Why PySpark?
PySpark is the Python API for Apache Spark — the industry standard for processing large-scale data. If you work with big data, data engineering pipelines, or distributed computing, PySpark is an essential skill.
Using the Minimal Viable Analytics (MVA) approach, you focus on just 6 skills instead of trying to learn the entire Spark API. Think of it like building a house — you need a solid foundation (your data), 6 pillars (core skills), and a roof (the decisions you make).
The 6 Pillars — PySpark Tutorials
Work through each pillar in order. Every tutorial is interactive — edit the code and run it directly in your browser. No Spark installation needed.
Grouping & Aggregation
Group rows with groupBy and aggregate with F.count, F.sum, F.avg. Filter groups with a chained .filter().
Filtering & Slicing
Select subsets with .filter(), F.col(), .isin(), .between(), .like(), and .isNull().
Sorting
Order data with .orderBy(), F.desc(), F.asc(), multi-column sort, and .limit() for top values.
Joining
Combine DataFrames with .join() — inner, left, left_anti, and .crossJoin(). Native anti-join support.
Creating Columns
Derive new columns with .withColumn(), F.when().otherwise(), F.coalesce(), F.round(), and .cast().
Creating Graphs
Convert to pandas with .toPandas() and visualise with matplotlib — bar, line, pie, and scatter charts.
How to Use These Tutorials
- Start with Pillar 1 (Grouping) — it introduces the dataset and core patterns
- Work through Pillars 2–5 — each builds on the same fundamentals
- Finish with Pillar 6 (Graphs) — bring your analysis to life visually
Each tutorial uses the same Australian petrol station dataset so concepts build naturally. By the end, you'll have the skills to tackle real data analytics problems with confidence.
Every tutorial runs in the browser using a PySpark-compatible shim. The syntax mirrors real PySpark — the same patterns apply when you move to a full Spark cluster.
What You'll Learn
groupBy(),agg(),F.count(),F.sum(),F.avg()— aggregate data by category.filter(),F.col(),.isin(),.between()— filter DataFrames.orderBy(),F.desc(),F.asc(),.limit()— sort and find top values.join()with inner, left, left_anti, and cross joins.withColumn(),F.when().otherwise(),F.coalesce()— create and transform columns.toPandas()+matplotlib— visualise your Spark data
Same Skills, Other Languages
The MVA 6 pillars apply to every data tool. If you prefer a different language, we have the same tutorials for:
Ready to start?
Begin with the first pillar — grouping and aggregation with PySpark.
Start Pillar 1: Grouping →