Why PySpark?

PySpark is the Python API for Apache Spark — the industry standard for processing large-scale data. If you work with big data, data engineering pipelines, or distributed computing, PySpark is an essential skill.

Using the Minimal Viable Analytics (MVA) approach, you focus on just 6 skills instead of trying to learn the entire Spark API. Think of it like building a house — you need a solid foundation (your data), 6 pillars (core skills), and a roof (the decisions you make).

Decisions & Insights
Group & Aggregate
Filter & Slice
Sort
Join
Create Columns
Create Graphs
Foundation: Your Data

The 6 Pillars — PySpark Tutorials

Work through each pillar in order. Every tutorial is interactive — edit the code and run it directly in your browser. No Spark installation needed.

1

Grouping & Aggregation

Group rows with groupBy and aggregate with F.count, F.sum, F.avg. Filter groups with a chained .filter().

2

Filtering & Slicing

Select subsets with .filter(), F.col(), .isin(), .between(), .like(), and .isNull().

3

Sorting

Order data with .orderBy(), F.desc(), F.asc(), multi-column sort, and .limit() for top values.

4

Joining

Combine DataFrames with .join() — inner, left, left_anti, and .crossJoin(). Native anti-join support.

5

Creating Columns

Derive new columns with .withColumn(), F.when().otherwise(), F.coalesce(), F.round(), and .cast().

6

Creating Graphs

Convert to pandas with .toPandas() and visualise with matplotlib — bar, line, pie, and scatter charts.

How to Use These Tutorials

Each tutorial uses the same Australian petrol station dataset so concepts build naturally. By the end, you'll have the skills to tackle real data analytics problems with confidence.

Every tutorial runs in the browser using a PySpark-compatible shim. The syntax mirrors real PySpark — the same patterns apply when you move to a full Spark cluster.

What You'll Learn

Same Skills, Other Languages

The MVA 6 pillars apply to every data tool. If you prefer a different language, we have the same tutorials for:

Ready to start?

Begin with the first pillar — grouping and aggregation with PySpark.

Start Pillar 1: Grouping →