PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. The groupByagg pattern is the foundation of every aggregation in PySpark: group rows by one or more columns, then compute summary statistics for each group.

This tutorial covers six interactive examples, from basic counting to filtering aggregated groups (the PySpark equivalent of SQL's HAVING clause).

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.

  1. groupBy().count() — count rows per group
  2. groupBy().agg() with .alias() — name the output column
  3. Multiple aggregations in a single .agg() call
  4. Grouping by multiple columns
  5. Filtering groups — the PySpark equivalent of SQL HAVING

The dataset

We will use a small petrol station dataset. Each row represents a fuel transaction recorded at stations across Australia. The columns capture the station name, state, fuel_type, litres sold, and the price per litre.

Python — editable
Figure 1: Petrol station transactions — 15 rows, 5 columns.

The dataset has 15 transactions spread across three Australian petrol stations: Shell Fortitude Valley (QLD), BP Southbank (VIC) and Caltex Bondi (NSW). A natural question is: how many transactions were recorded at each station?

Python — editable
Figure 2: Row counts per station using the .count() shortcut.

The .count() shortcut is the fastest way to count rows per group. It produces a DataFrame with two columns: the grouping column (station) and a column called count. This is the PySpark equivalent of SQL's SELECT station, COUNT(*) FROM ... GROUP BY station.

The only thing you might want to change is the column name count. Let's fix that with .alias().

Naming columns with alias()

In PySpark, you name output columns using .alias() on the aggregation expression inside .agg(). Instead of the shortcut .count(), we switch to the full .agg() syntax with F.count('*'):

Python — editable
Figure 3: The count column is now labeled "transactions".

The F.count('*') expression counts all rows per group (including nulls), and .alias('transactions') gives the output column a meaningful name. This is the PySpark equivalent of SELECT station, COUNT(*) AS transactions FROM ... GROUP BY station in SQL.

Multiple aggregations in one call

One of PySpark's strengths is running multiple aggregations in a single .agg() call. Each expression produces a column in the output:

Python — editable
Figure 4: Four aggregations in one call — count, average, sum, and max.

Each F.<function>() call inside .agg() produces exactly one named column. This is clean and explicit — you can see at a glance what each column will contain and what it will be named.

Grouping by multiple columns

Pass multiple column names to .groupBy() to create finer-grained groups:

Python — editable
Figure 5: Transactions and average price by station and fuel type.

Both station and fuel_type appear as regular columns in the output. PySpark has no index concept, so there is nothing to reset or flatten.

Filtering groups — the HAVING equivalent

SQL uses HAVING to filter groups after aggregation. PySpark has no HAVING clause — instead, you aggregate first, then .filter() on the result:

Python — editable
Figure 6: Only stations with more than 4 transactions.

This two-step pattern — aggregate, then filter — is the PySpark equivalent of SQL's HAVING COUNT(*) > 4. You can filter on any aggregated column using F.col().

PySpark vs pandas vs Polars vs SQL

If you work across multiple tools, here is how grouping syntax compares:

The core differences: PySpark uses camelCase (groupBy, not group_by), requires explicit function imports (F.count, F.avg), and returns lazy DataFrames on a real cluster. Pandas puts group keys in the index by default. Polars uses expression syntax similar to PySpark but with snake_case naming.

Try editing the code blocks above — change the grouping column to fuel_type or state, swap F.count('*') for F.sum('litres'), or add your own stations to see how each pattern behaves.

References

Suhith Illesinghe
Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.