PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. The groupBy → agg pattern is the foundation of every aggregation in PySpark: group rows by one or more columns, then compute summary statistics for each group.
This tutorial covers six interactive examples, from basic counting to filtering aggregated groups (the PySpark equivalent of SQL's HAVING clause).
Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.
groupBy().count()— count rows per groupgroupBy().agg()with.alias()— name the output column- Multiple aggregations in a single
.agg()call - Grouping by multiple columns
- Filtering groups — the PySpark equivalent of SQL
HAVING
The dataset
We will use a small petrol station dataset. Each row represents a fuel transaction recorded at stations across Australia. The columns capture the station name, state, fuel_type, litres sold, and the price per litre.
The dataset has 15 transactions spread across three Australian petrol stations: Shell Fortitude Valley (QLD), BP Southbank (VIC) and Caltex Bondi (NSW). A natural question is: how many transactions were recorded at each station?
The .count() shortcut is the fastest way to count rows per group. It produces a DataFrame with two columns: the grouping column (station) and a column called count. This is the PySpark equivalent of SQL's SELECT station, COUNT(*) FROM ... GROUP BY station.
The only thing you might want to change is the column name count. Let's fix that with .alias().
Naming columns with alias()
In PySpark, you name output columns using .alias() on the aggregation expression inside .agg(). Instead of the shortcut .count(), we switch to the full .agg() syntax with F.count('*'):
The F.count('*') expression counts all rows per group (including nulls), and .alias('transactions') gives the output column a meaningful name. This is the PySpark equivalent of SELECT station, COUNT(*) AS transactions FROM ... GROUP BY station in SQL.
Multiple aggregations in one call
One of PySpark's strengths is running multiple aggregations in a single .agg() call. Each expression produces a column in the output:
Each F.<function>() call inside .agg() produces exactly one named column. This is clean and explicit — you can see at a glance what each column will contain and what it will be named.
Grouping by multiple columns
Pass multiple column names to .groupBy() to create finer-grained groups:
Both station and fuel_type appear as regular columns in the output. PySpark has no index concept, so there is nothing to reset or flatten.
Filtering groups — the HAVING equivalent
SQL uses HAVING to filter groups after aggregation. PySpark has no HAVING clause — instead, you aggregate first, then .filter() on the result:
This two-step pattern — aggregate, then filter — is the PySpark equivalent of SQL's HAVING COUNT(*) > 4. You can filter on any aggregated column using F.col().
PySpark vs pandas vs Polars vs SQL
If you work across multiple tools, here is how grouping syntax compares:
- PySpark:
df.groupBy('col').agg(F.count('*').alias('n'))— camelCasegroupBy, functions frompyspark.sql.functions, results are always DataFrames. - pandas:
df.groupby('col').size().to_frame('n').reset_index()— lowercasegroupby, results often needreset_index()to get a clean DataFrame. - Polars:
df.group_by('col').agg(pl.len().alias('n'))— snake_casegroup_by, expression-based syntax, no index concept. - SQL:
SELECT col, COUNT(*) AS n FROM t GROUP BY col— declarative, filtering viaHAVING.
The core differences: PySpark uses camelCase (groupBy, not group_by), requires explicit function imports (F.count, F.avg), and returns lazy DataFrames on a real cluster. Pandas puts group keys in the index by default. Polars uses expression syntax similar to PySpark but with snake_case naming.
Try editing the code blocks above — change the grouping column to fuel_type or state, swap F.count('*') for F.sum('litres'), or add your own stations to see how each pattern behaves.
References
- PySpark documentation: GroupedData
- PySpark documentation: pyspark.sql.functions
- Pandas equivalent: How to Group Data in Pandas
- Polars equivalent: How to Group Data with Polars
- SQL equivalent: How to Group Data with SQL