How do you group data in PySpark?

Use df.groupBy('column').agg(...). For counting, use .count() or F.count('*'). PySpark returns a DataFrame.

What is the difference between PySpark groupBy and pandas groupby?

PySpark uses .groupBy() (camelCase) and requires .agg() with functions from pyspark.sql.functions. Results are always DataFrames. No index concept.

How do you name a grouped column in PySpark?

Use .alias() on the aggregation expression inside .agg(). Example: F.count('*').alias('total_count').

Can PySpark run multiple aggregations at once?

Yes. Pass multiple expressions to .agg(): F.avg('col1').alias('avg'), F.sum('col2').alias('total').

How to Group Data with PySpark — groupBy, agg, count, sum Guide

PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. The groupBy → agg pattern is the foundation of every aggregation in PySpark: group rows by one or more columns, then compute summary statistics for each group.

This tutorial covers six interactive examples, from basic counting to filtering aggregated groups (the PySpark equivalent of SQL's HAVING clause).

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.

groupBy().count() — count rows per group
groupBy().agg() with .alias() — name the output column
Multiple aggregations in a single .agg() call
Grouping by multiple columns
Filtering groups — the PySpark equivalent of SQL HAVING

The dataset

We will use a small petrol station dataset. Each row represents a fuel transaction recorded at stations across Australia. The columns capture the station name, state, fuel_type, litres sold, and the price per litre.

Python — editable

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

data = [
    ('Shell Fortitude Valley', 'QLD', 'Unleaded', 44.9, 189.9),
    ('Shell Fortitude Valley', 'QLD', 'Diesel', 55.3, 193.5),
    ('Shell Fortitude Valley', 'QLD', 'Premium', 32.1, 215.9),
    ('BP Southbank', 'VIC', 'Unleaded', 52.1, 185.7),
    ('BP Southbank', 'VIC', 'Unleaded', 47.8, 185.7),
    ('BP Southbank', 'VIC', 'Diesel', 60.0, 192.3),
    ('BP Southbank', 'VIC', 'Premium', 41.0, 209.9),
    ('Caltex Bondi', 'NSW', 'Unleaded', 45.2, 189.5),
    ('Caltex Bondi', 'NSW', 'Diesel', 60.0, 195.9),
    ('Caltex Bondi', 'NSW', 'Premium', 38.5, 212.5),
    ('Caltex Bondi', 'NSW', 'Diesel', 58.7, 195.9),
    ('Caltex Bondi', 'NSW', 'Unleaded', 40.1, 189.5),
    ('BP Southbank', 'VIC', 'Diesel', 63.2, 192.3),
    ('BP Southbank', 'VIC', 'Premium', 35.6, 209.9),
    ('BP Southbank', 'VIC', 'Unleaded', 49.0, 185.7),
]

columns = ['station', 'state', 'fuel_type', 'litres', 'price']
df = spark.createDataFrame(data, columns)
df.show()

Figure 1: Petrol station transactions — 15 rows, 5 columns.

The dataset has 15 transactions spread across three Australian petrol stations: Shell Fortitude Valley (QLD), BP Southbank (VIC) and Caltex Bondi (NSW). A natural question is: how many transactions were recorded at each station?

Python — editable

df.groupBy('station').count().show()

Figure 2: Row counts per station using the .count() shortcut.

The .count() shortcut is the fastest way to count rows per group. It produces a DataFrame with two columns: the grouping column (station) and a column called count. This is the PySpark equivalent of SQL's SELECT station, COUNT(*) FROM ... GROUP BY station.

The only thing you might want to change is the column name count. Let's fix that with .alias().

Naming columns with `alias()`

In PySpark, you name output columns using .alias() on the aggregation expression inside .agg(). Instead of the shortcut .count(), we switch to the full .agg() syntax with F.count('*'):

Python — editable

df.groupBy('station').agg(
    F.count('*').alias('transactions')
).show()

Figure 3: The count column is now labeled "transactions".

The F.count('*') expression counts all rows per group (including nulls), and .alias('transactions') gives the output column a meaningful name. This is the PySpark equivalent of SELECT station, COUNT(*) AS transactions FROM ... GROUP BY station in SQL.

Multiple aggregations in one call

One of PySpark's strengths is running multiple aggregations in a single .agg() call. Each expression produces a column in the output:

Python — editable

df.groupBy('station').agg(
    F.count('*').alias('transactions'),
    F.avg('litres').alias('avg_litres'),
    F.sum('litres').alias('total_litres'),
    F.max('price').alias('max_price')
).show()

Figure 4: Four aggregations in one call — count, average, sum, and max.

Each F.<function>() call inside .agg() produces exactly one named column. This is clean and explicit — you can see at a glance what each column will contain and what it will be named.

Grouping by multiple columns

Pass multiple column names to .groupBy() to create finer-grained groups:

Python — editable

df.groupBy('station', 'fuel_type').agg(
    F.count('*').alias('count'),
    F.avg('price').alias('avg_price')
).show()

Figure 5: Transactions and average price by station and fuel type.

Both station and fuel_type appear as regular columns in the output. PySpark has no index concept, so there is nothing to reset or flatten.

Filtering groups — the HAVING equivalent

SQL uses HAVING to filter groups after aggregation. PySpark has no HAVING clause — instead, you aggregate first, then .filter() on the result:

Python — editable

result = df.groupBy('station').agg(
    F.count('*').alias('transactions')
)
result.filter(F.col('transactions') > 4).show()

Figure 6: Only stations with more than 4 transactions.

This two-step pattern — aggregate, then filter — is the PySpark equivalent of SQL's HAVING COUNT(*) > 4. You can filter on any aggregated column using F.col().

PySpark vs pandas vs Polars vs SQL

If you work across multiple tools, here is how grouping syntax compares:

PySpark: df.groupBy('col').agg(F.count('*').alias('n')) — camelCase groupBy, functions from pyspark.sql.functions, results are always DataFrames.
pandas: df.groupby('col').size().to_frame('n').reset_index() — lowercase groupby, results often need reset_index() to get a clean DataFrame.
Polars: df.group_by('col').agg(pl.len().alias('n')) — snake_case group_by, expression-based syntax, no index concept.
SQL: SELECT col, COUNT(*) AS n FROM t GROUP BY col — declarative, filtering via HAVING.

The core differences: PySpark uses camelCase (groupBy, not group_by), requires explicit function imports (F.count, F.avg), and returns lazy DataFrames on a real cluster. Pandas puts group keys in the index by default. Polars uses expression syntax similar to PySpark but with snake_case naming.

Try editing the code blocks above — change the grouping column to fuel_type or state, swap F.count('*') for F.sum('litres'), or add your own stations to see how each pattern behaves.

Data Science PySpark Spark groupBy

References

PySpark documentation: GroupedData
PySpark documentation: pyspark.sql.functions
Pandas equivalent: How to Group Data in Pandas
Polars equivalent: How to Group Data with Polars
SQL equivalent: How to Group Data with SQL

Suhith Illesinghe

Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.

How to group data with PySpark?

The dataset

Naming columns with alias()

Multiple aggregations in one call

Grouping by multiple columns

Filtering groups — the HAVING equivalent

PySpark vs pandas vs Polars vs SQL

References

Related Articles

Naming columns with `alias()`