PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. The .join() method is how you combine two DataFrames on a shared key, and it supports every standard join type: inner, left, right, outer, anti, semi, and cross.
This tutorial covers seven interactive examples, from creating the input DataFrames to practical join patterns you will use in real pipelines.
Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.
- Create a fuel transactions DataFrame
- Create a regions lookup DataFrame
innerjoin — keep only matched rowsleftjoin — keep all left rows, fill nulls for unmatchedleft_antijoin — find rows with no matchcrossJoin— every combination of rows- Practical example — join and aggregate
The datasets
We will use two small datasets. The first is a petrol station transactions table with station, state, fuel_type, litres, and price. The second is a regions lookup table that maps station names to suburbs and area types. Not every station appears in both tables — this is deliberate so we can see how different join types handle mismatches.
The fuel DataFrame has 15 transactions across three stations: Shell Fortitude Valley (QLD), BP Southbank (VIC), and Caltex Bondi (NSW). Next, we create a regions lookup table that maps some of these stations to their suburb and area type.
The regions table has four rows. Three stations match the fuel table, but 7-Eleven Parramatta exists only in the regions table. This asymmetry will show us exactly how each join type behaves.
Inner join — keep only matches
An inner join returns only rows where the key exists in both DataFrames. Rows from either side that have no match are dropped. This is the default join type in PySpark.
All 15 fuel rows matched because every station in the fuel table has a corresponding entry in the regions table. The 7-Eleven Parramatta row from the regions table is dropped because it has no fuel transactions. When you pass a single string like 'station' as the join key, PySpark automatically deduplicates the column — the result contains one station column, not two.
Left join — keep all left rows
A left join keeps every row from the left DataFrame. Where no match exists in the right DataFrame, the right columns are filled with null.
In this case, the result looks identical to the inner join because every fuel station happens to exist in the regions table. If a station in the fuel table had no region mapping, its suburb and area_type columns would show null. Left joins are essential when you want to enrich a dataset without losing any original rows.
Anti join — find missing matches
A left_anti join returns rows from the left DataFrame that have no match in the right DataFrame. It is the PySpark equivalent of SQL's WHERE NOT EXISTS or NOT IN. This is invaluable for data quality checks — finding records that should have a match but do not.
The result is empty because every fuel station has a region mapping. Now try the reverse — which regions have no fuel transactions? Swap the DataFrames: regions.join(df, 'station', 'left_anti').show(). You will see 7-Eleven Parramatta appear because it exists in the regions table but has no fuel data.
Cross join — every combination
A cross join (Cartesian product) produces every combination of rows from both DataFrames. If the left has M rows and the right has N rows, the result has M x N rows. Use .limit() to keep the output manageable:
Cross joins are rarely used on large datasets because the output grows quadratically. They are useful for generating test data, building comparison matrices, or combining a small lookup with a larger table when there is no natural key.
Practical example — join and aggregate
In real pipelines, a join is rarely the final step. You typically join first to enrich the data, then aggregate. Here we join the fuel data with regions, then compute average litres per suburb:
This two-step pattern — join to enrich, then groupBy to aggregate — is the backbone of most PySpark data pipelines. The join adds context (suburb, area type), and the aggregation summarises the data at the level you need.
PySpark join types comparison
Here is a summary of all join types available in PySpark's .join() method:
inner(default) — keeps only rows where the key exists in both DataFrames.left/left_outer— keeps all left rows; fills nulls for unmatched right columns.right/right_outer— keeps all right rows; fills nulls for unmatched left columns.outer/full/full_outer— keeps all rows from both sides; fills nulls where no match exists.left_semi— keeps left rows that have a match, but returns only left columns (like SQLWHERE EXISTS).left_anti— keeps left rows that have no match (like SQLWHERE NOT EXISTS).cross— Cartesian product of all rows. No key required.
PySpark vs pandas vs SQL
If you work across multiple tools, here is how join syntax compares:
- PySpark:
df.join(other, 'key', 'inner')— join type as a string argument. Passing a single column name auto-deduplicates the key. - pandas:
df.merge(other, on='key', how='inner')— usesmerge()withhowparameter. No built-in anti join; requires boolean indexing. - SQL:
SELECT * FROM a INNER JOIN b ON a.key = b.key— declarative syntax with explicitONclause.
The core differences: PySpark supports left_anti and left_semi natively, which pandas lacks. PySpark auto-deduplicates the join key when you pass a string, while pandas always keeps both columns unless you use on=. SQL uses explicit ON clauses and separate JOIN keywords for each type.
Try editing the code blocks above — change the join type from 'inner' to 'right' or 'outer', swap which DataFrame is on the left, or add new stations to see how each join type behaves.
References
- PySpark documentation: DataFrame.join
- PySpark documentation: DataFrame.crossJoin
- PySpark groupBy: How to Group Data with PySpark
- PySpark filter: How to Filter Data with PySpark