PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. The .join() method is how you combine two DataFrames on a shared key, and it supports every standard join type: inner, left, right, outer, anti, semi, and cross.

This tutorial covers seven interactive examples, from creating the input DataFrames to practical join patterns you will use in real pipelines.

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.

  1. Create a fuel transactions DataFrame
  2. Create a regions lookup DataFrame
  3. inner join — keep only matched rows
  4. left join — keep all left rows, fill nulls for unmatched
  5. left_anti join — find rows with no match
  6. crossJoin — every combination of rows
  7. Practical example — join and aggregate

The datasets

We will use two small datasets. The first is a petrol station transactions table with station, state, fuel_type, litres, and price. The second is a regions lookup table that maps station names to suburbs and area types. Not every station appears in both tables — this is deliberate so we can see how different join types handle mismatches.

Python — editable
Figure 1: Fuel transactions — 15 rows across three stations.

The fuel DataFrame has 15 transactions across three stations: Shell Fortitude Valley (QLD), BP Southbank (VIC), and Caltex Bondi (NSW). Next, we create a regions lookup table that maps some of these stations to their suburb and area type.

Python — editable
Figure 2: Regions lookup — 4 rows. Notice "7-Eleven Parramatta" has no match in the fuel table.

The regions table has four rows. Three stations match the fuel table, but 7-Eleven Parramatta exists only in the regions table. This asymmetry will show us exactly how each join type behaves.

Inner join — keep only matches

An inner join returns only rows where the key exists in both DataFrames. Rows from either side that have no match are dropped. This is the default join type in PySpark.

Python — editable
Figure 3: Inner join — only the 15 fuel rows with matching regions. "7-Eleven Parramatta" is excluded.

All 15 fuel rows matched because every station in the fuel table has a corresponding entry in the regions table. The 7-Eleven Parramatta row from the regions table is dropped because it has no fuel transactions. When you pass a single string like 'station' as the join key, PySpark automatically deduplicates the column — the result contains one station column, not two.

Left join — keep all left rows

A left join keeps every row from the left DataFrame. Where no match exists in the right DataFrame, the right columns are filled with null.

Python — editable
Figure 4: Left join — all 15 fuel rows retained. Suburb and area_type filled for matches.

In this case, the result looks identical to the inner join because every fuel station happens to exist in the regions table. If a station in the fuel table had no region mapping, its suburb and area_type columns would show null. Left joins are essential when you want to enrich a dataset without losing any original rows.

Anti join — find missing matches

A left_anti join returns rows from the left DataFrame that have no match in the right DataFrame. It is the PySpark equivalent of SQL's WHERE NOT EXISTS or NOT IN. This is invaluable for data quality checks — finding records that should have a match but do not.

Python — editable
Figure 5: Anti join — fuel rows with no matching region. Empty here because all stations have a region.

The result is empty because every fuel station has a region mapping. Now try the reverse — which regions have no fuel transactions? Swap the DataFrames: regions.join(df, 'station', 'left_anti').show(). You will see 7-Eleven Parramatta appear because it exists in the regions table but has no fuel data.

Cross join — every combination

A cross join (Cartesian product) produces every combination of rows from both DataFrames. If the left has M rows and the right has N rows, the result has M x N rows. Use .limit() to keep the output manageable:

Python — editable
Figure 6: Cross join — 3 x 2 = 6 rows. Every fuel row paired with every region row.

Cross joins are rarely used on large datasets because the output grows quadratically. They are useful for generating test data, building comparison matrices, or combining a small lookup with a larger table when there is no natural key.

Practical example — join and aggregate

In real pipelines, a join is rarely the final step. You typically join first to enrich the data, then aggregate. Here we join the fuel data with regions, then compute average litres per suburb:

Python — editable
Figure 7: Join then aggregate — transaction counts and averages by suburb.

This two-step pattern — join to enrich, then groupBy to aggregate — is the backbone of most PySpark data pipelines. The join adds context (suburb, area type), and the aggregation summarises the data at the level you need.

PySpark join types comparison

Here is a summary of all join types available in PySpark's .join() method:

PySpark vs pandas vs SQL

If you work across multiple tools, here is how join syntax compares:

The core differences: PySpark supports left_anti and left_semi natively, which pandas lacks. PySpark auto-deduplicates the join key when you pass a string, while pandas always keeps both columns unless you use on=. SQL uses explicit ON clauses and separate JOIN keywords for each type.

Try editing the code blocks above — change the join type from 'inner' to 'right' or 'outer', swap which DataFrame is on the left, or add new stations to see how each join type behaves.

References

Suhith Illesinghe
Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.