How do you join two DataFrames in PySpark?

Use df.join(other_df, 'key_column', 'join_type'). The join type can be 'inner', 'left', 'right', 'outer', 'left_anti', 'left_semi', or 'cross'.

What is the difference between left join and left anti join in PySpark?

A left join keeps all rows from the left DataFrame and fills nulls where no match exists. A left anti join returns only left rows that have NO match in the right DataFrame — useful for finding missing data.

How does a cross join work in PySpark?

A cross join produces every combination of rows from both DataFrames (Cartesian product). If left has M rows and right has N rows, the result has M x N rows. Use with caution on large datasets.

What is the default join type in PySpark?

The default join type in PySpark is 'inner'. If you omit the third argument in df.join(other, 'key'), PySpark performs an inner join.

How to Join Data with PySpark — join, inner, left, cross, anti Guide

PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. The .join() method is how you combine two DataFrames on a shared key, and it supports every standard join type: inner, left, right, outer, anti, semi, and cross.

This tutorial covers seven interactive examples, from creating the input DataFrames to practical join patterns you will use in real pipelines.

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.

Create a fuel transactions DataFrame
Create a regions lookup DataFrame
inner join — keep only matched rows
left join — keep all left rows, fill nulls for unmatched
left_anti join — find rows with no match
crossJoin — every combination of rows
Practical example — join and aggregate

The datasets

We will use two small datasets. The first is a petrol station transactions table with station, state, fuel_type, litres, and price. The second is a regions lookup table that maps station names to suburbs and area types. Not every station appears in both tables — this is deliberate so we can see how different join types handle mismatches.

Python — editable

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

data = [
    ('Shell Fortitude Valley', 'QLD', 'Unleaded', 44.9, 189.9),
    ('Shell Fortitude Valley', 'QLD', 'Diesel', 55.3, 193.5),
    ('Shell Fortitude Valley', 'QLD', 'Premium', 32.1, 215.9),
    ('BP Southbank', 'VIC', 'Unleaded', 52.1, 185.7),
    ('BP Southbank', 'VIC', 'Unleaded', 47.8, 185.7),
    ('BP Southbank', 'VIC', 'Diesel', 60.0, 192.3),
    ('BP Southbank', 'VIC', 'Premium', 41.0, 209.9),
    ('Caltex Bondi', 'NSW', 'Unleaded', 45.2, 189.5),
    ('Caltex Bondi', 'NSW', 'Diesel', 60.0, 195.9),
    ('Caltex Bondi', 'NSW', 'Premium', 38.5, 212.5),
    ('Caltex Bondi', 'NSW', 'Diesel', 58.7, 195.9),
    ('Caltex Bondi', 'NSW', 'Unleaded', 40.1, 189.5),
    ('BP Southbank', 'VIC', 'Diesel', 63.2, 192.3),
    ('BP Southbank', 'VIC', 'Premium', 35.6, 209.9),
    ('BP Southbank', 'VIC', 'Unleaded', 49.0, 185.7),
]

columns = ['station', 'state', 'fuel_type', 'litres', 'price']
df = spark.createDataFrame(data, columns)
df.show()

Figure 1: Fuel transactions — 15 rows across three stations.

The fuel DataFrame has 15 transactions across three stations: Shell Fortitude Valley (QLD), BP Southbank (VIC), and Caltex Bondi (NSW). Next, we create a regions lookup table that maps some of these stations to their suburb and area type.

Python — editable

region_data = [
    ('Caltex Bondi', 'Eastern Suburbs', 'Metro'),
    ('BP Southbank', 'Inner City', 'Metro'),
    ('Shell Fortitude Valley', 'Inner North', 'Metro'),
    ('7-Eleven Parramatta', 'Western Sydney', 'Metro'),
]
regions = spark.createDataFrame(region_data, ['station', 'suburb', 'area_type'])
regions.show()

Figure 2: Regions lookup — 4 rows. Notice "7-Eleven Parramatta" has no match in the fuel table.

The regions table has four rows. Three stations match the fuel table, but 7-Eleven Parramatta exists only in the regions table. This asymmetry will show us exactly how each join type behaves.

Inner join — keep only matches

An inner join returns only rows where the key exists in both DataFrames. Rows from either side that have no match are dropped. This is the default join type in PySpark.

Python — editable

df.join(regions, 'station', 'inner').show()

Figure 3: Inner join — only the 15 fuel rows with matching regions. "7-Eleven Parramatta" is excluded.

All 15 fuel rows matched because every station in the fuel table has a corresponding entry in the regions table. The 7-Eleven Parramatta row from the regions table is dropped because it has no fuel transactions. When you pass a single string like 'station' as the join key, PySpark automatically deduplicates the column — the result contains one station column, not two.

Left join — keep all left rows

A left join keeps every row from the left DataFrame. Where no match exists in the right DataFrame, the right columns are filled with null.

Python — editable

df.join(regions, 'station', 'left').show()

Figure 4: Left join — all 15 fuel rows retained. Suburb and area_type filled for matches.

In this case, the result looks identical to the inner join because every fuel station happens to exist in the regions table. If a station in the fuel table had no region mapping, its suburb and area_type columns would show null. Left joins are essential when you want to enrich a dataset without losing any original rows.

Anti join — find missing matches

A left_anti join returns rows from the left DataFrame that have no match in the right DataFrame. It is the PySpark equivalent of SQL's WHERE NOT EXISTS or NOT IN. This is invaluable for data quality checks — finding records that should have a match but do not.

Python — editable

# Find fuel rows with NO matching region
df.join(regions, 'station', 'left_anti').show()

Figure 5: Anti join — fuel rows with no matching region. Empty here because all stations have a region.

The result is empty because every fuel station has a region mapping. Now try the reverse — which regions have no fuel transactions? Swap the DataFrames: regions.join(df, 'station', 'left_anti').show(). You will see 7-Eleven Parramatta appear because it exists in the regions table but has no fuel data.

Cross join — every combination

A cross join (Cartesian product) produces every combination of rows from both DataFrames. If the left has M rows and the right has N rows, the result has M x N rows. Use .limit() to keep the output manageable:

Python — editable

df.limit(3).crossJoin(regions.limit(2)).show()

Figure 6: Cross join — 3 x 2 = 6 rows. Every fuel row paired with every region row.

Cross joins are rarely used on large datasets because the output grows quadratically. They are useful for generating test data, building comparison matrices, or combining a small lookup with a larger table when there is no natural key.

Practical example — join and aggregate

In real pipelines, a join is rarely the final step. You typically join first to enrich the data, then aggregate. Here we join the fuel data with regions, then compute average litres per suburb:

Python — editable

enriched = df.join(regions, 'station', 'left')

enriched.groupBy('suburb').agg(
    F.count('*').alias('transactions'),
    F.avg('litres').alias('avg_litres'),
    F.avg('price').alias('avg_price')
).show()

Figure 7: Join then aggregate — transaction counts and averages by suburb.

This two-step pattern — join to enrich, then groupBy to aggregate — is the backbone of most PySpark data pipelines. The join adds context (suburb, area type), and the aggregation summarises the data at the level you need.

PySpark join types comparison

Here is a summary of all join types available in PySpark's .join() method:

inner (default) — keeps only rows where the key exists in both DataFrames.
left / left_outer — keeps all left rows; fills nulls for unmatched right columns.
right / right_outer — keeps all right rows; fills nulls for unmatched left columns.
outer / full / full_outer — keeps all rows from both sides; fills nulls where no match exists.
left_semi — keeps left rows that have a match, but returns only left columns (like SQL WHERE EXISTS).
left_anti — keeps left rows that have no match (like SQL WHERE NOT EXISTS).
cross — Cartesian product of all rows. No key required.

PySpark vs pandas vs SQL

If you work across multiple tools, here is how join syntax compares:

PySpark: df.join(other, 'key', 'inner') — join type as a string argument. Passing a single column name auto-deduplicates the key.
pandas: df.merge(other, on='key', how='inner') — uses merge() with how parameter. No built-in anti join; requires boolean indexing.
SQL: SELECT * FROM a INNER JOIN b ON a.key = b.key — declarative syntax with explicit ON clause.

The core differences: PySpark supports left_anti and left_semi natively, which pandas lacks. PySpark auto-deduplicates the join key when you pass a string, while pandas always keeps both columns unless you use on=. SQL uses explicit ON clauses and separate JOIN keywords for each type.

Try editing the code blocks above — change the join type from 'inner' to 'right' or 'outer', swap which DataFrame is on the left, or add new stations to see how each join type behaves.

Data Science PySpark Spark join

References

PySpark documentation: DataFrame.join
PySpark documentation: DataFrame.crossJoin
PySpark groupBy: How to Group Data with PySpark
PySpark filter: How to Filter Data with PySpark

Suhith Illesinghe

Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.

How to join data with PySpark?

The datasets

Inner join — keep only matches

Left join — keep all left rows

Anti join — find missing matches

Cross join — every combination

Practical example — join and aggregate

PySpark join types comparison

PySpark vs pandas vs SQL

References

Related Articles