Joining two datasets on a shared key is one of the most fundamental operations in data analysis. Whether you are enriching a transaction table with customer details, combining survey responses with demographic data, or linking sensor readings to location metadata, joins are how you bring related tables together.

If you have used pandas, you know pd.merge() well. It works, but it only offers four join types: inner, left, right, and outer. Polars goes further. On top of the standard joins, Polars provides anti joins and semi joins as native, first-class operations — no workarounds, no indicator columns, no multi-step filtering.

This article covers seven patterns, each demonstrated with interactive code you can edit and run directly in your browser:

  1. Create two DataFrames with a shared key
  2. join() — inner join (default)
  3. how='left' — left join
  4. how='full' — full/outer join
  5. how='anti' — anti join (rows NOT in right)
  6. how='semi' — semi join (rows in right, no extra columns)
  7. left_on / right_on — join on differently named columns

The datasets

We will use two small DataFrames inspired by Australian petrol stations. The first, stations, contains station metadata: name, state, and pump capacity. The second, fuel_sales, contains fuel sales records: station name, fuel type, and total litres sold. The key column they share is station, but not every station appears in both tables — this is intentional so we can see how different join types handle mismatches.

Python — editable
Figure 1: Two DataFrames — stations (4 rows) and fuel_sales (6 rows). Ampol Hobart has no sales; 7-Eleven Parramatta has no station record.

Notice that Ampol Hobart exists in stations but has no sales records, and 7-Eleven Parramatta exists in fuel_sales but has no station record. These mismatches will reveal the difference between each join type.

Inner join (default)

An inner join returns only the rows where the key exists in both DataFrames. This is the default behaviour of .join():

Python — editable
Figure 2: Inner join — only stations present in both DataFrames appear.

Ampol Hobart is dropped (no sales data), and 7-Eleven Parramatta is dropped (no station record). Only the three stations that appear in both tables make it through. This is identical to pd.merge(left, right, on='key') in pandas.

Left join

A left join keeps all rows from the left DataFrame and fills in null where there is no match on the right:

Python — editable
Figure 3: Left join — Ampol Hobart is kept with null sales values. 7-Eleven Parramatta is excluded.

Ampol Hobart now appears with null for fuel_type and total_litres because there is no matching row in fuel_sales. 7-Eleven Parramatta is still excluded because it does not exist in the left table (stations).

Full join

A full join keeps all rows from both DataFrames. Important: Polars uses how='full', not how='outer' as in pandas:

Python — editable
Figure 4: Full join — every row from both DataFrames appears. Nulls fill gaps on either side.

Now both Ampol Hobart and 7-Eleven Parramatta appear. Ampol Hobart has nulls for the sales columns, and 7-Eleven Parramatta has nulls for the station metadata columns. In a full join with Polars, you will also see a station_right column — this is because the key column from the right DataFrame is preserved separately when rows do not match.

Anti join — a Polars superpower

An anti join returns rows from the left DataFrame that have no match in the right DataFrame. This is incredibly useful for finding missing data, orphan records, or rows that failed to link. In pandas, you would need a multi-step workaround: pd.merge() with indicator='_merge', then filter for 'left_only'. Polars makes it a single, native operation:

Python — editable
Figure 5: Anti join — only Ampol Hobart appears because it has no sales records.

Only Ampol Hobart is returned — it is the only station with no matching row in fuel_sales. No columns from the right DataFrame are added. This is one of the clearest advantages Polars has over pandas: what takes three lines in pandas is a single, expressive call in Polars.

Semi join

A semi join is the opposite of an anti join: it returns rows from the left DataFrame that do have a match in the right DataFrame, but — crucially — it does not add any columns from the right. Think of it as a filter: "keep only the left rows that exist in the right table."

Python — editable
Figure 6: Semi join — three stations returned (those with sales), but no sales columns added.

Three stations are returned: Caltex Bondi, BP Southbank, and Shell Fortitude Valley. These are the stations that have at least one record in fuel_sales. Ampol Hobart is excluded. Notice that no columns from the right DataFrame appear — the result has the same schema as stations. There is no direct equivalent in pandas.

Joining on different column names

Sometimes the key columns have different names in each DataFrame. Instead of on, use left_on and right_on:

Python — editable
Figure 7: Joining on differently named columns with left_on and right_on.

When using left_on and right_on, Polars keeps only the left key column in the result by default. This avoids the duplicate key column problem you sometimes get with pd.merge().

Polars vs pandas: joining compared

If you are coming from pandas, here is how the key join operations map:

The most significant advantages are anti and semi joins. In pandas, an anti join requires merging with an indicator column and then filtering — three operations that are easy to get wrong. In Polars, it is a single parameter: how='anti'. Similarly, a semi join in pandas requires either a merge-and-drop or an isin() filter on the key column. In Polars, it is simply how='semi'.

Try editing the code blocks above — swap stations and fuel_sales to see how the join direction changes, add new rows to either DataFrame, or change how to explore each join type.

References

Suhith Illesinghe
Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.