PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. The filter() method (and its alias where()) is how you select subsets of rows from a PySpark DataFrame based on conditions. Whether you need simple comparisons, compound logic, list membership, range checks, pattern matching, or null handling — PySpark has you covered.

This tutorial covers seven interactive examples, from basic comparison filters to pattern matching with like() and null detection with isNull().

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.

  1. filter(F.col('col') > value) — filter by comparison
  2. filter((cond1) & (cond2)) — combine conditions with & and |
  3. F.col('col').isin([...]) — match a list of values
  4. F.col('col').between(low, high) — range filter
  5. F.col('col').like('%pattern%') — SQL-style pattern matching
  6. F.col('col').isNull() — handle missing values

The dataset

We will use a small petrol station dataset. Each row represents a fuel transaction recorded at stations across Australia. The columns capture the station name, state, fuel_type, litres sold, and the price per litre.

Python — editable
Figure 1: Petrol station transactions — 15 rows, 5 columns.

The dataset has 15 transactions spread across three Australian petrol stations: Shell Fortitude Valley (QLD), BP Southbank (VIC) and Caltex Bondi (NSW). Let's start filtering rows based on different conditions.

Basic filter — comparison operators

The simplest filter uses a comparison operator on a column. Use F.col('column') to reference a column and standard operators (>, <, ==, !=, >=, <=) to define the condition. The .filter() method keeps only rows where the condition evaluates to True.

Python — editable
Figure 2: Rows where price exceeds 200 — only Premium fuel transactions.

This is the PySpark equivalent of SQL's SELECT * FROM df WHERE price > 200. Only the Premium fuel rows have a price above 200, so those are the rows returned. Note that .where() is an alias for .filter() — they are interchangeable.

Compound conditions with & and |

Combine multiple conditions using & (and) or | (or). Each condition must be wrapped in parentheses because of Python's operator precedence:

Python — editable
Figure 3: Unleaded fuel transactions in Victoria only.

The & operator applies a logical AND — both conditions must be true. Use | for logical OR. The parentheses around each condition are required; without them, Python's bitwise operator precedence will cause a syntax error. This is the PySpark equivalent of SQL's WHERE fuel_type = 'Unleaded' AND state = 'VIC'.

Filtering with isin()

When you want to match a column against a list of values, use .isin(). This is the PySpark equivalent of SQL's IN clause:

Python — editable
Figure 4: Diesel and Premium transactions selected with isin().

The .isin() method accepts a Python list or individual arguments: .isin('Diesel', 'Premium') works too. To negate an isin filter, prefix with ~: df.filter(~F.col('fuel_type').isin(['Diesel', 'Premium'])).

Range filtering with between()

The .between(low, high) method filters rows where the column value falls within a range (inclusive on both ends):

Python — editable
Figure 5: Transactions with price between 190 and 200 (inclusive).

This is equivalent to df.filter((F.col('price') >= 190) & (F.col('price') <= 200)) but more concise. The SQL equivalent is WHERE price BETWEEN 190 AND 200. Both bounds are inclusive.

Pattern matching with like()

Use .like() for SQL-style wildcard matching on string columns. The % wildcard matches any sequence of characters, and _ matches a single character:

Python — editable
Figure 6: Stations containing "Bondi" selected with like().

The .like('%Bondi%') pattern matches any station name that contains "Bondi" anywhere in the string. This is the PySpark equivalent of SQL's WHERE station LIKE '%Bondi%'. For case-insensitive matching, combine with F.lower(): df.filter(F.lower(F.col('station')).like('%bondi%')).

Handling nulls with isNull()

Standard comparison operators do not match null values. Use .isNull() to find rows with missing data, or .isNotNull() to exclude them:

Python — editable
Figure 7: Rows where litres is null — empty result since our dataset has no nulls.

Our dataset has no null values, so the result is an empty DataFrame. In practice, .isNull() is essential for data quality checks. You can also use .isNotNull() to keep only non-null rows: df.filter(F.col('litres').isNotNull()). The SQL equivalent is WHERE litres IS NULL.

PySpark vs pandas vs SQL filtering

If you work across multiple tools, here is how filtering syntax compares:

The core differences: PySpark requires explicit F.col() references, parentheses around each condition in compound filters, and uses method-based syntax (.isin(), .between(), .like(), .isNull()) instead of SQL keywords. Pandas uses bracket indexing but shares the same operator precedence rules.

Try editing the code blocks above — change the filter column, swap & for |, or add new conditions to see how each pattern behaves.

References

Suhith Illesinghe
Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.