How does PySpark filtering compare to pandas filtering?

PySpark uses df.filter(F.col('column') == value) with explicit column references via F.col(). Pandas uses boolean indexing like df[df['column'] == value]. PySpark filters are lazily evaluated on a cluster, while pandas filters execute immediately in memory.

What is the difference between .filter() and .where() in PySpark?

.filter() and .where() are identical in PySpark — they are aliases for the same method. Use whichever reads better in your code. Most PySpark developers prefer .filter().

How do you use isin in PySpark?

Use F.col('column').isin(['value1', 'value2']) or F.col('column').isin('value1', 'value2'). Both forms accept a list or individual arguments. It returns True for rows whose column value matches any item in the list.

How do you handle null values when filtering in PySpark?

Use F.col('column').isNull() to find rows with null values, or F.col('column').isNotNull() to exclude them. Standard comparison operators (==, !=) do not match nulls — you must use isNull() explicitly.

How to Filter Data with PySpark — filter, where, col, isin, between Guide

PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. The filter() method (and its alias where()) is how you select subsets of rows from a PySpark DataFrame based on conditions. Whether you need simple comparisons, compound logic, list membership, range checks, pattern matching, or null handling — PySpark has you covered.

This tutorial covers seven interactive examples, from basic comparison filters to pattern matching with like() and null detection with isNull().

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.

filter(F.col('col') > value) — filter by comparison
filter((cond1) & (cond2)) — combine conditions with & and |
F.col('col').isin([...]) — match a list of values
F.col('col').between(low, high) — range filter
F.col('col').like('%pattern%') — SQL-style pattern matching
F.col('col').isNull() — handle missing values

The dataset

We will use a small petrol station dataset. Each row represents a fuel transaction recorded at stations across Australia. The columns capture the station name, state, fuel_type, litres sold, and the price per litre.

Python — editable

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

data = [
    ('Shell Fortitude Valley', 'QLD', 'Unleaded', 44.9, 189.9),
    ('Shell Fortitude Valley', 'QLD', 'Diesel', 55.3, 193.5),
    ('Shell Fortitude Valley', 'QLD', 'Premium', 32.1, 215.9),
    ('BP Southbank', 'VIC', 'Unleaded', 52.1, 185.7),
    ('BP Southbank', 'VIC', 'Unleaded', 47.8, 185.7),
    ('BP Southbank', 'VIC', 'Diesel', 60.0, 192.3),
    ('BP Southbank', 'VIC', 'Premium', 41.0, 209.9),
    ('Caltex Bondi', 'NSW', 'Unleaded', 45.2, 189.5),
    ('Caltex Bondi', 'NSW', 'Diesel', 60.0, 195.9),
    ('Caltex Bondi', 'NSW', 'Premium', 38.5, 212.5),
    ('Caltex Bondi', 'NSW', 'Diesel', 58.7, 195.9),
    ('Caltex Bondi', 'NSW', 'Unleaded', 40.1, 189.5),
    ('BP Southbank', 'VIC', 'Diesel', 63.2, 192.3),
    ('BP Southbank', 'VIC', 'Premium', 35.6, 209.9),
    ('BP Southbank', 'VIC', 'Unleaded', 49.0, 185.7),
]

columns = ['station', 'state', 'fuel_type', 'litres', 'price']
df = spark.createDataFrame(data, columns)
df.show()

Figure 1: Petrol station transactions — 15 rows, 5 columns.

The dataset has 15 transactions spread across three Australian petrol stations: Shell Fortitude Valley (QLD), BP Southbank (VIC) and Caltex Bondi (NSW). Let's start filtering rows based on different conditions.

Basic filter — comparison operators

The simplest filter uses a comparison operator on a column. Use F.col('column') to reference a column and standard operators (>, <, ==, !=, >=, <=) to define the condition. The .filter() method keeps only rows where the condition evaluates to True.

Python — editable

df.filter(F.col('price') > 200).show()

Figure 2: Rows where price exceeds 200 — only Premium fuel transactions.

This is the PySpark equivalent of SQL's SELECT * FROM df WHERE price > 200. Only the Premium fuel rows have a price above 200, so those are the rows returned. Note that .where() is an alias for .filter() — they are interchangeable.

Compound conditions with `&` and `|`

Combine multiple conditions using & (and) or | (or). Each condition must be wrapped in parentheses because of Python's operator precedence:

Python — editable

df.filter((F.col('fuel_type') == 'Unleaded') & (F.col('state') == 'VIC')).show()

Figure 3: Unleaded fuel transactions in Victoria only.

The & operator applies a logical AND — both conditions must be true. Use | for logical OR. The parentheses around each condition are required; without them, Python's bitwise operator precedence will cause a syntax error. This is the PySpark equivalent of SQL's WHERE fuel_type = 'Unleaded' AND state = 'VIC'.

Filtering with `isin()`

When you want to match a column against a list of values, use .isin(). This is the PySpark equivalent of SQL's IN clause:

Python — editable

df.filter(F.col('fuel_type').isin(['Diesel', 'Premium'])).show()

Figure 4: Diesel and Premium transactions selected with isin().

The .isin() method accepts a Python list or individual arguments: .isin('Diesel', 'Premium') works too. To negate an isin filter, prefix with ~: df.filter(~F.col('fuel_type').isin(['Diesel', 'Premium'])).

Range filtering with `between()`

The .between(low, high) method filters rows where the column value falls within a range (inclusive on both ends):

Python — editable

df.filter(F.col('price').between(190, 200)).show()

Figure 5: Transactions with price between 190 and 200 (inclusive).

This is equivalent to df.filter((F.col('price') >= 190) & (F.col('price') <= 200)) but more concise. The SQL equivalent is WHERE price BETWEEN 190 AND 200. Both bounds are inclusive.

Pattern matching with `like()`

Use .like() for SQL-style wildcard matching on string columns. The % wildcard matches any sequence of characters, and _ matches a single character:

Python — editable

# Pattern matching with like
df.filter(F.col('station').like('%Bondi%')).show()

Figure 6: Stations containing "Bondi" selected with like().

The .like('%Bondi%') pattern matches any station name that contains "Bondi" anywhere in the string. This is the PySpark equivalent of SQL's WHERE station LIKE '%Bondi%'. For case-insensitive matching, combine with F.lower(): df.filter(F.lower(F.col('station')).like('%bondi%')).

Handling nulls with `isNull()`

Standard comparison operators do not match null values. Use .isNull() to find rows with missing data, or .isNotNull() to exclude them:

Python — editable

df.filter(F.col('litres').isNull()).show()

Figure 7: Rows where litres is null — empty result since our dataset has no nulls.

Our dataset has no null values, so the result is an empty DataFrame. In practice, .isNull() is essential for data quality checks. You can also use .isNotNull() to keep only non-null rows: df.filter(F.col('litres').isNotNull()). The SQL equivalent is WHERE litres IS NULL.

PySpark vs pandas vs SQL filtering

If you work across multiple tools, here is how filtering syntax compares:

PySpark: df.filter(F.col('price') > 200) — uses F.col() for column references, & / | for compound logic, parentheses required. .filter() and .where() are identical.
pandas: df[df['price'] > 200] — boolean indexing with square brackets. Compound conditions also use & / | with parentheses. .query('price > 200') is an alternative.
SQL: SELECT * FROM t WHERE price > 200 — declarative syntax with AND / OR, IN, BETWEEN, LIKE, IS NULL.

The core differences: PySpark requires explicit F.col() references, parentheses around each condition in compound filters, and uses method-based syntax (.isin(), .between(), .like(), .isNull()) instead of SQL keywords. Pandas uses bracket indexing but shares the same operator precedence rules.

Try editing the code blocks above — change the filter column, swap & for |, or add new conditions to see how each pattern behaves.

Data Science PySpark Spark filter

References

PySpark documentation: DataFrame.filter
PySpark documentation: Column class (isin, between, like, isNull)
PySpark documentation: pyspark.sql.functions
Pandas equivalent: How to Filter Data in Pandas
SQL equivalent: How to Filter Data with SQL

Suhith Illesinghe

Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.

How to filter data with PySpark?

The dataset

Basic filter — comparison operators

Compound conditions with & and |

Filtering with isin()

Range filtering with between()

Pattern matching with like()

Handling nulls with isNull()

PySpark vs pandas vs SQL filtering

References

Related Articles

Compound conditions with `&` and `|`

Filtering with `isin()`

Range filtering with `between()`

Pattern matching with `like()`

Handling nulls with `isNull()`