PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. The filter() method (and its alias where()) is how you select subsets of rows from a PySpark DataFrame based on conditions. Whether you need simple comparisons, compound logic, list membership, range checks, pattern matching, or null handling — PySpark has you covered.
This tutorial covers seven interactive examples, from basic comparison filters to pattern matching with like() and null detection with isNull().
Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.
filter(F.col('col') > value)— filter by comparisonfilter((cond1) & (cond2))— combine conditions with&and|F.col('col').isin([...])— match a list of valuesF.col('col').between(low, high)— range filterF.col('col').like('%pattern%')— SQL-style pattern matchingF.col('col').isNull()— handle missing values
The dataset
We will use a small petrol station dataset. Each row represents a fuel transaction recorded at stations across Australia. The columns capture the station name, state, fuel_type, litres sold, and the price per litre.
The dataset has 15 transactions spread across three Australian petrol stations: Shell Fortitude Valley (QLD), BP Southbank (VIC) and Caltex Bondi (NSW). Let's start filtering rows based on different conditions.
Basic filter — comparison operators
The simplest filter uses a comparison operator on a column. Use F.col('column') to reference a column and standard operators (>, <, ==, !=, >=, <=) to define the condition. The .filter() method keeps only rows where the condition evaluates to True.
This is the PySpark equivalent of SQL's SELECT * FROM df WHERE price > 200. Only the Premium fuel rows have a price above 200, so those are the rows returned. Note that .where() is an alias for .filter() — they are interchangeable.
Compound conditions with & and |
Combine multiple conditions using & (and) or | (or). Each condition must be wrapped in parentheses because of Python's operator precedence:
The & operator applies a logical AND — both conditions must be true. Use | for logical OR. The parentheses around each condition are required; without them, Python's bitwise operator precedence will cause a syntax error. This is the PySpark equivalent of SQL's WHERE fuel_type = 'Unleaded' AND state = 'VIC'.
Filtering with isin()
When you want to match a column against a list of values, use .isin(). This is the PySpark equivalent of SQL's IN clause:
The .isin() method accepts a Python list or individual arguments: .isin('Diesel', 'Premium') works too. To negate an isin filter, prefix with ~: df.filter(~F.col('fuel_type').isin(['Diesel', 'Premium'])).
Range filtering with between()
The .between(low, high) method filters rows where the column value falls within a range (inclusive on both ends):
This is equivalent to df.filter((F.col('price') >= 190) & (F.col('price') <= 200)) but more concise. The SQL equivalent is WHERE price BETWEEN 190 AND 200. Both bounds are inclusive.
Pattern matching with like()
Use .like() for SQL-style wildcard matching on string columns. The % wildcard matches any sequence of characters, and _ matches a single character:
The .like('%Bondi%') pattern matches any station name that contains "Bondi" anywhere in the string. This is the PySpark equivalent of SQL's WHERE station LIKE '%Bondi%'. For case-insensitive matching, combine with F.lower(): df.filter(F.lower(F.col('station')).like('%bondi%')).
Handling nulls with isNull()
Standard comparison operators do not match null values. Use .isNull() to find rows with missing data, or .isNotNull() to exclude them:
Our dataset has no null values, so the result is an empty DataFrame. In practice, .isNull() is essential for data quality checks. You can also use .isNotNull() to keep only non-null rows: df.filter(F.col('litres').isNotNull()). The SQL equivalent is WHERE litres IS NULL.
PySpark vs pandas vs SQL filtering
If you work across multiple tools, here is how filtering syntax compares:
- PySpark:
df.filter(F.col('price') > 200)— usesF.col()for column references,&/|for compound logic, parentheses required..filter()and.where()are identical. - pandas:
df[df['price'] > 200]— boolean indexing with square brackets. Compound conditions also use&/|with parentheses..query('price > 200')is an alternative. - SQL:
SELECT * FROM t WHERE price > 200— declarative syntax withAND/OR,IN,BETWEEN,LIKE,IS NULL.
The core differences: PySpark requires explicit F.col() references, parentheses around each condition in compound filters, and uses method-based syntax (.isin(), .between(), .like(), .isNull()) instead of SQL keywords. Pandas uses bracket indexing but shares the same operator precedence rules.
Try editing the code blocks above — change the filter column, swap & for |, or add new conditions to see how each pattern behaves.
References
- PySpark documentation: DataFrame.filter
- PySpark documentation: Column class (isin, between, like, isNull)
- PySpark documentation: pyspark.sql.functions
- Pandas equivalent: How to Filter Data in Pandas
- SQL equivalent: How to Filter Data with SQL