PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. The orderBy method is how you sort rows in a PySpark DataFrame: arrange data in ascending or descending order by one or more columns, handle nulls, and slice the result with limit.
This tutorial covers six interactive examples, from basic ascending sort to multi-column ordering and null handling.
Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.
orderBy('column')— sort ascending (default)orderBy(F.desc('column'))— sort descendingorderBy(F.desc('column')).limit(n)— top N rows- Multi-column sort with mixed directions
- Null handling — where nulls land in the sort order
The dataset
We will use a small petrol station dataset. Each row represents a fuel transaction recorded at stations across Australia. The columns capture the station name, state, fuel_type, litres sold, and the price per litre. A few rows have None in the litres column to demonstrate null handling.
The dataset has 15 transactions spread across three Australian petrol stations. Two rows have None in the litres column — we will use these to explore how PySpark handles nulls during sorting.
Sort ascending with orderBy
By default, orderBy sorts in ascending order. Pass a column name as a string:
The cheapest transactions (Unleaded at BP Southbank, 185.7) appear at the top, and the most expensive (Premium at Shell Fortitude Valley, 215.9) at the bottom. This is the PySpark equivalent of SQL's SELECT * FROM ... ORDER BY price ASC.
Sort descending with F.desc()
To sort from highest to lowest, wrap the column name in F.desc():
F.desc('price') creates a descending sort expression. You can also use F.col('price').desc() — both produce the same result. This is the PySpark equivalent of ORDER BY price DESC.
Get top N with limit
Chain .limit(n) after orderBy to keep only the first n rows. This is the standard pattern for "top 5 most expensive" or "bottom 3 cheapest" queries:
The limit(5) call returns a new DataFrame with only the first 5 rows. Combined with orderBy(F.desc('price')), this gives the 5 most expensive transactions. This is the PySpark equivalent of SQL's ORDER BY price DESC LIMIT 5.
Multi-column sort
Pass multiple columns to orderBy to sort by the first column, then break ties with the second. You can mix ascending and descending directions:
Rows are first grouped by fuel_type in alphabetical order (Diesel, Premium, Unleaded). Within each fuel type, transactions are sorted from most expensive to least expensive. This is the PySpark equivalent of ORDER BY fuel_type ASC, price DESC.
Null handling in sorts
When a column contains nulls, PySpark must decide where to place them. By default, nulls appear last in ascending order and first in descending order. Let's see this with the litres column, which has two null values:
The two rows with null in the litres column appear at the bottom. On a real Spark cluster, you can control null placement with F.col('litres').asc_nulls_first() or F.col('litres').asc_nulls_last(). The same applies to descending sorts with desc_nulls_first() and desc_nulls_last().
PySpark vs pandas vs Polars vs SQL
If you work across multiple tools, here is how sorting syntax compares:
- PySpark:
df.orderBy(F.desc('col'))— usesorderBy(or its aliassort), direction viaF.desc()/F.asc(), top N via.limit(n). - pandas:
df.sort_values('col', ascending=False).head(n)— usessort_values, direction viaascendingparameter, top N via.head(n). - Polars:
df.sort('col', descending=True).head(n)— usessort, direction viadescendingparameter, nulls last by default. - SQL:
SELECT * FROM t ORDER BY col DESC LIMIT n— declarative, direction viaASC/DESC, nulls behavior varies by database.
The core differences: PySpark uses orderBy (matching SQL convention) and requires F.desc() for descending order. Pandas uses sort_values with a boolean ascending parameter. Polars uses sort with a boolean descending parameter. All four handle nulls differently by default.
Try editing the code blocks above — change the sort column to station or state, swap F.desc for ascending, or adjust limit to see different slices of the data.
References
- PySpark documentation: DataFrame.orderBy
- PySpark documentation: pyspark.sql.functions
- Polars equivalent: How to Sort Data with Polars
- SQL equivalent: How to Sort Data with SQL