PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. The orderBy method is how you sort rows in a PySpark DataFrame: arrange data in ascending or descending order by one or more columns, handle nulls, and slice the result with limit.

This tutorial covers six interactive examples, from basic ascending sort to multi-column ordering and null handling.

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.

  1. orderBy('column') — sort ascending (default)
  2. orderBy(F.desc('column')) — sort descending
  3. orderBy(F.desc('column')).limit(n) — top N rows
  4. Multi-column sort with mixed directions
  5. Null handling — where nulls land in the sort order

The dataset

We will use a small petrol station dataset. Each row represents a fuel transaction recorded at stations across Australia. The columns capture the station name, state, fuel_type, litres sold, and the price per litre. A few rows have None in the litres column to demonstrate null handling.

Python — editable
Figure 1: Petrol station transactions — 15 rows, 5 columns (two rows have null litres).

The dataset has 15 transactions spread across three Australian petrol stations. Two rows have None in the litres column — we will use these to explore how PySpark handles nulls during sorting.

Sort ascending with orderBy

By default, orderBy sorts in ascending order. Pass a column name as a string:

Python — editable
Figure 2: All 15 rows sorted by price, lowest first.

The cheapest transactions (Unleaded at BP Southbank, 185.7) appear at the top, and the most expensive (Premium at Shell Fortitude Valley, 215.9) at the bottom. This is the PySpark equivalent of SQL's SELECT * FROM ... ORDER BY price ASC.

Sort descending with F.desc()

To sort from highest to lowest, wrap the column name in F.desc():

Python — editable
Figure 3: All rows sorted by price descending — most expensive first.

F.desc('price') creates a descending sort expression. You can also use F.col('price').desc() — both produce the same result. This is the PySpark equivalent of ORDER BY price DESC.

Get top N with limit

Chain .limit(n) after orderBy to keep only the first n rows. This is the standard pattern for "top 5 most expensive" or "bottom 3 cheapest" queries:

Python — editable
Figure 4: Top 5 most expensive transactions.

The limit(5) call returns a new DataFrame with only the first 5 rows. Combined with orderBy(F.desc('price')), this gives the 5 most expensive transactions. This is the PySpark equivalent of SQL's ORDER BY price DESC LIMIT 5.

Multi-column sort

Pass multiple columns to orderBy to sort by the first column, then break ties with the second. You can mix ascending and descending directions:

Python — editable
Figure 5: Sorted by fuel_type ascending, then price descending within each fuel type.

Rows are first grouped by fuel_type in alphabetical order (Diesel, Premium, Unleaded). Within each fuel type, transactions are sorted from most expensive to least expensive. This is the PySpark equivalent of ORDER BY fuel_type ASC, price DESC.

Null handling in sorts

When a column contains nulls, PySpark must decide where to place them. By default, nulls appear last in ascending order and first in descending order. Let's see this with the litres column, which has two null values:

Python — editable
Figure 6: Ascending sort on litres — null values appear last by default.

The two rows with null in the litres column appear at the bottom. On a real Spark cluster, you can control null placement with F.col('litres').asc_nulls_first() or F.col('litres').asc_nulls_last(). The same applies to descending sorts with desc_nulls_first() and desc_nulls_last().

PySpark vs pandas vs Polars vs SQL

If you work across multiple tools, here is how sorting syntax compares:

The core differences: PySpark uses orderBy (matching SQL convention) and requires F.desc() for descending order. Pandas uses sort_values with a boolean ascending parameter. Polars uses sort with a boolean descending parameter. All four handle nulls differently by default.

Try editing the code blocks above — change the sort column to station or state, swap F.desc for ascending, or adjust limit to see different slices of the data.

References

Suhith Illesinghe
Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.