How do you sort data in PySpark?

Use df.orderBy('column') for ascending order or df.orderBy(F.desc('column')) for descending order. You can also use df.sort() which is an alias for orderBy.

What is the difference between orderBy and sort in PySpark?

There is no functional difference. sort() is an alias for orderBy(). Both accept the same arguments and produce the same result. orderBy is more common in PySpark codebases because it matches SQL syntax.

How does PySpark handle nulls when sorting?

By default, PySpark places nulls last in ascending order and first in descending order. You can control this with .asc_nulls_first(), .asc_nulls_last(), .desc_nulls_first(), and .desc_nulls_last() on column expressions.

How do you get the top N rows in PySpark?

Sort descending with df.orderBy(F.desc('column')).limit(N). The limit(N) method returns a new DataFrame with only the first N rows.

How to Sort Data with PySpark — orderBy, asc, desc, limit Guide

PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. The orderBy method is how you sort rows in a PySpark DataFrame: arrange data in ascending or descending order by one or more columns, handle nulls, and slice the result with limit.

This tutorial covers six interactive examples, from basic ascending sort to multi-column ordering and null handling.

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.

orderBy('column') — sort ascending (default)
orderBy(F.desc('column')) — sort descending
orderBy(F.desc('column')).limit(n) — top N rows
Multi-column sort with mixed directions
Null handling — where nulls land in the sort order

The dataset

We will use a small petrol station dataset. Each row represents a fuel transaction recorded at stations across Australia. The columns capture the station name, state, fuel_type, litres sold, and the price per litre. A few rows have None in the litres column to demonstrate null handling.

Python — editable

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

data = [
    ('Shell Fortitude Valley', 'QLD', 'Unleaded', 44.9, 189.9),
    ('Shell Fortitude Valley', 'QLD', 'Diesel', 55.3, 193.5),
    ('Shell Fortitude Valley', 'QLD', 'Premium', 32.1, 215.9),
    ('BP Southbank', 'VIC', 'Unleaded', 52.1, 185.7),
    ('BP Southbank', 'VIC', 'Unleaded', 47.8, 185.7),
    ('BP Southbank', 'VIC', 'Diesel', 60.0, 192.3),
    ('BP Southbank', 'VIC', 'Premium', 41.0, 209.9),
    ('Caltex Bondi', 'NSW', 'Unleaded', 45.2, 189.5),
    ('Caltex Bondi', 'NSW', 'Diesel', 60.0, 195.9),
    ('Caltex Bondi', 'NSW', 'Premium', 38.5, 212.5),
    ('Caltex Bondi', 'NSW', 'Diesel', 58.7, 195.9),
    ('Caltex Bondi', 'NSW', 'Unleaded', 40.1, 189.5),
    ('BP Southbank', 'VIC', 'Diesel', None, 192.3),
    ('BP Southbank', 'VIC', 'Premium', 35.6, 209.9),
    ('BP Southbank', 'VIC', 'Unleaded', None, 185.7),
]

columns = ['station', 'state', 'fuel_type', 'litres', 'price']
df = spark.createDataFrame(data, columns)
df.show()

Figure 1: Petrol station transactions — 15 rows, 5 columns (two rows have null litres).

The dataset has 15 transactions spread across three Australian petrol stations. Two rows have None in the litres column — we will use these to explore how PySpark handles nulls during sorting.

Sort ascending with `orderBy`

By default, orderBy sorts in ascending order. Pass a column name as a string:

Python — editable

df.orderBy('price').show()

Figure 2: All 15 rows sorted by price, lowest first.

The cheapest transactions (Unleaded at BP Southbank, 185.7) appear at the top, and the most expensive (Premium at Shell Fortitude Valley, 215.9) at the bottom. This is the PySpark equivalent of SQL's SELECT * FROM ... ORDER BY price ASC.

Sort descending with `F.desc()`

To sort from highest to lowest, wrap the column name in F.desc():

Python — editable

df.orderBy(F.desc('price')).show()

Figure 3: All rows sorted by price descending — most expensive first.

F.desc('price') creates a descending sort expression. You can also use F.col('price').desc() — both produce the same result. This is the PySpark equivalent of ORDER BY price DESC.

Get top N with `limit`

Chain .limit(n) after orderBy to keep only the first n rows. This is the standard pattern for "top 5 most expensive" or "bottom 3 cheapest" queries:

Python — editable

df.orderBy(F.desc('price')).limit(5).show()

Figure 4: Top 5 most expensive transactions.

The limit(5) call returns a new DataFrame with only the first 5 rows. Combined with orderBy(F.desc('price')), this gives the 5 most expensive transactions. This is the PySpark equivalent of SQL's ORDER BY price DESC LIMIT 5.

Multi-column sort

Pass multiple columns to orderBy to sort by the first column, then break ties with the second. You can mix ascending and descending directions:

Python — editable

df.orderBy('fuel_type', F.desc('price')).show()

Figure 5: Sorted by fuel_type ascending, then price descending within each fuel type.

Rows are first grouped by fuel_type in alphabetical order (Diesel, Premium, Unleaded). Within each fuel type, transactions are sorted from most expensive to least expensive. This is the PySpark equivalent of ORDER BY fuel_type ASC, price DESC.

Null handling in sorts

When a column contains nulls, PySpark must decide where to place them. By default, nulls appear last in ascending order and first in descending order. Let's see this with the litres column, which has two null values:

Python — editable

df.orderBy('litres').show()

Figure 6: Ascending sort on litres — null values appear last by default.

The two rows with null in the litres column appear at the bottom. On a real Spark cluster, you can control null placement with F.col('litres').asc_nulls_first() or F.col('litres').asc_nulls_last(). The same applies to descending sorts with desc_nulls_first() and desc_nulls_last().

PySpark vs pandas vs Polars vs SQL

If you work across multiple tools, here is how sorting syntax compares:

PySpark: df.orderBy(F.desc('col')) — uses orderBy (or its alias sort), direction via F.desc() / F.asc(), top N via .limit(n).
pandas: df.sort_values('col', ascending=False).head(n) — uses sort_values, direction via ascending parameter, top N via .head(n).
Polars: df.sort('col', descending=True).head(n) — uses sort, direction via descending parameter, nulls last by default.
SQL: SELECT * FROM t ORDER BY col DESC LIMIT n — declarative, direction via ASC/DESC, nulls behavior varies by database.

The core differences: PySpark uses orderBy (matching SQL convention) and requires F.desc() for descending order. Pandas uses sort_values with a boolean ascending parameter. Polars uses sort with a boolean descending parameter. All four handle nulls differently by default.

Try editing the code blocks above — change the sort column to station or state, swap F.desc for ascending, or adjust limit to see different slices of the data.

Data Science PySpark Spark orderBy

References

PySpark documentation: DataFrame.orderBy
PySpark documentation: pyspark.sql.functions
Polars equivalent: How to Sort Data with Polars
SQL equivalent: How to Sort Data with SQL

Suhith Illesinghe

Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.

How to sort data with PySpark?

The dataset

Sort ascending with orderBy

Sort descending with F.desc()

Get top N with limit

Multi-column sort

Null handling in sorts

PySpark vs pandas vs Polars vs SQL

References

Related Articles

Sort ascending with `orderBy`

Sort descending with `F.desc()`

Get top N with `limit`