PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. Creating new columns is one of the most common operations in any Spark pipeline: you derive new values, apply conditional logic, clean up nulls, and cast types before writing results downstream.

This tutorial covers six interactive examples, from simple computed columns to type casting with .cast().

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.

  1. withColumn — add a computed column from existing columns
  2. F.when().otherwise() — conditional logic
  3. F.coalesce() — handle null values
  4. F.round() + F.upper() — numeric and string transforms
  5. .cast() + select — change column types

The dataset

We will use a small petrol station dataset. Each row represents a fuel transaction recorded at stations across Australia. The columns capture the station name, state, fuel_type, litres sold, and the price per litre. Note that one row has a None value in the litres column to demonstrate null handling.

Python — editable
Figure 1: Petrol station transactions — 15 rows, 5 columns (one null in litres).

The dataset has 15 transactions across three Australian petrol stations. One row has a missing litres value, which we will handle with F.coalesce() later in this tutorial.

Adding a computed column with withColumn

The .withColumn() method adds a new column (or replaces an existing one) to the DataFrame. Pass the column name and an expression that derives its value from existing columns:

Python — editable
Figure 2: A new total_cost column computed as litres x price.

The expression F.col('litres') * F.col('price') multiplies two columns element-wise. The result is a new DataFrame with all original columns plus total_cost. Notice the row with null litres produces null for total_cost — PySpark propagates nulls through arithmetic by default.

Conditional logic with when / otherwise

PySpark's F.when() is the equivalent of SQL's CASE WHEN. Chain it with .otherwise() to provide a default value:

Python — editable
Figure 3: Rows labeled "Expensive" or "Affordable" based on price.

The logic is straightforward: if price > 200, the new category column gets 'Expensive'; otherwise, it gets 'Affordable'. You can chain multiple .when() calls for multi-condition logic, similar to CASE WHEN ... WHEN ... ELSE in SQL.

Handling nulls with coalesce

F.coalesce() returns the first non-null value from a list of columns or literals. It is the standard way to replace null values in PySpark:

Python — editable
Figure 4: Null litres replaced with 0 using coalesce.

F.coalesce(F.col('litres'), F.lit(0)) checks the litres column first. If it is null, it falls back to the literal 0. This is the PySpark equivalent of SQL's COALESCE(litres, 0) or pandas' .fillna(0).

Transforming with round and upper

PySpark provides a rich library of column functions in pyspark.sql.functions. You can chain multiple .withColumn() calls to apply several transformations at once:

Python — editable
Figure 5: Price rounded to 0 decimals and station names uppercased.

F.round(F.col('price'), 0) rounds the price to zero decimal places, while F.upper(F.col('station')) converts station names to uppercase. Each .withColumn() call returns a new DataFrame, so you can chain as many as needed.

Selecting and casting column types

Use .cast() to change a column's data type, and .alias() to rename the result. When you only need specific columns, .select() is cleaner than chaining .withColumn():

Python — editable
Figure 6: Price cast from double to integer using .cast('int').

.cast('int') truncates the decimal portion of price, and .alias('price_int') gives the new column a distinct name. Common cast targets include 'int', 'double', 'string', and 'date'. Use .select() when you want to return only a subset of columns.

PySpark vs pandas vs Polars vs SQL

If you work across multiple tools, here is how column creation syntax compares:

The core differences: PySpark uses withColumn for single-column additions and requires function imports from pyspark.sql.functions. Pandas mutates DataFrames in place. Polars and PySpark both return new DataFrames, but Polars allows multiple columns in a single with_columns call.

Try editing the code blocks above — change the conditional threshold in F.when(), swap F.upper() for F.lower(), or cast to 'string' instead of 'int' to see how each pattern behaves.

References

Suhith Illesinghe
Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.