How do you create a new column in PySpark?

Use df.withColumn('new_col', expression). The expression can be arithmetic on existing columns, conditional logic with F.when(), or any pyspark.sql.functions transformation.

What is the difference between withColumn and select in PySpark?

withColumn adds or replaces a single column while keeping all existing columns. select returns only the columns you specify, so you must list every column you want to keep.

How do you handle nulls in PySpark columns?

Use F.coalesce(F.col('column'), F.lit(default)) to replace nulls with a fallback value. You can also use .isNull() or .isNotNull() in filter conditions.

How do you cast a column type in PySpark?

Use F.col('column').cast('target_type'). Common types include 'int', 'double', 'string', and 'date'. Combine with .alias() to rename the result.

How to Create Columns with PySpark — withColumn, when, coalesce, cast Guide

PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. Creating new columns is one of the most common operations in any Spark pipeline: you derive new values, apply conditional logic, clean up nulls, and cast types before writing results downstream.

This tutorial covers six interactive examples, from simple computed columns to type casting with .cast().

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.

withColumn — add a computed column from existing columns
F.when().otherwise() — conditional logic
F.coalesce() — handle null values
F.round() + F.upper() — numeric and string transforms
.cast() + select — change column types

The dataset

We will use a small petrol station dataset. Each row represents a fuel transaction recorded at stations across Australia. The columns capture the station name, state, fuel_type, litres sold, and the price per litre. Note that one row has a None value in the litres column to demonstrate null handling.

Python — editable

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

data = [
    ('Shell Fortitude Valley', 'QLD', 'Unleaded', 44.9, 189.9),
    ('Shell Fortitude Valley', 'QLD', 'Diesel', 55.3, 193.5),
    ('Shell Fortitude Valley', 'QLD', 'Premium', 32.1, 215.9),
    ('BP Southbank', 'VIC', 'Unleaded', 52.1, 185.7),
    ('BP Southbank', 'VIC', 'Unleaded', 47.8, 185.7),
    ('BP Southbank', 'VIC', 'Diesel', 60.0, 192.3),
    ('BP Southbank', 'VIC', 'Premium', 41.0, 209.9),
    ('Caltex Bondi', 'NSW', 'Unleaded', 45.2, 189.5),
    ('Caltex Bondi', 'NSW', 'Diesel', 60.0, 195.9),
    ('Caltex Bondi', 'NSW', 'Premium', 38.5, 212.5),
    ('Caltex Bondi', 'NSW', 'Diesel', None, 195.9),
    ('Caltex Bondi', 'NSW', 'Unleaded', 40.1, 189.5),
    ('BP Southbank', 'VIC', 'Diesel', 63.2, 192.3),
    ('BP Southbank', 'VIC', 'Premium', 35.6, 209.9),
    ('BP Southbank', 'VIC', 'Unleaded', 49.0, 185.7),
]

columns = ['station', 'state', 'fuel_type', 'litres', 'price']
df = spark.createDataFrame(data, columns)
df.show()

Figure 1: Petrol station transactions — 15 rows, 5 columns (one null in litres).

The dataset has 15 transactions across three Australian petrol stations. One row has a missing litres value, which we will handle with F.coalesce() later in this tutorial.

Adding a computed column with `withColumn`

The .withColumn() method adds a new column (or replaces an existing one) to the DataFrame. Pass the column name and an expression that derives its value from existing columns:

Python — editable

df.withColumn('total_cost', F.col('litres') * F.col('price')).show()

Figure 2: A new total_cost column computed as litres x price.

The expression F.col('litres') * F.col('price') multiplies two columns element-wise. The result is a new DataFrame with all original columns plus total_cost. Notice the row with null litres produces null for total_cost — PySpark propagates nulls through arithmetic by default.

Conditional logic with `when` / `otherwise`

PySpark's F.when() is the equivalent of SQL's CASE WHEN. Chain it with .otherwise() to provide a default value:

Python — editable

df.withColumn('category',
    F.when(F.col('price') > 200, 'Expensive').otherwise('Affordable')
).show()

Figure 3: Rows labeled "Expensive" or "Affordable" based on price.

The logic is straightforward: if price > 200, the new category column gets 'Expensive'; otherwise, it gets 'Affordable'. You can chain multiple .when() calls for multi-condition logic, similar to CASE WHEN ... WHEN ... ELSE in SQL.

Handling nulls with `coalesce`

F.coalesce() returns the first non-null value from a list of columns or literals. It is the standard way to replace null values in PySpark:

Python — editable

df.withColumn('litres_clean', F.coalesce(F.col('litres'), F.lit(0))).show()

Figure 4: Null litres replaced with 0 using coalesce.

F.coalesce(F.col('litres'), F.lit(0)) checks the litres column first. If it is null, it falls back to the literal 0. This is the PySpark equivalent of SQL's COALESCE(litres, 0) or pandas' .fillna(0).

Transforming with `round` and `upper`

PySpark provides a rich library of column functions in pyspark.sql.functions. You can chain multiple .withColumn() calls to apply several transformations at once:

Python — editable

df.withColumn('rounded_price', F.round(F.col('price'), 0)) \
  .withColumn('upper_station', F.upper(F.col('station'))) \
  .show()

Figure 5: Price rounded to 0 decimals and station names uppercased.

F.round(F.col('price'), 0) rounds the price to zero decimal places, while F.upper(F.col('station')) converts station names to uppercase. Each .withColumn() call returns a new DataFrame, so you can chain as many as needed.

Selecting and casting column types

Use .cast() to change a column's data type, and .alias() to rename the result. When you only need specific columns, .select() is cleaner than chaining .withColumn():

Python — editable

df.select(
    'station', 'price', F.col('price').cast('int').alias('price_int')
).show()

Figure 6: Price cast from double to integer using .cast('int').

.cast('int') truncates the decimal portion of price, and .alias('price_int') gives the new column a distinct name. Common cast targets include 'int', 'double', 'string', and 'date'. Use .select() when you want to return only a subset of columns.

PySpark vs pandas vs Polars vs SQL

If you work across multiple tools, here is how column creation syntax compares:

PySpark: df.withColumn('new', expr) — adds one column at a time, uses F.when(), F.coalesce(), .cast(). Immutable DataFrames; each call returns a new DataFrame.
pandas: df['new'] = expr — direct assignment, uses np.where() for conditionals, .fillna() for nulls, .astype() for casting. Mutates in place.
Polars: df.with_columns(expr.alias('new')) — expression-based, uses pl.when().otherwise(), .fill_null(), .cast(). Immutable like PySpark.
SQL: SELECT *, col1 * col2 AS new FROM t — declarative, uses CASE WHEN, COALESCE(), CAST().

The core differences: PySpark uses withColumn for single-column additions and requires function imports from pyspark.sql.functions. Pandas mutates DataFrames in place. Polars and PySpark both return new DataFrames, but Polars allows multiple columns in a single with_columns call.

Try editing the code blocks above — change the conditional threshold in F.when(), swap F.upper() for F.lower(), or cast to 'string' instead of 'int' to see how each pattern behaves.

Data Science PySpark Spark withColumn

References

PySpark documentation: DataFrame.withColumn
PySpark documentation: pyspark.sql.functions
PySpark documentation: Column.cast
Pandas equivalent: How to Create Columns in Pandas
SQL equivalent: How to Create Columns with SQL

Suhith Illesinghe

Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.

How to create columns with PySpark?

The dataset

Adding a computed column with withColumn

Conditional logic with when / otherwise

Handling nulls with coalesce

Transforming with round and upper

Selecting and casting column types

PySpark vs pandas vs Polars vs SQL

References

Related Articles

Adding a computed column with `withColumn`

Conditional logic with `when` / `otherwise`

Handling nulls with `coalesce`

Transforming with `round` and `upper`