PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. Creating new columns is one of the most common operations in any Spark pipeline: you derive new values, apply conditional logic, clean up nulls, and cast types before writing results downstream.
This tutorial covers six interactive examples, from simple computed columns to type casting with .cast().
Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.
withColumn— add a computed column from existing columnsF.when().otherwise()— conditional logicF.coalesce()— handle null valuesF.round()+F.upper()— numeric and string transforms.cast()+select— change column types
The dataset
We will use a small petrol station dataset. Each row represents a fuel transaction recorded at stations across Australia. The columns capture the station name, state, fuel_type, litres sold, and the price per litre. Note that one row has a None value in the litres column to demonstrate null handling.
The dataset has 15 transactions across three Australian petrol stations. One row has a missing litres value, which we will handle with F.coalesce() later in this tutorial.
Adding a computed column with withColumn
The .withColumn() method adds a new column (or replaces an existing one) to the DataFrame. Pass the column name and an expression that derives its value from existing columns:
The expression F.col('litres') * F.col('price') multiplies two columns element-wise. The result is a new DataFrame with all original columns plus total_cost. Notice the row with null litres produces null for total_cost — PySpark propagates nulls through arithmetic by default.
Conditional logic with when / otherwise
PySpark's F.when() is the equivalent of SQL's CASE WHEN. Chain it with .otherwise() to provide a default value:
The logic is straightforward: if price > 200, the new category column gets 'Expensive'; otherwise, it gets 'Affordable'. You can chain multiple .when() calls for multi-condition logic, similar to CASE WHEN ... WHEN ... ELSE in SQL.
Handling nulls with coalesce
F.coalesce() returns the first non-null value from a list of columns or literals. It is the standard way to replace null values in PySpark:
F.coalesce(F.col('litres'), F.lit(0)) checks the litres column first. If it is null, it falls back to the literal 0. This is the PySpark equivalent of SQL's COALESCE(litres, 0) or pandas' .fillna(0).
Transforming with round and upper
PySpark provides a rich library of column functions in pyspark.sql.functions. You can chain multiple .withColumn() calls to apply several transformations at once:
F.round(F.col('price'), 0) rounds the price to zero decimal places, while F.upper(F.col('station')) converts station names to uppercase. Each .withColumn() call returns a new DataFrame, so you can chain as many as needed.
Selecting and casting column types
Use .cast() to change a column's data type, and .alias() to rename the result. When you only need specific columns, .select() is cleaner than chaining .withColumn():
.cast('int') truncates the decimal portion of price, and .alias('price_int') gives the new column a distinct name. Common cast targets include 'int', 'double', 'string', and 'date'. Use .select() when you want to return only a subset of columns.
PySpark vs pandas vs Polars vs SQL
If you work across multiple tools, here is how column creation syntax compares:
- PySpark:
df.withColumn('new', expr)— adds one column at a time, usesF.when(),F.coalesce(),.cast(). Immutable DataFrames; each call returns a new DataFrame. - pandas:
df['new'] = expr— direct assignment, usesnp.where()for conditionals,.fillna()for nulls,.astype()for casting. Mutates in place. - Polars:
df.with_columns(expr.alias('new'))— expression-based, usespl.when().otherwise(),.fill_null(),.cast(). Immutable like PySpark. - SQL:
SELECT *, col1 * col2 AS new FROM t— declarative, usesCASE WHEN,COALESCE(),CAST().
The core differences: PySpark uses withColumn for single-column additions and requires function imports from pyspark.sql.functions. Pandas mutates DataFrames in place. Polars and PySpark both return new DataFrames, but Polars allows multiple columns in a single with_columns call.
Try editing the code blocks above — change the conditional threshold in F.when(), swap F.upper() for F.lower(), or cast to 'string' instead of 'int' to see how each pattern behaves.
References
- PySpark documentation: DataFrame.withColumn
- PySpark documentation: pyspark.sql.functions
- PySpark documentation: Column.cast
- Pandas equivalent: How to Create Columns in Pandas
- SQL equivalent: How to Create Columns with SQL