Creating new columns is one of the most important skills in data analysis. Whether you're computing a derived metric, binning continuous data into categories, or preparing features for a machine learning model — you'll be adding columns to a DataFrame constantly.
In this article you'll learn three things:
- How to create and assign values to a column
- How to create a categorical column with
pd.cut() - How to create one-hot encodings with
pd.get_dummies()
1. The Dataset
We'll use a petrol station dataset with fuel transactions across Australia. Each row has a station, fuel type, litres sold, price per litre, and the state. Some values are intentionally missing.
2. Create a Column with Bracket Assignment
The simplest way to create a new column is bracket assignment. Let's add a revenue column initialised to zero:
You can also use loc to achieve the same thing. This is useful when working with filtered DataFrames to avoid the SettingWithCopyWarning:
3. Populate with Random Values
Zeros aren't very interesting. Let's populate the discount column with random values between 0 and 15 (cents off per litre) using np.random.randint():
4. Compute a Derived Column
Now let's compute the actual revenue column. Revenue = litres x price per litre. But there's a catch — some rows have NaN in litres or price_per_litre. If you multiply a number by NaN, the result is NaN. Use fillna(0) first:
Let's also compute the discounted price by subtracting the discount (in cents) from the price per litre:
Let's clean up the extra columns and keep the ones we need going forward:
5. Create Categorical Columns with pd.cut()
You have a continuous revenue column. What if the business wants to categorise transactions into groups — say Low, Medium, and High revenue? This is where pd.cut() comes in. It takes a continuous column and bins it into discrete categories.
First, let's calculate the bins dynamically from the data:
Now create labels for each bin range and apply pd.cut():
Now you can group by the category to see how many transactions fall into each bucket:
6. One-Hot Encoding with pd.get_dummies()
Categorical columns are great for analysis, but machine learning models typically need numerical inputs. One-hot encoding converts each category into its own binary column (1 or 0). Use pd.get_dummies() to do this in one line:
The dummy_na=True parameter adds a column for NaN values — useful if missing data is meaningful. Now merge the dummies back into the original DataFrame:
The DataFrame now has binary indicator columns ready for any machine learning pipeline or statistical model.
Summary
You've learned three essential column creation techniques:
- Direct assignment —
df['col'] = valueordf.loc[:, 'col'] = valueto create new columns with constants, random values, or computed expressions - Categorical binning —
pd.cut()to bin continuous values into labelled categories with dynamic or manual bins - One-hot encoding —
pd.get_dummies()to convert categorical columns into binary indicators for ML models
Try editing the code blocks above — change the number of bins, use fuel_type instead of revenue_category for one-hot encoding, or compute new derived columns like revenue per litre.
References
- Original article: How to create columns in Pandas? — Medium
- pandas documentation: pandas.cut
- pandas documentation: pandas.get_dummies
- pandas documentation: pandas.DataFrame.fillna