How do you plot a PySpark DataFrame?

Convert the PySpark DataFrame to pandas with .toPandas(), then use matplotlib or any Python plotting library. PySpark itself has no built-in plotting API.

What is toPandas() in PySpark?

toPandas() collects all rows from the distributed Spark DataFrame to the driver node and returns a pandas DataFrame. Use it only after filtering or aggregating to a manageable size.

Can you use matplotlib with PySpark?

Yes. Convert your PySpark DataFrame to pandas with .toPandas(), then pass columns to matplotlib functions like bar(), plot(), pie(), and scatter().

What chart types work best with PySpark data?

Bar charts for comparisons, line charts for trends over time, pie charts for proportions, and scatter plots for relationships between two numeric columns.

How to Plot with PySpark and Matplotlib — Bar, Line, Pie and Scatter Charts

PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. Spark itself has no built-in plotting API. The standard workflow is: aggregate or filter your data in PySpark, convert the result to pandas with .toPandas(), then plot with matplotlib (or seaborn, plotly, etc.).

This tutorial covers six interactive examples: creating datasets, bar charts, line charts, pie charts, scatter plots, and a combined 2×2 subplot grid. Every chart renders directly in your browser.

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Charts are rendered using matplotlib's AGG backend and displayed as PNG images. Distributed execution and lazy evaluation do not apply in this environment.

Create two datasets — EV registrations and fuel station transactions
Bar chart — compare EV vs diesel registrations by year
Line chart — visualise trends with styled markers
Pie chart — fuel type distribution
Scatter plot — litres vs price relationship
Subplots — combine all four chart types in a 2×2 grid

The datasets

We use two datasets. The first tracks yearly electric vehicle (EV) and diesel registrations. The second is a petrol station transaction log with station, state, fuel_type, litres, and price columns. Run the block below to create both.

Python — editable

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import matplotlib.pyplot as plt
import numpy as np

spark = SparkSession.builder.getOrCreate()

# Dataset 1: EV vs Diesel registrations by year
ev_data = [
    (2019, 6718, 152400),
    (2020, 7248, 143600),
    (2021, 20665, 131200),
    (2022, 39353, 118900),
    (2023, 87217, 105300),
]
ev_columns = ['year', 'ev_count', 'diesel_count']
df = spark.createDataFrame(ev_data, ev_columns)

# Dataset 2: Fuel station transactions
fuel_data = [
    ('Shell Fortitude Valley', 'QLD', 'Unleaded', 44.9, 189.9),
    ('Shell Fortitude Valley', 'QLD', 'Diesel', 55.3, 193.5),
    ('Shell Fortitude Valley', 'QLD', 'Premium', 32.1, 215.9),
    ('BP Southbank', 'VIC', 'Unleaded', 52.1, 185.7),
    ('BP Southbank', 'VIC', 'Unleaded', 47.8, 185.7),
    ('BP Southbank', 'VIC', 'Diesel', 60.0, 192.3),
    ('BP Southbank', 'VIC', 'Premium', 41.0, 209.9),
    ('Caltex Bondi', 'NSW', 'Unleaded', 45.2, 189.5),
    ('Caltex Bondi', 'NSW', 'Diesel', 60.0, 195.9),
    ('Caltex Bondi', 'NSW', 'Premium', 38.5, 212.5),
    ('Caltex Bondi', 'NSW', 'Diesel', 58.7, 195.9),
    ('Caltex Bondi', 'NSW', 'Unleaded', 40.1, 189.5),
    ('BP Southbank', 'VIC', 'Diesel', 63.2, 192.3),
    ('BP Southbank', 'VIC', 'Premium', 35.6, 209.9),
    ('BP Southbank', 'VIC', 'Unleaded', 49.0, 185.7),
]
fuel_columns = ['station', 'state', 'fuel_type', 'litres', 'price']
fuel_df = spark.createDataFrame(fuel_data, fuel_columns)

print("EV Registrations:")
df.show()
print("Fuel Station Transactions:")
fuel_df.show()

Figure 1: Two datasets — EV registrations (5 rows) and fuel transactions (15 rows).

Both datasets are now available as PySpark DataFrames. The EV dataset spans 2019–2023 and shows the rapid growth in electric vehicle adoption alongside declining diesel registrations. The fuel dataset captures individual transactions at three Australian stations.

Bar chart — EV vs Diesel registrations

A grouped bar chart is ideal for comparing two categories across a shared axis. We convert the PySpark DataFrame to pandas with .toPandas(), then use matplotlib's bar() function with offset positions:

Python — editable

pdf = df.toPandas()

fig, ax = plt.subplots(figsize=(8, 4))
x = range(len(pdf))
w = 0.35
ax.bar([i - w/2 for i in x], pdf['ev_count'], w, label='EV', color='#FF8C00')
ax.bar([i + w/2 for i in x], pdf['diesel_count'], w, label='Diesel', color='#4dba8e')
ax.set_xticks(x)
ax.set_xticklabels(pdf['year'])
ax.set_xlabel('Year')
ax.set_ylabel('Registrations')
ax.legend()
ax.set_title('EV vs Diesel Registrations')
plt.tight_layout()
plt.show()

Figure 2: Grouped bar chart comparing EV and diesel registrations by year.

The key steps: .toPandas() collects the Spark DataFrame to the driver, then standard matplotlib functions handle the rendering. The w = 0.35 offset ensures bars sit side by side rather than overlapping. On a real cluster, always aggregate or filter before calling .toPandas() to avoid pulling millions of rows to the driver.

Line chart — trends with styled markers

Line charts are the natural choice for time-series data. Adding markers and a grid makes trends easier to read:

Python — editable

pdf = df.toPandas()

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(pdf['year'], pdf['ev_count'], marker='o', linewidth=2,
        markersize=8, color='#FF8C00', label='EV')
ax.plot(pdf['year'], pdf['diesel_count'], marker='s', linewidth=2,
        markersize=8, color='#4dba8e', label='Diesel')
ax.fill_between(pdf['year'], pdf['ev_count'], alpha=0.1, color='#FF8C00')
ax.fill_between(pdf['year'], pdf['diesel_count'], alpha=0.1, color='#4dba8e')
ax.set_xlabel('Year')
ax.set_ylabel('Registrations')
ax.set_title('EV vs Diesel Registration Trends')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Figure 3: Line chart with markers, fill, and grid showing registration trends.

The marker='o' and marker='s' arguments add circle and square markers respectively. fill_between() adds a subtle shaded area beneath each line, making the diverging trends more visible. The grid(True, alpha=0.3) call adds faint gridlines for easier value reading.

Pie chart — fuel type distribution

Pie charts show proportions of a whole. Here we group the fuel station data by fuel_type, count the transactions, then plot the result:

Python — editable

fuel_counts = fuel_df.groupBy('fuel_type').count().toPandas()

plt.figure(figsize=(6, 6))
plt.pie(fuel_counts['count'], labels=fuel_counts['fuel_type'],
        autopct='%1.1f%%', colors=['#FF8C00', '#FFA500', '#4dba8e'],
        startangle=90, wedgeprops={'edgecolor': '#0A2A12', 'linewidth': 1.5})
plt.title('Fuel Type Distribution')
plt.tight_layout()
plt.show()

Figure 4: Pie chart showing the proportion of each fuel type in the dataset.

The groupBy('fuel_type').count() aggregation runs in PySpark, then .toPandas() brings the small result (3 rows) to the driver. autopct='%1.1f%%' formats each slice with one decimal place. The wedgeprops argument adds a dark edge to each slice for visual separation.

Scatter plot — litres vs price

Scatter plots reveal relationships between two numeric variables. We filter out any null values first (good practice), then plot:

Python — editable

clean = fuel_df.filter(
    F.col('litres').isNotNull() & F.col('price').isNotNull()
).toPandas()

plt.figure(figsize=(8, 4))
plt.scatter(clean['litres'], clean['price'], c='#FF8C00', alpha=0.7,
            s=60, edgecolors='#0A2A12', linewidth=0.5)
plt.xlabel('Litres')
plt.ylabel('Price (cents/litre)')
plt.title('Litres vs Price')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Figure 5: Scatter plot showing the relationship between litres purchased and price per litre.

The F.col('litres').isNotNull() filter runs on the Spark side before collecting to pandas. The alpha=0.7 argument makes overlapping points partially transparent, and edgecolors adds a thin border around each dot for clarity.

Subplots — 2×2 grid with all chart types

For dashboards or reports, combine multiple chart types in a single figure using plt.subplots():

Python — editable

pdf = df.toPandas()
clean = fuel_df.filter(
    F.col('litres').isNotNull() & F.col('price').isNotNull()
).toPandas()
fuel_counts = fuel_df.groupBy('fuel_type').count().toPandas()

fig, axes = plt.subplots(2, 2, figsize=(10, 8))

# Top-left: Bar chart
ax = axes[0, 0]
x = range(len(pdf))
w = 0.35
ax.bar([i - w/2 for i in x], pdf['ev_count'], w, label='EV', color='#FF8C00')
ax.bar([i + w/2 for i in x], pdf['diesel_count'], w, label='Diesel', color='#4dba8e')
ax.set_xticks(x)
ax.set_xticklabels(pdf['year'], fontsize=8)
ax.legend(fontsize=8)
ax.set_title('Bar: EV vs Diesel', fontsize=10)

# Top-right: Line chart
ax = axes[0, 1]
ax.plot(pdf['year'], pdf['ev_count'], marker='o', color='#FF8C00', label='EV')
ax.plot(pdf['year'], pdf['diesel_count'], marker='s', color='#4dba8e', label='Diesel')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)
ax.set_title('Line: Registration Trends', fontsize=10)

# Bottom-left: Pie chart
ax = axes[1, 0]
ax.pie(fuel_counts['count'], labels=fuel_counts['fuel_type'],
       autopct='%1.0f%%', colors=['#FF8C00', '#FFA500', '#4dba8e'],
       textprops={'fontsize': 9})
ax.set_title('Pie: Fuel Types', fontsize=10)

# Bottom-right: Scatter plot
ax = axes[1, 1]
ax.scatter(clean['litres'], clean['price'], c='#FF8C00', alpha=0.7, s=40,
           edgecolors='#0A2A12', linewidth=0.5)
ax.set_xlabel('Litres', fontsize=9)
ax.set_ylabel('Price', fontsize=9)
ax.grid(True, alpha=0.3)
ax.set_title('Scatter: Litres vs Price', fontsize=10)

plt.suptitle('PySpark Data Visualisation Dashboard', fontsize=13, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Figure 6: A 2×2 subplot grid combining bar, line, pie, and scatter charts.

The plt.subplots(2, 2) call creates a 2-row, 2-column grid. Each axes[row, col] is an independent matplotlib axes object. plt.tight_layout() adjusts spacing to prevent overlap. This pattern scales to any grid size — try changing to plt.subplots(1, 4) for a single-row dashboard.

Choosing the right chart type

Not every chart works for every dataset. Here is a quick guide:

Bar chart: Best for comparing discrete categories or groups. Use grouped bars when comparing two measures (e.g., EV vs diesel). Use horizontal bars when category labels are long.
Line chart: Best for continuous data or time series. Shows trends, rates of change, and crossover points clearly. Add markers when you have few data points.
Pie chart: Best for showing parts of a whole (proportions). Limit to 3–5 slices — more than that and a bar chart is easier to read. Avoid 3D pie charts.
Scatter plot: Best for exploring relationships between two numeric variables. Add colour or size dimensions to encode additional variables. Use alpha transparency for overlapping points.

PySpark visualisation vs other tools

If you work across multiple tools, here is how PySpark plotting compares:

PySpark + matplotlib: Convert with .toPandas(), then use standard matplotlib. Full control over styling but requires more code. Best for static publication-quality charts.
PySpark + plotly: Same .toPandas() pattern, but produces interactive HTML charts with hover tooltips and zoom. Better for exploratory analysis and dashboards.
PySpark + seaborn: Built on matplotlib with cleaner defaults and statistical chart types (violin plots, heatmaps). Less boilerplate for common patterns.
pandas plotting: df.plot() is a one-liner but offers less control. Uses matplotlib under the hood.
Polars + matplotlib: Same pattern as PySpark — convert with .to_pandas(), then plot. Polars is faster for single-machine datasets.

The core principle is the same across all tools: aggregate your data to a manageable size in the distributed engine (PySpark, Polars), then bring it to the driver for plotting. Never call .toPandas() on a full billion-row dataset.

Try editing the code blocks above — change bar colours, swap chart types, add a title, or combine different columns to see how each pattern behaves.

Data Science PySpark Matplotlib Data Visualisation Charts

References

Matplotlib documentation: pyplot API
PySpark documentation: DataFrame.toPandas()
Matplotlib gallery: Example gallery
PySpark groupBy: How to Group Data with PySpark
PySpark filter: How to Filter Data with PySpark

Suhith Illesinghe

Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.

How to plot with PySpark and Matplotlib?

The datasets

Bar chart — EV vs Diesel registrations

Line chart — trends with styled markers

Pie chart — fuel type distribution

Scatter plot — litres vs price

Subplots — 2×2 grid with all chart types

Choosing the right chart type

PySpark visualisation vs other tools

References

Related Articles