How do you create a graph from a PySpark DataFrame?

Convert the PySpark DataFrame to pandas with .toPandas(), then use matplotlib (plt.bar, plt.plot, etc.) to create charts. PySpark itself has no built-in plotting API.

What does toPandas() do in PySpark?

toPandas() collects all rows from the distributed PySpark DataFrame to the driver node and returns a local pandas DataFrame. Use it only when the data fits in memory.

Can you use matplotlib directly with PySpark?

Not directly. Matplotlib works with local data (numpy arrays, pandas DataFrames). You must first convert a PySpark DataFrame to pandas with .toPandas() before passing it to matplotlib.

What is the best way to visualise large PySpark DataFrames?

Aggregate or sample the data in PySpark first (using groupBy, agg, or sample), then convert the smaller result to pandas for plotting. Never call toPandas() on a large raw DataFrame.

How to Create a Graph with PySpark and Matplotlib

PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. However, PySpark has no built-in plotting API. The standard workflow is: aggregate or filter your data in PySpark, convert to a local pandas DataFrame with .toPandas(), then plot with matplotlib.

This tutorial covers six interactive examples — from a basic bar chart to a fully styled multi-series plot — all using EV vs diesel vehicle registration data.

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.

Create the dataset and display it as a table
Simple bar chart with plt.bar()
Multi-series line chart with plt.plot()
Aggregate in PySpark, then plot a stacked bar chart
Polished line chart with custom styling
Grouped bar chart with a computed column

The dataset

We will use a small dataset of UK electric vehicle (EV) and diesel new car registrations from 2018 to 2023. Each row records the year, the number of new ev_count registrations, and the number of new diesel_count registrations. The trend is clear: EVs are rising while diesel is declining.

Python — editable

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import matplotlib.pyplot as plt

spark = SparkSession.builder.getOrCreate()

data = [
    (2018, 2216, 58456),
    (2019, 6718, 53251),
    (2020, 7248, 40264),
    (2021, 20665, 30987),
    (2022, 33252, 25301),
    (2023, 87217, 24590),
]
columns = ['year', 'ev_count', 'diesel_count']
df = spark.createDataFrame(data, columns)
df.show()

Figure 1: EV vs diesel registrations — 6 rows, 3 columns.

The dataset has 6 rows spanning 2018 to 2023. EV registrations grew from 2,216 to 87,217, while diesel fell from 58,456 to 24,590. Let's visualise this trend.

Simple bar chart

The first step to any PySpark visualisation is .toPandas(). This collects the distributed DataFrame to the driver as a local pandas DataFrame. Then we use plt.bar() to create a simple bar chart:

Python — editable

pdf = df.toPandas()

plt.figure(figsize=(8, 4))
plt.bar(pdf['year'], pdf['ev_count'], color='#3692eb')
plt.xlabel('Year')
plt.ylabel('EV Registrations')
plt.title('EV Registrations by Year')
plt.tight_layout()
plt.show()

Figure 2: A simple bar chart showing EV registrations rising year over year.

The pattern is always the same: convert with .toPandas(), then use standard matplotlib calls. The plt.tight_layout() call prevents labels from being clipped, and plt.show() renders the figure.

Line chart with two series

To compare EV and diesel trends side-by-side, a line chart with markers works well. We call plt.plot() twice — once for each series — and use plt.legend() to label them:

Python — editable

pdf = df.toPandas()

plt.figure(figsize=(8, 4))
plt.plot(pdf['year'], pdf['ev_count'], marker='o', label='EV', color='#3692eb')
plt.plot(pdf['year'], pdf['diesel_count'], marker='s', label='Diesel', color='#ff6384')
plt.xlabel('Year')
plt.ylabel('Registrations')
plt.legend()
plt.title('EV vs Diesel Registrations')
plt.tight_layout()
plt.show()

Figure 3: EV registrations crossing above diesel around 2022–2023.

The crossover is dramatic: EV registrations overtook diesel in 2023. The marker parameter adds data point indicators ('o' for circles, 's' for squares), making the trend easier to read.

Aggregate in PySpark, then plot

A common real-world pattern is to compute new columns in PySpark before converting to pandas. Here we use F.col() to compute a total, then create a stacked bar chart:

Python — editable

# Compute total per year in PySpark, then plot
result = df.withColumn('total', F.col('ev_count') + F.col('diesel_count'))
pdf = result.toPandas()

plt.figure(figsize=(8, 4))
plt.bar(pdf['year'], pdf['ev_count'], label='EV', color='#3692eb')
plt.bar(pdf['year'], pdf['diesel_count'], bottom=pdf['ev_count'], label='Diesel', color='#ff6384')
plt.xlabel('Year')
plt.ylabel('Registrations')
plt.legend()
plt.title('Stacked: EV + Diesel Registrations')
plt.tight_layout()
plt.show()

Figure 4: Stacked bar chart — total registrations declining as diesel drops faster than EV grows.

The bottom parameter in the second plt.bar() call stacks diesel on top of EV. This shows the combined total while preserving the individual breakdown. The key insight: do heavy computation in PySpark (where it runs distributed), then convert only the final summary to pandas for plotting.

Polished line chart with styling

Matplotlib defaults are functional but plain. With a few extra lines you can create publication-quality charts. Here we customise the background, gridlines, spines, and fonts:

Python — editable

pdf = df.toPandas()

fig, ax = plt.subplots(figsize=(8, 4))
fig.patch.set_facecolor('#0A2A12')
ax.set_facecolor('#0A2A12')

ax.plot(pdf['year'], pdf['ev_count'], marker='o', linewidth=2.5,
        color='#3692eb', label='EV', markersize=7)
ax.plot(pdf['year'], pdf['diesel_count'], marker='s', linewidth=2.5,
        color='#ff6384', label='Diesel', markersize=7)

ax.set_xlabel('Year', color='#ccc', fontsize=11)
ax.set_ylabel('Registrations', color='#ccc', fontsize=11)
ax.set_title('EV vs Diesel Registrations (Styled)', color='#fff', fontsize=14, fontweight='bold')
ax.legend(facecolor='#0d3318', edgecolor='#333', labelcolor='#ccc')
ax.tick_params(colors='#999')
ax.grid(True, alpha=0.15, color='#fff')

for spine in ax.spines.values():
    spine.set_color('#333')

plt.tight_layout()
plt.show()

Figure 5: Dark-themed line chart with custom colours, grid, and spines.

The fig.patch.set_facecolor() and ax.set_facecolor() calls set the background. The ax.grid() adds subtle gridlines, and ax.spines controls the chart border. These tweaks take a few seconds but make a big difference in readability.

Grouped bar chart with a computed column

For a grouped (side-by-side) bar chart, we offset the bars using numpy. We also compute the EV share percentage in PySpark before plotting:

Python — editable

import numpy as np

# Compute EV share in PySpark
result = df.withColumn(
    'ev_share_pct',
    F.round(F.col('ev_count') / (F.col('ev_count') + F.col('diesel_count')) * 100, 1)
)
pdf = result.toPandas()

x = np.arange(len(pdf['year']))
width = 0.35

fig, ax = plt.subplots(figsize=(8, 4))
bars1 = ax.bar(x - width/2, pdf['ev_count'], width, label='EV', color='#3692eb')
bars2 = ax.bar(x + width/2, pdf['diesel_count'], width, label='Diesel', color='#ff6384')

ax.set_xlabel('Year')
ax.set_ylabel('Registrations')
ax.set_title('EV vs Diesel — Grouped Bar Chart')
ax.set_xticks(x)
ax.set_xticklabels(pdf['year'])
ax.legend()

plt.tight_layout()
plt.show()

Figure 6: Grouped bar chart with EV share computed in PySpark.

The numpy.arange() call creates evenly spaced positions, and the width offset places the two bar groups side by side. This is the standard matplotlib pattern for grouped bars. The EV share column was computed in PySpark using F.round() and F.col() — demonstrating how to do the heavy lifting in Spark and only plot the result.

PySpark visualisation vs pandas vs Polars vs SQL

Every tool has a different approach to graphing:

PySpark: No built-in plotting. Convert to pandas with .toPandas(), then use matplotlib, seaborn, or plotly. Always aggregate first to avoid collecting large datasets to the driver.
pandas: Built-in .plot() method wraps matplotlib. Example: df.plot.bar(x='year', y='ev_count'). Convenient but limited to single-machine data.
Polars: No built-in plotting. Convert to pandas with .to_pandas() or use Polars' .to_numpy() for direct matplotlib arrays.
SQL: No native graphing. Results are typically exported to a BI tool (Tableau, Power BI) or a notebook for matplotlib/plotly rendering.

The core pattern is universal: prepare a small summary dataset, then hand it to a plotting library. PySpark's .toPandas() is the bridge between distributed computation and local visualisation. On a real cluster, always aggregate or sample before calling .toPandas() to avoid out-of-memory errors on the driver.

Try editing the code blocks above — change colours, swap bar charts for scatter plots (plt.scatter()), or add annotations with ax.annotate() to see how each pattern behaves.

Data Science PySpark Matplotlib Visualisation

References

PySpark documentation: DataFrame.toPandas()
Matplotlib documentation: pyplot API reference
Matplotlib documentation: Grouped bar chart example
PySpark groupBy guide: How to Group Data with PySpark
Pandas plotting: How to Group Data in Pandas

Suhith Illesinghe

Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.

How to create a graph with PySpark and Matplotlib

The dataset

Simple bar chart

Line chart with two series

Aggregate in PySpark, then plot

Polished line chart with styling

Grouped bar chart with a computed column

PySpark visualisation vs pandas vs Polars vs SQL

References

Related Articles