PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. However, PySpark has no built-in plotting API. The standard workflow is: aggregate or filter your data in PySpark, convert to a local pandas DataFrame with .toPandas(), then plot with matplotlib.

This tutorial covers six interactive examples — from a basic bar chart to a fully styled multi-series plot — all using EV vs diesel vehicle registration data.

Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.

  1. Create the dataset and display it as a table
  2. Simple bar chart with plt.bar()
  3. Multi-series line chart with plt.plot()
  4. Aggregate in PySpark, then plot a stacked bar chart
  5. Polished line chart with custom styling
  6. Grouped bar chart with a computed column

The dataset

We will use a small dataset of UK electric vehicle (EV) and diesel new car registrations from 2018 to 2023. Each row records the year, the number of new ev_count registrations, and the number of new diesel_count registrations. The trend is clear: EVs are rising while diesel is declining.

Python — editable
Figure 1: EV vs diesel registrations — 6 rows, 3 columns.

The dataset has 6 rows spanning 2018 to 2023. EV registrations grew from 2,216 to 87,217, while diesel fell from 58,456 to 24,590. Let's visualise this trend.

Simple bar chart

The first step to any PySpark visualisation is .toPandas(). This collects the distributed DataFrame to the driver as a local pandas DataFrame. Then we use plt.bar() to create a simple bar chart:

Python — editable
Figure 2: A simple bar chart showing EV registrations rising year over year.

The pattern is always the same: convert with .toPandas(), then use standard matplotlib calls. The plt.tight_layout() call prevents labels from being clipped, and plt.show() renders the figure.

Line chart with two series

To compare EV and diesel trends side-by-side, a line chart with markers works well. We call plt.plot() twice — once for each series — and use plt.legend() to label them:

Python — editable
Figure 3: EV registrations crossing above diesel around 2022–2023.

The crossover is dramatic: EV registrations overtook diesel in 2023. The marker parameter adds data point indicators ('o' for circles, 's' for squares), making the trend easier to read.

Aggregate in PySpark, then plot

A common real-world pattern is to compute new columns in PySpark before converting to pandas. Here we use F.col() to compute a total, then create a stacked bar chart:

Python — editable
Figure 4: Stacked bar chart — total registrations declining as diesel drops faster than EV grows.

The bottom parameter in the second plt.bar() call stacks diesel on top of EV. This shows the combined total while preserving the individual breakdown. The key insight: do heavy computation in PySpark (where it runs distributed), then convert only the final summary to pandas for plotting.

Polished line chart with styling

Matplotlib defaults are functional but plain. With a few extra lines you can create publication-quality charts. Here we customise the background, gridlines, spines, and fonts:

Python — editable
Figure 5: Dark-themed line chart with custom colours, grid, and spines.

The fig.patch.set_facecolor() and ax.set_facecolor() calls set the background. The ax.grid() adds subtle gridlines, and ax.spines controls the chart border. These tweaks take a few seconds but make a big difference in readability.

Grouped bar chart with a computed column

For a grouped (side-by-side) bar chart, we offset the bars using numpy. We also compute the EV share percentage in PySpark before plotting:

Python — editable
Figure 6: Grouped bar chart with EV share computed in PySpark.

The numpy.arange() call creates evenly spaced positions, and the width offset places the two bar groups side by side. This is the standard matplotlib pattern for grouped bars. The EV share column was computed in PySpark using F.round() and F.col() — demonstrating how to do the heavy lifting in Spark and only plot the result.

PySpark visualisation vs pandas vs Polars vs SQL

Every tool has a different approach to graphing:

The core pattern is universal: prepare a small summary dataset, then hand it to a plotting library. PySpark's .toPandas() is the bridge between distributed computation and local visualisation. On a real cluster, always aggregate or sample before calling .toPandas() to avoid out-of-memory errors on the driver.

Try editing the code blocks above — change colours, swap bar charts for scatter plots (plt.scatter()), or add annotations with ax.annotate() to see how each pattern behaves.

References

Suhith Illesinghe
Curiosity is the first step to make a difference. I hope to inspire others to explore, build and champion collaborative growth.