PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. However, PySpark has no built-in plotting API. The standard workflow is: aggregate or filter your data in PySpark, convert to a local pandas DataFrame with .toPandas(), then plot with matplotlib.
This tutorial covers six interactive examples — from a basic bar chart to a fully styled multi-series plot — all using EV vs diesel vehicle registration data.
Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Distributed execution and lazy evaluation do not apply in this environment.
- Create the dataset and display it as a table
- Simple bar chart with
plt.bar() - Multi-series line chart with
plt.plot() - Aggregate in PySpark, then plot a stacked bar chart
- Polished line chart with custom styling
- Grouped bar chart with a computed column
The dataset
We will use a small dataset of UK electric vehicle (EV) and diesel new car registrations from 2018 to 2023. Each row records the year, the number of new ev_count registrations, and the number of new diesel_count registrations. The trend is clear: EVs are rising while diesel is declining.
The dataset has 6 rows spanning 2018 to 2023. EV registrations grew from 2,216 to 87,217, while diesel fell from 58,456 to 24,590. Let's visualise this trend.
Simple bar chart
The first step to any PySpark visualisation is .toPandas(). This collects the distributed DataFrame to the driver as a local pandas DataFrame. Then we use plt.bar() to create a simple bar chart:
The pattern is always the same: convert with .toPandas(), then use standard matplotlib calls. The plt.tight_layout() call prevents labels from being clipped, and plt.show() renders the figure.
Line chart with two series
To compare EV and diesel trends side-by-side, a line chart with markers works well. We call plt.plot() twice — once for each series — and use plt.legend() to label them:
The crossover is dramatic: EV registrations overtook diesel in 2023. The marker parameter adds data point indicators ('o' for circles, 's' for squares), making the trend easier to read.
Aggregate in PySpark, then plot
A common real-world pattern is to compute new columns in PySpark before converting to pandas. Here we use F.col() to compute a total, then create a stacked bar chart:
The bottom parameter in the second plt.bar() call stacks diesel on top of EV. This shows the combined total while preserving the individual breakdown. The key insight: do heavy computation in PySpark (where it runs distributed), then convert only the final summary to pandas for plotting.
Polished line chart with styling
Matplotlib defaults are functional but plain. With a few extra lines you can create publication-quality charts. Here we customise the background, gridlines, spines, and fonts:
The fig.patch.set_facecolor() and ax.set_facecolor() calls set the background. The ax.grid() adds subtle gridlines, and ax.spines controls the chart border. These tweaks take a few seconds but make a big difference in readability.
Grouped bar chart with a computed column
For a grouped (side-by-side) bar chart, we offset the bars using numpy. We also compute the EV share percentage in PySpark before plotting:
The numpy.arange() call creates evenly spaced positions, and the width offset places the two bar groups side by side. This is the standard matplotlib pattern for grouped bars. The EV share column was computed in PySpark using F.round() and F.col() — demonstrating how to do the heavy lifting in Spark and only plot the result.
PySpark visualisation vs pandas vs Polars vs SQL
Every tool has a different approach to graphing:
- PySpark: No built-in plotting. Convert to pandas with
.toPandas(), then use matplotlib, seaborn, or plotly. Always aggregate first to avoid collecting large datasets to the driver. - pandas: Built-in
.plot()method wraps matplotlib. Example:df.plot.bar(x='year', y='ev_count'). Convenient but limited to single-machine data. - Polars: No built-in plotting. Convert to pandas with
.to_pandas()or use Polars'.to_numpy()for direct matplotlib arrays. - SQL: No native graphing. Results are typically exported to a BI tool (Tableau, Power BI) or a notebook for matplotlib/plotly rendering.
The core pattern is universal: prepare a small summary dataset, then hand it to a plotting library. PySpark's .toPandas() is the bridge between distributed computation and local visualisation. On a real cluster, always aggregate or sample before calling .toPandas() to avoid out-of-memory errors on the driver.
Try editing the code blocks above — change colours, swap bar charts for scatter plots (plt.scatter()), or add annotations with ax.annotate() to see how each pattern behaves.
References
- PySpark documentation: DataFrame.toPandas()
- Matplotlib documentation: pyplot API reference
- Matplotlib documentation: Grouped bar chart example
- PySpark groupBy guide: How to Group Data with PySpark
- Pandas plotting: How to Group Data in Pandas