PySpark is the Python API for Apache Spark — the most widely used framework for distributed data processing. Spark itself has no built-in plotting API. The standard workflow is: aggregate or filter your data in PySpark, convert the result to pandas with .toPandas(), then plot with matplotlib (or seaborn, plotly, etc.).
This tutorial covers six interactive examples: creating datasets, bar charts, line charts, pie charts, scatter plots, and a combined 2×2 subplot grid. Every chart renders directly in your browser.
Note: This tutorial uses real PySpark syntax running on a browser-based simulation powered by pandas. The code is copy-paste ready for a real Spark cluster. Charts are rendered using matplotlib's AGG backend and displayed as PNG images. Distributed execution and lazy evaluation do not apply in this environment.
- Create two datasets — EV registrations and fuel station transactions
- Bar chart — compare EV vs diesel registrations by year
- Line chart — visualise trends with styled markers
- Pie chart — fuel type distribution
- Scatter plot — litres vs price relationship
- Subplots — combine all four chart types in a 2×2 grid
The datasets
We use two datasets. The first tracks yearly electric vehicle (EV) and diesel registrations. The second is a petrol station transaction log with station, state, fuel_type, litres, and price columns. Run the block below to create both.
Both datasets are now available as PySpark DataFrames. The EV dataset spans 2019–2023 and shows the rapid growth in electric vehicle adoption alongside declining diesel registrations. The fuel dataset captures individual transactions at three Australian stations.
Bar chart — EV vs Diesel registrations
A grouped bar chart is ideal for comparing two categories across a shared axis. We convert the PySpark DataFrame to pandas with .toPandas(), then use matplotlib's bar() function with offset positions:
The key steps: .toPandas() collects the Spark DataFrame to the driver, then standard matplotlib functions handle the rendering. The w = 0.35 offset ensures bars sit side by side rather than overlapping. On a real cluster, always aggregate or filter before calling .toPandas() to avoid pulling millions of rows to the driver.
Line chart — trends with styled markers
Line charts are the natural choice for time-series data. Adding markers and a grid makes trends easier to read:
The marker='o' and marker='s' arguments add circle and square markers respectively. fill_between() adds a subtle shaded area beneath each line, making the diverging trends more visible. The grid(True, alpha=0.3) call adds faint gridlines for easier value reading.
Pie chart — fuel type distribution
Pie charts show proportions of a whole. Here we group the fuel station data by fuel_type, count the transactions, then plot the result:
The groupBy('fuel_type').count() aggregation runs in PySpark, then .toPandas() brings the small result (3 rows) to the driver. autopct='%1.1f%%' formats each slice with one decimal place. The wedgeprops argument adds a dark edge to each slice for visual separation.
Scatter plot — litres vs price
Scatter plots reveal relationships between two numeric variables. We filter out any null values first (good practice), then plot:
The F.col('litres').isNotNull() filter runs on the Spark side before collecting to pandas. The alpha=0.7 argument makes overlapping points partially transparent, and edgecolors adds a thin border around each dot for clarity.
Subplots — 2×2 grid with all chart types
For dashboards or reports, combine multiple chart types in a single figure using plt.subplots():
The plt.subplots(2, 2) call creates a 2-row, 2-column grid. Each axes[row, col] is an independent matplotlib axes object. plt.tight_layout() adjusts spacing to prevent overlap. This pattern scales to any grid size — try changing to plt.subplots(1, 4) for a single-row dashboard.
Choosing the right chart type
Not every chart works for every dataset. Here is a quick guide:
- Bar chart: Best for comparing discrete categories or groups. Use grouped bars when comparing two measures (e.g., EV vs diesel). Use horizontal bars when category labels are long.
- Line chart: Best for continuous data or time series. Shows trends, rates of change, and crossover points clearly. Add markers when you have few data points.
- Pie chart: Best for showing parts of a whole (proportions). Limit to 3–5 slices — more than that and a bar chart is easier to read. Avoid 3D pie charts.
- Scatter plot: Best for exploring relationships between two numeric variables. Add colour or size dimensions to encode additional variables. Use alpha transparency for overlapping points.
PySpark visualisation vs other tools
If you work across multiple tools, here is how PySpark plotting compares:
- PySpark + matplotlib: Convert with
.toPandas(), then use standard matplotlib. Full control over styling but requires more code. Best for static publication-quality charts. - PySpark + plotly: Same
.toPandas()pattern, but produces interactive HTML charts with hover tooltips and zoom. Better for exploratory analysis and dashboards. - PySpark + seaborn: Built on matplotlib with cleaner defaults and statistical chart types (violin plots, heatmaps). Less boilerplate for common patterns.
- pandas plotting:
df.plot()is a one-liner but offers less control. Uses matplotlib under the hood. - Polars + matplotlib: Same pattern as PySpark — convert with
.to_pandas(), then plot. Polars is faster for single-machine datasets.
The core principle is the same across all tools: aggregate your data to a manageable size in the distributed engine (PySpark, Polars), then bring it to the driver for plotting. Never call .toPandas() on a full billion-row dataset.
Try editing the code blocks above — change bar colours, swap chart types, add a title, or combine different columns to see how each pattern behaves.
References
- Matplotlib documentation: pyplot API
- PySpark documentation: DataFrame.toPandas()
- Matplotlib gallery: Example gallery
- PySpark groupBy: How to Group Data with PySpark
- PySpark filter: How to Filter Data with PySpark