PySpark Tutorials

PySpark is what you reach for when your data has outgrown a single machine. It gives you a DataFrame API that feels familiar if you've used pandas, but underneath it ships work out to a Spark cluster — meaning you can query terabytes the same way you'd query a few thousand rows. It's the default toolkit on Databricks, EMR, Synapse and a lot of in-house platforms.

These tutorials are written in the same Minimal Viable Analytics style I use with clients: each one focuses on a single core operation — filtering, sorting, grouping, joining, creating columns, plotting — with working PySpark you can paste straight into a notebook. New to Spark? Start with filtering. Coming from pandas? Skim and notice where the API diverges.