This project is a hands-on PySpark tutorial designed to run inside GitHub Codespaces. It helps you learn and practice PySpark by walking through real examples — from basic DataFrame operations to reading/writing files and setting up your development environment.
| Lesson | Description |
|---|---|
| 1️⃣ | Lesson 1 – PySpark Basics — SparkSession, DataFrame ops, CSV I/O |
| 2️⃣ | Lesson 2 – Aggregations and GroupBy — summary stats, multi-level grouping, CSV output |
PySpark is the Python API for Apache Spark, an open-source distributed computing framework used for big data processing and analytics.
It allows you to:
| Feature | Standard Python (pandas, etc.) |
PySpark |
|---|---|---|
| Data Size | In-memory only (limited by RAM) | Distributed (handles huge datasets) |
| Execution | Single-threaded | Parallel, distributed |
| Syntax | Pythonic | Similar to SQL + functional API |
| Use Case | Small to mid-sized data | Big Data, scalable analytics |
| Performance | Slower on large datasets | Optimized with JVM + Spark engine |
.devcontainer)This project uses a .devcontainer configuration to set up:
default-jdk)Dockerfile — builds the container with Java and PySpark installeddevcontainer.json — VS Code Codespace settingsrequirements.txt — lists additional Python packagespython -c "import pyspark; print(pyspark.__version__)"
To Run
python <filename>.py