This project is a hands-on PySpark tutorial designed to run inside GitHub Codespaces. It helps you learn and practice PySpark by walking through real examples — from basic DataFrame operations to reading/writing files and setting up your development environment.
Lesson | Description |
---|---|
1️⃣ | Lesson 1 – PySpark Basics — SparkSession, DataFrame ops, CSV I/O |
2️⃣ | Lesson 2 – Aggregations and GroupBy — summary stats, multi-level grouping, CSV output |
PySpark is the Python API for Apache Spark, an open-source distributed computing framework used for big data processing and analytics.
It allows you to:
Feature | Standard Python (pandas , etc.) |
PySpark |
---|---|---|
Data Size | In-memory only (limited by RAM) | Distributed (handles huge datasets) |
Execution | Single-threaded | Parallel, distributed |
Syntax | Pythonic | Similar to SQL + functional API |
Use Case | Small to mid-sized data | Big Data, scalable analytics |
Performance | Slower on large datasets | Optimized with JVM + Spark engine |
.devcontainer
)This project uses a .devcontainer
configuration to set up:
default-jdk
)Dockerfile
— builds the container with Java and PySpark installeddevcontainer.json
— VS Code Codespace settingsrequirements.txt
— lists additional Python packagespython -c "import pyspark; print(pyspark.__version__)"
To Run
python <filename>.py