Pyspark

⚡ PySpark Tutorial

This project is a hands-on PySpark tutorial designed to run inside GitHub Codespaces. It helps you learn and practice PySpark by walking through real examples — from basic DataFrame operations to reading/writing files and setting up your development environment.


📘 Lessons

Lesson Description
1️⃣ Lesson 1 – PySpark Basics — SparkSession, DataFrame ops, CSV I/O
2️⃣ Lesson 2 – Aggregations and GroupBy — summary stats, multi-level grouping, CSV output

📘 What is PySpark?

PySpark is the Python API for Apache Spark, an open-source distributed computing framework used for big data processing and analytics.

It allows you to:


🔁 How is PySpark Different from Standard Python?

Feature Standard Python (pandas, etc.) PySpark
Data Size In-memory only (limited by RAM) Distributed (handles huge datasets)
Execution Single-threaded Parallel, distributed
Syntax Pythonic Similar to SQL + functional API
Use Case Small to mid-sized data Big Data, scalable analytics
Performance Slower on large datasets Optimized with JVM + Spark engine

🛠️ Environment Setup (via .devcontainer)

This project uses a .devcontainer configuration to set up:

Dev Container Includes:


🚀 Getting Started

  1. Open this repo in GitHub Codespaces
  2. The container auto-builds with Java and PySpark installed
  3. To verify installation, run:
python -c "import pyspark; print(pyspark.__version__)"

To Run

python <filename>.py