Pyspark

๐Ÿ“˜ Lesson 1: PySpark Basics

This lesson introduces the fundamentals of using PySpark in GitHub Codespaces. It walks through:


๐Ÿงฑ Step-by-Step Guide

โœ… Step 1: Create a SparkSession

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("BasicSparkApp") \
    .getOrCreate()

This initializes a Spark session, which is the entry point to use PySpark.


โœ… Step 2: Create a DataFrame

data = [
    (1, "Alice", 29),
    (2, "Bob", 31),
    (3, "Cathy", 27)
]

columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.show()

Creates an in-memory DataFrame with 3 rows and displays it.


๐Ÿงช Step 3: DataFrame Operations

3.1 Print Schema

df.printSchema()

Shows the structure and data types of the DataFrame.

3.2 Select Columns

df.select("name").show()

Selects and displays only the name column.

3.3 Filter Rows

df.filter(df.age > 28).show()

Filters the rows to show only those with age > 28.

3.4 Add New Column

from pyspark.sql.functions import col
df.withColumn("age_plus_5", col("age") + 5).show()

Adds a derived column called age_plus_5.


๐Ÿ’พ Step 4: Write and Read CSV

โœ… Write as Single CSV

df.coalesce(1).write.mode("overwrite").csv("Lesson-1/output/output.csv", header=True)

โœ… Read Back CSV

df2 = spark.read.csv("Lesson-1/output/output.csv", header=True, inferSchema=True)
df2.show()

Reads the CSV back into a new DataFrame and displays it.


๐Ÿ“ Output

The output directory contains:

Lesson-1/output/output.csv/
โ”œโ”€โ”€ part-00000-xxxx.csv   <-- actual data
โ””โ”€โ”€ _SUCCESS              <-- Spark job success marker

๐Ÿงผ Notes


๐Ÿงญ Next Up

Lesson 2 will cover: