This lesson introduces the fundamentals of using PySpark in GitHub Codespaces. It walks through:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("BasicSparkApp") \
.getOrCreate()
This initializes a Spark session, which is the entry point to use PySpark.
data = [
(1, "Alice", 29),
(2, "Bob", 31),
(3, "Cathy", 27)
]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.show()
Creates an in-memory DataFrame with 3 rows and displays it.
df.printSchema()
Shows the structure and data types of the DataFrame.
df.select("name").show()
Selects and displays only the name
column.
df.filter(df.age > 28).show()
Filters the rows to show only those with age > 28
.
from pyspark.sql.functions import col
df.withColumn("age_plus_5", col("age") + 5).show()
Adds a derived column called age_plus_5
.
df.coalesce(1).write.mode("overwrite").csv("Lesson-1/output/output.csv", header=True)
coalesce(1)
ensures only one part file is written.Lesson-1/output/output.csv/
folder.df2 = spark.read.csv("Lesson-1/output/output.csv", header=True, inferSchema=True)
df2.show()
Reads the CSV back into a new DataFrame and displays it.
The output directory contains:
Lesson-1/output/output.csv/
โโโ part-00000-xxxx.csv <-- actual data
โโโ _SUCCESS <-- Spark job success marker
.coalesce(1)
to simplify file outputs when working locally.output.csv
folder expecting a .csv
file directly โ Spark writes partitioned output.Lesson 2 will cover:
groupBy()