This lesson builds on the basics from Lesson 1. It introduces how to summarize and analyze data using groupBy()
and aggregation functions in PySpark.
We also demonstrate saving both the original and transformed DataFrames to CSV files using .coalesce(1)
.
spark = SparkSession.builder.appName("Lesson2-Aggregations").getOrCreate()
Initializes Spark.
data = [
("Electronics", "Laptop", 1200),
...
]
df = spark.createDataFrame(data, ["category", "product", "price"])
df.show()
Creates a DataFrame with product category, name, and price.
df.coalesce(1).write.mode("overwrite").csv("Lesson-2/output/original_data.csv", header=True)
Saves the original data as a single CSV file in output/original_data.csv
.
df.groupBy("category").count().show()
Shows how many products belong to each category.
df.groupBy("category").agg(
avg("price"), sum("price"), max("price"), min("price")
).show()
Shows average, total, highest, and lowest prices for each category.
agg_df.coalesce(1).write.mode("overwrite").csv("Lesson-2/output/category_aggregations.csv", header=True)
Saves aggregated stats per category.
df.groupBy("category", "product").agg(
count("*"), avg("price")
).show()
Groups by both category
and product
to get finer-grained stats.
multi_df.coalesce(1).write.mode("overwrite").csv("Lesson-2/output/category_product_summary.csv", header=True)
Stores multi-level aggregation result in CSV format.
Lesson-2/output/
├── original_data.csv/
├── category_aggregations.csv/
└── category_product_summary.csv/
Each folder contains:
part-00000-*.csv
)_SUCCESS
file (written automatically by Spark).groupBy().agg()
to perform multi-column aggregations..alias()
to rename result columns..coalesce(1)
to avoid multiple part-files.From Codespace terminal:
python Lesson-2/main.py
Lesson 3 will cover: