DEV Community

Cover image for Day 10: Partitioning vs Bucketing - The Spark Optimization Guide Every Data Engineer Needs
Sandeep
Sandeep

Posted on • Edited on

Day 10: Partitioning vs Bucketing - The Spark Optimization Guide Every Data Engineer Needs

Welcome to Day 10 of the Spark Mastery Series!
Today’s topic is one of the biggest performance boosters in Spark ETL pipelines.

Most Spark beginners learn transformations but never learn how data should be stored for maximum performance.
Partitioning & Bucketing are the two most powerful tools for that.

Let’s master them.

🌟 1. Why Partitioning Matters in Spark
Partitioning is the process of splitting data into separate folders/files based on one or more columns.

Example:

df.write.partitionBy("year", "month").parquet("/sales")
Enter fullscreen mode Exit fullscreen mode

This creates folders:
year=2024/month=01/
year=2024/month=02/
year=2024/month=03/

Benefits:

  • Queries that filter on partition columns skip entire folders
  • Less I/O
  • Faster scans
  • Lower compute cost

This technique is used in all Lakehouse architectures.

🌟 2. Repartition() vs Coalesce()

Repartition()
Used to increase or rebalance partitions.

df = df.repartition("customer_id")
Enter fullscreen mode Exit fullscreen mode

✔ Even distribution
✔ Useful before joins
✖ Slow (shuffle required)

Coalesce()
Used to reduce partitions.

df = df.coalesce(5)
Enter fullscreen mode Exit fullscreen mode

✔ No shuffle
✔ Faster writes
✖ Cannot increase partitions

🌟 3. When Should You Partition Your Data?

Partition when:

  • You filter heavily on the same column
  • You have time-based data
  • You want faster analytics

Avoid partitioning when:

  • Column has millions of unique values
  • Files become extremely small (<1MB each)

🌟 4. What is Bucketing and Why It’s Powerful?

Bucketing reduces shuffle for large-table joins.

df.write.bucketBy(20, "id").sortBy("id").saveAsTable("bucketed_users")
Enter fullscreen mode Exit fullscreen mode

This creates 20 bucket files.
When you join two bucketed tables on the same key, Spark doesn’t need to shuffle!

Benefits:

  • Faster joins
  • Deterministic data distribution
  • Better for high-cardinality columns

🌟 5. Partition vs Bucket — Which One Should You Use?

Use Partitioning when:
✔ Queries heavily filter on the column
✔ Time-series queries
✔ Data skipping is needed

Use Bucketing when:
✔ You want to speed up joins on large datasets
✔ High-cardinality join keys
✔ Combine with partitioning for massive datasets

🌟 6. Real-World Use Case (E-Commerce)

Sales data:

Partition by:
year, month, country

User table:
Bucket by
user_id

When joining:
Bucketed tables → fast joins
Partitioned tables → fast filters

This is exactly how Databricks Lakehouse architectures are built.

🚀 Summary
We learned:

  • What partitioning is
  • What bucketing is
  • Repartition vs coalesce
  • How Spark optimizes large joins
  • How to choose partition keys

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

Top comments (0)