DEV Community

Cover image for ๐Ÿ”ฅ Day 4: RDD Internals - Partitions, Shuffles & Repartitioning Demystified
Sandeep
Sandeep

Posted on

๐Ÿ”ฅ Day 4: RDD Internals - Partitions, Shuffles & Repartitioning Demystified

Welcome to Day 4 of the Spark Mastery Series.

Yesterday we learned RDD basics. Today we go deeper into partitions, shuffles, coalesce, repartition, and persistenceโ€”core concepts that define Spark performance.

โšก 1. Understanding Partitions
A partition is Sparkโ€™s basic unit of parallel processing.

Think of partitions like:

  • Slices of a pizza
  • Each slice handled by one executor core
  • More partitions = more parallel workers

Why partitions matter?

  • Too few partitions โ†’ cluster underutilized
  • Too many partitions โ†’ scheduler overhead

Default number of partitions:
`parallelize() โ†’ spark.default.parallelism (usually number of cores ร— 2)

File reads โ†’ based on block size`

Check partitions:

rdd.getNumPartitions()
Enter fullscreen mode Exit fullscreen mode

๐Ÿ” 2. Narrow vs Wide Transformations (The Real Reason Your Jobs Are Slow)

Narrow transformations:

  • No data movement
  • No shuffle
  • Faster

Examples: map, filter, union

Wide transformations:

  • Data movement between executors
  • Causes shuffle
  • Creates new stage

Examples: reduceByKey, groupByKey, join, distinct

๐Ÿ”ฅ 3. Shuffle โ€” Sparkโ€™s Most Expensive Operation

During shuffle, Spark:

  • Writes data to disk
  • Transfers it over network
  • Reorganizes partitions

This is why shuffle-heavy jobs run slow. Huge companies spend millions reducing shuffle.

๐Ÿ”„ 4. Repartition vs Coalesce
This is one of the most misunderstood concepts.

Repartition:

  • Used to increase OR decrease partitions
  • Causes full shuffle
  • Data gets evenly distributed
  • Good for large operations like joins
df2 = df.repartition(50)
Enter fullscreen mode Exit fullscreen mode

When to use?

  • Before joins
  • Before large aggregations
  • When dealing with skew

Coalesce:

  • Used to reduce partitions only
  • No shuffle
  • Much faster than repartition
  • Moves minimal data
df2 = df.coalesce(5)
Enter fullscreen mode Exit fullscreen mode

When to use?

  • Writing to small number of output files
  • Improving file compactness
  • When merging small partitions

๐Ÿ“ฆ 5. Persistence & Caching-Boosting Performance

Spark recomputes transformations unless cached.

Example:

processed = rdd.map(...).filter(...)
processed.persist()
processed.count()
processed.collect()
Enter fullscreen mode Exit fullscreen mode

Without persist โ†’ Spark computes twice
With persist โ†’ second action reads from cache

๐Ÿง  Summary
Today we learned:

  • How partitions work
  • What causes shuffle
  • Difference between narrow and wide transformations
  • When to use repartition vs coalesce
  • How caching helps performance

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

Top comments (0)