Welcome to Day 4 of the Spark Mastery Series.
Yesterday we learned RDD basics. Today we go deeper into partitions, shuffles, coalesce, repartition, and persistenceโcore concepts that define Spark performance.
โก 1. Understanding Partitions
A partition is Sparkโs basic unit of parallel processing.
Think of partitions like:
- Slices of a pizza
- Each slice handled by one executor core
- More partitions = more parallel workers
Why partitions matter?
- Too few partitions โ cluster underutilized
- Too many partitions โ scheduler overhead
Default number of partitions:
`parallelize() โ spark.default.parallelism (usually number of cores ร 2)
File reads โ based on block size`
Check partitions:
rdd.getNumPartitions()
๐ 2. Narrow vs Wide Transformations (The Real Reason Your Jobs Are Slow)
Narrow transformations:
- No data movement
- No shuffle
- Faster
Examples: map, filter, union
Wide transformations:
- Data movement between executors
- Causes shuffle
- Creates new stage
Examples: reduceByKey, groupByKey, join, distinct
๐ฅ 3. Shuffle โ Sparkโs Most Expensive Operation
During shuffle, Spark:
- Writes data to disk
- Transfers it over network
- Reorganizes partitions
This is why shuffle-heavy jobs run slow. Huge companies spend millions reducing shuffle.
๐ 4. Repartition vs Coalesce
This is one of the most misunderstood concepts.
Repartition:
- Used to increase OR decrease partitions
- Causes full shuffle
- Data gets evenly distributed
- Good for large operations like joins
df2 = df.repartition(50)
When to use?
- Before joins
- Before large aggregations
- When dealing with skew
Coalesce:
- Used to reduce partitions only
- No shuffle
- Much faster than repartition
- Moves minimal data
df2 = df.coalesce(5)
When to use?
- Writing to small number of output files
- Improving file compactness
- When merging small partitions
๐ฆ 5. Persistence & Caching-Boosting Performance
Spark recomputes transformations unless cached.
Example:
processed = rdd.map(...).filter(...)
processed.persist()
processed.count()
processed.collect()
Without persist โ Spark computes twice
With persist โ second action reads from cache
๐ง Summary
Today we learned:
- How partitions work
- What causes shuffle
- Difference between narrow and wide transformations
- When to use repartition vs coalesce
- How caching helps performance
Follow for more such content. Let me know if I missed anything in comments. Thank you!!
Top comments (0)