Repartition vs Coalesce — When to Use Which
Understand the difference between repartition and coalesce in PySpark, when to use each, and practical patterns for controlling partition count.
Repartition vs Coalesce — When to Use Which
What You'll Learn
- The exact difference between
repartitionandcoalesce - When each one triggers (or avoids) a shuffle
- Common patterns: writing fewer output files, pre-join partitioning, fixing skew
- The cost-benefit trade-off for each approach
The One-Line Answer
repartition(n)— full shuffle, creates exactly n evenly distributed partitions. Can increase or decrease.coalesce(n)— no shuffle, combines adjacent partitions. Can only decrease. Partitions may be uneven.
from pyspark.sql import SparkSession
from pyspark.sql.functions import spark_partition_id, count
spark = SparkSession.builder.appName("RepartitionVsCoalesce").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
data = [(i, f"user_{i}", i * 10) for i in range(1000)]
df = spark.createDataFrame(data, ["id", "name", "score"])
print(f"Original partitions: {df.rdd.getNumPartitions()}")
Expected Output
Original partitions: 8
Repartition: Full Shuffle
# Repartition to 4 — triggers a full shuffle
df_repart = df.repartition(4)
print(f"After repartition(4): {df_repart.rdd.getNumPartitions()}")
# Check distribution — should be even
df_repart.withColumn("partition", spark_partition_id()) \
.groupBy("partition").count().orderBy("partition").show()
Expected Output
After repartition(4): 4
+---------+-----+
|partition|count|
+---------+-----+
| 0| 250|
| 1| 250|
| 2| 250|
| 3| 250|
+---------+-----+
Perfectly even — 250 rows each. The full shuffle redistributed data evenly.
# Repartition to 16 — can also INCREASE partitions
df_repart_up = df.repartition(16)
print(f"After repartition(16): {df_repart_up.rdd.getNumPartitions()}")
Coalesce: No Shuffle
# Coalesce to 4 — NO shuffle
df_coal = df.coalesce(4)
print(f"After coalesce(4): {df_coal.rdd.getNumPartitions()}")
# Check distribution — may be uneven
df_coal.withColumn("partition", spark_partition_id()) \
.groupBy("partition").count().orderBy("partition").show()
Expected Output
After coalesce(4): 4
+---------+-----+
|partition|count|
+---------+-----+
| 0| 250|
| 1| 250|
| 2| 250|
| 3| 250|
+---------+-----+
In this case the distribution looks even because the original 8 partitions were already balanced. But coalesce works by merging adjacent partitions — if the original partitions were uneven, the result would be uneven too.
# Coalesce CANNOT increase partitions — this does nothing
df_coal_up = df.coalesce(16)
print(f"After coalesce(16): {df_coal_up.rdd.getNumPartitions()}")
Expected Output
After coalesce(16): 8
Still 8 — coalesce silently ignores requests to increase.
Repartition by Column
You can repartition based on a column value — all rows with the same value end up in the same partition:
# Repartition by department — all "Engineering" rows in one partition
data_dept = [(i, f"dept_{i % 3}", i * 100) for i in range(1000)]
df_dept = spark.createDataFrame(data_dept, ["id", "department", "salary"])
df_by_dept = df_dept.repartition("department")
df_by_dept.withColumn("partition", spark_partition_id()) \
.groupBy("department", "partition").count() \
.orderBy("department", "partition").show()
This is useful before a join — if both tables are partitioned by the join key, Spark can do a more efficient join.
The Decision Matrix
Situation Use
──────────────────────────────────── ──────────────
Need FEWER partitions for output coalesce(n)
Need MORE partitions for parallelism repartition(n)
Need EVEN distribution repartition(n)
Pre-partitioning for a join repartition(col)
Writing a single output file coalesce(1)
After filter removed most data coalesce(n)
Fixing data skew repartition(n)
Common Patterns
Pattern 1: Writing fewer output files
# Without coalesce: 200 tiny Parquet files (one per shuffle partition)
df.groupBy("department").count().write.parquet("output/counts")
# With coalesce: 1 file
df.groupBy("department").count().coalesce(1).write.parquet("output/counts")
Pattern 2: After a heavy filter
# Original: 100 partitions with 10 million rows
# After filter: 100 partitions with only 1000 rows (99.99% filtered out)
filtered = big_df.filter(col("status") == "error")
# Now most partitions are empty — coalesce to remove the empties
filtered.coalesce(4).write.parquet("output/errors")
Pattern 3: Before writing partitioned data
# Repartition by the write partition key for efficient, non-overlapping files
df.repartition("year", "month") \
.write.partitionBy("year", "month") \
.parquet("output/sales")
Without the repartition, each output partition directory might contain files from many input partitions — creating many small files. With repartition by the same columns, each output directory gets one well-sized file.
Common Mistakes
- Using
coalesce(1)on large datasets. This forces all data through a single partition — one core does all the work. Fine for 1000 rows, dangerous for 10 GB. If you need a single output file from a large dataset, consider: do you really need one file, or can downstream tools handle a directory of files? - Using
repartitionwhencoalescewould work. If you're going from 200 partitions to 10,coalesce(10)avoids a shuffle.repartition(10)does the same thing but with an unnecessary shuffle. The only reason to preferrepartitionwhen reducing is if you need perfectly even partition sizes. - Forgetting that
coalescecan create uneven partitions. If you have 8 partitions of sizes [100, 100, 100, 100, 100, 100, 100, 900] and coalesce to 4, you might get [200, 200, 200, 1000]. If this unevenness matters, userepartitioninstead.
Key Takeaways
coalesce(n)— reduces partitions without a shuffle. Fast but may create uneven partitions.repartition(n)— full shuffle, creates exactly n even partitions. Can increase or decrease.repartition(col)— partitions by column value. Useful before joins and writes.- Use
coalesceto reduce,repartitionto increase or to get even distribution. - Always consider partition count when writing output — too many partitions = too many small files.
- The shuffle cost of
repartitionis real — don't use it ifcoalesceachieves the same goal.
Next Lesson
Module 4 is complete! You now understand Spark's execution model — lazy evaluation, DAGs, partitions, shuffles, and the Catalyst optimizer. In Module 5: Real-World Patterns, we start building production-quality PySpark code, beginning with Lesson 23: Building an ETL Pipeline from Scratch.