I'm not sure how many of us use these two operations frequently in our projects, but these two operations are very useful and expensive operations. Here's why..
Firstly let's get to know what these two operations are meant for. Reparatition and coalesce methods are used to control the number of partitions of dataframes. Whenever you save a dataframe you will get lot of part files in the target directory. This is because by default spark sets the number of partitions to 200. If you want your data to be saved in single file then you can use repartition or coalesce as below. But once again I'm telling, be careful with these two operations are very expensive as they involve shuffling.
Why are they expensive?
Spark never likes shuffling the data. One more new term shuffling, which is meant by transferring the data among partitions. If the partitions are on different executors then there will be lot of data transfer over network that
One of the main performance glitches in spark is data shuffling which includes serialization to transfer the objects over the network. Note: Please try to avoid shuffling as much as possible.
When to use coalesce?
Coalesce can be used reduce the number of partitions only but it can't be used to increase the number of partitions. Use coalesce when you want to reduce the number of partitions only.
Advantages and disadvantages of using coalesce..
Coalesce will group the partitions that are present in same executor and it tries to minimize the shuffling as much as possible. So coalesce will be efficient and faster, but it gives unequal sized files when we try to save the data.
When to use repartition?
Repartition can be used when we want to increase or reduce the number of partitions.
Advantages and disadvantages of repartition..
Repartition guarantees equal sized partitions and can be used for both increase and reduce the number of partitions. But repartition operation is more expensive than coalesce because it shuffles all the partitions into new partitions.