What is ORC file format?
How ORC is better than RC file format?
ORC stands for Optimized Record Columnar format. An improved performance in reading and writing files and provides highly efficient way of storing the data, developed to overcome the drawbacks of other file formats.
ORC file data arrangement
Whenever we want to perform any aggregation on any column, in case of normal file format we need to go through each and every record and look for that column. But in case of ORC files this is not required. Because the entire column data will be a single row we can get the data at one shot. So the performance will increase enormously.
ORC file structure
An ORC file contains groups of rows data called Stripes, auxiliary information in Footer and Post script, which contains the information about compression parameters and size of the compressed footer. The default size of the stripe is 20 MB. Larger stripe increases efficiency of reads from HDFS. And in each stripe there will be Stripe Footer. This contains the metadata of the stripe i.e., list of stripes in file, the number of rows per stripe, and data types of each column. And it also contains column level aggregations like count, maximum value, minimum value and sum.
Below diagram shows how an ORC file looks like:
How ORC is more efficient than a normal RC file?
RC file is also similar to ORC but it doesn't contain the metadata of the stripes and also the row group size in RC file is lesser than ORC, which causes RC file inefficient.
What happens if an ORC file contains metadata or what is the need to have the metadata?
There are two advantages of having metadata in ORC file.
1). Faster search operation
2). Lazy decompression
Assume that we have a compressed ORC file with some data. And we are doing some operation on the data which firstly searches the data based on some condition and the do the operation. As the stripes in ORC file contains the metadata, search operation will be performed based on the metadata on each stripe. If we are looking for some data which with some integer value, then the metadata will be checked first, if the integer is in the range of maximum and minimum of values of the column. If it doesn't fall in that range then the entire row group will be ignored. Whenever the the integer falls in the range of maximum and minimum values then that particular group will be decompressed and operation will be performed. This kind of decompression is called Lazy Decompression.