What is Sequence file in Spark/Hadoop?

You may have come across questions like below during any of your spark interview. So to get full knowledge on Sequence file format here's the blog.

Tell me something about Sequence file format?
What do you know about Sequence file format?

 

What is Sequence file format?

It's a special kind of file format in Big data environment, which is used to read or write data in binary types from or to a file.

 

Structure of Sequence file

A typical sequence file looks like the diagram below.

Sequence is built up of Records. Each record is a combination of key, value pairs. Inside each record it has Record Length, Key Length, Key and Value. Ans there will be sync markers in arbitrary positions. To read or write a sequence file you need to know the class type of key and value and in addition if there is any compression algorithm is used you need know that. This information will be store at the starting of the file in header section.

 

In the below diagram I have shown two types of Sequence file format, first one is without compression and the second one is with compression. If you want to have sequence file to be compressed, then only the value in the sequence file will be compressed but not the other metadata. An the compression shown in the second diagram is record level compression. Sequence file can also have block level compression.

sequencefile
sequencefile_1

We have multiple record compressed together, which is block level compression. Below is the diagram that shows the block level compression in Sequence files. Block level compression is preferred over record level compression because, it uses the similarity between the records and offers efficient compression compared to record level compression. Each block in block level compressed sequence file will contain number of records, which is uncompressed, compressed key lengths, compressed keys, compressed value lengths and actual values compressed.

 

What are sync markers in Sequence file?

Plain compressed files cannot be splitted in big data environment, so they will be processed as a whole. This is because record reader in input format cannot seamlessly position itself from one block to another block. But sync markers will overcome this problem. Sync markers or sync pointers in Sequence file have the information about boundaries of the blocks. This allows record reader to read all the blocks.

How Sequence file overcomes lot of small file problem?

How do you solve small file problems in Spark or Hadoop?

They are similar questions which might have faced in your interviews. Here is one of the solutions for this using Sequence file format.

If you have lot of small sized files then it will be very inefficient to read all of them. If you are using a Hadoop job then it will be most inefficient. Your record reader will have to read all the files one by one and for each file one mapper. This will uses lot of resources of cluster unnecessarily. To solve this we can create a large file out of the small files. If we create a sequence file out of these small files, then it will allow the dataset to be splitted. So sequence files will be one of the solutions to overcome small files problem in Spark or Hadoop.

Leave a Reply

Your email address will not be published. Required fields are marked *