Sqoop is a common data migration tool used to import data from RDBMS tables to other Hadoop components. It is developed in Java and developer friendly tool. In most of the projects it is used as migration tool. If you had worked on it or you have mentioned it in your resume it is mandatory to know how the interview questions will be asked around this topic. I'm diving the interview question on this tool into multiple blogs, which will completely cover all the interview question on Sqoop.
1). How can you list the columns of a table using Sqoop?
We don't have any direct option to accomplish this task. But we have commands in SQL that can be used to do this. So below is the Sqoop command that can be used to display all the columns of a table:
2). Difference between --target-dir and warehouse-dir
--target-dir: This argument takes the path of a folder inside which the data files that are imported from the RDBMS tables will be stored. This is used when we import single table from database.
--warehouse-dir: This used when we import multiple tables from RDBMS. Takes path of a folder in HDFS and creates sub-folder with the names of source tables and inside those folder it stores the data files of corresponding tables.
3). What is Free Form Import in Sqoop?
It means that user can import data from any RDBMS table using any SQL query rather than using the options table name and column name provided by Sqoop.
4). How can we execute Free Form SQL query to import the rows sequentially in Sqoop?
This can be achieved by setting the --num-mappers or -m argument to 1. Which creates only one task and imports the rows sequentially into HDFS.
5). What is the default file format for importing the data from RDBMS tables?
The default file format while importing data from database is text file format. User can explicitly specify this using the argument --as-textfile.
6). What is Sqoop metastore?
Sqoop metastore is the repository of all the saved jobs that are created using sqoop job command. The configuration details of metastore will be done sqoop-site.xml file.
7). How to import all the tables of a database expcept some tables?
Assume that we have 300 tables in a database and we want to import all the 300 but not table150, table180 and table 190. How can we do this in Sqoop?
There is an option in sqoop import-all-tables command, --exclude-tables, where we can specify all names of the tables which we don't want to import.
--exclude-tables table150, table180, table190
8). During Sqoop job preparation you mentioned --num-mappers to 8, but you found that the Sqoop job runs only 4 mappers. Why?
The possible reason for this is Hadoop cluster is set to use only 4 mappers for that user. In that case Sqoop cannot use more than 4 mappers even if we pass more than 4.
9). What is the significance of --split-by clause in Apache Sqoop?
Split-by clause is used to specify the column based on whose values the splits will be created. If we don't mention this argument by default Sqoop takes primary key column for splits generation. But some times primary key column might not have evenly distributed values between the maximum and minimum ranges. So the splits will of different sizes, so the execution of mapreduces job, which be will one for each split might take different times to complete. So to improve the performance in such case we can go with split-by command by supplying a column to it in such a way that the column has evenly distributed data between the ranges.
10). Even after creating having --split-by command in Sqoop command, Sqoop job is giving optimal performance. How can we optimize this?
There will cases where even after supplying the --split-by command Sqoop job might not give good perfoemance. In such cases we can use --boundary-query which can be used to specify maximum and minimum values for split column through an SQL query.
11). Is it mandatory to have metastore in Hadoop cluster?
No. It is not mandatory to have the metastore in cluster, we can have it even outside of the cluster.