How to create a dataframe with custom schema in Spark?

How to create a dataframe using a custom schema in Spark?

This is one of the most common interview questions.

 

Usually if we create a dataframe in Spark without specifying any schema then Spark creates a default schema. In this default schema all the columns will be of type String and column names names will be give in the pattern _c0, _c1 etc.

 

Instead of this if we want to create a custom schema to a dataframe then we can do it in two ways. One is using case class and another is using StructType. StructType will give us more flexibility compared to case class so this method will be used most often.

Import statement of StructType:

import org.apache.spark.sql.types.{StructType, StructField};

StructType cannot take the normal datatypes that we use in Scala.  We need to import all the required data types as shown below based on our requirement.

import org.apache.spark.sql.types.{StringType, IntegerType, BooleanType};

The generic syntax for creating the StructType schema will be as shown below:

val schema = StructType(
                 List( 
                   StructField("col_name1", <Type>, is_nullable),
                   StructField("col_name2", <Type>, is_nullable),
                 )
              )

Using this generic syntax we can create a sample Spark dataframe using a custom schema.

val data = Seq( Row(1, "Raj", 20000),
                Row(2, "Ravi", 30000))

val schema = StructType(
                 List( 
                   StructField("Emp_id", IntegerType, false),
                   StructField("Emp_name", StringType, false),
                   StructField("Emp_sal", IntegerType, true),
                 )
              )
val df = spark.createDataFrame(
         spark.sparkContext.parallelize(data), 
         schema)

df.show()

Output:
=======
+-------+--------+-------+
|Emp_id |Emp_name|Emp_sal|
+-------+--------+-------+
|      1|     Raj|  20000|
|      2|    Ravi|  30000| 
+-------+--------+-------+

df.printSchema()
Output:
=======
StructType(
   StructField(Emp_id, IntegerType, false),
   StructField(Emp_name StringType, false),
   StructField(Emp_sal, IntegerType, true),
)

 

Please comment your thoughts about this post.

1 thought on “How to create a dataframe with custom schema in Spark?”

Leave a Reply

Your email address will not be published. Required fields are marked *