Slowly Changing Dimensions in Hive|Hive Slowly Changing Dimensions

What is SCDs or Slowly Changing Dimensions?

Slowly changing dimensions is a concept related to data warehousing. They track the data changes at irregular and unpredictable intervals. So basically SCD are not syntax or code in language. They are many ways to mange SCDs based on business needs. In some cases we may need to maintain all the historical of the data and in some cases we may need to maintain some of the historical data. And in some other cases we may not worry at all about the historical data.

There are three types of managing SCDs.

Type1:

In this type we will overwrite the historical data with latest one. This approach is very simple as we don't need to update the the latest data. The disadvantage in this method is, we will lose all the historical data. When we need to synchronize the reporting system with operational systems, we can go for this approach.

Type1 SCDs, before merge:

Type1 SCDs, after merge:

Here we can observe that the old data is update with new data replacing the values, which doesn't allow the user to track the historical data.

Type2:

This type of update allows us to store a full history of the records by making new fields that will track the status of the records whether they are old or latest. In this we need to maintain a separate field that tracks the history of the record.

Type2 SCDs, before merge:

Type2 SCDs, after merge:

In the above records we can observe that the old records maintained according to the dates. If the ValidTo column is null then we can conclude that the record is latest one. The disadvantage in this type is data grows as the records need to updated. But in today's technology maintaining large data is not at all an issue, so we can track the total history of the data. That's why type2 is very famous and widely used.

Type3:

Type2 updates are little complex and data maintenance overhead will be there, which might be more relative than what we need. Type3 SCDs are more simpler to develop and easy to maintain as they offer limited history tracking. If it is allowed to maintain partial history of the data then it is good have type3 SCDs.

Type3 SCDs, before merge:

Type3 SCDs, after merge:

You can observe the above records before and after updating. Here it is maintaining the current and last values for contact and zipcode.

 

Please comment you thoughts about this post..

Leave a Reply

Your email address will not be published. Required fields are marked *