Slowly Changing Dimensions (SCD)

1. Data Warehouse Modelling

In data warehousing, a common approach to dimensional modelling is through the concept of a star schema, a logical grouping of one or more fact tables surrounded by related dimension tables, linked via primary and foreign key relationships. As can be seen in the image below, the star schema derives its name from the physical representation which resembles a star shape, with the fact table at its centre and the surrounding dimension tables representing the stars points.

  • Fact tables contain measurements and foreign keys which refer to the primary keys in the associated dimension tables. For example, a Sales fact table may include measures such as Quantity, Unit Price, and Profit.

  • Dimension tables contain a collection of descriptive attributes that categorise facts to enable users to answer business questions. Example dimensions include Customer, Stock Item, and Date.

 
 

2. What is a Slowly Changing Dimension (SCD)?

A slowly changing dimension (SCD) is a dimension that is able to handle data attributes which change over time.

For example:

  • A customer dimension may hold attributes such as name, address, and phone number.

  • Over time, a customer's details may change (e.g. move addresses, change phone number, etc).

A slowly changing dimension is able to accommodate these changes, with some SCD patterns having the added ability to preserve history. Deciding on which type of slowly changing dimension pattern to implement will vary based on your business requirements.

3. SCD Types (High-Level)

The table below provides a high-level summary of the most common SCD types. For more information on other types such as 4, 5, 6, and 7, check out Dimensional Modeling Techniques by the Kimball Group.

Type Title Allows Updates Preserves History Description
0 Retain Original Attributes never change (e.g. date of birth).
1 Overwrite Old values are overwritten with new values.
2 Add a new record Track changes by creating multiple records (e.g. ValidFrom, ValidTo).
3 Add a new field ✔(*) Track changes using separate columns (e.g. CurrentValue, PreviousValue).

(*) Preserves limited history as it is limited to the number of additional columns designated for storing historical data.

Type 0 - Retain original

In type 0, dimension attributes never change, the original value is always true. Type 0 applies to most date dimension attributes (e.g. for the date 2022-01-01, the day will always be Saturday, the month will always be January, etc).

Scenario

In the following types, we will look at how each pattern handles a simple scenario where we have a Customer dimension on the right with three existing records, and some new incoming data on the left which holds a combination of existing records (CustomerKey 1 and 2), updated records (CustomerKey 3), and new records (CustomerKey 4 and 5).

Type 1 - Overwrite

When new data arrives, the old attribute value in the dimension row are overwritten with the new value. While this technique is able to handle changes, this approach is unable to preserve history (i.e. simply reflects the most recent value).

Type 2 - Add a new record

When new data arrives, a new row is added to the dimension table with the updated attribute values. Since this can result in multiple rows describing different states of the same natural key, a surrogate key is needed to complement the natural key so that each row can be uniquely identified. Alongside the surrogate key are additional columns such as ValidFrom and ValidTo, to describe the period of time the record is relevant for, and finally a current row indicator such as isActive, to easily filter on the current state.

Summary of additional columns needed for type 2 include:

  • Surrogate Key (Unique identifier)

  • Effective Date (e.g. ValidFrom)

  • Expiration Date (e.g. ValidTo)

  • Current Row Indicator (e.g. isActive)

Note: The Current Row Indicator (isActive) column is technically optional, as filtering on ValidTo (e.g. ValidTo = 9999-12-31) would be sufficient to see current state. That being said, the additional field provides a rapid way to isolate the current set of dimension records that are active and therefore is generally recommended as best practice.

 
 

Type 3 - Add a new field

When new data arrives, a secondary column (e.g. PreviousValue) preserves the old value, the new value overwrites the primary column (e.g. CurrentValue).

 
 

4. SCD Type 2 (Detail)

As Type 2 is one of the most frequently used SCD patterns, let’s unpack this particular pattern further. When new data arrives for an existing SCD type 2 dimension, the following is what we conceptually need to evaluate in order to determine the type of operations that will need to be performed. 

Conceptual Logic

The comparisons below are performed against the existing dimension table where the current row indicator is true (e.g. isActive = 1).

Q1. Does the natural key exist?

  • If Yes, flag as an “existing” row.

    • Proceed to Q2.

  • If No, flag as a “new” row.

    • INSERT new (isActive = 1, ValidFrom = effectiveDate, ValidTo = 9999-12-31)

Q2. Where the natural key does exist, is the row different (i.e. has an attribute value changed)?

  • If Yes, flag as “changed”.

    • INSERT new (isActive = 1, ValidFrom = effectiveDate, ValidTo = 9999-12-31)

    • UPDATE old (isActive = 0, ValidTo = expirationDate)

  • If No, flag as “no change”.

    • No action.

Effective and Expiration Dates

Depending on the technology stack, each vendor will typically have some type of function such as currentDate() or currentDateTime() to populate effective date fields such as ValidFrom and ValidTo.

When handling new versions of data where we need to UPDATE the obsolete record (isActive = 0) and INSERT the new record (isActive = 1), the date time value used to expire the obsolete record (ValidTo) and activate the new record (ValidFrom) must be identical, otherwise a gap will be introduced where the record has no effective dimension values for a period of time. This can be accomplished by storing the date time value in a variable prior to using the value in an INSERT/UPDATE operation.

Lastly, it is recommended to use full date time stamps rather than date only if you need to perform multiple updates per day (e.g. hourly).

Detecting Changes

A common method for detecting if a row has changed is by generating a hash value on all the dimension columns. Functions will vary by technology stack but an example may look like md5(Address, PhoneNumber, EmailAddress). This hash column is computed for each record across both datasets, then this single hash column is compared instead of comparing multiple individual columns to determine if the data has changed.

Note: When generating a “hash” column for the existing dimension table, remember to only include the non-DW columns that exist in the source (i.e. the columns being hashed should be the same across both streams). In other words, exclude DW specific columns such as SurrogateKey, ValidFrom, ValidTo, and isActive.

Flow Chart

An illustration of the conceptual flow.

While there are additional SCD approaches, the three described in this article and type 2 in particular are broadly applicable and commonly used in the industry. If you would like to see a working example of how SCD type 2 can be implemented in Data Lakehouse architectures leveraging Delta Lake with Azure Synapse Analytics, check out this workshop.

5. Additional Resources