Back to tutorials

dbt_utils for Surrogate Key Generation

Introduction

Welcome to this tutorial on surrogate key generation using dbt's utility package. dbt is a powerful tool for transforming data in the data warehouse. One of its many utilities is the generation of surrogate keys, which are essential for data modeling and analytics.

Understanding Primary and Surrogate Keys

  • Primary Keys: These are unique identifiers for records in a table. They ensure no duplicate rows, establish relationships with other tables, and help identify the grain of the table.
  • Surrogate Keys: When your data doesn't come with a unique primary key, surrogate keys come to the rescue. They are primary keys derived in the analytics layer, ensuring each record has a unique identifier.

The Basics of dbt_utils.generate_surrogate_key

dbt_utils is a package for dbt that offers a collection of macros and materializations to simplify common tasks. One such macro is generate_surrogate_key, which aids in creating surrogate keys.

Generating Surrogate Keys: A Step-by-Step Guide

  • When to Use: Whenever your table lacks a unique primary key or when combining data from multiple sources.
  • How to Use: This macro concatenates the fields and applies a cryptographic hash (using MD5 by default) to produce a unique ID.
{{ dbt_utils.generate_surrogate_key('field_a', 'field_b') }}

Handling Null Values in Surrogate Keys

Null values can be tricky when generating surrogate keys. If any value is null, the entire concatenated string might return as null.

  • Solution: Use the coalesce function to manage null values. This function replaces null values with a default value, ensuring the concatenated string remains unique.
  • Separator Importance: To ensure uniqueness, add separators between fields, especially when handling null values.
select
  {{ dbt_utils.generate_surrogate_key(
    coalesce(activity_date,'NO_ACTIVITY'), 
    '---', 
    user_id) 
  }} as primary_key, 

  activity_date,
  user_id,
  total_actions

from {{ ref('int__user_daily_activity') }}

Customizing Hashing: Switching from MD5 to SHA

While MD5 is the default hashing function, you might prefer using SHA for its cryptographic advantages.

How to Switch: Override the default hashing macro in dbt by adding a macro called hash.sql to your dbt project's macros directory. The file should include this file:

{% macro default__hash(field) -%}
  sha(cast({{ field }} as {{ api.Column.translate_type('string') }}))
{%- endmacro %}

Best Practices for Surrogate Key Generation

  • Consistency: Ensure you use the same method for surrogate key generation across your project.
  • Data Integrity: Regularly test your surrogate keys in dbt to ensure data completeness and uniqueness.
  • Avoid Pitfalls: Be wary of null values and ensure you're using separators to maintain uniqueness.

Conclusion

Surrogate keys are invaluable in data modeling, especially when dealing with data from diverse sources. With dbt_utils, the process becomes streamlined, allowing you to focus on analytics rather than data preparation.

Additional Resources

Official dbt_utils Documentation

Previous

dbt sources
database icon
Unified workspace for your dbt workflow
Forget about the painful parts of dbt development, focus on what matters the most - data analysis