dbt_utils
for Surrogate Key Generation
Introduction
Welcome to this tutorial on surrogate key generation using dbt's utility package. dbt is a powerful tool for transforming data in the data warehouse. One of its many utilities is the generation of surrogate keys, which are essential for data modeling and analytics.
Understanding Primary and Surrogate Keys
- Primary Keys: These are unique identifiers for records in a table. They ensure no duplicate rows, establish relationships with other tables, and help identify the grain of the table.
- Surrogate Keys: When your data doesn't come with a unique primary key, surrogate keys come to the rescue. They are primary keys derived in the analytics layer, ensuring each record has a unique identifier.
The Basics of dbt_utils.generate_surrogate_key
dbt_utils
is a package for dbt that offers a collection of macros and materializations to simplify common tasks. One such macro is generate_surrogate_key
, which aids in creating surrogate keys.
Generating Surrogate Keys: A Step-by-Step Guide
- When to Use: Whenever your table lacks a unique primary key or when combining data from multiple sources.
- How to Use: This macro concatenates the fields and applies a cryptographic hash (using MD5 by default) to produce a unique ID.
{{ dbt_utils.generate_surrogate_key('field_a', 'field_b') }}
Handling Null Values in Surrogate Keys
Null values can be tricky when generating surrogate keys. If any value is null, the entire concatenated string might return as null.
- Solution: Use the
coalesce
function to manage null values. This function replaces null values with a default value, ensuring the concatenated string remains unique. - Separator Importance: To ensure uniqueness, add separators between fields, especially when handling null values.
select
{{ dbt_utils.generate_surrogate_key(
coalesce(activity_date,'NO_ACTIVITY'),
'---',
user_id)
}} as primary_key,
activity_date,
user_id,
total_actions
from {{ ref('int__user_daily_activity') }}
Customizing Hashing: Switching from MD5 to SHA
While MD5 is the default hashing function, you might prefer using SHA for its cryptographic advantages.
How to Switch: Override the default hashing macro in dbt by adding a macro called hash.sql
to your dbt project's macros directory. The file should include this file:
{% macro default__hash(field) -%}
sha(cast({{ field }} as {{ api.Column.translate_type('string') }}))
{%- endmacro %}
Best Practices for Surrogate Key Generation
- Consistency: Ensure you use the same method for surrogate key generation across your project.
- Data Integrity: Regularly test your surrogate keys in dbt to ensure data completeness and uniqueness.
- Avoid Pitfalls: Be wary of null values and ensure you're using separators to maintain uniqueness.
Conclusion
Surrogate keys are invaluable in data modeling, especially when dealing with data from diverse sources. With dbt_utils
, the process becomes streamlined, allowing you to focus on analytics rather than data preparation.
Additional Resources
Previous
dbt sources