Back to tutorials

dbt snapshot Command: Strategies & Examples

Introduction

dbt is a powerful data transformation tool that enables data analysts and engineers to transform data in their warehouses more effectively. One of the key features of dbt is the ability to create snapshots, which are a way of tracking changes in your data over time. This tutorial will guide you through the process of creating and using dbt snapshots.

Understanding Slowly Changing Dimensions

Slowly Changing Dimensions (SCDs) are a concept in data warehousing that refers to the ways in which data changes over time. They are called “slowly changing” because the changes, such as a customer’s address or a product’s price, usually occur infrequently, but the impact of these changes can be significant for data analysis.

SCDs are typically categorized into three types:

  • Type 1: This approach overwrites old data with new data, and therefore, no history is preserved.
  • Type 2: This approach adds a new record with the new data and preserves the old record as historical data. This is the most common type of SCD and the one that dbt snapshots implement.
  • Type 3: This approach adds a new column for the new data and preserves the old data in the original column. This type is less common and is used when it’s important to see the progression of changes.

dbt snapshots are directly connected to the concept of SCDs, particularly Type 2 SCDs. A snapshot in dbt is a way of tracking changes in your data over time, effectively creating a version history of each row of data. When you create a snapshot, dbt adds metadata columns to your data, including dbt_valid_from and dbt_valid_to, which indicate the time range during which a particular version of a record was valid.

This mechanism allows dbt snapshots to implement Type 2 SCDs. When a change occurs in the source data, instead of overwriting the existing record (Type 1) or adding a new column (Type 3), dbt adds a new record with the new data (Type 2). The dbt_valid_from and dbt_valid_to columns in the snapshot table indicate when each version of the record was valid, allowing you to track the full history of changes over time.

Creating a dbt Snapshot Model

A snapshot model in dbt is a special kind of model that tracks changes in your data over time. To create a snapshot model, you need to create a new file in the snapshots directory of your dbt project. This file should contain a snapshot block, which defines the configuration for your snapshot.

Here’s an example of a snapshot model that tracks changes in a users table:

{% snapshot users_snapshot %}  

{{
  config(      
    target_schema='snapshots',      
    strategy='timestamp',      
    unique_key='id',      
    updated_at='updated_at'    
  )  
}}  

  select * 
  from raw.users

{% endsnapshot %}

In this example, the strategy is set to 'timestamp', which means dbt will track changes based on the updated_at column. The unique_key is set to 'id', which is the unique identifier for each row in the users table.

Understanding dbt Snapshot Strategies

In dbt, there are two main strategies for creating snapshots: the timestamp strategy and the check strategy.

Timestamp strategy

The timestamp strategy is used when your source data includes a timestamp column that gets updated whenever a record changes. In this strategy, dbt will create a new snapshot record whenever the timestamp column is more recent than the last time the snapshot was run. This strategy is useful when you want to track changes based on when they occurred.

You can see an example of a timestamp strategy in the previous paragraph.

Check strategy

The check strategy, on the other hand, is used when you want to track changes based on the values of specific columns. In this strategy, dbt will create a new snapshot record whenever the values in the specified columns are different from the last time the snapshot was run. This strategy is useful when you want to track changes based on what changed, rather than when it changed.

Here’s an example of a snapshot configuration using the check strategy:

{% snapshot users_snapshot %}  

{{    
  config(      
    target_schema='snapshots',      
    strategy='check',      
    unique_key='id',      
    check_cols=['status']    
  )  
}}  

select * 
from raw.users

{% endsnapshot %}

In this example, dbt will create a new snapshot record whenever the status column changes for a user.

Choosing the right strategy for your snapshots depends on the nature of your source data and the specific requirements of your use case.

Running Your Snapshot Models

Once you’ve created your snapshot model, you can run it using the dbt snapshot command. This command will execute all snapshot models in your dbt project and create a new snapshot table in your data warehouse.

Here’s how you can run your snapshot models:

dbt snapshot

After running this command, you should see a new table in your data warehouse under the snapshots schema (or whatever schema you specified in your snapshot configuration).

Working with dbt Snapshot Data

The snapshot table created by dbt contains a full history of your data, with each row representing a different version of a record. You can query this table to see how your data has changed over time.

Here’s an example of a query that shows all changes to the status column of the users table:

select 
  id, 
  status, 
  dbt_valid_from, 
  dbt_valid_to

from snapshots.users_snapshot
order by id, dbt_valid_from;

This query will return a result set that shows when each status value was valid for each user.

Advanced Snapshot Techniques

In addition to the basic snapshot functionality, dbt also provides several advanced features that can help you manage your snapshots more effectively. For example, you can use the invalidate_hard_deletes configuration option to track records that have been deleted from your source data.

Here’s how you can modify the previous snapshot model to track hard deletes:

{% snapshot users_snapshot %}  

{{    
  config(      
    target_schema='snapshots',      
    strategy='timestamp',      
    unique_key='id',      
    updated_at='updated_at',      
    invalidate_hard_deletes=True    
  )  
}}  

select * 
from raw.users

{% endsnapshot %}

With this configuration, dbt will create a new row in the snapshot table for each record that is deleted from the users table, with the dbt_valid_to column set to the timestamp of the deletion.

Best Practices and Guidelines

When working with dbt snapshots, there are a few best practices you should follow:

  • Keep your snapshots simple: Avoid adding complex logic or joins to your snapshot models. You can always apply this logic downstream after the snapshot is created.
  • Use a separate schema for snapshots: This makes it easier to distinguish between snapshot tables and other types of tables in your data warehouse.
  • Ensure your unique key is unique: The unique key you specify in your snapshot configuration should uniquely identify each record in your source data.

Conclusion

dbt snapshots are a powerful tool for tracking changes in your data over time. By following this tutorial, you should now have a good understanding of how to create and use dbt snapshots in your own projects.

database icon
Unified workspace for your dbt workflow
Forget about the painful parts of dbt development, focus on what matters the most - data analysis