dbt seed Command: Usage & Examples
Introduction
dbt is a command-line tool that enables data analysts and engineers to transform data in their warehouses more effectively. One of the powerful features of dbt is the seed
command, which allows you to load CSV files (referred to as “seeds”) into your data warehouse. This tutorial will guide you through the process of using the dbt seed
command.
Understanding dbt Seeds
In dbt, seeds are CSV files that you can load into your data warehouse. They are particularly useful for handling static data that changes infrequently. For instance, you might have a CSV file containing a list of country codes and their corresponding country names. This data can be loaded into your warehouse as a seed and referenced in your dbt models.
dbt seed Use Cases
Mapping Codes to Descriptive Names
As we’ve seen in the previous example, you can use a seed to map product codes to product names. This can be extended to any scenario where you have codes that need to be mapped to more descriptive names. For example, you might have a seed that maps error codes to error descriptions, or abbreviations to their full forms.
Excluding Certain Data from Analysis
Suppose you have a list of test email addresses or user IDs that you want to exclude from your analysis. You can create a seed with these email addresses or user IDs, and then use the dbt seed
command to load this data into your warehouse. In your dbt models, you can then exclude these test email addresses or user IDs from your analysis.
Loading Small Reference Datasets
If you have small reference datasets that don’t change often, it can be more efficient to load them into your warehouse as seeds rather than storing them as tables in your source databases. For example, you might have a small dataset of exchange rates that you want to use in your dbt models.
Data Validation
You can use seeds to validate the data in your warehouse. For example, you might have a seed that contains the expected results of a certain calculation. You can then create a dbt test that compares the actual results in your warehouse to the expected results in your seed.
Machine Learning Model Testing
If you’re a data scientist or machine learning engineer, you can use seeds to load test data into your warehouse. You can then use this test data to evaluate the performance of your machine learning models.
Creating Your First dbt Seed
Let’s create a CSV file to use as a seed. For this tutorial, we’ll use a simple CSV file containing product codes and their corresponding product names. Save the following data in a file named **product_codes.csv**
:
product_code,product_name
PRD01,Apple iPhone 13
PRD02,Samsung Galaxy S21
PRD03,Google Pixel 6
Place this file in the seeds
directory of your dbt project.
Loading dbt Seeds into Your Data Warehouse
To load the seed into your data warehouse, run the following command in your terminal:
$ dbt seed
This command will load all CSV files in the seeds
directory into your data warehouse. If you want to load a specific seed, you can use the --select
flag followed by the name of the seed (without the .csv
extension). For example:
$ dbt seed --select product_codes
Referencing dbt Seeds in Your Models
Once your seed is loaded into your data warehouse, you can reference it in your dbt models using the ref
function. For example, if you have a model that needs to join with the **product_codes**
seed, you could do something like this:
select
orders.*,
product_codes.product_name
from {{ ref('orders') }} as orders
left join {{ ref('product_codes') }} as product_codes
on orders.product_code = product_codes.product_code
Configuring and Testing dbt Seeds
You can configure your seeds in the dbt_project.yml
file. For example, you can specify the schema where the seed data should be loaded. You can also document and test your seeds by declaring properties in YAML. For more information, check out the dbt documentation.
Conclusion
The dbt seed
command is a powerful tool for loading static data into your data warehouse. By using seeds, you can keep your transformation logic within dbt, ensuring that your data transformations are consistent, version controlled, and code reviewable.
Previous
dbt run-operationNext
dbt snapshot