Designing tables properly is critical to successful use of any database, and is emphasized a lot more in specialized databases such as Redshift. This articles talks about the options to use when creating tables to ensure performance, and continues from Redshift table creation basics.
When you create a table on Redshift, you can (and should) specify one or more columns as the sort key. You can think of a sort key as a specialized type of index, since Redshift does not have the regular indexes found in other relational databases. Redshift stores data on disk in sorted order according to the sort key, which has an important effect on query performance.
You choose sort keys based on the following criteria:
Here are some examples of defining the sort key:
-- sale_date is the timestamp column CREATE TABLE sales ( sale_id BIGINT NOT NULL PRIMARY KEY, sale_date timestamp NOT NULL SORTKEY, ... <other colums> ); -- use the SORTKEY table attribute keyword to create a multi-column sort key -- In this case searches are done frequently by the location columns, -- so state and city are part of sort key CREATE TABLE dim_customers ( ... <some columns>... state VARCHAR, city VARCHAR ) SORTKEY (state, city);
When you create a Redshift cluster, you define the number of nodes you want to use. The nodes work in parallel to speed up query execution. This also means that when you load data into a table, Redshift distributes the rows of the table to each of the node slices according to the table’s distribution style.
There are three distribution styles:
Distribution of a table is defined using the
-- Specifying a column as DISTKEY automatically sets distribution style to KEY CREATE TABLE sales ( sale_id BIGINT NOT NULL PRIMARY KEY, sale_date timestamp NOT NULL SORTKEY, customer_id int DISTKEY, amount float ); -- Use DISTSTYLE table attribute to set it to ALL CREATE TABLE atrribute_lookup ( attribute_id INT NOT NULL PRIMARY KEY, attribute_name VARCHAR ) DISTSTYLE ALL;
Compression is defined per column allows reduction of size of stored data, which reduces disk I/O and improves query performance.
You define compression per column, if you do not specify any, Redshift uses the following compression:
For example, if you want to force a VARCHAR column to use RAW compression:
CREATE TABLE atrribute_lookup ( attribute_id INT NOT NULL PRIMARY KEY, attribute_name VARCHAR ENCODE RAW );
See the Redshift Documentation for details on the different compression encodings.