How to Use DISTKEY, SORTKEY and Define Column Compression Encoding in Redshift

Designing tables properly is critical to successful use of any database, and is emphasized a lot more in specialized databases such as Redshift. This articles talks about the options to use when creating tables to ensure performance, and continues from Redshift table creation basics.

Selecting Sort Keys

When you create a table on Redshift, you can (and should) specify one or more columns as the sort key. You can think of a sort key as a specialized type of index, since Redshift does not have the regular indexes found in other relational databases. Redshift stores data on disk in sorted order according to the sort key, which has an important effect on query performance.

You choose sort keys based on the following criteria:

  • If recent data is queried most frequently, specify the timestamp column as the leading column.
  • If you frequently filter by a range of values or a single value on one column, that column should be the sort key.
  • Columns frequently used in joins should be used as the sort key

Here are some examples of defining the sort key:

-- sale_date is the timestamp column
CREATE TABLE sales (
  sale_id BIGINT NOT NULL PRIMARY KEY,
  sale_date timestamp NOT NULL SORTKEY,
  ... <other colums>
);

-- use the SORTKEY table attribute keyword to create a multi-column sort key
-- In this case searches are done frequently by the location columns,
-- so state and city are part of sort key
CREATE TABLE dim_customers (
  ... <some columns>...
  state VARCHAR,
  city VARCHAR
)
SORTKEY (state, city);

Selecting Distribution Styles

When you create a Redshift cluster, you define the number of nodes you want to use. The nodes work in parallel to speed up query execution. This also means that when you load data into a table, Redshift distributes the rows of the table to each of the node slices according to the table’s distribution style.

There are three distribution styles:

  • EVEN Distribution: This is the default and just uses a simple round-robin method to distribute data, regardless of values. This is appropriate when a table is not used in queries with joins or when there is no clear choice of distribution method between the next two.
  • KEY Distribution: The values in one column are used to determine the row distribution. Redshift will attempt to place matching values on the same node slice. Use this for tables that are frequently joined together so that Redshift will collocate the rows of the tables with the same values of the joining columns on the same node slices. This makes execution of the joins much faster since the matching values of the common columns are physically stored together.
  • ALL Distribution: Using this will cause a copy of the entire table to be stored on each node. This is normally used for small but frequently joined tables such as lookup tables.

Distribution of a table is defined using the DISTSTYLE and/or DISTKEY.

-- Specifying a column as DISTKEY automatically sets distribution style to KEY
CREATE TABLE sales (
  sale_id BIGINT NOT NULL PRIMARY KEY,
  sale_date timestamp NOT NULL SORTKEY,
  customer_id int DISTKEY,
  amount float
);

-- Use DISTSTYLE table attribute to set it to ALL
CREATE TABLE atrribute_lookup (
  attribute_id INT NOT NULL PRIMARY KEY,
  attribute_name VARCHAR
)
DISTSTYLE ALL;

Specifying Column Compression Encoding

Compression is defined per column allows reduction of size of stored data, which reduces disk I/O and improves query performance.

You define compression per column, if you do not specify any, Redshift uses the following compression:

  • All columns in temporary tables are assigned RAW compression by default
  • Columns defined as sort keys are assigned RAW compression
  • BOOLEAN, REAL, and DOUBLE PRECISION columns are assigned RAW compression
  • All others are assigned LZO compression.

For example, if you want to force a VARCHAR column to use RAW compression:

CREATE TABLE atrribute_lookup (
  attribute_id INT NOT NULL PRIMARY KEY,
  attribute_name VARCHAR ENCODE RAW
);

See the Redshift Documentation for details on the different compression encodings.