PopSQL

Finding Your Product's Most Engaged Users

Who are the power users of your product? These people are your champions. You'll definitely want to talk to them!

Why SQL?

Analytics tools can tell you aggregate event counts. To understand how much time a user spends in a product, though, you'll need to look at their sessions. SQL gives the most flexibility in how you "sessionize their data".

āš ļø This query is more complex than our other templates. We break it down thoroughly, but hereā€™s the final output in advance so you can see if this is worth your time:

| user_id | email                 | hours | num_days_used_product | hours_per_day | longest_session_minutes |
|---------|-----------------------|-------|-----------------------|---------------|-------------------------|
| 467801  | lukasz@cookbright.com | 4.7   | 5                     | 0.9           | 113.7                   |
| 519997  | ottilie@yobbit.com    | 4.3   | 5                     | 0.9           | 40.9                    |
| 553461  | anika@dogshare.com    | 2.8   | 3                     | 0.9           | 126.7                   |
| 506538  | rebeca@wearcade.com   | 2.5   | 3                     | 0.8           | 87.5                    |
| ...     | ...                   | ...   | ...                   | ...           | ...                     |

SQL you can copy / paste

If you have a users table and an events table with a timestamp and event_name you can use the queries below with little modification.

Don't feel overwhelmed! We break down each CTE below.

-- PART 1: data prep on each user's events to find time since their last event, if the next event constitutes a new session and is on a new day than the previous
with data as (
  select
    user_id,
    time,
    round(((extract(epoch from time) - extract(epoch from lag(time) over (partition by user_id order by time asc))) / 60)::decimal, 1) minutes_since_last_event,
    case
      when coalesce(round(((extract(epoch from time) - extract(epoch from lag(time) over (partition by user_id order by time asc))) / 60)::decimal, 1), 30) >= 30 then 1
      else 0
    end as new_session,
    case
      when date_trunc('day', time) <> date_trunc('day', lag(time) over (partition by user_id order by time asc)) then 1
      when lag(time) over (partition by user_id order by time asc) is null then 1
      else 0
    end as new_day
  from events
  where date_trunc('day', time) between '{{start_date}}' and '{{end_date}}' -- we recommend a week of data
  order by 2, 1 asc
),

-- PART 2: grouping events into sessions
data2 as (
  select
    user_id,
    time,
    sum(new_session) over (partition by user_id order by time asc) as user_session_number,
    min(time) over (partition by user_id order by time asc) as user_session_start,
    sum(new_day) over (partition by user_id order by time asc) as user_day_number
  from data
),

-- PART 3: defining each session's start and end time
data3 as (
  select
    user_id,
    time,
    user_session_number,
    user_day_number,
    min(time) over (partition by user_id, user_session_number order by time asc) as session_start,
    max(time) over (partition by user_id, user_session_number order by time asc) as session_end
  from data2
),

-- PART 4: calculating session durations
session_durations as (
  select
    user_id,
    users.email,
    user_session_number,
    max(user_day_number) as num_days,
    min(time) as session_start,
    max(session_end) as max_session_end,
    round(((extract(epoch from max(session_end)) - extract(epoch from max(session_start))) / 60)::decimal, 1) as session_duration_in_minutes
  from data3
  inner join users on users.id = user_id
  group by 1,2,3
  order by 5 desc
)

-- PART 5: aggregating session durations and other calculations (hours_per_week, num_days_used_product, hours_per_day, longest_weekly_session_in_minutes)
  select
    session_durations.user_id,
    email,
    round(sum(session_duration_in_minutes) / 60, 1) as hours,
    max(num_days) as num_days_used_product,
    round(round(sum(session_duration_in_minutes) / 60, 1) / max(num_days), 1) as hours_per_day,
    max(session_duration_in_minutes) as longest_session_in_minutes
  from session_durations
  group by 1,2
  having sum(session_duration_in_minutes) > 20
  order by 3 desc;

Part 1: data prep on each user's events

The query in the data CTE is tricky, so let's look at its output for just one user upfront:

| user_id | time                | minutes_since_last_event | new_session | new_day |
|---------|---------------------|--------------------------|-------------|---------|
| 467801  | 2020-04-06 16:19:43 |                          | 1           | 1       |
| 467801  | 2020-04-06 17:42:21 | 82.6                     | 1           | 0       |
| 467801  | 2020-04-06 17:46:04 | 3.7                      | 0           | 0       |
| 467801  | 2020-04-06 17:48:39 | 2.6                      | 0           | 0       |
| ...     | ...                 | ...                      | ...         | ...     |

Here we see a partial list of events for user 467801 ordered by time.

The lag() function allows us to calculate the time between one event and the event prior. That's what populates the minutes_since_last_event column. The first row is null since there is no preceding event. You can see that 82.6 minutes have passed from the event at 2020-04-06 17:42:21 and the prior event at 2020-04-06 16:19:43.

The column new_session is a binary. It looks at minutes_since_last_event, and says if more than 30 minutes has passed since the last event, then it's a new session and records a 1. The column new_day does similarly, examining if an event is a different day than the one prior.

We'll use these new_session and new_day columns later.

Part 2: grouping events into sessions

Just like in Part 1, let's look at the output of the CTE data2 for just one user upfront:

| user_id | time                | user_session_number | user_session_start  | user_day_number |
|---------|---------------------|---------------------|---------------------|-----------------|
| 467801  | 2020-04-06 16:19:43 | 1                   | 2020-04-06 16:19:43 | 1               |
| 467801  | 2020-04-06 17:42:21 | 2                   | 2020-04-06 16:19:43 | 1               |
| 467801  | 2020-04-06 17:46:04 | 2                   | 2020-04-06 16:19:43 | 1               |
| 467801  | 2020-04-06 17:48:39 | 2                   | 2020-04-06 16:19:43 | 1               |
| ...     | ...                 | ...                 | ...                 | ...             |

We see the same ordering of events by time. A Window function creates the column user_session_number. It creates a running count of each user session by tabulating the new_session binary (from the previous CTE data). This is easier shown than written:

The same approach creates the column user_day_number. And user_session_start finds the earliest timestamp for a session for that user.

Part 3: defining each session's start and end time

Again we'll look at the output of the CTE data3 for just one user.

| user_id | time                | user_session_number | session_start       | session_end         | user_day_number |
|---------|---------------------|---------------------|---------------------|---------------------|-----------------|
| 467801  | 2020-04-06 16:19:43 | 1                   | 2020-04-06 16:19:43 | 2020-04-06 16:19:43 | 1               |
| 467801  | 2020-04-06 17:42:21 | 2                   | 2020-04-06 17:42:21 | 2020-04-06 17:42:21 | 1               |
| 467801  | 2020-04-06 17:46:04 | 2                   | 2020-04-06 17:42:21 | 2020-04-06 17:46:04 | 1               |
| 467801  | 2020-04-06 17:48:39 | 2                   | 2020-04-06 17:42:21 | 2020-04-06 17:48:39 | 1               |
| ...     | ...                 | ...                 | ...                 | ...                 | ...             |

In this CTE, the query calculates the session_start by finding the earliest (minimum) timestamp of any event for each session (defined by user_session_number). Look at how the session_start changes for user_session_number = 1 vs user_session_number = 2.

Similarly, the query calculates the session_end by finding the latest (maximum) timestamp of any event for each session (defined by user_session_number). Look at the session_end for user_session_number = 2. The query iteratively records a higher session_end as it realizes that with each event in the same session there's a higher possible max.

Part 4: calculating session durations

Home stretch! Here's the output of the CTE session_durations. As the name implies, we calculate how long each session lasted.

| user_id | email                 | user_session_number | num_days | session_start       | max_session_end     | session_duration_in_minutes |
|---------|-----------------------|---------------------|----------|---------------------|---------------------|-----------------------------|
| 467801  | lukasz@cookbright.com | 1                   | 1        | 2020-04-06 16:19:43 | 2020-04-06 16:19:43 | 0.0                         |
| 467801  | lukasz@cookbright.com | 2                   | 1        | 2020-04-06 17:42:21 | 2020-04-06 18:25:36 | 43.3                        |
| 467801  | lukasz@cookbright.com | 3                   | 1        | 2020-04-06 19:05:07 | 2020-04-06 20:58:47 | 113.7                       |
| 467801  | lukasz@cookbright.com | 4                   | 2        | 2020-04-07 15:34:31 | 2020-04-07 15:34:31 | 0.0                         |
| ...     | ...                   | ...                 | ...      | ...                 | ...                 | ...                         |

Recall how in Part 3, we found an iteratively higher session_end?

In Part 4, we take the highest session_end for that particular session (max_session_end). We then calculate the minutes between max_session_end and session_start using extract(epoch from...) (tutorial).

Lastly, we join to a users table in this step to get email.

Part 5: aggregating session durations and other calculations

We've computed all the necessary pieces in our previous steps. Now it's straightforward arithmetic and MAX() functions to calculate hours in product, the number of days the user has used your product in the period, the hours per day they spend in the product, and their longest session šŸŽ‰

Final Output:

| user_id | email                 | hours | num_days_used_product | hours_per_day | longest_session_minutes |
|---------|-----------------------|-------|-----------------------|---------------|-------------------------|
| 467801  | lukasz@cookbright.com | 4.7   | 5                     | 0.9           | 113.7                   |
| 519997  | ottilie@yobbit.com    | 4.3   | 5                     | 0.9           | 40.9                    |
| 553461  | anika@dogshare.com    | 2.8   | 3                     | 0.9           | 126.7                   |
| 506538  | rebeca@wearcade.com   | 2.5   | 3                     | 0.8           | 87.5                    |
| ...     | ...                   | ...   | ...                   | ...           | ...                     |

Try it yourself?

Run this template against ourĀ sample database that mirrors real startup data. See the connection credentials, then connect in PopSQL.

Ready for a modern SQL editor?