Finding Your Product's Most Engaged Users
Who are the power users of your product? These people are your champions. You'll definitely want to talk to them!
Why SQL?
Analytics tools can tell you aggregate event counts. To understand how much time a user spends in a product, though, you'll need to look at their sessions. SQL gives the most flexibility in how you "sessionize their data".
ā ļø This query is more complex than our other templates. We break it down thoroughly, but hereās the final output in advance so you can see if this is worth your time:
| user_id | email | hours | num_days_used_product | hours_per_day | longest_session_minutes |
|---------|-----------------------|-------|-----------------------|---------------|-------------------------|
| 467801 | lukasz@cookbright.com | 4.7 | 5 | 0.9 | 113.7 |
| 519997 | ottilie@yobbit.com | 4.3 | 5 | 0.9 | 40.9 |
| 553461 | anika@dogshare.com | 2.8 | 3 | 0.9 | 126.7 |
| 506538 | rebeca@wearcade.com | 2.5 | 3 | 0.8 | 87.5 |
| ... | ... | ... | ... | ... | ... |
SQL you can copy / paste
If you have a users
table and an events
table with a timestamp
and event_name
you can use the queries below with little modification.
Don't feel overwhelmed! We break down each CTE below.
-- PART 1: data prep on each user's events to find time since their last event, if the next event constitutes a new session and is on a new day than the previous
with data as (
select
user_id,
time,
round(((extract(epoch from time) - extract(epoch from lag(time) over (partition by user_id order by time asc))) / 60)::decimal, 1) minutes_since_last_event,
case
when coalesce(round(((extract(epoch from time) - extract(epoch from lag(time) over (partition by user_id order by time asc))) / 60)::decimal, 1), 30) >= 30 then 1
else 0
end as new_session,
case
when date_trunc('day', time) <> date_trunc('day', lag(time) over (partition by user_id order by time asc)) then 1
when lag(time) over (partition by user_id order by time asc) is null then 1
else 0
end as new_day
from events
where date_trunc('day', time) between '{{start_date}}' and '{{end_date}}' -- we recommend a week of data
order by 2, 1 asc
),
-- PART 2: grouping events into sessions
data2 as (
select
user_id,
time,
sum(new_session) over (partition by user_id order by time asc) as user_session_number,
min(time) over (partition by user_id order by time asc) as user_session_start,
sum(new_day) over (partition by user_id order by time asc) as user_day_number
from data
),
-- PART 3: defining each session's start and end time
data3 as (
select
user_id,
time,
user_session_number,
user_day_number,
min(time) over (partition by user_id, user_session_number order by time asc) as session_start,
max(time) over (partition by user_id, user_session_number order by time asc) as session_end
from data2
),
-- PART 4: calculating session durations
session_durations as (
select
user_id,
users.email,
user_session_number,
max(user_day_number) as num_days,
min(time) as session_start,
max(session_end) as max_session_end,
round(((extract(epoch from max(session_end)) - extract(epoch from max(session_start))) / 60)::decimal, 1) as session_duration_in_minutes
from data3
inner join users on users.id = user_id
group by 1,2,3
order by 5 desc
)
-- PART 5: aggregating session durations and other calculations (hours_per_week, num_days_used_product, hours_per_day, longest_weekly_session_in_minutes)
select
session_durations.user_id,
email,
round(sum(session_duration_in_minutes) / 60, 1) as hours,
max(num_days) as num_days_used_product,
round(round(sum(session_duration_in_minutes) / 60, 1) / max(num_days), 1) as hours_per_day,
max(session_duration_in_minutes) as longest_session_in_minutes
from session_durations
group by 1,2
having sum(session_duration_in_minutes) > 20
order by 3 desc;
Part 1: data prep on each user's events
The query in the data
CTE is tricky, so let's look at its output for just one user upfront:
| user_id | time | minutes_since_last_event | new_session | new_day |
|---------|---------------------|--------------------------|-------------|---------|
| 467801 | 2020-04-06 16:19:43 | | 1 | 1 |
| 467801 | 2020-04-06 17:42:21 | 82.6 | 1 | 0 |
| 467801 | 2020-04-06 17:46:04 | 3.7 | 0 | 0 |
| 467801 | 2020-04-06 17:48:39 | 2.6 | 0 | 0 |
| ... | ... | ... | ... | ... |
Here we see a partial list of events for user 467801
ordered by time.
The lag()
function allows us to calculate the time between one event and the event prior. That's what populates the minutes_since_last_event
column. The first row is null
since there is no preceding event. You can see that 82.6 minutes have passed from the event at 2020-04-06 17:42:21
and the prior event at 2020-04-06 16:19:43
.
The column new_session
is a binary. It looks at minutes_since_last_event
, and says if more than 30 minutes has passed since the last event, then it's a new session and records a 1. The column new_day
does similarly, examining if an event is a different day than the one prior.
We'll use these new_session
and new_day
columns later.
Part 2: grouping events into sessions
Just like in Part 1, let's look at the output of the CTE data2
for just one user upfront:
| user_id | time | user_session_number | user_session_start | user_day_number |
|---------|---------------------|---------------------|---------------------|-----------------|
| 467801 | 2020-04-06 16:19:43 | 1 | 2020-04-06 16:19:43 | 1 |
| 467801 | 2020-04-06 17:42:21 | 2 | 2020-04-06 16:19:43 | 1 |
| 467801 | 2020-04-06 17:46:04 | 2 | 2020-04-06 16:19:43 | 1 |
| 467801 | 2020-04-06 17:48:39 | 2 | 2020-04-06 16:19:43 | 1 |
| ... | ... | ... | ... | ... |
We see the same ordering of events by time. A Window function creates the column user_session_number
. It creates a running count of each user session by tabulating the new_session
binary (from the previous CTE data
). This is easier shown than written:
The same approach creates the column user_day_number
. And user_session_start
finds the earliest timestamp for a session for that user.
Part 3: defining each session's start and end time
Again we'll look at the output of the CTE data3
for just one user.
| user_id | time | user_session_number | session_start | session_end | user_day_number |
|---------|---------------------|---------------------|---------------------|---------------------|-----------------|
| 467801 | 2020-04-06 16:19:43 | 1 | 2020-04-06 16:19:43 | 2020-04-06 16:19:43 | 1 |
| 467801 | 2020-04-06 17:42:21 | 2 | 2020-04-06 17:42:21 | 2020-04-06 17:42:21 | 1 |
| 467801 | 2020-04-06 17:46:04 | 2 | 2020-04-06 17:42:21 | 2020-04-06 17:46:04 | 1 |
| 467801 | 2020-04-06 17:48:39 | 2 | 2020-04-06 17:42:21 | 2020-04-06 17:48:39 | 1 |
| ... | ... | ... | ... | ... | ... |
In this CTE, the query calculates the session_start
by finding the earliest (minimum) timestamp of any event for each session (defined by user_session_number
). Look at how the session_start
changes for user_session_number = 1
vs user_session_number = 2
.
Similarly, the query calculates the session_end
by finding the latest (maximum) timestamp of any event for each session (defined by user_session_number
). Look at the session_end
for user_session_number = 2
. The query iteratively records a higher session_end
as it realizes that with each event in the same session there's a higher possible max.
Part 4: calculating session durations
Home stretch! Here's the output of the CTE session_durations
. As the name implies, we calculate how long each session lasted.
| user_id | email | user_session_number | num_days | session_start | max_session_end | session_duration_in_minutes |
|---------|-----------------------|---------------------|----------|---------------------|---------------------|-----------------------------|
| 467801 | lukasz@cookbright.com | 1 | 1 | 2020-04-06 16:19:43 | 2020-04-06 16:19:43 | 0.0 |
| 467801 | lukasz@cookbright.com | 2 | 1 | 2020-04-06 17:42:21 | 2020-04-06 18:25:36 | 43.3 |
| 467801 | lukasz@cookbright.com | 3 | 1 | 2020-04-06 19:05:07 | 2020-04-06 20:58:47 | 113.7 |
| 467801 | lukasz@cookbright.com | 4 | 2 | 2020-04-07 15:34:31 | 2020-04-07 15:34:31 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... |
Recall how in Part 3, we found an iteratively higher session_end
?
In Part 4, we take the highest session_end
for that particular session (max_session_end
). We then calculate the minutes between max_session_end
and session_start
using extract(epoch from...)
(tutorial).
Lastly, we join to a users
table in this step to get email
.
Part 5: aggregating session durations and other calculations
We've computed all the necessary pieces in our previous steps. Now it's straightforward arithmetic and MAX()
functions to calculate hours in product, the number of days the user has used your product in the period, the hours per day they spend in the product, and their longest session š
Final Output:
| user_id | email | hours | num_days_used_product | hours_per_day | longest_session_minutes |
|---------|-----------------------|-------|-----------------------|---------------|-------------------------|
| 467801 | lukasz@cookbright.com | 4.7 | 5 | 0.9 | 113.7 |
| 519997 | ottilie@yobbit.com | 4.3 | 5 | 0.9 | 40.9 |
| 553461 | anika@dogshare.com | 2.8 | 3 | 0.9 | 126.7 |
| 506538 | rebeca@wearcade.com | 2.5 | 3 | 0.8 | 87.5 |
| ... | ... | ... | ... | ... | ... |
Try it yourself?
Run this template against ourĀ sample database that mirrors real startup data. See the connection credentials, then connect in PopSQL.
Spread the word
Tweet