5

Useful SQL queries for analyzing Segment data

Below you'll find a library of some of the most useful SQL queries we've seen customers use with their Warehouse data. You can run these using your own Segment data with little or no modification.

Queries index

  • Events over time period
  • Defining sessions
  • Identifies to users
  • Counts of user traits
  • Groups to accounts

Events over time period

Tracking events

The track allows you to record any actions your users perform. A track call takes three parameters: the userId, the event, and any optional properties.

Here's a basic track call:

analytics.track('Completed Order',
  item: 'pants',
  color: 'blue'
  size: '32x32'
  payment: 'credit card'
});

And another completed order track call might look like this:

analytics.track('Completed Order', {
 item: 'shirt',
 color: 'green'
 size: 'Large'
 payment: 'paypal'
});

Each track call is stored as a distinct row in a single Redshift table called tracks. To get a table of your completed orders, you can run the following query:

select *
from initech.tracks
where event = 'completed_order'

which would return a table like this:

But why are there columns in the table that weren't a part of our track call, like event_id. This is because the track method automatically includes additional properties of the event, like event_idsent_at, and user_id (for client-side libraries)!

Grouping events by day

What if you want to know how many orders have been completed through time? You can use the date() and count() function with the sent_at timestamp:

select date(sent_at) as date, count(event)
from initech.tracks
where event = 'completed_order'
group by date

which will return a table like this:

Want to see how many shirts & pants were sold each day? We can easily query this with case statements:

select date(sent_at) as date,
sum(case when item = 'shirt' then 1 else 0 end) as shirts,
sum(case when item = 'pants' then 1 else 0 end) as pants
from initech.tracks
where event = 'completed_order'
group by date

And there you have it, your completed orders by item and by day!


Defining sessions

Segment’s API does not impose any restrictions on your data with regard to user sessions.

Sessions aren’t fundamental facts about the user experience. They’re stories we build around the data to understand how customers actually use the product in their day-to-day lives. And since Segment’s API is about collecting raw, factual data, we don’t have an API for collecting sessions. We leave session interpretation to our partners, which let you design how you measure sessions based on how customers use your product.

For more on why we don’t collect session data at the API level, check out our blog post here!

How to define user sessions in SQL

Each of our SQL partners allows you to define sessions based on your specific business needs. With Looker, for example, you can take advantage of their persistent derived tables and LookML modeling language to layer sessionization on top of your Segment SQL data. We recommend checking out their approach here!

For defining sessions with raw SQL, the best query and explanation we’ve come across is from our friends at Mode Analytics.

Here’s the query to make it happen, but we definitely recommend checking out their blog post as well! They walk you through the reasoning behind the query, what each portion accomplishes, how you can tweak it to suit your needs, and what kind of further analysis you can do on top of it.

-- Finding the start of every session
SELECT *
  FROM (
       SELECT *,
              LAG(sent_at,1) OVER (PARTITION BY user_id ORDER BY sent_at) AS last_event
        FROM "your_source".tracks
      ) last
WHERE EXTRACT('EPOCH' FROM sent_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
   OR last_event IS NULL

-- Mapping every event to its session
SELECT *,
       SUM(is_new_session) OVER (ORDER BY user_id, sent_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS global_session_id,
       SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY sent_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS user_session_id
  FROM (
       SELECT *,
              CASE WHEN EXTRACT('EPOCH' FROM sent_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
                     OR last_event IS NULL
                   THEN 1 ELSE 0 END AS is_new_session
         FROM (
              SELECT *,
                     LAG(sent_at,1) OVER (PARTITION BY user_id ORDER BY sent_at) AS last_event
                FROM "your_source".tracks
              ) last
       ) final

Identifies to users

Historical traits

The identify method ties user attributes to a userId.

analytics.identify('bob123',{
  email: 'bob@initech.com',
  plan:m'Free'
});

As these user traits change over time, you can continue calling the identify method to update their changes. Here we’ll update Bob’s account plan to “Premium”.

analytics.identify('bob123', {
  email: 'bob@initech.com',
  plan: 'Premium'
});

Each identify call is stored as a distinct row in a single Redshift table calledidentifies. To see how a user’s plan changes over time, you can run the following query:

select email, plan, sent_at
from initech.identifies
where email = 'bob@initech.com'

The previous example will return a table of Bob’s account information, with each entry representing the state of the account at different times.

If you want to see what your users looked like at a previous point in time, that data is right there in your identifies table! (To get this table for your users, replace ‘initech’ with your source slug).

But what if you only want to see the most recent state of the user? Luckily, we can convert the identifies table into a distinct users table by taking the most recent identify call for each account.

Converting the identifies table into a users table

The following query will return your identifies table:

select *
from initech.identifies

which will return a table like this: 

However, if all you want is a table of distinct users with their current traits, you can do so with the following query:

with identifies as (
  select user_id,
         email,
         plan,
         sent_at,
         row_number() over (partition by user_id order by sent_at desc) as rn
  from initech.identifies
),
users as (
  select user_id,
         email,
         plan
  from identifies
  where rn = 1
)

select *
from users

This query will return a table with your distinct users, without duplicates!

How it works

To get the most recent user traits, we want to select the most recent row for each unique user_id. To do this, we can partition the table by user_id, sort each partition by the sent_at timestamp, and use the row_number() function to give the row’s position inside the partition:

From there, we create a new table called users that selects the latest entry within each partition, or where rn = 1. And there you have it, the query returns the latest traits for your users!


Counts of user traits

Let's say you have an `identifies` table that looks like this: 

If we want to query the traits of these users, we first need to convert the identifies table into a users table. From there, we can run a query like this to get a count of users with each type of plan:

with identifies as (
  select user_id,
         email,
         plan,
         sent_at,
         row_number() over (partition by user_id order by sent_at desc) as rn
  from initech.identifies
),
users as (
  select plan
  from identifies
  where rn = 1
)

select sum(case when plan = 'Premium' then 1 else 0 end) as premium,
       sum(case when plan = 'Free' then 1 else 0 end) as free
from users

And there you go, a count of users with each type of plan!


Groups to accounts

Historical traits

The group method ties a user to a group. Be it a company, organization, account, source, team or whatever other crazy name you came up with for the same concept! It also lets you record custom traits about the group, like industry or number of employees.

Here’s what a basic group call looks like:

analytics.group'0e8c78ea9d97a7b8185e8632', {
  name: 'Initech',
  industry: 'Technology',
  employees: 329,
  plan: 'Premium'
});

As these group traits change over time, you can continue calling the group method to update their changes.

analytics.group('0e8c78ea9d97a7b8185e8632', {
  name: 'Initech',
  industry: 'Technology',
  employees: 600,
  plan: 'Enterprise'
});

Each group call is stored as a distinct row in a single Redshift table calledgroups. To see how a group changes over time, you can run the following query:

select name, industry, plan, employees, sent_at
from initech.groups
where name = 'Initech'

The previous example will return a table of Initech’s group information, with each entry representing the state of the account at different times.

If you want to see a group’s traits at a previous point in time, that data is right there in your groups table! (To get this table for your groups, replace ‘initech’ with your source slug).

But what if you only want to see the most recent state of the group? We can convert the groups table into a distinct groups table by taking the most recent groups call for each account.

Converting the groups rable into an organizations table

The following query will return your groups table.

select *
from initech.groups

which will return a table like this:

However, if all you want is a table of distinct groups and their current traits, you can do so with the following query:

with groups as (
  select name,
         industry,
         employees,
         plan,
         sent_at,
         row_number() over (partition by name order by sent_at desc) as rn
  from initech.groups
),
organizations as (
  select name,
         industry,
         employees,
         plan
  from groups
  where rn = 1
)

select *
from organizations

This query will return a table with your distinct groups, without duplicates!

How it works

To get the most recent group traits, we want to select the most recent entry for each group in the groupstable. To do this, we can partition the table by group name, sort each partition by the sent_at timestamp, and use therow_number() function to give the row’s position inside the partition:

From there, we create a new table called ‘organizations’ that selects only the latest entry for each name, or where ‘rn’ = 1. And there you have it, the query returns the latest traits for your groups!