I need to assign a time_group to each logged_time based on the following logic:
- For the earliest timestamp, the time_group should be the same as logged_time.
- For subsequent rows, compare the logged_time to the previous row’s time_group:
- If the difference between logged_time and time_group is within 3 minutes, use the previous row’s time_group.
- If the difference exceeds 3 minutes, set the time_group to the current logged_time.
logged_time | time_group (expected) |
---|---|
2023-12-10 17:03:05 | 2023-12-10 17:03:05 |
2023-12-10 17:05:02 | 2023-12-10 17:03:05 |
2023-12-10 17:06:18 | 2023-12-10 17:06:18 |
2023-12-10 17:10:07 | 2023-12-10 17:10:07 |
2023-12-11 08:31:27 | 2023-12-11 08:31:27 |
I’ve been trying to implement this in BigQuery, but I can’t seem to achieve the desired output.
1
You can achieve the desired behavior in BigQuery by using window functions along with some logic in CASE and LAST_VALUE. Below is a step-by-step solution to correctly assign the time_group based on your specified rules.
WITH grouped_data AS (
SELECT
logged_time,
-- Calculate the time difference in seconds from the previous row
TIMESTAMP_DIFF(logged_time, LAG(logged_time) OVER (ORDER BY logged_time), SECOND) AS diff_seconds,
-- Determine if a new group should start
CASE
WHEN LAG(logged_time) OVER (ORDER BY logged_time) IS NULL OR
TIMESTAMP_DIFF(logged_time, LAG(logged_time) OVER (ORDER BY logged_time), SECOND) > 180
THEN logged_time
ELSE NULL
END AS new_group_marker
FROM your_table_name
),
final_grouping AS (
SELECT
logged_time,
-- Propagate the new group marker forward
LAST_VALUE(new_group_marker IGNORE NULLS) OVER (
ORDER BY logged_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS time_group
FROM grouped_data
)
SELECT
logged_time,
time_group
FROM final_grouping
ORDER BY logged_time;
Time difference:
LAG(logged_time) retrieves the logged_time of the previous row.
TIMESTAMP_DIFF calculates the difference in seconds between the current row and the previous one.
New group marker:
A CASE statement is used to define when a new group should start:
For the first row (LAG(logged_time) IS NULL).
When the time difference exceeds 3 minutes (180 seconds).
Group propagation:
LAST_VALUE with IGNORE NULLS propagates the most recent non-NULL new_group_marker forward, assigning it as the time_group for all rows within the same group.
Sorting and displaying results:
Finally, select and order the logged_time and time_group columns.
I hope this helps!