I am working on a transaction dataset with one date column & one identifier column. Now retroactively I want to create a boolean column to check whether any particular activity (e.g. marketing campaign) should have been sent for that identifier if the last time a campaign sent was before 30 days. I can use lag function but the problem is that the starting point would keep on changing.
For example, on first transaction date a campaign would be sent, then for any transaction within next 30 days no campaign; then, let’s say next transaction happens on 35th day from first transaction; then, send a campaign and now the new counter for next 30 days should start from 35th day.
I’m haven’t been able to figure out how I can achieve it. I am using redshift sql (table for reference below)
Identifier | Date (YYYY-MM-DD time) | desired column |
---|---|---|
A | 2024-06-30 11:22:03 | FALSE |
A | 2024-06-16 14:02:36 | FALSE |
A | 2024-06-15 8:15:57 | TRUE |
A | 2024-05-24 14:30:57 | FALSE |
A | 2024-05-10 4:45:06 | FALSE |
A | 2024-05-08 15:19:48 | FALSE |
A | 2024-05-07 15:09:14 | FALSE |
A | 2024-05-06 4:16:39 | FALSE |
A | 2024-05-04 10:37:16 | TRUE |
A | 2024-04-08 5:02:00 | FALSE |
A | 2024-04-06 9:03:23 | FALSE |
A | 2024-03-30 11:05:55 | TRUE |
A | 2024-03-16 8:39:56 | FALSE |
A | 2024-03-15 14:06:10 | FALSE |
A | 2024-02-28 16:55:28 | TRUE |
Thanks for the help!
3
You are looking for an iteration. The first date is 2024-02-28. With this date you are looking for the first date that is after its 30 days range. That date is 2024-03-30. Then again you want to find the first date after its 30 day range, and so on.
Iteration is done with recursive queries in SQL.
with recursive
starters (identifier, ts) as
(
select identifier, min(ts)
from mytable
group by identifier
union all
select t.identifier, min(t.ts)
from starters s
join mytable t on t.identifier = s.identifier
and t.ts > s.ts + interval '30' day
group by t.identifier
)
select
t.*,
exists
(
select null
from starters s
where s.identifier = t.identifier
and s.ts = t.ts
) as flag
from mytable t
order by t.identifier, t.ts;
If redshift is not yet capable of handling aggregation in recursive CTEs as you say below in the comments, then use another way to get the top 1 row per group. Getting the top 1 means there does not exist a better one, so you may be able to solve this with NOT EXISTS
.
The part
select identifier, min(ts)
from mytable
group by identifier
can also be written as
select identifier, ts
from mytable t
where not exists
(
select null
from mytable better
where better.identifier = t.identifier
and better.ts < t.ts
)
for instance.
4
I tried kind of a hacky solution for now which worked –
with min_ts as
(
select
identifier,
min(ts) as min_ts
from my_table
group by 1
),
step_2 as
(
select
a.*,
floor(datediff(second, b.min_ts, a.ts)/2592000::DOUBLE PRECISION) as floor_days_between_ts
from my_table a
inner join min_ts b
on a.identifier = b.identifier
),
step_3 as
(
select
identifier,
floor_days_between_ts,
min(ts) as min_ts
from step_2
group by 1,2
)
select
a.*,
case when a.ts = b.min_ts then True::bool
else False::bool end as campaign
from step_2 a
left join step_3 b
on a.identifier = b.identifier and a.floor_days_between_ts = b.floor_days_between_ts
order by a.identifier, a.ts desc
;
1
One option to do it is to find refference dates. You could try to do it combining dates and differences in days between them with a few ctes…
-- S a m p l e D a t a :
Create Table tbl (identifier Varchar(6), dt TimeStamp);
Insert Into tbl VALUES
('A', '2024-06-30 11:22:03'),
('A', '2024-06-16 14:02:36'),
('A', '2024-06-15 08:15:57'),
('A', '2024-05-24 14:30:57'),
('A', '2024-05-10 04:45:06'),
('A', '2024-05-08 15:19:48'),
('A', '2024-05-07 15:09:14'),
('A', '2024-05-06 04:16:39'),
('A', '2024-05-04 10:37:16'),
('A', '2024-04-08 05:02:00'),
('A', '2024-04-06 09:03:23'),
('A', '2024-03-30 11:05:55'),
('A', '2024-03-16 08:39:56'),
('A', '2024-03-15 14:06:10'),
('A', '2024-02-28 16:55:28');
… removed the time part and combined the dates (practicaly cross join) to get all day differences between them …
WITH
dates as
( Select Row_Number() Over(Partition By d.identifier, d.dt_0 Order By d.dt_1) as rn_0,
d.identifier, d.dt_0, d.dt_1, d.diff
From ( Select t1.identifier, Date_Trunc('day', t1.dt) as dt_1,
Case When DATE_PART('day', Date_Trunc('day', t1.dt) - Date_Trunc('day', t2.dt)) >= 30
Then Date_Trunc('day', t2.dt)
End as dt_0,
DATE_PART('day', Date_Trunc('day', t1.dt) - Date_Trunc('day', t2.dt)) as diff,
Min(Date_Trunc('day', t1.dt)) Over(Partition By t1.identifier) as start_dt
From tbl t1
Inner Join tbl t2 ON(t1.identifier = t2.identifier)
) d
Where diff >= 30
),
… filtering the resultset …
grid as
( Select Distinct d0.*, d1.dt_0 as dt_00, d1.dt_1 as dt_11
From dates d0
Inner Join dates d1 on(d1.identifier = d0.identifier And
d1.dt_0 = d0.dt_1 And
d0.rn_0 = 1 And d1.rn_0 = 1
)
Where d0.rn_0 = 1 And d0.dt_1 = d1.dt_0
),
… getting refference dates using UNION (not UNION ALL) to get ridd of duplicates …
reff_dates as
( Select identifier, Min(dt_0) as dt_0 From grid Group By identifier UNION
Select identifier, Min(dt_1) From grid Group By identifier UNION
Select identifier, Min(dt_11) From grid Group By identifier UNION
Select Distinct identifier, dt_11 From grid
)
… reference dates are left joined to your data defining the values ‘TRUE’/’FALSE’ for the new column using Case expressions …
-- M a i n S Q L :
Select t.identifier, t.dt,
Case When r.dt_0 Is Null Then 'FALSE'
Else 'TRUE'
End as new_column
From tbl t
Left Join reff_dates r ON( r.dt_0 = Date_Trunc('day', t.dt) )
Order By t.dt Desc
/* R e s u l t :
identifier dt new_column
--------- ---------------------- ----------
A 2024-06-30 11:22:03 FALSE
A 2024-06-16 14:02:36 FALSE
A 2024-06-15 08:15:57 TRUE
A 2024-05-24 14:30:57 FALSE
A 2024-05-10 04:45:06 FALSE
A 2024-05-08 15:19:48 FALSE
A 2024-05-07 15:09:14 FALSE
A 2024-05-06 04:16:39 FALSE
A 2024-05-04 10:37:16 TRUE
A 2024-04-08 05:02:00 FALSE
A 2024-04-06 09:03:23 FALSE
A 2024-03-30 11:05:55 TRUE
A 2024-03-16 08:39:56 FALSE
A 2024-03-15 14:06:10 FALSE
A 2024-02-28 16:55:28 TRUE */
See the fiddle here.