I have these two tables:
CREATE TABLE my_table1 (
name VARCHAR(50),
var1 DATE,
var2 INT
);
INSERT INTO my_table1 (name, var1, var2) VALUES
('john', '2010-01-01', 94),
('john', '2010-01-04', 106),
('john', '2015-01-01', 99),
('alex', '2010-01-01', 96),
('alex', '2018-01-01', 96),
('sara', '2005-01-01', 94),
('sara', '2006-01-01', 90),
('tim', '1999-01-01', 101);
CREATE TABLE my_table2 (
name VARCHAR(50),
var3 DATE,
var4 CHAR(1)
);
INSERT INTO my_table2 (name, var3, var4) VALUES
('john', '2001-01-01', 'a'),
('john', '2002-01-01', 'b'),
('alex', '2021-01-01', 'c'),
('alex', '2022-01-01', 'd'),
('sara', '1999-01-01', 'e'),
('sara', '2023-01-01', 'f');
I am trying to answer the following types of questions (the final result should have the same number of rows as my_table1 and all columns from both tables):
-
problem 1: for each name in my_table2, find the most recent row (based on date). join this row to my_table1. however, after the join – ensure that the date from my_table2 is greater than the date from my_table1 (if not, then remove)
-
problem 2: full join my_table1 and my_table2 based on name. after the join, only keep rows where the date from my_table2 is greater than the date from my_table1. when this is done (i.e. in the joined file), for each row from my_table1, count how many rows from my table_2 were retained
For problem 1, I tried to solve the problem like this (I can’t figure out how to include a NA row for John and Tim here):
# problem 1
SELECT t1.*, t2.*
FROM my_table1 t1
JOIN (
SELECT name, var3, var4
FROM (
SELECT name, var3, var4,
ROW_NUMBER() OVER (PARTITION BY name ORDER BY var3 DESC) as rn
FROM my_table2
) tmp
WHERE rn = 1
) t2
ON t1.name = t2.name
WHERE t1.var1 < t2.var3;
From problem 2, I was not sure how to proceed:
SELECT
t1.*,
t2.var3,
t2.var4,
COUNT(t2.name) OVER (PARTITION BY t1.name, t1.var1) as count
FROM
my_table1 t1
INNER JOIN
my_table2 t2
ON
t1.name = t2.name AND
EXTRACT(YEAR FROM t1.var1) = EXTRACT(YEAR FROM t2.var3) AND
t2.var3 > t1.var1;
The final answer for problem 2 should look like this:
name date var2 count
alex 2010-01-01 96 2
alex 2018-01-01 96 2
sara 2005-01-01 94 1
sara 2006-01-01 90 1
john 2010-01-01 94 0
john 2010-01-04 106 0
john 2015-01-01 99 0
tim 1999-01-01 101 0
Is this the correct way to work on these problems?