I have the two following tables:
<code>CREATE TABLE my_table1 (
color VARCHAR(50),
d1 DATE,
var1 VARCHAR(50)
);
INSERT INTO my_table1 (color, d1, var1)
VALUES
('red', '2010-01-01', 'a'),
('red', '2011-01-01', 'b'),
('red', '2012-01-01', 'c'),
('red', '2013-01-01', 'b'),
('blue', '2013-01-01', 'a'),
('blue', '2014-01-01', 'd'),
('blue', '2015-01-01', 'f'),
('green', '2020-01-01', 'a');
CREATE TABLE my_table2 (
color VARCHAR(50),
d2 DATE,
var2 VARCHAR(50)
);
INSERT INTO my_table2 (color, d2, var2)
VALUES
('red', '2008-01-01', 'zz'),
('red', '2020-01-01', 'cc'),
('blue', '1999-01-01', 'xx'),
('blue', '2005-01-01', 'yy'),
('purple', '2022-01-01', 'hh');
</code>
<code>CREATE TABLE my_table1 (
color VARCHAR(50),
d1 DATE,
var1 VARCHAR(50)
);
INSERT INTO my_table1 (color, d1, var1)
VALUES
('red', '2010-01-01', 'a'),
('red', '2011-01-01', 'b'),
('red', '2012-01-01', 'c'),
('red', '2013-01-01', 'b'),
('blue', '2013-01-01', 'a'),
('blue', '2014-01-01', 'd'),
('blue', '2015-01-01', 'f'),
('green', '2020-01-01', 'a');
CREATE TABLE my_table2 (
color VARCHAR(50),
d2 DATE,
var2 VARCHAR(50)
);
INSERT INTO my_table2 (color, d2, var2)
VALUES
('red', '2008-01-01', 'zz'),
('red', '2020-01-01', 'cc'),
('blue', '1999-01-01', 'xx'),
('blue', '2005-01-01', 'yy'),
('purple', '2022-01-01', 'hh');
</code>
CREATE TABLE my_table1 (
color VARCHAR(50),
d1 DATE,
var1 VARCHAR(50)
);
INSERT INTO my_table1 (color, d1, var1)
VALUES
('red', '2010-01-01', 'a'),
('red', '2011-01-01', 'b'),
('red', '2012-01-01', 'c'),
('red', '2013-01-01', 'b'),
('blue', '2013-01-01', 'a'),
('blue', '2014-01-01', 'd'),
('blue', '2015-01-01', 'f'),
('green', '2020-01-01', 'a');
CREATE TABLE my_table2 (
color VARCHAR(50),
d2 DATE,
var2 VARCHAR(50)
);
INSERT INTO my_table2 (color, d2, var2)
VALUES
('red', '2008-01-01', 'zz'),
('red', '2020-01-01', 'cc'),
('blue', '1999-01-01', 'xx'),
('blue', '2005-01-01', 'yy'),
('purple', '2022-01-01', 'hh');
These tables look like this:
<code>| color | d1 | var1 |
|-------|------------|------|
| red | 2010-01-01 | a |
| red | 2011-01-01 | b |
| red | 2012-01-01 | c |
| red | 2013-01-01 | b |
| blue | 2013-01-01 | a |
| blue | 2014-01-01 | d |
| blue | 2015-01-01 | f |
| green | 2020-01-01 | a |
| color | d2 | var2 |
|--------|------------|------|
| red | 2008-01-01 | zz |
| red | 2020-01-01 | cc |
| blue | 1999-01-01 | xx |
| blue | 2005-01-01 | yy |
| purple | 2022-01-01 | hh |
</code>
<code>| color | d1 | var1 |
|-------|------------|------|
| red | 2010-01-01 | a |
| red | 2011-01-01 | b |
| red | 2012-01-01 | c |
| red | 2013-01-01 | b |
| blue | 2013-01-01 | a |
| blue | 2014-01-01 | d |
| blue | 2015-01-01 | f |
| green | 2020-01-01 | a |
| color | d2 | var2 |
|--------|------------|------|
| red | 2008-01-01 | zz |
| red | 2020-01-01 | cc |
| blue | 1999-01-01 | xx |
| blue | 2005-01-01 | yy |
| purple | 2022-01-01 | hh |
</code>
| color | d1 | var1 |
|-------|------------|------|
| red | 2010-01-01 | a |
| red | 2011-01-01 | b |
| red | 2012-01-01 | c |
| red | 2013-01-01 | b |
| blue | 2013-01-01 | a |
| blue | 2014-01-01 | d |
| blue | 2015-01-01 | f |
| green | 2020-01-01 | a |
| color | d2 | var2 |
|--------|------------|------|
| red | 2008-01-01 | zz |
| red | 2020-01-01 | cc |
| blue | 1999-01-01 | xx |
| blue | 2005-01-01 | yy |
| purple | 2022-01-01 | hh |
I am trying to accomplish the following task:
- Using my_table2 as the main table, I want to join to my_table1 where color=color and d1 is closest to d2
- After the join, the columns from my_table1 should be called closest_color, closest_d1, closest_var1
- The final result should have the same number of rows as my_table2 (purple would have all NULLS)
I thought of using a function that can take the absolute value of the date differences and then do a full outer left join to make sure all rows from my_table2 are captured – and then use window functions to rank/partition to identify the smallest difference:
<code>WITH closest_dates AS (
SELECT
t2.color,
t2.d2,
t2.var2,
t1.color AS closest_color,
t1.d1 AS closest_d1,
t1.var1 AS closest_var1,
ROW_NUMBER() OVER(PARTITION BY t2.color, t2.d2 ORDER BY ABS(DATEDIFF(day, t2.d2, t1.d1))) as rn
FROM
my_table2 t2
LEFT OUTER JOIN
my_table1 t1
ON
t1.color = t2.color
)
SELECT
color,
d2,
var2,
closest_color,
closest_d1,
closest_var1
FROM
closest_dates
WHERE
rn = 1;
</code>
<code>WITH closest_dates AS (
SELECT
t2.color,
t2.d2,
t2.var2,
t1.color AS closest_color,
t1.d1 AS closest_d1,
t1.var1 AS closest_var1,
ROW_NUMBER() OVER(PARTITION BY t2.color, t2.d2 ORDER BY ABS(DATEDIFF(day, t2.d2, t1.d1))) as rn
FROM
my_table2 t2
LEFT OUTER JOIN
my_table1 t1
ON
t1.color = t2.color
)
SELECT
color,
d2,
var2,
closest_color,
closest_d1,
closest_var1
FROM
closest_dates
WHERE
rn = 1;
</code>
WITH closest_dates AS (
SELECT
t2.color,
t2.d2,
t2.var2,
t1.color AS closest_color,
t1.d1 AS closest_d1,
t1.var1 AS closest_var1,
ROW_NUMBER() OVER(PARTITION BY t2.color, t2.d2 ORDER BY ABS(DATEDIFF(day, t2.d2, t1.d1))) as rn
FROM
my_table2 t2
LEFT OUTER JOIN
my_table1 t1
ON
t1.color = t2.color
)
SELECT
color,
d2,
var2,
closest_color,
closest_d1,
closest_var1
FROM
closest_dates
WHERE
rn = 1;
I am not sure how to perform absolute date differences in Netezza – can someone please show me how to do this correctly?
Thanks!