I make SQL-queries in Hive using Hue. When I use inner join between two tables, one of the table’s column get new distinct values, that wasn’t in this column before join.
I have table periods with columns period_name and period_first_dt, table traffic with columns period_first_dt, product_id and traffic_sum. Detailed info about tables below:
Table periods
Column | Data type | Value example |
---|---|---|
period_name | string | DEC-23 |
period_first_dt | datetime | 2023-12-01 00:00:00 |
Table traffic
Column | Data type | Value example |
---|---|---|
period_first_dt | string | 2018-01-01T00:00:00Z |
product_id | bigint | 135 |
traffic_sum | bigint | 123123 |
I want to have period_name in traffic table and calculate sum of traffic_sum by period_name. I inner join two tables like this:
with periods as (
select
period_name,
cast(period_first_dt as date) as period_first_dt
from
periods
where 1=1
and period_name in ('NOV-23', 'DEC-23')
)
select
p.period_name,
sum(t.traffic_sum) as traffic_sum
from
traffic t
join
periods p on p.period_first_dt = cast(substring(t.period_first_dt, 1, 10) as date)
where 1=1
and t.product_id = 135
group by
p.period_name
;
CTE periods have following values:
period_name | period_first_dt |
---|---|
DEC-23 | 2023-12-01 |
NOV-23 | 2023-11-01 |
Query above have following results:
period_name | traffic_sum |
---|---|
2023-11-01 | 1894758884807.576 |
2023-12-01 | 1953671751139.3423 |
DEC-23 | 132211702997.26137 |
NOV-23 | 90683990757.3648 |
Column period_name somehow have new values that wasn’t before inner join: 2023-11-01
and 2023-12-01
. When I change condition in CTE and period_name in ('NOV-23', 'DEC-23')
to and period_name = 'DEC-23'
, results are without new values in period_name:
period_name | traffic_sum |
---|---|
DEC-23 | 2085883454136.6035 |
When I remove condition and t.product_id = 135
and leave condition in CTE and period_name in ('NOV-23', 'DEC-23')
, results are somehow without new values in period_name:
period_name | traffic_sum |
---|---|
DEC-23 | 2658701852149.3955 |
NOV-23 | 2518165900926.1953 |
Why new distinct values appear in period_name and how to fix that?