I have a table with 5 columns: date – date of temperature measurement, process_id – identifier of the process for which the temperature is measured, batch_id – identifier of the upload batch, code – upload code, value – temperature value for the process.
hive table
Some dates are missed. It may be the case that several days in a row have been missed.
I need to fill in the missing dates so that each date has a fixed set of rows (all unique combinations of process_id, batch_id, code). For example, for the date 15.01.2021, the following combinnations of process_id, batch_id, code are present (orange color):
records of 2021-01-15
On the date 16.01.2021, only the following process_id values are present:
records of 2021-01-16
In the example above missing values with process_id=18 and process_id=19 should be added:
fillin on 2021-01-16
After adding records by missing processes, I need to fill in the missed values in columns
process_id, batch_id, code, value with the last not null value on previous date by combination process_id, batch_id, code. See the picture above.
Also there may be situations where a date is missing. In the example above, measurements for the date 17.01.2021 are missing.
In this case, missing records for the date 17.01.2021 should be added. A total of 5 records will be added (process_id: 15, 16, 17, 18, 19).
enter image description here
After adding the records for the missing date, the last non-null temperature value for each process from the previous date should be propagated.
My current query:
with cvc as (
select t.code,
t.process_id,
d.next_date,
t.batch_id
from (select date_add(from_date, i) as next_date
from ( select min(date) as from_date,
max(date) as to_date
from measures
where year(date)=2021 and month(date)>=1
) p
lateral view posexplode(split(space(datediff(p.to_date, p.from_date)))) pe as i, x
) d
cross join (
select distinct code,
process_id,
batch_id
from measures
where year(date)=2021 and month(date)>=1
) c
left join (
select code,
process_id,
date,
batch_id
from measures
where year(date)=2021 and month(date)>=1
) t
on t.code=c.code
and t.process_id=c.process_id
and t.date=d.next_date
and t.batch_id=c.batch_id
)
select
next_date,
coalesce(code, last_value(code, true)
over(partition by code, process_id, order by next_date
rows between unbounded preceding and current row)) as code,
...
...
coalesce(process_id, last_value(process_id, true)
over(partition by code, process_id, order by next_date
rows between unbounded preceding and current row)) as process_id
from cvc
order by
next_date,
code,
process_id,
batch_id
The query presented above correctly fills in the majority of values. However, in some cases, there are duplicates for the same date. For example, for process_id=19 on 2021-01-17, there are two entries with different values (highlighted in red).
enter image description here
The desired output:
enter image description here
This needs to be done in Hive.
Please, help
bogdan eleseev is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.