I am working with a table that registers the hourly measurements taken by some devices our clients’ have purchased. A mock sample of it can be created with the following code:
contract_code = [1927318] * 168 + [73812631] * 168
client_code = ["090909"] * 168 + ["101010"] * 168
device = ["QWERTY"] * 168 + ["ASDFG"] * 168
one_week_hourly_data = pd.date_range(
start=datetime.datetime(2024, 5, 6, 0, 0, 0),
end=datetime.datetime(2024, 5, 12, 23, 59, 59),
freq="1h",
).to_list()
measurements = np.random.randint(low=0, high=4000, size=168 * 2)
mock_data_dict = {
"contract_code": contract_code,
"client_code": client_code,
"device": device,
"measurement_date_hour": one_week_hourly_data + one_week_hourly_data,
"measurements": measurements,
}
mock_data = pd.DataFrame.from_dict(mock_data_dict)
As you can see from the code or by looking at the dataframe, this mock table has one week worth of measurements of 2 different devices
that have been purchased by 2 different clients
. What I now need to do is to check if the measurements
of each device
have, at least, one month of data since its last measurement
and, if not, I need to create rows until I have one month of data and assign the measurement
value of the missing values with the average by weekday and hour of non-missing measurements
.
All of this can be, somehow easily, done via python with the following code (maybe it can be optimized, but at least this code does the work):
mean_values = mock_data.groupby(
[
"contract_code",
"client_code",
"device",
mock_data.measurement_date_hour.dt.weekday,
mock_data.measurement_date_hour.dt.hour,
]
).measurements.mean()
mean_values.index.names = ["contract_code", "client_code", "device", "weekday", "hour"]
mean_values = mean_values.reset_index()
end_date = mock_data.groupby(["contract_code", "client_code", "device"]).measurement_date_hour.max()
start_date = end_date.apply(lambda x: x - relativedelta(months=1))
mock_data = mock_data.set_index(["measurement_date_hour", "contract_code", "client_code", "device"])
multiindex = [
[
(date, group[0], group[1], group[2])
for date in pd.date_range(
start=start_date[group], end=end_date[group], freq="1h"
)
]
for group, _ in mock_data.groupby(["contract_code", "client_code", "device"])
]
flat_index = [x for xs in multiindex for x in xs]
mock_data = mock_data.reindex(flat_index).reset_index()
mock_data["weekday"] = mock_data.measurement_date_hour.dt.weekday
mock_data["hour"] = mock_data.measurement_date_hour.dt.hour
result = pd.concat(
[
mock_data[mock_data.measurements.notnull()],
mock_data.loc[
mock_data.measurements.isnull(),
[
"measurement_date_hour",
"contract_code",
"client_code",
"device",
"weekday",
"hour",
],
].merge(
mean_values,
how="left",
on=["contract_code", "client_code", "device", "weekday", "hour"],
),
]
).sort_values("measurement_date_hour", ignore_index=True)
However, the real table is stored in an Oracle database and has thousand of millions of rows, thus, I cannot do this job in Python because it would be highly expensive to move that massive amount of data to work with it in Python.
Therefore, here goes my question, how can I translate this code into an Oracle SQL query? It does not necessarily have to be one single query, intermediate tables or views can be created too. I have tried a lot of things but I am still not able to achieve my desired result. Any help is much appreciated! And if you need further details to answer the question I will gladly provide them when asked 🙂