Question
How to fill missing values of a pandas dataframe based on the relationship between an existing preceeding row (predictions for a commodity), and an associated existing value in another column (actual values for a commodity).
Details
I have a pandas dataframe with 10 columns and 40 rows. The columns are Date, Actual, time_from_actual_1, time_from_actual_2, time_from_actual_3...
up to time_from_actual_8
.
The Actual
column contains actual values for a commodity with hourly timestamps in the Date
column. The time_from_actual
columns are predictions forward in time for the same commodity. These are produced once per day, hence the pre-existing values at Index 0
and Index 1
. And so, there are 23 missing observations per day.
Input dataframe
I would like to fill those missing values in a very specific way. I want the values for “time_from_actual” at index 1 up to 24 to follow the same pattern as the first column with regards to the differences between the different timesteps and the actual values.
Output dataframe
I’ve succeeded doing this with a nested for loop
, but I would very much like to see suggestions for more elegant approaches. Below you will find a complete attempt with sample data, code and output. Thank you for any suggestions!
Code
# imports
import pandas as pd
import numpy as np
# Random seed
np.random.seed(42)
# Sample data
data = {
'Date': pd.date_range(start='2023-01-01', periods=40, freq='H'),
'Actual': [100, 99.72, 101.02, 104.06, 103.60, 103.13, 106.29, 107.82, 106.88, 107.97,
107.04, 106.11, 106.59, 102.77, 99.32, 98.19, 96.17, 96.80, 94.98, 92.15,
95.09, 94.63, 94.77, 91.92, 90.83, 91.05, 88.75, 89.50, 88.30, 87.72,
86.51, 90.22, 90.19, 88.08, 89.72, 87.28, 87.70, 83.78, 81.12, 131.52],
'time_from_actual_1': [97] + [np.nan]*23 + [90] + [np.nan]*15,
'time_from_actual_2': [99] + [np.nan]*23 + [89] + [np.nan]*15,
'time_from_actual_3': [98] + [np.nan]*23 + [88] + [np.nan]*15,
'time_from_actual_4': [97] + [np.nan]*23 + [87] + [np.nan]*15,
'time_from_actual_5': [96] + [np.nan]*23 + [86] + [np.nan]*15,
'time_from_actual_6': [95] + [np.nan]*23 + [85] + [np.nan]*15,
'time_from_actual_7': [94] + [np.nan]*23 + [84] + [np.nan]*15,
'time_from_actual_8': [93] + [np.nan]*23 + [83] + [np.nan]*15,
}
# dataframe
df = pd.DataFrame(data)
# copy of the dataframe to reference original values only
original_df = df.copy()
# Fill missing values for columns starting with "time_from_actual"
time_cols = [col for col in df.columns if col.startswith('time_from_actual')]
for col in time_cols:
for i in range(1, len(df)):
if pd.isnull(df.loc[i, col]):
j = i
while j < len(df) and pd.isnull(original_df.loc[j, col]):
previous_actual = df.loc[j - 1, 'Actual']
previous_time = df.loc[j - 1, col]
current_actual = df.loc[j, 'Actual']
difference = previous_time - previous_actual
df.loc[j, col] = current_actual + difference
j += 1