I was trying to do a multi variate linear regression using a set of data. I tried to predict the Y using the same set of X used to generate the regression coefficients. While the differences between the actual and predicted value were less (as they were expected to be) for a set of data, it is more for another set of data. The two datasets represent same set of parameters (same physical quantity). Did I do something wrong or what can I do to improve?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
.....
.....
# Construct the design matrix
X = np.column_stack((tir1, tir1_z, bt_diff, bt_diff_z, bt_diff_sst, s_theta, np.ones_like(tir1)))
sst = np.array([i+273.15 for i in selected_buoy_sst])
# Fit the OLS model
model = sm.OLS(sst, X)
results = model.fit()
predicted_sst_same_data = results.predict(X)
# Calculate the difference between actual SST and predicted SST
difference = sst - predicted_sst_same_data`
# using a different data
X_n = np.column_stack((tir1, tir1_z, bt_diff, bt_diff_z, bt_diff_sst, s_theta, np.ones_like(tir1)))
sst = skin_temp_array
# Fit the OLS model
model = sm.OLS(sst_n, X_n)
results = model.fit()
# Print the summary of the regression results
print(results.summary())
predicted_sst_same_data = results.predict(X)
# Calculate the difference between actual SST and predicted SST
difference = sst - predicted_sst_same_data
I can upload the data if needed. Does it depend on my x and Y values.