I am doing a simulation test to check that my method should be working as expected. I would be defining by three variables X, Y, Z using the dirichlet distribution for X, and a gaussian for Z and Y. The goal would be to define and predict Y = theta f(X)+g(Z)+N_y, where X=N_x, Z=N_x+X and to plot a graph where the x-axis represents theta and Y the rejection rate. Testing for conditinal independence between E[Y|X,Z]=E[Y|Z]. This should end with being able to see how as theta increases the rate of rejection would rise.
I have tried simulating data using python and initialy only using one coordinate to do the regression using the following code:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Parameters
alpha = [0.2, 0.2, 0.2]
mu_z = 2
sigma_z = 1
mu_y = 23
sigma_y = 3
# Number of samples
num_samples = 1000
X = np.random.dirichlet(alpha, num_samples)[:, 0]
print(X)
Z = np.random.normal(mu_z, sigma_z, num_samples)
# Define functions f(X) and g(Z)
def f(X):
return (X)
def g(Z):
return np.sin(Z)
# Generate Ny
Ny = np.random.normal(mu_y, sigma_y, num_samples)
# Set up theta values for the simulation
theta_values = np.linspace(0, 4, 100)
rejection_rates = []
# Loop over different theta values to compute rejection rates
for theta in theta_values:
# Generate Y
Y = theta * f(X) + g(Z) + Ny
# Fit the linear model
X_ = sm.add_constant(np.column_stack((f(X), g(Z))))
model = sm.OLS(Y, X_)
results = model.fit()
# Compute the p-value for the theta coefficient
p_value = results.pvalues[1] # p-value for the theta coefficient
# Determine rejection (1 if p-value < 0.05, else 0)
rejection_rates.append(int(p_value < 0.05)) # Significance level of 0.05
# Plot rejection rate against theta
plt.plot(theta_values, rejection_rates)
plt.xlabel('Theta')
plt.ylabel('Rejection Rate')
plt.title('Rejection Rate vs. Theta')
plt.show()
The plot is not quite what I expected because it jumpes quite early to be a very high rejection rate. I am not sure if it’s stems the function that I use or a mistake in the method. The curve should optimally increase move evenly as theta increases.