I’m currently trying to fit a skewed gaussian to a histogram of data by attempting to constrain the mean (mu), sigma1 and sigma2 through a grid search chi square minimization.
It seemed to be working decently on my actual data as in, the reduced chi square was reasonable and visually inspection also showed a decent fit to the data.
As a test, I then tried simulating data using scipy.skewnorm and then doing the exact same fitting process to this. However, when I started, I used a loc parameter of 3 (i.e. a skewed distribution with a mean of 3). Running the same code gave a terrible fit. The simulated data matched my real data almost exactly except for the difference in mean (where my real data is more centred on 0). I initially thought id just defined some variables incorrectly but after messing around with the code, it’s the exact same. So I then tried to make the simulated data match more closely to my real data by changing the loc parameter in scipy.skewnorm to 0 and reran the same code – this time the fit was reasonable.
I just can’t seem to understand why this happens, however? ignoring the change in the x location of the distribution, the data same looks exactly the same and given that I’m performing an exhaustive grid search, where the boundaries of the grid are defined by the min and max data values (so should shift accordingly with a change in data), why should the fit get worse for a change in loc?
Ive attached my code below as I’m very puzzled and can’t seem to understand the logic behind this behaviour at this moment.
sigclip = skewnorm.rvs(a=3,loc=3,scale=1,size=300) #generate fake data with Skewnorm function
#y_sig = counts in each bin, x_sig = x data points
counts, bin_edges, *_ = plt.hist(sigclip, bins=50, color='#0504aa', rwidth=0.85,)
A = plt.gca().get_ylim()
plt.show()
#bin centers - to map counts to bins
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
#gaussian (without the normalisation factor)
def gauss(x, mu, sigma):
return np.exp(-np.power((x - mu)/sigma, 2.)/2)
def gauss2sig(x, mu, sigma1, sigma2, A):
res = np.zeros(len(x))
res[x>=mu] = (gauss(x[x>=mu],mu,sigma1)) #positive sigma
res[x<mu] = (gauss(x[x<mu],mu,sigma2)) #negative sigma
return (A*res)
# setting up dimensions of grid search
N = 50
#parameter ranges for mu,sigma1,sigma2, A
mu_range = np.linspace(min(sigclip), max(sigclip), N)
sig1_range = np.linspace(min(sigclip), max(sigclip), N)
sig2_range = np.linspace(min(sigclip), max(sigclip), N)
A_range = np.linspace(min(A), max(A), N)
chi_grid = np.zeros((N,N,N,N))
#chi values
chi_val = []
parameter_indices = []
for x, y, z, a in itertools.product(range(len(mu_range)), range(len(sig1_range)), range(len(sig2_range)), range(len(A_range))):
mu_val = mu_range[x]
sig1_val = sig1_range[y]
sig2_val = sig2_range[z]
A_val = A_range[a]
#data - just for ease of figuring out, can delete after
#error also equal counts as sigma = poisson nose (sqrt(N)) where N=counts so sig^2 = count (N)
data = counts
#model
model = gauss2sig(x = bin_centers, mu = mu_val, sigma1= sig1_val, sigma2=sig2_val,A=A_val)
#error = poisson noise
error = np.sqrt(counts)
# will have some inf values as there are places where counts and therefore, error = 0
chi = (data - model)**2 / error
mask = np.isfinite(chi) # masking inf values
chi = chi[mask] #masked chi vals i.e. removing inf
chi = np.sum(chi) #final chisq value
#appending values of chisq to corresponding grid positions
chi_grid[a,z,y,x] = chi
#appending chisq values to list
chi_val.append(chi)
#appending parameter index values
parameter_indices.append((x,y,z,a))
print('min chisq val is:', min(chi_val))
print()
best_fit_mu = mu_range[parameter_indices[chi_val.index(min(chi_val))][0]]
print('best mu value is:', best_fit_mu)
print()
best_fit_sig1 = sig1_range[parameter_indices[chi_val.index(min(chi_val))][1]]
print('best sigma>=mu (+ve):', best_fit_sig1)
print()
best_fit_sig2 = sig2_range[parameter_indices[chi_val.index(min(chi_val))][2]]
print('best sigma<mu (-ve):', best_fit_sig2)
print()
best_fit_A = A_range[parameter_indices[chi_val.index(min(chi_val))][3]]
print('best A value is:', best_fit_A)
print()
print('reduced chisq:', min(chi_val)/(len(data)-4))
plt.hist(sigclip, bins=50, color='#0504aa', rwidth=0.85,)
# Generate x values
x_values = np.linspace(min(sigclip), max(sigclip), 1000)
# Calculate y values using the bimodal function
y_values = gauss2sig(x_values, mu=best_fit_mu, sigma1=best_fit_sig1,sigma2=best_fit_sig2,A=best_fit_A)
# Plot the result
plt.plot(x_values, y_values)
Again, the issue is with a loc parameter that gets larger e.g. >2, the fit is terrible. But seems fine for loc param ~ 0. Now my actual data is centered around mean 0 but I want to understand what the issue is, so I know if this behaviour will affect the accuracy of my analysis. If anyone has any insight, I would be most grateful, as I was under the impression this grid search should be fairly robust.