I’m working on kernel density estimation (KDE) using python for road accidents data, but here I will use 1d data just for illustration. When fitting the KDE model, I’ve noticed that I get different results when considering all data points versus restricting to those within the bandwidth. Specifically,when using only points within the bandwidth I get almost the same values of density both in areas with few points and areas with many point which to me seems counterintuitive. I don’t know how to interpret this or if it is some code problem.
def KDE(x_list,radius):
return (1/(len(x_list)*radius))*np.sum([K(x/radius) for x in x_list])
def kde_val(x,dati):
return K((x-xi))
dataset = np.array([10,11,10,55,56,57,58,59])
x_range = np.linspace(dataset.min()-0.3, dataset.max()+0.3, num=600)
# bandwith values for experimentation
H = [30, 40, 50,30, 40, 50]
n_samples = dataset.size
# line properties for different bandwith values
color_list = ["brown","black","yellow","blue","red","green"]
alpha_list = [0.8, 1, 0.8,0.8, 1, 0.8]
width_list = [1.7,2.5,1.7,1.7,2.5,1.7]
plt.figure(figsize=(10,4))
# iterate over bandwith values
i=0
for h, color, alpha, width in zip(H, color_list, alpha_list, width_list):
i+=1
# iterate over datapoints
y_range=[]
for x in x_range:
a=x-dataset
b=a[abs(a) <= h] #consider only datapoints within the bandwidth h
if i>3: #when i>3 I consider all datapoints
b=x-dataset
y_range.append(KDE(b,h))
y_range=np.array(y_range)
plt.plot(x_range, y_range,
color=color, alpha=alpha, linewidth=width,
label=f'{h}')
plt.plot(dataset, np.zeros_like(dataset) , 's',
markersize=8, color='black')
KDE
I couldn’t find any libraries that ignote datapoints outside the bandwidth to compare the results.
N3r1 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.