I have 22 columns and need to do k-means clustering (k=5) for selected labels which are gender, education, age, years of marriage and amount spent. After doing k-means clustering, I am require to determine the number of samples and the calculate the number of 0 and 1 (using another column named default). I need to use the cluster column and take out the default to calculate the proportion of 0 and 1.
K = 5
random_seed = 1
model = KMeans(n_clusters=K, random_state=random_seed)
model.fit(data_norm1)
df1_norm = pd.DataFrame(data_norm1, columns=df_1.columns)
cluster_labels = model.labels_
df1_norm['Cluster1'] = cluster_labels
df1_norm
cluster_counts = df1_norm.groupby('Cluster1').size()
print(cluster_counts)
I managed to calculate the number of sample but unable to group into 0 and 1 at each cluster. At the earlier part of the code (before clustering), I have group the full set of data into 0 and 1. I tried to use groupby() to link the two sets of data.
grp1 = df1_norm.groupby(['GENDER', 'EDUCATION', 'AGE', 'MARRIAGE', 'AMOUNT_SPENT', 'DEFAULT'])
Error message is KeyError: 'DEFAULT'
.
Can someone point out the error for me?
Expected output will be something like that
Cluster No. of positive
0 9095 8098
1 6933 2345
2 11685 7892
3 7190 2345
4 5097 2342
dtype: int64