I have an embedding matrix of shape (4312, 1024)
(corresponding to 1024-dimensional embedding vectors of 4312 English sentences). I want to perform a clustering of these vectors and to visualize the results (in order to see if the distance threshold that I chose was good enough).
The clustering is done using:
import numpy as np
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters=None, metric='cosine',
compute_full_tree='auto',
linkage='complete',
distance_threshold=0.2,
compute_distances=True)
clustering = model.fit(embeddings)
print(f'Number of clusters: {clustering.n_clusters_}')
print(f'Labels:n{clustering.labels_}')
# count unique labels
unique_labels, counts = np.unique(clustering.labels_, return_counts=True)
print(f'Number of clusters by counting: {len(unique_labels)}')
# Sort in descending order of counts
sorted_indices = np.argsort(-counts)
unique_labels = unique_labels[sorted_indices]
counts = counts[sorted_indices]
print(f'Unique labels: {unique_labels}')
print(f'counts: {counts}')
The results of the print is:
Number of clusters: 1714
clustering.labels_:
[ 460 820 245 ... 1030 112 1367]
Number of clusters by counting: 1714
Unique labels: [ 410 352 229 ... 1039 1041 1713]
counts: [55 42 33 ... 1 1 1]
I obtained 1714 clusters, and the largest cluster contains 55 points. If I increase the distance threshold to 0.25, then the number of clusters decreased to 1395. I want to know which sentences have been merged if distance_threshold=0.25
(compared to distance_threshold=0.2
), so I plot the results for 0.2
, following the Scipy’s official example and did:
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram
def plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack(
[model.children_, model.distances_, counts]
).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
plt.title("Hierarchical Clustering Dendrogram")
# plot the top three levels of the dendrogram
plot_dendrogram(clustering, truncate_mode="level", p=3, distance_sort='ascending', show_leaf_counts=True)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()
Results:
I have two questions please:
-
I chose the distance to be
cosine
, which is between[0,1]
. How is it possible that the distances from the leaves to their parents are larger than 1? Is it due to numerical errors? -
The function
dendrogram
seems to draw the tree top-down. Is it possible to draw bottom-up: the clusters are the leaves? This would make more sense to me because I would like to start from the clusters and see how they would merge depending on the distances between them.
Thank you very much in advance for your help!