MemoryError: Unable to allocate 361. TiB for an array with shape (49604310962128,) and data type float64
from sklearn.cluster import Birch
from sklearn.metrics import pairwise_distances
branching_factor = 50
n_clusters = 1000000 # 设置一个较大的聚类数
threshold = 0.5
# 创建BIRCH聚类器
birch = Birch(n_clusters=n_clusters, threshold=threshold, branching_factor=branching_factor)
# 训练BIRCH聚类器
birch.fit(reduced_data)
contains 10,000,000 samples, each sample is one row, the first sample No., there are 64 sample features remaining.The data is from DNA sequences The maximum number of clusters is 1000000. If it exceeds 1000000, the excess clusters will be merged into the 1000000th cluster。
What clustering scheme would be the best and how should we do it