I have a dataset with categorical features that I want to cluster using a soft clustering approach, where each data point can belong to multiple clusters with different probabilities.
My questions are:
- Is it a valid approach combining Categorical Naive Bayes Classification and Expectation Maximization algorithm for soft clustering of categorical data? Are there any theoretical or practical issues with combining Categorical Naive Bayes and EM in this way?
- Are there better alternatives for soft clustering of purely categorical data? I’ve come across methods like fuzzy k-modes and latent class analysis, but I’m not sure how they compare.
- Any suggestions for improving the code or the overall approach?
Thanks in advance for your insights!
As a first attempt, I tried combining the Categorical Naive Bayes classifier with the Expectation Maximization (EM) algorithm. Here’s the code I used:
import numpy as np
from sklearn.naive_bayes import CategoricalNB
def EM_CategoricalNB(X, k, max_iters=100, tol=1e-4, verbose=False):
"""
EM algorithm using CategoricalNB for clustering categorical data.
Parameters:
X (np.array): Data array, each row is a data point, and values are integer-encoded categories.
k (int): Number of clusters.
max_iters (int): Maximum number of iterations for EM to converge.
tol (float): Tolerance for checking convergence.
verbose (bool): Whether to print status messages.
Returns:
model (CategoricalNB): Trained Naive Bayes model.
assignments (np.array): Cluster assignments of data points.
"""
# Initialize Categorical Naive Bayes model
model = CategoricalNB()
assignments = np.random.randint(0, k, size=len(X))
model.fit(X, assignments)
previous_log_likelihood = -np.inf
for iteration in range(max_iters):
# E-step: Estimate membership probabilities
probabilities = model.predict_proba(X)
# M-step: Re-fit the model using the expected memberships as soft assignments
model.fit(X, np.argmax(probabilities, axis=1))
# Calculate log likelihood for convergence check
log_likelihood = model.score(X, np.argmax(probabilities, axis=1))
if verbose:
print(f"Iteration {iteration}: Log Likelihood = {log_likelihood}")
# Check for convergence
if np.abs(log_likelihood - previous_log_likelihood) < tol:
if verbose:
print("Convergence reached.")
break
previous_log_likelihood = log_likelihood
# Final assignments
final_assignments = model.predict(X)
return model, final_assignments
The idea is to initialize the Categorical Naive Bayes model with random cluster assignments, and then iteratively:
- Estimate the membership probabilities of each data point to each cluster (E-step)
- Re-fit the model using the expected memberships as soft assignments (M-step)
- Check for convergence based on the change in log likelihood