I try to implement the following basic k-Means algorithm in R (from James et al (2023).An Introduction to Statistical Learning, p. 517):
The code I came up with is working and produces a reasonable result (every 100 obs build 1 cluster, which corresponds with the data generating process).
The code currently looks like this:
# predefined number of clusters
k <- 3
# data simulation
set.seed(1234)
X <- matrix(rnorm(30000),nrow=300,ncol=100) # 300 obs with 100 features each
X[101:200,] <- X[101:200,] + 1
X[201:300,] <- X[201:300,] + 2
# random classification for each obs
class <- matrix(sample(1:k,dim(X)[1],replace=TRUE,set.seed(123)),nrow=dim(X)[1])
# Running k-Means Algorithm
kmeans.basic <- function(X,class,k) {
# total obs
points <- dim(X)[1]
# classification
classification <- class
# number of modified class assigments per iteration
mod.assignments <- NA
repeat{
# finding cluster centroids (rows = cluster, cols = feature mean)
center <- aggregate(X,list(class),mean)
rownames(center) <- 1:3
center <- center[,-1]
# calculating euclidean distance for each point
EuclideanDist <- matrix(nrow=300,ncol=3)
for (point in 1:points) {
for (cl in 1:k) {
EuclideanDist[point,cl] <- sqrt(sum((X[point,] - center[cl,])^2))
}
}
# new classification
class.new <- apply(EuclideanDist,1,which.min)
# classification per iteration
classification <- cbind(classification,unname(class.new))
# modified assignments per iteration
mod.assignments <- c(mod.assignments,unname(points-table(class.new==class)["TRUE"]))
# stop if assignments don't change anymore
if(identical(class.new,class)) break
# new classification for next iteration
class <- class.new
}
return(list("modified_assignements" = mod.assignments,"classifications"=classification))
}
kmeans.basic(X,class,k)
My question is whether the for-loop can be replaced with a quicker code. So I want to change this loop…
EuclideanDist <- matrix(nrow=300,ncol=3)
for (point in 1:points) {
for (cl in 1:k) {
EuclideanDist[point,cl] <- sqrt(sum((X[point,] - center[cl,])^2))
}
}
…into something like this:
index <- expand.grid(lapply(c(points,k),seq_len))
EuclideanDist <- array(dist(rbind(X[index$Var1,],center[index$Var2,])),dim=c(points,k))
But the results of the array EuclideanDist in the first iteration are already not the same:
for-loop results (for first 10 obs; col 1 = Eucl.Dist to cluster centroid 1,…):
EuclideanDist
[,1] [,2] [,3]
[1,] 15.054331 13.498277 12.891710
[2,] 14.676493 12.986101 12.458025
[3,] 14.604512 12.786467 12.161778
[4,] 15.795601 13.977206 13.645632
[5,] 16.475889 14.753928 14.011286
[6,] 14.506822 12.897481 12.381961
[7,] 17.822123 15.913398 15.451605
[8,] 15.569600 13.918099 13.397844
[9,] 14.593612 12.841399 12.340844
[10,] 16.350167 14.664481 14.148517
results with expand.grid():
EuclideanDist
[,1] [,2] [,3]
[1,] 13.08934 13.08934 13.08934
[2,] 13.27003 13.27003 13.27003
[3,] 12.65392 12.65392 12.65392
[4,] 14.29587 14.29587 14.29587
[5,] 14.66200 14.66200 14.66200
[6,] 14.41147 14.41147 14.41147
[7,] 13.45788 13.45788 13.45788
[8,] 13.33575 13.33575 13.33575
[9,] 14.77679 14.77679 14.77679
[10,] 12.77164 12.77164 12.77164
Can anybody please help with the alternative code using expand.grid() to replace the for-loop? Why isn’t this working?