I have clustered a bunch of lidar points using dbscan to improve classification of vegetation points. The output of dbscan is a new column with a cluster_ID that corresponds to which group the lidar point belongs to.
What I want to do is remove any clusters that are below a certain number of points. My data has over 70 000 points with about 600 classes, so my working solution is a for loop
with if statement
but it takes about 2 mins to complete. I am hoping to speed this up as this is a small subset of my actual data.
Here is a working example of what I implemented in my code that works but is slow:
rm(list=ls()) # clear env
# make empty list
test_data <- c()
# add data to cluster by
test_data$cluster_ID <- as.integer(runif(10,min = 1, max = 5))
# convert to df
test_data <- as.data.frame(test_data)
# summarize by cluster
sum_test <- test_data %>% group_by(cluster_ID) %>%
summarise(pt_count = sum(cluster_ID))
# convert from sum to num points
sum_test$pt_count <- sum_test$pt_count/sum_test$cluster_ID
# remove all rows < 3 points in "cluster_ID"
for (c_id in unique(test_data$cluster_ID)) {
# get index of unique c_id
idx <- which(test_data$cluster_ID == c_id)
# filter out length < 3 and remove from test_data
if (length(idx) < 3) {
test_data <- test_data[-c(idx)]
}
}
What I can’t figure out is how find the rows in test_data
that are less then a count threshold based on the output of the summarise function without a loop since the lengths of test_data
and sum_test
are not equal.
Thanks for any help/suggestions.