I am working with the R programming language.
Suppose I have the following dataset:
library(ggplot2)
set.seed(42)
heights <- rnorm(100, mean=170, sd=10)
weights <- rnorm(100, mean=65, sd=15)
data <- data.frame(heights, weights)
ggplot(data, aes(x = heights, y = weights)) +
geom_point() +
theme_bw() +
labs(title = "Scatterplot of Heights and Weights",
x = "Height (cm)",
y = "Weight (kg)") +
scale_x_continuous(breaks = seq(floor(min(data$heights)), ceiling(max(data$heights)), by = 5)) +
scale_y_continuous(breaks = seq(floor(min(data$weights)), ceiling(max(data$weights)), by = 5))
Suppose I look at every “box” that can be made of size 5×5 – I manually drew this on top of the original plot:
My Question: I am interested in counting the number of points contained in each of these boxes.
I first defined the max boundaries of each variable:
max_height <- max(data$heights)
max_weight <- max(data$weights)
height_breaks <- seq(0, max_height, by=5)
weight_breaks <- seq(0, max_weight, by=5)
Then using the expand.grid() function, I tried to make a data frame that contains each of these boxes using the expand.grid function:
combinations <- expand.grid(height = seq_along(height_breaks)[-length(height_breaks)],
weight = seq_along(weight_breaks)[-length(weight_breaks)])
interval_label <- function(breaks, index) {
paste0("(", breaks[index], "-", breaks[index + 1], ")")
}
combinations$height_interval <- mapply(interval_label, list(height_breaks), combinations$height)
combinations$weight_interval <- mapply(interval_label, list(weight_breaks), combinations$weight)
height_weight_boxes <- combinations[, c("height_interval", "weight_interval")]
Using some further manipulations, I made 4 separate columns for the min/max of each variable:
library(dplyr)
transformed_df <- height_weight_boxes %>%
mutate(
box_number = row_number(),
min_height = as.numeric(sub("\((.*?)-.*", "\1", height_interval)),
max_height = as.numeric(sub(".*-(.*)\)", "\1", height_interval)),
min_weight = as.numeric(sub("\((.*?)-.*", "\1", weight_interval)),
max_weight = as.numeric(sub(".*-(.*)\)", "\1", weight_interval))
) %>%
select(box_number, min_height, max_height, min_weight, max_weight)
This now looks like this:
box_number min_height max_height min_weight max_weight
1 0 5 0 5
2 5 10 0 5
3 10 15 0 5
4 15 20 0 5
5 20 25 0 5
6 25 30 0 5
In the final step, I now want to use this reference frame (transformed_df) to query the original data frame and count how many points are in each box. Logically, I thought of doing this by literally counting how many points are contained within the boundaries of each box:
library(dplyr)
count_points_in_box <- function(min_height, max_height, min_weight, max_weight, data) {
data %>%
filter(heights >= min_height, heights < max_height,
weights >= min_weight, weights < max_weight) %>%
nrow()
}
final <- transformed_df %>%
rowwise() %>%
mutate(count = count_points_in_box(min_height, max_height, min_weight, max_weight, data))
In the end, I tried to check if the box counts match the original data:
> sum(final$count)
[1] 97
> nrow(data)
[1] 100
While the results are close, I think I might have done something wrong as the counts don’t match completely. Can someone please help me correct this?
Thanks!