I have a dataset in the following format:
heights <- rnorm(10000, mean=170, sd=10)
weights <- rnorm(10000, mean=65, sd=15)
data <- data.frame(heights, weights)
heights weights
1 164.0554 75.21385
2 167.8416 80.20245
3 170.8382 64.86342
4 175.3897 73.40080
5 177.6491 42.82188
6 169.2133 79.28145
I used the following R code to count how many data points are in each bin of size 5×5:
max_height <- max(data$heights)
max_weight <- max(data$weights)
height_breaks <- seq(0, max_height, by=5)
weight_breaks <- seq(0, max_weight, by=5)
combinations <- expand.grid(height = seq_along(height_breaks)[-length(height_breaks)],
weight = seq_along(weight_breaks)[-length(weight_breaks)])
interval_label <- function(breaks, index) {
paste0("(", breaks[index], "-", breaks[index + 1], ")")
}
combinations$height_interval <- mapply(interval_label, list(height_breaks), combinations$height)
combinations$weight_interval <- mapply(interval_label, list(weight_breaks), combinations$weight)
height_weight_boxes <- combinations[, c("height_interval", "weight_interval")]
count_points_in_box <- function(min_height, max_height, min_weight, max_weight, data) {
data %>%
filter(height >= min_height, height <= max_height,
weight >= min_weight, weight <= max_weight) %>%
nrow()
}
library(dplyr)
transformed_df <- height_weight_boxes %>%
mutate(
box_number = row_number(),
min_height = as.numeric(sub("\((.*?)-.*", "\1", height_interval)),
max_height = as.numeric(sub(".*-(.*)\)", "\1", height_interval)),
min_weight = as.numeric(sub("\((.*?)-.*", "\1", weight_interval)),
max_weight = as.numeric(sub(".*-(.*)\)", "\1", weight_interval))
)
count_points_in_box <- function(min_height, max_height, min_weight, max_weight, data) {
data %>%
filter(heights >= min_height, heights < max_height,
weights >= min_weight, weights < max_weight) %>%
nrow()
}
final <- transformed_df %>%
rowwise() %>%
mutate(count = count_points_in_box(min_height, max_height, min_weight, max_weight, data))
From here, I want to make a heatmap of this data:
library(ggplot2)
library(viridisLite)
distinct_counts <- length(unique(final$count))
color_palette <- magma(distinct_counts)
final$color <- factor(final$count)
ggplot(final, aes(x = min_weight, y = min_height, fill = final$color)) +
geom_tile() +
scale_fill_manual(values = color_palette, guide = guide_legend(title = "Count")) +
labs(x = "Minimum Weight", y = "Minimum Height", title = "2D Heatmap of Counts") +
theme_minimal() +
theme(legend.position = "right")
Everything is looking good, but it seems that the legend is showing every possible color combination and it looks very long. I am trying to reduce the size of the legend and make it look more compact. Is it possible to reduce the size of this legend by “binning” the colors together in similar ranges?
I know that it’s possible to directly make a heatmap in R using functions like kde(). The issue is that my real dataset is very large and I am actually using SQL CASE WHEN statements to perform the counts and then bring the aggregated dataset into R to make the heatmap (i.e. “final”). From here, I am manually assigning colors and making the heatmap. In this question that I asked, the lengthy data manipulation I included above represents SQL data manipulation to get the data into the “final” format.
Generating SQL statements within R:
sql_query <- "SELECT *, CASE"
for (i in 1:nrow(df)) {
sql_query <- paste0(sql_query, " WHEN height BETWEEN ", df$min_height[i], " AND ", df$max_height[i],
" AND weight BETWEEN ", df$min_weight[i], " AND ", df$max_weight[i],
" THEN ", df$box_number[i])
}
sql_query <- paste0(sql_query, " END AS box_num FROM my_table;")
2