I have a few datasets combined in a dataframe, which I want to eliminate outliers from.
When trying different ways to calculate upper and lower thresholds I came upon discrepacies between the results of ggplot-boxplots and manual calculation.
I would like to (1) understand the discrepancies and (2) find a convenient way to eliminate outliers from multiple similar datasets via dplyr.
Four of the datasets are given below with 2×2 variants (Var1, SSW):
library(tidyverse)
values_A30 <- c(0.2079762,0.2605029,0.3054334,0.304067,0.8696487,0.3470931,0.2560001,0.3096838,0.2887556,0.3741472,0.2178375,0.2234682,0.2628923,0.2745458,0.5438208,0.7068278,0.6492924,1.100175,0.2740491,0.2849299,0.7562737,0.3009749,0.2598575,0.3460925,0.3265929,0.4208336,0.4353992,1.132036,0.3856708,0.1978752,0.3676808,0.4196799,0.4486595,0.3282394,0.3725664,0.385373,0.3680049,0.7875058,0.8098903,0.741165,1.260887,0.3521471,0.3883195,1.17124,0.3225514,0.3492051)
values_B30 <- c(0.2598824,0.3147266,0.3876806,0.3740659,0.9880903,0.3491571,0.2852879,0.3659836,0.3562278,0.3574071,0.2793339,0.2765582,0.326236,0.3305683,0.628697,0.7359492,0.6954842,1.139923,0.3106868,0.3187189,0.9236551,0.3218849,0.2722268,0.3102944,0.3590789,0.4290484,0.3649334,1.133538,0.3815261,0.313504,0.4090641,0.4127804,0.4103117,0.3039001,0.3421307,0.3383706,0.3697731,0.6795609,0.8174759,0.730511,1.248585,0.3350673,0.3678199,1.025086,0.3550109,0.2992851)
values_A32 <- c(0.3031411,0.6585525,0.2774704,0.3185133,0.3657107,0.36731,0.2690659,0.3000714,0.2638143,0.3952846,0.260601,0.2873786,0.3522794,0.4528319,0.2959548,0.3085563,0.2821835,0.28403,0.3282855,0.4996997,0.4005206,0.8866824,0.4036912,0.3818493,0.4250281,0.4804805,0.3840721,0.4288454,0.3920388,0.5721854,0.3303645,0.3137673,0.4255052,0.4639104,0.3755455,0.4013699,0.4690261,0.4198166,0.4578243,0.6717564)
values_B32 <- c(0.3597136,0.7568497,0.3340147,0.3257469,0.3921928,0.4232309,0.2661836,0.3098475,0.3049883,0.5052187,0.311451,0.3089702,0.367432,0.5030153,0.3493206,0.3470694,0.3631118,0.3742462,0.4100476,0.5922369,0.3922594,0.7923606,0.385271,0.3919856,0.4243319,0.4642854,0.3340272,0.3854504,0.3563194,0.5574781,0.3542073,0.3310583,0.4260903,0.5463172,0.3810555,0.3576101,0.4161085,0.4094533,0.4390219,0.6388255)
bpdata <- bind_rows(
data.frame(Var1 = "A", SSW = 30, Value = values_A30),
data.frame(Var1 = "B", SSW = 30, Value = values_B30),
data.frame(Var1 = "A", SSW = 32, Value = values_A32),
data.frame(Var1 = "B", SSW = 32, Value = values_B32)
)
Usually I start up with ggplot boxplots to get a visual impression.
# test plot full
ggplot(bpdata, aes(SSW, Value, group = SSW)) +
geom_boxplot() +
facet_wrap(~ Var1, scales = "free_y") +
scale_y_continuous(limits = c(0, 1.5),
breaks = seq(0, 1.5, by = 0.1))
All of the four boxplots have some values marked as outliers, which I would like to eliminate. To facilitate further discussion / understanding the mentioned discrepancy, I select the first combination (Var1 = A, SSW = 30).
To eliminate outliers in dyplyr, I would have to get the upper (and lower for other data) thresholds in my dataframe, so my 1st approach was to manually calculate them based on the explanations in the geom_boxplot help page:
# manual calculation
bpstats_man <- bpdata |>
filter(Var1 == "A", SSW == 30) |>
summarise(
Qu1 = quantile(Value, 0.25),
Qu3 = quantile(Value, 0.75),
IQR = IQR(Value)
) |>
mutate(ymin = Qu1 - (1.5 * IQR),
ymax = Qu3 + (1.5 * IQR))
However, the results of this (ymin = -0.05051965 and ymax = 0.8623606) are quite different compared to the limits shown in the plot. To compare directly, I also extracted the geom_boxplot statistics. Here, of course, ymin and ymax correspond to the plot (ymin = 0.1978752 and ymax = 0.8098903).
# extract stats
bpstats_gg <- ggplot_build(
ggplot(bpdata |> filter(Var1 == "A", SSW == 30),
aes(x=SSW, y = Value)) +
geom_boxplot()
)$data[[1]]
So finally, I would like to (1) understand the reason for the different output of ymin and ymax when calculated manually and (2) find a convenient way to calculate the limits, i. e., either by manual calculation or by extracting them from geom_boxplot statistics. My goal is a comprehensible way to eliminate outliers for many different sets of ‘Value’ grouped by Var1 and SSW.
I think there could be a way by nest() and unnest() with ggplot_build but this is still difficult to understand for me (any hints where to look for a good tutorial are appreciated).