I have a dataset with a data structure similar to Iris, so I will use that as an example.
I am trying to identify subsamples with outlier difference patterns between two continuous variables:
1)
For each ‘Species’ I am calculating correlation coefficients between Sepal.Length and Sepal.Width. As the first step, I want to know which of these coefficients might be outliers (in this case defined as 0.5IQR, not the typical 1.5IQR). I can figure that step out, but I am having trouble with the next step:
2)
If there are outliers, then for each of these outliers, I would like to plot a ggplot histogram (like this: https://r-graph-gallery.com/histogram_several_group.html) for Sepal.Length vs Sepal.Width difference in the cases defined as outlier versus the total sample, so that I can look at the total sample versus the outlier patterns.
For instance, by the definition below, ‘setosa’ is an outlier, so in this case, the loop should construct one histogram: a histogram that compares setosa difference between Sepal.Length and Sepal.Width to the total sample difference between Sepal.Length and Sepal.Width.
I am assuming that I need a for loop for the resulting outliers, but I can’t figure out how to set it up.
Attempt:
library(datasets)
library(ggplot2)
library(dplyr)
library(hrbrthemes)
iris <- iris
# step 1
corr <- iris %>% group_by(Species) %>% summarise(correlation = cor(Sepal.Length,
Sepal.Width, method='spearman'))
Q1 <- quantile(corr$correlation, .25)
Q3 <- quantile(corr$correlation, .75)
IQR <- IQR(corr$correlation)
outliers <- subset(corr, corr$correlation<(Q1 - 0.5*IQR) | corr$correlation>(Q3 +
0.5*IQR))
# step 2
for (x in outliers) {
p <- data %>%
ggplot( aes(x=, fill=)) +
geom_histogram( color="#e9ecef", alpha=0.6, position = '') +
scale_fill_manual(values=c("#69b3a2", "#404080")) +
theme_ipsum() +
labs(fill="")print(x)
}
2
You might want to wrap an approach like below in a function f
:
library(tidyverse)
library(hrbrthemes)
# re-write
o =
iris %>%
summarise(Correlation = cor(Sepal.Length, Sepal.Width, method="spearman"),
.by=Species) %>%
filter(Correlation < (quantile(Correlation, .25) - (i<-.5*IQR(Correlation))) |
Correlation > (quantile(Correlation, .75) + i))
f = (df0, ll, x1, x2, grp) {
bind_rows(mutate(df0, Diff={{x1}}-{{x2}}, Reference="all"),
map(ll, (x) filter(df0, Species==x) %>%
mutate(Diff={{x1}}-{{x2}}, Reference=x))) %>%
ggplot() +
geom_histogram(aes(x=Diff, fill=Reference), alpha=.2) +
theme_ipsum()
}
f(iris, o$Species, Sepal.Length, Sepal.Width, Species)
Note, if you like to assign self-chosen colours you need to implement a routine which deals with different lengths (cardinality) of outliers$Species
to account for a varying number of colours (reference + length of outliers$Species
, here 2). I recommend to use a standard discrete colour palette.
EDIT
Researching Spearman’s correlation coefficient assessing monotonic relationships (whether linear or not) instead of linear relationships (Pearson’s) depends merely on the data. We haven’t seen yours.
W.r.t. the toy data given, you might want to use much simpler visualisation strategies like
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, colour=Species)) +
geom_point() +
stat_ellipse() +
theme_ipsum()
giving
2