I am trying to filter a dataset I have obtained after RNASeq Analysis. I have "Sequence Name
” assigned to "topological domain"
and “transmembrane region”. I would like to filter out all those sequences that contain a "topological domain"
and "transmembrane region"
but keep the ones containing only "topological domain"
.
I have tried using the filter()
function from dplyr
but I can only select one or the other. The difficult thing for me is that a sequence can cointain multiple "topological domain"
and "transmembrane region"
.
Here is the code I have used so far:
TMHMM_filter <- filter(TMHMM.df, TMHMM.df$Type %in% c("topological domain", "transmembrane domain"))
head(TMHMM_filter)
The “transmembrane region” is removed but I still have the “Sequence name” with “topological domain”.
Here is a part of my data set:
Name Sequence Name Type Minimum Maximum Length # Intervals
1 Cytoplasmic (potential) XP_009600001.1 topological domain 1 743 743 1
2 Cytoplasmic (potential) XP_009600003.1 topological domain 1 623 623 1
3 Cytoplasmic (potential) XP_009600004.1 topological domain 360 475 116 1
4 Transmembrane (potential) XP_009600004.1 transmembrane region 339 359 21 1
5 Extracellular (potential) XP_009600004.1 topological domain 155 338 184 1
6 Transmembrane (potential) XP_009600004.1 transmembrane region 134 154 21 1
and what I would like to have is:
Name Sequence Name Type Minimum Maximum Length # Intervals
1 Cytoplasmic (potential) XP_009600001.1 topological domain 1 743 743 1
2 Cytoplasmic (potential) XP_009600003.1 topological domain 1 623 623 1
0
filter(df, !any(Type != 'topological domain'), .by=Sequence.Name)
# A tibble: 2 x 4
Name Sequence.Name Type Minimum
<chr> <chr> <chr> <dbl>
1 Cytoplasmic (potential) XP_009600001.1 topological domain 1
2 Cytoplasmic (potential) XP_009600003.1 topological domain 1
df <- structure(list(Name = c("Cytoplasmic (potential)", "Cytoplasmic (potential)",
"Cytoplasmic (potential)", "Transmembrane (potential)", "Extracellular (potential)",
"Transmembrane (potential)"), Sequence.Name = c("XP_009600001.1",
"XP_009600003.1", "XP_009600004.1", "XP_009600004.1", "XP_009600004.1",
"XP_009600004.1"), Type = c("topological domain", "topological domain",
"topological domain", "transmembrane region", "topological domain",
"transmembrane region"), Minimum = c(1, 1, 360, 339, 155, 134
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L))
3
A simple tidy solution like this does work on your small example.
df %>%
group_by(Sequence.Name) %>%
filter(n() == 1) %>%
ungroup()
It simply groups per sequence name, and then removes every group that has more than one row. Since this does not ungroup automatically, I added the ungroup()
bit at the end.