I have searched for an answer over the past two days and can’t figure this out. I have tidyverse
loaded but can use any solution.
I have a data frame structured like this. These are arbitrary values to use as an example:
Group Match_ID Gene TID Species
A Id224 Gene1 692 Species A
A Gene2 12589
A Gene3 966702
A Gene4 158
B Gene5 222985
B Gene6 225187
C Id456 Gene7 112
C Gene8 2256
C Id111 Gene9 658792 Species B
C Gene10 368800
C Gene11 22915
C Gene12 692
C Gene13 12589
D Id908 Gene14 966702 Species K
D Gene15 158
D Gene16 2256
D Gene17 658792
I want to group by the Group
column. Then check if there is one or more values in column Match_ID
. If yes, for all rows with a value in Match_ID
, check if the row(s) with the value in Match_ID
also have a value in the Species
column. If yes, keep that row. Keep the first row if there is more than one row that has a value in both Match_Id
and Species
. If the Group
has no value in Match_ID
, keep the first row of that group.
The final data frame should look like this with one row per Group
:
Group Match_ID Gene TID Species
A Id224 Gene1 692 Species A
B Gene5 222985
C Id111 Gene9 658792 Species B
D Id908 Gene15 158 Species K
I have been playing around with if/else filtering, case_when()
and just general filter()
but haven’t been able to get the code figured out for this use case.
I have this working partially but it doesn’t account for if the first row doesn’t have the match or if there is more than one match per group:
working <- working %>% group_by(Group) %>% filter(row_number() == 1)
How can I structure my filtering to achieve my desired result?