Say I have a data frame like the following:
mydf <- data.frame(id=LETTERS, locus=c(rep("alpha",14),rep("beta",12)),
group=c(rep(1,2),rep(2,6),rep(3,6),rep(4,4),rep(5,4),rep(6,4)),
pair=c(1:12,14,15,3,4,6,8,9:16))
which looks like this:
> mydf
id locus group pair
1 A alpha 1 1
2 B alpha 1 2
3 C alpha 2 3
4 D alpha 2 4
5 E alpha 2 5
6 F alpha 2 6
7 G alpha 2 7
8 H alpha 2 8
9 I alpha 3 9
10 J alpha 3 10
11 K alpha 3 11
12 L alpha 3 12
13 M alpha 3 14
14 N alpha 3 15
15 O beta 4 3
16 P beta 4 4
17 Q beta 4 6
18 R beta 4 8
19 S beta 5 9
20 T beta 5 10
21 U beta 5 11
22 V beta 5 12
23 W beta 6 13
24 X beta 6 14
25 Y beta 6 15
26 Z beta 6 16
It has an id
column as a primary key, and ids
are divided into locus
alpha and beta. Each id
in alpha should have a pair in beta (column pair
), though there might be cases of orphan alpha or beta ids
(due to prior filtering).
ids
are also grouped in different groups (column group
), which were defined on a per-locus basis.
Now I want to define another grouping variable new_group
that reflects the alpha–beta pairing information.
This seems very straightforward visually, but I am struggling to define the rules in an automatic way, so that every possible situation is taken into account.
The desired output for the example data frame above would be the following (colors just for guidance):
Note that I only define a new_group
when pair
information is found, so ids
A and B do not belong to any new_group
.
ids
C, D, F and H have pair
information in beta, so they all constitute new_group
1; since ids
E and G belong to the same group, they are pooled together (even if they do not have pair
information).
ids
from I to N have pair information in beta grouped in different groups, so new_group
2 has to be split in beta into 2a and 2b.
Essentially, rules would be:
- make a new group when at least one id of a group contains pair info in the other locus
- for ids without pair info, pool them into the same new group if other ids from their group have paired info
- groups in alpha can be paired to multiple groups in beta, and conversely groups in beta can be paired to multiple groups in alpha, in which case letter suffixes are used
- anything else I’m not thinking about…
I’m honestly not sure how to get this started… Any hint would be appreciated!