I have a pandas DataFrame like this:
sample closest_signature distance patient cluster correlation biopsy similarity n_biopsy
0 21506-A HU.1 0.416795 21506 HU.1 0.994611 A 2.399261 2.0
1 21506-B HU.1 0.340269 21506 HU.1 0.993852 B 2.938855 2.0
2 21507-A HU.3 0.181289 21507 HU.3 0.868993 A 5.516052 3.0
3 21507-B HU.3 0.128398 21507 HU.3 0.972968 B 7.788282 3.0
4 21507-C HU.3 0.117186 21507 HU.3 0.949540 C 8.533432 3.0
5 21521-A HU.2 0.111720 21521 HU.2 0.956889 A 8.950942 2.0
6 21521-B HU.2 0.116082 21521 HU.2 0.974804 B 8.614610 2.0
7 21531-A HU.4 0.251558 21531 ND 0.560867 A 3.975227 2.0
8 21531-B HU.2 0.197108 21531 HU.2 0.890214 B 5.073356 2.0
9 21543-A HU.2 0.184331 21543 HU.2 0.973885 A 5.425033 6.0
10 21543-B HU.2 0.151444 21543 HU.2 0.990204 B 6.603119 6.0
11 21543-C HU.2 0.156038 21543 HU.2 0.989900 C 6.408698 6.0
12 21543-D HU.2 0.196920 21543 HU.2 0.939929 D 5.078191 6.0
13 21543-E HU.1 0.234311 21543 HU.1 0.980673 E 4.267841 6.0
14 21543-F HU.2 0.152050 21543 HU.2 0.989276 F 6.576796 6.0
My goal is sorting the entire dataframe, like this:
- Records should be ordered by
closest_signature
as defined for thesample
ending with A (in other words, theclosest_signature
value to use as key is the one of the-A
sample for each group, for example if 21543-A isHU.2
, then it’sHU.2
even if other samples in the same group may be different) - Patient groups should be sorted by the highest
n_biopsy
but preserving order in a group (i.e. samples should be ordered from A to Z)
Expected result:
sample closest_signature distance patient cluster correlation biopsy similarity n_biopsy
0 21506-A HU.1 0.416795 21506 HU.1 0.994611 A 2.399261 2.0
1 21506-B HU.1 0.340269 21506 HU.1 0.993852 B 2.938855 2.0
9 21543-A HU.2 0.184331 21543 HU.2 0.973885 A 5.425033 6.0
10 21543-B HU.2 0.151444 21543 HU.2 0.990204 B 6.603119 6.0
11 21543-C HU.2 0.156038 21543 HU.2 0.989900 C 6.408698 6.0
12 21543-D HU.2 0.196920 21543 HU.2 0.939929 D 5.078191 6.0
13 21543-E HU.1 0.234311 21543 HU.1 0.980673 E 4.267841 6.0
14 21543-F HU.2 0.152050 21543 HU.2 0.989276 F 6.576796 6.0
5 21521-A HU.2 0.111720 21521 HU.2 0.956889 A 8.950942 2.0
6 21521-B HU.2 0.116082 21521 HU.2 0.974804 B 8.614610 2.0
2 21507-A HU.3 0.181289 21507 HU.3 0.868993 A 5.516052 3.0
3 21507-B HU.3 0.128398 21507 HU.3 0.972968 B 7.788282 3.0
4 21507-C HU.3 0.117186 21507 HU.3 0.949540 C 8.533432 3.0
7 21531-A HU.4 0.251558 21531 ND 0.560867 A 3.975227 2.0
8 21531-B HU.2 0.197108 21531 HU.2 0.890214 B 5.073356 2.0
If I don’t include closest_signature
the approach is very straightforward:
df = df.sort_values(["n_biopsy", "patient" ], ascending=[False, True])
df.head(10)
sample closest_signature distance patient cluster correlation biopsy similarity n_biopsy
9 21543-A HU.2 0.184331 21543 HU.2 0.973885 A 5.425033 6
10 21543-B HU.2 0.151444 21543 HU.2 0.990204 B 6.603119 6
11 21543-C HU.2 0.156038 21543 HU.2 0.989900 C 6.408698 6
12 21543-D HU.2 0.196920 21543 HU.2 0.939929 D 5.078191 6
13 21543-E HU.1 0.234311 21543 HU.1 0.980673 E 4.267841 6
14 21543-F HU.2 0.152050 21543 HU.2 0.989276 F 6.576796 6
32 21564-A HU.3 0.121428 21564 HU.3 0.975599 A 8.235334 6
33 21564-B HU.3 0.114477 21564 HU.3 0.978366 B 8.735386 6
34 21564-C HU.3 0.149845 21564 HU.3 0.983692 C 6.673560 6
35 21564-D HU.3 0.139047 21564 HU.3 0.949370 D 7.191837 6
However this of course starts with HU.2
instead of HU.1
.
I’ve tried making a composite key, mixing closest_signature
from the first A record of each group with the number of biopsies (e.g. 6_HU.1
), but again I run into a chicken-and-egg problem because closest_signature
needs to be sorted in ascending order, and the number of biopsies in descending order:
for gid, group in df.groupby("patient"):
num_biopsy = group["n_biopsy"].unique().item()
cluster = group.loc[group.index[0], "closest_signature"]
value = f"{num_biopsy}_{cluster}"
df.loc[group.index, "biopsy_cluster"] = value
df.sort_values(["biopsy_cluster", "patient"], ascending=[False, True])
sample closest_signature distance patient cluster correlation biopsy similarity n_biopsy biopsy_cluster
32 21564-A HU.3 0.121428 21564 HU.3 0.975599 A 8.235334 6 6_HU.3
33 21564-B HU.3 0.114477 21564 HU.3 0.978366 B 8.735386 6 6_HU.3
34 21564-C HU.3 0.149845 21564 HU.3 0.983692 C 6.673560 6 6_HU.3
35 21564-D HU.3 0.139047 21564 HU.3 0.949370 D 7.191837 6 6_HU.3
36 21564-E HU.3 0.125473 21564 HU.3 0.969198 E 7.969847 6 6_HU.3
Notice how the first is HU.3
. All other approaches I’ve tried lose the ordering within the groups, which is essential to keep.
Is what I’m trying to do doable?