Thiết kế website giá rẻ

Question

I have a pandas DataFrame like this:

sample  closest_signature   distance    patient cluster correlation biopsy  similarity  n_biopsy
0   21506-A HU.1    0.416795    21506   HU.1    0.994611    A   2.399261    2.0
1   21506-B HU.1    0.340269    21506   HU.1    0.993852    B   2.938855    2.0
2   21507-A HU.3    0.181289    21507   HU.3    0.868993    A   5.516052    3.0
3   21507-B HU.3    0.128398    21507   HU.3    0.972968    B   7.788282    3.0
4   21507-C HU.3    0.117186    21507   HU.3    0.949540    C   8.533432    3.0
5   21521-A HU.2    0.111720    21521   HU.2    0.956889    A   8.950942    2.0
6   21521-B HU.2    0.116082    21521   HU.2    0.974804    B   8.614610    2.0
7   21531-A HU.4    0.251558    21531   ND  0.560867    A   3.975227    2.0
8   21531-B HU.2    0.197108    21531   HU.2    0.890214    B   5.073356    2.0
9   21543-A HU.2    0.184331    21543   HU.2    0.973885    A   5.425033    6.0
10  21543-B HU.2    0.151444    21543   HU.2    0.990204    B   6.603119    6.0
11  21543-C HU.2    0.156038    21543   HU.2    0.989900    C   6.408698    6.0
12  21543-D HU.2    0.196920    21543   HU.2    0.939929    D   5.078191    6.0
13  21543-E HU.1    0.234311    21543   HU.1    0.980673    E   4.267841    6.0
14  21543-F HU.2    0.152050    21543   HU.2    0.989276    F   6.576796    6.0

My goal is sorting the entire dataframe, like this:

Records should be ordered by closest_signature as defined for the sample ending with A (in other words, the closest_signature value to use as key is the one of the -A sample for each group, for example if 21543-A is HU.2, then it’s HU.2 even if other samples in the same group may be different)
Patient groups should be sorted by the highest n_biopsy but preserving order in a group (i.e. samples should be ordered from A to Z)

Expected result:

sample  closest_signature   distance    patient cluster correlation biopsy  similarity  n_biopsy
0   21506-A HU.1    0.416795    21506   HU.1    0.994611    A   2.399261    2.0
1   21506-B HU.1    0.340269    21506   HU.1    0.993852    B   2.938855    2.0
9   21543-A HU.2    0.184331    21543   HU.2    0.973885    A   5.425033    6.0
10  21543-B HU.2    0.151444    21543   HU.2    0.990204    B   6.603119    6.0
11  21543-C HU.2    0.156038    21543   HU.2    0.989900    C   6.408698    6.0
12  21543-D HU.2    0.196920    21543   HU.2    0.939929    D   5.078191    6.0
13  21543-E HU.1    0.234311    21543   HU.1    0.980673    E   4.267841    6.0
14  21543-F HU.2    0.152050    21543   HU.2    0.989276    F   6.576796    6.0
5   21521-A HU.2    0.111720    21521   HU.2    0.956889    A   8.950942    2.0
6   21521-B HU.2    0.116082    21521   HU.2    0.974804    B   8.614610    2.0
2   21507-A HU.3    0.181289    21507   HU.3    0.868993    A   5.516052    3.0
3   21507-B HU.3    0.128398    21507   HU.3    0.972968    B   7.788282    3.0
4   21507-C HU.3    0.117186    21507   HU.3    0.949540    C   8.533432    3.0
7   21531-A HU.4    0.251558    21531   ND  0.560867    A   3.975227    2.0
8   21531-B HU.2    0.197108    21531   HU.2    0.890214    B   5.073356    2.0

If I don’t include closest_signature the approach is very straightforward:


df = df.sort_values(["n_biopsy", "patient" ], ascending=[False, True])
df.head(10)

    sample  closest_signature   distance    patient cluster correlation biopsy  similarity  n_biopsy
9   21543-A HU.2    0.184331    21543   HU.2    0.973885    A   5.425033    6
10  21543-B HU.2    0.151444    21543   HU.2    0.990204    B   6.603119    6
11  21543-C HU.2    0.156038    21543   HU.2    0.989900    C   6.408698    6
12  21543-D HU.2    0.196920    21543   HU.2    0.939929    D   5.078191    6
13  21543-E HU.1    0.234311    21543   HU.1    0.980673    E   4.267841    6
14  21543-F HU.2    0.152050    21543   HU.2    0.989276    F   6.576796    6
32  21564-A HU.3    0.121428    21564   HU.3    0.975599    A   8.235334    6
33  21564-B HU.3    0.114477    21564   HU.3    0.978366    B   8.735386    6
34  21564-C HU.3    0.149845    21564   HU.3    0.983692    C   6.673560    6
35  21564-D HU.3    0.139047    21564   HU.3    0.949370    D   7.191837    6

However this of course starts with HU.2 instead of HU.1.

I’ve tried making a composite key, mixing closest_signature from the first A record of each group with the number of biopsies (e.g. 6_HU.1), but again I run into a chicken-and-egg problem because closest_signature needs to be sorted in ascending order, and the number of biopsies in descending order:

for gid, group in df.groupby("patient"):
   num_biopsy = group["n_biopsy"].unique().item()
   cluster = group.loc[group.index[0], "closest_signature"]
   value = f"{num_biopsy}_{cluster}"
   df.loc[group.index, "biopsy_cluster"] = value
df.sort_values(["biopsy_cluster", "patient"], ascending=[False, True])

sample  closest_signature   distance    patient cluster correlation biopsy  similarity  n_biopsy    biopsy_cluster
32  21564-A HU.3    0.121428    21564   HU.3    0.975599    A   8.235334    6   6_HU.3
33  21564-B HU.3    0.114477    21564   HU.3    0.978366    B   8.735386    6   6_HU.3
34  21564-C HU.3    0.149845    21564   HU.3    0.983692    C   6.673560    6   6_HU.3
35  21564-D HU.3    0.139047    21564   HU.3    0.949370    D   7.191837    6   6_HU.3
36  21564-E HU.3    0.125473    21564   HU.3    0.969198    E   7.969847    6   6_HU.3

Notice how the first is HU.3. All other approaches I’ve tried lose the ordering within the groups, which is essential to keep.

Is what I’m trying to do doable?

Thiết kế website giá rẻ

Danh mục

Pandas: group and sort entire dataframe basing on the value of two columns, preserving within-group order