I have a pandas dataframe like follows,
+------------+-------------------+---------+--------+----------+
| process_no | process_durations | columns | orders | customer |
+------------+-------------------+---------+--------+----------+
| 0 | 3 | [0] | [3109] | [0] |
+------------+-------------------+---------+--------+----------+
| 1 | 100 | [11] | [5855] | [0] |
+------------+-------------------+---------+--------+----------+
| 2 | 81 | [8] | [5304] | [0] |
+------------+-------------------+---------+--------+----------+
I want to combine this dataframe with itself and get the total length for orders
, the length of unique values for columns
, and the list of unique values for customers
. Then, I want to sort in descending for the orders
length and ascending for the columns
length. But I should also keep the original version of the combination. For this situation, I did the following,
from itertools import combinations
from operator import itemgetter
vals = data.values.tolist()
cross_combine = list(combinations(vals, r=2))
sorted_cross_combine = sorted(
[
(
x,
-(len(x[0][3]) + len(x[1][3])),
len(set(x[0][2] + x[1][2])),
list(set(x[0][4] + x[1][4]))
)
for x in cross_combine
],
key=itemgetter(1, 2)
)
print(sorted_cross_combine)
[(([0, 3, [0], [3109], [0]], [1, 100, [11], [5855], [0]]), -2, 2, [0]),
(([0, 3, [0], [3109], [0]], [2, 81, [8], [5304], [0]]), -2, 2, [0]),
(([1, 100, [11], [5855], [0]], [2, 81, [8], [5304], [0]]), -2, 2, [0])]
What I want to ask is,
Can I do this while the combine process is still taking place, provided that it is more effective? For example, I know there is no such function, but I imagine something like this,
def calc(x, y):
return (
x + y,
-(len(x[3]) + len(y[3])),
len(set(x[2] + y[2])),
list(set(x[4] + y[4]))
)
cross_combine = list(combinations(vals, r=2, func=calc))
Or is there a way I can make the whole process more effective? The process of creating the sorted_cross_combine
value takes about 20 seconds for a vals
list with approximately 6500 elements.
Thanks in advance.