I have a DataFrame that I am rounding. After the round, I subtract the original from the resultant. This gives me a data frame with a shape identical to the original, but which contains the amount of change the rounding operation caused.
I need to transform this into a Boolean where there is a true flag for the max of the row, and everything else in the row is false. All steps but the final one are handled with a vectorized function. But I can’t seem to figure out how to vectorize the last step. This is what I am currently doing:
a = pd.DataFrame([[2.290119, 5.300725, 17.266693, 75.134857, 0.000000, 0.000000, 0.007606],
[0.000000, 7.560276, 55.579175, 36.858266, 0.000000, 0.000000, 0.002284],
[0.001574, 15.225538, 39.309742, 45.373800, 0.000951, 0.001198, 0.087197],
[0.000000, 55.085390, 15.547927, 29.327661, 0.000000, 0.017691, 0.021331],
[0.000000, 66.283488, 15.636673, 17.912315, 0.000000, 0.003185, 0.164339]])
b = a.round(-1) # round to 10's place (not 10ths)
c = b-a
round_modifier = c.apply(lambda x: x.eq(x.max()), axis="columns")
print(round_modifier)
0 1 2 3 4 5 6
0 False False False True False False False
1 False False True False False False False
2 False True False False False False False
3 False True False False False False False
4 False False True False False False False
I am aware of DataFrame.idxmax(axis="columns")
, which gives me the column name (of each row) where the max is found, but I can’t seem to find a (pythonic) way to take that and populate the corresponding flag with a True. The lambda expression I’m using gives the correct result, but I’m hoping for a faster method.
For anyone wondering, the use case is that I want to round the values in the original data frame to the tens place, such that they sum to 100. I have pre-scaled this data so it should be close, but the rounding can cause the sum to come to 90 or 110. I intend to use this T/F matrix to decide which rounded value caused the most delta, then round it in the opposite direction since this is the minimum impact method with which to coerce the series to properly sum to 100 in chunks of 10.
Simply use max
and eq
:
c.eq(c.max(axis=1), axis=0)
Output:
0 1 2 3 4 5 6
0 False False False True False False False
1 False False True False False False False
2 False True False False False False False
3 False True False False False False False
4 False False True False False False False
You can use idxmax to get the position of column with the max value, and use numpy broadcasting to match the position with the column.
m = c.columns.to_numpy() == c.idxmax(axis=1).to_numpy()[:, None]
new_df = pd.DataFrame(np.where(m, True, False), columns=c.columns)
End result:
0 1 2 3 4 5 6
False False False True False False False
False False True False False False False
False True False False False False False
False True False False False False False
False False True False False False False
2
Using idxmax in combination with np.eye to create a Boolean mask where the maximum value in each row is flagged as True
import pandas as pd
import numpy as np
a = pd.DataFrame([[2.290119, 5.300725, 17.266693, 75.134857, 0.000000, 0.000000, 0.007606],
[0.000000, 7.560276, 55.579175, 36.858266, 0.000000, 0.000000, 0.002284],
[0.001574, 15.225538, 39.309742, 45.373800, 0.000951, 0.001198, 0.087197],
[0.000000, 55.085390, 15.547927, 29.327661, 0.000000, 0.017691, 0.021331],
[0.000000, 66.283488, 15.636673, 17.912315, 0.000000, 0.003185, 0.164339]])
# Round the dataframe to the tens place
b = a.round(-1)
# Calculate the difference
c = b - a
# Get the index of the maximum value for each row
max_indices = c.idxmax(axis="columns")
# Create a Boolean mask where True corresponds to the maximum value in each row
round_modifier = np.zeros_like(c, dtype=bool)
# Use advanced indexing to set the True values
round_modifier[np.arange(len(c)), max_indices] = True
round_modifier_df = pd.DataFrame(round_modifier, columns=a.columns)
print(round_modifier_df)
output
0 1 2 3 4 5 6
0 False False False True False False False
1 False False True False False False False
2 False True False False False False False
3 False True False False False False False
4 False False True False False False False
If you are looking for a vectorized solution, one of the best ways to do that would be by using numpy. Once you have done the rounding, you can feed the whole array into it. This should lead to much faster calculations. If you are interested in reading more about the numpy max function here is the link. https://numpy.org/doc/stable/reference/generated/numpy.max.html
import pandas as pd
import numpy as np
a = pd.DataFrame([[2.290119, 5.300725, 17.266693, 75.134857, 0.000000, 0.000000, 0.007606],
[0.000000, 7.560276, 55.579175, 36.858266, 0.000000, 0.000000, 0.002284],
[0.001574, 15.225538, 39.309742, 45.373800, 0.000951, 0.001198, 0.087197],
[0.000000, 55.085390, 15.547927, 29.327661, 0.000000, 0.017691, 0.021331],
[0.000000, 66.283488, 15.636673, 17.912315, 0.000000, 0.003185, 0.164339]])
b = a.round(-1) # round to 10's place (not 10ths)
c = b - a
# Use numpy for max and comparison
max_in_rows = np.max(c.values, axis=1)[:, np.newaxis]
round_modifier = c.values == max_in_rows
round_modifier_df = pd.DataFrame(round_modifier, columns=a.columns)
print(round_modifier_df)