My setting is that I would like to go through my dataframe df and change specific values to missing.
Specifically, from another dataframe bounds, I check column specific lower and upper bounds and if given start and end values fit inside them. If the value in df does not fit inside the bounds, then I want to set the whole col in df to missing. If yes, then I want to check if specific part of the df index fits inside the bounds. If it doesn’t, then I want to set that specific cell to missing.
The format of my bounds dataframe looks like this:
index | lower | upper |
---|---|---|
Col1 | 1 | 102 |
Col1 | 103 | 244 |
Col1 | 245 | 440 |
Col2 | 1 | 99 |
Col2 | 100 | 288 |
Col2 | 289 | 427 |
and the format of index in df is “Alice:220:A:B”, “Bob:87:B:C”, etc.
My current approach is inefficient and throws future warning when trying to use None as the missing type – the data type in df is float64. When trying np.nan, I get warning about modifying copy, not original and I want to modify original.
This is what I have tried so far:
def filter_by_bounds(self, start, end, df):
for cur_col in df.columns:
cur_bounds = self.bounds.loc[
(self.bounds.index == cur_col)
& (self.bounds["lower"] <= start)
& (self.bounds["upper"] >= start)
]
for id in df.index:
cur_pos = int(id.split(":")[1])
# First check start and end fall into one row in bounds
if len(cur_bounds) != 1:
df.loc[:, cur_col] = None
break
# Then check if the current positions falls into the current bounds
elif (
cur_bounds["lower"].values[0] > cur_pos
or cur_bounds["upper"].values[0] < cur_pos
):
df.loc[df.index == id, cur_col] = None
return df
This sort of works, but I would much prefer to get rid of the nested loop and use for example masking for efficiency. Also, my current approach to modifying the values to missing is problematic and I would like to do it properly.
Any help to fix this would be appreciated 🙂
saml is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.