I have a dataframe with many columns, all of which contain text data mixed with NaNs
.
I want to count the number of characters in each cell in each column and then drop any rows where all the columns have less than 5 characters (if any cells have more than 5 characters, then the row is not dropped).
I was considering making a new column with str.len
for each column and then filter out rows using that, but it seems very cumbersome.
Example:
>>> df
column_1 column_2 column_3
0 werhi dsfhjk dh ---> not filtered because some columns have more than 5 characters
1 sgds fuo g ---> filtered
2 wqyuio dsklh fhkjfj
3 fhi d fgho ---> filtered
4 sadfhkj sdjfkhs yyisdk
>>> df_filtered
column_1 column_2 column_3
0 werhio dsfhjk dh
2 wqyuio dsfjklh fhkjfj
4 sadfhkj sdjfkhs yyisdk
6
No need to create new columns.
OPTION 1: minimal total string length per row
just apply
to all columns, aggregate the total sum
and create a boolean Series for boolean indexing:
thresh = 10
out = df[df.apply(lambda x: x.str.len()).sum(axis=1).ge(thresh)]
Output:
column_1 column_2 column_3
0 werhio dsfhjk dh
2 wqyuio dsfjklh fhkjfj
4 sadfhkj sdjfkhs yyisdk
OPTION 2: minimal string length for any strings in a row
If you want to ensure all values in a column have at least a given size, compare to the threshold before aggregating with any
:
thresh = 5
df[df.apply(lambda x: x.str.len()).ge(thresh).any(axis=1)]
Output:
column_1 column_2 column_3
0 werhio dsfhjk dh
2 wqyuio dsfjklh fhkjfj
4 sadfhkj sdjfkhs yyisdk
Intermediates
OPTION 1:
# df.apply(lambda x: x.str.len())
column_1 column_2 column_3
0 6 6 2
1 4 3 1
2 6 7 6
3 3 1 4
4 7 7 6
# df.apply(lambda x: x.str.len()).sum(axis=1)
0 14
1 8
2 19
3 8
4 20
dtype: int64
# df.apply(lambda x: x.str.len()).sum(axis=1).ge(thresh)
0 True
1 False
2 True
3 False
4 True
dtype: bool
OPTION 2:
# df.apply(lambda x: x.str.len().ge(5))
column_1 column_2 column_3
0 True True False
1 False False False
2 True True True
3 False False False
4 True True True
# df.apply(lambda x: x.str.len().ge(5)).any(axis=1)
0 True
1 False
2 True
3 False
4 True
dtype: bool
4
import pandas as pd
import numpy as np
data = {
'col1': ['abc', 'de', 'fghi', 'jklmn'],
'col2': ['opq', 'r', 'stuv', 'wx'],
'col3': ['y', 'z', '123', '45678']
}
df = pd.DataFrame(data)
threshold = 3
# If you want the sum of all values in a row to exceed a threshold:
row_filter = df.apply(lambda col: col.str.len()).sum(axis=1) >= threshold
# If you want the minimum value of each cell in a row to exceed a threshold
row_filter = df.apply(lambda col: col.str.len()).min(axis=1) >= threshold
filtered_df = df[row_filter]
df.apply(lambda col: col.str.len())
will create a new DF where each cell represents the length of the same cell in the original DFsum(axis=1)
will sum the length of all string in the row>= threshold
will filter out rows that are less than the desired lengthfiltered_df = df[row_filter]
will filter the original data frame to exclude rows that don’t match your predicate
3