I am working on this database of heart_diseases and I need to remove outliers (in numeric columns) based on the condition of putting a threshold S equal to 1.5 times the interquartile range.
If a value is less than Q1 -S or greater than Q3+S, then it is an outlier.
The numeric columns of this database are:
col_numeriche = ['age', 'trestbps', 'chol', 'restecg', 'thalach', 'oldpeak', 'ca']
The rule is to remove rows that have at least one outlier value in a column.
So I created this function:
def remove_outliers(df, colnames):
for colname in colnames:
n = df[colname]
q1 = n.quantile(0.25)
q3 = n.quantile(0.75)
S = 1.5 * (q3 - q1)
lower_bound = q1 - S
upper_bound = q3 + S
df = df[(n >= lower_bound) & (n <= upper_bound)]
This function deletes nearly 130 rows of target
columns, leaving only 1 no_disease
row which seams weird.
Then, I tried the solutions described here and on this one which are very similar to my question but when I use the .loc
method or .index
I get 0 values of no_disease
.
I am confused which solution is right, should I use the index method or my function is correct?
1