What is a the more efficient way to bin the amount
column into different bucket and get the length of each bucket.
buckets are amount
1. amount < 10000
2. amount >=10000 & <100000
3. amount >100000 & <500000
4. amount > 500000
I was trying to implement the above question using:
sample_data =
date amount type
0 2018-09-28 4000.0 D
1 2018-11-23 2000.0 D
2 2018-12-27 52.5 D
3 2018-10-02 20000.0 D
4 2018-11-27 4000.0 C
5 2018-06-01 500.0 D
6 2018-07-02 5000.0 D
7 2018-07-02 52.5 D
8 2018-10-31 500.0 D
9 2018-11-26 2000.0 C
sample_data['Transactions_bin_1'] = sample_data[sample_data.amount < 10000]['amount']
sample_data['Transactions_bin_2'] = sample_data[(sample_data['amount'] >= 10000) & (sample_data['amount'] < 100000)]['amount']
sample_data['Transactions_bin_3'] = sample_data[(sample_data['amount'] >= 100000) & (sample_data['amount'] < 500000)]['amount']
sample_data['Transactions_bin_4'] = sample_data[sample_data.amount > 500000]['amount']
bin_classification =
{
'bin1' : sample_data.Transactions_bin_1.count(),
'bin2' : sample_data.Transactions_bin_2.count(),
'bin3' : sample_data.Transactions_bin_3.count(),
'bin4' : sample_data.Transactions_bin_4.count()
}
0
use pd.cut
: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html:
import numpy as np
bins = [-np.inf,10000,100000,500000,np.inf]
labels = ['amount < 10000' ,'amount >=10000 & <100000','amount >=100000 & <500000', 'amount >= 500000']
df['bins'] = pd.cut(df.amount, bins=bins, , labels=labels, right=False, include_lowest=True)
OUTPUT
:
date amount type bins
0 2018-09-28 4000.0 D amount < 10000
1 2018-11-23 2000.0 D amount < 10000
2 2018-12-27 52.5 D amount < 10000
3 2018-10-02 20000.0 D amount >=10000 & <100000
4 2018-11-27 4000.0 C amount < 10000
5 2018-06-01 500.0 D amount < 10000
6 2018-07-02 5000.0 D amount < 10000
7 2018-07-02 52.5 D amount < 10000
8 2018-10-31 500.0 D amount < 10000
9 2018-11-26 2000.0 C amount < 10000
NOTE: you can manipulate the labels
list if required.
4
Use pd.cut
and pd.IntervalIndex.from_breaks
:
>>> pd.cut(df["amount"], closed="left",
bins=pd.IntervalIndex.from_breaks([0, 10000, 100000, 500000, np.inf],
.value_counts()
[0.0, 10000.0) 9
[10000.0, 100000.0) 1
[100000.0, 500000.0) 0
[500000.0, inf) 0
Name: amount, dtype: int64