I’m reading ~4000 excel files with one sheet and each sheet has around 200 rows and 10 columns.
To validate the dataframes I made for each file, I’m making nine simple checks of the content and the format of the spreadsheet. The whole process takes ~6 minutes, which is relatively fine.
Now, I added another check check_10
and suddenly the process takes now ~20 minutes.
This check is targeting four columns of the spreadsheet that looks like below dataframe :
import pandas as pd
df = pd.DataFrame({'id': ['381', '381', '381', '381', '381', '381', '381', '381', '381', '381', '382', '382', '382', '382', '382', '382', '382', '382', '382', '382'], 'components': ['A1', 'A1', 'A1', 'A1', 'B1', 'B1', 'B1', 'B1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'C1', 'C1', 'C1', 'C1'], 'capacity': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2'], 'chunks': ['4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '6', '6', '6', '6', '6', '6', '6', '6', '6', '6']})
def check_10(df):
cols = ['id', 'capacity', 'chunks']
counts = df.groupby(cols, as_index=False)['components'].agg(total_components='value_counts')
counts['unique_components'] = counts.groupby(cols)['components'].transform('nunique')
result1 = counts['unique_components'].le(counts['capacity'].astype(int)).all()
result2 = counts['total_components'].eq(vc['chunks'].astype(int)).all()
return result1 and result2
print(check_10(df))
# False
The purpose of check_10
is to make sure that for each id
(that has a fixed capacity and number of chunks), the number of unique components don’t exceed the capacity of the id and the total of components is equal to the number of chunks of the given id.
For information, counts
looks like this :
print(counts)
# id capacity chunks components total_components unique_components
# 0 381 1 4 A1 6 2
# 1 381 1 4 B1 4 2
# 2 382 2 6 A1 6 2
# 3 382 2 6 C1 4 2
Do you guys know why the check I added is slowing down my code ?
6
EDIT1 Solution with one groupby
:
def f(x):
i, cap, chunks = x.name
result2 = (x.value_counts() == int(chunks)).all()
result1 = len(set(x)) <= int(cap)
return result1 and result2
cols = ['id', 'capacity', 'chunks']
out = df.groupby(cols, sort=False)['components'].apply(f).any()
Optimalization with Counter
:
from collections import Counter
def f(x):
i, cap, chunks = x.name
counts = Counter(x).values()
result2 = all([x == int(chunks) for x in counts])
result1 = len(counts) <= int(cap)
# print (result1)
return result1 and result2
cols = ['id', 'capacity', 'chunks']
out = df.groupby(cols)['components'].apply(f).any()
Alternative:
from collections import Counter
def f(i, cap, chunks, x):
counts = Counter(x).values()
result2 = all([x == int(chunks) for x in counts])
result1 = len(counts) <= int(cap)
# print (result1)
return result1 and result2
cols = ['id', 'capacity', 'chunks']
out = any([f(i,cap,chunks,x) for (i,cap,chunks),x in df.groupby(cols)['components']])
EDIT: You can test conditions in 2 different DataFrames:
def check_10(df):
cols = ['id', 'capacity', 'chunks']
counts = (df.groupby(cols + ['components']).size()
.reset_index(name='total_components'))
print (counts)
id capacity chunks components total_components
0 381 1 4 A1 6
1 381 1 4 B1 4
2 382 2 6 A1 6
3 382 2 6 C1 4
counts1 = counts.groupby(cols).size().reset_index(name='unique_components')
print (counts1)
id capacity chunks unique_components
0 381 1 4 2
1 382 2 6 2
result1 = counts1['unique_components'].le(counts1['capacity'].astype(int)).all()
result2 = counts['total_components'].eq(counts['chunks'].astype(int)).all()
return result1 and result2
print(check_10(df))
I think there are unique values in components
column in counts
DataFrame, so is possible simplify code from:
counts['unique_components'] = counts.groupby(cols)['components'].transform('nunique')
to:
counts['unique_components'] = counts.groupby(cols)['components'].transform('size')
Another idea is create one big DataFrame with 4000 x 200 rows and working with this DataFrame – instead 4k small one.
2