I created an ETL pipeline using python ETL PETL.
It works as expected, but my current transformation part is a bit slow.
I’m utilizing ProcessPoolExecutor to transform each petl data table in separate process. I realized that perhaps working on 1 data table at a time across all CPUs might be more performant than trying to work all data tables across CPUs since we might not be able to use all CPUs when there is a less number of work to be done than the number of cores.
I guess my question is:
- does this thought process make sense?
- Is there a way to work on a single transformation work across all cores and join the results back?
what I have, simplified:
def transform(data_tables):
results = {}
futures = {}
with ProcessPoolExecutor() as executor:
for table in data_tables.items():
futures[executor.submit(transform_table, table)]
for future in as_completed(futures):
try:
results[name] = future.result()
except Exception as exc:
# handling exception...
return results
