I need to convert a bunch of .xlsx files to pdf files. I use Linux mint and I wrote script that do the job correctly if processing is done sequentially. However this takes a lot of time and I would like to speed up the things by running concurrently. The idea is to split a list of files needed to be converted in half and do these concurrently and independently. This should work since files are independent from each other.
I tried to use asyncio for this purpose and asked ChatGPT for help, but I cannot solve the problem, because randomly about 10 files (number varies) out of 100
just fail to convert to pdf.
I need y our help to understand what is going on.
The original script used the following approach and it works slow, but correct:
def convert_pdf_soffice(xlsx_file: str)->None:
out_dir = './PdfDir/'
print('Started conversion of ', xlsx_file)
subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', '--outdir', out_dir, xlsx_file])
print('Finished conversion of ', xlsx_file)
I called convert function in a loop, like this:
for file in xls_files_to_be_converted:
convert_pdf_soffice(file)
The concurrent approach is the following:
#!/usr/local/bin/python3
import os
import asyncio
import time
async def convert_pdf_soffice(xlsx_files):
out_dir = ‘./PdfDir/’
tasks = []
for xlsx_file in xlsx_files:
print('Started conversion of ', xlsx_file)
process = await asyncio.create_subprocess_exec(
'soffice', '--headless', '--convert-to', 'pdf', '--outdir', out_dir, xlsx_file,
stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE
)
tasks.append(process)
for task, xlsx_file in zip(tasks, xlsx_files):
stdout, stderr = await task.communicate()
if task.returncode != 0:
print(f'Conversion of {xlsx_file} failed with return code {task.returncode}')
else:
print('Finished conversion of ', xlsx_file)
async def main():
start_t = time.time()
INPUT_DIR = './XLSX/'
OUTPUT_DIR = './PdfDir/'
# Create folder if not exists
if not os.path.exists(OUTPUT_DIR):
os.makedirs(OUTPUT_DIR)
# List of all xlsx files
xlsx_file_list = [file for file in os.listdir(INPUT_DIR) if file.endswith('.xlsx')]
# Split the list into two halves
mid_index = len(xlsx_file_list) // 2
first_half = xlsx_file_list[:mid_index]
second_half = xlsx_file_list[mid_index:]
# List of xlsx files to be converted to pdf
first_half_paths = [os.path.join(INPUT_DIR, file) for file in first_half]
second_half_paths = [os.path.join(INPUT_DIR, file) for file in second_half]
# Run conversions concurrently for both halves
await asyncio.gather(
convert_pdf_soffice(first_half_paths),
convert_pdf_soffice(second_half_paths)
)
end_t = time.time()
duration_t = end_t - start_t
print(f'Duration is {duration_t}')
if name == ‘main‘:
asyncio.run(main())