def download_files_from_folder(base_url, folder_name):
folder_url = f"{base_url}{folder_name}/"
response = requests.get(folder_url)
soup = BeautifulSoup(response.content, "html.parser")
links = soup.find_all("a")
peninsular_rain_rates = []
east_rain_rates = []
completed_urls = []
with ThreadPoolExecutor(max_workers=15) as executor:
futures = {executor.submit(process_gz_file, folder_url + link.get("href")): link.get("href") for link in links if link.get("href").endswith(".gz")}
for future in as_completed(futures):
peninsular, east, completed = future.result()
peninsular_rain_rates.extend(peninsular)
east_rain_rates.extend(east)
completed_urls.extend(completed)
futures.pop(future)
print(len(futures))
In this code, I am trying to fetch URLS links which are .gz
files (around 8000 files in total, each ~1.5 MB) from a website for processing and finally extending the results to arrays. But the problem is that the code seems to be stuck at the last future, this is known by popping each future when it is done and printing the length of the remaining futures dictionary. I have experimented with lower number of files (eg. 50 links) and it is working fine.
Could this be a memory issue?