I’m working on a Python script that downloads images from an Amazon S3 bucket and uploads them to Google Cloud Storage (GCS) using the google-cloud-storage library. The script processes a large dataset in chunks and uses concurrent.futures.ThreadPoolExecutor to parallelize the downloads and uploads.
However, after running for a while, I start seeing the following warning message:
com
2024-07-16 21:26:40,575 – WARNING – Connection pool is full, discarding connection: storage.googleapis.com
def download_and_upload_image(s3_client, bucket_name, s3_key, target_bucket_name, target_key):
# ... code to download image from S3 ...
blob = bucket.blob(f"{target_key}")
blob.upload_from_string(image_data)
def main():
# ... code to set up S3 and GCS clients ...
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
futures = []
for row in tqdm(all_rows, desc="Downloading and uploading images"):
if row is not None:
s3_key, gcs_blob_name = row
future = executor.submit(download_and_upload_image, s3_client, bucket_name, s3_key, gcs_bucket_name, gcs_blob_name)
futures.append(future)
I suspect the issue is related to the number of concurrent connections to GCS, but I’m not sure how to resolve it.
I’ve tried increasing the max_workers parameter of the ThreadPoolExecutor, but that didn’t seem to help. I also tried using a storage.Client instance directly instead of creating a new one for each upload, but the warning still occurs.
How can I fix this warning and ensure that the script can handle the large number of concurrent uploads to GCS without discarding connections? Is there a way to configure the connection pool size or reuse connections more efficiently?
Any guidance or suggestions would be greatly appreciated. Let me know if you need more details about my setup or code.