I run a spider to obtain some data and part of it is downloading images to the S3 storage. I noticed that sometimes the images are not uploaded to S3 and when checking the log, I see that when an image is successfully uploaded, boto3
used the PUT method. When an image is not successfully uploaded, it uses the HEAD method.
Here’s a snippet from the log:
dac3146722c4a04e7fd8b00ff7eb2ae76a82136f86c10b969e9a8b2a008c1c2a
2024-05-09 06:25:52 [botocore.hooks] DEBUG: Event before-sign.s3.HeadObject: calling handler <bound method S3ExpressIdentityResolver.resolve_s3express_identity of <botocore.utils.S3ExpressIdentityResolver object at 0x7fc1319328c0>>
2024-05-09 06:25:52 [botocore.endpoint] DEBUG: Making request for OperationModel(name=HeadObject) with params: {'url_path': '/path/imgs/folder/img.jpg', 'query_string': {}, 'method': 'HEAD', 'headers': {'User-Agent': 'Botocore/1.34.88 ua/2.0 os/linux#5.15.0-105-generic md/arch#x86_64 lang/python#3.10.12 md/pyimpl#CPython cfg/retry-mode#legacy'}, 'body': b'', 'auth_path': '/app/path/images/folder/img.jpg', 'url': 'https://URL-TO-S3.com/app/path/images/folder/img.jpg', 'context': {'client_region': 'fra1', 'client_config': <botocore.config.Config object at 0x7fc131ddb3a0>, 'has_streaming_input': False, 'auth_type': 'v4', 's3_redirect': {'redirected': False, 'bucket': 'app', 'params': {'Bucket': 'app', 'Key': 'path/images/folder/img.jpg'}}, 'input_params': {'Bucket': 'app', 'Key': 'path/images/folder/img.jpg'}, 'signing': {'region': 'fra1', 'signing_name': 's3', 'disableDoubleEncoding': True}, 'endpoint_properties': {'authSchemes': [{'disableDoubleEncoding': True, 'name': 'sigv4', 'signingName': 's3', 'signingRegion': 'fra1'}]}}}
2024-05-09 06:25:52 [botocore.auth] DEBUG: Signature:
f290eab475db8223123b40346bc99e06adb6d87ced24828440827008e76b221c
2024-05-09 06:25:52 [botocore.hooks] DEBUG: Event request-created.s3.HeadObject: calling handler <function add_retry_headers at 0x7fc1338eb010>
2024-05-09 06:25:52 [botocore.hooks] DEBUG: Event request-created.s3.HeadObject: calling handler functools.partial(<function _sentry_request_created at 0x7fc133768700>, service_id='s3')
2024-05-09 06:25:52 [botocore.endpoint] DEBUG: Sending http request: <AWSPreparedRequest stream_output=False, method=HEAD, url=https://URL-TO-S3.com/app/path/images/folder/img.jpg, headers={'User-Agent': b'Botocore/1.34.88 ua/2.0 os/linux#5.15.0-105-generic md/arch#x86_64 lang/python#3.10.12 md/pyimpl#CPython cfg/retry-mode#legacy', 'X-Amz-Date': b'20240509T062551Z', 'X-Amz-Content-SHA256': b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'Authorization': b'AWS4-HMAC-SHA256 Credential=DO004NM8RQNYNFZMKLT6/20240509/fra1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=f290eab475db8223123b40346bc99e06adb6d87ced24828440827008e76b221c', 'amz-sdk-invocation-id': b'e0ca4edf-a0fb-4377-b0d5-f1eb4366e217', 'amz-sdk-request': b'attempt=1'}>
2024-05-09 06:25:52 [botocore.httpsession] DEBUG: Certificate path: /etc/ssl/certs/ca-certificates.crt
2024-05-09 06:25:52 [botocore.hooks] DEBUG: Event request-created.s3.HeadObject: calling handler <bound method RequestSigner.handler of <botocore.signers.RequestSigner object at 0x7fc131ddb490>>
2024-05-09 06:25:52 [botocore.hooks] DEBUG: Event choose-signer.s3.HeadObject: calling handler <function set_operation_specific_signer at 0x7fc1338e9000>
2024-05-09 06:25:52 [botocore.hooks] DEBUG: Event before-sign.s3.HeadObject: calling handler <function remove_arn_from_signing_path at 0x7fc1338eb1c0>
2024-05-09 06:25:52 [botocore.hooks] DEBUG: Event before-sign.s3.HeadObject: calling handler <bound method S3ExpressIdentityResolver.resolve_s3express_identity of <botocore.utils.S3ExpressIdentityResolver object at 0x7fc1319328c0>>
2024-05-09 06:25:52 [botocore.auth] DEBUG: Calculating signature using v4 auth.
2024-05-09 06:25:52 [botocore.auth] DEBUG: CanonicalRequest:
HEAD
/app/path/images/folder/img.jpg
host:URL-TO-S3.com
x-amz-content-sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
x-amz-date:20240509T062552Z
host;x-amz-content-sha256;x-amz-date
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
2024-05-09 06:25:52 [botocore.auth] DEBUG: StringToSign:
AWS4-HMAC-SHA256
20240509T062552Z
My CustomImagePipeline.py looks fairly simple:
class CustomImgPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
meta = {
'f': item.get('id'),
}
if item.get('image_urls'):
for image_url in item.get('image_urls'):
yield scrapy.Request(image_url, meta=meta)
def file_path(self, request, response=None, info=None, *, item=None):
image_guid = hashlib.sha1(request.url.encode('utf-8')).hexdigest()
file_name = f'{image_guid}.jpg'
return '%s/%s' % (request.meta["f"], file_name)
The thing is that sometimes, the images are successfully uploaded (I can see the PUT method in the logs) and sometimes they’re are and when that happens, I see the HEAD method** in logs. Why is that happening? Is there a way to force Scrapy/boto3 to force to always use the PUT method?