I am testing an application which produces plots of data. These plots are downloadable as .png files. My approach to testing the plot (to make sure nothing has changed with the most recent software deploy of the application) is as follows:
- Download the plot as a .png file
- Obtain a hash value of the downloaded plot file.
- Compare this hash value to my stored hash value for the same plot. If they are the same, the test passes.
But even after several rounds of this, using various hash algorithms, the numbers still always vary. I also tried to add a few steps to remove metadata on the file, and I still cannot get the same result.
These are the three higher-level steps of my test where this download and comparison takes place:
downloaded_path = sbDAPlotPage.download_plot(getData.plot_display_name)
# generate a hash on the downloaded file
hash_value_of_downloaded_plot = hash_image(downloaded_path)
# compare the hash value to a stored hash value
assert getData.hash_value == hash_value_of_downloaded_plot
Here is the first variation I tried for my “hash_image” method above, and the hash values were always different:
def hash_image(filepath: str) -> int:
BUF_SIZE = 65536 # Read in 64KB chunks
# create a hash object
readable_hash = hashlib.sha256()
with open(filepath, "rb") as f:
while True:
byte_cluster = f.read(BUF_SIZE) # read entire file as bytes
if not byte_cluster:
break
# update the hash object every time a chunk of data is added
# readable_hash.update(byte_cluster)
# return the hexidecimal representation of the hash:
hex_readable_hash = readable_hash.hexdigest()
print(f"hash returned: {hex_readable_hash}")
return hex_readable_hash
Here is the second variation I tried:
def hash_image(filepath: str) -> int:
readable_hash = hashlib.sha256()
with open(filepath, 'rb') as image_file:
base64_bytes = base64.b64encode(image_file.read())
# try this
readable_hash.update(base64_bytes)
hex_readable_hash = readable_hash.hexdigest()
print(f"Hex readable hash: {hex_readable_hash}")
return hex_readable_hash
And here was my last-ditch attempt where I then tried removing metadata:
def hash_image(file_path: str) -> int:
img = Image.open(file_path)
data = list(img.getdata())
img_without_metadata = Image.new(img.mode, img.size)
img_without_metadata.putdata(data)
# Save the new image over the original file, effectively removing metadata.
img_without_metadata.save(file_path)
breakpoint()
with open(file_path, "rb") as f:
# Read the entire file as bytes
bytes = f.read()
# Create a hash object using SHA-256
hash_obj = hashlib.sha256(bytes)
# Generate the hexadecimal representation of the hash
hash_hex = hash_obj.hexdigest()
return hash_hex
I expected the hash values to be the same, but the hash generated from the freshly downloaded file was different from the stored hash, as well as from the hash in any previous or subsequent run. I’m totally baffled by this one. What is it I’m either not understanding about image files or hashing them?