For example, let’s say I want to search through my 3 projects > many buckets > many csv files > field that includes ‘name’ > for the value ‘Bob’. So far I have the following code:
import io
import pandas as pd
from google.cloud import storage
for project_id in ("project-1", "project-2", "project-3"):
client = storage.Client(project=project_id)
buckets = client.list_buckets()
for bucket in buckets:
blobs = client.list_blobs(bucket)
for blob in blobs:
if blob.name.endswith('.csv'):
csv = blob.download_as_text()
df = pd.read_csv(io.StringIO(csv), low_memory=False)
for col in df.columns:
#finds fields that include 'name'
if 'name' in col.lower():
name_found = df[df[col].str.contains('Bob', case=False, na=False)]
if name_found:
num_found.append(blob.name)
if name_found:
print(f"{len(num_found)} of Bob found in {bucket}. CSV Files are:")
print(f"{num_found}")
else:
print(f"No Bobs found in {bucket}!")
So far this code break when I deal with data that tries to decode ‘utf-8’ as it’s an ‘invalid start byte’. Another issue, is Cloudshell cannot run past 20 buckets, it kills inself (due to low memory?) so I am currently using VScode IDE.
It seems like you are experiencing a unicode decode error. What you can do is try to decode your csv file in its same encoding scheme.
csv = blob.download_as_text(encoding='utf-8')
If you are not sure what encoding the file is using, you can use chardet
to detect the encoding type of your csv file.
As mentioned in this StackOverflow post, you can use errors='replace'
but take note that you’ll lose some characters since this will only replace the characters that can’t be decoded (usually �).
for blob in blobs:
if blob.name.endswith('.csv'):
try:
csv_bytes = blob.download_as_bytes()
csv = csv_bytes.decode(errors='replace')
Another reason would be that the csv file has some non-ASCII characters that can’t be decoded.
I’ve tried recreating your issue using one project and I uploaded some csv files including one corrupted file (forced save a .xlsx file to .csv) and I received the same “invalid start byte” error on the corrupted file. Maybe try checking your bucket for corrupted .csv files.
As for the limitations when using Cloud Shell, searching through multiple files within many buckets across 3 projects might exhaust cloudshell which can possibly lead to a certain error. You can read more about Cloud Shell’s quotas and limits.