Searching for a specific value, within a specific field, within many csv files, within many buckets, within 3 projects in GCS using Python

For example, let’s say I want to search through my 3 projects > many buckets > many csv files > field that includes ‘name’ > for the value ‘Bob’. So far I have the following code:

import io
import pandas as pd
from google.cloud import storage

for project_id in ("project-1", "project-2", "project-3"):
    
    client = storage.Client(project=project_id)
    buckets = client.list_buckets()

    for bucket in buckets:
        blobs = client.list_blobs(bucket)

        for blob in blobs:
            if blob.name.endswith('.csv'):
                csv = blob.download_as_text()
                df = pd.read_csv(io.StringIO(csv), low_memory=False)
                
                for col in df.columns:
                    #finds fields that include 'name'
                    if 'name' in col.lower():
                        
                        name_found = df[df[col].str.contains('Bob', case=False, na=False)]

                        if name_found:
                            num_found.append(blob.name)

    if name_found:
        print(f"{len(num_found)} of Bob found in {bucket}. CSV Files are:")
        print(f"{num_found}")  
    else:
        print(f"No Bobs found in {bucket}!")

So far this code break when I deal with data that tries to decode ‘utf-8’ as it’s an ‘invalid start byte’. Another issue, is Cloudshell cannot run past 20 buckets, it kills inself (due to low memory?) so I am currently using VScode IDE.

It seems like you are experiencing a unicode decode error. What you can do is try to decode your csv file in its same encoding scheme.

csv = blob.download_as_text(encoding='utf-8')

If you are not sure what encoding the file is using, you can use chardet to detect the encoding type of your csv file.

As mentioned in this StackOverflow post, you can use errors='replace' but take note that you’ll lose some characters since this will only replace the characters that can’t be decoded (usually �).

for blob in blobs:
        if blob.name.endswith('.csv'):
            try:
                csv_bytes = blob.download_as_bytes()
                csv = csv_bytes.decode(errors='replace')

Another reason would be that the csv file has some non-ASCII characters that can’t be decoded.

I’ve tried recreating your issue using one project and I uploaded some csv files including one corrupted file (forced save a .xlsx file to .csv) and I received the same “invalid start byte” error on the corrupted file. Maybe try checking your bucket for corrupted .csv files.

As for the limitations when using Cloud Shell, searching through multiple files within many buckets across 3 projects might exhaust cloudshell which can possibly lead to a certain error. You can read more about Cloud Shell’s quotas and limits.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 09:44

Thẻ: pythoncsvgoogle-cloud-storageblobbucket

Thiết kế website giá rẻ

Danh mục

Searching for a specific value, within a specific field, within many csv files, within many buckets, within 3 projects in GCS using Python