I’ve trained FAISS locally from several documents and saved the embedding with “.save_local” function. The embedding filea (.pkl and .faiss) are uploaded to the Google Cloud Storage Bucket.
Now I want to load the embedding with the langchain “FAISS.load_local” function. However, I didn’t find any solutions to make the index file accessible by the “FAISS.load_local” function.
I’ve tried using blob with download_as_string, and download_to_file functions and making them available in the RAM path: “/tmp”. Neither of them works, the error was attached below, seems like the file is corrupted and it couldn’t be searched within the load_local function especially tiktoken.
The error
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/tiktoken/core.py", line 116, in encode
if match := _special_token_regex(disallowed_special).search(text):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected string or buffer"
The blob code
storage_client = storage.Client()
bucket = storage_client.get_bucket(BUCKET)
EMBEDDINGS_PATH = '/tmp'
filepath = "embeddings"
filename = "index.faiss" or "index.pkl"
#I create a function to iterate the filename, and this is part of the function's code
blob_file = bucket.blob(blob_name="{}/{}".format(filepath,filename))
save_file = "{}/{}".format(EMBEDDINGS_PATH, filename)
downloaded_file = blob_file.download_as_string()
with open(save_file, "wb") as f:
f.write(downloaded_file)
#load the embeddings file
embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])
vectorstore=FAISS.load_local(folder_path="EMBEDDINGS_PATH", embeddings=embeddings, allow_dangerous_deserialization=True)
Hope anyone can share the correct way to access and load faiss embeddings from Google Cloud Storage Bucket to Cloud Function. Thank you