I am using Vertex Search API to build a generative search widget that searches through a repository of articles and generates a response to the user based on their query. The API is using unstructured data stored in a GCS bucket. The data is imported via a JSONL metadata file (also stored in the same bucket) that references the article files in the bucket and links things like the author and url of the article.
Below is an example of what my metadata file looks like:
{"id":"1","structData":{"title":"Article 1", "author":"Author 1", "url":"https://url-1"},"content":{"mimeType":"text/plain","uri":"gs://bucket/article-1.txt"}}
{"id":"2","structData":{"title":"Article 2", "author":"Author 2", "url":"https://url-2"},"content":{"mimeType":"text/plain","uri":"gs://bucket/article-2.txt"}}
{"id":"3","structData":{"title":"Article 3", "author":"Author 3", "url":"https://url-3"},"content":{"mimeType":"text/plain","uri":"gs://bucket/article-3.txt"}}
By and large it works quite well, however when I search things like ‘how many articles were authored by x’, if the number of articles authored by x is above the summary result count (which can’t be set past 10), it will say the number of articles is that summary result count. I understand the reason for this is that the summary result count is the number of top articles that are used to generate the summary.
I would also like it to handle some query like “how many articles are there in total”.
I am trying to find a way around this so that my widget can accurately answer these type of queries, but I’m having a lot of trouble trying to sift through Google’s documentation and I’m getting lost. Is there a way to include extra metadata either in the same metadata file or another one that corresponds to the dataset as a whole rather than a particular item (article in my case)? Or some other way around this issue?
I tried to include something like {"totalNumberOfArticles":"100","articlesAuthoredByJohnDoe":10}
as the first line in the file to see if that would work but I just got a ‘INCORRECT_JSON_FORMAT document’ error on import saying ‘no such field: “totalNumberOfArticles”‘