I’m encountering an issue while working with importing and vectorizing data in my project. After importing and vectorizing data, I noticed that I have an excessive number of chunks, which is fine as it helps me stay under the maximum token input limits of embedding models. Where I encounter the issue is when I want to create filtering by metadata_content_type. I marked the field as facetable, and as a result, for 1 PPTX file, I am getting a count of 19 for the value “pptx”. I guess it is because the engine sees chunks as documents and doesn’t distinguish them in any manner. Is there something I can do about this as I need to have embeddings on large documents which means chunking is required but also I need to have filtering with facets as well?
My expectations are to get distinct values for facets counts. And later while doing search to not get duplicates.
Haris Hercegovac is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.