If you want to generate embeddings for documents using Azure OpenAI with ada-002 model then you should sent maximum 8192 tokens to this API. If one document has more than 8K tokens then in order to process it we should follow specific steps as per my investigation.
- Prepare document text, clean, normalize, remove-stop-words to be able to count tokens as Azure OpenAI ada-002 counts them.
- Tokenize document text into words by splitting on space (” “)
- If document’s tokens are more than 8K then split it into more sub-documents with maximum 8K tokens
- Pass these 8K sub-documents from the Azure OpenAI ada-002 endpoint and get embeddings for each sub-document.
- Combine those float embeddings (by appending) into one single vector to represent the original document.
- Then in order to be able to find similar documents based on question, question vector and document vectors should have same length, so we need obviously to reduce dimensionality on documents which were spitted and them re-embedded into single vector.
As an example, if a document (10K Tokens) is split into two sub documents (8K and 2K) each sub-document embedding will have 1536 dimensions and therefore the complete document will have 1536 x 2 = 3072. The question which is not exceed the 8K tokens will have 1536 and therefore cannot be compared with all documents.
So is there any way to reduce properly the dimensions of those documents of 3072 dims back to 1536 dims?
According to my research this can be done using PCA, i have found the following example in C#, but here the data are [][] instead of []:
double[][] data = new double[][]
{
// ... Your combined embedding vectors here
};
// Create a new Principal Component Analysis
var pca = new PrincipalComponentAnalysis()
{
Method = PrincipalComponentMethod.Center,
Whiten = false
};
// Learn the PCA model
pca.Learn(data);
// Transform the data into the reduced dimensionality space
double[][] reducedData = pca.Transform(data, 3); // Reducing to 3 dimensions
Any ideas?