Embedding Token Limit Overpass by chunkingm concatenation and dimensionality reduction

If you want to generate embeddings for documents using Azure OpenAI with ada-002 model then you should sent maximum 8192 tokens to this API. If one document has more than 8K tokens then in order to process it we should follow specific steps as per my investigation.

Prepare document text, clean, normalize, remove-stop-words to be able to count tokens as Azure OpenAI ada-002 counts them.
Tokenize document text into words by splitting on space (” “)
If document’s tokens are more than 8K then split it into more sub-documents with maximum 8K tokens
Pass these 8K sub-documents from the Azure OpenAI ada-002 endpoint and get embeddings for each sub-document.
Combine those float embeddings (by appending) into one single vector to represent the original document.
Then in order to be able to find similar documents based on question, question vector and document vectors should have same length, so we need obviously to reduce dimensionality on documents which were spitted and them re-embedded into single vector.

As an example, if a document (10K Tokens) is split into two sub documents (8K and 2K) each sub-document embedding will have 1536 dimensions and therefore the complete document will have 1536 x 2 = 3072. The question which is not exceed the 8K tokens will have 1536 and therefore cannot be compared with all documents.

So is there any way to reduce properly the dimensions of those documents of 3072 dims back to 1536 dims?

According to my research this can be done using PCA, i have found the following example in C#, but here the data are [][] instead of []:

double[][] data = new double[][]
{
// ... Your combined embedding vectors here
};

// Create a new Principal Component Analysis
var pca = new PrincipalComponentAnalysis()
{
Method = PrincipalComponentMethod.Center,
Whiten = false
};

// Learn the PCA model
pca.Learn(data);

// Transform the data into the reduced dimensionality space
double[][] reducedData = pca.Transform(data, 3); // Reducing to 3 dimensions

Any ideas?

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 17:27

Thẻ: c#tokenopenai-apiembeddingdimensions

Thiết kế website giá rẻ

Danh mục

Embedding Token Limit Overpass by chunkingm concatenation and dimensionality reduction