I want to automatically generate testcases using generative AI. For this purpose I will be using open source LLM(llama 3, will try others as well). Since the LLM is trained only on publically available data, it needs more information regarding the application being developed(for which the testcases I wish to generate testcases) and the requirement which contain detail information regarding the expected behaviour.
This is additional information can be provided through RAG. The Vector Database being used it ChromaDB.
As of now, I want this additional information to be provided as a pdf file.
I have researched there are many ways of dividing this pdf file into smaller chunks:
- Recursive Character Splitter
- Sentence splitter
- Semantic splitting
- LLM based chunking
- Document specific splitting
Please do let me know if I missed some other useful method.
So there are 2 questions here:
- How to decide which chunking method to choose?
- How can I evaluate the performance of the chosen chunking technique?