I have a large collection of text snippets that adhere a certain template that looks like the one in the example below.
**1.PERSONAL DETAILS**
* Age: 31
* Education: Professional bachelor in applied informatics
* Years active : 7 years
* Civil status: Married
* Sector/Industry: Private Sector
**2. SYMPTOMS**
* Lower back pain: No
* Bad vision: No
* Allergies: peanuts, pollen
* ...
**3. RECORDS
* Amount of sick days taken: 7
... you get the point
I would like to use a local LLM to process all these snippets. I provide a json scheme with for example zod or pydantic to fill in as many properties extracted from that text snippet.
z.object({
age: z.number().nullable(),
education: z.string().nullable(),
years_active: z.number().nullable(),
sector: z.string().nullable(),
lower_back_pain: z.boolean().nullable()
allergies: z.string().array(),
sick_days: z.number().nullable(),
})
I had best results for now with Ollama running llama3 and langchain StructuredOutputParser but can’t get it to adhere to outputting strictly json. Am I using the wrong model? Should I finetune? What can I do to improve my results?