I’m trying to build a chain which will chunk a long PDF document (currently loaded in as markdown). I have the following Pydantic classes created.
from langchain.pydantic_v1 import BaseModel, Field
from typing import List
class HeaderSection(BaseModel):
"""Class to save a section header and text from the section"""
header: str = Field(description="Header of a section from the document.")
text: str = Field(description="Text under the associated header.")
class AllSections(BaseModel):
sections: List[HeaderSection]
I then have this code chunk setting the structured output.
from langchain_anthropic import ChatAnthropic
# llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
llm = ChatAnthropic(model="claude-3-5-sonnet-20240620")
structured_json_output = llm.with_structured_output(AllSections)
system = """You are tasked with splitting up a document into sections. These sections all have headers, which is how you will determine where to split all of the data.
Return the header in the "header" field of the HeaderSection class. The text that comes below the header, return the text as part of the "text" field of the HeaderSection class.
You will be given the input text, and the headers.
Here's an example on how a chunk of data will be stored.
EXAMPLE TEXT:
# Bill of Lading and Driver Signature
Item 6
The signature of a Carrier Freight Driver/Sales Representative on any Bill of Lading other than a Carrier’s Bill of Lading will act only to acknowledge the receipt of freight as described on the document. This signature will not acknowledge agreement to any terms and conditions of carriage and/or liability conditions that may also appear on the document. Unless there is a written agreement, separate from the Bill of Lading, signed by shipper and Carrier, then the Carrier Freight Bill of Lading Terms and Conditions will apply.
EXAMPLE OUTPUT:
HeaderSection(header="Bill of Lading and Driver Signature Item 6", text="The signature of a Carrier Freight Driver/Sales Representative on any Bill of Lading other than a Carrier’s Bill of Lading will act only to acknowledge the receipt of freight as described on the document. This signature will not acknowledge agreement to any terms and conditions of carriage and/or liability conditions that may also appear on the document. Unless there is a written agreement, separate from the Bill of Lading, signed by shipper and Carrier, then the Carrier Freight Bill of Lading Terms and Conditions will apply.")
NOTE: The item number may not always be present.
"""
prompt = ChatPromptTemplate.from_messages(
[
("system", system),
("human", "Headers: {headers}nnText: {text}")
]
)
chunking_chain = prompt | structured_json_output
output = chunking_chain.invoke({"headers": llama_parse_headers_and_items, "text": llama_data_md[0].text})
And receive this error
ValidationError: 1 validation error for AllSections
sections
field required (type=value_error.missing)
When I set the HeaderSection
class as the structured output class, it works and returns one section, but I need it to return all of the sections, which is why I’m trying to use AllSections
class, which is suppose to have a list of HeaderSection
. Any idea what this error means in this context, and how I can get this to run?