https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/
The default value of separators
argument contains ""
(the 4th element). I wonder why it is needed because I think the result is the same. What does ""
mean here and how may ""
affect the result?
In the example below, I tested two cases: separators=["nn", "n", " ", ""]
and separators=["nn", "n", " "]
. The results (len(text)
) were 16 for both.
document = """"
Lena had always loved the beach, but this visit felt different. The sun dipped below the horizon, painting the sky with hues of orange and pink. She walked along the shoreline, the cool water lapping at her feet, carrying away the stress of her city life.
Suddenly, Lena noticed something shimmering in the sand. She bent down and unearthed an old, tarnished locket. Opening it, she found a faded photograph of a young couple smiling brightly. Curiosity piqued, she turned it over and saw a date: July 10, 1944.
Lena's mind raced. Who were they? How had the locket ended up here? She decided to find out.
Back at her beachside cottage, Lena asked the locals about the locket. An elderly man named Mr. Thompson remembered the couple—John and Mary—who had visited the beach every summer during the war. John was a soldier; Mary waited for him by the shore.
Inspired, Lena wrote an article about their love story, hoping to reconnect the locket with their descendants. Weeks later, she received an email from a grateful granddaughter.
The locket was returned, and Lena felt a deep connection to a love story that had once graced the same shores she adored.
"""
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100,
chunk_overlap=20,
length_function=len,
is_separator_regex=False,
separators=["nn", "n", " ", ""]
)
texts = text_splitter.create_documents([document])
len(texts)