I am reading a list of pdf files and concatenating them as a unique string. Thereafter, I try to split the string into train_text and validation_text to train a LLM. Whey I use the train_test_split command from sklearn I get the following error:
ValueError: With n_samples=1, test_size=0.1 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters. I know it has something to do with the split by n method. But whats wrong? and how to solve it?
<code>text=""
pages = 0
for i in range(len(lista)):
reader = PdfReader(lista[i])
print("Arquivo:", i)
pages += len(reader.pages)
for j in range(len(reader.pages)):
page = reader.pages[j]
texto = page.extract_text()
text+= texto
print(pages)
##############################################################################
def clean_text(text):
# Remove unwanted characters and noise
text = re.sub(r's+', ' ', text) # Replace multiple spaces with a single space
text = text.lower() # Convert text to lowercase
return text.strip()
text = clean_text(text)
#############################################################################
# Split text into training and validation sets
train_text, validation_text = train_test_split(text.split("n"), test_size=0.1)
</code>
<code>text=""
pages = 0
for i in range(len(lista)):
reader = PdfReader(lista[i])
print("Arquivo:", i)
pages += len(reader.pages)
for j in range(len(reader.pages)):
page = reader.pages[j]
texto = page.extract_text()
text+= texto
print(pages)
##############################################################################
def clean_text(text):
# Remove unwanted characters and noise
text = re.sub(r's+', ' ', text) # Replace multiple spaces with a single space
text = text.lower() # Convert text to lowercase
return text.strip()
text = clean_text(text)
#############################################################################
# Split text into training and validation sets
train_text, validation_text = train_test_split(text.split("n"), test_size=0.1)
</code>
text=""
pages = 0
for i in range(len(lista)):
reader = PdfReader(lista[i])
print("Arquivo:", i)
pages += len(reader.pages)
for j in range(len(reader.pages)):
page = reader.pages[j]
texto = page.extract_text()
text+= texto
print(pages)
##############################################################################
def clean_text(text):
# Remove unwanted characters and noise
text = re.sub(r's+', ' ', text) # Replace multiple spaces with a single space
text = text.lower() # Convert text to lowercase
return text.strip()
text = clean_text(text)
#############################################################################
# Split text into training and validation sets
train_text, validation_text = train_test_split(text.split("n"), test_size=0.1)