I’m trying to break a text into chunks using a tokenization-aware approach that attempts to split the text at spaces or endlines when possible. The goal is to avoid breaking words or, if feasible, lines. However, I’m encountering an issue where some words are missing in the final output, particularly when the size of the text chunk equals the maximal chunk size.
Here’s the code I’m using:
from transformers import AutoTokenizer
# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained('model_id')
# Tokenize text
tokens = tokenizer.encode(my_text, return_tensors='pt')
# Chunk tokens into text
texts_chunks = chunk_tokens_into_text(tokens)
# Ensure the final text matches the original
assert ''.join(texts_chunks) == my_text
def get_this_chunk(pre_de_chunk):
split_by_endline = pre_de_chunk.split("n")
if len(split_by_endline) > 1:
return "n".join(split_by_endline[:-1])
split_by_space = pre_de_chunk.split(" ")
if len(split_by_space) > 1:
return " ".join(split_by_space[:-1])
return pre_de_chunk[:-1] # compensate for pre_chunk being +1
def chunk_tokens_into_text(tokens, chunk_size, overlap):
text_chunks = []
i = 0
end = len(tokens[0])
while True:
next_pointer = i + chunk_size
if next_pointer >= end: # Collect last chunk
final_chunk = tokens[0][i:]
text_chunk = tokenizer.decode(final_chunk, skip_special_tokens=True)
text_chunks.append(text_chunk)
return text_chunks
pre_chunk = tokens[0][i:next_pointer] # +1 to ensure the full range is captured
pre_de_chunk = tokenizer.decode(pre_chunk, skip_special_tokens=True)
text_chunk = get_this_chunk(pre_de_chunk)
size_in_tokens = tokenizer.encode(text_chunk, return_tensors='pt').size()[1]
text_chunks.append(text_chunk)
i += size_in_tokens - overlap
After running the chunk_tokens_into_text function, the reassembled text (using ”.join(texts_chunks)) doesn’t match the original my_text. It seems that some words are missing, especially when the size of the tokenized chunk (size_in_tokens) equals the chunk_size.
What I tried:
Adjusting how the i pointer is incremented, but the issue persists.
The function get_this_chunk is meant to refine chunk boundaries by checking for newline or space characters, but this seems to contribute to the word-dropping issue.
What is causing this effect? And how can I solve the issue?