I have a function, tokenize
:
def tokenize(text,max_len=MAX_LEN):
encoded = tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length = max_len,
padding='max_length',
return_attention_mask=True
)
return encoded['input_ids'], encoded['attention_mask']
And a block of code that applies it to some training data:
df_train['input_ids'], df_train['attention_masks'] = df_train['text'].progress_apply(tokenize)
The returns for the tokenize
, encoded['input_ids']
and encoded['attention_mask']
are both lists, and when called on a column of text data (i.e. df_train['text'].progress_apply(tokenize)
), it outputs a tuple of two lists, (encoded['input_ids'], encoded['attention_mask'])
Despite changes, the following error persists:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[13], line 1
----> 1 df_train['input_ids'], df_train['attention_masks'] = df_train['text'].progress_apply(tokenize)
2 df_val['input_ids'], df_val['attention_masks'] = df_val['text'].progress_apply(tokenize)
3 df_test['input_ids'], df_test['attention_masks'] = df_test['text'].progress_apply(tokenize)
ValueError: too many values to unpack (expected 2)
My understanding is this – the list inside of the tuple is being unpacked for some reason, and instead of each list being saved to df_train['input_ids']
and df_train['attention_masks']
, it is trying to assign the contents of the first list in the double assignment, and failing because it exceeds two.
The solution in: Unpack two variables into two columns in DataFrame was tried, unfortunately, this causes the return tuple to contain series as opposed to lists which causes problems downstream.
Is there any way to have df_train['input_ids']
and df_train['attention_masks']
be populated by the output of encoded['input_ids']
and encoded['attention_mask']
in list form, for the relevant training example passed in by df_train['text'].progress_apply(tokenize)
Using a temp column and splitting after the fact as in ‘Too many values to unpack’ trying to call function for two dataframe columns using apply and lambda seems clunky, is there a more elegant solution?
Possibly .explode() method?