I have a pandas dataframe with a column.
id text_col
1 Was it Accurate?: YesnnReasoning: This is a sample : text
2 Was it Accurate?: YesnnReasoning: This is a :sample 2 text
3 Was it Accurate?: NonnReasoning: This is a sample: 1. text
I have to break the text_col into two columms "Was it accurate?"
and "Reasoning"
The final dataframe should look like:
id Was it Accurate? Reasoning
1 Yes This is a sample : text
2 Yes This is a :sample 2 text
3 No This is a sample: 1. text
The text values can have multiple : “colons” in it
I tried splitting the text_col using “nnReasoning:” but did’nt get desired result.It is leaving out the text after second colon (:)
df[['Was it Accurate?', 'Reasoning']] = df['text_col'].str.extract(r'Was it Accurate?: (Yes|No)nnReasoning: (.*)')
Edit:
I applied the function on the LLM_response column of my sample_100 dataframe. and printed the first row. if you see closely the sample_100.iloc[0]['Reasoning']
has stripped off all the text after :
Temp dict obj to test on:
{'id_no': [8736215],
'Notes': [' Temp Notes Sample xxxxxxxxxxxxx [4/21/23, 2:10 PM] Work started -work complete-'],
'ProblemDescription': ['Sample problem description xxxxxxxxxxxxxxxxxxxxxxxx'],
'LLM_response': ['Accurate & Understandable: YesnnReasoning: The Technician notes are accurate and understandable as:n1) The technician provided detailed steps on how they addressed the mold issue by removing materials, treating surfaces, priming, and painting them.n2) Additionally, even though there was non-repair related information (toilet repairs), the main issue of mold growth was addressed.n3) The process described logically follows the process for remedying a mold issue, which aligns with the problem description.'],
'Accurate & Understandable': ['Yes'],
'Reasoning': ['The Technician notes are accurate and understandable as:']}
4
The issue is not due to colons, but to newlines in your sample text. Those are not matched by .
by default. You should add the re.DOTALL
flag.
Example:
import re
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3],
'text_col': ['Was it Accurate?: YesnnReasoning: This is a sample: text',
'Was it Accurate?: YesnnReasoning: This is a sample:n with newline',
'Was it Accurate?: NonnReasoning: This is a sample text']})
df[['Was it Accurate?', 'Reasoning']] = (df['text_col']
.str.extract(r'Was it Accurate?: (Yes|No)nnReasoning: (.*)',
flags=re.DOTALL)
)
Output:
id text_col Was it Accurate? Reasoning
0 1 Was it Accurate?: YesnnReasoning: This is a sample: text Yes This is a sample: text
1 2 Was it Accurate?: YesnnReasoning: This is a sample:n with newline Yes This is a sample:n with newline
2 3 Was it Accurate?: NonnReasoning: This is a sample text No This is a sample text
one way to handle it could be with the split function.
# deal with the first column by splitting on 'n' and take the first value then split on ':' and take the second one
df['Was it Accurate ?'] = df.apply(lambda x:x.split('n')[0].split(':')[1],axis=1)
# again split on 'n' then on ':' but put back the ':' if you have many colons in the text.
df['Reasoning'] = df.apply(lambda x: ':'.join(x.split('n)[2].split(':')[1:]), axis = 1)