The following code allows me to successfully identify the 2nd and 3rd texts, and only those texts, in a pandas dataframe by search for rows that contain the word “cod” or “i”:
import numpy as np
import pandas as pd
texts_df = pd.DataFrame({"id":[1,2,3,4],
"text":["she loves coding",
"he was eating cod",
"i do not like fish",
"fishing is not for me"]})
texts_df.loc[texts_df["text"].str.contains(r'b(cod|i)b', regex=True)]
I would like to build the list of words up dynamically by inserting words from a long list but I can’t figure out how to do that successfully.
I’ve tried the following but I get an error saying “r is not defined” (which I expected as it’s not a variable but I can’t put it as part of the string either and don’t know what I should do)
kw_list = ["cod", "i"]
kw_regex_string = "b("
for kw in kw_list:
kw_regex_string = kw_regex_string + kw + "|"
kw_regex_string = kw_regex_string[:-1] # remove the final "|" at the end
kw_regex_string = kw_regex_string + ")b"
myregex = r + kw_regex_string
texts_df.loc[texts_df["text"].str.contains(myregex, regex=True)]
How can I build the ‘or’ condition containing the list of key words and then insert that into the reg ex in a way that will work in the pandas dataframe search?