I am working with call centre transcripts. In an ideal case the speech-to-text software will transcribe an e-mail as follows: [email protected]. This will not always be the case. So I am looking at a regular expression (RegEx) solution that accommodates white spaces in the e-mail address, e.g. maya [email protected] or maya.lucco @proton.me or maya-lucco@pro ton.me
I have unsuccessfully tried to extend this solution with regex101. Compiling a re object (pattern) as suggested in this solution seems overly complex for the task. I looked at the posts on validating e-mail addresses but they describe a different issue. Below my code so far:
import re
#creating some data
test = ['some random text maya @ proton.me with some more text [email protected]',
'[email protected] with another address [email protected]',
'some text maya.lucco @proton.me with some more bla [email protected]',
'[email protected] more text maya@ proton.me '
]
test = pd.DataFrame(test, columns = ['words'])
#creating a function because I like to add some other data cleaning to it later on
def anonymiseEmail(text):
text = str(text) #make text as string variable
text = text.strip() @
text = re.sub(r'S*@S*s?', '{e-mail}', text)
return text
# applying the function
test['noEmail'] = test.words.apply(anonymiseEmail)
#checking the results
print(test.noEmail[0])
How can the code be extended so that the whole e-mail address, regardless of how many white spaces it has, be replaced with a placed holder or deleted?