I’m trying to create a script to match regex patterns in a series of text files, and the remove those matches from the file. Right now, I have the following, which works for my purposes, but I don’t think this is an effective way to do it:
import os
import re
os.chdir("/home/user1/test_files")
patterns = ['(bannana)',
'(peaches)',
'(apples)'
]
subst = ""
cwd = os.getcwd()
for filename in os.listdir(cwd):
with open(filename, 'r', encoding="utf8") as f:
file = f.read()
result = re.sub('|'.join(patterns), subst, file, re.MULTILINE)
with open("/home/user1/output_files/" + "output_" + str(filename), 'w', encoding="utf-8") as newfile:
newfile.write(result)
for pattern in patterns:
with open('/home/user1/output_files/output_'+str(filename), 'r', encoding="utf8") as f:
file = f.read()
result = re.sub(pattern, subst, file, re.MULTILINE)
with open('/home/user1/output_files/output_'+str(filename), 'w', encoding="utf-8") as newfile:
newfile.write(result)
So, lets say I have a file, grocery.txt, and I want to remove the words apples
, peaches
, and bannana
. The above script will first run through and create an output file, output_grocery.txt. It will then iterate through the patterns list, removing the pattern from output_grocery.txt and rewriting it after each pass.
The way I’m doing this right now is not scalable. I’ll eventually need to run this on hundreds of files, each one being rewritten again and again depending on how many regex patterns I have. I originally tried doing this in one go, using:
result = re.sub('|'.join(patterns), subst, file, re.MULTILINE)
thinking that would remove all the patterns in one go from the file. However, this only removes the first pattern, in this case bannana.
Is there a better, more scalable way to do this?