I am attempting to remove all extraneous tags, URLs, and scripts from HTML prior to running the text through an LLM. Right now I have the following Python function.
def remove_tags(html) -> str:
# First we decode any encoded text
html = unquote(html)
# Next we strip out all of the HTML tags
soup = BeautifulSoup(html, "html.parser")
for data in soup(['style', 'script']):
# Remove tags
data.decompose()
# Now we get rid of the URLs
tag_free = ' '.join(soup.stripped_strings)
words = tag_free.split()
for i, word in enumerate(words):
parsed_url = urlparse(word)
if parsed_url.scheme and parsed_url.netloc:
words[i] = "[URL Removed]"
final_text = ' '.join(words)
# Finally we remove any unwanted returns
final_text.replace("t", " ").replace("n", " ").replace("r", " ")
return final_text
This works for everything BUT content-urls such as the following:
content - url(https - //link.sonos.com/f/a/ZinnmUI5FVMlzaiMExZvPw~~/AAQRxQA~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWQ5MjA0MDBhOTVmMDA1OTYwN2EwMS9vcmlnaW5hbC5wbmc_MTcwNDgyNTM0N1cDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP)
These scripted URLs are everywhere and are bloating my content and I need to removed them.
I have tried various regex options such as ^[‘content – url’]+[)]$ but it does not work.
Can somebody please provide some help?