I have a lot of html pages that have somehow become embedded with multiple newline characters, with the tags on separate lines and some of the sentences split up at apparently random intervals. Here is an example of what I am dealing with:
<html>
<head>
<title>One of many</title>
</head>
<body>
<h1>
Spam is not ham
</h1>
<p>
Many plates of Spam
</p>
<p>
Use the Fry option to properly cook the
Spam
until done.
</p>
<p>
Enquiries for more recipes can be made through the
Feed Me
option.
</p>
</body>
</html>
I used the replace() function with partial success for the beginning tags with this code:
html_filename = 'page.htm'
f = open(html_filename, encoding="utf-8")
file_str = f.readlines()
f.close()
with open(html_filename, 'w', encoding="utf-8") as f:
for line in file_str:
if '<h1>n' in line:
tmp = line.replace('<h1>n', '<h1>')
f.write(tmp)
if '<p>n' in line:
tmp = line.replace('<p>n', '<p>')
f.write(tmp)
else:
f.write(line)
and get the following result:
<html>
<head>
<title>One of many</title>
</head>
<body>
<h1>Spam is not ham
</h1>
<p>Many plates of Spam
</p>
<p>Use the Fry option to properly cook the
Spam
until done.
</p>
<p>Enquiries for more recipes can be made through the
Feed Me
option.
</p>
</body>
</html>
However, I can’t figure out how to resolve the lines with just text or the lines with just an end tag.