I’m traying to webscrap some text from a website, the problem is its html formatting.
<div class="coptic-text html">
<div class="htmlvis"><t class="translation" title="The book of the genealogy of Jesus Christ, the son of David, the son of Abraham."><div class="verse" verse="1"><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϫⲱⲱⲙⲉ' target='_new'>ϫⲱⲱⲙⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲙ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡⲉ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϫⲡⲟ' target='_new'>ϫⲡⲟ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲓⲏⲥⲟⲩⲥ' target='_new'>ⲓⲏⲥⲟⲩⲥ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡⲉ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲭⲣⲓⲥⲧⲟⲥ' target='_new'>ⲭⲣⲓⲥⲧⲟⲥ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϣⲏⲣⲉ' target='_new'>ϣⲏⲣⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲇⲁⲩⲉⲓⲇ' target='_new'>ⲇⲁⲩⲉⲓⲇ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϣⲏⲣⲉ' target='_new'>ϣⲏⲣⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲁⲃⲣⲁϩⲁⲙ' target='_new'>ⲁⲃⲣⲁϩⲁⲙ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=.' target='_new'>.</a></span></span></div></t><!--
--></span></div></t></div>
My desired output:
1: ⲡϫⲱⲱⲙⲉ ⲙⲡⲉϫⲡⲟ ⲛⲓⲏⲥⲟⲩⲥ ⲡⲉⲭⲣⲓⲥⲧⲟⲥ ⲡϣⲏⲣⲉ ⲛⲇⲁⲩⲉⲓⲇ ⲡϣⲏⲣⲉ ⲛⲁⲃⲣⲁϩⲁⲙ.
My output:
ⲡϫⲱⲱⲙⲉⲙⲡⲉϫⲡⲟⲛⲓ ⲏⲥⲟⲩⲥⲡⲉⲭⲣⲓ ⲥⲧⲟⲥⲡϣⲏⲣⲉⲛⲇⲁⲩⲉⲓ ⲇⲡϣⲏⲣⲉⲛⲁⲃⲣⲁϩⲁⲙ.
My code so far:
#coding: utf-8
import requests
from bs4 import BeautifulSoup
import signal
import sys
import os.path
signal.signal(signal.SIGINT, lambda x, y: sys.exit(0))
if len(sys.argv) != 4:
print("Usage: %s <book name> <first chapter> <last chapter>" % os.path.basename(__file__))
quit()
book_name = sys.argv[1]
start = int(sys.argv[2])
stop = int(sys.argv[3])
while start <= stop:
out_file = open(f"./{book_name}_{str(start)}.txt", "a")
try:
response = requests.get(f'https://data.copticscriptorium.org/texts/new-testament/{book_name}_{str(start)}/sahidica')
soup = BeautifulSoup(response.text, "lxml")
content_list = soup.find_all("span", class_="norm")
text = []
print(f"[{str(start)}/{str(stop)}] https://data.copticscriptorium.org/texts/new-testament/{book_name}_{str(start)}/sahidica")
for element in content_list:
text.append(element.get_text())
text = ''.join(text).strip()
out_file.write("%sn" % text)
except:
print("Error")
start += 1
P.S. Language is old Coptic.