Thiết kế website giá rẻ

Question

I’m traying to webscrap some text from a website, the problem is its html formatting.

        <div class="coptic-text html">
            <div class="htmlvis"><t class="translation" title="The book of the genealogy of Jesus Christ, the son of David, the son of Abraham."><div class="verse" verse="1"><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϫⲱⲱⲙⲉ' target='_new'>ϫⲱⲱⲙⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲙ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡⲉ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϫⲡⲟ' target='_new'>ϫⲡⲟ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲓⲏⲥⲟⲩⲥ' target='_new'>ⲓⲏⲥⲟⲩⲥ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡⲉ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲭⲣⲓⲥⲧⲟⲥ' target='_new'>ⲭⲣⲓⲥⲧⲟⲥ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϣⲏⲣⲉ' target='_new'>ϣⲏⲣⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲇⲁⲩⲉⲓⲇ' target='_new'>ⲇⲁⲩⲉⲓⲇ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϣⲏⲣⲉ' target='_new'>ϣⲏⲣⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲁⲃⲣⲁϩⲁⲙ' target='_new'>ⲁⲃⲣⲁϩⲁⲙ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=.' target='_new'>.</a></span></span></div></t><!--
--></span></div></t></div>

My desired output:

1: ⲡϫⲱⲱⲙⲉ ⲙⲡⲉϫⲡⲟ ⲛⲓⲏⲥⲟⲩⲥ ⲡⲉⲭⲣⲓⲥⲧⲟⲥ ⲡϣⲏⲣⲉ ⲛⲇⲁⲩⲉⲓⲇ ⲡϣⲏⲣⲉ ⲛⲁⲃⲣⲁϩⲁⲙ.

My output:

ⲡϫⲱⲱⲙⲉⲙⲡⲉϫⲡⲟⲛⲓ ⲏⲥⲟⲩⲥⲡⲉⲭⲣⲓ ⲥⲧⲟⲥⲡϣⲏⲣⲉⲛⲇⲁⲩⲉⲓ ⲇⲡϣⲏⲣⲉⲛⲁⲃⲣⲁϩⲁⲙ.

My code so far:

#coding: utf-8

import requests
from bs4 import BeautifulSoup
import signal
import sys
import os.path

signal.signal(signal.SIGINT, lambda x, y: sys.exit(0))

if len(sys.argv) != 4:
    print("Usage: %s <book name> <first chapter> <last chapter>" % os.path.basename(__file__))
    quit()

book_name = sys.argv[1]
start = int(sys.argv[2])
stop = int(sys.argv[3])

while start <= stop:
    out_file = open(f"./{book_name}_{str(start)}.txt", "a")

    try:
        response = requests.get(f'https://data.copticscriptorium.org/texts/new-testament/{book_name}_{str(start)}/sahidica')
        soup = BeautifulSoup(response.text, "lxml")
        content_list = soup.find_all("span", class_="norm")

        text = []
        print(f"[{str(start)}/{str(stop)}] https://data.copticscriptorium.org/texts/new-testament/{book_name}_{str(start)}/sahidica")
        for element in content_list:
            text.append(element.get_text())

        text = ''.join(text).strip()
        out_file.write("%sn" % text)

    except:
        print("Error")
    start += 1

P.S. Language is old Coptic.

Thiết kế website giá rẻ

Danh mục

BeautifulSoup output not properly formatted