So I, as part of a larger project, am coding some script in Python to input a word into the online Latin morphological analysis software Collatinus (Collatinus) and receive a full declination/conjugation of that word. Here is what I have so far:
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
import pandas
from bs4 import BeautifulSoup as bs
import sys
#x = input("Word:")
chrome_service = Service(executable_path="C:Program Files (x86)msedgedriver.exe")
driver = webdriver.Edge(service=chrome_service)
driver.get("https://outils.biblissima.fr/en/collatinus-web/")
time.sleep(15)
driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.COMMAND + 'r')
time.sleep(15)
driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.COMMAND + 'r')
time.sleep(15)
search = driver.find_element(By.ID, "flexion_lemme")
search.send_keys('canis')
time.sleep(7)
submit = driver.find_elements(By.TAG_NAME, "button")
correct = submit[4]
correct.click()
time.sleep(15)
html = driver.page_source
if html.find("Une erreur s'est produite") != -1:
driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.COMMAND + 'r')
time.sleep(20)
search = driver.find_element(By.ID, "flexion_lemme")
search.send_keys('canis')
time.sleep(7)
submit = driver.find_elements(By.TAG_NAME, "button")
correct = submit[4]
correct.click()
time.sleep(20)
html = driver.page_source
if html.find("Une erreur s'est produite") != -1:
driver.quit()
raise ZeroDivisionError("Nope.")
else:
results = driver.find_element(By.ID, "results")
titlesh4 = results.find_elements(By.TAG_NAME, "h4")
titlesp = results.find_elements(By.TAG_NAME, "p")
titleshref = results.find_elements(By.XPATH, "//*[ text() = 'Formes composées' ]")
html = driver.page_source
f = open('tables.html', "w", encoding="utf-8")
f.write(html)
f.close()
lh = open("tables.html", "r", encoding="utf-8")
soup = bs(lh, "html.parser")
prettyHTML=soup.prettify()
prettyHTML = prettyHTML.replace("ā", "a").replace("ă","a").replace("ā̆", "a").replace("ē","e").replace("ĕ", "e").replace("ē̆","e").replace("ī", "i").replace("ĭ", "i").replace("ī̆","i").replace("ō","o").replace("ō̆","o").replace("ŏ","o").replace("ŭ","u").replace("ū̆", "u").replace("ū","u")
f = open('tables.html', "w", encoding="utf-8")
f.write(prettyHTML)
f.close()
It’s still in the works.
The issue is when it comes to the titles of each table that I’m drawing from the function. I wish to have each table of data that I draw from the HTML have a title that I can work with. But, the titles that the software gives are in differing forms: h4, p, a, etc. Also, some titles overarch others in a hierarchy, e.g. the table title of ‘subjonctif’ will come under another title, let’s say, ‘actif’ or something. The only way, in my opinion, how this could be done is if the program could detect the title before it and decide from there. Also, those hierarchical ones; I wish that the parent name be included in each of the smaller titles, i.e. ‘subjonctif’ would become ‘actif subjonctif’. The last issue is that some titles house two or three(?) tables inside of them, so I wish that those could be labelled, e.g., as “subjonctif #1)” and “subjonctif #2”. In my opinion, and please correct me if I am wrong, all these problems could be easily fixed if the program knew what was before each table.
I have not really tried anything, as I’m not sure really where to start. If anybody could help, that would be really appreciated.