I intend to write a script which browse a web page from chrome with the following link:
https://isfinder.biotoul.fr/ from which i have to select “TOOLS” dropdown menu and then select “Blast”. I also manage to upload file in which is in fasta format but then under the heading “Algorithm parameters” changed parameter Evalue : 0.01 and then run blast, after the complete loading of the page I get the following page and save it in a file with name blast_results_TA373.html in a specific folder which i am able to do so.
I was able to download this page. Now I was able get the links of the table with respect to its query node under the heading “Sequence producing significant alignment ” For example link of ISEc1 and many others but I am unable to get contants of the links which should look like the following page:
Save it in a file with name for example jobtitle_query_node_ISEc5.html in a specific folder which i am unable to do so. Please kindly help me to achive this task and tell me where i made mistake.
I made the code to extract Query Identifiers and links which gave output like this.
Query Identifier: 123
Query Identifier: 123
Query Identifier: 123
Query Identifier: 2699
Query Identifier: 2699
Query Identifier: 2699
Link: https://www-is.biotoul.fr/under_construct.php
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISEc5
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISEc1
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISVsa13
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISEc1
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISEc5
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISSe1
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISCysp7
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=IS1H
Link: https://www-is.biotoul.fr/scripts/ficheIS.ph
Also i tried to filter unwanted links but unable to do it completely. My links of interest are like this :
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISEc5
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISEc1
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISVsa13
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISEc1
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISEc5
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISSe1
Link: https://www-is.biotoul.fr/scripts/ficheIS.php?name=ISCysp7
Here is my code:
# Necessary webdrivers need to be imported
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import StaleElementReferenceException
import os
import time
# This is for Chrome. Similarly if
# Firefox is needed, then it has to be specified
webBrowser = webdriver.Chrome()
# This will open Is finder site in chrome
webBrowser.get('https://www-is.biotoul.fr/index.php')
# Find and click on the 'TOOLS' link
tools_link = WebDriverWait(webBrowser, 60).until(EC.element_to_be_clickable((By.LINK_TEXT, 'TOOLS')))
tools_link.click()
# Find and click on the 'Blast' link
blast_link = WebDriverWait(webBrowser, 60).until(EC.element_to_be_clickable((By.LINK_TEXT, 'Blast')))
blast_link.click()
# Locate and interact with form elements
file_input = WebDriverWait(webBrowser, 60).until(EC.element_to_be_clickable((By.XPATH, "//input[@type='file']")))
file_input.send_keys("/Users/somil/Desktop/gene_bank.file/TA373.fasta")
# Enter job title
job_title_input = WebDriverWait(webBrowser, 60).until(EC.element_to_be_clickable((By.XPATH, "//input")))
job_title_input.send_keys("TA373")
# Get the job title entered
job_title = job_title_input.get_attribute('value')
# Locate the Evalue input field within the Algorithm parameters fieldset
evalue_input = WebDriverWait(webBrowser, 60).until(EC.element_to_be_clickable((By.XPATH, "//input[@name='expect']")))
# Clear the current value (if any) and input your custom value
evalue_input.clear()
#set input values
evalue_input.send_keys("0.01")
#submit the task
submit_button = WebDriverWait(webBrowser, 60).until(EC.element_to_be_clickable(( By.CLASS_NAME,"boutonblast")))
submit_button.click()
# Wait until the page is fully loaded
WebDriverWait(webBrowser, 60).until(EC.presence_of_all_elements_located((By.XPATH, "//*")))
time.sleep(100) ##############ADJUST_TIME_ACCORDING_TO_NEED#######################################################
# Create a directory based on job title
directory_name = f'{job_title}.IS_finder_results'
os.makedirs(directory_name, exist_ok=True)
# Once the content is fully loaded, download the page
page_content = webBrowser.page_source
# Save the page content to a file within the directory
file_name = f'blast_results_{job_title}.html'
file_path = os.path.join(directory_name, file_name)
with open(file_path, 'w', encoding='utf-8') as f:
f.write(page_content)
###################################################PART_1_COMPLETE################################################################################
###############PART_2###############################
# Find all elements containing "Query" sections
query_sections = webBrowser.find_elements(By.XPATH, '//b[starts-with(text(), "Query=")]')
# Iterate over each "Query" section
for query_section in query_sections:
# Extract the parent element's text
parent_text = query_section.find_element(By.XPATH, './..').text
# Extract the query identifier
try:
query_identifier = parent_text.split('=')[1].split()[0]
print("Query Identifier:", query_identifier)
except IndexError:
print("Error: Unable to extract query identifier from:", parent_text)
# Find the first <a> element on the page
continue_link = webBrowser.find_element(By.TAG_NAME, 'a')
# Find all elements with the href attribute
elements_with_href = webBrowser.find_elements(By.XPATH, "//*[@href]")
# Iterate over all elements with the href attribute
for elem in elements_with_href:
try:
# Get the href attribute value
href = elem.get_attribute("href")
# Exclude links leading to NCBI and links starting with # and containing #BL_ORD_ID
if "ncbi" not in href and not href.startswith("#") and "#BL_ORD_ID" not in href:
print("Link:", href)
except StaleElementReferenceException:
print("Element is stale. Refinding...")
# Refind the element
elem = webBrowser.find_element(By.XPATH, f'//a[@href="{href}"]')
# Get the href attribute value again
href = elem.get_attribute("href")
Then I modified this code to perform the required task:
# Necessary webdrivers need to be imported
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import StaleElementReferenceException
import os
import time
# This is for Chrome. Similarly if
# Firefox is needed, then it has to be specified
webBrowser = webdriver.Chrome()
# This will open Is finder site in chrome
webBrowser.get('https://www-is.biotoul.fr/index.php')
# Find and click on the 'TOOLS' link
tools_link = WebDriverWait(webBrowser, 60).until(EC.element_to_be_clickable((By.LINK_TEXT, 'TOOLS')))
tools_link.click()
# Find and click on the 'Blast' link
blast_link = WebDriverWait(webBrowser, 60).until(EC.element_to_be_clickable((By.LINK_TEXT, 'Blast')))
blast_link.click()
# Locate and interact with form elements
file_input = WebDriverWait(webBrowser, 60).until(EC.element_to_be_clickable((By.XPATH, "//input[@type='file']")))
file_input.send_keys("/Users/somil/Desktop/gene_bank.file/TA373.fasta")
# Enter job title
job_title_input = WebDriverWait(webBrowser, 60).until(EC.element_to_be_clickable((By.XPATH, "//input")))
job_title_input.send_keys("TA373")
# Get the job title entered
job_title = job_title_input.get_attribute('value')
# Locate the Evalue input field within the Algorithm parameters fieldset
evalue_input = WebDriverWait(webBrowser, 60).until(EC.element_to_be_clickable((By.XPATH, "//input[@name='expect']")))
# Clear the current value (if any) and input your custom value
evalue_input.clear()
#set input values
evalue_input.send_keys("0.01")
#submit the task
submit_button = WebDriverWait(webBrowser, 60).until(EC.element_to_be_clickable(( By.CLASS_NAME,"boutonblast")))
submit_button.click()
# Wait until the page is fully loaded
WebDriverWait(webBrowser, 60).until(EC.presence_of_all_elements_located((By.XPATH, "//*")))
time.sleep(100) ##############ADJUST_TIME_ACCORDING_TO_NEED#######################################################
# Create a directory based on job title
directory_name = f'{job_title}.IS_finder_results'
os.makedirs(directory_name, exist_ok=True)
# Once the content is fully loaded, download the page
page_content = webBrowser.page_source
# Save the page content to a file within the directory
file_name = f'blast_results_{job_title}.html'
file_path = os.path.join(directory_name, file_name)
with open(file_path, 'w', encoding='utf-8') as f:
f.write(page_content)
###################################################PART_1_COMPLETE################################################################################
###############PART_2###############################
# Find all elements containing "Query" sections
query_sections = webBrowser.find_elements(By.XPATH, '//b[starts-with(text(), "Query=")]')
# Iterate over each "Query" section
for query_section in query_sections:
# Extract the parent element's text
parent_text = query_section.find_element(By.XPATH, './..').text
# Extract the query identifier
try:
query_identifier = parent_text.split('=')[1].split()[0]
print("Query Identifier:", query_identifier)
except IndexError:
print("Error: Unable to extract query identifier from:", parent_text)
# Find all elements with the href attribute again (to avoid StaleElementReferenceException)
elements_with_href = webBrowser.find_elements(By.XPATH, "//*[@href]")
# Iterate over all elements with the href attribute
for elem in elements_with_href:
try:
# Get the href attribute value
href = elem.get_attribute("href")
# Exclude unwanted links
exclude_links = [
"biotoul.fr/styles/", "biotoul.fr/blast/", "biotoul.fr/index.php"
]
if any(link in href for link in exclude_links):
continue
# Exclude links leading to NCBI and links starting with # and containing #BL_ORD_ID
if "ncbi" not in href and not href.startswith("#") and "#BL_ORD_ID" not in href:
print("Link:", href)
# Get the content of the link
webBrowser.get(href)
link_content = webBrowser.page_source
# Extract the part after '?name=' from the href
identifier = href.split('=')[1] if '=' in href else 'NoIdentifier'
# Save the content to a file within the specific directory
file_name = f'{job_title}.{identifier}.{query_identifier}.html'
file_path = os.path.join(directory_name, file_name)
with open(file_path, 'w', encoding='utf-8') as f:
f.write(link_content)
print(f"Content saved for Link: {href}")
except StaleElementReferenceException:
print("Element is stale. Refinding...")
# Refind the element
elem = webBrowser.find_element(By.XPATH, f'//a[@href="{href}"]')
# Get the href attribute value again
href = elem.get_attribute("href")
except NoSuchElementException:
print(f"No such element found for href: {href}. Skipping...")
continue
I get the following error :
Query Identifier: NODE_195
Link: https://www-is.biotoul.fr/general_information.php
Content saved for Link: https://www-is.biotoul.fr/general_information.php
Element is stale. Refinding...
Traceback (most recent call last):
File "/Users/somil/Downloads/navigate_4.py", line 95, in <module>
href = elem.get_attribute("href")
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/somil/miniconda3/lib/python3.11/site-packages/selenium/webdriver/remote/webelement.py", line 178, in get_attribute
attribute_value = self.parent.execute_script(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/somil/miniconda3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 407, in execute_script
return self.execute(command, {"script": script, "args": converted_args})["value"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/somil/miniconda3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 347, in execute
self.error_handler.check_response(response)
File "/Users/somil/miniconda3/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 229, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found
(Session info: chrome=124.0.6367.158); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#stale-element-reference-exception
Stacktrace:
0 chromedriver 0x0000000104e63ae8 chromedriver + 5217000
1 chromedriver 0x0000000104e5b723 chromedriver + 5183267
2 chromedriver 0x00000001049cd527 chromedriver + 406823
3 chromedriver 0x00000001049dd814 chromedriver + 473108
4 chromedriver 0x00000001049de10a chromedriver + 475402
5 chromedriver 0x00000001049d3595 chromedriver + 431509
6 chromedriver 0x00000001049de13b chromedriver + 475451
7 chromedriver 0x00000001049d3595 chromedriver + 431509
8 chromedriver 0x00000001049d189e chromedriver + 424094
9 chromedriver 0x00000001049d4bfa chromedriver + 437242
10 chromedriver 0x0000000104a5b6a4 chromedriver + 988836
11 chromedriver 0x0000000104a3b702 chromedriver + 857858
12 chromedriver 0x0000000104a5a6bf chromedriver + 984767
13 chromedriver 0x0000000104a3b4a3 chromedriver + 857251
14 chromedriver 0x0000000104a0bfe3 chromedriver + 663523
15 chromedriver 0x0000000104a0c92e chromedriver + 665902
16 chromedriver 0x0000000104e21a00 chromedriver + 4946432
17 chromedriver 0x0000000104e27ab4 chromedriver + 4971188
18 chromedriver 0x0000000104e024fe chromedriver + 4818174
19 chromedriver 0x0000000104e285c9 chromedriver + 4974025
20 chromedriver 0x0000000104df2784 chromedriver + 4753284
21 chromedriver 0x0000000104e4ac78 chromedriver + 5115000
22 chromedriver 0x0000000104e4ae37 chromedriver + 5115447
23 chromedriver 0x0000000104e5b343 chromedriver + 5182275
24 libsystem_pthread.dylib 0x00007ff80f5051d3 _pthread_start + 125
25 libsystem_pthread.dylib 0x00007ff80f500bd3 thread_start + 15
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/somil/Downloads/navigate_4.py", line 125, in <module>
elem = webBrowser.find_element(By.XPATH, f'//a[@href="{href}"]')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/somil/miniconda3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 741, in find_element
return self.execute(Command.FIND_ELEMENT, {"using": by, "value": value})["value"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/somil/miniconda3/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 347, in execute
self.error_handler.check_response(response)
File "/Users/somil/miniconda3/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 229, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[@href="https://www-is.biotoul.fr/general_information.php"]"}
(Session info: chrome=124.0.6367.158); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
0 chromedriver 0x0000000104e63ae8 chromedriver + 5217000
1 chromedriver 0x0000000104e5b723 chromedriver + 5183267
2 chromedriver 0x00000001049cd527 chromedriver + 406823
3 chromedriver 0x0000000104a18ff2 chromedriver + 716786
4 chromedriver 0x0000000104a19181 chromedriver + 717185
5 chromedriver 0x0000000104a5d1d4 chromedriver + 995796
6 chromedriver 0x0000000104a3b72d chromedriver + 857901
7 chromedriver 0x0000000104a5a6bf chromedriver + 984767
8 chromedriver 0x0000000104a3b4a3 chromedriver + 857251
9 chromedriver 0x0000000104a0bfe3 chromedriver + 663523
10 chromedriver 0x0000000104a0c92e chromedriver + 665902
11 chromedriver 0x0000000104e21a00 chromedriver + 4946432
12 chromedriver 0x0000000104e27ab4 chromedriver + 4971188
13 chromedriver 0x0000000104e024fe chromedriver + 4818174
14 chromedriver 0x0000000104e285c9 chromedriver + 4974025
15 chromedriver 0x0000000104df2784 chromedriver + 4753284
16 chromedriver 0x0000000104e4ac78 chromedriver + 5115000
17 chromedriver 0x0000000104e4ae37 chromedriver + 5115447
18 chromedriver 0x0000000104e5b343 chromedriver + 5182275
19 libsystem_pthread.dylib 0x00007ff80f5051d3 _pthread_start + 125
20 libsystem_pthread.dylib 0x00007ff80f500bd3 thread_start + 15
Please help me as I am unable to understand this error and also I am unable to achieve this task.