Web scraping with selenium fails to paginate

I am trying to scrape this webpage https://mst.dk/publikationer, it has pagination and looking at the source, it looks like it is happening in the section I’ve added below.

<div class="Container_Container__G5vVd Container_Container___width_std__y2_Pn">
    <div class="Pagination_Pagination_wrapper__kp62j">
        <ul class="Pagination_Pagination__UOZ60" role="navigation" aria-label="Pagination">
            <li class="Pagination_Pagination_prev__zIUqn Pagination_Pagination_item___disabled__g5CaR">
                <a class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_prevLink__HDKS4" tabindex="-1" role="button" aria-disabled="true" aria-label="Previous page" rel="prev"></a>
            </li>
            <li class="Pagination_Pagination_item__suqyV selected">
                <a rel="canonical" role="button" class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_link___active__to_Os" tabindex="-1" aria-label="Side 1" aria-current="page">1</a>
            </li>
            <li class="Pagination_Pagination_item__suqyV">
                <a role="button" class="Pagination_Pagination_link__Z2LW0" tabindex="0" aria-label="Side 2" rel="next">2</a>
            </li>
            <li class="Pagination_Pagination_break__dKVzB">
                <a class="Pagination_Pagination_breakLink__jB8Rd" role="button" tabindex="0">...</a>
            </li>
            <li class="Pagination_Pagination_item__suqyV">
                <a role="button" class="Pagination_Pagination_link__Z2LW0" tabindex="0" aria-label="Side 321">321</a>
            </li>
            <li class="Pagination_Pagination_next__N6tkt">
                <a class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_nextLink__mytrA" tabindex="0" role="button" aria-disabled="false" aria-label="Next page" rel="next"></a>
            </li>
        </ul>
    </div>

I’ve tried multiple approaches including adding page=x to the url, or using selenium different locators and selectors, increasing wait time, trying to use next button, or imitate a click on list items. Nothing seems to be woking for me. Can anybody please help me figuring out the dynamics of this page and how to paginate through it?
What I am trying to do is open each link in each page and find the pdf and download it, which works fine for the first page, using the code below:

def parse_epa_filtered_keywords():
    # Get number of search results
    page_no = int(int(get_number_of_results(link_filtered)) / 10) + 1
    driver = webdriver.Chrome(options=options)
    search_query = '+'.join(keywords.split())
    
    for i in tqdm(range(1, page_no + 1)):
        try:
            search_url = f"{link_filtered}?search={search_query}&page={i}"
            print(f"Fetching URL: {search_url}")
            
            # Load the search URL
            driver.get(search_url)
            
            # Wait for the page to load completely
            time.sleep(5)  # Adjust the sleep time as needed
            
            # Wait for the main page to load again
            publications = driver.find_elements(By.CSS_SELECTOR, 'a[class^="Link_Link__lzynb SearchResultItem_SearchResult"]')
            ....
driver.quit()

Obviously it is the effort using the page, which keeps opening the first page over and over.
then I tried to use the following items:

next_button = driver.find_element(By.XPATH, "//li[contains(@class, 'Pagination_Pagination_next')]/a[@rel='next']")

next_button = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "li.Pagination_Pagination_next_N6tkt a")))

and many more tries with different elements, which either lead to a general chrome driver error, or something like :

An error occurred: Message: element click intercepted: Element is not clickable at point (732, 2911)
  (Session info: chrome=128.0.6613.114)
Stacktrace:
0   chromedriver                        0x0000000104f83998 cxxbridge1$str$ptr + 1887096
1   chromedriver                        0x0000000104f7be00 cxxbridge1$str$ptr + 1855456
2   chromedriver                        0x0000000104b80be0 cxxbridge1$string$len + 89508
3   chromedriver                        0x0000000104bca6fc cxxbridge1$string$len + 391360
4   chromedriver                        0x0000000104bc8d28 cxxbridge1$string$len + 384748
5   chromedriver

Here another solution using the API:

import requests

def get_all_results():
    headers = {
        'hostname': 'http://mst.local:3001'
    }

    payload = {
        'key': 'a2369450-5ec7-494c-b910-d72074a73af9',
        'documentTypes': ['articlePage'],
        'subjects': [],
        'categories': [{'name': 'Publikation'}],
        'takeAmount': 100,
        'skipAmount': 0,
        'direction': 'descending',
        'UserTextInputField': ''
    }

    url = 'https://search.mst.dk/api/News/Search'
    
    results = []
    while True:
        response = requests.post(url, headers=headers, json=payload)
        data = response.json()
        results.extend(data['searchResults'])
        payload['skipAmount'] += payload['takeAmount']

        if len(results) >= data['pagination']['totalResults']:
            break

    return results


results = get_all_results()

print(f'{len(results) = }')

*This takes about 30-40 seconds to fetch all 3201 results, you can use async for faster time.

next_button = driver.find_element(By.XPATH, "//li[contains(@class, 'Pagination_Pagination_next')]/a[@rel='next']")

Although the XPath expression in your above code is correct, for some reason it is not clicking the element. I used ActionChains as below and it successfully clicked the next button.

next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//a[@aria-label='Next page']")))
actions = ActionChains(driver)
actions.move_to_element(next_button).click().perform()

Here is a full working code which will scrape the pages in a loop.

Note: I am scraping the first 3 pages and scraping the search results headings you can scrape whatever you want:

from selenium.webdriver import ActionChains
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

def click_next_page():
    next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//a[@aria-label='Next page']")))
    actions = ActionChains(driver)
    actions.move_to_element(next_button).click().perform()

def extract_headings(wait):
    headings = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//li//h3")))
    search_results_headings = ""
    for heading in headings:
        search_results_headings += "n" + heading.text
    return search_results_headings

driver = webdriver.Chrome()
driver.get("https://mst.dk/publikationer")
driver.maximize_window()
wait = WebDriverWait(driver, 10)

# Use below line of code only if you see accept/reject cookies pop-up
accept_all = wait.until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll")))
driver.execute_script("arguments[0].click();", accept_all)

search_results_headings = ""
# Below for loop iterates 3 times, so 3 pages will be scraped, if you want more pages change the range accordingly
for _ in range(3):
    search_results_headings += extract_headings(wait)
    click_next_page()

print(search_results_headings)

Console output:

Diffus forurening med PFAS i jord, grundvand og overfladevand
Digitale værktøjer til klimatilpasning
Performancebenchmarking
Oprensning af PFAS-forurening i jord, slam og vand - Test af teknologier i praksis
Lokalt funderede analyse – afrapportering
Maritime Emissionsløsninger i Kystnære Farvande
Biokinetisk lattergasreduktion i renseanlæg
Inter DAN NRW
Gennemførelse og anvendelse af slamdirektivet 2023
CombiControl - Combining above- and belowground biological control agents for improved pest control in strawberry tunnel production
Affaldsstatistik 2022
Scientific investigation of ballast water discharge - Random checks on ships in autumn – winter 2022
Control of Biocides 2023
Ny kosteffektiv teknologi til måling af klimagasudledninger fra renseanlæg
Recycling potential of separately collected post-consumer textile waste
Modelling and mapping pesticide exposure risk at the catchment scale (MOMAPEST)
Indberetning af status for anvendelse af almene vandforsyningsboringer i Virk.dk
PFAS i jord - International screening af andre landes praksis for håndtering af jord med PFAS
Anbefalinger til screening og kortlægning af bygge- og anlægsaffald
Emissions of Quaternary Alkylammonium Compounds
Nikotinposer – indhold og miljøkonsekvenser
Udredningsprojekt vedr. analysemetoder til undersøgelse for PFAS-forbindelser i jord, grundvand og overfladevand
Rensningsmuligheder for pesticider med fokus på aktivt kul og membraner
Renholds- og omkostningsanalyse jf. Engangsplastdirektivets oprydningsansvar
Kemiske stoffer i en cirkulær økonomi - Et MUDP projekt
Pesticider og biocider i den danske pindsvinebestand
Kortlægning af madaffald i primærproduktionen samt forarbejdnings- og fremstillingssektoren for 2022
Kortlægning af madaffald og madspild i restaurationsbranchen og restaurationstjenester for 2022
Inhibition of lung surfactant function as an alternative method to predict lung toxicity following exposure to plant protection products
Survey and risk assessment of pesticides in cut flowers from non-EU countries

Process finished with exit code 0

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 12:33

Thẻ: selenium-webdriverpagination

Thiết kế website giá rẻ

Danh mục

Web scraping with selenium fails to paginate