For pure study I am trying to scrap this site: https://it.indeed.com/.
In the code I show you I try to get the titles of the ads. The scraping happens correctly, the problem is that at each run it gives different results. As you can see I try to navigate through the pages in order to get the rest of the results. Sometimes I scan 3 pages, sometimes 1, sometimes none.
This is the script:
<?php
require 'vendor/autoload.php';
use GuzzleHttpClient;
$httpClient = new Client();
$headers = [
'headers' => [
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Connection' => 'keep-alive',
'Accept-Language' => 'en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6',
],
];
$rootUrl = 'https://it.indeed.com/';
$url = $rootUrl . 'jobs?q=artistico&vjk=21eb3c13ddabd151';
$numPage = "";
while($url) {
$response = $httpClient->get($url, $headers);
$htmlString = (string) $response->getBody();
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmlString);
$xpath = new DOMXPath($doc);
// get current page
$currentPageElement = $xpath->query('//a[@data-testid="pagination-page-current"]')->item(0);
if ($currentPageElement) {
$numPage = $currentPageElement->textContent;
echo "<br> -------- PAGE - ". $currentPageElement->textContent ." -------- <br>";
} else {
echo "<br> -------- PAGE - NULL -------- <br>";
}
// get title job
$jobTitleElements = $xpath->query('//h2[contains(@class, "jobTitle")]//a//span');
foreach ($jobTitleElements as $element) {
echo $element->textContent . PHP_EOL . "<br>";
}
$nextPageLinkNode = $xpath->query('//a[@aria-label="Next Page"]')->item(0);
$nextPageLink = $nextPageLinkNode ? $rootUrl . $nextPageLinkNode->getAttribute('href') : null;
// stop while if link not exist
if (!$nextPageLink) {
break;
}
echo $nextPageLink;
$url = $nextPageLink;
}
?>
As indicated earlier, each time start provides a different number of results. What can this depend on?