So I have this script in node.js using Puppeteer. Its supposed to go through a list of urls, and call a puppeteer function for each page (url)….It does work ,sorta but it seems like I could make this code much more optimized and just plain better. I just kinda had to force it all on one page, cause when I tried to modularize the scrape function to another script , it didn’t like it. Probably because I am not getting async like I probably should . I’m not asking for a lecture on async (but it would be appreciated if you were able to do it .). But I want to make this code look right if I was working in a shop I don’t feel this code would be the best way to do this plus the fact it doesn’t work reliably.
Take that back , sometimes it does run through multiple urls, other times it seems if it’s more than 100 or so records, it just makes it quit after one url.
const puppeteer = require('puppeteer');
const fs = require('fs');
async function scrapeData(url, totalRecords, ctype, pageSize = 24) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.setDefaultNavigationTimeout(60000);
try {
// Calculate the number of pages needed
const totalPages = Math.ceil(totalRecords / pageSize);
const csvRows = []; // Array to store CSV rows
for (let i = 0; i < totalPages; i++) {
// Calculate the start parameter for the current page
const start = i * pageSize;
// Calculate the size parameter for the current page
const size = (i === totalPages - 1) ? (totalRecords % pageSize || pageSize) : pageSize;
const currentPageUrl = `${url}?start=${start}&sz=${size}`;
await page.goto(currentPageUrl);
const productUrls = await page.evaluate(() => {
const urls = [];
document.querySelectorAll('.l-products-grid_item-wrapper').forEach(link => {
urls.push(link.href);
});
return urls;
});
productUrls.forEach(productUrl => {
// Create a CSV row with URL and type
const csvRow = `"${productUrl}","${ctype}"`;
csvRows.push(csvRow);
});
// Output the product URLs
console.log(`Product URLs from page ${i + 1}:`);
productUrls.forEach(productUrl => {
console.log(productUrl);
});
}
// Write CSV rows to a file
fs.writeFileSync('product_urls.csv', csvRows.join('n'), 'utf8');
} catch (error) {
console.error('An error occurred:', error);
} finally {
await browser.close();
}
}
const urls = [
{
url: 'https://www.xtestf.com/explore-bathrooms', totalRecords: 2891, ctype: bathroom,
url: 'https://www.xtestf.com/explore-showers', totalRecords: 1817, ctype: 'shower',
url: 'https://www.xtestf.com/explore-backsplashes', totalRecords: 1067, ctype: 'backsplash',
url: 'https://www.xtestf.com/explore-kitchens', totalRecords: 2111, ctype: 'kitchen',
// i have a list of about 30 of these
},
];
// Scrape data for each URL
urls.forEach(async ({ url, totalRecords, ctype }) => {
console.log(`Scraping data from ${url}`);
await scrapeData(url, totalRecords, ctype);
});
i tried and wanted to make the scrapeData function a module, but either way I tried I couldnt get it call a list of multiple urls, in succession. very aggravating doing this one at a time. Please any suggestions , hints, anything is appreciated.
Thanks in advance.