I’m working on a project where I need to scrape the documentation of a technical software website using PowerShell. The goal is to feed the scraped HTML data into a GPT-based assistant to facilitate automated question answering and information retrieval. I’m looking for a method to recursively download all relevant HTML pages from the website, handle links properly, and ensure that I’m staying within the domain.
However, I’m facing issues with pages that are heavy on JavaScript or have dynamic content that doesn’t load properly with Invoke-WebRequest.
Questions:
How can I improve this script to handle dynamic content more effectively?
Are there better strategies for ensuring that I capture all necessary documentation without leaving out pages that might be loaded dynamically?
Any recommendations on managing large amounts of data and links efficiently in PowerShell?
Any help or insights would be greatly appreciated!
theindianvenom is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.