I’m trying to make a ‘frontend web scraper’. I had the idea of setting up a node app that had a dynamic route that used Request (the NPM package) to scrape pages and displayed them as it’s own. Then my JS plugin would add an invisible iframe to whatever page it’s used on with of ‘nodeapp.com/[URL-TO-SCRAPE]’. Then use the inter page getElementById() function to get the text. The reason I ask this question is because I know that you can’t use getElementById() across multiple origins, but if I was able to get the page on a heroku app with a ‘Allow Cross Origin Requests’ header would this work? I know my question is very confusing.
What you’re describing is a reverse proxy.
CORS allows cross-origin reading of resources, but it does not allow cross-origin iframe reads. However, if your reverse proxy serves permissive CORS headers (e.g., Access-Control-Allow-Origin: *
), then the contents of your reverse proxy will be readable with an Ajax request. You simply need to make an Ajax request to nodeapp.com/[URL-TO-SCRAPE]
, load the fetched HTML into a document, and then find the element with myNewDocument.getElementById
.
Note that if the page reading the iframe is on the same origin as the reverse proxy, CORS isn’t even necessary. (If they’re different origins, then CORS on the fetched page is sufficient to allow the proxy content to be read.)
Note also that reverse proxies will not use the user’s cookies when making the request to the page. So, for example, when your reverse proxy fetches gmail.com
, it will show the user the Gmail login page, not the user’s inbox. If the content you’re trying to read is not protected by credentials (i.e., it can be read without logging in), then there won’t be any issues.