I was looking at extracting a bit of info from a mutual fund info page. The page has static header and footer stuff, but the middle info section is built/displayed at load time in some fashion – it looks like AngularJS. If you look at the retrieved HTML source in Chrome’s debugger Network or Elements tabs, there is a section something like this which sets a javascript variable with all the data for an angular controller the dynamic middle stuff will use:
<section class="td-fund-card" ng-cloak ng-controller="fundCardController as $ctrl">
<script type="text/javascript">
var data = {...};
</script>
But when I request the same source page URL programmatically (using the Python requests module), that section/script chunk is not present in the response data. I thought some of the load-time javascript might be retrieving stuff as separate requests and inserting that chunk into the DOM. But when I disable javascript for the host site in Chrome, the chunk is still present in the initial HTML data in the response (although as expected, the dynamic middle bit is not generated). If that’s the case, then presumably the server is distinguishing between my browser request and my programmatic request in some way and not sending the same response data to both cases?
I added some likely headers to the programmatic request, including Chrome’s User-Agent string, but the response data still lacked that chunk.
page = self.requests_session.get(url,
headers={"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.7",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
},**kwargs)
I’m not sure what else the server might be checking to produce the different responses to the browser vs programmatic requests? I don’t really know Angular, or how common it is to suppress parts of a template if it is somehow thought the caller doesn’t support Javascript, or whatever else the server is thinking it is achieving? If anyone has a good idea as to something else I could try and set in the programmatic request that might influence this, I’d love to know.