When it is necessary to integrate with a web application, and an API is unavailable, is it a viable solution to simulate a web browser interacting with the web application as a real user would interact with it?
UPDATE
Some context.
Web App belongs to a vendor/partner. Their timeline for building a proper API will not meet our needs.
Scraping for data is the least used part. Full CRUD for interacting with the app is required.
As a developer, I can see that it will work, but I want to be able to address concerns that others may raise about this approach.
Would you base a mission critical business application on this approach? Why or why not?
6
If the problem is with a third-party vendor/partner that your company has a solid relationship with, then the best person to ask is the vendor. Get in contact with someone at their company, preferably someone associated with the team building the API, and ask their opinion; they will probably know the ins and outs of their application fairly well, and be able to offer good advice.
How viable it is to interact with the webapp by pretending to be a browser is very dependent on the nature of the webapp. If it’s a javascript-heavy, AJAXy, stateful app, you may have to do a LOT of testing and extra development to make things work, because without a proper API, you have no guarantee that they aren’t doing something really crazy and evil like changing state logic when they see your browser request a particular page’s CSS because that signals some particular line of javascript loaded and executed, and it was the only way their junior programmer knew how to get something done.
On the other hand, if it’s a fairly stable, well-designed, preferably RESTful webapp, you should be able to send HTTP requests of various types at it, parse the responses, and have it work.
There’s also always the old programmer standby of adding another layer of abstraction. Depending on the nature of your company’s relationship with the vendor, you may be able to get the actual interface of the API before it’s completed, or even just get a general idea of how it will work. At that point, you can write your own “API” internally to emulate theirs using the webapp’s existing features. Once the real API arrives, you should be able to quickly convert your implementation into being nothing more than a thin wrapper around the proper API. Then, hopefully, you can ultimately refactor the wrapper layer out of existence, without ever breaking functionality. It’s not a very pretty, or purist solution, but if the vendor is willing and able to work closely with you, or even just willing to throw a few pages of API documentation at you, this might be a fairly practical way to get the functionality working now, while being able to quickly and painlessly (as much as is possible) switch over to the proper API when it becomes available.
In any of these cases, if the webapp is being hosted on servers owned by the vendor, you really do need to clear this with them first, especially if your app is going to be hitting theirs for data with any regularity. If the webapp is instead part of something hosted internally in your company, then you should similarly check with your network/system admins, give them an idea how heavy the traffic will be, and make sure they’re okay with it. Failure to alert people before slamming their server with traffic is not cool, and may result in things like being IP banned from accessing them, making their company more hostile to yours, or even getting yourself fired when both those things happen, and management starts looking for someone to blame so they can get back in the vendor’s good graces.
5
When there is no API, ask the owner of the website for a suggestion.
For example, let’s say you’re interested in a list of movies published by a website which collected a large database of nearly every released movie for decades. There is no API.
If you ask the owner:
- Maybe he can sell you the raw data,
- Maybe there is an API, just not a public one. It is not a rare case: creating an API for internal use is one thing; writing complete documentation, examples, etc. in order to release it to the public is a different story. In such case, giving access to an internal API to a trusted partner may be possible.
- Maybe the website doesn’t have any database, but simply retrieving the data from a third party web service. The owner can point you to this service in order for you to use it directly.
- Maybe the owner is willing to spend some time creating an API.
- Maybe the owner is ready to hire you to create the API.
- Maybe there is a way to call a simplified version of the website, which will give you the opportunity to you to work on an easy to understand markup and to the owner to see the bandwidth not used excessively.
- etc.
There are plenty of opportunities in this case.
If you start scrapping the website without asking any one, sooner or later, the owner will see it. It would be up to him to:
- Either change the markup of the web pages to force you to rewrite your entire app. Then change it again. And again.
- Or block you.
- Or file a claim for copyright violation.
In all those three cases, the consequences would range from slightly annoying to consequent money loss. Don’t go this way.
1
Would you base a mission critical business application on this approach? Why or why not?
No.
The approach is brittle; it’s too easy to break, too difficult to maintain and impossible to test. I wouldn’t rely on it in any sort of critical mission context.
To get it to work at all, your partner would have to be willing to freeze the UI so that it would not change and break your interface. Otherwise, you’re signing up for an ongoing, never-ending maintenance nightmare.
1