At the past two companies, I’ve been at, REST API’s exist for querying data via webapp – i.e. instead of having the webapp do SQL directly it calls a REST API and that does the SQL and returns the result.
My question is – why is this done?
If it was going to be exposed to third-parties I could understand. Better to expose a limited REST API than the full DB. But in both of these companies that’s not the the case.
It’s been suggested to me that these REST API’s make it easier to switch between DBMS. But isn’t that the point of a database abstraction layer (DBAL)? Maybe you use an ORM as your DBAL or maybe you could just write raw SQL and have your DBAL translate the DB specific stuff if appropriate (e.g. translate LIMIT for MySQL to TOP for MSSQL).
Either way it seems unnecessary to me. And I think it makes diagnosing issues more difficult as well. If a report on the webapp is giving the wrong numbers you can’t just dump out the SQL query – you have to dump the REST URL and then go into the project that’s serving as the REST API and pull out the SQL from that. So it’s an extra layer of indirection that slows down the diagnostic process.
15
If you allow a client to access the database directly – which they would do, even with a database abstraction layer, then:
- You get a coupling between their code and yours – particularly, there is a very strong coupling between your database structure and their code;
- Your client may do some pretty undesirable stuff on your database – whether it be updating data that they should not, writing a query that takes too much time, deadlocking something because they do not acquire locks cleanly…
- If you have made some less than optimal choice in your database structure, then moving out of that choice may be very hard, especially if you do not have a good way to make your clients migrate over to new structures.
That is, I am not touching at all on the REST part – isolating your database behind an API is simply a more sensible choice if the team that maintains the database and the teams that use it are not in sync, as it allows these parts to evolve at their own pace.
You are right, there is no clear benefit to introduce a REST API layer between a web app and a database, and it has a cost in complexity and performance overhead.
The reason you are getting contradictory answers is confusion about what is the ‘client’ in your architecture.
In your architecture (if I understand it correct), you have browsers interacting with a single web app, which in turn is interacting with the database. Introducing a REST API layer between the web app and the database has no benefit. All the stated benefits (caching, isolation of database etc) can be achieved with data access layer(s) in code.
But there are some other architectures where a REST API makes sense:
-
If you have multiple clients accessing the database – that is, not a single web app but multiple independent web apps accessing the same database. It may have benefit to create a common REST interface to allow sharing of data model, caching etc. Sure you can get some of the benefit by sharing the same DAL libraries, but that wont work if the apps are developed in different languages and on different platforms. This is common in enterprise systems.
-
If you have multiple desktop apps accessing the database directly. This is the classic “two-tier” architecture, which have fallen out of favor compared to web apps. Introducing a REST layer allows you to centralize data access logic and especially it allows tighter control of security, since it is risky to have multiple distributed clients accessing the same database directly.
-
If you have JavaScript code which directly fetches data from the server, then you need something like a REST API in any case.
6
Warning: big post, some opinions, vague ‘do what works best for you’ conclusion
Generally, this is done as a means of implementing ‘hexagonal architecture’ around your database. You can have web applications, mobile applications, desktop applications, bulk importers, and background processing all consume your database in a uniform way. Certainly you could accomplish the same thing to some extent by writing a rich library for accessing your database, and having all of your processes use that library. And indeed, if you’re in a small shop with a very simple system, that’s actually probably a better route to go; It’s a simpler approach and if you don’t need the advanced capabilities of a more complicated system, why pay for the complexity? However, if you’re working with a large, sophisticated set of systems that all need to interact with your database at scale, there’s a lot of benefits to putting a web service between your applications and your data:
Platform independence & maintenance
If you have a database, and you write a Python library to interact with that database, and everybody pulls in that library to interact with the database, that’s great. But let’s say suddenly you need to write a mobile app, and that mobile app now needs to talk to the database as well. And your iOS engineers don’t use Python, and your Android engineers don’t use Python. Maybe the iOS guys want to use Apple’s languages and the Android engineers want to use Java. Then you’d be stuck writing and maintaining your data access library in 3 different languages. Maybe iOS and Android devs decide to use something like Xamarin to maximize the code they can share. Perfect, except you’re probably still going to have to port your data access library to .NET. And then your company just purchased another company who’s web application is a disparate but related product, and the business wants to integrate some of the data from your company’s platform into the newly acquired subsidiary’s platform. Only there’s one problem: The subsidiary was a start-up and decided to write the bulk of their application in Dart. Plus, for whatever reasons (reasons probably beyond your control) the mobile team that was piloting Xamarin decided it’s not for them, and that they’d rather use the tools and languages specific to the mobile devices they’ll be developing for. But while you were in that phase, your team had already delivered a large portion of your data access library in .NET, and another team in the company was writing some crazy Salesforce integration stuff and decided to do all of that in .NET since there was already a data access library for .NET and it seemed like a good idea because mobile was initially planning to use .NET as well.
So now, because of a very realistic turn of events, you have your data access library written in Python, .NET, Swift, Java, and Dart. They’re not as nice as you’d like them to be, either. You couldn’t use an ORM as effectively as you’d like to, because each language has different ORM tools, so you’ve had to write more code than you would have liked to. And you haven’t been able to devote as much time to each incarnation as you would have wanted, because there’s 5 of them. And the Dart version of the library is especially hairy because the you had to roll-your-own transactional stuff for a some of it because the libraries and support just wasn’t really there. You tried to make the case that because of this, the Dart application should have only had read-only functionality for your database, but the business had already made up their minds that whatever features they were planning were worth the extra effort. And it turns out there’s a bug in some of the validation logic that exists in all of these incarnations of your data access library. Now you have to write tests and code to fix this bug in all of these libraries, get code reviews for your changes to all of these libraries, get QA on all of these libraries, and releases your changes to all of the systems using all of these libraries. Meanwhile, your customers are displeased and have taken to Twitter, stringing together combinations of vulgarities you never would have imagined could be conceived, let alone targeted at your company’s flagship product. And the product owner decides to be not very understanding about the situation at all.
Please understand that in some environments, the above example is anything but contrived. Also take into consideration that this sequence of events may unfold over the course of a few years. Generally, when you get to the point where architects and business people start talking about hooking up other systems to your database, that’s when you’re going to want to get ‘putting a REST API in front of the database’ onto your roadmap. Consider if early on, when it was clear that this database was going to start being shared by a few systems, that a web service/REST API was put in front of it. Fixing your validation bug would be a lot quicker and easier because you’re doing it once instead of 5 times. And releasing the fix would be a lot easier to coordinate, because you’re not dependent on several other systems releasing in order to get your change out there.
TLDR; It’s easier to centralize the data access logic and maintain very thin HTTP clients than it is to distribute the data access logic to each application that needs to access the data. In fact, your HTTP client may even be generated from meta-data. In large systems, the REST API lets you maintain less code
Performance and scalability
Some people may believe that talking to the database directly instead of going through a web service first is faster. If you have only one application, that’s certainly true. But in larger systems, I disagree with the sentiment. Eventually, at some level of scale, it’s going to be very beneficial to put some kind of cache in front of the database. Maybe you’re using Hibernate, and want to install an Infinispan grid as an L2 cache. If you’ve got a cluster of 4 beefy servers to host your web service separate from your applications, you can afford to have an embedded topology with synchronous replication turned on. If you try to put that on a cluster of 30 application servers, the overhead of turning on replication in that setup will be too much, so you’ll either have to run Infinispan in a distributed mode or in some kind of dedicated topology, and suddenly Hibernate has to go out over the network in order to read from the cache. Plus, Infinispan only works in Java. If you have other languages, you’ll need other caching solutions. The network overhead of having to go from your application to your web service before reaching the database is quickly offset by the need to use much more complicated caching solutions that generally come with overhead of their own.
Additionally, that HTTP layer of your REST API provides another valuable caching mechanism. Your servers for your REST API can put caching headers on their responses, and these responses can be cached at the network layer, which scales exceptionally well. In a small setup, with one or two servers, your best bet is to just use an in memory cache in the application when it talks to the database, but in a large platform with many applications running on many servers, you want to leverage the network to handle your caching, because when properly configured something like squid or varnish or nginx can scale out to insane levels on relatively small hardware. Hundreds of thousands or millions of requests per second of throughput is a lot cheaper to do from an HTTP cache than it is from an application server or a database.
On top of that, having a ton of clients all pointed at your database, instead of having them all pointed at a few servers which in turn point to the database, can make tuning the database and connection pooling a lot harder. In general, most of the actual workload on an application server is application stuff; waiting for data to come back from the database is often time consuming, but generally not very computationally expensive. You may need 40 servers to handle your application’s workload, but you probably don’t need 40 servers to orchestrate fetching the data from the database. If you dedicate that task to a web service, the web service will probably be running on far fewer servers than the rest of the application, which means you’ll need far fewer connections to the database. Which is important, because databases generally don’t perform as well when they’re servicing tons of concurrent connections.
TLDR; It’s easier to tune, scale and cache your data access when it’s something that happens inside of a single dedicated web service than it is when it’s something that happens across many different applications using different languages and technologies
Final thoughts
Please don’t come away from this thinking “Oh wow, I should always be using REST APIs to get my data” or “This idiot is trying to say we’re doing it wrong because our web app talks to the database directly, but our stuff works fine!”. The major point I’m trying to make is that different systems and different businesses have different requirements; In a lot of cases, putting a REST API in front of your database really doesn’t make sense. It is a more complicated architecture that requires justifying that complexity. But when the complexity is warranted, there’s a ton of benefits to having the REST API. Being able to weigh the different concerns and choose the right approach for your system is what makes a good engineer.
Additionally, if the REST API is getting in the way of debugging things, there’s likely something wrong or missing in that picture. I don’t believe having that added abstraction layer intrinsically makes debugging harder. When I work with large, n-tier systems, I like to make sure I have a distributed logging context. Perhaps when a user initiates a request, generate a GUID for that request and log the username of that user and the request they made. Then, pass that GUID on as your application talks to other systems. With proper log aggregation and indexing, you can query your entire platform for the user reporting the issue, and have visibility into all of their actions and they trickle through the system to quickly identify where things went wrong. Again, it’s a more complicated architecture, so you should probably have more complicated infrastructure in place to facilitate supporting that architecture.
Sources:
http://alistair.cockburn.us/Hexagonal+architecture
https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-Sizing
1
If I understand correctly what a DBAL is, then the answer is that a REST interface allows you to use any language for its clients, while a DBAL is a library that allows you to use a single language for its clients.
This, in turn, can be an advantage for a company where there are many development teams and not all of them are proficient in the same language. Allowing their software to query directly the DB would be equivalent in functionality, but as you say “better to expose a limited REST API than the full DB”.
In more abstract terms, you yourself are answering the question:
So it’s an extra layer of indirection that slows down the diagnostic process
… since there is this famous aphorism that states: “All problems in computer science can be solved by another level of indirection”. 🙂
Just because you’re inside the same company doesn’t mean you should expose everything to everyone. REST APIs are a way to define a limited consumer/provider relationship between teams in a company, with a clear contract. Amazon has been a pioneer in this form of organization.
APIs also provide a layer of abstraction, allowing you to use a specific set of idioms — you don’t necessarily want to talk to your consumers in the same terms that are used in your database. You also don’t necessarily want to talk to each consumer the same way.
You are thinking that REST is for database queries and it is not. REST represents the state of something at the moment. Using REST changes or retrieves a representation but that is all. If that state becomes available by database, it doesn’t matter and no one cares because HOW that representation comes to be is not part of REST and neither are database queries.
6
DBAL is concerned only about accessing data, but in most applications, you aren’t only concerned about accessing data, but also performing all the side effects of an operation.
For example, an ecommerce system. When an order comes into the system, you’d want to deliver an email to the customer for their receipt, you’d want to connect to your accounting system to update the balance sheet, you’d want to connect to the payment processor to process credit card, you’d want to submit the order details to fraud detection service to verify if a card payment is likely to become a chargeback (common in fraudulent orders), you’d want to connect to the warehouse system to dispatch the order, you’d want to connect to your own suppliers to order more inventories when your supplies are running low, you’d want to run a machine learning algorithms to provide a cross selling product recommendations, you’d want to connect to your CRM to raise a flag about orders that failed, you’d want to log the order to your analytic service for reporting and alerting, etc, etc, etc.
Most of these operations need to be done on a trusted system, i.e. on the server, and these side operations are not database related and are often too complex to develop on the DBAL layer.