http://highscalability.com/blog/2010/11/9/facebook-uses-non-stored-procedures-to-update-social-graphs.html talks about how Facebook moved logic out of the database into the application in order to improve ease of development (scaling better across employees instead of across databases).
Are there any studies/articles/benchmarks that evaluate the business and technical tradeoffs of running as much of the logic in the database versus in the application?
Specifically, I’m wondering what scales better (both in terms of the database performance and ease of development as the business grows):
Scenario 1 (Logic in database)
- Running as much of the logic on the database end.
- Using database-specific SQL extensions/features.
- Using stored procedures.
- Few queries over short transactions, low network overhead.
Scenario 2 (Logic in application)
- Running as much of the logic on the application end.
- Limiting oneself to features which are common across all major databases.
- Not using stored procedures.
- Many queries over long transactions, high network overhead.
What kind of people-scaling overhead can one expect in Scenario 1?
What kind of database-scaling overhead can one expect in Scenario 2?
5
Answering my own question (aggregating everything I’ve read so far):
- Separation of Concerns
- Implement the business logic in the application in order to benefit from the richer development, testing and debugging environment offered outside the database.
- Data integrity and cross-cutting concerns (e.g. auditing, reporting) should reside in the database.
- Databases need an abstraction layer to avoid code duplication and enable you to quickly update SQL statements when the schema is changed. The abstraction layer can reside in the application or database.
- Performance
- Prepared statements are as fast as stored procedures.
- Network overhead is negligible, unless you need to transfer large amounts of data from the server to the client and back again.
- Security
- Stored procedures aren’t any safer than dynamic SQL for injection attacks
- Stored procedures allows you to grant permissions in terms of API methods instead of the table/view level. Meaning, instead of “Customers should only be allowed to view table
Account
and update tableBalance
” you have “Customers should only be allowed to invokeTransferFunds()
“. You should be able to emulate the same permission scheme using views.
- Common use-cases
- If the database needs to support multiple applications, the abstraction layer should probably reside in the database in the form of stored procedures. This is the most frequently cited use-case by DBAs for implementing logic in the database.
- If the application needs to support multiple databases, the abstraction layer should probably reside in the application layer.
- Miscellaneous
- Your developers should be well versed in SQL. Learning the language is a lot easier than you think. Don’t fool yourself: developers that seek to avoid SQL will produce scalability problems.
Summary: Contrary to popular belief, you shouldn’t use stored procedures for performance or security reasons. You should use them to support multiple applications, to enforce data integrity and to implement cross-cutting concerns that cannot be easily implemented in the application layer.
Feel free to push all remaining logic into the application layer.
18
Scalability isn’t the only issue in making this choice (and I do know of successful databases with trillions of records so databases are more scalable than you think). Nor is it probably the most important.
First you need to look at the meaning of the data. Something like Facebook is tied intrinsically to its application and thus putting the logic there isn’t as risky as a business application which has to get its data from imports, database jobs, user data entry from several different applications including some that perhaps the business has no control over. So the risk to data integrity is the first and most important thing to consider when making this choice.
Also how the data will be used and in what environment is critical. How is the information audited? Are there regulatory requirements? How is the reporting done? Do I need to be able to reuse the logic in a different reporting system as well as in the user application? If so the logic needs to rest outside of the application. Do I need to do exports of the data and how is that affected by the logic being in the application. Does that mean the people writing the reporting and export code will not have the ability to see what the logic is because they don’t have access to application code? That can be a major problem.
Another consideration is the scalability which includes the performance. How much scalability do you need? Very few things need the scalability of a Facebook. How much performance do you need? Designing for scalability when you will never need it leads to less than optimum results. Will the methods you use to put logic in the application have a negative impact on database performance (many ORMs for instance write terrible database code).
Then there is the argument about less time for development which is ridiculous. If you know what you are doing, putting the logic in the database does not take more time than putting it in the application. It is just that most application developers aren’t SQL specialists. However, is saving the devs time to get better at SQL really a plus? No it is not because that choice almost always comes at a cost of performance on the database and at the cost of data integrity.
What I am trying to say is there is no one size fits all. There are some applications where putting the logic in the application makes sense and some where it does not, but thinking there are only one or two critical factors to consider when making the choice is generally a mistake.
1
Well, there’s the practical answer, and then there’s this one.
The question is titled “Where should the application logic run?” But if you think about it, this conflates two things that ought to be separable:
- How should the application logic be coded?
- Where should the application logic be run?
One of the big, not generally recognized weaknesses of present-day database technology is that making a choice for either of these questions usually forces your hand on the other one:
- If you choose to run application logic on the database, you’re forced to either use the database’s procedural language (eeek, PL/SQL), or really cumbersome support for third-party language stored procedures.
- If you choose to code application logic in a good programming language, then you lose the data integrity advantages unless you duplicate effort and code some of the same logic at the database as well.
But ideally, I should be able to code all my application logic once, in a powerful programming language of my choice, and whether the logic runs on the application or the database should be a late-bound decision—one that I choose when I deploy the application, or even as late as runtime.
In fact, ideally I should be able to run the same logic simultaneously in the browser, the application server and the database. Think of validation rules in a forms-driven web application. Ideally you want the validation logic to run at all three tiers:
- The browser should run the validation logic, because it can provide very low latency feedback to the user when they’re filling the form wrong. Also, by preventing the user from submitting invalid forms, it reduces load on the lower tiers.
- The application server should run the validation logic, because it cannot trust the browser to do so—a hacker can bypass the browser’s logic. Also, by catching invalid forms before they’re sent to the database, it reduces the load on that critical shared resource.
- The database should run the validation logic, because it’s responsible for enforcing data integrity for multiple users of the data.
So really you’d want to code that logic once, and have tools that apply it in all three places. But precious little technology exists to achieve this.
1
I tend to recommend / prefer:
-
Logging
- App level for application logic (“delete user”, “change password”, “add customer”).
- For compliance / audit, apply directly to database (since this will catch any other indirect means of access, such as DBAs)
-
Security / Permissions
- Security should be implemented within the application.
- Within the DBRMS, security can be implemented a dozen different ways (SQL Admins, often Local Admins are SQL Admins (not that they should be), DB owner, DB roles, object rights)… this can easily lead to confusion (“who has rights to X?”, “how does Jane Doe have permission to Y?”)… avoid wherever possible. And, rights directly to the DB can be dangerous, since there are so many tools to access databases (Excel, Access, etc)… too easy for users to hit the DB directly, and do things that bypass the app’s process.
- Let’s face it, most apps DON’T use integrated authentication (NTLM / Kerberos). WebApps require Kerberos, which is annoying and confusing to many, plus it means that permissions within the DB can cause problems to the app (if for example, the app doesn’t know how to handle when a specific call fails due to the user not being in the correct group)… plus, there are too many ways to apply permissions directly to SQL, to enforce good consistency… PLUS, apps aren’t going to scan the DB for such misconfigurations (“gee, user can run things because they’re SA, when in fact they SHOULD be a member of this DB role”).
- (MSSQL) Application roles are rarely used, but SHOULD BE (looking at the devs here). Great way to enforce app permissions within SQL.
-
Queries
- simple queries (SELECT *, SELECT by ID) can be handled by ORMs writing dynamic SQL
- Complex queries tend to be written better by hand, so write a sproc and map within the ORM
-
INS/UPD/DEL
- no specific preference.
- on one hand, passing an app object to a sproc makes the INS/UPD transaction across tables easier
- on the other hand, scaling out may mean that some tables are moved to other servers, so the transactions aren’t contained to a single server
- also, transaction locking requires tuning to ensure that locks won’t become performance impediments. The alternative “NOSQL” / “eventual consistency” model is high performance, but involves more code, and more sysadmin effort (since there are usually more systems involved to track and propogate the data between the databases / cache servers).
-
SDL
- should only run during deployment… presumably an installation / upgrade process runs the SDL code
2