Why use a database instead of just saving your data to disk?

Instead of a database I just serialize my data to JSON, saving and loading it to disk when necessary. All the data management is made on the program itself, which is faster AND easier than using SQL queries. For that reason I have never understood why databases are necessary at all.

Why should one use a database instead of just saving the data to disk?

You can query data in a database (ask it questions).
You can look up data from a database relatively rapidly.
You can relate data from two different tables together using JOINs.
You can create meaningful reports from data in a database.
Your data has a built-in structure to it.
Information of a given type is always stored only once.
Databases are ACID.
Databases are fault-tolerant.
Databases can handle very large data sets.
Databases are concurrent; multiple users can use them at the same time without corrupting the data.
Databases scale well.

In short, you benefit from a wide range of well-known, proven technologies developed over many years by a wide variety of very smart people.

If you’re worried that a database is overkill, check out SQLite.

Whilst I agree with everything Robert said, he didn’t tell you when you should use a database as opposed to just saving the data to disk.

So take this in addition to what Robert said about scalability, reliability, fault tolerance, etc.

For when to use a RDBMS, here are some points to consider:

You have relational data, i.e. you have a customer who purchases your products and those products have a supplier and manufacturer
You have large amounts of data and you need to be able to locate relevant information quickly
You need to start worrying about the previous issues identified: scalability, reliability, ACID compliance
You need to use reporting or intelligence tools to work out business problems

As for when to use a NoSQL

You have lots of data that needs to be stored which is unstructured
Scalability and speed needs
You generally don’t need to define your schema up front, so if you have changing requirements this might be a good point

Finally, when to use files

You have unstructured data in reasonable amounts that the file system can handle
You don’t care about structure, relationships
You don’t care about scalability or reliability (although these can be done, depending on the file system)
You don’t want or can’t deal with the overhead a database will add
You are dealing with structured binary data that belongs in the file system, for example: images, PDFs, documents, etc.

One thing that no one seems to have mentioned is indexing of records. Your approach is fine at the moment, and I assume that you have a very small data set and very few people accessing it.

As you get more complex, you’re actually creating a database. Whatever you want to call it, a database is just a set of records stored to disk. Whether you’re creating the file, or MySQL, SQLite or whatever is creating the file(s), they’re both databases.

What you’re missing is the complex functionality that has been built into the database systems to make them easier to use.

The main thing that springs to mind is indexing. OK, so you can store 10 or 20 or even 100 or 1000 records in a serialised array, or a JSON string and pull it out of your file and iterate it relatively quickly.

Now, imagine you have 10,000, 100,000, or even 1,000,000 records. When someone tries to log in you’re going to have to open a file which is now several hundred megabytes large, load it into memory in your program, pull out a similarly sized array of information and then iterate 100s of thousands of records just to find the one record you want to access.

A proper database will allow you to set up indexes on certain fields in records allowing you to query the database and receive a response very quickly, even with huge data sets. Combine that with something like Memcached, or even a home-brew caching system (for example, store the results of a search in a separate table for 10 minutes and load those results in case someone else searches for the same thing soon afterwards), and you’ll have blazing fast queries, something you won’t get with such a large dataset when you’re manually reading/writing to files.

Another thing loosely related to indexing is transfer of information. As I said above, when you’ve got files of hundreds or thousands of megabytes you’re having to load all of that information into memory, iterate it manually (probably on the same thread) and then manipulate your data.

With a database system it will run on its own thread(s), or even on its own server. All that is transmitted between your program and the database server is an SQL query and all that is transmitted back is the data you want to access. You’re not loading the whole dataset into memory – all you’re sending and receiving is a tiny fraction of your total data set.

TLDR

It sounds like you made an essentially valid, short term data-store technical decision for your application – you chose to write a custom data store management tool.

You’re sitting on a continuum, with options to move in either direction.

In the long term, you’ll likely (almost, but not 100% certainly) find yourself running into trouble, and may be better off to change to using existing data store solutions. There are specific, very common, predictable, performance problems you will be forced to deal with, and you’re better off using existing tools instead of rolling your own.

It sounds like you’ve written a (small) custom-purpose database, built into and directly used by your application. I assume you’re relying on an OS and file system to manage the actual disk writing and reading, and treating the combination as a data-store.

When to do what you did

You’re sitting at a sweet-spot for data storage. An OS and file system data store is incredibly convenient, accessible, and cross-platform portable. The combination has been around for so long, that you’re certain to be supported, and have your application run, on almost any standard deployment configuration.

It’s also an easy combination to write code for – the API is fairly straight-forward and basic, and it takes relatively few lines of code to get it working.

Generally, it’s ideal to do what you’ve done when:

Prototyping new ideas
Building applications which are highly unlikely to need to scale, performance wise
Constrained by unusual circumstances, such as lack of resources for installing a database

Alternatives

You’re on a continuum of options, and there are two ‘directions’ you can go from here, what I think of as ‘down’ and ‘up’:

Down

This is the least likely option to apply, but it’s here for completeness sake:

You can, if you want, go down, that is, bypass the OS and filesystem altogether and really write and read directly from disk. This choice is usually relevant only in cases where extreme efficiency is required – think, for example, of a minimal/tiny MP3 player device, without enough RAM for a fully functional OS, or of something like the Wayback Machine, which requires incredibly efficient mass data write operations (most data stores trade off slower writes for faster reads, since that’s the overwhelmingly more common use case for almost all applications).

Up

There are several sub-categories here – these aren’t exactly exclusive, though. Some tools span both, providing some functionality in each, some can completely switch from working in one mode to working in the other, and some can be layered on top of each other, providing different functionality to different parts of your application.

More powerful data stores

You may find yourself needing to store higher and higher volumes of data, while still relying on your own application for managing the data manipulation complexity. A whole range of key-value stores are available to you, with varying extents of support for related functions. NoSQL tools fall into this category, as well as others.

This is the obvious path to scale up on when the following describe your application:

It is unusually heavy read reliant
You’re OK with trading off higher performance for lower (short term) consistency guarantees (many offer “eventual consistency”).
Is “directly” managing most of the data manipulation and lack of consistency (in practice, you’ll probably end up using a third party tool at first, though eventually you’ll bring this into your application or into a custom written intermediate layer).
You’re looking to massively scale the amount of data you’re storing and/or your ability to search through it, with “relatively simple” data manipulation requirements.

There is some wiggle room here – you can force better read consistency, for slower reads. Various tools and options provide data manipulation apis, indexing and other options, which may be more or less suited for easily writing your specific application. So if the above points almost completely describe your application, you might be “close enough” to work with a more powerful data store solution.

Well-known examples: CouchDB, MongoDB, Redis, cloud storage solutions like Microsoft’s Azure, Google App Data Store and Amazon’s ECE.

More complex data manipulation engines

The “SQL” family of data storage application, as well as a range of others, are better described as data manipulation tools, than pure storage engines. They provide a wide range of additional functionality, beyond storage of data, and often beyond what’s available in the key-value store side of things. You’ll want to take this path when:

You absolutely have to have read consistency, even if it means you’ll take a performance hit.
You’re looking to efficiently perform highly complex data manipulation – think of very complex JOIN and UPDATE operations, data cubes and slicing, etc…
You’re OK with trading off rigidity for performance (think forced, fixed data storage formats, such as tables, which cannot easily and/or efficiently be altered).
You have the resources to deal with an often times more complex set of tools and interfaces.

This is the more “traditional” way of thinking of a database or data store, and has been around for much longer – so there is a lot that’s available here, and there’s often a lot of complexity to deal with. It’s possible, though it takes some expertise and knowledge, and build simple solutions/avoid much of the complexity – you most likely will end up using third-party tools and libraries to manage most of it for you, though.

Well known examples are MySQL, SQL Server, Oracle’s Database, and DB2.

Outsource the work

There are several, modern, third-party tools and libraries, which interpose themselves between your data storage tools and your application, to help you manage the complexity.

They attempt to initially take away most or all of the work that goes into managing and manipulating data stores, and, ideally, allow you to make a smooth transition into complexity only when and if it is required. This is an active area of entrepreneurship and research, with a few recent results that are immediately accessible and useable.

Well-known examples are MVC tools (Django, Yii), Ruby on Rails, and Datomic. It is hard to be fair here as there are literally dozens of tools and libraries which act as wrappers around the APIs of various data stores.

PS: if you prefer videos to text, you might want to watch some of Rich Hickey’s database related videos; he does a good job of elucidating most of the thinking that goes into choosing, designing and using a data store.

A file system fits the description of a NoSQL database, so I’d say you should definitely consider using that when deciding on how to store your data and not just dismiss it off hand in favor of RDBMS, like some answers seems to suggest here.

One issue with file systems (and NoSQL in general) is handling relationships between data. If that is not major blocker here, then I’d say skip the RDBMS for now. Also remember the positive sides of using a file system as storage:

Zero administration
Low complexity, easy to set up
Works with any operating system, language, platform, libraries etc
Only configuration setting is the directory
Trivial to test
Trivial to examine with existing tools, backup, modify etc
Good performance characteristics and well tuned by the operating system
Easy for any developer to understand
No dependencies, no extra drivers
Security model is trivial to understand and is a base part of operating system
Data is not externally accessible

When you have simple data, like a list of things as you describe in the comments of your question, then an SQL database won’t give you much. A lot of people still use them, because they know their data can get more complicated over time, and there are a lot of libraries that make working with database trivial.

But even with a simple list that you load, hold in memory, then write when needed, can suffer from a number of problems:

Abnormal program termination can lose data, or while writing data to disk something goes wrong, and you can end up killing the whole file. You can roll your own mechanisms to handle this, but databases handle this for you using battle-proven techniques.

If your data starts growing too big and updating too often, serializing all your data and saving is going to be a big resource hog and slow everything down. You’d have to start working out how to partition things, so it won’t be so expensive. Databases are optimized to save just the things that change to disk in a fault tolerant way. Also they are designed, so you can quickly just load the little bits of data you need at any given time.

Also, you don’t have to use SQL databases. You can use NoSQL “databases” which many do, just use JSON to store the data. But it is done in a fault-tolerant way, and in a way where the data can intelligent split up, queried, and intelligently split across multiple computers.

Also, some people mix things up. They might use a NoSQL data store like Redis for storing login information. Then use relational databases to store more complex data where they need to do more interesting queries.

I see a lot of answers focus on the problem of concurrency and reliability. Databases provide other benefits beside concurrency, reliability and performance. They allow to not to bother how bytes and chars are represented in the memory. In other words, databases allow programmer to focus himself on “what” rather than “how”.

One of the answers mentions queries. “Asking SQL database a question” scales well with the complexity of a question. As code evolves during the development simple queries such as “fetch all” can easily expand to “fetch all where property1 equals this value and then sort by property2” without making it programmer’s concern to optimize data structure for such query. Performance of most queries can be speed up by making index for a certain property.

Other benefit are relations. With queries it’s cleaner to cross-reference data from different data sets then having nested loops. For example searching for all forum posts from users that have less then 3 posts in a system where users and posts are different data sets (or DB tables or JSON objects) can be done with a single query without sacrificing readability.

All in all, SQL databases are better then plain arrays if data volume can be big (let’s say more than 1000 objects), data access in non-trivial and different parts of code access to different subset of data.

File systems are a type of database. Maybe not a RDBMS like everyone else is talking about, but certainly a DB in the strictest sense. You’re provide keys (file name) to look-up data (file contents), which has abstracted storage and an API by which your program communicates.

So, you are using a Database. The other posts can argue about the virtues of different types of database…

A database is needed if you have multiple processes (users/servers) modifying the data. Then the database serves to prevent them from overwriting each others changes.

You also need a database when your data is larger than memory. Nowadays with the memory we have available, this does indeed makes the use of databases in many applications obsolete.

Your approach is definitely better than the nonsense of “in-memory databases”. Which are essentially your approach, but with a lot of overhead added.

You should always ask yourself if a particular application needs an RDBMS. Too many applications are built with a design process that automatically assumes all the required tools and frameworks at the beginning. Relational databases are so common and many developers have worked on similar applications as before, that they’re automatically included before the project starts. Many projects can get away with this, so don’t judge too harshly.

You started your project without one, and it works. It was easier for you to get this up and running without waiting until you SQL. There is nothing wrong with that.

As this project expands and the requirements become more complicated, some things are going to become difficult to build. Until you research and test alternate methods, how do you know which is better? You can ask on Programmers and weed through the flames and ‘it depends’ to answer this question. Once you learn it, you can consider how many lines of code you’re willing to write in your language to handle some of the benefits of a database. At some point, you’re reinventing the wheel.

Easy is often relative. There are some frameworks that can build a web page and connect a form to a database table without requiring the user to write any code. I guess if you struggle with the mouse, this could be a problem. Everyone knows, this isn’t scalable or flexible because god forbid you’ve tightly coupled everything to the GUI. A non-programmer just built a prototype; lots of YAGNI to be found here.

If you’d rather learn an ORM manipulated by your language of choice instead of learning SQL, go for it, but try to install, create a table and pull some data out of a popular database with SQL (Select * From ; isn’t mindblowing stuff). It’s easy to do. That’s why someone created them in the first place. It doesn’t seem like such a huge investment in order to make an informed decision. You could probably do a performance test as well.

Saving the data to disk IS writing it to a database, especially if you put each object in its own file with the name of the file being the key to the record. And to minimize lookup times for reading the file, create subdirectories based on the first few characters of the key.

For instance key=ghostwriter would go in g/ho/stwriter.json or g/h/o/stwriter.json or g/ho/ghostwriter.json or g/h/o/ghostwriter.json. Choose your naming scheme based on the distribution of your keys. If they are sequence numbers then 5/4/3/12345.json is better than the other way around.

That is a database and if it does all that you need, then do it that way. Nowadays that would be called a NoSQL database like GDBM, or Berkeley db. So many choices. First figure out what you need, then build an interface library to deal with the details, perhaps a get/set interface like memcached or a CRUD interface, and then you will be able to swap libraries if you need to change the database format for one with different characteristics.

Note that some SQL databases like PostgreSQL and Apache Derby DB, will allow you to do SQL queries on top of many NoSQL formats including your own homegrown databases. Not sure about MyBatis but it may be similar.

Avoid NoSQL hype. Read about the features, test the performance and capability and then choose based upon how well it matches your application needs.

http://www.hdfgroup.org/HDF5/ is yet another interesting and widely used datastore format that people do not often consider.

As soon as the data are updated concurrently, the approach using a database (it could well be an in memory database) will likely be more correct and more performant, while at the same time your code remains easy, because you simply don’t have to worry about concurrent updates, transactions, caching, asynchronous I/O and all that.

You need a databse to store / retrieve QAs like the ones we are posting here ! A simple file is unable to organize data related to different topics.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 16:12

Thẻ: database, mysql, nosql, sql

Thiết kế website giá rẻ

Danh mục