It seems that more source control systems still use files as the means of storing the version data. Vault and TFS use Sql Server as their data store, which I would think would be better for data consistency as well as speed.
So why is it that SVN, I believe GIT, CVS, etc still use the file system as essentially a database, (I ask this question as we had our SVN server just corrupt itself during a normal commit) instead of using actual database software (MSSQL, Oracle, Postgre, etc)?
EDIT: I think another way of asking my question is “why do VCS developers roll their own structured data storage system instead of using an exisiting one?”
9
TL;DR: Few version control systems use a database because it isn’t necessary.
As a question for a question answer, why wouldn’t they? What benefits do “real” database systems offer over a file system in this context?
Consider that revision control is mostly keeping track of a little metadata and a lot of text diffs. Text is not stored in databases more efficiently, and indexability of the contents isn’t going to be a factor.
Lets presume that Git (for argument’s sake) used a BDB or SQLite DB for its back-end to store data. What would be more reliable about that? Anything that could corrupt simple files can also corrupt the database (since that’s also a simple file with a more complex encoding).
From the programmer paradigm of not optimizing unless its necessary, if the revision control system is fast enough and works reliably enough, why change the entire design to use a more complex system?
15
You seem to be making a lot of assumptions, possibly based on your experience with SVN and CVS.
Git and Mercurial are basically like SVN and CVS
Comparing git and CVS is like comparing an iPad and an Atari. CVS was created back when dinoaurs roamed the Earth. Subversion is basically an improved version of CVS. Assuming that modern version control systems like git and Mercurial work like them makes very little sense.
A relational database is more efficient than a single-purpose database
Why? Relational databases are really complicated, and may not be as efficient as single-purpose databases. Some differences off the top of my head:
- Version control systems don’t need complicated locking, since you can’t do multiple commits at the same time anyway.
- Distributed version control systems need to be very extremely space efficient, since the local database is a full copy of the repo.
- Version control systems only need to look up data in a couple specific ways (by author, by revision ID, sometimes full-text search). Making your own database that can handle author/revision ID searches is trivial and full-text searches aren’t very fast in any relational database I’ve tried.
- Version control systems need to work on multiple platforms. This makes it harder to use a database that needs to be installed and running as a service (like MySQL or PostgreSQL).
- Version control systems on your local machine only need to be running when you’re doing something (like a commit). Leaving a service like MySQL running all the time just in case you want to do a commit is wasteful.
- For the most part, version control systems never want to delete history, just append to it. That may lead to different optimizations, and different methods of protecting integrity.
Relational databases are safer
Again, why? You seem to assuming that because data is stored in files, version control systems like git and Mercurial don’t have atomic commits, but they do. Relational databases also store their databases as files. It’s notable here that CVS doesn’t do atomic commits, but that’s likely because it’s from the dark ages, not because they don’t use relational databases.
There’s also the issue of protecting the data from corruption once it’s in the database, and again the answer is the same. If the filesystem is corrupted, then it doesn’t matter which database you’re using. If the filesystem isn’t corrupted, then your database engine might be broken. I don’t see why a version control database would be more prone to this than a relational database.
I would argue that distributed version control systems (like git and Mercurial) are better for protecting your database than centralized version control, since you can restore the entire repo from any clone. So, if your central server spontaneously combusts, along with all of your backups, you can restore it by running git init
on the new server, then git push
from any developer’s machine.
Reinventing the wheel is bad
Just because you can use a relational database for any storage problem doesn’t mean you should. Why do you use configuration files instead of a relational database? Why store images on the filesystem when you could store the data in a relational database? Why keep your code on the filesystem when you could store it all in a relational database?
“If all you have is a hammer, everything looks like a nail.”
There’s also the fact that open-source projects can afford to reinvent the wheel whenever it’s convenient, since you don’t have the same kinds of resource constraints that commercial projects do. If you have a volunteer who’s an expert at writing databases, then why not use them?
As for why we would trust the writers of revision control systems to know what they’re doing.. I can’t speak for other VCS’s, but I’m pretty confident that Linus Torvalds understands filesystems.
Why do some commercial version control systems use a relational database then?
Most likely some combination of the following:
- Some developers don’t want to write databases.
- Developers of commercial version control systems have time and resource constraints, so they can’t afford to write a database when they have something close to what they want already. Also, developers are expensive, and database developers (as in, people who write databases) are probably more expensive, since most people don’t have that kind of experience.
- Users of commercial version control systems are less likely to care about the overhead of setting up and running a relational database, since they already have one.
- Users of commercial version control systems are more likely to want a relational database backing their revision data, since this may integrate with their processes better (like backups for example).
13
Actually svn
used to use BDB for repositories. This was eventually gotten rid of because it was prone to breakage.
Another VCS that currently uses a DB (SQLite) is fossil
. It also integrates a bug tracker.
My guess at the real reason is that VCSes work with lots of files. Filesystems are just another kind of database (hierarchical, focused on CLOB/BLOB storage efficiency). Normal databases don’t handle that well because there’s no reason to — filesystems already exist.
7
-
A filesystem is a database. Not a relational database, of course, but most are very efficient key/value stores. And if your access patterns are well-designed for a key-value store (eg, the git repository format), then using a database probably doesn’t offer significant advantages over using the filesystem. (In fact, it’s just another layer of abstraction to get in the way.)
-
A lot of the database features are just extra baggage. Full text search? Does full text search make sense for source code? Or do you need to tokenize it differently? This also requires that you store full files at every revision, which is uncommon. Many version control systems store deltas between revisions of the same file in order to save space, for example Subversion and Git (at least, when using pack files.)
-
The cross-platform requirements make using a database more challenging.
Most version control tools are built to run on multiple platforms. For centralized version control tools, this only affects the server component, but it is still difficult to rely upon a single database server since Unix users cannot install Microsoft SQL Server and Windows users may be unwilling to install PostgreSQL or MySQL. The filesystem is the least common denominator. However, there are several tools where the server must be installed on a Windows machine, and thus require SQL Server, for example SourceGear Vault and Microsoft Team Foundation Server.
Distributed version control systems make this more challenging still, since every user gets a copy of the repository. This means that every user needs a database to put the repository into. This implies that the software:
- Is limited to a subset of platforms where a particular database exists
- Targets a single database backend that is cross-platform (eg, SQLite).
- Targets a pluggable storage backend, so that one could use whatever database they wished (possibly including the filesystem).
Most distributed version control systems, therefore, just use the filesystem. A notable exception is SourceGear’s Veracity, which can store in a SQLite database (useful for local repositories) or a relational database like SQL Server (possibly useful for a server.) Their cloud hosted offering may use a non-relational storage backend like Amazon SimpleDB, but I do not know this to be true.
1
As far as I’ve seen in many offerings it seems that files are “good enough” for the job, something reasonable, taking into account that at the end of the day VCSes output is also files.
There are many companies that offer a RDBMS back end with a svn/git/etc interface, so what you are asking for basically exists already.
I would say it’s because the primary data structure of a version control system is a DAG, which maps to databases very poorly. A lot of the data is also content addressable, which also maps to databases very poorly.
Data integrity isn’t the only concern of a VCS, they are also concerned with version history integrity, which databases aren’t very good at. In other words, when you retrieve a version, you not only need to make sure that version has no current flaws, but also that nothing in its entire history has been surreptitiously altered.
VCS are also a consumer product in addition to an enterprise product. People use them in small, one-man hobby projects. If you add the hassle of installing and configuring a database server, you are going to alienate much of that part of the market. I’m guessing you don’t see a lot of Vault and TFS installations at home. It’s the same reason spreadsheets and word processors don’t use databases.
Also, this is more a reason for DVCS, but not using a database makes it extremely portable. I can copy my source tree onto a thumb drive and reuse it on any machine, without having to configure a database server process.
As far as corrupting during commits, VCS uses the exact same techniques as databases to prevent simultaneous access, make transactions atomic, etc. Corruptions in both are very rare, but they do happen. For all intents and purposes, a VCS data store is a database.
9
-
Better disaster recovery (worst case scenario: we’ll parse it by eye, like in the old times)
-
Making tracking and debugging such disasters, possibly caused by faults in the VCS system, easier.
-
Lowering the number of dependencies. (let’s not forget one of those systems is handling the kernel, and the other was supposed to)
-
A text-editor is always available. (MS SQL Server licenses… not so much)
10
Fossil is an excellent Distributed Version Control System (DVCS) and uses SQLite for storage, no plain-text files.
I really like that it has integrated: bug tracking, Wiki and that it is really distributed. I mean you can really work offline and fix bugs.
Fossil uses Sqlite as it’s application file format.
In the keynote at PgCon Dr. Richard Hipp explains what are the advantages of using sqlite as an Application File System, and makes a pretty convincing argument of the benefits of using a database as filesystem.
The second main topic was that SQLite should be seen as an application file format—an alternative to inventing own file formats or using ZIPped XMLs. The statement “SQLite is not a replacement for PostgreSQL. SQLite is a replacement for fopen()” nails that (slide 21). Finally, Richard put a lot of emphasis on that fact that SQLite takes care of your data (crash safe, ACID) use-the-index.com
Now Dr. Hipp has addressed the concerns on saving code on a database
- Why is Fossil based on SQLite instead of a distributed NoSQL database?
Fossil is not based on SQLite. The current implementation of Fossil uses SQLite as a local store for the content of the distributed database and as a cache for meta-information about the distributed database that is precomputed for quick and easy presentation. But the use of SQLite in this role is an implementation detail and is not fundamental to the design. Some future version of Fossil might do away with SQLite and substitute a pile-of-files or a key/value database in place of SQLite. (Actually, that is very unlikely to happen since SQLite works amazingly well in its current role, but the point is that omitting SQLite from Fossil is a theoretical possibility.)