I always wondered why git prefers hashes over revision numbers. Revision numbers are much clearer and easier to refer to (in my opinion): There is a difference between telling someone to take a look at revision 1200 or commit 92ba93e! (Just to give one example).
So, is there any reason for this design?
1
A single, monotonically increasing revision number only really makes sense for a centralized version control system, where all revisions flow to a single place that can track and assign numbers. Once you get into the DVCS world, where numerous copies of the repository exist and changes are being pulled from and pushed to them in arbitrary workflows, the concept just doesn’t apply. (For example, there’s no one place to assign revision numbers – if I fork your repository and you decide a year later to pull my changes, how could a system ensure that our revision numbers don’t conflict?)
6
You need hashes in a distributed system. Let’s say you and a colleague are both working on the same repository and you both commit a change locally and then push it. Who gets to be revision number 1200 and who is revision number 1201 given neither party has any knowledge about each other? The only realistic technical solution is to create a hash of the changes using a known method and link things up based on that.
Interestingly HG does support version numbers but they are explicitly a local-only feature — your repository has one set, your co-worker’s repo will have a different set depending on how they pushed and pulled. It does make command line usage a bit more friendly than Git though.
Data integrity.
I respectfully disagree with the current answers. Hashes are not necessary for a DVCS, see the Bazaar way. You could do as well with any other kind of globally unique identifier. The hashes are a measure to guarantee data integrity: They represent a digest of the information contained in the object (commit, trees, …) referred to by the hash. Altering the contents without altering the hash (i.e., a preimage attack or collision attack) is believed to be difficult, although not impossible. (If you’re really into it, take a look at the 2011 paper by Marc Stevens).
Hence, referring to objects by their SHA hash allows to check if the contents have been tampered with. And, given that they’re (almost) guaranteed to be unique, they can be used as revision identifiers, too — conveniently so.
See Chapter 9 of the Git book for more details.
4
In layman’s words:
- Hashes are intended to be nearly universally unique. It is NOT guaranteed but it is extremely unlikely that the same SHA’s are generated for different content. In practical term for a given project you can treat it as unique.
- With revision numbers you would have to use a namespace in order to reffer specifically to revision 1200.
- Git can work both distributed and/or centralized. So how do you get revision numbers correct and unique ?
- Also using revision numbers would create the false spectation that newer revisions should have higher numbers, and that would not be true because of branching, merging, rebasing, etc.
- You always have the option to put tags to commits.
3
In mathematical terms:
- A total order over Git’s commits would be required for monotonally increasing version numbers.
- Git’s commits form a directed, acyclic graph (DAG) that can only be ordered partially / topologically.
Hash is not the unique solution for distributed VCS. But when deal with a distributed system, only the partial ordering of events can be recorded. (For VCS, the event can be a commit.) That is why maintain a monotonically increasing revision number is impossible. Usually we adopt something like vector clock (or vector timestamp) to record such partial-ordered relation. This is the solution used in Bazaar.
But why Git not uses vector clock but hash? I think the root cause is cherry-pick. When we perform cherry-pick on a repository, the partial ordering of commits is changing. Some commits’ vector clocks must be re-assigned to represent the new partial ordering. However, such reassignment in distributed system would induce inconsistent vector clocks. That is the real problem which hashes deal with.