Which hashing algorithm is best for uniqueness and speed?

Which hashing algorithm is best for uniqueness and speed? Example (good) uses include hash dictionaries.

I know there are things like SHA-256 and such, but these algorithms are designed to be secure, which usually means they are slower than algorithms that are less unique. I want a hash algorithm designed to be fast, yet remain fairly unique to avoid collisions.

I tested some different algorithms, measuring speed and number of collisions.

I used three different key sets:

A list of 216,553 English words ^????archive(in lowercase)
The numbers "1" to "216553" (think ZIP codes, and how a poor hash took down msn.com ^????archive)
216,553 “random” (i.e. type 4 uuid) GUIDs

For each corpus, the number of collisions and the average time spent hashing was recorded.

I tested:

DJB2
DJB2a (variant using xor rather than +)
FNV-1 (32-bit)
FNV-1a (32-bit)
SDBM
CRC32
Murmur2 (32-bit)
SuperFastHash

Results

Each result contains the average hash time, and the number of collisions

Hash           Lowercase      Random UUID  Numbers
=============  =============  ===========  ==============
Murmur            145 ns      259 ns          92 ns
                    6 collis    5 collis       0 collis
FNV-1a            152 ns      504 ns          86 ns
                    4 collis    4 collis       0 collis
FNV-1             184 ns      730 ns          92 ns
                    1 collis    5 collis       0 collis▪
DBJ2a             158 ns      443 ns          91 ns
                    5 collis    6 collis       0 collis▪▪▪
DJB2              156 ns      437 ns          93 ns
                    7 collis    6 collis       0 collis▪▪▪
SDBM              148 ns      484 ns          90 ns
                    4 collis    6 collis       0 collis**
SuperFastHash     164 ns      344 ns         118 ns
                   85 collis    4 collis   18742 collis
CRC32             250 ns      946 ns         130 ns
                    2 collis    0 collis       0 collis
LoseLose          338 ns        -             -
               215178 collis

Notes:

The LoseLose algorithm (where hash = hash+character) is truly awful. Everything collides into the same 1,375 buckets
SuperFastHash is fast, with things looking pretty scattered; by my goodness the number collisions. I’m hoping the guy who ported it got something wrong; it’s pretty bad
CRC32 is pretty good. Slower, and a 1k lookup table

Do collisions actually happen?

Yes. I started writing my test program to see if hash collisions actually happen – and are not just a theoretical construct. They do indeed happen:

FNV-1 collisions

creamwove collides with quists

FNV-1a collisions

costarring collides with liquid
declinate collides with macallums
altarage collides with zinke
altarages collides with zinkes

Murmur2 collisions

cataract collides with periti
roquette collides with skivie
shawl collides with stormbound
dowlases collides with tramontane
cricketings collides with twanger
longans collides with whigs

DJB2 collisions

hetairas collides with mentioner
heliotropes collides with neurospora
depravement collides with serafins
stylist collides with subgenera
joyful collides with synaphea
redescribed collides with urites
dram collides with vivency

DJB2a collisions

haggadot collides with loathsomenesses
adorablenesses collides with rentability
playwright collides with snush
playwrighting collides with snushing
treponematoses collides with waterbeds

CRC32 collisions

codding collides with gnu
exhibiters collides with schlager

SuperFastHash collisions

dahabiah collides with drapability
encharm collides with enclave
grahams collides with gramary
…snip 79 collisions…
night collides with vigil
nights collides with vigils
finks collides with vinic

Randomnessification

The other subjective measure is how randomly distributed the hashes are. Mapping the resulting HashTables shows how evenly the data is distributed. All the hash functions show good distribution when mapping the table linearly:

Or as a Hilbert Map (XKCD is always relevant):

Except when hashing number strings ("1", "2", …, "216553") (for example, zip codes), where patterns begin to emerge in most of the hashing algorithms:

SDBM:

DJB2a:

FNV-1:

All except FNV-1a, which still look pretty random to me:

In fact, Murmur2 seems to have even better randomness with Numbers than FNV-1a:

When I look at the FNV-1a “number” map, I think I see subtle vertical patterns. With Murmur I see no patterns at all. What do you think?

The extra * in the table denotes how bad the randomness is. With FNV-1a being the best, and DJB2x being the worst:

      Murmur2: .
       FNV-1a: .
        FNV-1: ▪
         DJB2: ▪▪
        DJB2a: ▪▪
         SDBM: ▪▪▪
SuperFastHash: .
          CRC: ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
     Loselose: ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
                                        ▪
                                 ▪▪▪▪▪▪▪▪▪▪▪▪▪
                        ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
          ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪

I originally wrote this program to decide if I even had to worry about collisions: I do.

And then it turned into making sure that the hash functions were sufficiently random.

FNV-1a algorithm

The FNV1 hash comes in variants that return 32, 64, 128, 256, 512 and 1024 bit hashes.

The FNV-1a algorithm is:

hash = FNV_offset_basis
for each octetOfData to be hashed
    hash = hash xor octetOfData
    hash = hash * FNV_prime
return hash

Where the constants FNV_offset_basis and FNV_prime depend on the return hash size you want:

Hash Size  
===========
32-bit
    prime: 2^24 + 2^8 + 0x93 = 16777619
    offset: 2166136261
64-bit
    prime: 2^40 + 2^8 + 0xb3 = 1099511628211
    offset: 14695981039346656037
128-bit
    prime: 2^88 + 2^8 + 0x3b = 309485009821345068724781371
    offset: 144066263297769815596495629667062367629
256-bit
    prime: 2^168 + 2^8 + 0x63 = 374144419156711147060143317175368453031918731002211
    offset: 100029257958052580907070968620625704837092796014241193945225284501741471925557
512-bit
    prime: 2^344 + 2^8 + 0x57 = 35835915874844867368919076489095108449946327955754392558399825615420669938882575126094039892345713852759
    offset: 9659303129496669498009435400716310466090418745672637896108374329434462657994582932197716438449813051892206539805784495328239340083876191928701583869517785
1024-bit
    prime: 2^680 + 2^8 + 0x8d = 5016456510113118655434598811035278955030765345404790744303017523831112055108147451509157692220295382716162651878526895249385292291816524375083746691371804094271873160484737966720260389217684476157468082573
    offset: 1419779506494762106872207064140321832088062279544193396087847491461758272325229673230371772250864096521202355549365628174669108571814760471015076148029755969804077320157692458563003215304957150157403644460363550505412711285966361610267868082893823963790439336411086884584107735010676915

See the main FNV page for details.

All my results are with the 32-bit variant.

FNV-1 better than FNV-1a?

No. FNV-1a is all around better. There was more collisions with FNV-1a when using the English word corpus:

Hash    Word Collisions
======  ===============
FNV-1   1
FNV-1a  4

Now compare lowercase and uppercase:

Hash    lowercase word Collisions  UPPERCASE word collisions
======  =========================  =========================
FNV-1   1                          9
FNV-1a  4                          11

In this case FNV-1a isn’t “400%” worse than FN-1, only 20% worse.

I think the more important takeaway is that there are two classes of algorithms when it comes to collisions:

collisions rare: FNV-1, FNV-1a, DJB2, DJB2a, SDBM
collisions common: SuperFastHash, Loselose

And then there’s the how evenly distributed the hashes are:

outstanding distribution: Murmur2, FNV-1a, SuperFastHas
excellent distribution: FNV-1
good distribution: SDBM, DJB2, DJB2a
horrible distribution: Loselose

Update

Murmur? Sure, why not

Update

@whatshisname wondered how a CRC32 would perform, added numbers to the table.

CRC32 is pretty good. Few collisions, but slower, and the overhead of a 1k lookup table.

Snip all erroneous stuff about CRC distribution – my bad

Up until today I was going to use FNV-1a as my de facto hash-table hashing algorithm. But now I’m switching to Murmur2:

Faster
Better randomnessification of all classes of input

And I really, really hope there’s something wrong with the SuperFastHash algorithm I found; it’s too bad to be as popular as it is.

Update: From the MurmurHash3 homepage on Google:

(1) – SuperFastHash has very poor collision properties, which have been documented elsewhere.

So I guess it’s not just me.

Update: I realized why Murmur is faster than the others. MurmurHash2 operates on four bytes at a time. Most algorithms are byte by byte:

for each octet in Key
   AddTheOctetToTheHash

This means that as keys get longer Murmur gets its chance to shine.

Update

GUIDs are designed to be unique, not random

A timely post by Raymond Chen reiterates the fact that “random” GUIDs are not meant to be used for their randomness. They, or a subset of them, are unsuitable as a hash key:

Even the Version 4 GUID algorithm is not guaranteed to be unpredictable, because the algorithm does not specify the quality of the random number generator. The Wikipedia article for GUID contains primary research which suggests that future and previous GUIDs can be predicted based on knowledge of the random number generator state, since the generator is not cryptographically strong.

Randomess is not the same as collision avoidance; which is why it would be a mistake to try to invent your own “hashing” algorithm by taking some subset of a “random” guid:

int HashKeyFromGuid(Guid type4uuid)
{
   //A "4" is put somewhere in the GUID.
   //I can't remember exactly where, but it doesn't matter for
   //the illustrative purposes of this pseudocode
   int guidVersion = ((type4uuid.D3 & 0x0f00) >> 8);
   Assert(guidVersion == 4);

   return (int)GetFirstFourBytesOfGuid(type4uuid);
}

Note: Again, I put “random GUID” in quotes, because it’s the “random” variant of GUIDs. A more accurate description would be Type 4 UUID. But nobody knows what type 4, or types 1, 3 and 5 are. So it’s just easier to call them “random” GUIDs.

All English Words mirrors

https://web.archive.org/web/20070221060514/http://www.sitopreferito.it/html/all_english_words.html
https://drive.google.com/file/d/0B3BLwu7Vb2U-dEw1VkUxc3U4SG8/view?usp=sharing

If you are wanting to create a hash map from an unchanging dictionary, you might want to consider perfect hashing https://en.wikipedia.org/wiki/Perfect_hash_function – during the construction of the hash function and hash table, you can guarantee, for a given dataset, that there will be no collisions.

Here is a list of hash functions, but the short version is:

If you just want to have a good hash function, and cannot wait, djb2 is one of the best string hash functions i know. It has excellent distribution and speed on many different sets of keys and table sizes

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

CityHash by Google is the algorithm you are looking for. It is not good for cryptography but is good for generating unique hashes.

Read the blog for more details and the code is available here.

CityHash is written in C++. There also is a plain C port.

About 32-bit support:

All the CityHash functions are tuned for 64-bit processors. That said, they will run (except for the new ones that use SSE4.2) in 32-bit code. They won’t be very fast though. You may want to use Murmur or something else in 32-bit code.

I’ve plotted a short speed comparison of different hashing algorithms when hashing files.

The individual plots only differ slightly in the reading method and can be ignored here, since all files were stored in a tmpfs. Therefore the benchmark was not IO-bound if you are wondering.

Linear Scale
Logarithmic scale

Algorithms include: SpookyHash, CityHash, Murmur3, MD5, SHA{1,256,512}.

Conclusions:

Non-cryptographic hash functions like Murmur3, Cityhash and Spooky are pretty close together. One should note that Cityhash may be faster on CPUs with SSE 4.2s CRC instruction, which my CPU does not have. SpookyHash was in my case always a tiny bit before CityHash.
MD5 seems to be a good tradeoff when using cryptographic hash functions, although SHA256 may be more secure to the collision vulnerabilities of MD5 and SHA1.
The complexity of all algorithms is linear – which is really not surprising since they work blockwise. (I wanted to see if the reading method makes a difference, so you can just compare the rightmost values).
SHA256 was slower than SHA512.
I did not investigate the randomness of the hash functions. But here is a good comparison of the hash functions that are missing in Ian Boyds answer.
This points out that CityHash has some problems in corner cases.

The source used for the plots:

https://github.com/sahib/rmlint/tree/gh-pages/plots (sorry for the ugly code)

I know there are things like SHA-256 and such, but these algorithms are designed to be secure, which usually means they are slower than algorithms that are less unique.

The assumption that cryptographic hash functions are more unique is wrong, and in fact it can be shown to be often backwards in practice. In truth:

Cryptographic hash functions ideally should be indistinguishable from random;
But with non-cryptographic hash functions, it’s desirable for them to interact favorably with likely inputs.

Which means that a non-cryptographic hash function may well have fewer collisions than a cryptographic one for “good” data set—data sets that it was designed for.

We can actually demonstrate this with the data in Ian Boyd’s answer and a bit of math: the Birthday problem. The formula for the expected number of colliding pairs if you pick n integers at random from the set [1, d] is this (taken from Wikipedia):

n - d + d * ((d - 1) / d)^n

Plugging n = 216,553 and d = 2^32 we get about 5.5 expected collisions. Ian’s tests mostly show results around that neighborhood, but with one dramatic exception: most of the functions got zero collisions in the consecutive numbers tests. The probability of choosing 216,553 32-bit numbers at random and getting zero collisions is about 0.43%. And that’s just for one function—here we have five distinct hash function families with zero collisions!

So what we’re seeing here is that the hashes that Ian tested are interacting favorably with the consecutive numbers dataset—i.e., they’re dispersing minimally different inputs more widely than an ideal cryptographic hash function would. (Side note: this means that Ian’s graphical assessment that FNV-1a and MurmurHash2 “look random” to him in the numbers data set can be refuted from his own data. Zero collisions on a data set of that size, for both hash functions, is strikingly nonrandom!)

This is not a surprise because this is a desirable behavior for many uses of hash functions. For example, hash table keys are often very similar; Ian’s answer mentions a problem MSN once had with ZIP code hash tables. This is a use where collision avoidance on likely inputs wins over random-like behavior.

Another instructive comparison here is the contrast in the design goals between CRC and cryptographic hash functions:

CRC is designed to catch errors resulting from noisy communications channels, which are likely to be a small number of bit flips;
Crypto hashes are designed to catch modifications made by malicious attackers, who are allotted limited computational resources but arbitrarily much cleverness.

So for CRC it is again good to have fewer collisions than random in minimally different inputs. With crypto hashes, this is a no-no!

The SHA algorithms (including SHA-256) are designed to be fast.

In fact, their speed can be a problem sometimes. In particular, a common technique for storing a password-derived token is to run a standard fast hash algorithm 10,000 times (storing the hash of the hash of the hash of the hash of the … password).

#!/usr/bin/env ruby
require 'securerandom'
require 'digest'
require 'benchmark'

def run_random_digest(digest, count)
  v = SecureRandom.random_bytes(digest.block_length)
  count.times { v = digest.digest(v) }
  v
end

Benchmark.bmbm do |x|
  x.report { run_random_digest(Digest::SHA256.new, 1_000_000) }
end

Output:

Rehearsal ------------------------------------
   1.480000   0.000000   1.480000 (  1.391229)
--------------------------- total: 1.480000sec

       user     system      total        real
   1.400000   0.000000   1.400000 (  1.382016)

Use SipHash. It has many desirable properties:

Fast. An optimized implementation takes around 1 cycle per byte.
Secure. SipHash is a strong PRF (pseudorandom function). This means that it is indistinguishable from a random function (unless you know the 128-bit secret key). Hence:
- No need to worry about your hash table probes becoming linear time due to collisions. With SipHash, you know that you will get average-case performance on average, regardless of inputs.
- Immunity to hash-based denial of service attacks.
- You can use SipHash (especially the version with a 128-bit output) as a MAC (Message Authentication Code). If you receive a message and a SipHash tag, and the tag is the same as that from running SipHash with your secret key, then you know that whoever created the hash was also in possession of your secret key, and that neither the message nor the hash have been altered since.

It depends on the data you are hashing. Some hashing works better with specific data like text. Some hashing algorithms were specificaly designed to be good for specific data.

Paul Hsieh once made fast hash. He lists source code and explanations. But it was already beaten. 🙂

Java uses this simple multiply-and-add algorithm:

The hash code for a String object is computed as
 s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the i-th character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)

There are probably much better ones out there but this is fairly widespread and seems to be a good trade-off between speed and uniqueness.

First of all, why do you need to implement your own hashing? For most tasks you should get good results with data structures from a standard library, assuming there’s an implementation available (unless you’re just doing this for your own education).

As far as actual hashing algorithms go, my personal favorite is FNV. 1

Here’s an example implementation of the 32-bit version in C:

unsigned long int FNV_hash(void* dataToHash, unsigned long int length)
{
  unsigned char* p = (unsigned char *) dataToHash;
  unsigned long int h = 2166136261UL;
  unsigned long int i;

  for(i = 0; i < length; i++)
    h = (h * 16777619) ^ p[i] ;

  return h;
}

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 21:50

Thẻ: algorithms, hashing, performance, security, unique-data

Thiết kế website giá rẻ

Danh mục

Which hashing algorithm is best for uniqueness and speed?

Results

Do collisions actually happen?

Randomnessification

FNV-1a algorithm

FNV-1 better than FNV-1a?

GUIDs are designed to be unique, not random

All English Words mirrors