Imagine I have file called strings.dat
. Inside this file there are a lot of strings, for example: one million. Strings are sorted. Now I want to find a specified string, so I can write method like this:
public long FindByName (string text)
{
// ...
}
This method can return to me a position within a file where this string occurs. But iterating through lots of data is not efficient.
I can do some indexing system, like an Dictionary<char, long>
which tells about at what position within file is placed first string with its value starting from given char.
Example: if I got 5 strings:
hello
hello2
world
world2
Zzz
So my dictionary will be like:
h 0
w 2
z 4
But it’s not efficient if I will have 1000 string with “d” char as first letter, and 10 million with “r” letter.
Do you know any good algorithms to achieve what I’m asking for?
7
Because your data is sorted, I would use a Binary Search algorithm on it.
Since this is a text file (i.e., string lengths / record sizes are not all equal) you’ll have a little adjusting to do.
As an example, say your file is 1GB in size – (average string length including end of line is around 1,024 bytes.
Open the file and start reading bytes at position 512 MB. Chances are you’ve hit the middle of a string. No problem, just keep reading bytes until you get to the end of the line, then read the next line and use that as your pivot.
It will be a little bit messier than if all the strings were the same length, but not terribly so.
If your search target comes before the pivot string, your next random access read would begin around 256 MB.
2
You mention in the comments to your question that you used to use a DBMS but are moving away in order to save money and make your project more portable.
There are free and portable DBMS alternatives. I highly recommend looking into those instead of writing your own.
Existing products will have been designed by better devs than you or me, and will already have hundreds of testing hours on them. In other words: They will be faster and more stable.
Try looking into sqlite.
5
I would use a bucket sort to help optimize this.
Since strings are different lengths in your example, and there is a lot of data considering this is a flat file, I would simply split the file into multiple pieces based on the string prefix.
So instead of having strings.dat
I would have the following:
a.dat
b.dat
…z.dat
If any particular file is still too large, for example, a common letter such as r
or t
is disproportionately larger, split that into smaller buckets:
r.dat
ra.dat
re.dat
…
This can continue with rab.dat
and so on.
To find a string, simply start with the full string and look for a file matching its name. If not found, look for a file matching the string minus the last character. Do this until a match is found or the string is zero length (e.g. maybe there is no x.dat
so no word starting with x
will ever be found).
Example: here are some files, and a string that could be located in each one.
r.dat
– rhombus (norh.dat
)ra.dat
– rat (norat.dat
)rab.dat
– rabid (norabi.dat
)
Each file should be small enough for efficient searching either using a full scan or a binary search, whatever you believe is appropriate given the amount of data.
Benefits
- Smaller text files are easier to manage in a text editor if you need to modify them.
- Seeking through text files can be inefficient depending on your language, library, and file mode. By reducing the amount of data in each file, this may alleviate that.
- From an algorithmic complexity perspective this does not buy you much if anything. However, from a practical perspective it does. File I/O is expensive, easily one of the most expensive operations a computer can perform (unless you have an SSD). By minimizing file seeks/reads, you mitigate the real-world performance impact of searching 500 MB of data each time. How much? Profile it to say for sure, but I would feel comfortable sticking my neck out and saying it would be a measurable amount.
2
Use a hash table. The hash table would include a reasonably unique key generated from the string and pointers to the strings that match that hash key.
A simple, fast hash with a small hash table (256 entries), but lots of key collisions, would be the xor of all bytes in the string. A slower, more complicated hash with a much larger hash table (as many entries as strings), but where key collisions are unlikely, would be AES encryption.
I realize you’re using C#, but here’s a little perl script to help you investigate what hash table size you’d like to use. This version of the keyify() function just sums the string into a 16 bit integer.
# keyspace.pl sub keyify { use constant HASH_KEY_SIZE_IN_BITS => 16; return unpack( '%' . HASH_KEY_SIZE_IN_BITS . 'A*', $_ ); } $/ = "rn"; # Windows EOL $offset = 0; while(<>) { $newoffset = $offset + length($_); $key = keyify($_); if (defined $myhash{$key}) { # key collision, add to the list of offsets push @{ $myhash {$key} }, $offset; } else { # new key, create the list of offsets $myhash { $key } = [$offset]; } $offset = $newoffset; } printf "%d keys generatedn", scalar (keys %myhash); $max = 0; foreach (keys%myhash) { $collisions = scalar @{ $myhash{$_} }; $max = $collisions if ( $collisions > $max ); } print "maximum # of string offsets in a hash = $maxn"; exit; # dump hash table foreach (keys%myhash) { print "key = $_:"; foreach my $offset ( @{ $myhash{$_} } ) { print " $offset"; } print "n"; }
Use it like so:
perl keyspace.pl <strings.dat
The same thing in PowerShell, with a much simpler hashing function. You’ll have to put in some effort if you want this to be useful.
# keyspace.ps1 # Don't use "gc -Encoding Byte -Raw" because it reads the ENTIRE file into memory. function keyify { return $args[0].Substring(0,1); } $myHash = @{}; $offset = 0; $file = New-Object System.IO.StreamReader($args[0]); while ($line = $file.ReadLine()) { $newoffset = $offset + $line.Length + 2; # adjust by 2 for Windows EOL (CRLF) $key = keyify($line); if ($myHash.ContainsKey($key)) { # key collision, add to the list of offsets $myHash.Set_Item($key, $myHash.Get_Item($key)+$offset); } else { # new key, create the list of offsets $myHash.Add($key, @($offset)); } $offset = $newoffset; } $file.close() echo "$($myHash.Count) keys generated"; $max = 0; foreach ($i in $myHash.KEYS.GetEnumerator()) { $collisionList = $myHash.Get_Item($i); if ($collisionList.Count -gt $max) { $max = $collisionList.Count; } } echo "maximum # of string offsets in a hash = $max"; # echo $myHash;
Use it like so:
.keyspace.ps1 strings.dat
3