Algorithm for indexing strings in “list”

Imagine I have file called strings.dat. Inside this file there are a lot of strings, for example: one million. Strings are sorted. Now I want to find a specified string, so I can write method like this:

<code>public long FindByName (string text)

{

// ...

}

</code>

<code>public long FindByName (string text) { // ... } </code>

public long FindByName (string text)
{
  // ...
}

This method can return to me a position within a file where this string occurs. But iterating through lots of data is not efficient.
I can do some indexing system, like an Dictionary<char, long> which tells about at what position within file is placed first string with its value starting from given char.

Example: if I got 5 strings:
hello
hello2
world
world2
Zzz

So my dictionary will be like:
h 0
w 2
z 4

But it’s not efficient if I will have 1000 string with “d” char as first letter, and 10 million with “r” letter.

Do you know any good algorithms to achieve what I’m asking for?

Because your data is sorted, I would use a Binary Search algorithm on it.

Since this is a text file (i.e., string lengths / record sizes are not all equal) you’ll have a little adjusting to do.

As an example, say your file is 1GB in size – (average string length including end of line is around 1,024 bytes.

Open the file and start reading bytes at position 512 MB. Chances are you’ve hit the middle of a string. No problem, just keep reading bytes until you get to the end of the line, then read the next line and use that as your pivot.

It will be a little bit messier than if all the strings were the same length, but not terribly so.

If your search target comes before the pivot string, your next random access read would begin around 256 MB.

You mention in the comments to your question that you used to use a DBMS but are moving away in order to save money and make your project more portable.

There are free and portable DBMS alternatives. I highly recommend looking into those instead of writing your own.

Existing products will have been designed by better devs than you or me, and will already have hundreds of testing hours on them. In other words: They will be faster and more stable.

Try looking into sqlite.

I would use a bucket sort to help optimize this.

Since strings are different lengths in your example, and there is a lot of data considering this is a flat file, I would simply split the file into multiple pieces based on the string prefix.

So instead of having strings.dat I would have the following:

a.dat
b.dat
…
z.dat

If any particular file is still too large, for example, a common letter such as r or t is disproportionately larger, split that into smaller buckets:

r.dat
ra.dat
re.dat
…

This can continue with rab.dat and so on.

To find a string, simply start with the full string and look for a file matching its name. If not found, look for a file matching the string minus the last character. Do this until a match is found or the string is zero length (e.g. maybe there is no x.dat so no word starting with x will ever be found).

Example: here are some files, and a string that could be located in each one.

r.dat – rhombus (no rh.dat)
ra.dat – rat (no rat.dat)
rab.dat – rabid (no rabi.dat)

Each file should be small enough for efficient searching either using a full scan or a binary search, whatever you believe is appropriate given the amount of data.

Benefits

Smaller text files are easier to manage in a text editor if you need to modify them.
Seeking through text files can be inefficient depending on your language, library, and file mode. By reducing the amount of data in each file, this may alleviate that.
From an algorithmic complexity perspective this does not buy you much if anything. However, from a practical perspective it does. File I/O is expensive, easily one of the most expensive operations a computer can perform (unless you have an SSD). By minimizing file seeks/reads, you mitigate the real-world performance impact of searching 500 MB of data each time. How much? Profile it to say for sure, but I would feel comfortable sticking my neck out and saying it would be a measurable amount.

Use a hash table. The hash table would include a reasonably unique key generated from the string and pointers to the strings that match that hash key.

A simple, fast hash with a small hash table (256 entries), but lots of key collisions, would be the xor of all bytes in the string. A slower, more complicated hash with a much larger hash table (as many entries as strings), but where key collisions are unlikely, would be AES encryption.

I realize you’re using C#, but here’s a little perl script to help you investigate what hash table size you’d like to use. This version of the keyify() function just sums the string into a 16 bit integer.

# keyspace.pl

sub keyify {

use constant HASH_KEY_SIZE_IN_BITS =&gt 16;

return unpack( '%' . HASH_KEY_SIZE_IN_BITS . 'A*', $_ );

}

$/ = "rn"; # Windows EOL

$offset = 0;

while(&lt&gt) {

$newoffset = $offset + length($_);

$key = keyify($_);

if (defined $myhash{$key}) {

# key collision, add to the list of offsets

push @{ $myhash {$key} }, $offset;

} else {

# new key, create the list of offsets

$myhash { $key } = [$offset];

}

$offset = $newoffset;

}

printf "%d keys generatedn", scalar (keys %myhash);

$max = 0;

foreach (keys%myhash) {

$collisions = scalar @{ $myhash{$_} };

$max = $collisions if ( $collisions &gt $max );

}

print "maximum # of string offsets in a hash = $maxn";

exit;

# dump hash table

foreach (keys%myhash) {

print "key = $_:";

foreach my $offset ( @{ $myhash{$_} } ) {

print " $offset";

}

print "n";

}

# keyspace.pl sub keyify { use constant HASH_KEY_SIZE_IN_BITS =&gt 16; return unpack( '%' . HASH_KEY_SIZE_IN_BITS . 'A*', $_ ); } $/ = "rn"; # Windows EOL $offset = 0; while(&lt&gt) { $newoffset = $offset + length($_); $key = keyify($_); if (defined $myhash{$key}) { # key collision, add to the list of offsets push @{ $myhash {$key} }, $offset; } else { # new key, create the list of offsets $myhash { $key } = [$offset]; } $offset = $newoffset; } printf "%d keys generatedn", scalar (keys %myhash); $max = 0; foreach (keys%myhash) { $collisions = scalar @{ $myhash{$_} }; $max = $collisions if ( $collisions &gt $max ); } print "maximum # of string offsets in a hash = $maxn"; exit; # dump hash table foreach (keys%myhash) { print "key = $_:"; foreach my $offset ( @{ $myhash{$_} } ) { print " $offset"; } print "n"; }

# keyspace.pl
sub keyify {
    use constant HASH_KEY_SIZE_IN_BITS =&gt 16;
    return unpack( '%' . HASH_KEY_SIZE_IN_BITS . 'A*', $_ );
}

$/ = "rn"; # Windows EOL
$offset = 0;
while(&lt&gt) {
    $newoffset = $offset + length($_);
    $key = keyify($_);
    if (defined $myhash{$key}) {
        # key collision, add to the list of offsets
        push @{ $myhash {$key} }, $offset;
    } else {
        # new key, create the list of offsets
        $myhash { $key } = [$offset];
    }
    $offset = $newoffset;
}
printf "%d keys generatedn", scalar (keys %myhash);
$max = 0;
foreach (keys%myhash) {
    $collisions = scalar @{ $myhash{$_} };
    $max = $collisions      if ( $collisions &gt $max );
}
print "maximum # of string offsets in a hash = $maxn";
exit;

# dump hash table
foreach (keys%myhash) {
    print "key = $_:";
    foreach my $offset ( @{ $myhash{$_} } ) {
        print " $offset";
    }
    print "n";
}

Use it like so:

<code>perl keyspace.pl <strings.dat

</code>

<code>perl keyspace.pl <strings.dat </code>

perl keyspace.pl <strings.dat

The same thing in PowerShell, with a much simpler hashing function. You’ll have to put in some effort if you want this to be useful.

# keyspace.ps1

# Don't use "gc -Encoding Byte -Raw" because it reads the ENTIRE file into memory.

function keyify {

return $args[0].Substring(0,1);

}

$myHash = @{};

$offset = 0;

$file = New-Object System.IO.StreamReader($args[0]);

while ($line = $file.ReadLine()) {

$newoffset = $offset + $line.Length + 2; # adjust by 2 for Windows EOL (CRLF)

$key = keyify($line);

if ($myHash.ContainsKey($key)) {

# key collision, add to the list of offsets

$myHash.Set_Item($key, $myHash.Get_Item($key)+$offset);

} else {

# new key, create the list of offsets

$myHash.Add($key, @($offset));

}

$offset = $newoffset;

}

$file.close()

echo "$($myHash.Count) keys generated";

$max = 0;

foreach ($i in $myHash.KEYS.GetEnumerator()) {

$collisionList = $myHash.Get_Item($i);

if ($collisionList.Count -gt $max) { $max = $collisionList.Count; }

}

echo "maximum # of string offsets in a hash = $max";

# echo $myHash;

# keyspace.ps1 # Don't use "gc -Encoding Byte -Raw" because it reads the ENTIRE file into memory. function keyify { return $args[0].Substring(0,1); } $myHash = @{}; $offset = 0; $file = New-Object System.IO.StreamReader($args[0]); while ($line = $file.ReadLine()) { $newoffset = $offset + $line.Length + 2; # adjust by 2 for Windows EOL (CRLF) $key = keyify($line); if ($myHash.ContainsKey($key)) { # key collision, add to the list of offsets $myHash.Set_Item($key, $myHash.Get_Item($key)+$offset); } else { # new key, create the list of offsets $myHash.Add($key, @($offset)); } $offset = $newoffset; } $file.close() echo "$($myHash.Count) keys generated"; $max = 0; foreach ($i in $myHash.KEYS.GetEnumerator()) { $collisionList = $myHash.Get_Item($i); if ($collisionList.Count -gt $max) { $max = $collisionList.Count; } } echo "maximum # of string offsets in a hash = $max"; # echo $myHash;

# keyspace.ps1
# Don't use "gc -Encoding Byte -Raw" because it reads the ENTIRE file into memory.
function keyify {
    return $args[0].Substring(0,1);
}

$myHash = @{};
$offset = 0;
$file = New-Object System.IO.StreamReader($args[0]);
while ($line = $file.ReadLine()) {
    $newoffset = $offset + $line.Length + 2;    # adjust by 2 for Windows EOL (CRLF)
    $key = keyify($line);
    if ($myHash.ContainsKey($key)) {
        # key collision, add to the list of offsets
        $myHash.Set_Item($key, $myHash.Get_Item($key)+$offset);
    } else {
        # new key, create the list of offsets
        $myHash.Add($key, @($offset));
    }
    $offset = $newoffset;
}
$file.close()
echo "$($myHash.Count) keys generated";
$max = 0;
foreach ($i in $myHash.KEYS.GetEnumerator()) {
    $collisionList = $myHash.Get_Item($i);
    if ($collisionList.Count -gt $max) { $max = $collisionList.Count; }
}
echo "maximum # of string offsets in a hash = $max";

# echo $myHash;

Use it like so:

<code>.keyspace.ps1 strings.dat

</code>

<code>.keyspace.ps1 strings.dat </code>

.keyspace.ps1 strings.dat

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 05:49

Thẻ: c++, indexing

Thiết kế website giá rẻ

Danh mục

Algorithm for indexing strings in “list”

Benefits