I’m faced with the following problem:
I have an array which is produced by the parsing of a yaml file which contains patterns in the following structure.
"Programming books"
Title:
a list of titles
Author:
"Art Books"
Title:
foo
bar
Author:
Mrs. Foo
Mr. Bar
Year:
the program is supposed to be given random input and find if there is a book with that particular Title,Author and Year.
So far I’m using the following
foreach book_type,values from config{
tag_match = 0
foreach tag from values{
tag_no = values.length()
foreach value from tag{
if value in input
tag_match++
}
}
if tag_match == values.length()
/* tag the book matched continue matching*/
}
The program is not supposed to receive many matching lines bu it is supposed to receive a huge amount of data.
So far it has to do
book_type.length() * tag.length() * value.length() iterations for each line of input, is there a better way to do this?
2
Use a dictionary or collection, which has a Key-value pair, if the size of the data allows, if not a database – as stated above.
There are lots of different collections, dictionary, hash table etc implementations, and the right/fastest one to use will depend on the nature of its indented use, so you will have to look around to find the most suitable one. For example some will load faster, but take longer to read, some load slower, but have faster access times, and so on.
On other thing, might be, currently you are O(3n), by which I mean type X tag X value so if that’s loops it might be 10 + 10 + 10 = 30 worst case. You could combine Type and Tag, and Value, which would be 10 worst case – just a thought.
Heres a link to an articular I found useful, but is .Net specific. IDictionary Options – Performance Test – SortedList vs. SortedDictionary vs. Dictionary vs. Hashtable
Hashing/indexing is the way to go. Up to a few million items or so, create a single or multilevel hashtable/hashmap/whatever it’s called in your language. In C# it could be Dictionary<string, Dictionary<string, List<Book>>> books_by_title_and_author
. The first key is the title, the second is the author, and the value is a list of matching books, supposed to hold only a few items. If you have too much data, you should do what a database would do, build a B-tree index. Again, this can be single or multi column. An index takes a bit longer to build, but can be more efficient with many items. Also an on-disk structure is the only way if your data is too big to fit in memory.
1