I have around 800,000 rows of data stored in the boost shared memory from the database. The data are in the form:
Id Color Length Size
1 1 2 4
2 3 4 5
3 2 2 0
4 1 2 4......and so on
The Colors can be the value from 1-12 the length 1-4 and the size 1-5. The Id,Length,Color,Size are stored in seperate vector of 800,000 size in the shared memory. So there is Id vector for Id, Color vector for Color and so on.
I want to filter the data before I perform some computation. So I want data for which Color is 1 and length is 2 and Size 4 i.e row 1 and 4 in above case. Is there any efficient way to filter the data without using for loop and going through all the 800,000 images and checking the condition?
Right now I am just using mysql statement to get the Ids of the data which satisify the condition.
"select Id from features_table where Color=1 and Length=2 and Size =4"
But is there a faster way to do this? Or should I stick to this method? I am looking for a faster method so I am not sure whether fetching the Ids from the database will increase the execution time of the algorithm.
What are the other options that I can consider in this case? I read about Hash table, B-Tree, Binary Search tree and I am confused which is suitable for this case. Will kd-tree be helpful in this case? Because many images may have the same combination of color,length and size. I am not sure if kd-tree is the right thing to do. I read about FLANN in opencv used for kd-tree. Is there any example or resources which may be helpful in this case? Or are there any built in c++ libraries?
1
You’ve a lot of data duplication there, as the number of possible combinations of Color, Length and Size values is way too smaller than the number of rows.
A better alternative is storing a set of row indexes for each possible tuple:
struct Rows
{
std::tuple<Color, Length, Size> properties;
std::vector<int> rowids;
} ;
This way, queries are straightforward.
If the number of possible tuples is high enough, then it’d be worth to set indexes in the form of row set hashes, one hash per property: std::map<Color, Rows *>
and so on.
You have exactly 240 unique values for colour, length and size. If the id is strictly ascending as in your example, you don’t need to store it and the data could conveniently be packed in a single short int. An array of 800,000 shorts will take 1.6 MB of memory and could be brute force searched in under 1 millisecond. Do you need faster than this?
If the id has to be stored, you need 27 bits but a short and an int may be more convenient. Still only 4.8 MB of memory, still very fast to search.
If you know in advance the searches you want to make then you can build indexes to make the searches even faster. You could use the 240 unique values as the key and the id as the data, with about 3333 ids in each bucket.
Hash table: Yes. The above could be implemented with an unordered multimap, such as the one in the standard library. A custom solution might be faster, however.
B-trees and binary search are for sorted data. You’ve not indicated this as a requirement.
Kd-trees are for multi-dimensional sorted data, excellent for nearest neighbour searches. Again, does not seem to match your needs.