Data retrieval and indexing

I have around 800,000 rows of data stored in the boost shared memory from the database. The data are in the form:

Id        Color        Length          Size

1        1                 2            4
2        3                 4            5
3        2                 2            0
4        1                 2            4......and so on

The Colors can be the value from 1-12 the length 1-4 and the size 1-5. The Id,Length,Color,Size are stored in seperate vector of 800,000 size in the shared memory. So there is Id vector for Id, Color vector for Color and so on.

I want to filter the data before I perform some computation. So I want data for which Color is 1 and length is 2 and Size 4 i.e row 1 and 4 in above case. Is there any efficient way to filter the data without using for loop and going through all the 800,000 images and checking the condition?

Right now I am just using mysql statement to get the Ids of the data which satisify the condition.

"select Id from features_table where Color=1 and Length=2 and Size =4"

But is there a faster way to do this? Or should I stick to this method? I am looking for a faster method so I am not sure whether fetching the Ids from the database will increase the execution time of the algorithm.

What are the other options that I can consider in this case? I read about Hash table, B-Tree, Binary Search tree and I am confused which is suitable for this case. Will kd-tree be helpful in this case? Because many images may have the same combination of color,length and size. I am not sure if kd-tree is the right thing to do. I read about FLANN in opencv used for kd-tree. Is there any example or resources which may be helpful in this case? Or are there any built in c++ libraries?

You’ve a lot of data duplication there, as the number of possible combinations of Color, Length and Size values is way too smaller than the number of rows.

A better alternative is storing a set of row indexes for each possible tuple:

struct Rows
{
    std::tuple<Color, Length, Size> properties;
    std::vector<int> rowids;
} ;

This way, queries are straightforward.

If the number of possible tuples is high enough, then it’d be worth to set indexes in the form of row set hashes, one hash per property: std::map<Color, Rows *> and so on.

You have exactly 240 unique values for colour, length and size. If the id is strictly ascending as in your example, you don’t need to store it and the data could conveniently be packed in a single short int. An array of 800,000 shorts will take 1.6 MB of memory and could be brute force searched in under 1 millisecond. Do you need faster than this?

If the id has to be stored, you need 27 bits but a short and an int may be more convenient. Still only 4.8 MB of memory, still very fast to search.

If you know in advance the searches you want to make then you can build indexes to make the searches even faster. You could use the 240 unique values as the key and the id as the data, with about 3333 ids in each bucket.

Hash table: Yes. The above could be implemented with an unordered multimap, such as the one in the standard library. A custom solution might be faster, however.

B-trees and binary search are for sorted data. You’ve not indicated this as a requirement.

Kd-trees are for multi-dimensional sorted data, excellent for nearest neighbour searches. Again, does not seem to match your needs.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 11:38

Thẻ: c++, data-structures, indexing

Data retrieval and indexing

I have around 800,000 rows of data stored in the boost shared memory from the database. The data are in the form:

Id        Color        Length          Size

1        1                 2            4
2        3                 4            5
3        2                 2            0
4        1                 2            4......and so on

Right now I am just using mysql statement to get the Ids of the data which satisify the condition.

"select Id from features_table where Color=1 and Length=2 and Size =4"

You’ve a lot of data duplication there, as the number of possible combinations of Color, Length and Size values is way too smaller than the number of rows.

A better alternative is storing a set of row indexes for each possible tuple:

struct Rows
{
    std::tuple<Color, Length, Size> properties;
    std::vector<int> rowids;
} ;

This way, queries are straightforward.

If the number of possible tuples is high enough, then it’d be worth to set indexes in the form of row set hashes, one hash per property: std::map<Color, Rows *> and so on.

If the id has to be stored, you need 27 bits but a short and an int may be more convenient. Still only 4.8 MB of memory, still very fast to search.

Hash table: Yes. The above could be implemented with an unordered multimap, such as the one in the standard library. A custom solution might be faster, however.

B-trees and binary search are for sorted data. You’ve not indicated this as a requirement.

Kd-trees are for multi-dimensional sorted data, excellent for nearest neighbour searches. Again, does not seem to match your needs.

Filed under: softwareengineering - @ 11:38

Thẻ: c++, data-structures, indexing

Thiết kế website giá rẻ

Danh mục

Data retrieval and indexing

Data retrieval and indexing