Algorithm for fast tag search

The problem is the following.

There’s a set of simple entities E, each one having a set of tags T attached.
Each entity might have an arbitrary number of tags.
Total number of entities is near 100 million, and the total number of tags is about 5000.

So the initial data is something like this:

E1 - T1, T2, T3, ... Tn
E2 - T1, T5, T100, ... Tk
..
Ez - T10, T12, ... Tl

This initial data is quite rarely updated.

Somehow my app generates a logical expression on tags like this:

T1&T2&T3 | (T5&!T6)
What I need to is to calculate a number of entities matching given expression (note – not the entities, but just the number). This one might be not totally accurate, of course.

What I’ve got now is a simple in-memory table lookup, giving me a 5-10 seconds execution time on a single thread.

I’m curious, is there any efficient way to handle this stuff? What approach would you recommend? Is there some common algorithms or data structures for this?

Update

A bit of clarification as requested.

T objects are actually relatively short constant strings. But it doesn’t actually matter – we can always assign some IDs and operate on integers.
We definitely can sort them.

i would do this in sql having a link table EntityCategory between eid referencing entity and cid referencing category using self-joins:

    select count(ec1.eid)
    from EntityCategory ec1 
    left join EntityCategory ec2 on ec1.eid=ec2.eid 
    left join EntityCategory ec3 on ec1.eid=ec3.eid 
    ...
    where 
      ec1.cid={categoryId1} and 
      ec2.cid={categoryId2} and
      ec3.cid={categoryId3} ...

After having written this answer, I should probably flag the question as “too broad” – we can talk for ages about various strategies, in the end a benchmark will have to be run with your data.

Each tag can be efficiently represented by an integer. Each entity has a set of tags. Choosing the correct set implementation is important – both B-trees and sorted arrays are possible. With this set, we will only do membership tests. As both structures do this in O(log t) (with t tags per entity), I would prefer arrays due to their denser representation.

We can now filter through all entities in an O(n · log t · p) operation, where p is the average path length in the predicate decision tree. This decision tree can be ordered so that a decision can be reached quickly. Without statistical data, it is only possible to factor out common subexpression.

The order in which the entities are searched is not really important. On the other hand it can be advantageous to sort it such that the entities at indices 0 to i all have a certain tag, while the rest doesn’t. This reduces the n when searching for this specific tag (in the decision tree, this should be the first test). This can be expanded to multiple levels, but this complicates things and takes O(2^k) memory with k levels. With multiple levels, the tags with the highest gain should be decided first, where the gain is the number of entities that don’t have to be looked up times the probability of discarding them. The gain becomes maximal for 50:50 chances or when 50% of the entities have this specific tag. This would allow you to optimize even if the access patterns are not known.

You could also create sets that index the entities by each tag used – one set with all entities for T1, the next for T2. An obvious (space and time) optimization is to stop when a set contains more than half of all elements, and to save those elements that don’t have this tag – this way, building indices for all tags will take less than ½ · n · t space (with t tags in total). Note that saving complementary sets can make other optimizations more difficult. Again, I would (sorted) arrays for the sets.

If you also represent your entities through an integer range, you can compress the space used for the index sets by only storing the starting and ending member of a continuous range. Implementation-wise this would probably be done with a high bit to indicate whether an entry is a range bound or regular entry.

If we now have index sets (and thus statistics on the tags) we can optimize our predicates so that unlikely properties are tested first (fail-fast strategy). This means that if T1 is common and T2 is rare, then the predicate T1 & T2 should be evaluated by iterating through all T2 index set entries and testing each element for T1.

If we use sorted arrays to implement the index sets, then many evaluation steps can be implemented as merging operations. T1 & T2 means that we take the T1 and T2 lists, allocate a target array the size of the larger inputs and perform the following algorithm until both inputs are empty: If T1[0] < T2[0], then T1++ (discard the head). If T1[0] > T2[0] then T2++. If both heads are equal, then copy that number over to the target array, and increment all three pointers (T1, T2, target). If the predicate is T1 | T2, then no elements are discarded but the smaller one copied over. A predicate of the form T1 & ¬T2 can also be implemented using a merging strategy, but ¬T1 or T1 | ¬T2 can’t.

This should be considered when ordering the predicate decision tree: Complements should either occur on the RHS of an &, or at the end, when the final count is being determined and the actual elements don’t have to be looked at.

Without using index sets, each thread can filter through its part of the entities and return the count of elements that match the predicate, which can be summed up. When using index sets, then each thread would be assigned one node in the decision tree. It takes two input streams which correspond to the ordered sets, and return a merged stream. Notice that each node in the decision tree has a corresponding set that represents all entities which fulfill that sub-expression, and that due to the ordering of the sets, it isn’t necessary to know the whole set at once in order to merge them.

Different strategies like merging indexed sets or filtering through a list of entities can be combined to a certain degree. Filtering has very predictable performance. If a query is very specific so that the usage of index sets reduces the search space drastically, then merge operations could be better. It is important to note that merging many large input sets can result in much worse performance than brute-force searching. A very optimized algorithm will pick a suitable strategy based on the input size, query structure and statistical indicators.

As an aside, caching partial results can be beneficial if it is expected that similar queries will be run in the future, even though they don’t speed up the inital run.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 16:42

Thẻ: algorithms, optimization

Thiết kế website giá rẻ

Danh mục

Algorithm for fast tag search