I am wondering if there is a known algorithm for solving the following.
Data: There is a number of sets containing items. An item can be a member of more than one set.
Aim: return the minimum number of sets which between them contain all of the items, and avoid returning duplicate items as much as possible. For example:
- Set 1: a
- Set 2: a, b
- Set 3: b, c
For this I would want to return sets 1 and 3. This is the fewest sets which combined contain at least 1 copy of all the items and duplicate items are kept to a minimum.
BTW – I will be looking to implement this in C#.
4
Your problem is, I believe, NP-complete, so unless the number of sets you are dealing with is small, there isn’t a guaranteed accurate solution that will run in a realistic time. You may plausibly be able to get accurate results by a global search (basically listing all possible combinations of sets and checking which one gives the best results) for up to about 20 sets, but not much larger problems than that.
Your requirement to minimize duplication makes it slightly different to the set cover problem as linked in the comments, and it may be that the approximate solution described in that article (essentially iterating the process of picking the choice that brings you closest to a solution until you have complete cover) doesn’t work well enough for you as it entirely ignores your extra requirement.
If this is the case, I’d take an approach of using this algorithm to generate a first candidate solution, and then using simulated annealing in a similar manner to how it is applied to the traveling salesman problem as a final optimization step. I think this should result in something close to an optimal solution most of the time.