I’m not sure if this process has a name.
I have some words (about 9,000). They are in Japanese, but I’ll try to explain this using English words. I want to categorize the words by the components (in English, letters).
A
B
C
act bar play
This should create:
A: play
B: bar
C: act
Now, ‘a’ appears in all 3 words, but I want to make sure that each category (letter) has at least word. Now, it would make sense to delete a word after it’s used, but there are a few cases where 2 letters make up one word and that’s each letter’s only word–so I’d like to account for that somehow.
Is there an approach for solving this aside from brute force? Dynamic programming perhaps? Even a name for this process (if it exists) would be great.
3
Lucene is a powerful library used to do a variety of text searching and stemming and it might be worth taking a look at. The typical use case is full-text search but it is pretty component based under the hood so you could certainly build a custom analyizer or stemmer to get to where you wanted to be.
You can express this as a maximum flow problem. A maximum flow problem optimizes a flow in a network. It can be used to solve assignment problems. You want to assign words to categories.
You need to build a graph as follows:
Each category and each word is a node, to which you have to add a start node and an end node.
From the start node, add an edge going to each category.
From each category add an edge going to each word that is compatible with the category.
From each word, add an edge going to the end node.
All edges have capacity 1.
In this graph you want to maximuze the flow going from the start node to the end node. The Ford-Fulkerson is especially interesting because it works with integer weights and returns a solution in integers.
The solution assigns one word to each category just follow for each category which edge has a positive flow. If it is impossible to assign a word to all categories, it will return the assignment that maximizes the number of categories matched with a word.
If you have more words than categories, or any words left over, you can assign the remaining words to the first matching category.
This will guarantee that a maximum of categories have at least one word matched.
2