I want to identify most matching sentence using some pattern. That means by using java algorithm I want to create identical value for each sentences.Each sentence when entering to that algorithm can be out some kind of identical value. How can I develop it? What can I refer any web sites do you know? What sort of sites should I look for? Actually I want clarify about when I’m giving as ex: 5 sentence to algorithm that possible to generate some kind of 5 values.Then I compare with those values with previously generated values(I should be store those values in my database) and get the gap between new 5 value and previously stored values.Then I get the distance and I selected most suitable sentence as most lowest gap value.
I’m use those things for my machine translation tool.
As an example we think using my ruled based translation model generate 2 sentences.
1. I want eat an apple.
2. I want eat a house.
In my corpus we think more sentences include and I store values for sentences in my database. (Value assign part is I don’t know yet)
I want to create java algorithm to assign value for each sentence. As an example if we think
Sentence 1 value: 250.8
Sentence 2 value : 290.5
Database included values 248, 400,800
Then I got the difference.
So we can see here most minimum difference get for 250.8 and most suitable sentence is 1 one.
17
If you want to create unique value for each and every sentence, then don’t even try, because thanks to Pigeonhole principle, you are guaranteed to get collisions and thus the identifiers won’t be unique. You could limit the input space accordingly, but in this case the algorithm looses it’s purpose.
If you are looking for way to create identifier than might indicate that sentences are be equal, but not guarantee it, then hashing is practically created for this purpose. This the allows you to check if the sentences might be equal, but you still have to run the equality function to guarantee it.
3