How would you go about parsing a sentence like
“Bought two kilos of steak from Acme supermarket” into a data structure like the following JSON representation:
{
item: {name: "steak", tags: "meat,beef" },
quantity: { value: 2, unit: "kg"},
source: "ACME Supermarket"
}
?
I’m looking for a high-level conceptual overview of where to start e.g. some papers or introductory material that don’t require PhD level knowledge. For context, this is part of the preliminary investigation for a personal expense tracker I’m planning to build.
To break it down a little further, I’m interested in basic named entity recognition, and categorization strategies. I don’t have a CS education, so you may want to keep that in mind when answering 🙂 Thanks in advance. Not interested in 3rd party web services, since this is a learning exercise, and is intended to work offline.
5
Your best bet is to define your own syntax which appears to be natural (or close-to-natural) language, even though it’s actually much more rigid.
For example, a SQL select statement: SELECT <something> AS <foo> FROM <table> WHERE <something else> IS <a value>
(note that I replaced =
with IS
to make the point, but that the capitalization is not important).
In your case, it would be BOUGHT <quantity> OF <thing> FROM <location>
. Then you just have to parse <quantity>
, <thing>
, and <location>
to match them up against known items.
switch (quantityString)
{
case "a":
case "1":
case "a single":
case "one":
return 1;
case "a pair":
case "2":
case "two":
return 2;
....
}
You could even make the first word variable input too, but only from a select set of known verbs (the way SQL has SELECT
, DELETE
, INSERT
, etc). You could handle derivatives of them (“purchased” vs “bought”) similarly.
This would be an example of defining a Domain-Specific Language like @Robert-Harvey mentioned.
6