So i am currently working a machine-translation-like project, where i have a transformer with the encoder-decoder structure, which is supposed to generate SQL queries from natural language commands, example:
Input: Count the number of aircrafts produced by the company XYZ
Output: SELECT COUNT(*) FROM aircrafts WHERE manufacturer=’XYZ’;
Now i have kind of achieved my goal and my model can generate decent looking queries in most of cases, but there’s still one problem to solve, which is that the queries it could generate are all SYNTACTICALLY correct but it doesn’t nail the attributes/tables names, for example:
Input: Count the number of aircrafts produced by the company XYZ
Output: SELECT COUNT(*) FROM aircrafts WHERE produced=’XYZ’;
or:
Input: Show all the employees older than 40
Output: SELECT * FROM employee WHERE country > 40;
So like i said, syntactically-speaking the query has no errors but it sure cannot be executed correctly on the database itself.
My model is trained on a json dataset that contains a list of samples who follows the following form:
[
[
"Count the number of aircraft produced by company XYZ",
"SELECT COUNT(*) FROM aircraft WHERE manufacturer = 'XYZ';"
],
[
"How many marine species are found in the Atlantic Ocean?",
"SELECT COUNT(*) FROM marine_species WHERE location = 'Atlantic Ocean';"
]
]
Although i found some datasets that contains the schema of the database as a set of CREATE queries like this one:
{"instruction": "CREATE TABLE table_72445 (
"County" text,
"Population" real,
"Per capita income" text,
"Median household income" text,
"Median family income" text
)
-- Name the median family income for riverside",
"output": "SELECT "Median family income" FROM table_72445 WHERE "County" = 'Riverside'"}
So is there anyway i can leverage this kind of dataset so i can give my transformer the database context it should work on, or is there any other way to achieve this task goal?
NOTE: i cannot make the previous corpus as a whole my input, because it will result in a huge performance-overhead. Imagine defining the very SAME set that contains tens of tables every single time that i want to execute, so this is no choice to take.
Thank you all in advance.