I am working on a recipe classification system and I am struggling with the preprocessing of my data. The data is from Food.com and I need to make sure that all ingredients are in singular form to reduce the number of unique ingredients. But I am not so sure how to do that/what library to use. It is also a bit tricky because some ingredient names are very long e.g. “del monte crushed tomatoes with mild green chilies” or they include specific brand names e.g. “baker’s angel flake sweetened coconut”.
First reading the ingredients and see what it looks like I first removed all “,’, and []. that leaves me with the data as seen in the description above. But now I am struggling with converting them all to singular. After I did the cleaning the data looks like in the second image. Removing characters (“,’,[]). I intentionally did not remove numbers and dots since there are ingredients like “a.1. steak sauce” and “1% low-fat chocolate milk”. But what can I do to singularize the terms? So far I have around 14000 unique ingredients and I am sure it would reduce by half when everything is in singular form. Some words are very distinct so I tried using a mapping function that converts e.g.:
‘8-inch 97% fat free flour tortillas’: ‘8-inch 97% fat-free flour tortilla’
But also sometimes there are cases where the authors write instead of “8-inch” they write “8””, that also complicates the process a lot. I do not know how to go about it since I cannot go through all 14000 ingredients and adjust them by hand with the mapping function. Also, since I am building a recipe classifier that should also be able to classify completely new recipes, I need to find a way to integrate the preprocessing in my system.
carly lange is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.