i’m trying to automate some aspect of my small business mainly with Python.
Where i live when a product is sold to a business a stadardized .xml invoice is created and sent to the buyer.
From these files i created a script to extract the data i need and then it gets divided in columns and then sent to Excel. The main columns are:
Extracted columns
Example of extracted table
Where Data is the Date, Denominazione is the sellers name, Descrizione is the product desctiption, Quantita is the quantity, UnitaMisura is the unit of measurement (usually KG or unit), PrezzoUnitario is the single item price, PrezzoTotale is the toal price so Quanity x PrezzoUnitario.
The format is pretty standard and i can extract these informations quite accurately; now the main problem is that in the product Description the various selles write pretty much whatever they want and however they like so the product description might have only the first part of the word, maybe it’s shortened maybe is written slightly wrong and various other problems like these. I’d like to create a “mapping” file in excel (or some other way) where i have a list of the names of all the products i buy usually with an assigned Macro Category for each word so for example Apple – Fruit, Banana – Fruit, Chicken – Meat, Potato – Vegetable and so on (everything would be in italian but to make it understandable i’ll use english to explain). Now when i buy any product i want to match the product description against my mapping file and assign a category to each product so that i can automate this process. I’m not expecting it to work with a 100% accuracy but i’d like to remove the bulk of the work.
I was thinking of a mix of exact text matching and if that doesen’t find any matches then use fuzzy matching but i’m not sure it’s the best way to do this. Also this way there would be problems if a certain word is contained in another like Applepie isn’t a fruit but it contains Apple which is.
Maybe i should add another “mapping” with a probability for the product category based on the sellers name since 80% of the sellers have only a certain number of categories (it’s unlikely a seller will sell fruit, vegetables, meat and also computer parts or pens).
Maybe my thought process is completly wrong or useless so i need a bit of advice to move forward.
This is what i did with Python (3.11) using fuzzywuzzy library but it works probably 30% of the time so it’s probably not a good approach.
def categorize_description(description, keyword_dict, threshold=80):
description = description.lower()
# Exact word boundary matching
for category, keywords in keyword_dict.items():
for keyword in keywords:
pattern = r'b' + re.escape(keyword) + r'b' # Using b for word boundaries
if re.search(pattern, description):
return category, 'Parola esatta', 100, keyword # Return exact match type with 100% similarity and matched keyword
# Fuzzy matching
# Create a combined list of all keywords
all_keywords = [(keyword, category) for category, keywords in keyword_dict.items() for keyword in keywords]
# Extract the best match
best_match = process.extractOne(description, [kw[0] for kw in all_keywords], scorer=fuzz.partial_ratio)
if best_match and best_match[1] >= threshold:
matched_keyword = best_match[0]
matched_category = next(cat for kw, cat in all_keywords if kw == matched_keyword)
return matched_category, 'Parola approssimata', best_match[
1], matched_keyword # Return fuzzy match type with similarity percentage and matched keyword
return 'Sconosciuto', 'NA', 0, '' # Return no match type with 0% similarity and empty matched keyword
1