I have a CSV file with multiple categorical columns, but most of these columns contain messy data due to typing mistakes (e.g., ‘spciulated’, ‘SPICULATED’, etc. for the category ‘spiculated’ of the column ‘margins’). Is there a standard way to deal with such situations?
To be precise, I would like to read the CSV file directly into a clean DataFrame with a dtype category for the categorical columns, but with all variants collapsed into one category (e.g., each variant of ‘spiculated’ would be read as ‘spiculated’). The spelling variants could be given by a dict, for instance.
Expected solution:
import pandas as pd
FEAT_VALS = {
"margins": {
"spiculated": ["spiculated", "spiiculated", "SPICULATED"],
"circumscribed": ["circumscribed", "cicumscribed"],
}
}
# somehow give FEAT_VALS to read_csv
df = pd.read_csv('test.csv', dtype='category')
df.margins
where test.csv is:
margins
spiculated
spiiculated
SPICULATED
circumscribed
cicumscribed
to obtain:
0 spiculated
1 spiculated
2 spiculated
3 circumscribed
4 circumscribed
Name: margins, dtype: category
Categories (2, object): ['circumscribed', 'spiculated']
However, without the spelling variants information, I get:
0 spiculated
1 spiiculated
2 SPICULATED
3 circumscribed
4 cicumscribed
Name: margins, dtype: category
Categories (5, object): ['SPICULATED', 'cicumscribed', 'circumscribed', 'spiculated', 'spiiculated']
My current solution: Looks like this
df2 = pd.read_csv('test.csv')
for feat, feat_vals in FEAT_VALS.items():
for enc_val, str_vals in feat_vals.items():
df2.loc[df2[feat].isin(str_vals), feat] = enc_val
df2.margins = df2.margins.astype('category')
Alek Fröhlich is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
6
You could flip the inner dicts then use .map()
. This isn’t as direct as you want, but at least the code using Pandas is cleaner.
for feat, feat_vals in FEAT_VALS.items():
feat_strs = {
str_val: enc_val
for enc_val, str_vals in feat_vals.items()
for str_val in str_vals
}
df[feat] = df[feat].map(feat_strs).astype('category')
# For demo
print(df[feat].cat.categories)
Output:
Index(['circumscribed', 'spiculated'], dtype='object')
1
There are three things to consider in your question:
-
Make the strings somewhat comparable (for eg. remove the accents and use the same case).
What is the best way to remove accents (normalize) in a Python unicode string? -
Select a fitting distance algorithm/library.
looking for python library which can perform levenshtein/other edit distance at word-level -
Create a DataFrame from the transformed input.
Create a pandas DataFrame from generator?
Here’s a sample implementation that uses the Levenshtein distance for comparisons:
import csv
import unicodedata
import Levenshtein
import pandas as pd
def normalize_string(str):
return ''.join(
char for char in unicodedata.normalize('NFD', str)
if unicodedata.category(char) != 'Mn'
).lower()
def parse_csv(filename, categories = ["spiculated", "circumscribed"]):
with open(filename) as f:
reader = csv.DictReader(f)
normalized_categories = [normalize_string(str) for str in categories]
for row in reader:
min_distance = float('inf')
normalized_margins = normalize_string(row["margins"])
for i,normalized_category in enumerate(normalized_categories):
distance = Levenshtein.distance(
normalized_margins,
normalized_category
)
if distance < min_distance:
min_distance = distance
row["margins"] = categories[i]
yield row
df = pd.DataFrame(data=parse_csv("test.csv"), dtype='category')
print(df.margins)
output:
0 spiculated
1 spiculated
2 spiculated
3 circumscribed
4 circumscribed
Name: margins, dtype: category
Categories (2, object): ['circumscribed', 'spiculated']
1