I’m cleaning a column of spanish text using the following function that use re
and unicodedata
:
def CleanText(texto: str) -> str:
texto = texto.lower()
texto = ''.join((c for c in unicodedata.normalize('NFD', texto) if unicodedata.category(c) != 'Mn'))
texto = re.sub(r'[^a-z0-9 n.,]', '', texto)
texto = re.sub(r'([.,])(?![s])', r'1 ', texto)
texto = re.sub(r's+', ' ', texto).strip()
texto = texto.replace('.', '')
texto = texto.replace(',', '')
return texto
And then i apply it to my Dataframe
using:
(
df
.with_columns(
pl.col("Comment").map_elements(CleanText,return_dtype=pl.String).alias("CleanedText")
)
)
However, since polars accept regex crate
i think i could just use polars to do the cleaning without needing to create auxiliar funcions.
What are your thoughts on this?