I am an awful coder and have no idea what I am doing most of the time, but coding is very useful and timesaving in manipulating and managing linguistic data, so I try.
My big question is if there is a way to make .sort() insensitive to diacritcs (accent marks such as á or ô). Currently when I sort I get “aeiou” before “áéíóú” when I want “[aá][eé][ií][oó][uú]” or “aáeéiíoóuú”. This, though I doubt it’s relevant to this question, is my code (the locale thing was an attempt to fix this suggested by google’s gemini):
import os
import locale
import polars as pl
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') # set the locale to en_US.UTF-8
# print the current working directory
print("Current working directory:", os.getcwd())
# Make paths for the lexicon and the sorted lexicon
LexiconPath = r'C:UsersmawanDocumentsCodeWorldBuildingCodeKalagyonManNyal_Lexicon.csv'
UnsortedLexiconPath = r'C:UsersmawanDocumentsCodeWorldBuildingCodeKalagyonManNyal_Lexicon_Sorted.csv'
# read the csv file name KalagyonManNyal_Lexicon and make it a DataFrame
df = pl.read_csv(LexiconPath)
df.fill_null("-") # fill null values with a dash
# print the first 5 rows of the DataFrame
print(df)
# sort the DataFrame by the column "Lexeme" in alphabetical order
df_sorted = df.sort("Lexeme")
print(df_sorted)
write_csv = df_sorted.write_csv(UnsortedLexiconPath, include_bom=True) # write the sorted DataFrame to a new csv file
# print(df.filter(df.is_duplicated())) # print duplicated rows
# print(df.columns)
# make a dataframe that filters by the part of speech input by the user
PoS = input("Enter the part of speech you want to filter by: ")
df_PoS = df.filter(df['PoS'] == PoS)
print(df_PoS)
So what I am looking for is either diacritic-insensitive sorting or custom sorting instructions so I can make some diacritics count as separate letter while others don’t.
I tried using locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
after gemini, the generative ai, suggested it, but it didn’t seem to change anything.
user24810980 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.