I am exploring some of the more esoteric aspects of string comparisons at the moment (I went down a bit of a rabbit hole and there doesn’t seem to be an end in sight!).
I’d like to know how to compare a string containing characters with ligatures to a canonical non-ligature version (imagine an application for French language learning that lets the user type in oeuf
or œuf
interchangeably). I think this is called ‘folding’, but I could be wrong.
I’ve tried normalizing my strings using NFKD, which I thought would decompose the character into its constituent parts, but only some Unicode codepoints support decomposition. (Of course, my example character ‘œ’ doesn’t, which resulted in much hair-pulling.)
For example:
using System.Text;
using System.Globalization;
// This character does not support decomposition.
string str1 = "u0153"; // œ (LATIN SMALL LIGATURE OE)
string str2 = "oe";
string str1norm = str1.Normalize(NormalizationForm.FormKD);
Console.WriteLine(str1norm.IsNormalized()); // True
Console.WriteLine(str1norm.Equals(str2)); // False
Console.WriteLine(str1norm.Length); // 1
// This character supports decomposition.
str1 = "uFB06"; // st (LATIN SMALL LIGATURE ST)
str2 = "st";
str1norm = str1.Normalize(NormalizationForm.FormKD);
Console.WriteLine(str1norm.IsNormalized()); // True
Console.WriteLine(str1norm.Equals(str2)); // True
// (Non-normalized comparison.) True in most locales, but not all (see below)
Console.WriteLine(str1.Equals(str2, StringComparison.CurrentCultureIgnoreCase));
Console.WriteLine(str1norm.Length); // 2
References for the two Unicode characters:
-
https://www.compart.com/en/unicode/U+0153
-
https://www.compart.com/en/unicode/U+FB06
If only some ligatures are decomposable, I have a couple of questions:
- How do I determine this without manually checking through all the Unicode code points?
- Can I do a string comparison to a canonical version of the string without having to create a dictionary of all the characters and their decomposed forms? I mean, I like typing, but not that much.
- (Bonus round.) Why are some multi-character code points decomposable (like the ligature ‘st’ example), and some not? Is there something special about characters like ‘œ’? Who do I have to bribe to get these exceptions sorted out?
As a final illustration of my “what-on-earth-is-going-on?” mindset at the moment, ‘st’ is equivalent to ‘st’ without normalizing in 862 locales on my PC, with the string comparison failing in only 7, with those failure locales being:
aa
(Afar language)aa-DJ
(Afar – Djibouti)aa-ER
(Afar – Eritrea)aa-ET
(Afar – Ethiopia)en-US-POSIX
(US English POSIX settings)ssy
(Saho language)ssy-ER
(Saho – Eritrea)
This is all very interesting, but the sheer apparent arbitrariness of it all is a bit overwhelming 🙂 As far as I know, Afar and Saho have Latin-based written forms, but I assume their Unicode code pages are less popular and may just lack the entries for ligatures which are present in others.
Finally finally, the single-character ‘st’ and its decomposed form ‘st’ are equal in the .NET Invariant
culture, but only if case is ignored:
//...continuing from prior code...
Console.WriteLine(str1.Equals(str2, StringComparison.InvariantCulture)); // False
Console.WriteLine(str1.Equals(str2, StringComparison.InvariantCultureIgnoreCase)); // True
This is puzzling because both strings are lowercase.
(I know, “Just normalize your string and forget you ever saw this, Dave.” But I still think it’s interesting to know why this happens.)
Thanks for your time.