I have two different tables with data, in one of them Katakana-Hiragana Sound Mark is part of the previous character, in another it’s a separate symbol. I need to match values between the two tables. The Unicode Equivalence should handle these cases, but suddenly U+309B
(Katakana-Hiragana Voiced Sound Mark) is decomposed into U+0020
(space) and U+3099
(Combining Katakana-Hiragana Voiced Sound Mark). The space doesn’t let me combine U+3099 with the previous character.
Example:
From one table I get value ジ (U+30B8
). I perform the NFKC transformation: U+30B8
is decomposed as U+30B7
and U+3099
and then composed back to U+30B8
.
From the other table I get value シ゛(U+30B7
and U+309B
). I perform the NFKC transformation: (U+30B7
U+309B
) is decomposed as (U+30B7
U+0020
U+3099
) and (U+30B7
U+3099
) is not composed back to U+30B8
because of the space in between. So I’m left with シ ゙ (U+30B7
U+0020
U+3099
) and I can’t match this value with ジ (U+30B8
) from the previous table.
How can I get rid of the space in decomposition of U+309B
and why is it even there?
Here is the Python code:
import unicodedata2
print(f"Unicode code points: {[hex(ord(c)) for c in unicodedata2.normalize('NFKC', 'シ゛')]}")
# Result: Unicode code points: ['0x30b7', '0x20', '0x3099']
print(f"Unicode code points: {[hex(ord(c)) for c in unicodedata2.normalize('NFKC', 'ジ')]}")
# Result: Unicode code points: ['0x30b8']