In Rust, OS file paths are stored using a specific Path
type instead of a str
. That’s because str
represents a UTF-8 sequence of bytes, while kernels either don’t enforce any kind of encoding (Unix) as long as the slash is represented with its ASCII codepoint, or are encoded in a 16-bit encoding (Windows).
In Java, paths are represented using the String type in the standard library, which internally uses UCS-2. This is not implementation-specific since some String methods “leak” this encoding one way or another.
How does Java manage to represent Unix’ arbitrary byte sequence paths as Unicode? I assume that it treats paths as being ASCII or UTF-8 in order to map “raw” bytes to Unicode codepoints since paths are in practice almost always ASCII or UTF-8, but what if there’s a sequence of bytes that isn’t valid ASCII nor UTF-8? Is the conversion to Unicode/UCS-2 lossless? Is there a documented algorithm to go back-and-forth? Anything I should consider when dealing with file paths in Java if I don’t want to exclude non-latin alphabet users?