I am looking for a solution to the problem I have faced recently in Java: to limit the filename to 255 bytes in UTF-8.
Given that a single UTF-8 character can be represented by multiple bytes, this is not as simple as:
String sampleString = "컴퓨터";
byte[] bytes = sampleString.getBytes("utf8");
String limitedString = new String(bytes, 0, 5, "utf8");
because we can “cut” the character so that it will end up like in the case above:
컴�
I was looking for a good solution but I cannot find any. ChatGPT suggested using StringBuilder
, and adding a character one-by-one and checking if this reached the limit, something like this (this isn’t ChatGPT’s code, my own interpretation):
String sampleString = "컴퓨터";
StringBuilder sb = new StringBuilder();
for (int i = 0; i < sampleString.length(); i++) {
String temp = sb.toString() + sampleString.codePointAt(i); // build temporary string
if (temp.getBytes("utf8").length > 5) { // convert it back to bytes and check size
break; // if it does not fit, break
}
sb.append(sampleString.codePointAt(i)); // add that tested character otherwise
}
and then the result is as expected:
컴
but I see this as a very memory-expensive solution. Perhaps a much more performant one exists out there?
9
Use the fact that in UTF-8, every byte of a multi-byte code point, except the first byte, starts with binary digits 10xxxxxx. If you are looking at such a byte, you know that it’s part of a multi-byte code point and you need to scan backwards until you find a byte starting 11xxxxxx to find the start of that code point. Therefore, you can chop your byte array at the required limit and then remove any incomplete code point at the end, if there is one:
String sampleString = "컴퓨터";
byte[] bytes = sampleString.getBytes(StandardCharsets.UTF_8);
int limit = 5;
String limitedString;
if (limit >= bytes.length) {
limitedString = sampleString;
} else {
while (limit > 0 && (bytes[limit] & 0xC0) == 0x80)
limit--;
limitedString = new String(bytes, 0, limit, StandardCharsets.UTF_8);
}
3
UTF-8 is designed such that the 1st byte of an encoded codepoint specifies the number of bytes it actually takes up. So you can simply sum up the codepoint byte counts until you reach your desired max byte count, eg:
int getUtf8SeqLen(byte b) {
if ((b & 0x80) == 0x00) return 1; // 0xxxxxxx
if ((b & 0xE0) == 0xC0) return 2; // 110xxxxx
if ((b & 0xF0) == 0xE0) return 3; // 1110xxxx
if ((b & 0xF8) == 0xF0) return 4; // 11110xxx
throw new Exception("Invalid");
}
String sampleString = ...; // "컴퓨터"
byte[] bytes = sampleString.getBytes(StandardCharsets.UTF_8);
int maxBytes = ...; // 255
String limitedString;
if (bytes.length <= maxBytes) {
limitedString = sampleString;
} else {
int numBytes = 0, newLength = 0;
do {
int seqLen = getUtf8SeqLen(bytes[newLength]);
numBytes += seqLen;
if (numBytes > maxBytes) break;
newLength = numBytes;
}
while (newLength < bytes.length);
limitedString = new String(bytes, 0, newLength, StandardCharsets.UTF_8);
}
Alternatively, since you are converting the bytes back to a String
, you could simply avoid the byte[]
array altogether and iterate the original String
as-is, summing up encoded codepoint byte counts, eg:
int getUtf8SeqLen(int cp) {
if (cp <= 0x007F) return 1; // 0xxxxxxx
if (cp <= 0x07FF) return 2; // 110xxxxx 10xxxxxx
if (cp <= 0xFFFF) return 3; // 1110xxxx 10xxxxxx 10xxxxxx
if (cp <= 0x10FFFF) return 4; // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
throw new Exception("Invalid");
}
String sampleString = ...; // "컴퓨터"
int maxBytes = ...; // 255
int numBytes = 0, newLength = 0;
while (newLength < sampleString.length()) {
int cp = sampleString.codePointAt(newLength);
numBytes += getUtf8SeqLen(cp);
if (numBytes > maxBytes) break;
newLength += Character.charCount(cp);
}
String limitedString = sampleString.substring(0, newLength);
1