When looking for a UTF8 handling library for C I found this, which seems to be the most popular on GitHub. It overruns the buffer you pass it to the UTF8 validation function:
#include "utf8.h"
int main()
{
unsigned char c[] = { 0b11110000, 0b10000000, 0 };
utf8_int8_t* is_valid = utf8valid((char*)c);
}
The part of the code that overruns the buffer is:
if (0xf0 == (0xf8 & *str)) {
/* ensure that there's 4 bytes or more remaining */
if (remaining < 4) {
return (utf8_int8_t*)str;
}
auto val = str[3];
/* ensure each of the 3 following bytes in this 4-byte
* utf8 codepoint began with 0b10xxxxxx */
if ((0x80 != (0xc0 & str[1])) || (0x80 != (0xc0 & str[2])) ||
(0x80 != (0xc0 & str[3]))) {
return (utf8_int8_t*)str;
}
When it reads at str[3] it reads past the buffer I allocated, even though my string buffer was null terminated. Is this behaviour normal for a validation UTF8 function or expected? There is a utf8nvalid() function where you pass a max buffer size, but in the function it explicitly checks the null-terminator so it seems to think that it’s protecting against buffer overflows. I think it’s broken. The code is a single header, it’s here