I am trying to detected extended ascii characters with lex, e.g., àÀ.
<code>%{
#include <stdio.h>
%}
DIGIT [0-9]
ALPHA_CHAR [A-Za-z]
EXTENDED [àÀ]
CHAR {ALPHA_CHAR}|{DIGIT}|{EXTENDED}
CHARS ({CHAR})+
%%
{CHARS} { printf("CHARS: %sn", yytext); }
. { printf("Unknown character: %sn", yytext); }
%%
int main(int argc, char **argv) {
yylex();
return 0;
}
int yywrap() {
return 1;
}
</code>
<code>%{
#include <stdio.h>
%}
DIGIT [0-9]
ALPHA_CHAR [A-Za-z]
EXTENDED [àÀ]
CHAR {ALPHA_CHAR}|{DIGIT}|{EXTENDED}
CHARS ({CHAR})+
%%
{CHARS} { printf("CHARS: %sn", yytext); }
. { printf("Unknown character: %sn", yytext); }
%%
int main(int argc, char **argv) {
yylex();
return 0;
}
int yywrap() {
return 1;
}
</code>
%{
#include <stdio.h>
%}
DIGIT [0-9]
ALPHA_CHAR [A-Za-z]
EXTENDED [àÀ]
CHAR {ALPHA_CHAR}|{DIGIT}|{EXTENDED}
CHARS ({CHAR})+
%%
{CHARS} { printf("CHARS: %sn", yytext); }
. { printf("Unknown character: %sn", yytext); }
%%
int main(int argc, char **argv) {
yylex();
return 0;
}
int yywrap() {
return 1;
}
when I am giving as input àÀ, my code prints à À, as à is encoded 16 UTF8 bits 0xC3 = à and 0xA0 = NBSP, and so is À: 0xC3 = à and 0x80 = €.
What I could do is if I detect 0xC3, expect for a second byte, and add an appropriate offset to this character, to get the ASCII equivalent.
The offset would be = 0xC0 – 0x80, as 0xC0 is À is 0xC0 in ASCII, and 0xC380 in hex.
But I find this idea kind of dirty.
Any better ideas to handle this issue ?