While finding a solution for How to solve Perl’s `length ‘für’ == 4` for `LC_CTYPE=”en_US.UTF-8″`? I wrote another little test program:
It seems Perl (5.18.2) does not output UTF-8 encoded strings correctly in an UTF-8 Linux (SLES12 SP5) environment when the strings are interpreted as UTF-8.
The basic problem was that (e.g.) string “Gemäß” being read from a file had a length
of 7 instead of 5, so I wrote this test program (“length.pl”, the first test is “commented out” in an odd way):
#!/usr/bin/perl
use warnings;
use strict;
=begin debug
use utf8;
print length('Gemäß'), "n";
=end debug
=cut
if (open(my $fh, "<:encoding(UTF-8)", 'length.txt')) {
while (<$fh>) {
chomp;
print length($_), ':', $_, "n";
}
close($fh);
} else {
warn "length.txt: $!n";
}
The input file “length.txt” just contains a single line, like this
> cat length.txt
Gemäß
> hexdump -C length.txt
00000000 47 65 6d c3 a4 c3 9f 0a |Gem.....|
00000008
> ./length.pl
5:Gem▒▒
> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
> vi length.pl # remove the ":encoding(UTF-8)" from open
> ./length.pl
7:Gemäß
So the length is correct, but the output on the screen is wrong.
When dropping :encoding(UTF-8)
from the open
call, then the string length is wrong, but the output is correct.
I’m using an SSH session via PuTTY with setting “Remote character set:” set to “UTF-8” (just in case someone would ask for that).
Obviously I’d like to have both (for correct UTF-8 input), the string length and correct text output.