Consider the following script, which attempts to read a fixed number of bytes from a file containing Unicode using bash’s read
function, and then counts how many bytes and characters are read.
MWE.sh
#!/bin/bash
char_count=$1
while true; do
read -r -n $char_count output
echo -n "$output" | wc -c
echo -n "$output" | wc -m
echo "${#output}"
echo
exit
done < MWE.txt
MWE.txt
main · square
Suppose I invoke it like this:
for x in $(seq 1 10 ); do ./MWE.sh $x; done
This produces the following output:
1
1
1
2
2
2
3
3
3
4
4
4
4
4
4
7
6
6
7
6
6
9
8
8
10
9
9
11
10
10
If bash is attempting to read a particular number of characters, then I would expect discontinuities in the byte count, but a consistently increasing character count.
However, we observe discontinuities in both byte and character count.
For example, when we attempt to read 5 characters, we obtain only 4 characters.
Why does this happen, and is there a way to make read -n <count>
behave in a more consistent way?
I am running the following bash version:
bash --version
GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)