I have the 57-byte text file (in UTF-8). It can be generated by the following command:
echo '3C6469763E3C6469763E5F3C2F6469763E5F3C68313E6162636465665F3C2F68313E5F3C2F6469763E3C6469763EF09F93853C2F6469763E0A' | xxd -r -p > input.txt
I am trying to extract the part between _<h1>
and _</h1>
. I used the following program for gawk
:
LC_ALL=en_US.utf8 gawk 'BEGIN {IGNORECASE = 1; }
{match($0, /^.*_<h1>(.*)_</h1>.*$/, a);
print(a[1]) > "output.txt"}' input.txt
Question: why is a[1]
empty for the combination of this particular program and this particular input? If I remove the IGNORECASE = 1
part or replace /^.*_<h1>(.*)_</h1>.*$/
with /_<h1>(.*)_</h1>/
, the output is correct (because a[1]
contains abcdef
). But how can any of these changes affect the output in my situation?