I’m trying to implement a robust ere_parenthesize function that requires to accurately parse the bracket expressions of a user-provided ERE.
The difficult part is that, while the support for character classes [:
:]
, equivalence classes [=
=]
and collating symbols [.
.]
in bracket expressions differs between Awk implementations, it is critical for determining the termination of a bracket expression.
A simple example would be that /[[:punct:]]/
is equivalent to /[:[punct]]/
when Awk doesn’t support [:
:]
.
I brainstormed a few runtime checks that are not enough to fully characterize a regex engine, given the constraint that they shall not induce a crash. Still, I ran them with multiple Awks/OSs and made a few assumptions in light of the results:
-
An implementation that supports
[=
=]
but doesn’t support standard backslash-escape sequences within it always has the termination bug found here:match("]", /[[=x=]?]/) == 0
(support for equivalence classes)
match("]", /[[=x]?]/) == 1
(termination bug)implies:
match("]", /[[=t=]?]/) == 1
(no support for standard backslash-escape sequences within[=
=]
) -
An implementation that supports
[=
=]
and standard backslash-escape sequences within it does not have termination bugs:match("]", /[[=x=]?]/) == 0
(support for equivalence classes)
match("t", /[[=t=]?]/) == 1
(support for standard backslash-escape sequences within[=
=]
)implies:
match("]", /[[=t]]/)
(crash) -
An implementation that supports
[:
:]
but doesn’t support[=
=]
always has termination bugs:match("1", /[[:xdigit:]]/) == 1
(support for character classes)
match("]", /[[=x=]?]/) == 1
(no support for equivalence classes)implies:
match("]", /[[:xdigit]?]/) == 1
(termination bug)
match("]", /[[:abc:]?]/) == 1
(termination bug)
match("]", /[[::]?]/) == 1
(termination bug)
match("]", /[[:]?]/) == 1
(termination bug)
My question is about comforting/invalidating the above assumptions; can you provide the results of running the following code with the Awks/OSs that you have at hand?
awk 'BEGIN {
ere_brackets_have_character_classes = match("1", /[[:xdigit:]]/)
ere_brackets_have_equivalence_classes = !match("]", /[[=x=]?]/)
ere_brackets_have_backslash_escape_bug = match("]", /[[=t=]?]/)
print "ere_brackets_have_character_classes :", ere_brackets_have_character_classes
print "ere_brackets_have_equivalence_classes :", ere_brackets_have_equivalence_classes
print "ere_brackets_have_backslash_escape_bug :", ere_brackets_have_backslash_escape_bug
if (ere_brackets_have_equivalence_classes) {
if (ere_brackets_have_backslash_escape_bug) {
print "Assumption #1: expected output: 1"
r = "[[=x]?]"
print match("]", r)
} else {
print "Assumption #2: expected output: crash"
r = "[[=\t]]"
match("]", r)
}
} else if (ere_brackets_have_character_classes) {
print "Assumption #1: expected output: 1"
split("[[:xdigit]?] [[:abc:]?] [[::]?] [[:]?]", a, " ")
print match("]", a[1]) &&
match("]", a[2]) &&
match("]", a[3]) &&
match("]", a[4])
}
else {
print "no expected output: nothing"
}
}'
note: Some Awks compile the EREs before running the code when they are provided as string constants or within /
/
; as a workaround I stored them in variables.
ASIDE
match("1", /[[:xdigit:]]/)
should be locale independent, am I right?