Will the b
regex for word boundary work in cpp for all languages? Or is it just latin alphabet?
If not – how would one match a whole word such as “תפוח”?
Specifically I thought about something like this[^s]תפוח[$s]
but not sure if ^
is interpeted as negation or start of string here…
I’m using the PCRE library.
6
You don’t say what regex engine you are using. But anyway you might like to consider using boost regex, because it has a wrapper which can be used with the ICU library for handling unicode.
The documentation for this says you can:
Create regular expressions that support various Unicode data
properties, including character classification.
This implies /b and /B should work with any encoding supported by ICU.
In the ‘standards’ section for Unicode compliance it says:
1.4 Simple Word Boundaries: Conforming: non-spacing marks are included in the set of word characters.