This regular expression '(?:[sdmt]|ll|ve|re)| ?p{L}+| ?p{N}+| ?[^sp{L}p{N}]+|s+(?!S)|s+
works as expected to match Ġmeousrtr
, this can be seen in the shared link https://regex101.com/r/UR0P6T/1
But when I try using PCRE library in C, I get 3 individual matches instead of 1. I get that unicode character Ġ
is 2 byte width and expression is matching for the two bytes, but shouldn’t this match the whole string as https://regex101.com/r/UR0P6T/1
# Output of regex expression
'(?:[sdmt]|ll|ve|re)| ?p{L}+| ?p{N}+| ?[^sp{L}p{N}]+|s+(?!S)|s+
# Matches
Match Succeeded at 0
�x
Match Succeeded at 1
�x
Match Succeeded at 2
meousrtrx
Below is the C code:
#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>
#include <string.h>
#include <iostream>
using namespace std;
int main(int argc, char **argv)
{
PCRE2_SPTR expression = (PCRE2_SPTR) "'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+";
PCRE2_SPTR text = (PCRE2_SPTR) "Ġmeousrtr";
PCRE2_SIZE eoffset;
PCRE2_SIZE *ovector;
pcre2_code *re;
pcre2_match_data *match_data;
char *c = (char *)expression;
while (*c)
printf("%c", (unsigned int)*c++);
printf("n");
int error_number;
int result;
size_t start_offset = 0;
size_t text_len;
u_int32_t options = 0;
text_len = strlen((char *)text);
re = pcre2_compile(expression, PCRE2_ZERO_TERMINATED, 0, &error_number, &eoffset, NULL);
if (re == NULL)
{
PCRE2_UCHAR buffer[256];
pcre2_get_error_message(error_number, buffer, sizeof(buffer));
cout << buffer;
return 1;
}
match_data = pcre2_match_data_create_from_pattern(re, NULL);
while (true)
{
result = pcre2_match(re, text, text_len, start_offset, options, match_data, NULL);
if (result < 0)
{
switch (result)
{
case PCRE2_ERROR_NOMATCH:
cout << "No matches found!";
return 0;
default:
cout << "Matching Error" << result;
return -1;
}
pcre2_match_data_free(match_data);
pcre2_code_free(re);
}
ovector = pcre2_get_ovector_pointer(match_data);
printf("Match Succeeded at %dn", ovector[0]);
int i;
for (i = 0; i < result; i++)
{
PCRE2_SPTR substring_start = text + ovector[2 * i];
PCRE2_SIZE substring_length = ovector[2 * i + 1] - ovector[2 * i];
printf("%.*sn", (int)substring_length, (char *)substring_start);
}
start_offset = ovector[1];
}
}