Word boundary with words starting or ending with special characters gives unexpected results

Say I want to match the presence of the phrase Sortesindex[persons]{Sortes} in the phrase test Sortesindex[persons]{Sortes} text.

Using python re I could do this:

>>> search = re.escape('Sortesindex[persons]{Sortes}')
>>> match = 'test Sortesindex[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\index[persons]{Sortes}'>

This works, but I want to avoid the search pattern Sortes to give a positive result on the phrase test Sortesindex[persons]{Sortes} text.

>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>

So I use the b pattern, like this:

search = r'b' + re.escape('Sortesindex[persons]{Sortes}') + r'b'
match = 'test Sortesindex[persons]{Sortes} text'
re.search(search, match)

Now, I don’t get a match.

If the search pattern does not contain any of the characters []{}, it works. E.g.:

>>> re.search(r'b' + re.escape('Sortesindex') + r'b', 'test Sortesindex test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\index'>

Also, if I remove the final r'b', it also works:

re.search(r'b' + re.escape('Sortesindex[persons]{Sortes}'), 'test Sortesindex[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\index[persons]{Sortes}'>

Furthermore, the documentation says about b

Note that formally, b is defined as the boundary between a w and a W character (or vice versa), or between w and the beginning/end of the string.

So I tried replacing the final b with (W|$):

>>> re.search(r'b' + re.escape('Sortesindex[persons]{Sortes}') + '(W|$)', 'test Sortesindex[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\index[persons]{Sortes} '>

Lo and behold, it works!
What is going on here? What am I missing?

See what a word boundary matches:

A word boundary can occur in one of three positions:

Before the first character in the string, if the first character is a word character.

After the last character in the string, if the last character is a word character.

Between two characters in the string, where one is a word character and the other is not a word character.

In your pattern }b only matches if there is a word char after } (a letter, digit or _).

When you use (W|$) you require a non-word or end of string explicitly.

A solution is adaptive word boundaries:

re.search(r'(?:(?!w)|b(?=w)){}(?:(?<=w)b|(?<!w))'.format(re.escape('Sortesindex[persons]{Sortes}')), 'test Sortesindex[persons]{Sortes} test')

Or equivalent:

re.search(r'(?!Bw){}(?<!wB)'.format(re.escape('Sortesindex[persons]{Sortes}')), 'test Sortesindex[persons]{Sortes} test')

Here, adaptive dynamic word boundaries are used that mean the following:

(?:(?!w)|b(?=w)) (equal to (?!Bw)) – a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use (?:B(?!w)|b(?=w)) if you want to disallow a word char immediately on the left if the next char is not a word char)
(?:(?<=w)b|(?<!w)) (equal to (?<!wB)) – a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use (?:(?<=w)b|B(?<!w)) if you want to disallow a word char immediately on the right if the preceding char is not a word char).

You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:

re.search(r'(?<!w){}(?!w)'.format(re.escape('Sortesindex[persons]{Sortes}')), 'test Sortesindex[persons]{Sortes} test')

Here, (?<!w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.

Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.

Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^Wd_] instead of w, or if you only allow matches around whitespaces, use whitespace boundaries (?<!S) / (?!S) lookaround boundaries).

I think this is what you’re running into:

b lands on the boundary of w and W, but in the example that doesn’t work. '{Sortes}b' is the boundary between W and W because of the '}', which doesn’t match [a-zA-Z0-9_], the ordinary set for w.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 00:49

Thẻ: pythonregex

Thiết kế website giá rẻ

Danh mục

Word boundary with words starting or ending with special characters gives unexpected results