regex - "Variable length lookbehind not implemented" but it isn't variable length

Question

Welcome To Ask or Share your Answers For Others

regex - "Variable length lookbehind not implemented" but it isn't variable length

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - "Variable length lookbehind not implemented" but it isn't variable length

I have a very crazy regex that I'm trying to diagnose. It is also very long, but I have cut it down to just the following script. Run using Strawberry Perl v5.26.2.

use strict;
use warnings;

my $text = "M Y H A P P Y T E X T";
my $regex = '(?i)(?<!(Mon|Fri|Sun)day |August )abcd(?-i)';

if ($text =~ m/$regex/){
    print "true
";
}
else {
    print "false
";
}

This gives the error "Variable length lookbehind not implemented in regex."

I am hoping you can help with several issues:

I don't see why this error would occur, because all of the possible lookbehind values are 7 characters: "Monday ", "Friday ", "Sunday ", "August ".
I did not write this regex myself, and I am not sure how to interpret the syntax (?i) and (?-i). When I get rid of the (?i) the error actually goes away. How will perl interpret this part of the regex? I would think the first two characters are evaluated to "optional literal parentheses" except that the parentheses isn't escaped and also in that case I would get a different syntax error because the closing parentheses would then not be matched.
This behavior starts somewhere between Perl 5.16.3_64 and 5.26.1_64, at least in Strawberry Perl. The former version is fine with the code, the latter is not. Why did it start?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T02:51:22+0000

I have reduced your problem to this:

my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!st)A';
print ($text =~ m/$regex/i ? "true
" : "false
");

Due to presence of /i (case insensitive) modifier and presence of certain character combinations such as "ss" or "st" that can be replaced by a Typographic_ligature causing it to be a variable length (/August/i matches for instance on both AUGUST (6 characters) and augu? (5 characters, the last one being U+FB06)).

However if we remove /i (case insensitive) modifier then it works because typographic ligatures are not matched.

Solution: Use aa modifiers i.e.:

/(?<!st)A/iaa

Or in your regex:

my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd';
print ($text =~ m/$regex/iaa ? "true
" : "false
");

From perlre:

To forbid ASCII/non-ASCII matches (like "k" with "N{KELVIN SIGN}"), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the d, etc., and the second occurrence adds the "/i" restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.

See a closely related discussion here

Categories

regex - "Variable length lookbehind not implemented" but it isn't variable length

regex - "Variable length lookbehind not implemented" but it isn't variable length

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags