In non-POSIX regex flavors (as in Java, as The Pattern
engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5), the first alternative is matched. In POSIX, the longest alternative is matched.
See what Perl help says about alternation:
To match dog
or cat
, we form the regexp dog|cat
. As before, Perl will try to match the regexp at the earliest possible point in the string. At each character position, Perl will first try to match the first alternative, dog
. If dog
doesn't match, Perl will then try the next alternative, cat
. If cat
doesn't match either, then the match fails and Perl moves to the next position in the string.
See the Alternation with The Vertical Bar or Pipe Symbol at regular-expressions.info that describes NFA-compliant alternation behavior:
The order of the alternatives matters. Suppose you want to use a regex to match a list of function names in a programming language: Get
, GetValue
, Set
or SetValue
. The obvious solution is Get|GetValue|Set|SetValue
.
The regex engine starts at the first token in the regex, G
, and at the first character in the string, S
. The match fails. However, the regex engine studied the entire regular expression before starting. So it knows that this regular expression uses alternation, and that the entire regex has not failed yet. So it continues with the second option, being the second G
in the regex. The match fails again. The next token is the first S
in the regex. The match succeeds, and the engine continues with the next character in the string, as well as the next token in the regex. The next token in the regex is the e
after the S
that just successfully matched. e
matches e
. The next token, t
matches t
.
At this point, the third option in the alternation has been successfully matched. Because the regex engine is eager, it considers the entire alternation to have been successfully matched as soon as one of the options has. In this example, there are no other tokens in the regex outside the alternation, so the entire regex has successfully matched Set
in SetValue
.
And then:
But the POSIX standard does mandate that the longest match be returned, even when a regex-directed engine is used. Such an engine cannot be eager. It has to continue trying all alternatives even after a match is found, in order to find the longest one.
However, the order of alternatives may be irrelevant if the context on either side is strictly defined. If you use anchors, ^(asd|asdf)$
, to match the full string, you will only get the one that corresponds to the right alternative.