It turned out that probably the best solution is the following:
((https?|ftps?)://[^"<s]+)(?![^<>]*>|[^"]*?</a)
Looks like that the negative lookahead is working properly only if it starts with quantifiers and not strings. For such a case, it follows that practically we can do backtracks only.
Again, we just want to make sure that nothing inside HTML tags as attributes is messed up. Then we do a backtrack starting from </a
up to the first "
symbol (as it is not a valid URL symbol but <>
symbols are present with nested tags).
Now also nested tags inside <a>
tags are found properly. Of course, the code is not perfect but it should work with almost any simple HTML markup. Just you may need to be a bit careful with:
- placing quotes within
<a>
tags;
- do not use this algorithm on
<a>
tags without any attribute (placeholders);
- as well as you may need to avoid using multiple nested tags/lines unless the URL inside
<a>
tag is after any double quote.
Here is a very good and messy example (the last match should not be found but it is):
https://regex101.com/r/pC0jR7/2
It is a pity that this lookahead does not work: (?!<a.*?</a>)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…