You're close! Your pattern, '#<.*?>'
, only matches the opening tag. Try this:
r'#<a href=".*?">(.*?)</a>'
This is also a little more specific, in that it will only match <a>
tags. Also note that it's good practice to specify regular expressions as raw string literals (the r
at the beginning). The parenthesis, (.*?)
, are a capturing group. From the docs:
(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the
umber special sequence, described below.
You can refer back to this group in your replacement argument as g<#>
, where #
is which group you want. We've only defined one group, so it's naturally the first one: g<1>
.
Additionally, once you've compiled a regular expression, you can call its own sub
method:
pattern = re.compile(r'my pattern')
pattern.sub(r'replacement', 'text')
Usually the re.sub
method is for when you haven't compiled:
re.sub(r'my pattern', r'replacement', 'text')
Performance difference is usually none or minimal, so use whichever makes your code more clearer. (Personally I usually prefer compiling. Like any other variables, they let me use clear, reusable names.)
So your code would be:
import re
pound_links = re.compile(r'#<a href=".*?">(.*?)</a>')
output = pound_links.sub(r'#g<1>', '#<a href="stuff1">stuff1</a>')
print(output)
Or:
import re
output = re.sub(r'#<a href=".*?">(.*?)</a>',
r"#g<1>",
"#<a href="stuff1">stuff1</a>")
print(output)
Either one outputs:
#stuff1