You are not using the correct notation for non-BMP unicode points; you want to use U0001FFFF
, a capital U
and 8 digits:
myre = re.compile(u'['
u'U0001F300-U0001F5FF'
u'U0001F600-U0001F64F'
u'U0001F680-U0001F6FF'
u'u2600-u26FFu2700-u27BF]+',
re.UNICODE)
This can be reduced to:
myre = re.compile(u'['
u'U0001F300-U0001F64F'
u'U0001F680-U0001F6FF'
u'u2600-u26FFu2700-u27BF]+',
re.UNICODE)
as your first two ranges are adjacent.
Your version was specifying (with added spaces for readability):
[u1F30 0-u1F5F Fu1F60 0-u1F64 Fu1F68 0-u1F6F F u2600-u26FFu2700-u27BF]+
That's because the uxxxx
escape sequence always takes only 4 hex digits, not 5.
The largest of those ranges is 0-u1F6F
(so from the digit 0
through to ?
), which covers a very large swathe of the Unicode standard.
The corrected expression works, provided you use a UCS-4 wide Python executable:
>>> import re
>>> myre = re.compile(u'['
... u'U0001F300-U0001F64F'
... u'U0001F680-U0001F6FF'
... u'u2600-u26FFu2700-u27BF]+',
... re.UNICODE)
>>> myre.sub('', u'Some example text with a sleepy face: U0001f62a')
u'Some example text with a sleepy face: '
The UCS-2 equivalent is:
myre = re.compile(u'('
u'ud83c[udf00-udfff]|'
u'ud83d[udc00-ude4fude80-udeff]|'
u'[u2600-u26FFu2700-u27BF])+',
re.UNICODE)
You can combine the two into your script with a exception handler:
try:
# Wide UCS-4 build
myre = re.compile(u'['
u'U0001F300-U0001F64F'
u'U0001F680-U0001F6FF'
u'u2600-u26FFu2700-u27BF]+',
re.UNICODE)
except re.error:
# Narrow UCS-2 build
myre = re.compile(u'('
u'ud83c[udf00-udfff]|'
u'ud83d[udc00-ude4fude80-udeff]|'
u'[u2600-u26FFu2700-u27BF])+',
re.UNICODE)
Of course, the regex is already out of date, as it doesn't cover Emoji defined in newer Unicode releases; it appears to cover Emoji's defined up to Unicode 8.0 (since U+1F91D HANDSHAKE was added in Unicode 9.0).
If you need a more up-to-date regex, take one from a package that is actively trying to keep up-to-date on Emoji; it specifically supports generating such a regex:
import emoji
def remove_emoji(text):
return emoji.get_emoji_regexp().sub(u'', text)
The package is currently up-to-date for Unicode 11.0 and has the infrastructure in place to update to future releases quickly. All your project has to do is upgrade along when there is a new release.