Welcome To Ask or Share your Answers For Others

python - How do I specify a range of unicode characters

Welcome To Ask or Share your Answers For Others

1 Answer

answered Oct 17, 2021 by 深蓝 (71.8m points)

The syntax of your unicode range will not do what you expect.

The raw r'' string prevents u escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-]:

>>> re.compile(r'[u0020-u00d7ff]', re.DEBUG)
in
  literal 117
  literal 48
  literal 48
  literal 50
  range (48, 117)
  literal 48
  literal 48
  literal 100
  literal 55
  literal 102
  literal 102

Making it a Unicode literal causes u parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is uxxxx or Uxxxxxxxx, so it’s parsed as "u00d7, f, f".
```
>>> re.compile(ur'[u0020-u00d7ff]', re.DEBUG)
in
  range (32, 215)
  literal 102
  literal 102
```

Removing the leading zeroes or switching to U0000d7ff will fix it:

>>> re.compile(ur'[u0020-ud7ff]', re.DEBUG)
in
  range (32, 55295)

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

...