I am attempting to use the re
module in Python 2.7.3 with Unicode encoded Devnagari text. I have added from __future__ import unicode_literals
to the top of my code so all strings literals should be unicode objects.
However, I am running into some odd problems with Python's regex matching. For instance, consider this name: "??????". This is a (mis-spelled) name, in Hindi, entered by one of my users. Any Hindi reader would recognise this as a word.
The following returns a match, as it should:
re.search("^[ws][ws]*","??????",re.UNICODE)
But this does not:
re.search("^[ws][ws]*$","??????",re.UNICODE)
Some spelunking revealed that only one character in this string, character 0915 (?), is recognised as falling within the w character class. This is incorrect, as the Unicode Character Database file on "derived core properties" lists other characters (I have not checked all) in this string as alphabetic ones - as indeed they are.
Is this just a bug in Python's implementation? I could get around this by manually defining all the Devnagari alphanumeric characters as a character range, but that would be painful. Or am I doing something wrong?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…