The sentence may include non-english characters, e.g. Chinese:
你好,hello world
the expected value for the length is 5 (2 Chinese characters, 2 English words, and 1 comma)
5
You can use that most Chinese characters are located in the unicode range 0x4e00 - 0x9fcc.
# -*- coding: utf-8 -*- import re s = '你好 hello, world' s = s.decode('utf-8') # First find all 'normal' words and interpunction # '[x21-x2f]' includes most interpunction, change it to ',' if you only need to match a comma count = len(re.findall(r'w+|[x21-x2]', s)) for word in s: for ch in word: # see https://stackoverflow.com/a/11415841/1248554 for additional ranges if needed if 0x4e00 < ord(ch) < 0x9fcc: count += 1 print count
2.1m questions
2.1m answers
60 comments
57.0k users