Is there any way for me to preserve punctuation marks of !, ?, " and ' from my text documents using text CountVectorizer or TfidfVectorizer parameters in scikit-learn?
CountVectorizer
TfidfVectorizer
You should customize the token_pattern parameter when you instantiate the vectorizer. For example:
token_pattern
vent = CountVectorizer(token_pattern=r"(?u)ww+|!|?|"|'")
2.1m questions
2.1m answers
60 comments
57.0k users