StandardTokenizerFactory :-
It tokenizes on whitespace, as well as strips characters
Documentation :-
Splits words at punctuation characters, removing punctuations.
However, a dot that's not followed by whitespace is considered part of
a token. Splits words at hyphens, unless there's a number in the
token. In that case, the whole token is interpreted as a product
number and is not split. Recognizes email addresses and Internet
hostnames as one token.
Would use this for fields where you want to search on the field data.
e.g. -
http://example.com/I-am+example?Text=-Hello
would generate 7 tokens (separated by comma) -
http,example.com,I,am,example,Text,Hello
KeywordTokenizerFactory :-
Keyword Tokenizer does not split the input at all.
No processing in performed on the string, and the whole string is treated as a single entity.
This doesn't actually do any tokenization. It returns the original text as one term.
Mainly used for sorting or faceting requirements, where you want to match the exact facet when filtering on multiple words and sorting as sorting does not work on tokenized fields.
e.g.
http://example.com/I-am+example?Text=-Hello
would generate a single token -
http://example.com/I-am+example?Text=-Hello
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…