java - Using Lucene Analyzer Without Indexing - Is My Approach Reasonable?

Question

Welcome To Ask or Share your Answers For Others

java - Using Lucene Analyzer Without Indexing - Is My Approach Reasonable?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Using Lucene Analyzer Without Indexing - Is My Approach Reasonable?

My objective is to leverage some of Lucene's many tokenizers and filters to transform input text, but without the creation of any indexes.

For example, given this (contrived) input string...

" Someone’s - [texté] goes here, foo . "

...and a Lucene analyzer like this...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("lowercase")
        .addTokenFilter("icuFolding")
        .build();

I want to get the following output:

someone's texte goes here foo

The below Java method does what I want.

But is there a better (i.e. more typical and/or concise) way that I should be doing this?

I am specifically thinking about the way I have used TokenStream and CharTermAttribute, since I have never used them like this before. Feels clunky.

Here is the code:

Lucene 8.3.0 imports:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.custom.CustomAnalyzer;

My method:

private String transform(String input) throws IOException {

    Analyzer analyzer = CustomAnalyzer.builder()
            .withTokenizer("icu")
            .addTokenFilter("lowercase")
            .addTokenFilter("icuFolding")
            .build();

    TokenStream ts = analyzer.tokenStream("myField", new StringReader(input));
    CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);

    StringBuilder sb = new StringBuilder();
    try {
        ts.reset();
        while (ts.incrementToken()) {
            sb.append(charTermAtt.toString()).append(" ");
        }
        ts.end();
    } finally {
        ts.close();
    }
    return sb.toString().trim();
}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:31:55+0000

I have been using this set-up for a few weeks without issue. I have not found a more concise approach. I think the code in the question is OK.

Categories

java - Using Lucene Analyzer Without Indexing - Is My Approach Reasonable?

java - Using Lucene Analyzer Without Indexing - Is My Approach Reasonable?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags