Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
314 views
in Technique[技术] by (71.8m points)

wildcard - Lucene query: bla~* (match words that start with something fuzzy), how?

In the Lucene query syntax I'd like to combine * and ~ in a valid query similar to: bla~* //invalid query

Meaning: Please match words that begin with "bla" or something similar to "bla".

Update: What I do now, works for small input, is use the following (snippet of SOLR schema):

<fieldtype name="text_ngrams" class="solr.TextField">
  <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>

In case you don't use SOLR, this does the following.

Indextime: Index data by creating a field containing all prefixes of my (short) input.

Searchtime: only use the ~ operator, as prefixes are explicitly present in the index.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

in the development trunk of lucene (not yet a release), there is code to support use cases like this, via AutomatonQuery. Warning: the APIs might/will change before its released, but it gives you the idea.

Here is an example for your case:

// a term representative of the query, containing the field. 
// the term text is not so important and only used for toString() and such
Term term = new Term("yourfield", "bla~*");

// builds a DFA that accepts all strings within an edit distance of 2 from "bla"
Automaton fuzzy = new LevenshteinAutomata("bla").toAutomaton(2);

// concatenate this DFA with another DFA equivalent to the "*" operator
Automaton fuzzyPrefix = BasicOperations.concatenate(fuzzy, BasicAutomata.makeAnyString());

// build a query, search with it to get results.
AutomatonQuery query = new AutomatonQuery(term, fuzzyPrefix);

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...