Context Navigation

← Previous Changeset
Next Changeset →

Changeset 32506

Timestamp:

2018-10-09T19:24:52+13:00 (6 years ago)

Author:

ak19

Message:

Bugfix to bug that Kathy discovered in code I committed: with the upgrade to lucene 4, wildcard searches would work, e.g. season*. But boolean searches that combine wildcard search terms with regular terms or with other wildcard terms didn't work. If a query was a BooleanQuery it would not expand any wildcard search terms it contained, despite BooleanQuery otherwise recursively doing a rewrite as per its source code. The solution was to recursively rewrite query ourselves to additionally handle MultiTermQuery boolean clauses within a BooleanQuery besides the existing code to handle standalone MultiTermQuerys (which can be of type WildcardQuery and PrefixQuery, though they get wrapped in ConstantScoreQuery objects). I've moved the existing code that deals with MultiTermQuerys into the new recursive function which now does the further step (the recursive step) of recursively rewriting BooleanQuerys to preserve and expand MultiTermQuery objects.

File:

: 1 edited

main/trunk/greenstone2/common-src/indexers/lucene-gs/src/org/greenstone/LuceneWrapper4/GS2LuceneQuery.java (modified) (4 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/common-src/indexers/lucene-gs/src/org/greenstone/LuceneWrapper4/GS2LuceneQuery.java

-              r30159
+              r32506
 import org.apache.lucene.queryparser.classic.ParseException;
 import org.apache.lucene.queryparser.classic.QueryParser;
+import org.apache.lucene.search.BooleanClause;
 import org.apache.lucene.search.BooleanQuery; // for the TooManyClauses exception
+import org.apache.lucene.search.ConstantScoreQuery;
 import org.apache.lucene.search.Filter;
 import org.apache.lucene.search.IndexSearcher;
 …
         query_including_stop_words = query_including_stop_words.rewrite(reader);
         // System.err.println("********* query_string " + query_string + "****");
+        System.err.println("********* query_string " + query_string + "****");
         Query query = parseQuery(reader, query_parser, query_string, fuzziness);
+        // GS2's LuceneWrapper uses lucene-2.3.2. GS3's LuceneWrapper3 works with lucene-3.3.0.
+        // This change in lucene core library for GS3 (present since after version 2.4.1) had the
+        // side-effect that searching on "econom*" didn't display what terms it was searching for,
+        // whereas it had done so in GS2.
+        // The details of this problem and its current solution are explained in the ticket
+        // http://trac.greenstone.org/ticket/845
+        // We need to change the settings for the rewriteMethod in order to get searches on wildcards
+        // to produce search terms again when the query gets rewritten.
+        // We try, in order:
+        // 1. RewriteMethod set to BooleanQuery, to get it working as in GS2 which uses lucene-2.3.2
+        // it will expand wildcard searches to its terms when searching at both section AND doc level.
+        // If that throws a TooManyClauses exception (like when searching for "a*" over lucene demo collection)
+        // 2. Then try a custom rewriteMethod which sets termCountCutoff=350 and docCountPercent cutoff=0.1%
+        // If that throws a TooManyClauses exception (could perhaps happen if the collection has a huge number of docs
+        // 3. Then try the default apache rewriteMethod with its optimum defaults of
+        // termCountCutoff=350 and docCountPercent cutoff=0.1%
+        //  See http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/MultiTermQuery.html
+        if(query instanceof MultiTermQuery) {
+        MultiTermQuery multiTermQuery = (MultiTermQuery)query;
+        multiTermQuery.setRewriteMethod(MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE);
+             // less CPU intensive than MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE)
+        }
+        try {
+        query = query.rewrite(reader);
+        }
+        catch(BooleanQuery.TooManyClauses clauseException) {
+        // Example test case: try searching the lucene demo collection for "a*"
+        // and you'll hit this exception
+        lucene_query_result.setError(LuceneQueryResult.TOO_MANY_CLAUSES_ERROR);
+        if(query instanceof MultiTermQuery) {
+            // CustomRewriteMethod: setting the docCountPercent cutoff to a custom 100%.
+            // This will at least expand the query to its terms when searching with wildcards at section-level
+            // (though it doesn't seem to work for doc-level searches, no matter what the cutoffs are set to).
+            MultiTermQuery.ConstantScoreAutoRewrite customRewriteMethod = new MultiTermQuery.ConstantScoreAutoRewrite();
+            customRewriteMethod.setDocCountPercent(100.0);
+            customRewriteMethod.setTermCountCutoff(350); // same as default
+            MultiTermQuery multiTermQuery = (MultiTermQuery)query;
+            multiTermQuery.setRewriteMethod(customRewriteMethod);
+            try {
+            query = query.rewrite(reader);
+            }
+            catch(BooleanQuery.TooManyClauses clauseExceptionAgain) {
+            // do what the code originally did: use the default rewriteMethod which
+            // uses a default docCountPercent=0.1 (%) and termCountCutoff=350
+            multiTermQuery.setRewriteMethod(MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT);
+            query = query.rewrite(reader);
+            }
+        }
+        }
+        query = recursiveRewriteQuery(query, reader);
+        System.err.println("@@@@ final query class name: " + query.getClass());
         // http://stackoverflow.com/questions/13537126/term-frequency-in-lucene-4-0
 …
         Term term = (Term) iter.next();
+        System.err.println("@@@ GS2LuceneQuery.java: Next term: " + term.text());
         BytesRef term_bytes = term.bytes();
         DocsEnum term_docs = MultiFields.getTermDocsEnum(reader, liveDocs, term.field(), term_bytes); // flags?
 …
+    }
+    // If you're dealing with a BooleanQuery, they need to be recursively rewritten
+    // as they can contain queries with wildcards (WildcardQuery|PrefixQuery subclasses of MultiTermQuery)
+    // e.g. season* farm
+    // If MultiTermQuery, then expand here. e.g. WildcardQuerys like season*.
+    // DON'T call this method from inside parseQuery() (in place of its query.rewrite()), because then wildcard
+    // queries like season* won't contain Terms (extractTerms() will be empty) since the ConstantScoreQuerys
+    // that a WildcardQuery gets rewritten to here will contain Filters in place of Terms.
+    // Call this method from runQuery() after it calls parseQuery().
+    // Now searches like these will work
+    //    season* farm
+    //    season* farm*
+    // and not just searches like the following which already used to work:
+    //    season*
+    //    snail farm
+    // Idea for this method came from inspecting source code to BooleanQuery
+    // https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java
+    // which also does a recursive rewrite. Unfortunately, the existing BooleanQuery does not handle MultiTermQuery
+    // subcomponents.
+    protected Query recursiveRewriteQuery(Query orig_query, IndexReader reader) throws java.io.IOException
+    {
+    //Query query = orig_query.rewrite(reader);
+    Query query = orig_query;
+    if(orig_query instanceof BooleanQuery) {
+        BooleanQuery booleanQuery = (BooleanQuery)orig_query;
+        List<BooleanClause> clauses = booleanQuery.clauses();
+        for (BooleanClause clause : clauses) {
+        Query subQuery = clause.getQuery();
+        subQuery = recursiveRewriteQuery(subQuery, reader);
+        clause.setQuery(subQuery);
+        }
+    }
+    // GS2's LuceneWrapper uses lucene-2.3.2. GS3's LuceneWrapper3 works with lucene-3.3.0.
+        // This change in lucene core library for GS3 (present since after version 2.4.1) had the
+        // side-effect that searching on "econom*" didn't display what terms it was searching for,
+        // whereas it had done so in GS2.
+        // The details of this problem and its current solution are explained in the ticket
+        // http://trac.greenstone.org/ticket/845
+        // We need to change the settings for the rewriteMethod in order to get searches on wildcards
+        // to produce search terms again when the query gets rewritten.
+        // We try, in order:
+        // 1. RewriteMethod set to BooleanQuery, to get it working as in GS2 which uses lucene-2.3.2
+        // it will expand wildcard searches to its terms when searching at both section AND doc level.
+        // If that throws a TooManyClauses exception (like when searching for "a*" over lucene demo collection)
+        // 2. Then try a custom rewriteMethod which sets termCountCutoff=350 and docCountPercent cutoff=0.1%
+        // If that throws a TooManyClauses exception (could perhaps happen if the collection has a huge number of docs
+        // 3. Then try the default apache rewriteMethod with its optimum defaults of
+        // termCountCutoff=350 and docCountPercent cutoff=0.1%
+        //  See http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/MultiTermQuery.html
+        System.err.println("@@@@ query class name: " + orig_query.getClass());
+        System.err.println("@@@@ QUERY: " + orig_query);
+        if(orig_query instanceof MultiTermQuery) {
+        MultiTermQuery multiTermQuery = (MultiTermQuery)orig_query;
+        multiTermQuery.setRewriteMethod(MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE);
+             // less CPU intensive than MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE)
+        }
+        try {
+        query = orig_query.rewrite(reader);
+        }
+        catch(BooleanQuery.TooManyClauses clauseException) {
+        // Example test case: try searching the lucene demo collection for "a*"
+        // and you'll hit this exception
+        //lucene_query_result.setError(LuceneQueryResult.TOO_MANY_CLAUSES_ERROR);
+        if(query instanceof MultiTermQuery) {
+            // CustomRewriteMethod: setting the docCountPercent cutoff to a custom 100%.
+            // This will at least expand the query to its terms when searching with wildcards at section-level
+            // (though it doesn't seem to work for doc-level searches, no matter what the cutoffs are set to).
+            MultiTermQuery.ConstantScoreAutoRewrite customRewriteMethod = new MultiTermQuery.ConstantScoreAutoRewrite();
+            customRewriteMethod.setDocCountPercent(100.0);
+            customRewriteMethod.setTermCountCutoff(350); // same as default
+            MultiTermQuery multiTermQuery = (MultiTermQuery)query;
+            multiTermQuery.setRewriteMethod(customRewriteMethod);
+            try {
+            query = query.rewrite(reader);
+            }
+            catch(BooleanQuery.TooManyClauses clauseExceptionAgain) {
+            // do what the code originally did: use the default rewriteMethod which
+            // uses a default docCountPercent=0.1 (%) and termCountCutoff=350
+            multiTermQuery.setRewriteMethod(MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT);
+            query = query.rewrite(reader);
+            }
+        }
+        }
+        if(orig_query == query) {
+        return query;
+        } else {
+        return recursiveRewriteQuery(query, reader);
+        }
+    }
     protected Filter parseFilterString(String filter_string)
+    {

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 32506

Legend:

main/trunk/greenstone2/common-src/indexers/lucene-gs/src/org/greenstone/LuceneWrapper4/GS2LuceneQuery.java

Download in other formats: