Changeset 32506 for main/trunk

Show
Ignore:
Timestamp:
09.10.2018 19:24:52 (13 months ago)
Author:
ak19
Message:

Bugfix to bug that Kathy discovered in code I committed: with the upgrade to lucene 4, wildcard searches would work, e.g. season*. But boolean searches that combine wildcard search terms with regular terms or with other wildcard terms didn't work. If a query was a BooleanQuery? it would not expand any wildcard search terms it contained, despite BooleanQuery? otherwise recursively doing a rewrite as per its source code. The solution was to recursively rewrite query ourselves to additionally handle MultiTermQuery? boolean clauses within a BooleanQuery? besides the existing code to handle standalone MultiTermQuerys? (which can be of type WildcardQuery? and PrefixQuery?, though they get wrapped in ConstantScoreQuery? objects). I've moved the existing code that deals with MultiTermQuerys? into the new recursive function which now does the further step (the recursive step) of recursively rewriting BooleanQuerys? to preserve and expand MultiTermQuery? objects.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/common-src/indexers/lucene-gs/src/org/greenstone/LuceneWrapper4/GS2LuceneQuery.java

    r30159 r32506  
    4040import org.apache.lucene.queryparser.classic.ParseException; 
    4141import org.apache.lucene.queryparser.classic.QueryParser; 
     42import org.apache.lucene.search.BooleanClause; 
    4243import org.apache.lucene.search.BooleanQuery; // for the TooManyClauses exception 
     44import org.apache.lucene.search.ConstantScoreQuery; 
    4345import org.apache.lucene.search.Filter; 
    4446import org.apache.lucene.search.IndexSearcher; 
     
    167169        query_including_stop_words = query_including_stop_words.rewrite(reader); 
    168170         
    169         // System.err.println("********* query_string " + query_string + "****"); 
     171        System.err.println("********* query_string " + query_string + "****"); 
    170172 
    171173        Query query = parseQuery(reader, query_parser, query_string, fuzziness); 
    172  
    173         // GS2's LuceneWrapper uses lucene-2.3.2. GS3's LuceneWrapper3 works with lucene-3.3.0.  
    174         // This change in lucene core library for GS3 (present since after version 2.4.1) had the 
    175         // side-effect that searching on "econom*" didn't display what terms it was searching for,  
    176         // whereas it had done so in GS2.  
    177  
    178         // The details of this problem and its current solution are explained in the ticket  
    179         // http://trac.greenstone.org/ticket/845 
    180  
    181         // We need to change the settings for the rewriteMethod in order to get searches on wildcards 
    182         // to produce search terms again when the query gets rewritten. 
    183  
    184         // We try, in order: 
    185         // 1. RewriteMethod set to BooleanQuery, to get it working as in GS2 which uses lucene-2.3.2 
    186         // it will expand wildcard searches to its terms when searching at both section AND doc level. 
    187         // If that throws a TooManyClauses exception (like when searching for "a*" over lucene demo collection) 
    188         // 2. Then try a custom rewriteMethod which sets termCountCutoff=350 and docCountPercent cutoff=0.1% 
    189         // If that throws a TooManyClauses exception (could perhaps happen if the collection has a huge number of docs 
    190         // 3. Then try the default apache rewriteMethod with its optimum defaults of  
    191         // termCountCutoff=350 and docCountPercent cutoff=0.1% 
    192         //  See http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/MultiTermQuery.html 
    193  
    194         if(query instanceof MultiTermQuery) { 
    195         MultiTermQuery multiTermQuery = (MultiTermQuery)query; 
    196         multiTermQuery.setRewriteMethod(MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE); 
    197              // less CPU intensive than MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE) 
    198         } 
    199  
    200         try { 
    201         query = query.rewrite(reader); 
    202         }  
    203         catch(BooleanQuery.TooManyClauses clauseException) { 
    204         // Example test case: try searching the lucene demo collection for "a*"  
    205         // and you'll hit this exception 
    206  
    207         lucene_query_result.setError(LuceneQueryResult.TOO_MANY_CLAUSES_ERROR); 
    208  
    209         if(query instanceof MultiTermQuery) { 
    210  
    211             // CustomRewriteMethod: setting the docCountPercent cutoff to a custom 100%.  
    212             // This will at least expand the query to its terms when searching with wildcards at section-level  
    213             // (though it doesn't seem to work for doc-level searches, no matter what the cutoffs are set to). 
    214  
    215             MultiTermQuery.ConstantScoreAutoRewrite customRewriteMethod = new MultiTermQuery.ConstantScoreAutoRewrite(); 
    216             customRewriteMethod.setDocCountPercent(100.0); 
    217             customRewriteMethod.setTermCountCutoff(350); // same as default 
    218              
    219             MultiTermQuery multiTermQuery = (MultiTermQuery)query; 
    220             multiTermQuery.setRewriteMethod(customRewriteMethod); 
    221             try { 
    222             query = query.rewrite(reader); 
    223             }  
    224             catch(BooleanQuery.TooManyClauses clauseExceptionAgain) { 
    225  
    226             // do what the code originally did: use the default rewriteMethod which 
    227             // uses a default docCountPercent=0.1 (%) and termCountCutoff=350 
    228  
    229             multiTermQuery.setRewriteMethod(MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT); 
    230             query = query.rewrite(reader); 
    231             } 
    232         } 
    233         } 
     174        query = recursiveRewriteQuery(query, reader);        
     175        System.err.println("@@@@ final query class name: " + query.getClass()); 
    234176 
    235177        // http://stackoverflow.com/questions/13537126/term-frequency-in-lucene-4-0 
     
    259201             
    260202        Term term = (Term) iter.next(); 
     203        System.err.println("@@@ GS2LuceneQuery.java: Next term: " + term.text()); 
    261204        BytesRef term_bytes = term.bytes(); 
    262205        DocsEnum term_docs = MultiFields.getTermDocsEnum(reader, liveDocs, term.field(), term_bytes); // flags? 
     
    516459    } 
    517460 
     461    // If you're dealing with a BooleanQuery, they need to be recursively rewritten 
     462    // as they can contain queries with wildcards (WildcardQuery|PrefixQuery subclasses of MultiTermQuery) 
     463    // e.g. season* farm 
     464    // If MultiTermQuery, then expand here. e.g. WildcardQuerys like season*. 
     465    // DON'T call this method from inside parseQuery() (in place of its query.rewrite()), because then wildcard 
     466    // queries like season* won't contain Terms (extractTerms() will be empty) since the ConstantScoreQuerys 
     467    // that a WildcardQuery gets rewritten to here will contain Filters in place of Terms. 
     468    // Call this method from runQuery() after it calls parseQuery(). 
     469    // Now searches like these will work 
     470    //    season* farm 
     471    //    season* farm* 
     472    // and not just searches like the following which already used to work: 
     473    //    season* 
     474    //    snail farm 
     475    // Idea for this method came from inspecting source code to BooleanQuery 
     476    // https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java 
     477    // which also does a recursive rewrite. Unfortunately, the existing BooleanQuery does not handle MultiTermQuery 
     478    // subcomponents. 
     479    protected Query recursiveRewriteQuery(Query orig_query, IndexReader reader) throws java.io.IOException 
     480    { 
     481    //Query query = orig_query.rewrite(reader); 
     482    Query query = orig_query; 
     483 
     484    if(orig_query instanceof BooleanQuery) { 
     485        BooleanQuery booleanQuery = (BooleanQuery)orig_query; 
     486        List<BooleanClause> clauses = booleanQuery.clauses();  
     487        for (BooleanClause clause : clauses) { 
     488        Query subQuery = clause.getQuery(); 
     489        subQuery = recursiveRewriteQuery(subQuery, reader); 
     490        clause.setQuery(subQuery); 
     491        } 
     492    } 
     493     
     494    // GS2's LuceneWrapper uses lucene-2.3.2. GS3's LuceneWrapper3 works with lucene-3.3.0.  
     495        // This change in lucene core library for GS3 (present since after version 2.4.1) had the 
     496        // side-effect that searching on "econom*" didn't display what terms it was searching for,  
     497        // whereas it had done so in GS2.  
     498 
     499        // The details of this problem and its current solution are explained in the ticket  
     500        // http://trac.greenstone.org/ticket/845 
     501 
     502        // We need to change the settings for the rewriteMethod in order to get searches on wildcards 
     503        // to produce search terms again when the query gets rewritten. 
     504 
     505        // We try, in order: 
     506        // 1. RewriteMethod set to BooleanQuery, to get it working as in GS2 which uses lucene-2.3.2 
     507        // it will expand wildcard searches to its terms when searching at both section AND doc level. 
     508        // If that throws a TooManyClauses exception (like when searching for "a*" over lucene demo collection) 
     509        // 2. Then try a custom rewriteMethod which sets termCountCutoff=350 and docCountPercent cutoff=0.1% 
     510        // If that throws a TooManyClauses exception (could perhaps happen if the collection has a huge number of docs 
     511        // 3. Then try the default apache rewriteMethod with its optimum defaults of  
     512        // termCountCutoff=350 and docCountPercent cutoff=0.1% 
     513        //  See http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/MultiTermQuery.html 
     514 
     515        System.err.println("@@@@ query class name: " + orig_query.getClass()); 
     516        System.err.println("@@@@ QUERY: " + orig_query); 
     517 
     518        if(orig_query instanceof MultiTermQuery) { 
     519        MultiTermQuery multiTermQuery = (MultiTermQuery)orig_query; 
     520        multiTermQuery.setRewriteMethod(MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE); 
     521             // less CPU intensive than MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE) 
     522        } 
     523 
     524        try { 
     525        query = orig_query.rewrite(reader); 
     526        }  
     527        catch(BooleanQuery.TooManyClauses clauseException) { 
     528        // Example test case: try searching the lucene demo collection for "a*"  
     529        // and you'll hit this exception 
     530 
     531        //lucene_query_result.setError(LuceneQueryResult.TOO_MANY_CLAUSES_ERROR); 
     532 
     533        if(query instanceof MultiTermQuery) { 
     534 
     535            // CustomRewriteMethod: setting the docCountPercent cutoff to a custom 100%.  
     536            // This will at least expand the query to its terms when searching with wildcards at section-level  
     537            // (though it doesn't seem to work for doc-level searches, no matter what the cutoffs are set to). 
     538 
     539            MultiTermQuery.ConstantScoreAutoRewrite customRewriteMethod = new MultiTermQuery.ConstantScoreAutoRewrite(); 
     540            customRewriteMethod.setDocCountPercent(100.0); 
     541            customRewriteMethod.setTermCountCutoff(350); // same as default 
     542             
     543            MultiTermQuery multiTermQuery = (MultiTermQuery)query; 
     544            multiTermQuery.setRewriteMethod(customRewriteMethod); 
     545            try { 
     546            query = query.rewrite(reader); 
     547            }  
     548            catch(BooleanQuery.TooManyClauses clauseExceptionAgain) { 
     549 
     550            // do what the code originally did: use the default rewriteMethod which 
     551            // uses a default docCountPercent=0.1 (%) and termCountCutoff=350 
     552 
     553            multiTermQuery.setRewriteMethod(MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT); 
     554            query = query.rewrite(reader); 
     555            } 
     556        } 
     557        } 
     558 
     559        if(orig_query == query) { 
     560        return query; 
     561        } else { 
     562        return recursiveRewriteQuery(query, reader); 
     563        } 
     564    } 
     565 
    518566    protected Filter parseFilterString(String filter_string) 
    519567    {