Ticket #845 (closed defect: fixed)

Opened 5 years ago

Last modified 5 years ago

Searching a GS3 lucene collection with wildcards doesn't show terms

Reported by: ak19 Owned by: ak19
Priority: moderate Milestone: 3.05 Release
Component: Greenstone3 Runtime Severity: major
Keywords: Cc:

Description

In GS2, searching a lucene collection displays information for possible search terms when the original search used a wildcard.

This doesn't happen with a lucene collection in GS3. The disparity is owing to GS3's LuceneWrapper?3.jar using version 3.3.0 of the lucene-core library, whereas GS2's LuceneWrapper?.jar used lucene 2.3.2.

E.g. when searching for econom* in a GS2 lucene containing the documents from Demo (as in the indexers tutorial), the search results start with:

Word count: econometrics: 1, economique: 1, economist: 4, economical: 6, economists: 6, economically: 25, economics: 27, economies: 38, economy: 156, economic: 507 10 documents matched the query.

The same search in GS3 however produces just the documents in the results list without the term information seen above.

Change History

Changed 5 years ago by ak19

The cause of the problem and the solution.

GS2's LuceneWrapper? uses lucene-2.3.2. GS3 needed LuceneWrapper?3 to work with lucene-3.3.0 (probably for solr). This changeover to a later version of the lucene core library for GS3 had had the side-effect that searching on "econom*" didn't display what terms it was searching for, as it had done in GS2.

For GS2, the query upon rewrite would be expanded to:

TX:econometrics TX:economic TX:economical TX:economically TX:economics TX:economies TX:economique TX:economist TX:economists TX:economy

whereas for GS3 the query upon rewrite didn't expand to any terms at all. This was because of a change in lucene core library. See  https://issues.apache.org/jira/browse/LUCENE-1557

In more recent versions of the lucene library, the rewrite method for queries of type MultiTermQuery? no longer get rewritten to BooleanQuery? by default, since this can throw an exception when there are too many terms. Instead, the RewriteMethod? is by default set to a ConstantScoreAutoRewrite? object (which can be changed back to using BooleanQuery? by calling either setRewriteMethod(MultiTermQuery?.SCORING_BOOLEAN_QUERY_REWRITE) or else the slight more optimal setRewriteMethod(MultiTermQuery?.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE), both of which allow things to work again as before (where searching for "econom*" gets expanded to multiple terms), but with the same potential for an exception when there are too many terms.

 http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/api/all/index.html?org/apache/lucene/search/Query.html

A ConstantScoreAutoRewrite? as RewriteMethod?, however, uses some default documentcount and termcount cutoff values to determine whether to use BooleanQuery? for rewrite or, if the number of terms might be too much and may throw an exception, use something else upon rewrite.

 http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/api/all/org/apache/lucene/search/MultiTermQuery.ConstantScoreAutoRewrite.html

The default ConstantScoreAutoRewrite? object that the MultiTermQuery?.RewriteMethod? member is set to has DocumentCountPercent?=0.1 and TermCountCutoff?=350. The default ConstantScoreAutoRewrite? cannot be changed.

In order to still get things to work as in GS2 when searching for "econom*", while also trying to avoid the BooleanQuery? issue as much as possible, we here replace the unalterable default ConstantScoreAutoRewrite? object with a custom one which keeps the TermCountCutoff? the same, but allows the DocCountPercent? to be 100% in order for a rewrite of the query "econom*" to still return the multiple search terms it would in GS2 which uses the earlier lucene library.

Changed 5 years ago by ak19

The code change is to be made to the runQuery() method in

common-src/indexers/lucene-gs/src/org/greenstone/LuceneWrapper3/GS2LuceneQuery.java

Changed 5 years ago by ak19

  • milestone set to 3.05 Release

Changed 5 years ago by ak19

...

Query query = parseQuery(reader, query_parser, query_string, fuzziness);

if(query instanceof MultiTermQuery?) {

// debug display of existing cutoff values to work out optimum values

//MultiTermQuery.ConstantScoreAutoRewrite? oldRewriteMethod = (MultiTermQuery?.ConstantScoreAutoRewrite?)MultiTermQuery?.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT;

// display oldRewriteMethod.getDocCountPercent()+"/"+oldRewriteMethod.getTermCountCutoff()

// default docCountPercent=0.1; default termCountCutoff=350

// Creating custom cutoff values, taking into account of existing cutoff values

MultiTermQuery?.ConstantScoreAutoRewrite? customRewriteMethod = new MultiTermQuery?.ConstantScoreAutoRewrite?(); customRewriteMethod.setDocCountPercent(100.0);//MultiTermQuery.ConstantScoreAutoRewrite?.DEFAULT_DOC_COUNT_PERCENT); customRewriteMethod.setTermCountCutoff(350);

MultiTermQuery? multiTermQuery = (MultiTermQuery?)query; multiTermQuery.setRewriteMethod(customRewriteMethod);//MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE);//MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);

}

query = query.rewrite(reader);

...

Changed 5 years ago by ak19

As per  https://issues.apache.org/jira/browse/LUCENE-1557 this change to the lucene core classes happened after lucee v 2.4.1, when MultiTermQuery? ceased to be rewritten to BooleanQuery? by default.

Changed 5 years ago by ak19

  • owner changed from nobody to ak19

The bug has been solved as follows, by changing the rewriteMethod. See  http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/MultiTermQuery.html

We try, in order:

1. RewriteMethod? set to BooleanQuery?, to get it working as in GS2 which uses lucene-2.3.2 it will expand wildcard searches to its terms when searching at both section AND doc level.

If that throws a TooManyClauses? exception (like when searching for "a*" over lucene demo collection)

2. Then try a custom rewriteMethod which sets termCountCutoff=350 and docCountPercent cutoff=0.1%

If that throws a TooManyClauses? exception (could perhaps happen if the collection has a huge number of docs

3. Then try the default apache rewriteMethod with its optimum defaults of termCountCutoff=350 and docCountPercent cutoff=0.1%

Changed 5 years ago by ak19

  • status changed from new to closed
  • resolution set to fixed

Changed 5 years ago by ak19

Preliminary fix to GS2LuceneQuery was in http://trac.greenstone.org/changeset/26155

Final and more comprehensive fix (described above) is in the commit to GS2LuceneQuery at http://trac.greenstone.org/changeset/26157

Changed 5 years ago by ak19

Note:

Attempt 2 if RewriteMethod? set to BooleanQuery? fails, uses a custom rewriteMethod which sets termCountCutoff=350 and docCountPercent cutoff=0.1%. This does not produce term information when performing a wildcard search at document level, only at section level.

Attempt 3, if the above fails, is to use the default lucene rewritemethod, with its optimised defaults of termCountCutoff=350 and docCountPercent cutoff=0.1%, which doesn't show up any term information in the search results for "econom*" in the lucene demo collection, whether this is at doc level or section level.

Therefore attempt 1 (booleanquery) is the ideal, but if that doesn't work, attempts 2 and then 3 are tried consecutively, to at least produce accurate search results even if term information is not always produced in their case.

Note: See TracTickets for help on using tickets.