Opened 12 years ago

Closed 12 years ago

Last modified 6 years ago

#845 closed defect (fixed)

Searching a GS3 lucene collection with wildcards doesn't show terms

Reported by: ak19 Owned by: ak19
Priority: moderate Milestone: 3.05 Release
Component: Greenstone3 Runtime Severity: major
Keywords: Cc:

Description

In GS2, searching a lucene collection displays information for possible search terms when the original search used a wildcard.

This doesn't happen with a lucene collection in GS3. The disparity is owing to GS3's LuceneWrapper3.jar using version 3.3.0 of the lucene-core library, whereas GS2's LuceneWrapper.jar used lucene 2.3.2.

E.g. when searching for econom* in a GS2 lucene containing the documents from Demo (as in the indexers tutorial), the search results start with:

Word count: econometrics: 1, economique: 1, economist: 4, economical: 6, economists: 6, economically: 25, economics: 27, economies: 38, economy: 156, economic: 507 10 documents matched the query.

The same search in GS3 however produces just the documents in the results list without the term information seen above.

Change History (10)

comment:1 by ak19, 12 years ago

The cause of the problem and the solution.

GS2's LuceneWrapper uses lucene-2.3.2. GS3 needed LuceneWrapper3 to work with lucene-3.3.0 (probably for solr). This changeover to a later version of the lucene core library for GS3 had had the side-effect that searching on "econom*" didn't display what terms it was searching for, as it had done in GS2.

For GS2, the query upon rewrite would be expanded to:

TX:econometrics TX:economic TX:economical TX:economically TX:economics TX:economies TX:economique TX:economist TX:economists TX:economy

whereas for GS3 the query upon rewrite didn't expand to any terms at all. This was because of a change in lucene core library. See https://issues.apache.org/jira/browse/LUCENE-1557

In more recent versions of the lucene library, the rewrite method for queries of type MultiTermQuery no longer get rewritten to BooleanQuery by default, since this can throw an exception when there are too many terms. Instead, the RewriteMethod is by default set to a ConstantScoreAutoRewrite object (which can be changed back to using BooleanQuery by calling either setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE) or else the slight more optimal setRewriteMethod(MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE), both of which allow things to work again as before (where searching for "econom*" gets expanded to multiple terms), but with the same potential for an exception when there are too many terms.

http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/api/all/index.html?org/apache/lucene/search/Query.html

A ConstantScoreAutoRewrite as RewriteMethod, however, uses some default documentcount and termcount cutoff values to determine whether to use BooleanQuery for rewrite or, if the number of terms might be too much and may throw an exception, use something else upon rewrite.

http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/api/all/org/apache/lucene/search/MultiTermQuery.ConstantScoreAutoRewrite.html

The default ConstantScoreAutoRewrite object that the MultiTermQuery.RewriteMethod member is set to has DocumentCountPercent=0.1 and TermCountCutoff=350. The default ConstantScoreAutoRewrite cannot be changed.

In order to still get things to work as in GS2 when searching for "econom*", while also trying to avoid the BooleanQuery issue as much as possible, we here replace the unalterable default ConstantScoreAutoRewrite object with a custom one which keeps the TermCountCutoff the same, but allows the DocCountPercent to be 100% in order for a rewrite of the query "econom*" to still return the multiple search terms it would in GS2 which uses the earlier lucene library.

comment:2 by ak19, 12 years ago

The code change is to be made to the runQuery() method in

common-src/indexers/lucene-gs/src/org/greenstone/LuceneWrapper3/GS2LuceneQuery.java

comment:3 by ak19, 12 years ago

Milestone: 3.05 Release

comment:4 by ak19, 12 years ago

...

Query query = parseQuery(reader, query_parser, query_string, fuzziness);

if(query instanceof MultiTermQuery) {

debug display of existing cutoff values to work out optimum values

MultiTermQuery.ConstantScoreAutoRewrite oldRewriteMethod = (MultiTermQuery.ConstantScoreAutoRewrite)MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT;

display oldRewriteMethod.getDocCountPercent()+"/"+oldRewriteMethod.getTermCountCutoff()

default docCountPercent=0.1; default termCountCutoff=350

Creating custom cutoff values, taking into account of existing cutoff values

MultiTermQuery.ConstantScoreAutoRewrite customRewriteMethod = new MultiTermQuery.ConstantScoreAutoRewrite(); customRewriteMethod.setDocCountPercent(100.0);MultiTermQuery.ConstantScoreAutoRewrite.DEFAULT_DOC_COUNT_PERCENT); customRewriteMethod.setTermCountCutoff(350);

MultiTermQuery multiTermQuery = (MultiTermQuery)query; multiTermQuery.setRewriteMethod(customRewriteMethod);MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE);MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);

}

query = query.rewrite(reader);

...

comment:5 by ak19, 12 years ago

As per https://issues.apache.org/jira/browse/LUCENE-1557 this change to the lucene core classes happened after lucee v 2.4.1, when MultiTermQuery ceased to be rewritten to BooleanQuery by default.

comment:6 by ak19, 12 years ago

Owner: changed from nobody to ak19

The bug has been solved as follows, by changing the rewriteMethod. See http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/MultiTermQuery.html

We try, in order:

  1. RewriteMethod set to BooleanQuery, to get it working as in GS2 which uses lucene-2.3.2

it will expand wildcard searches to its terms when searching at both section AND doc level.

If that throws a TooManyClauses exception (like when searching for "a*" over lucene demo collection)

  1. Then try a custom rewriteMethod which sets termCountCutoff=350 and docCountPercent cutoff=0.1%

If that throws a TooManyClauses exception (could perhaps happen if the collection has a huge number of docs

  1. Then try the default apache rewriteMethod with its optimum defaults of

termCountCutoff=350 and docCountPercent cutoff=0.1%

comment:7 by ak19, 12 years ago

Resolution: fixed
Status: newclosed

comment:8 by ak19, 12 years ago

Preliminary fix to GS2LuceneQuery was in http://trac.greenstone.org/changeset/26155

Final and more comprehensive fix (described above) is in the commit to GS2LuceneQuery at http://trac.greenstone.org/changeset/26157

comment:9 by ak19, 12 years ago

Note:

Attempt 2 if RewriteMethod set to BooleanQuery fails, uses a custom rewriteMethod which sets termCountCutoff=350 and docCountPercent cutoff=0.1%. This does not produce term information when performing a wildcard search at document level, only at section level.

Attempt 3, if the above fails, is to use the default lucene rewritemethod, with its optimised defaults of termCountCutoff=350 and docCountPercent cutoff=0.1%, which doesn't show up any term information in the search results for "econom*" in the lucene demo collection, whether this is at doc level or section level.

Therefore attempt 1 (booleanquery) is the ideal, but if that doesn't work, attempts 2 and then 3 are tried consecutively, to at least produce accurate search results even if term information is not always produced in their case.

comment:10 by ak19, 6 years ago

http://trac.greenstone.org/changeset/32506 - http://trac.greenstone.org/changeset/32509

(All changes to http://trac.greenstone.org/browser/main/trunk/greenstone2/common-src/indexers/lucene-gs/src/org/greenstone/LuceneWrapper4/GS2LuceneQuery.java)

Bugfix to bug that Kathy discovered in code I committed:

with the upgrade to lucene 4, wildcard searches would work, e.g. season*. But boolean searches that combine wildcard search terms with regular terms or with other wildcard terms didn't work. If a query was a BooleanQuery it would not expand any wildcard search terms it contained, despite BooleanQuery otherwise recursively doing a rewrite as per its source code.

The solution was to recursively rewrite query ourselves to additionally handle MultiTermQuery boolean clauses within a BooleanQuery besides the existing code to handle standalone MultiTermQuerys (which can be of type WildcardQuery and PrefixQuery, though they get wrapped in ConstantScoreQuery objects). I've moved the existing code that deals with MultiTermQuerys into the new recursive function which now does the further step (the recursive step) of recursively rewriting BooleanQuerys to preserve and expand MultiTermQuery objects.


To recompile after any changes to gs2build/common-src/indexers/lucene-gs/src/org/greenstone/LuceneWrapper4/GS2LuceneQuery.java:

  1. GS3/gs2build/common-src/indexers/lucene-gs>make all
  1. GS3/gs2build/common-src/indexers/lucene-gs>cp LuceneWrapper4.jar /Scratch/ak19/gs3-svn-13Sep2018/web/WEB-INF/lib/.

To test the changes, you can run a lucene query on a GS lucene collection directly (following Kathy's instructions):

  1. source gs3-setup
  1. run the following from a lucene collection folder:

java org.greenstone.LuceneWrapper4.GS2LuceneQuery ./index/sidx/

  1. The "prompt" will wait for you search to type search terms, e.g.

season* farm

  1. Ctrl-C to quit. (Not sure if there's a better way to quit it.)
Note: See TracTickets for help on using tickets.