Ticket #666 (closed enhancement: fixed)

Opened 9 years ago

Last modified 4 years ago

can we add stemming and plurals to lucene? Spanish. Snowball.

Reported by: kjdon Owned by: nobody
Priority: moderate Milestone: 3.06 Release
Component: Collection Building Severity: enhancement
Keywords: Cc:

Description

Requested by Diego. Nov 2008

Can apparently be done using SnowballAnalyzer???

Change History

Changed 6 years ago by ak19

  • milestone changed from Collection building wishlist to 3.06 Release

Email Oct 2013 from Diego:

I want to comment some things about Lucene. I was reviewing old pending emails from Spanish list and I find many of them asking for stemming in Lucene. As Kathy told me before, Lucene has no stemming process and that can be replaced with "*" at the word ending. This workaround not always works., i.e, you have "fish" and "fishes" so "fish*" is enough. We have "pez" and "peces" and "pe*" is not a good query string.

Digging a little a read these info:

 http://lucene.apache.org/core/3_6_2/api/all/org/apache/lucene/analysis/snowball/SnowballFilter.html

And these are the available stemmers:

 http://lucene.apache.org/core/3_6_2/api/all/org/tartarus/snowball/ext/package-summary.html

Could it be a solution?. Perhaps it can be implemented as an extension?

Changed 6 years ago by ak19

  • summary changed from can we add stemming and plurals to lucene? to can we add stemming and plurals to lucene? Spanish. Snowball.

Changed 6 years ago by ak19

More information from Diego:

I think that Snowball applies to SolR too. Look this:

 https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=32604229

It says:

Word stemming is, obviously, very language specific. Solr includes several language-specific stemmers created by the Snowballgenerator that are based on the Porter stemming algorithm. The generic Snowball Porter Stemmer Filter can be used to configure any of these language stemmers. Solr also includes a convenience wrapper for the English Snowball stemmer. There are also several purpose-built stemmers for non-English languages. These stemmers are described in Language Analysis.

And here:

 https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-Spanish

you have the Spanish stemmers.

Here is the SOlR implementation of Snowball:

 http://wiki.apache.org/solr/LanguageAnalysis#Notes_about_solr.SnowballPorterFilterFactory

Changed 4 years ago by ak19

The Snowball analyzer is now available to use in GS3.06 since the upgrade to solr 4.7.2, see http://trac.greenstone.org/ticket/885

A related ticket is http://trac.greenstone.org/ticket/872 which mentions the Japanese analyzer Koromoji also available since lucene/solr 3.6.

The relevant commits for the lucene and solr update from 3.3.0 to 4.7.2 are the commit revisions between 29133 of 16.07.2014 and 29228 of 21.08.2014, and a further commit (important fix) at http://trac.greenstone.org/changeset/29355 from 08.10.2014

Diego has tested it and found that Snowball Analyzer has issues for Spanish, and that the Hunspell analyzer, which also works with solr 4.7.2, may be better suited for Spanish and some other languages. See  http://wiki.greenstone.org/doku.php?id=en:release:3.06_release_notes#greenstone_3_runtimechanges_since_305 and the links there.

Changed 4 years ago by ak19

  • status changed from new to closed
  • resolution set to fixed
Note: See TracTickets for help on using tickets.