Opened 14 years ago
Closed 9 years ago
#666 closed enhancement (fixed)
can we add stemming and plurals to lucene? Spanish. Snowball.
Reported by: | kjdon | Owned by: | nobody |
---|---|---|---|
Priority: | moderate | Milestone: | 3.06 Release |
Component: | Collection Building | Severity: | enhancement |
Keywords: | Cc: |
Description
Requested by Diego. Nov 2008
Can apparently be done using SnowballAnalyzer??
Change History (5)
comment:1 by , 11 years ago
Milestone: | Collection building wishlist → 3.06 Release |
---|
comment:2 by , 11 years ago
Summary: | can we add stemming and plurals to lucene? → can we add stemming and plurals to lucene? Spanish. Snowball. |
---|
comment:3 by , 11 years ago
More information from Diego:
I think that Snowball applies to SolR too. Look this:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=32604229
It says:
Word stemming is, obviously, very language specific. Solr includes several language-specific stemmers created by the Snowballgenerator that are based on the Porter stemming algorithm. The generic Snowball Porter Stemmer Filter can be used to configure any of these language stemmers. Solr also includes a convenience wrapper for the English Snowball stemmer. There are also several purpose-built stemmers for non-English languages. These stemmers are described in Language Analysis.
And here:
https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-Spanish
you have the Spanish stemmers.
Here is the SOlR implementation of Snowball:
http://wiki.apache.org/solr/LanguageAnalysis#Notes_about_solr.SnowballPorterFilterFactory
comment:4 by , 9 years ago
The Snowball analyzer is now available to use in GS3.06 since the upgrade to solr 4.7.2, see http://trac.greenstone.org/ticket/885
A related ticket is http://trac.greenstone.org/ticket/872 which mentions the Japanese analyzer Koromoji also available since lucene/solr 3.6.
The relevant commits for the lucene and solr update from 3.3.0 to 4.7.2 are the commit revisions between 29133 of 16.07.2014 and 29228 of 21.08.2014, and a further commit (important fix) at http://trac.greenstone.org/changeset/29355 from 08.10.2014
Diego has tested it and found that Snowball Analyzer has issues for Spanish, and that the Hunspell analyzer, which also works with solr 4.7.2, may be better suited for Spanish and some other languages. See http://wiki.greenstone.org/doku.php?id=en:release:3.06_release_notes#greenstone_3_runtimechanges_since_305 and the links there.
comment:5 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Email Oct 2013 from Diego:
I want to comment some things about Lucene. I was reviewing old pending emails from Spanish list and I find many of them asking for stemming in Lucene. As Kathy told me before, Lucene has no stemming process and that can be replaced with "*" at the word ending. This workaround not always works., i.e, you have "fish" and "fishes" so "fish*" is enough. We have "pez" and "peces" and "pe*" is not a good query string.
Digging a little a read these info:
http://lucene.apache.org/core/3_6_2/api/all/org/apache/lucene/analysis/snowball/SnowballFilter.html
And these are the available stemmers:
http://lucene.apache.org/core/3_6_2/api/all/org/tartarus/snowball/ext/package-summary.html
Could it be a solution?. Perhaps it can be implemented as an extension?