Opened 10 years ago
Closed 9 years ago
#872 closed enhancement (fixed)
Japanese Analyzer
Reported by: | ak19 | Owned by: | nobody |
---|---|---|---|
Priority: | high | Milestone: | 3.06 Release |
Component: | Greenstone2&3 | Severity: | major |
Keywords: | Cc: |
Description
Refer to the email exchanges on the mailing list with Mr Gaku Yamaguchi.
There is proper support for searching Japanese that makes use of Lucene/Solr versions 3.6 and 4.0
- This ticket is to remind us to assess whether we can increase the version of Lucene used to 3.6 instead, to incorporate the Japanese Analyzer into GS2 to support the Japanese Indexer/Tokenizer work at atilika.org.
- For the GS3 release, investigate using Solr 3.6 for the sole purpose of including the Japanese language support, to thereby add in the text_ja as a setting to make use of the work at atilika.org.
Change History (4)
comment:1 by , 10 years ago
comment:2 by , 10 years ago
On 14 Oct 2013, Mr Yamaguchi wrote to the list:
... the full text indexing is not available in Japanese version (in GS286 rc2).
At present, one of the way to solve this problem is to integrate Apache Lucene/Solr version 3.6 with Greenstone.
Lucene/Solr version 3.6 is integrated with full text search engine for Japanese named kuromoji, therefore Greenstone can handle Japanese full text indexing if it is integrated with Lucene/Solr with kuromoji. Unfortunately, I do not have skills in developping Lucene/Solr. If you are interested in this point, please access http://www.atilika.org/ .
comment:3 by , 9 years ago
The Kuromoji analyzer is now available to use in GS3.06 since the upgrade to solr 4.7.2, see http://trac.greenstone.org/ticket/885
A related ticket is http://trac.greenstone.org/ticket/666 which mentions the Snowball analyzer.
The relevant commits for the lucene and solr update from 3.3.0 to 4.7.2 are the commit revisions between 29133 of 16.07.2014 and 29228 of 21.08.2014, and a further commit (important fix) at http://trac.greenstone.org/changeset/29355
comment:4 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
From Mr Yamaguchi's email from 2 Dec 2011:
I had tried your files and found that it is written in Japanese and Japanese can be displayed, however, it was confirmed by scholars at Ritsumeikan University in Kyoto, Japan before (it is so regret that their articles are written in Japanese so I cannot show you them) that even Japanese can be displayed in Greenstone, if Japanese analyzer is not integrated in Lucene, Japanese indexing is not worked well because ordinally in Japanese sentences there is no segentation.
For instance, ぼくはにほんじんです。(I am a Japanese) in this sentence, there is no segmentation. About this sentence, on original Greenstone indexing, this sentence would be segmented by a character one by one, ぼ く は に ほ ん じ ん で す However, if it is not segmented by words one by one, such as ぼく(I)/ は/にほんじん(Japanese)/ です(be), it cannot be searched correctly by Japanese people. As Japanese analyzer has its own Japanese dictionary, it is possible to segment Japanese sentences to a word one by one correctly so it is possible to search indexes by words. Therefore I am planning to integrate Lucene in which Japanese analyzer is included with Greestone and I contacted a computer developper which mainly develops indexing engines for Japanese applications using Lucene solr with Japanese analyzer.
Its manager says that they should know in what part of Greenstone source codes the way to integrate Lucene with Greenstone is written, if they can confirm these codes, they are able to test on integrating Lucene with Japanese analyzer into Greenstone so I have questioned you this time.
Note how the following test example I described in my earlier email to him (15 Nov 2011) was simplistic, and not realistic in terms of how Japanese language searches naturally work: