Ticket #872 (closed enhancement: fixed)

Opened 4 years ago

Last modified 3 years ago

Japanese Analyzer

Reported by: ak19 Owned by: nobody
Priority: high Milestone: 3.06 Release
Component: Greenstone2&3 Severity: major
Keywords: Cc:

Description

Refer to the email exchanges on the mailing list with Mr Gaku Yamaguchi.

There is proper support for searching Japanese that makes use of Lucene/Solr versions 3.6 and 4.0

See  http://www.atilika.org/

- This ticket is to remind us to assess whether we can increase the version of Lucene used to 3.6 instead, to incorporate the Japanese Analyzer into GS2 to support the Japanese Indexer/Tokenizer work at atilika.org.

- For the GS3 release, investigate using Solr 3.6 for the sole purpose of including the Japanese language support, to thereby add in the text_ja as a setting to make use of the work at atilika.org.

Change History

Changed 4 years ago by ak19

From Mr Yamaguchi's email from 2 Dec 2011:

I had tried your files and found that it is written in Japanese and Japanese can be displayed, however, it was confirmed by scholars at Ritsumeikan University in Kyoto, Japan before (it is so regret that their articles are written in Japanese so I cannot show you them) that even Japanese can be displayed in Greenstone, if Japanese analyzer is not integrated in Lucene, Japanese indexing is not worked well because ordinally in Japanese sentences there is no segentation.

For instance, ぼくはにほんじんです。(I am a Japanese) in this sentence, there is no segmentation. About this sentence, on original Greenstone indexing, this sentence would be segmented by a character one by one, ぼ く は に ほ ん じ ん で す However, if it is not segmented by words one by one, such as ぼく(I)/ は/にほんじん(Japanese)/ です(be), it cannot be searched correctly by Japanese people. As Japanese analyzer has its own Japanese dictionary, it is possible to segment Japanese sentences to a word one by one correctly so it is possible to search indexes by words. Therefore I am planning to integrate Lucene in which Japanese analyzer is included with Greestone and I contacted a computer developper which mainly develops indexing engines for Japanese applications using Lucene solr with Japanese analyzer.

Its manager says that they should know in what part of Greenstone source codes the way to integrate Lucene with Greenstone is written, if they can confirm these codes, they are able to test on integrating Lucene with Japanese analyzer into Greenstone so I have questioned you this time.

Note how the following test example I described in my earlier email to him (15 Nov 2011) was simplistic, and not realistic in terms of how Japanese language searches naturally work:

Mr Yamaguchi said: However, there is a problem that indexing of Japanese is impossible on Greenstone at present because there is no function of indexing of Japanese on Lucene integrated in Greenstone.

ak19 responded:

On the Linux machine here, using the new Greenstone 2.85, I just tried creating a Greenstone collection of 3 text documents containing Japanese text that Ms Fuyuki (Dong Xue) has helpfully collected for me from the Japanese Wikipedia. One document contains an excerpt on Kabuki, another on Geisha and the other on Hinamatsuri.

The collection used the Lucene indexer. Once built, I previewed the collection in the browser and in the search form I typed in

歌舞伎 (which Fuyuki tells me means Kabuki) and then the search results returned the Kabuki document. It further told me that there were 9 occurrences of the word in the text and this looks to be correct.

In case you want to try the collection out yourself, I am attaching a zip file containing the collection I built here. You will need to have Greenstone 2.85 installed for this to work for you. You can get version 2.85 from  http://www.greenstone.org/download Once you have 2.85 installed, unzip the attached collection, which is called "Japanese", into your Greenstone 2.85's installation's "collect" subfolder (note that the source documents are in the "japanese/import" subfolder). Then launch the Greenstone 2.85 server and preview the collection. Try searching for the text field for 歌舞伎

Changed 4 years ago by ak19

On 14 Oct 2013, Mr Yamaguchi wrote to the list:

... the full text indexing is not available in Japanese version (in GS286 rc2).

At present, one of the way to solve this problem is to integrate Apache Lucene/Solr version 3.6 with Greenstone.

Lucene/Solr version 3.6 is integrated with full text search engine for Japanese named kuromoji, therefore Greenstone can handle Japanese full text indexing if it is integrated with Lucene/Solr with kuromoji. Unfortunately, I do not have skills in developping Lucene/Solr. If you are interested in this point, please access  http://www.atilika.org/ .

Changed 3 years ago by ak19

The Kuromoji analyzer is now available to use in GS3.06 since the upgrade to solr 4.7.2, see http://trac.greenstone.org/ticket/885

A related ticket is http://trac.greenstone.org/ticket/666 which mentions the Snowball analyzer.

The relevant commits for the lucene and solr update from 3.3.0 to 4.7.2 are the commit revisions between 29133 of 16.07.2014 and 29228 of 21.08.2014, and a further commit (important fix) at http://trac.greenstone.org/changeset/29355

Changed 3 years ago by ak19

  • status changed from new to closed
  • resolution set to fixed
Note: See TracTickets for help on using tickets.