Ticket #342 (closed defect: fixed)

Opened 11 years ago

Last modified 11 years ago

CJK character segmentation

Reported by: kjdon Owned by: kjdon
Priority: high Milestone: Next Release (2 or 3)
Component: Collection Building Severity: enhancement
Keywords: build-overhaul Cc:

Description

My plugin changes have meant that this doesn't work anymore.

TODO:

* Make it work for Japanese and Korean (I think I have done this but not committed yet) * The option is not available anymore for all plugins. We used to use a global collect.cfg option which was added to all plugins, and also used by runtime. How to do this now? * Add the option to all plugins that have text, not just ReadTextFile? ones.

Change History

Changed 11 years ago by kjdon

  • status changed from new to assigned

Changed 11 years ago by kjdon

the option is now part of AutoExtractMetadata? (which needs to be renamed).

works for chinese japanese and korean.

Just have the config file and gli issue left to do.

Changed 11 years ago by kjdon

  • status changed from assigned to closed
  • resolution set to fixed

separate_cjk is now an indexoption (along with stem, case, accentfold)

The text is segmented before going to the indexer.

Note: See TracTickets for help on using tickets.