Ticket #681 (new defect)

Opened 9 years ago

mgpp word separator

Reported by: kjdon Owned by: nobody
Priority: moderate Milestone: Collection building wishlist
Component: Collection Building Severity: major
Keywords: Cc:

Description

As reported in mailing list, 12-4-2010

Where can I define my own "word separator character" or bypass some characters in word separator functions in the Greenstone?

It seems like, my collection in the Greenstone is considering some Unicode special control characters as a space. For example, according to the Unicode standard, Mongolian text contents have four special control characters to change shapes (glyphs). Those are 1.Free Variation Selector One(FSV1) (U+180B), 2. Free Variation Selector Two (FSV2)(U+180C), 3. Free Variation Selector Three (FSV3)(U+180D) and 4. Mongolian vowel separator (MSV)(U+180E). Those control characters must be considered as a part of the word whether are in the middle, beginning and end of the word. For example, abc'MSV'defg is the single word, not two words 'abc' and 'defg'. I`ve failed to retrieve such words in the Greenstone. The Greenstone retrieves Mongolian words with control characters as two or more separate words (several control characters could used in a single word).

Note: See TracTickets for help on using tickets.