Changeset 16980 for gsdl/trunk/perllib


Ignore:
Timestamp:
2008-08-25T09:58:13+12:00 (16 years ago)
Author:
kjdon
Message:

cjk character segmentation. text_t chars not big enough to handle numbers > 0xffff. have commented these ranges out in c++ and perl until we implement a better solution. these high ranges are only for extension sets anyway, so most common words will be segmented

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gsdl/trunk/perllib/cnseg.pm

    r16641 r16980  
    5555    my $space = 1; # start doesn't need a space
    5656    foreach $c (@$uniin) {
    57     if (($c >= 0x2e80 && $c <= 0xfa6a) || # main east asian codes
    58         ($c >= 0x20000 && $c <= 0x2a6d6) || # cjk unified ideographs ext B
    59         ($c >= 0x2f800 && $c <= 0x2fa1d)) { #cjk compatibility ideographs supplement
     57    if (($c >= 0x2e80 && $c <= 0xd7a3) ||
     58        ( $c >= 0xf900 && $c <= 0xfa6a)) { # main east asian codes
     59        # currently c++ receptionist code can't handle these large numbers
     60        # search terms need to be segmented the same way. Add these back
     61        # in when fix up c++
     62       # ($c >= 0x20000 && $c <= 0x2a6d6) || # cjk unified ideographs ext B
     63       # ($c >= 0x2f800 && $c <= 0x2fa1d)) { #cjk compatibility ideographs supplement
    6064        # CJK character
    6165        push (@$out, 0x200b) unless $space;
Note: See TracChangeset for help on using the changeset viewer.