Context Navigation

← Previous Change
Next Change →

perllib

Timestamp:

2008-08-25T09:58:13+12:00 (16 years ago)

Author:

kjdon

Message:

cjk character segmentation. text_t chars not big enough to handle numbers > 0xffff. have commented these ranges out in c++ and perl until we implement a better solution. these high ranges are only for extension sets anyway, so most common words will be segmented

File:

: 1 edited

gsdl/trunk/perllib/cnseg.pm (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

gsdl/trunk/perllib/cnseg.pm

-              r16641
+              r16980
     my $space = 1; # start doesn't need a space
     foreach $c (@$uniin) {
+    if (($c >= 0x2e80 && $c <= 0xfa6a) || # main east asian codes
+        ($c >= 0x20000 && $c <= 0x2a6d6) || # cjk unified ideographs ext B
+        ($c >= 0x2f800 && $c <= 0x2fa1d)) { #cjk compatibility ideographs supplement
+    if (($c >= 0x2e80 && $c <= 0xd7a3) ||
+        ( $c >= 0xf900 && $c <= 0xfa6a)) { # main east asian codes
+        # currently c++ receptionist code can't handle these large numbers
+        # search terms need to be segmented the same way. Add these back
+        # in when fix up c++
+       # ($c >= 0x20000 && $c <= 0x2a6d6) || # cjk unified ideographs ext B
+       # ($c >= 0x2f800 && $c <= 0x2fa1d)) { #cjk compatibility ideographs supplement
         # CJK character
         push (@$out, 0x200b) unless $space;

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 16980 for gsdl/trunk/perllib

Legend:

gsdl/trunk/perllib/cnseg.pm

Download in other formats: