Context Navigation

cnseg.pm

Timestamp:

2012-06-14T11:03:14+12:00 (12 years ago)

Author:

kjdon

Message:

segmentation code was assuming strings in utf8 but we have changed to using unicode aware strings, so no conversion needed.

File:

-              r16980
+              r25788
     my ($in) = @_;
     my ($c);
+    my ($cl);
+    my $len = length($in);
+    my $i = 0;
+    my $out = "";
+    my $space = 1; # start doesn't need a space
+    while ($i < $len) {
+    $c = substr ($in, $i, 1);
+    $cl = ord($c);
+    if (($cl >= 0x2e80 && $cl <= 0xd7a3) ||
+        ( $cl >= 0xf900 && $cl <= 0xfa6a)) { # main east asian codes
+        # currently c++ receptionist code can't handle these large numbers
+        # search terms need to be segmented the same way. Add these back
+        # in when fix up c++
+        # ($cl >= 0x20000 && $cl <= 0x2a6d6) || # cjk unified ideographs ext B
+        # ($cl >= 0x2f800 && $cl <= 0x2fa1d)) { #cjk compatibility ideographs supplement
+        # CJK character
+        $out .= chr(0x200b) unless $space;
+        $out .= $c;
+        $out .= chr(0x200b);
+        $space = 1;
+    } else {
+        $out .=$c;
+        $space = 0;
+    }
+    $i++;
+    }
+    return $out;
+}
+sub segment_old {
+    my ($in) = @_;
+    my ($c);
     my $uniin = &unicode::utf82unicode($in);
     my $out = [];

Note: See TracChangeset for help on using the changeset viewer.