Context Navigation

← Previous Changeset
Next Changeset →

Changeset 1868

Timestamp:

2001-01-26T17:25:49+13:00 (23 years ago)

Author:

sjboddie

Message:

Made a bunch of changes to the building code to support lots of new
languages and encodings. It's still kind of a mess but should be fixed
up over the weekend.

Location:

trunk/gsdl

Files:

: 100 added
: 35 deleted
: 6 edited

bin/script/makemapfile.pl (added)
etc/main.cfg (modified) (2 diffs)
mappings (added)
mappings/README (added)
mappings/from_uc (added)
mappings/from_uc/8859_1.ump (added)
mappings/from_uc/8859_2.ump (added)
mappings/from_uc/8859_3.ump (added)
mappings/from_uc/8859_4.ump (added)
mappings/from_uc/8859_5.ump (added)
mappings/from_uc/8859_6.ump (added)
mappings/from_uc/8859_7.ump (added)
mappings/from_uc/8859_8.ump (added)
mappings/from_uc/8859_9.ump (added)
mappings/from_uc/iscii_de.ump (added)
mappings/from_uc/koi8_r.ump (added)
mappings/from_uc/koi8_u.ump (added)
mappings/from_uc/uhc.ump (added)
mappings/from_uc/win1250.ump (added)
mappings/from_uc/win1251.ump (added)
mappings/from_uc/win1252.ump (added)
mappings/from_uc/win1253.ump (added)
mappings/from_uc/win1254.ump (added)
mappings/from_uc/win1255.ump (added)
mappings/from_uc/win1256.ump (added)
mappings/from_uc/win1257.ump (added)
mappings/from_uc/win1258.ump (added)
mappings/from_uc/win874.ump (added)
mappings/to_uc (added)
mappings/to_uc/8859_1.ump (added)
mappings/to_uc/8859_2.ump (added)
mappings/to_uc/8859_3.ump (added)
mappings/to_uc/8859_4.ump (added)
mappings/to_uc/8859_5.ump (added)
mappings/to_uc/8859_6.ump (added)
mappings/to_uc/8859_7.ump (added)
mappings/to_uc/8859_8.ump (added)
mappings/to_uc/8859_9.ump (added)
mappings/to_uc/iscii_de.ump (added)
mappings/to_uc/koi8_r.ump (added)
mappings/to_uc/koi8_u.ump (added)
mappings/to_uc/uhc.ump (added)
mappings/to_uc/win1250.ump (added)
mappings/to_uc/win1251.ump (added)
mappings/to_uc/win1252.ump (added)
mappings/to_uc/win1253.ump (added)
mappings/to_uc/win1254.ump (added)
mappings/to_uc/win1255.ump (added)
mappings/to_uc/win1256.ump (added)
mappings/to_uc/win1257.ump (added)
mappings/to_uc/win1258.ump (added)
mappings/to_uc/win874.ump (added)
perllib/cjk.pm (added)
perllib/doc.pm (modified) (1 diff)
perllib/gb.pm (deleted)
perllib/multiread.pm (modified) (13 diffs)
perllib/plugins/BasPlug.pm (modified) (11 diffs)
perllib/textcat/README (deleted)
perllib/textcat/afrikaans.lm (deleted)
perllib/textcat/ar-iso_8859_6.lm (added)
perllib/textcat/ar-windows_1256.lm (added)
perllib/textcat/arabic-iso8859_6.lm (deleted)
perllib/textcat/arabic-windows1256.lm (deleted)
perllib/textcat/be-windows_1251.lm (added)
perllib/textcat/belarus-windows1251.lm (deleted)
perllib/textcat/bg-iso_8859_5.lm (added)
perllib/textcat/bulgarian-iso8859_5.lm (deleted)
perllib/textcat/chinese-big5.lm (deleted)
perllib/textcat/chinese-gb2312.lm (deleted)
perllib/textcat/cs-iso_8859_2.lm (added)
perllib/textcat/czech-iso8859_2.lm (deleted)
perllib/textcat/da-iso_8859_1.lm (added)
perllib/textcat/danish.lm (deleted)
perllib/textcat/de-iso_8859_1.lm (added)
perllib/textcat/dutch.lm (deleted)
perllib/textcat/el-iso_8859_7.lm (added)
perllib/textcat/en-iso_8859_1.lm (added)
perllib/textcat/english.lm (deleted)
perllib/textcat/es-iso_8859_1.lm (added)
perllib/textcat/esperanto.lm (deleted)
perllib/textcat/fi-iso_8859_1.lm (added)
perllib/textcat/finnish.lm (deleted)
perllib/textcat/fr-iso_8859_1.lm (added)
perllib/textcat/french.lm (deleted)
perllib/textcat/german.lm (deleted)
perllib/textcat/greek-iso8859-7.lm (deleted)
perllib/textcat/hebrew-iso8859_8.lm (deleted)
perllib/textcat/hi-iscii_de.lm (added)
perllib/textcat/hindi.lm (deleted)
perllib/textcat/in-iso_8859_1.lm (added)
perllib/textcat/it-iso_8859_1.lm (added)
perllib/textcat/italian.lm (deleted)
perllib/textcat/iw-iso_8859_8.lm (added)
perllib/textcat/ja-euc_jp.lm (added)
perllib/textcat/ja-shift_jis.lm (added)
perllib/textcat/japanese-euc_jp.lm (deleted)
perllib/textcat/japanese-shift_jis.lm (deleted)
perllib/textcat/ji-utf8.lm (added)
perllib/textcat/ko-uhc.lm (added)
perllib/textcat/korean.lm (deleted)
perllib/textcat/nl-iso_8859_1.lm (added)
perllib/textcat/no-iso_8859_1.lm (added)
perllib/textcat/norwegian.lm (deleted)
perllib/textcat/pl-iso_8859_2.lm (added)
perllib/textcat/polish.lm (deleted)
perllib/textcat/portuguese.lm (deleted)
perllib/textcat/pt-iso_8859_1.lm (added)
perllib/textcat/ro-iso_8859_2.lm (added)
perllib/textcat/ru-iso_8859_5.lm (added)
perllib/textcat/ru-koi8_r.lm (added)
perllib/textcat/ru-windows_1251.lm (added)
perllib/textcat/russian-iso8859_5.lm (deleted)
perllib/textcat/russian-koi8_r.lm (deleted)
perllib/textcat/russian-windows1251.lm (deleted)
perllib/textcat/sk-windows_1250.lm (added)
perllib/textcat/sl-ascii.lm (added)
perllib/textcat/sl-iso_8859_2.lm (added)
perllib/textcat/spanish.lm (deleted)
perllib/textcat/sv-iso_8859_1.lm (added)
perllib/textcat/swedish.lm (deleted)
perllib/textcat/th-windows_874.lm (added)
perllib/textcat/tr-iso_8859_9.lm (added)
perllib/textcat/uk-koi8_r.lm (added)
perllib/textcat/ukrainian-koi8_r.lm (deleted)
perllib/textcat/vi-windows_1258.lm (added)
perllib/textcat/vietnamese.lm (deleted)
perllib/textcat/zh-big5.lm (added)
perllib/textcat/zh-gb.lm (added)
perllib/unicode.pm (modified) (3 diffs)
unicode/MAPPINGS/EASTASIA/GB/makemapfile.pl (deleted)
unicode/MAPPINGS/EASTASIA/JIS (added)
unicode/MAPPINGS/EASTASIA/JIS/JIS0201.TXT (added)
unicode/MAPPINGS/EASTASIA/JIS/JIS0208.TXT (added)
unicode/MAPPINGS/EASTASIA/JIS/JIS0212.TXT (added)
unicode/MAPPINGS/EASTASIA/JIS/SJIS.TXT (added)
unicode/MAPPINGS/ISCII/Devanagari.txt (modified) (2 diffs)
unicode/MAPPINGS/WINDOWS/1257.TXT (added)
unicode/MAPPINGS/WINDOWS/1258.TXT (added)
unicode/MAPPINGS/WINDOWS/874.TXT (added)
unicode/sjisu.ump (added)
unicode/usjis.ump (added)

Legend:

: Unmodified
: Added
: Removed

trunk/gsdl/etc/main.cfg

-              r1856
+              r1868
 # longname  -- The display name of the given encoding. If longname isn't set
 #              it will default to using shortname instead.
 # type      -- The type of encoding. Note that for most encodings this
 #              value is the directory name under which the map file for
 #              this encoding resides in the Greenstone unicode/MAPPINGS
 #              directory (e.g. 'WINDOWS', 'ISO_8859' etc.). It may also
 #              take the values 'GB' and 'UTF8'.
+#              take the values 'CJK' and 'UTF8'.
 # mapfile   -- The name of the map file for use when converting between
 #              utf8 and the given encoding. The mapfile option is mandatory
+#              for all encoding types with the exception of GB and UTF8.
+#              for all encoding types with the exception of UTF8. If type
+#              is CJK, mapfile is the abbreviated name of the encoding as
+#              used by the binary mapping files (.ump files). i.e. if the
+#              encoding uses the map files gbku.ump and ugbk.ump, mapfile
+#              will be set to "gbk".
 # label     -- The standard label to which you must set the value of
 #              "charset" within http headers or html meta tags to get a web
 …
 Encoding shortname=w1251 "longname=Cyrillic (Windows-1251)" type=WINDOWS mapfile=1251.TXT label=windows-1251
 Encoding shortname=w1256 "longname=Arabic (Windows-1256)" type=WINDOWS mapfile=1256.TXT label=windows-1256
+Encoding shortname=gb "longname=Simplified Chinese (GBK)" type=GB label=GBK
+Encoding shortname=w1256 "longname=Central European (Windows-1250)" type=WINDOWS mapfile=1250.TXT label=windows-1250
+Encoding shortname=gb "longname=Chinese Simplified (GBK)" type=CJK label=GBK mapfile=gbk
+Encoding shortname=sjis "longname=Japanese (Shift-JIS)" type=CJK label=shift_jis mapfile=sjis
 Encoding shortname=koi8r "longname=Cyrillic (KOI8-R)" type=CYRILLIC mapfile=koi8_r.txt label=koi8-r
+# The following encoding is not currently supported
+# Encoding shortname=eucjp "longname=Japanese (EUC)" type=CJK label=euc-jp mapfile=jis

trunk/gsdl/perllib/doc.pm

-              r1844
+              r1868
+}
-sub set_source_encoding {
-    my $self = shift (@_);
-    my ($source_encoding) = @_;
-    $self->set_metadata_element ($self->get_top_section(),
-                 "gsdlsourceencoding",
-                 $source_encoding);
+}
-# returns the source_encoding as it was provided
-sub get_source_encoding {
-    my $self = shift (@_);
-    return $self->get_metadata_element ($self->get_top_section(), "gsdlsourceencoding");
+}
 sub _escape_text {
     my ($text) = @_;

trunk/gsdl/perllib/multiread.pm

-              r1844
+              r1868
 # gb               - GB
 # iso_8859_[1-9]   - 8 bit extended ascii encodings
+# windows_125[0-6] - Windows codepages 1250 to 1256
+# windows_125[0-8] - Windows codepages 1250 to 1258
+# windows 874      - Windows codepage 874
+# iscii_de         - ISCII Devanagari
+# shift_jis        - Shift-JIS
+# euc_jp           - EUC encoded Japanese
+# uhc              - Unified Hangul Code (Korean)
 package multiread;
 use unicode;
 use gb;
+use cjk;
 sub new {
 …
 # if automatic detection between utf8 and unicode is desired
 # then the encoding should be initially set to utf8
 sub read_char {
+sub read_unicode_char {
     my $self = shift (@_);
 …
     return undef if ($self->{'handle'} eq "");
     my $handle = $self->{'handle'};
+    binmode ($handle);
     if ($self->{'encoding'} eq "utf8") {
 …
             $self->{'encoding'} = "unicode";
             $self->{'bigendian'} = 0;
-            if ($ENV{'GSDLOS'} =~ /windows/i) {
-                binmode ($handle); # silly windows
+            }
             last;
 …
             $self->{'encoding'} = "unicode";
             $self->{'bigendian'} = 1;
-            if ($ENV{'GSDLOS'} =~ /windows/i) {
-                binmode ($handle); # silly windows
+            }
             last;
+            }
 …
+    }
-    if ($self->{'encoding'} eq "gb") {
-    # GB or GBK
-    return undef if (eof ($handle));
-    my $c1 = getc ($handle);
-    if (ord ($c1) >= 0x81) {
-        # double byte character
-        return undef if (eof ($handle));
-        my $c2 = getc ($handle);
-        return &unicode::unicode2utf8 (&gb::gb2unicode ($c1.$c2));
-    } else {
-        # single byte character
-        return &unicode::ascii2utf8 ($c1);
+    }
+    }
-    if ($self->{'encoding'} eq "iso_8859_1") {
-    # special case for iso_8859_1 as &ascii2utf8($char) is faster than
-    # &unicode2utf8(iso2unicode('1', $char))
-    return undef if (eof ($handle));
-    return &unicode::ascii2utf8 (getc ($handle));
+    }
-    if ($self->{'encoding'} =~ /^iso_8859_(\d+)$/) {
-    return undef if (eof ($handle));
-    return &unicode::unicode2utf8(&unicode::iso2unicode ($1, getc($handle)));
+    }
-    if ($self->{'encoding'} =~ /windows_(\d{4})$/) {
-    return undef if (eof ($handle));
-    return &unicode::unicode2utf8(&unicode::windows2unicode ($1, getc($handle)));
+    }
-    if ($self->{'encoding'} =~ /^koi8_[ru]$/) {
-    return undef if (eof ($handle));
-    return &unicode::unicode2utf8(&unicode::cyrillic2unicode ($self->{'encoding'}, getc($handle)));
+    }
-    # unknown encoding
     return undef;
+}
 …
     my $out = "";
     my $thisc = "";
     while (defined ($thisc = $self->read_char())) {
+    while (defined ($thisc = $self->read_unicode_char())) {
         $out .= $thisc;
         last if ($thisc eq "\n");
 …
     return undef;
+    }
     if ($self->{'encoding'} eq "utf8") {
 …
     my $line = "";
     if (defined ($line = <$handle>)) {
         return &unicode::unicode2utf8 (&gb::gb2unicode ($line));
+        return &unicode::unicode2utf8 (&cjk::gb2unicode ($line));
+    }
     return undef;
 …
+    }
     if ($self->{'encoding'} =~ /windows_(\d{4})$/) {
+    if ($self->{'encoding'} =~ /windows_(\d{3,4})$/) {
     my $line = "";
     if (defined ($line = <$handle>)) {
 …
     if (defined ($line = <$handle>)) {
         return &unicode::unicode2utf8(&unicode::cyrillic2unicode ($self->{'encoding'}, $line));
+    }
+    return undef;
+    }
+    if ($self->{'encoding'} eq "iscii_de") {
+    my $line = "";
+    if (defined ($line = <$handle>)) {
+        return &unicode::unicode2utf8(&unicode::iscii2unicode ("Devanagari", $line));
+    }
     return undef;
 …
     my $text = <$handle>;
     $/ = "\n";
     $$outputref .= &unicode::unicode2utf8 (&gb::gb2unicode ($text));
+    $$outputref .= &unicode::unicode2utf8 (&cjk::gb2unicode ($text));
     return;
+    }
 …
     return;
+    }
+    if ($self->{'encoding'} =~ /^iso_8859_(\d+)$/) {
+    undef $/;
+    my $text = <$handle>;
+    $/ = "\n";
+    $$outputref .= &unicode::unicode2utf8(&unicode::iso2unicode ($1, $text));
+    return;
+    }
+    if ($self->{'encoding'} =~ /windows_(\d{4})$/) {
+    undef $/;
+    my $text = <$handle>;
+    $/ = "\n";
+    $$outputref .= &unicode::unicode2utf8(&unicode::windows2unicode ($1, $text));
+    return;
+    }
+    if ($self->{'encoding'} =~ /^koi8_[ru]$/) {
+    undef $/;
+    my $text = <$handle>;
+    $/ = "\n";
+    $$outputref .= &unicode::unicode2utf8(&unicode::cyrillic2unicode ($self->{'encoding'}, $text));
+    return;
+    }
+    if ($self->{'encoding'} eq "shift_jis") {
+    undef $/;
+    my $text = <$handle>;
+    $/ = "\n";
+    $$outputref .= &unicode::unicode2utf8(&cjk::sjis2unicode ($text));
+    return;
+    }
+    if ($self->{'encoding'} eq "euc_jp") {
+    undef $/;
+    my $text = <$handle>;
+    $/ = "\n";
+    $$outputref .= &unicode::unicode2utf8(&cjk::eucjp2unicode ($text));
+    return;
+    }
+    if ($self->{'encoding'} eq "euc_kr") {
+    undef $/;
+    my $text = <$handle>;
+    $/ = "\n";
+    $$outputref .= &unicode::unicode2utf8(&cjk::euckr2unicode ($text));
+    return;
+    }
+    if ($self->{'encoding'} eq "uhc") {
+    undef $/;
+    my $text = <$handle>;
+    $/ = "\n";
+    $$outputref .= &unicode::unicode2utf8(&cjk::uhc2unicode ($text));
+    return;
+    }
+    # if we get to here we assume it's a simple 8 bit encoding
+    undef $/;
+    my $text = <$handle>;
+    $/ = "\n";
+    $$outputref .= &unicode::unicode2utf8(&unicode::simple2unicode ($self->{'encoding'}, $text));
+}

trunk/gsdl/perllib/plugins/BasPlug.pm

-              r1857
+              r1868
 %supported_encodings = (
             "ascii" => "",
+            "utf8" => "",
             "iso_8859_1" => "",
             "windows_1252" => "",
 …
             "iso_8859_9" => "",
             "windows_1254" => "",
+            "gb" => ""
+            "gb" => "",
+            "iscii_de" => "",
+            "windows_1257" => "",
+            "windows_874" => "",
+            "windows_1258" => "",
+            "shift_jis" => "",
+            "euc_jp" => "",
+            "uhc" => ""
             );
 …
     print STDERR "                       windows_1254: Windows codepage 1254 (WinTurkish)\n";
+    print STDERR "                       gb: GB or GBK simplified Chinese\n\n";
+    print STDERR "                       gb: GB or GBK simplified Chinese\n";
+    print STDERR "                       iscii_de: ISCII Devanagari\n";
+    print STDERR "                       windows_1257: Windows codepage 1257 (WinBaltic)\n";
+    print STDERR "                       windows_874: Windows codepage 874 (Thai)\n";
+    print STDERR "                       windows_1258: Windows codepage 1258 (Vietnamese)\n";
+    print STDERR "                       shift_jis: Shift-JIS (Japanese)\n";
+    print STDERR "                       euc_jp: EUC encoded Japanese\n";
+    print STDERR "                       uhc: Unified Hangul Code (Korean). This is a superset of\n";
+    print STDERR "                            EUC encoded Korean\n\n";
     print STDERR "   -default_encoding If -input_encoding is set to 'auto' and the text categorization\n";
 …
     print STDERR "                     this value.\n\n";
     print STDERR "   -extract_acronyms Extract acronyms from within text and set as metadata\n\n";
+    print STDERR "   -extract_acronyms Extract acronyms from within text and set as metadata\n";
     print STDERR "   -markup_acronyms  Add acronym metadata into document text\n\n";
 …
     print STDERR "   -extract_email    Extract email addresses as metadata\n\n";
+    print STDERR "   -extract_date     Extract dates pertaining to the content of documents about history\n\n";
+    print STDERR "   -maximum_date     The maximum historical date to be used as metadata (in a Common Era date such as 1950)\n\n";
+    print STDERR "   -maximum_century  The maximum named ceuntury to be extracted as historical metadata (e.g. 14 will extract all references up to the 14th century)\n\n";
+    print STDERR "   -no_bibliography Do not try and block pbibliographic dates when extracting historical dates.\n\n";
+    print STDERR "   -extract_date     Extract dates pertaining to the content of documents about history\n";
+    print STDERR "   -maximum_date     The maximum historical date to be used as metadata (in a Common Era\n";
+    print STDERR "                     date such as 1950)\n";
+    print STDERR "   -maximum_century  The maximum named century to be extracted as historical metadata\n";
+    print STDERR "                     (e.g. 14 will extract all references up to the 14th century)\n";
+    print STDERR "   -no_bibliography  Do not try and block bibliographic dates when extracting historical dates.\n\n";
+}
 …
 sub print_usage {
     print STDERR "\nThis plugin has no plugin specific options\n\n";
+}
 …
     my $enc = "^(";
     map {$enc .= "|$_";} keys %supported_encodings;
     my $denc = $enc . "|utf8|unicode)\$";
     $enc .= "|utf8|unicode|auto)\$";
+    my $denc = $enc . "|unicode)\$";
+    $enc .= "|unicode|auto)\$";
     $self->{'outhandle'} = STDERR;
 …
     my $doc_obj = new doc ($filename, "indexed_doc");
     $doc_obj->add_utf8_metadata($doc_obj->get_top_section(), "Language", $language);
+    $doc_obj->set_source_encoding ($encoding);
+    $doc_obj->add_utf8_metadata($doc_obj->get_top_section(), "Encoding", $encoding);
     # read in file ($text will be in utf8)
 …
     if (scalar @results != 1) {
     if ($self->{'input_encoding'} ne 'auto') {
         if ($self->{'extract_language'} && $self->{'verbosity'}) {
 …
     # format language/encoding
     my ($language, $encoding) = $results[0] =~ /^([^-]*)(?:-(.*))?$/;
-    $language = $iso639::toiso639{lc($language)};
     die "Invalid language\n" if !defined $language;
 …
     # if textcat returned no encoding info it is assumed to be iso_8859_1
     $encoding = "iso_8859_1";
-    } else {
-    # convert to the format we expect
-    $encoding =~ s/windows/windows_/;
-    $encoding =~ s/iso8859/iso_8859/;
-    $encoding =~ s/^gb.*$/gb/;
+    }

trunk/gsdl/perllib/unicode.pm

-              r1844
+              r1868
+}
+# iscii2unicode is basically identical to iso2unicode, the only
+# difference being that the map files live in unicode/MAPPINGS/ISCII
+#
+# values for $encoding may be 'Devanagari' only at present
+sub iscii2unicode {
+    my ($encoding, $in) = @_;
+    my $out = [];
+    my $mapfile = &util::filename_cat($ENV{'GSDLHOME'}, "unicode", "MAPPINGS",
+                      "ISCII", "$encoding.txt");
+    return $out unless &loadmapping ($encoding, $mapfile);
+    my $i = 0;
+    my $len = length($in);
+    while ($i < $len) {
+    my $c = ord(substr ($in, $i, 1));
+    $c = $translations{"$encoding-unicode"}->{$c} if ($c >= 0xA0);
+    push (@$out, $c);
+    $i++;
+    }
+    return $out;
+}
 # ascii2utf8 takes a (extended) ascii string and
 …
     foreach $num (@$in) {
+    next unless defined $num;
     if ($num < 0x80) {
         $out .= chr ($num);
 …
+####################################################################################################
+# %translations is of the form:
+#
+# encodings{encodingname-encodingname}->blocktranslation
+# blocktranslation->[[0-255],[256-511], ..., [65280-65535]]
+#
+# Any of the top translation blocks can point to an undefined
+# value. This data structure aims to allow fast translation and
+# efficient storage.
+%translations = ();
+# @array256 is used for initialisation, there must be
+# a better way...
+@array256 = (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
+$encodings = {
+    'iso_8859_1' => {'fullname' => 'Latin1 (western languages)',
+             'mapfile' => '8859_1.ump', 'ascii_delim' => 0xA0},
+    'iso_8859_2' => {'fullname' => 'Latin2 (central and eastern european languages)',
+             'mapfile' => '8859_2.ump', 'ascii_delim' => 0xA0},
+    'iso_8859_3' => {'fullname' => 'Latin3',
+             'mapfile' => '8859_3.ump', 'ascii_delim' => 0xA0},
+    'iso_8859_4' => {'fullname' => 'Latin4',
+             'mapfile' => '8859_4.ump', 'ascii_delim' => 0xA0},
+    'iso_8859_5' => {'fullname' => 'Cyrillic',
+             'mapfile' => '8859_5.ump', 'ascii_delim' => 0xA0},
+    'iso_8859_6' => {'fullname' => 'Arabic',
+             'mapfile' => '8859_6.ump', 'ascii_delim' => 0xA0},
+    'iso_8859_7' => {'fullname' => 'Greek',
+             'mapfile' => '8859_7.ump', 'ascii_delim' => 0xA0},
+    'iso_8859_8' => {'fullname' => 'Hebrew',
+             'mapfile' => '8859_8.ump', 'ascii_delim' => 0xA0},
+    'iso_8859_9' => {'fullname' => 'Latin5',
+             'mapfile' => '8859_9.ump', 'ascii_delim' => 0xA0},
+    'windows_1250' => {'fullname' => 'Windows codepage 1250 (WinLatin2)',
+               'mapfile' => 'win1250.ump', 'ascii_delim' => 0x80},
+    'windows_1251' => {'fullname' => 'Windows codepage 1251 (WinCyrillic)',
+               'mapfile' => 'win1251.ump', 'ascii_delim' => 0x80},
+    'windows_1252' => {'fullname' => 'Windows codepage 1252 (WinLatin1)',
+               'mapfile' => 'win1252.ump', 'ascii_delim' => 0x80},
+    'windows_1253' => {'fullname' => 'Windows codepage 1253 (WinGreek)',
+               'mapfile' => 'win1253.ump', 'ascii_delim' => 0x80},
+    'windows_1254' => {'fullname' => 'Windows codepage 1254 (WinTurkish)',
+               'mapfile' => 'win1254.ump', 'ascii_delim' => 0x80},
+    'windows_1255' => {'fullname' => 'Windows codepage 1255 (WinHebrew)',
+               'mapfile' => 'win1255.ump', 'ascii_delim' => 0x80},
+    'windows_1256' => {'fullname' => 'Windows codepage 1256 (WinArabic)',
+               'mapfile' => 'win1256.ump', 'ascii_delim' => 0x80},
+    'windows_1257' => {'fullname' => 'Windows codepage 1257 (WinBaltic)',
+               'mapfile' => 'win1257.ump', 'ascii_delim' => 0x80},
+    'windows_1258' => {'fullname' => 'Windows codepage 1258 (Vietnamese)',
+               'mapfile' => 'win1258.ump', 'ascii_delim' => 0x80},
+    'windows_874' => {'fullname' => 'Windows codepage 874 (Thai)',
+              'mapfile' => 'win874.ump', 'ascii_delim' => 0x80},
+    'koi8_r' => {'fullname' => 'Cyrillic',
+         'mapfile' => 'koi8_r.ump', 'ascii_delim' => 0x80},
+    'koi8_u' => {'fullname' => 'Cyrillic (Ukrainian)',
+         'mapfile' => 'koi8_u.ump', 'ascii_delim' => 0x80},
+    'iscii_de' => {'fullname' => 'ISCII Devanagari',
+           'mapfile' => 'iscii_de.ump', 'ascii_delim' => 0xA0}
+};
+# returns a pointer to unicode array
+sub simple2unicode {
+    my ($encoding, $intext) = @_;
+    if (!defined ($encodings->{$encoding})) {
+    print STDERR "unicode::simple2unicode: ERROR: $encoding encoding not supported\n";
+    return [];
+    }
+    my $info = $encodings->{$encoding};
+    my $encodename = "$encoding-unicode";
+    my $mapfile = &util::filename_cat($ENV{'GSDLHOME'}, "mappings", "to_uc",
+                      $info->{'mapfile'});
+    if (!&loadmapencoding ($encodename, $mapfile)) {
+    print STDERR "unicode: ERROR - could not load encoding $encodename\n";
+    return [];
+    }
+    my @outtext = ();
+    my $len = length($intext);
+    my ($c);
+    my $i = 0;
+    while ($i < $len) {
+    if (($c = ord(substr($intext, $i, 1))) < $info->{'ascii_delim'}) {
+        # normal ascii character
+        push (@outtext, $c);
+    } else {
+        push (@outtext, &transchar ($encodename, $c));
+    }
+    $i ++;
+    }
+    return \@outtext;
+}
+# returns 1 if successful, 0 if unsuccessful
+sub loadmapencoding {
+    my ($encoding, $mapfile) = @_;
+    # check to see if the encoding has already been loaded
+    return 1 if (defined $translations{$encoding});
+    return 0 unless open (MAPFILE, $mapfile);
+    binmode (MAPFILE);
+    $translations{$encoding} = [@array256];
+    my $block = $translations{$encoding};
+    my ($in,$i,$j);
+    while (read(MAPFILE, $in, 1) == 1) {
+    $i = unpack ("C", $in);
+    $block->[$i] = [@array256];
+    for ($j=0; $j<256 && read(MAPFILE, $in, 2)==2; $j++) {
+        my ($n1, $n2) = unpack ("CC", $in);
+        $block->[$i]->[$j] = ($n1*256) + $n2;
+    }
+    }
+    close (MAPFILE);
+}
+sub transchar {
+    my ($encoding, $from) = @_;
+    my $high = ($from / 256) % 256;
+    my $low = $from % 256;
+    return 0 unless defined $translations{$encoding};
+    my $block = $translations{$encoding};
+    if (ref ($block->[$high]) ne "ARRAY") {
+    return 0;
+    }
+    return $block->[$high]->[$low];
+}
 ;

trunk/gsdl/unicode/MAPPINGS/ISCII/Devanagari.txt

-              r1522
+              r1868
 #  ISCII / IS 13194:1991
-# This table was generated by Stuart ([email protected]) for
-# the Greenstone Digital Library software from the ISCII (Indian
-# Script Code for Information Interchange). It maps from the
-# ISCII 7 bit code page covering Latin and Indian Scripts to
-# the Unicode 0900-907F range.
 # see Unicode Standard Version 2.0 pages 7-72
 …
 #LETTERS
 xA1    0x0901  # DEVANAGARI VOWEL-MODIFIER CHANDRABINDU
 xA2    0x0902  # DEVANAGARI VOWEL-MODIFIER ANUSWAR
 xA3    0x0903  # DEVANAGARI VOWEL-MODIFIER VISARG
+xA1    0x0901  # DEVANAGARI VOWEL-MODIFIER CHANDRABINDU
+xA2    0x0902  # DEVANAGARI VOWEL-MODIFIER ANUSWAR
+xA3    0x0903  # DEVANAGARI VOWEL-MODIFIER VISARG
 xA4    0x0905  # DEVANAGARI VOWEL A
 xA5    0x0906  # DEVANAGARI VOWEL AA
 xA6    0x0907  # DEVANAGARI VOWEL I
 xA7    0x0908  # DEVANAGARI VOWEL II
 xA8    0x0909  # DEVANAGARI VOWEL U
 xA9    0x090A  # DEVANAGARI VOWEL UU
 xAA    0x090B  # DEVANAGARI VOWEL RI
 xAB    0x090E  # DEVANAGARI VOWEL E (SOUTHERN SCRIPTS)
 xAC    0x090F  # DEVANAGARI VOWEL EY
 xAD    0x0910  # DEVANAGARI VOWEL AI
 xAE    0x090D  # DEVANAGARI VOWEL AYE (DEVANAGARI SCRIPT)
 xAF    0x0912  # DEVANAGARI VOWEL O (SOUTHERN SCRIPTS)
+xA4    0x0905  # DEVANAGARI VOWEL A
+xA5    0x0906  # DEVANAGARI VOWEL AA
+xA6    0x0907  # DEVANAGARI VOWEL I
+xA7    0x0908  # DEVANAGARI VOWEL II
+xA8    0x0909  # DEVANAGARI VOWEL U
+xA9    0x090A  # DEVANAGARI VOWEL UU
+xAA    0x090B  # DEVANAGARI VOWEL RI
+xAB    0x090E  # DEVANAGARI VOWEL E (SOUTHERN SCRIPTS)
+xAC    0x090F  # DEVANAGARI VOWEL EY
+xAD    0x0910  # DEVANAGARI VOWEL AI
+xAE    0x090D  # DEVANAGARI VOWEL AYE (DEVANAGARI SCRIPT)
+xAF    0x0912  # DEVANAGARI VOWEL O (SOUTHERN SCRIPTS)
 xB0    0x0913  # DEVANAGARI VOWEL OW
 xB1    0x0914  # DEVANAGARI VOWEL AU
 xB2    0x0911  # DEVANAGARI VOWEL AWE  (DEVANAGARI SCRIPT)
 xB3    0x0915  # DEVANAGARI CONSONANT KA
 xB4    0x0916  # DEVANAGARI CONSONANT KHA
 xB5    0x0917  # DEVANAGARI CONSONANT GA
 xB6    0x0918  # DEVANAGARI CONSONANT GHA
 xB7    0x0919  # DEVANAGARI CONSONANT NGA
 xB8    0x091A  # DEVANAGARI CONSONANT CHA
 xB9    0x091B  # DEVANAGARI CONSONANT CHHA
 xBA    0x091C  # DEVANAGARI CONSONANT JA
 xBB    0x091D  # DEVANAGARI CONSONANT JHA
 xBC    0x091E  # DEVANAGARI CONSONANT JNA
 xBD    0x091F  # DEVANAGARI CONSONANT HARD TA
 xBE    0x0920  # DEVANAGARI CONSONANT HARD THA
 xBF    0x0921  # DEVANAGARI CONSONANT HARD DA
+xB0    0x0913  # DEVANAGARI VOWEL OW
+xB1    0x0914  # DEVANAGARI VOWEL AU
+xB2    0x0911  # DEVANAGARI VOWEL AWE  (DEVANAGARI SCRIPT)
+xB3    0x0915  # DEVANAGARI CONSONANT KA
+xB4    0x0916  # DEVANAGARI CONSONANT KHA
+xB5    0x0917  # DEVANAGARI CONSONANT GA
+xB6    0x0918  # DEVANAGARI CONSONANT GHA
+xB7    0x0919  # DEVANAGARI CONSONANT NGA
+xB8    0x091A  # DEVANAGARI CONSONANT CHA
+xB9    0x091B  # DEVANAGARI CONSONANT CHHA
+xBA    0x091C  # DEVANAGARI CONSONANT JA
+xBB    0x091D  # DEVANAGARI CONSONANT JHA
+xBC    0x091E  # DEVANAGARI CONSONANT JNA
+xBD    0x091F  # DEVANAGARI CONSONANT HARD TA
+xBE    0x0920  # DEVANAGARI CONSONANT HARD THA
+xBF    0x0921  # DEVANAGARI CONSONANT HARD DA
 xC0    0x0922  # DEVANAGARI CONSONANT HARD DHA
 xC1    0x0923  # DEVANAGARI CONSONANT HARD NA
 xC2    0x0924  # DEVANAGARI CONSONANT SOFT TA
 xC3    0x0925  # DEVANAGARI CONSONANT SOFT THA
 xC4    0x0926  # DEVANAGARI CONSONANT SOFT DA
 xC5    0x0927  # DEVANAGARI CONSONANT SOFT DHA
 xC6    0x0928  # DEVANAGARI CONSONANT SOFT NA
 xC7    0x0929  # DEVANAGARI CONSONANT NA (TAMIL)
 xC8    0x092A  # DEVANAGARI CONSONANT PA
 xC9    0x092B  # DEVANAGARI CONSONANT PHA
 xCA    0x092C  # DEVANAGARI CONSONANT BA
 xCB    0x092D  # DEVANAGARI CONSONANT BHA
 xCC    0x092E  # DEVANAGARI CONSONANT MA
 xCD    0x092F  # DEVANAGARI CONSONANT YA
+xC0    0x0922  # DEVANAGARI CONSONANT HARD DHA
+xC1    0x0923  # DEVANAGARI CONSONANT HARD NA
+xC2    0x0924  # DEVANAGARI CONSONANT SOFT TA
+xC3    0x0925  # DEVANAGARI CONSONANT SOFT THA
+xC4    0x0926  # DEVANAGARI CONSONANT SOFT DA
+xC5    0x0927  # DEVANAGARI CONSONANT SOFT DHA
+xC6    0x0928  # DEVANAGARI CONSONANT SOFT NA
+xC7    0x0929  # DEVANAGARI CONSONANT NA (TAMIL)
+xC8    0x092A  # DEVANAGARI CONSONANT PA
+xC9    0x092B  # DEVANAGARI CONSONANT PHA
+xCA    0x092C  # DEVANAGARI CONSONANT BA
+xCB    0x092D  # DEVANAGARI CONSONANT BHA
+xCC    0x092E  # DEVANAGARI CONSONANT MA
+xCD    0x092F  # DEVANAGARI CONSONANT YA
 # WARNING: THIS CHARACTER IS NON-CANNONICAL
 xCE    0x095F  # DEVANAGARI CONSONANT JKA (BENGALI, ASSAMESE & ORIYA)
 xCF    0x0930  # DEVANAGARI CONSONANT RA
+xCE    0x095F  # DEVANAGARI CONSONANT JKA (BENGALI, ASSAMESE & ORIYA)
+xCF    0x0930  # DEVANAGARI CONSONANT RA
 xD0    0x0931  # DEVANAGARI CONSONANT HARD RA (SOUTHERN SCRIPTS)
 xD1    0x0932  # DEVANAGARI CONSONANT LA
 xD2    0x0933  # DEVANAGARI CONSONANT HARD LA
 xD3    0x0934  # DEVANAGARI CONSONANT ZHA (TAMIL & MALAYALAM)
 xD4    0x0935  # DEVANAGARI CONSONANT VA
 xD5    0x0936  # DEVANAGARI CONSONANT SHA
 xD6    0x0937  # DEVANAGARI CONSONANT HARD SHA
 xD7    0x0938  # DEVANAGARI CONSONANT SA
 xD8    0x0939  # DEVANAGARI CONSONANT HA
 #0xD9    0x0900  # DEVANAGARI INVISIBLE (NO UNICODE EQUALIVENT)
 xDA    0x093E  # DEVANAGARI VOWEL SIGN AA
 xDB    0x093F  # DEVANAGARI VOWEL SIGN I
 xDC    0x0940  # DEVANAGARI VOWEL SIGN II
 xDD    0x0941  # DEVANAGARI VOWEL SIGN U
 xDE    0x0942  # DEVANAGARI VOWEL SIGN UU
 xDF    0x0943  # DEVANAGARI VOWEL SIGN RI
+xD0    0x0931  # DEVANAGARI CONSONANT HARD RA (SOUTHERN SCRIPTS)
+xD1    0x0932  # DEVANAGARI CONSONANT LA
+xD2    0x0933  # DEVANAGARI CONSONANT HARD LA
+xD3    0x0934  # DEVANAGARI CONSONANT ZHA (TAMIL & MALAYALAM)
+xD4    0x0935  # DEVANAGARI CONSONANT VA
+xD5    0x0936  # DEVANAGARI CONSONANT SHA
+xD6    0x0937  # DEVANAGARI CONSONANT HARD SHA
+xD7    0x0938  # DEVANAGARI CONSONANT SA
+xD8    0x0939  # DEVANAGARI CONSONANT HA
+#0xD9   0x0900  # DEVANAGARI INVISIBLE (NO UNICODE EQUALIVENT)
+xDA    0x093E  # DEVANAGARI VOWEL SIGN AA
+xDB    0x093F  # DEVANAGARI VOWEL SIGN I
+xDC    0x0940  # DEVANAGARI VOWEL SIGN II
+xDD    0x0941  # DEVANAGARI VOWEL SIGN U
+xDE    0x0942  # DEVANAGARI VOWEL SIGN UU
+xDF    0x0943  # DEVANAGARI VOWEL SIGN RI
 xE0    0x0946  # DEVANAGARI VOWEL SIGN E (SOUTHERN SCRIPTS)
 xE1    0x0947  # DEVANAGARI VOWEL SIGN EY
 xE2    0x0948  # DEVANAGARI VOWEL SIGN AI
 xE3    0x0945  # DEVANAGARI VOWEL SIGN AYE (DEVANAGARI SCRIPT)
 xE4    0x094A  # DEVANAGARI VOWEL SIGN O SOUTHERN SCRIPTS)
 xE5    0x094B  # DEVANAGARI VOWEL SIGN OW
 xE6    0x094C  # DEVANAGARI VOWEL SIGN AU
 xE7    0x0949  # DEVANAGARI VOWEL SIGN AWE (DEVANAGARI SCRIPT)
 xE8    0x094D  # DEVANAGARI VOWEL SIGN OMISSION SIGN (HALANT)
+xE0    0x0946  # DEVANAGARI VOWEL SIGN E (SOUTHERN SCRIPTS)
+xE1    0x0947  # DEVANAGARI VOWEL SIGN EY
+xE2    0x0948  # DEVANAGARI VOWEL SIGN AI
+xE3    0x0945  # DEVANAGARI VOWEL SIGN AYE (DEVANAGARI SCRIPT)
+xE4    0x094A  # DEVANAGARI VOWEL SIGN O SOUTHERN SCRIPTS)
+xE5    0x094B  # DEVANAGARI VOWEL SIGN OW
+xE6    0x094C  # DEVANAGARI VOWEL SIGN AU
+xE7    0x0949  # DEVANAGARI VOWEL SIGN AWE (DEVANAGARI SCRIPT)
+xE8    0x094D  # DEVANAGARI VOWEL SIGN OMISSION SIGN (HALANT)
 #PUNCTUATION
 xE9    0x093C  # DEVANAGARI DIACRITIC SIGN (NUKTA)
 xEA    0x0964  # DEVANAGARI FULL STOP
+xE9    0x093C  # DEVANAGARI DIACRITIC SIGN (NUKTA)
+xEA    0x0964  # DEVANAGARI FULL STOP
 #DIGITS
 xF1    0x0966  # DEVANAGARI DIGIT ZERO
 xF2    0x0967  # DEVANAGARI DIGIT ONE
 xF3    0x0968  # DEVANAGARI DIGIT TWO
 xF4    0x0969  # DEVANAGARI DIGIT THREE
 xF5    0x096A  # DEVANAGARI DIGIT FOUR
 xF6    0x096B  # DEVANAGARI DIGIT FIVE
 xF7    0x096C  # DEVANAGARI DIGIT SIX
 xF8    0x096D  # DEVANAGARI DIGIT SEVEN
 xF9    0x096E  # DEVANAGARI DIGIT EIGHT
 xFA    0x096F  # DEVANAGARI DIGIT NINE
+xF1    0x0966  # DEVANAGARI DIGIT ZERO
+xF2    0x0967  # DEVANAGARI DIGIT ONE
+xF3    0x0968  # DEVANAGARI DIGIT TWO
+xF4    0x0969  # DEVANAGARI DIGIT THREE
+xF5    0x096A  # DEVANAGARI DIGIT FOUR
+xF6    0x096B  # DEVANAGARI DIGIT FIVE
+xF7    0x096C  # DEVANAGARI DIGIT SIX
+xF8    0x096D  # DEVANAGARI DIGIT SEVEN
+xF9    0x096E  # DEVANAGARI DIGIT EIGHT
+xFA    0x096F  # DEVANAGARI DIGIT NINE

Note: See TracChangeset for help on using the changeset viewer.