Changeset 1842


Ignore:
Timestamp:
2001-01-19T10:01:57+13:00 (23 years ago)
Author:
sjboddie
Message:

Added codepages for a bunch of new encodings - iso-8859-1 thru iso-8859-9
are now all supported along with Windows codepages 1250 thru 1256 and
koi8-r and koi8-u Cyrillic encodings.

Location:
trunk/gsdl/unicode/MAPPINGS
Files:
13 added
4 edited

Legend:

Unmodified
Added
Removed
  • trunk/gsdl/unicode/MAPPINGS/EASTASIA/GB/GB12345.TXT

    r72 r1842  
    1 #
    2 #   Name:             GB12345-80 to Unicode table (complete, hex format)
    3 #   Unicode version:  1.1
    4 #   Table version:    0.0d1
    5 #   Table format:     Format A
    6 #   Date:             6 December 1993
    7 #   Author:           Glenn Adams <[email protected]>
    8 #                     John H. Jenkins <[email protected]>
    9 #
    10 #   Copyright (c) 1991-1994 Unicode, Inc.  All Rights reserved.
    11 #
    12 #   This file is provided as-is by Unicode, Inc. (The Unicode Consortium).
    13 #   No claims are made as to fitness for any particular purpose.  No
    14 #   warranties of any kind are expressed or implied.  The recipient
    15 #   agrees to determine applicability of information provided.  If this
    16 #   file has been provided on magnetic media by Unicode, Inc., the sole
    17 #   remedy for any claim will be exchange of defective media within 90
    18 #   days of receipt.
    19 #
    20 #   Recipient is granted the right to make copies in any form for
    21 #   internal distribution and to freely use the information supplied
    22 #   in the creation of products supporting Unicode.  Unicode, Inc.
    23 #   specifically excludes the right to re-distribute this file directly
    24 #   to third parties or other organizations whether for profit or not.
    25 #
    26 #   General notes:
    27 #
    28 #   This table contains the data Metis and Taligent currently have on how
    29 #       GB12345-90 characters map into Unicode.
    30 #
    31 #   Format:  Three tab-separated columns
    32 #        Column #1 is the GB12345 code (in hex as 0xXXXX)
    33 #        Column #2 is the Unicode (in hex as 0xXXXX)
    34 #        Column #3 the Unicode name (follows a comment sign, '#')
    35 #                   The official names for Unicode characters U+4E00
    36 #                   to U+9FA5, inclusive, is "CJK UNIFIED IDEOGRAPH-XXXX",
    37 #                   where XXXX is the code point.  Including all these
    38 #                   names in this file increases its size substantially
    39 #                   and needlessly.  The token "<CJK>" is used for the
    40 #                   name of these characters.  If necessary, it can be
    41 #                   expanded algorithmically by a parser or editor.
    42 #
    43 #   The entries are in GB12345 order
    44 #
    45 #   The following algorithms can be used to change the hex form
    46 #       of GB12345 to other standard forms:
    47 #
    48 #       To change hex to EUC form, add 0x8080
    49 #       To change hex to kuten form, first subtract 0x2020.  Then
    50 #           the high and low bytes correspond to the ku and ten of
    51 #           the kuten form.  For example, 0x2121 -> 0x0101 -> 0101;
    52 #           0x777E -> 0x575E -> 8794
    53 #
    54 #   Any comments or problems, contact <[email protected]>
    55 #
    56 #
    5710x2121  0x3000  # IDEOGRAPHIC SPACE
    5820x2122  0x3001  # IDEOGRAPHIC COMMA
  • trunk/gsdl/unicode/MAPPINGS/EASTASIA/GB/GB2312.TXT

    r72 r1842  
    1 #
    2 #   Name:             GB2312-80 to Unicode table (complete, hex format)
    3 #   Unicode version:  1.1
    4 #   Table version:    0.0d2
    5 #   Table format:     Format A
    6 #   Date:             6 December 1993
    7 #   Author:           Glenn Adams <[email protected]>
    8 #                     John H. Jenkins <[email protected]>
    9 #
    10 #   Copyright (c) 1991-1994 Unicode, Inc.  All Rights reserved.
    11 #
    12 #   This file is provided as-is by Unicode, Inc. (The Unicode Consortium).
    13 #   No claims are made as to fitness for any particular purpose.  No
    14 #   warranties of any kind are expressed or implied.  The recipient
    15 #   agrees to determine applicability of information provided.  If this
    16 #   file has been provided on magnetic media by Unicode, Inc., the sole
    17 #   remedy for any claim will be exchange of defective media within 90
    18 #   days of receipt.
    19 #
    20 #   Recipient is granted the right to make copies in any form for
    21 #   internal distribution and to freely use the information supplied
    22 #   in the creation of products supporting Unicode.  Unicode, Inc.
    23 #   specifically excludes the right to re-distribute this file directly
    24 #   to third parties or other organizations whether for profit or not.
    25 #
    26 #   General notes:
    27 #
    28 #   This table contains the data Metis and Taligent currently have on how
    29 #       GB2312-80 characters map into Unicode.
    30 #
    31 #   Format:  Three tab-separated columns
    32 #        Column #1 is the GB2312 code (in hex as 0xXXXX)
    33 #        Column #2 is the Unicode (in hex as 0xXXXX)
    34 #        Column #3 the Unicode name (follows a comment sign, '#')
    35 #                   The official names for Unicode characters U+4E00
    36 #                   to U+9FA5, inclusive, is "CJK UNIFIED IDEOGRAPH-XXXX",
    37 #                   where XXXX is the code point.  Including all these
    38 #                   names in this file increases its size substantially
    39 #                   and needlessly.  The token "<CJK>" is used for the
    40 #                   name of these characters.  If necessary, it can be
    41 #                   expanded algorithmically by a parser or editor.
    42 #
    43 #   The entries are in GB2312 order
    44 #
    45 #   The following algorithms can be used to change the hex form
    46 #       of GB2312 to other standard forms:
    47 #
    48 #       To change hex to EUC form, add 0x8080
    49 #       To change hex to kuten form, first subtract 0x2020.  Then
    50 #           the high and low bytes correspond to the ku and ten of
    51 #           the kuten form.  For example, 0x2121 -> 0x0101 -> 0101;
    52 #           0x777E -> 0x575E -> 8794
    53 #
    54 #   Any comments or problems, contact <[email protected]>
    55 #
    56 #
    5710x2121  0x3000  # IDEOGRAPHIC SPACE
    5820x2122  0x3001  # IDEOGRAPHIC COMMA
  • trunk/gsdl/unicode/MAPPINGS/WINDOWS/1251.TXT

    r1838 r1842  
    11# Microsoft Windows Codepage : 1251 (Cyrillic)
    22
    3 # This table was generated by Stefan Boddie ([email protected]) for
    4 # the Greenstone Digital Library software from the codepage found at
    5 # http://www.microsoft.com/globaldev/reference/sbcs/1251.htm
    6 
     3# This table was generated for the Greenstone Digital Library software from the
     4# codepage found at http://www.microsoft.com/globaldev/reference/sbcs/1251.htm
    75
    860x80    0x0402  # CYRILLIC CAPITAL LETTER DJE
  • trunk/gsdl/unicode/MAPPINGS/WINDOWS/1256.TXT

    r1228 r1842  
    11# Microsoft Windows Codepage : 1256 (Arabic)
    22
    3 # This table was generated by Stefan Boddie ([email protected]) for
    4 # the Greenstone Digital Library software from the codepage found at
    5 # http://www.microsoft.com/globaldev/reference/sbcs/1256.htm
     3# This table was generated for the Greenstone Digital Library software from the
     4# codepage found at http://www.microsoft.com/globaldev/reference/sbcs/1256.htm
    65
    760x80    0x20AC  # EURO SIGN
Note: See TracChangeset for help on using the changeset viewer.