Changeset 2799


Ignore:
Timestamp:
2001-10-12T12:45:17+13:00 (20 years ago)
Author:
sjboddie
Message:

Fixed a bug where Word documents containing non-ascii characters weren't
being handled correctly. The problem occurred because Greenstone doesn't
currently detect utf-8 encoded text automatically and the html produced by
wvWare is utf-8. The Greenstone build process was assuming it was
iso-8859-1 and so not importing the documents correctly. I've fixed it by
forcing the build process to assume html imported with WordPlug is
utf-8. This works but we should also include support for detecting utf-8
encoded documents sometime.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/gsdl/perllib/plugins/ConvertToPlug.pm

    r2796 r2799  
    258258    # Do encoding stuff
    259259    my ($language, $encoding);
     260
     261    # WordPlug's wvWare will always produce html files encoded as utf-8
     262    if ($plugin_name eq "WordPlug") {
     263    $self->{'input_encoding'} = "utf8";
     264    $self->{'extract_language'} = 1;
     265    }
     266
    260267    if ($self->{'input_encoding'} eq "auto") {
    261268        # use textcat to automatically work out the input encoding and language
     
    264271        # use textcat to get language metadata
    265272
    266         my ($language, $extracted_encoding) = $self->get_language_encoding ($conv_filename);
     273    my ($extracted_encoding);
     274        ($language, $extracted_encoding) = $self->get_language_encoding ($conv_filename);
    267275        $encoding = $self->{'input_encoding'};
    268276        if ($extracted_encoding ne $encoding && $self->{'verbosity'}) {
Note: See TracChangeset for help on using the changeset viewer.