Ticket #294 (closed defect: fixed)

Opened 11 years ago

Last modified 11 years ago

Investigation of multi lingual filenames

Reported by: kjdon Owned by: kjdon
Priority: high Milestone: Release 2.81
Component: Collection Building Severity: major
Keywords: multilingual Cc:

Description

Test whole greenstone system with multi lingual filenames.

Do we need two versions of the filename metadata, one for presentation, one for use in paths?

Change History

Changed 11 years ago by kjdon

  • keywords multilingual added
  • owner changed from nobody to kjdon
  • status changed from new to assigned

Changed 11 years ago by ak19

With the correction from "encodings::encoding" to "encodings::encodings" made by Dr Bainbridge, collections with html docs containing - French text (special characters) and - img src tags referencing image names with French characters work on Windows and Linux. That is, the images are loaded and text is automatically displayed by Greenstone in correct encoding, UTF8.

HOWEVER, when these html docs contain relative links to filenames containing special French characters, these links don't go to the pages they refer to, but show the "External Link" message. But when I create a copy of the collection where the docs only contain regular chars, the relative links work.

Here's two examples of the differences: 1. WORKS (normal chars + spaces):  http://localhost:9090/greenstone3/library?el=&a=d&c=french3&d=&rl=1&href=http://Le%20Petit%20Prince%20d'Antoine%20de%20Saint%20Exupery.html

DOES NOT WORK (special chars + spaces):  http://localhost:9090/greenstone3/library?el=&a=d&c=frenchti&d=&rl=1&href=http://Le%20Petit%20Prince%20d'Antoine%20de%20Saint%20Exup%C3%A9ry.html

2. WORKS (normal chars + spaces):  http://localhost:9090/greenstone3/library?el=&a=d&c=french3&d=&rl=1&href=http://Le%20Chateau%20de%20ma%20mere.html

DOES NOT WORK (special chars + spaces):  http://localhost:9090/greenstone3/library?el=&a=d&c=frenchti&d=&rl=1&href=http://Le%20Ch%C3%A2teau%20de%20ma%20m%C3%A8re.html

I don't know if the following may be a part of the problem: Note the special characters above when they are URL-encoded. This is also odd, in that the online encoder/decoder at  http://meyerweb.com/eric/tools/dencoder/ shows that when the filename "Le Château de ma mère" is URL-encoded, it will be: "Le%20Ch%E2teau%20de%20ma%20m%E8re" as opposed to what's happening above: "Le%20Ch%C3%A2teau%20de%20ma%20m%C3%A8re"

I've tried to call BasPlug::filename_to_metadata (which works with encodings) in both HTMLPlug's process and format_link subroutines, as these two subs mention internal and relative links, but it does fix the problem: $file = &BasPlug::filename_to_metadata($self, $file); and $link = &BasPlug::filename_to_metadata($self, $link);

Changed 11 years ago by ak19

The problem is with perllib/doc.pm's add_metadata: In the line,

$self->add_utf8_metadata ($section, $field,

&unicode::ascii2utf8(\$value)); # problem here

The unicode::ascii2utf8 changes the special characters from being correct to introducing strange characters, such that doc.xml's URL metadata is incorrect. And this causes relative links to HTML pages whose filenames contain special characters to fail.

I have come up with a solution that works for now, but I don't know whether it will suddenly fail for other instances that used to work. I don't want to break anything.

1. Here's the old code: a. perllib/plugins/HTMLPlug.pm's process subroutine:

my $web_url = " http://$file"; $doc_obj->add_metadata($cursection, "URL", $web_url, $convert_to_utf8);

$file = &BasPlug::filename_to_metadata($self, $file);

b. perllib/doc.pm's add_metadata subroutine: sub add_metadata {

my $self = shift (@_); my ($section, $field, $value) = @_;

$self->add_utf8_metadata ($section, $field,

&unicode::ascii2utf8(\$value));

}

2. Here's the modified code

a. perllib/plugins/HTMLPlug.pm's process subroutine:

$file = &BasPlug::filename_to_metadata($self, $file); # filename character encoding my $web_url = " http://$file"; my $convert_to_utf8 = 0; # set to false, since we have just encoded the filename above (is that always utf8 though?) $doc_obj->add_metadata($cursection, "URL", $web_url, $convert_to_utf8);

b. perllib/doc.pm's add_metadata subroutine: sub add_metadata {

my $self = shift (@_); my ($section, $field, $value, $convert_to_utf8) = @_;

print STDERR "###$field=$value\n";

if(!defined $convert_to_utf8 $convert_to_utf8) {

$self->add_utf8_metadata ($section, $field,

&unicode::ascii2utf8(\$value));

} else { # don't convert specially to utf8

$self->add_utf8_metadata ($section, $field, $value);

}

}

The above now works for my sample collection. I have also tested it on the version where the French filenames use regular characters and that still builds and works as before. But will anything else break? What all do I need to test? And how do we deal with the case where the character encoding of filename metadata is not utf8, while still have not encoded the URL into UTF8? I.e. where the character encoding is ascii and still needs to be converted to UTF8?

In doc.pm there is this code (sub add_utf8_metadata gets called by doc.pm's add_metadata): sub add_utf8_metadata {

... #print STDERR "###$field=$value\n"; # double check that the value is utf-8 if (unicode::ensure_utf8(\$value)) {

print STDERR "doc::add_utf8_metadata: warning: '$field' wasn't utf8\n";

} ...

}

That looks promising.

Changed 11 years ago by ak19

Undid all changes to doc.pm. Now HTMLPlug.pm calls doc.pm's add_utf8_metadata for the URL metadata instead of calling add_metadata. This has the same effect as the changes described above.

Changes to perllib/plugins/HTMLPlug.pm's process subroutine:

$file = &BasPlug::filename_to_metadata($self, $file); # filename with character encoding my $web_url = " http://$file"; $doc_obj->add_utf8_metadata($cursection, "URL", $web_url); # will eventually ensure it is utf8

This still works for the 2 cases tested: 1. Interlinking html pages where the pages have filenames with special French characters 2. Interlinking html pages where the pages have filenames with no special characters

Changed 11 years ago by kjdon

  • status changed from assigned to closed
  • resolution set to fixed

I think we have come to the end of this. Using a UTF8 version for display, and a URL encoded version for filenames.

Note: See TracTickets for help on using tickets.