Ticket #667 (closed defect: fixed)

Opened 10 years ago

Last modified 9 years ago

wvware bug?

Reported by: kjdon Owned by: nobody
Priority: moderate Milestone: 2.84 Release
Component: Collection Building: Plugins Severity: major
Keywords: Cc:

Description

See also #430

Mariana Pichinini reported a bug with our current 0.7.1? wvware (copied below) where it was adding invalid characters to the HTML. She thinks this was fixed by using 1.2.4, so can we upgrade to that? I know we are supposed to be moving away from this and using openoffice or other (apache tika?) but this might be a small step to improve our doc processing.

Here is her report:

when testing 2.82, we found a issue. We got this:

**** Error is: not well-formed (invalid token) at line 8293, column 33, byte 607053 at /usr/lib/perl5/XML/Parser.pm line 187.

This error repeated, estimately, for a half of the total of Word-sourced documents (quite a lot, I'd say).

if we move to the line & char we've been told in it, we find there an unknown (invalid) character to a utf-8 encoding.

Furthermore, the lines always are of this kind:

<p><div name="Cuerpo de texto con sangr&#65533;a" align="left" style=" padding: 0.00mm 0.00mm 0.00mm 0.00mm; ">

We know the invalid character pertains to the style section of the original Word DOC, (since "Cuerpo de texto con sangrma" is the name of a style when in Openoffice-MsWord? world).

So, the problem arises when importing the .doc to the xml file. We should note at this point that there is no issue in converting the body of the document. Here we reach the point. We found the offending characters all in a "name=" HTML attribute of the <div> tags, in the <content> section in doc.xml. So we followed the building process more closely; and we found that if we replace the bundled wvWare with a soft link to the system version we possess in /usr/bin (that's actually the wvWare our Greenstone 2.74 uses), the issue was gone. We can now conclude that the bundled wvWare for 2.82 (v.0.7.1), copy every style name as a "name" attribute to the corresponding DIV tag in the resulting HTML version: but, in doing so does NOT convert the encoding, so if you got in the "name" characters outside the common ascii core... ArgH! You'll have an error, the document will not be indexed and so, will not be findable by any conceivable search. As stated, we actually are using wvWare 1.2.4. It seems that the issue got solved because this version does not include any "name" attribute in HTML tags. That said, we cannot confirm if, were wvWare 1.2.4 to do this "name" things or any in the HTML source code, by any reason, it would do the proper encoding of HTML, as it does with the document content. That could be a bug in wvWare, and in every case is a possible stopper for building collections from .doc, so the present report.

Change History

Changed 9 years ago by kjdon

  • status changed from new to closed
  • resolution set to fixed

We are using 1.2.4 now, so hopefully this will have gone away. No files to test with. also, now have open office conversion too.

Note: See TracTickets for help on using tickets.