|Reported by:||kjdon||Owned by:||nobody|
|Component:||Collection Building: Plugins||Severity:||major|
See also #430
Mariana Pichinini reported a bug with our current 0.7.1? wvware (copied below) where it was adding invalid characters to the HTML. She thinks this was fixed by using 1.2.4, so can we upgrade to that? I know we are supposed to be moving away from this and using openoffice or other (apache tika?) but this might be a small step to improve our doc processing.
Here is her report:
when testing 2.82, we found a issue. We got this:
Error is: not well-formed (invalid token) at line 8293, column 33, byte 607053 at /usr/lib/perl5/XML/Parser.pm line 187.
This error repeated, estimately, for a half of the total of Word-sourced documents (quite a lot, I'd say).
if we move to the line & char we've been told in it, we find there an unknown (invalid) character to a utf-8 encoding.
Furthermore, the lines always are of this kind:
<p><div name="Cuerpo de texto con sangr�a" align="left" style=" padding: 0.00mm 0.00mm 0.00mm 0.00mm; ">
We know the invalid character pertains to the style section of the original Word DOC, (since "Cuerpo de texto con sangrma" is the name of a style when in Openoffice-MsWord world).
So, the problem arises when importing the .doc to the xml file. We should note at this point that there is no issue in converting the body of the document. Here we reach the point. We found the offending characters all in a "name=" HTML attribute of the <div> tags, in the <content> section in doc.xml. So we followed the building process more closely; and we found that if we replace the bundled wvWare with a soft link to the system version we possess in /usr/bin (that's actually the wvWare our Greenstone 2.74 uses), the issue was gone. We can now conclude that the bundled wvWare for 2.82 (v.0.7.1), copy every style name as a "name" attribute to the corresponding DIV tag in the resulting HTML version: but, in doing so does NOT convert the encoding, so if you got in the "name" characters outside the common ascii core... ArgH! You'll have an error, the document will not be indexed and so, will not be findable by any conceivable search. As stated, we actually are using wvWare 1.2.4. It seems that the issue got solved because this version does not include any "name" attribute in HTML tags. That said, we cannot confirm if, were wvWare 1.2.4 to do this "name" things or any in the HTML source code, by any reason, it would do the proper encoding of HTML, as it does with the document content. That could be a bug in wvWare, and in every case is a possible stopper for building collections from .doc, so the present report.