2013-07-03T21:37:18+12:00 (11 years ago)

Basic Word-PDF collection now has the same number of diffing errors on Windows upon diffcol as on Linux and Mac. Needed to do a lot of special processing for windows: to remove carriage returns introduced into doc.xml when doing a multiread on the html version of a pdf doc after it has been converted to html. (And similarly, needed to get rid of windows carriage returns introduced into ex.Title meta for pdf01.pdf converted to HTML. This was handled in HTMLPlugin). Further special tags need either to be ignored, if they're time stamps, or specially handled if they're filepaths. Not sure if it's the encoding setting in multiread or maybe the locale that is introducing the carriage returns, but am dealing with this at the point of diffcol since it's not a 'problem' in Greenstone, just an inconsistency across OS-es. There's still one diffcol error remaining for this collection on all 3 OS: one word document has a different word wrap length on the machine where the model col was built compared to the wrap length on the other machines. This may be a setting to wvware or else libreoffice/staroffice, if these are used.

1 edited


  • other-projects/nightly-tasks/diffcol/trunk/task.pl

    r27725 r27743  
    417417    for my $collection (readdir $collect_handle) {
    418418    next if ($collection eq "." || $collection eq "..");
    419     ##next if ($collection ne "Small-HTML"); ## TEMPORARY, FOR TESTING THIS SCRIPT
     419#   next if ($collection ne "Word-PDF-Basic"); ## TEMPORARY, FOR TESTING THIS SCRIPT
    421421    #escape the filename (in case of space)
Note: See TracChangeset for help on using the changeset viewer.