Changeset 27990

Show
Ignore:
Timestamp:
06.08.2013 22:20:41 (6 years ago)
Author:
ak19
Message:

2 fixes: 1. The Tudor collections' html source documents have stray carriage returns (r) that are not cleaned up by html tidy and make it into the linux doc.xml which for the rest use only linefeed chars. In contrast, the windows doc.xml was explicitly processed to convert all carriage-return-line-feed (rn) into linefeed by removing the carriage returns. So there were stray carriage returns in the linux doc.xml but these had been removed in the windows doc.xml, resulting in differences. 2. Partly truncated ampersand entities in the xml report are now completed so that things don't break when the xslt is applied during the summarise command that generates the report.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/nightly-tasks/diffcol/trunk/diffcol/diffcol.pl

    r27971 r27990  
    654654             
    655655                my $win_contents = $testIsWin ? \$test_contents : \$model_contents; 
    656              
     656                my $lin_contents = $testIsWin ? \$model_contents : \$test_contents; 
     657                 
    657658                # remove all carriage returns \r - introduced into doc.xml by multiread after pdf converted to html 
    658659                $$win_contents =~ s@[\r]@@g; 
     
    665666                #FOR MAC: old macs use CR carriage return (see http://www.perlmonks.org/?node_id=745018), so replace with \n?) 
    666667                # $$win_contents =~ s@\r@\n@mg; 
     668                 
     669                # remove solitary, stray carriage returns \r in the linux doc.xml, as occurs in the tudor collection owing to the source material 
     670                # containing solitary carriage returns instead of linefeed 
     671                $$lin_contents =~ s@[\r]@@g; #$$lin_contents =~ s@[\r][^\n]@@g; 
    667672            } 
    668673             
     
    862867 
    863868        # make sure there are no stray ampersands/partial ampersands that need to be completed as < or > 
    864         if($strOutput =~ m/&(.{1,2})?$/) { # &lt => < or &g => > 
     869        if($strOutput =~ m/&(.{1,2})?$/ || $strOutput =~ m/&amp$/) { # &lt => < or &g => > or &a(m)=> & or &amp => & 
    865870        if(defined $1 && $1) { 
    866871            my $rest = $1; 
    867             if($rest eq "g" || $rest eq "l") { 
     872            if($rest =~ m/^a/) { 
     873                $strOutput =~ s@am?p?$@amp;@; 
     874            } 
     875            elsif($rest eq "g" || $rest eq "l") { 
    868876            $strOutput .= "t;"; # close the known tag 
    869877            } 
    870             elsif($1 eq "gt" || $1 eq "lt") { 
     878            elsif($rest eq "gt" || $rest eq "lt") { 
    871879            $strOutput .= ";"; 
    872             } 
     880            }            
    873881        } else { # & on its own 
    874882            #$strOutput = substr( $strOutput, 0, 977); # lop off the &