Changeset 34220

Show
Ignore:
Timestamp:
29.06.2020 23:54:16 (10 days ago)
Author:
ak19
Message:

1. TextPlugin? takes care to preserve whitespace formatting when converting txt to html, by nesting text in pre tags. In a recent carefully tabspaced txt file, the final document produced by GS had lost all these tabs. It turns out that this was done to allow XMLParser not to choke on control chars. Have encoded tabs as entities as they're going into doc.xml and ultimately html context instead of tabs being destructively removed. 2. TextPlugin? now skips opening punctuation too, not just spaces before setting the title meta to the first non-newline sequence of content.

Location:
main/trunk/greenstone2/perllib
Files:
2 modified

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/perllib/docprint.pm

    r32575 r34220  
    103103    # (XML::Parser will barf on anything it doesn't consider to be 
    104104    # valid UTF-8 text, including things like \c@, \cC etc.) 
    105     $all_text =~ s/[\x00-\x09\x0B\x0C\x0E-\x1F]//g; 
    106  
     105    # Will treat tab chars, \x09, as a special case right after this 
     106    $all_text =~ s/[\x00-\x08\x0B\x0C\x0E-\x1F]//g; 
     107     
     108    # $all_text gets written out into an xml context and represents the html version of a doc, 
     109    # allowing the use of html entities for the tab character (	) 
     110    # Tabs (ASCII \x09) may be meaningful spacing in such cases whether the html emanated from a 
     111    # text file, original html or other doc. Particularly when tabs are nested in <pre> tags. 
     112    # Instead of removing tabs, replacing tabs with their entity reference will allow <pre> tags 
     113    # to continue preserving any tabs in the final html display. 
     114    # Hopefully with this, XML::Parser will not choke on tabs, and we get tab stop spaces preserved 
     115    # in the html output. 
     116    # This may be the best location to do this replacement and not in TextPlugin, because an html 
     117    # source doc may contain <pre> elements with tab stops, so then HTMLPlugin would have to do the 
     118    # replacement too. 
     119    $all_text =~ s/\x09/&#09;/g; 
     120     
    107121    return $all_text; 
    108122} 
  • main/trunk/greenstone2/perllib/plugins/TextPlugin.pm

    r31492 r34220  
    111111    $title =~ s/$self->{'title_sub'}//; 
    112112    } 
    113     $title =~ /^\s*([^\n]*)/s; $title=$1; 
     113    # A series of spaces and/or punctuation too can be skipped to get at a meaningful title? 
     114    # https://www.geeksforgeeks.org/perl-special-character-classes-in-regular-expressions/ 
     115    $title =~ /^[\s|[:punct:]]*([^\n]*)/s; $title=$1; 
    114116    $title =~ s/\t/ /g; 
    115117    $title =~ s/\r?\n?$//s; # remove any carriage returns and/or line feeds at line end,