Changeset 34220 for main


Ignore:
Timestamp:
2020-06-29T23:54:16+12:00 (4 years ago)
Author:
ak19
Message:
  1. TextPlugin takes care to preserve whitespace formatting when converting txt to html, by nesting text in pre tags. In a recent carefully tabspaced txt file, the final document produced by GS had lost all these tabs. It turns out that this was done to allow XMLParser not to choke on control chars. Have encoded tabs as entities as they're going into doc.xml and ultimately html context instead of tabs being destructively removed. 2. TextPlugin now skips opening punctuation too, not just spaces before setting the title meta to the first non-newline sequence of content.
Location:
main/trunk/greenstone2/perllib
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/perllib/docprint.pm

    r32575 r34220  
    103103    # (XML::Parser will barf on anything it doesn't consider to be
    104104    # valid UTF-8 text, including things like \c@, \cC etc.)
    105     $all_text =~ s/[\x00-\x09\x0B\x0C\x0E-\x1F]//g;
    106 
     105    # Will treat tab chars, \x09, as a special case right after this
     106    $all_text =~ s/[\x00-\x08\x0B\x0C\x0E-\x1F]//g;
     107   
     108    # $all_text gets written out into an xml context and represents the html version of a doc,
     109    # allowing the use of html entities for the tab character (	)
     110    # Tabs (ASCII \x09) may be meaningful spacing in such cases whether the html emanated from a
     111    # text file, original html or other doc. Particularly when tabs are nested in <pre> tags.
     112    # Instead of removing tabs, replacing tabs with their entity reference will allow <pre> tags
     113    # to continue preserving any tabs in the final html display.
     114    # Hopefully with this, XML::Parser will not choke on tabs, and we get tab stop spaces preserved
     115    # in the html output.
     116    # This may be the best location to do this replacement and not in TextPlugin, because an html
     117    # source doc may contain <pre> elements with tab stops, so then HTMLPlugin would have to do the
     118    # replacement too.
     119    $all_text =~ s/\x09/&#09;/g;
     120   
    107121    return $all_text;
    108122}
  • main/trunk/greenstone2/perllib/plugins/TextPlugin.pm

    r31492 r34220  
    111111    $title =~ s/$self->{'title_sub'}//;
    112112    }
    113     $title =~ /^\s*([^\n]*)/s; $title=$1;
     113    # A series of spaces and/or punctuation too can be skipped to get at a meaningful title?
     114    # https://www.geeksforgeeks.org/perl-special-character-classes-in-regular-expressions/
     115    $title =~ /^[\s|[:punct:]]*([^\n]*)/s; $title=$1;
    114116    $title =~ s/\t/ /g;
    115117    $title =~ s/\r?\n?$//s; # remove any carriage returns and/or line feeds at line end,
Note: See TracChangeset for help on using the changeset viewer.