Context Navigation

← Previous Change
Next Change →

PDFPlugin.pm

Timestamp:

2018-06-21T21:41:12+12:00 (6 years ago)

Author:

ak19

Message:

First set of commits to do with implementing the new 'paged_html' output option of PDFPlugin that uses using xpdftools' new pdftohtml. So far tested only on Linux (64 bit), but things work there so I'm optimistically committing the changes since they work. 2. Committing the pre-built Linux binaries of XPDFtools for both 32 and 64 bit built by the XPDF group. 2. To use the correct bitness variant of xpdftools, setup.bash now exports the BITNESS env var, consulted by gsConvert.pl. 3. All the perl code changes to do with using xpdf tools' pdftohtml to generate paged_html and feed it in the desired form into GS(3): gsConvert.pl, PDFPlugin.pm and its parent ConvertBinaryPFile.pm have been modified to make it all work. xpdftools' pdftohtml generates a folder containing an html file and a screenshot for each page in a PDF (as well as an index.html linking to each page's html). However, we want a single html file that contains each individual 'page' html's content in a div, and need to do some further HTML style, attribute and structure modifications to massage the xpdftool output to what we want for GS. In order to parse and manipulate the HTML 'DOM' to do this, we're using the Mojo::DOM package that Dr Bainbridge found and which he's compiled up. Mojo::DOM is therefore also committed in this revision. Some further changes and some display fixes are required, but need to check with the others about that.

File:

: 1 edited

main/trunk/greenstone2/perllib/plugins/PDFPlugin.pm (modified) (4 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/perllib/plugins/PDFPlugin.pm

-              r31494
+              r32205
 use strict;
 no strict 'refs'; # so we can use a var for filehandles (e.g. STDERR)
+no strict 'subs'; # allow filehandles to be variables and viceversa
 use ReadTextFile;
 use unicode;
+use Mojo::DOM; # for HTML parsing
 use AutoLoadConverters;
 …
       { 'name' => "text",
     'desc' => "{ConvertBinaryFile.convert_to.text}" },
+      { 'name' => "paged_html",
+    'desc' => "{PDFPlugin.convert_to.paged_html}"},
       { 'name' => "pagedimg_jpg",
     'desc' => "{ConvertBinaryFile.convert_to.pagedimg_jpg}"},
 …
     # check convert_to
+    # TODO: Start supporting PDF to txt on Windows if we're going to be using XPDF Tools (incl pdftotext) on Windows/Linux/Mac
     if ($self->{'convert_to'} eq "text" && $ENV{'GSDLOS'} =~ /^windows$/i) {
     print STDERR "Windows does not support pdf to text. PDFs will be converted to HTML instead\n";
 …
     my ($conv_filename) = @_;
+    my $outhandle=$self->{'outhandle'};
+#    print STDERR "@@@ convert_to: ".$self->{'convert_to'}."\n";
+    if($self->{'convert_to'} eq "paged_html") {
+    # special post-processing for paged_html mode, as HTML pages generated
+    # by xpdf's pdftohtml need to be massaged into the form we want
+    $self->xpdftohtml_convert_post_process($conv_filename);
+    }
+    else { # use PDFPlugin's usual post processing
+    $self->default_convert_post_process($conv_filename);
+    }
+}
+# Called after gsConvert.pl has been run to convert a PDF to paged_html
+# using Xpdftools' pdftohtml
+# This method will do some cleanup of the HTML files produced after XPDF has produced
+# an HTML doc for each PDF page: it first gets rid of the default index.html.
+# Instead, it constructs a single html page containing each original HTML page
+# <body> nested as divs instead, with simple section information inserted at the top
+# of each 'page' <div> and some further styling customisation. This HTML manipulation
+# is to be done with the Mojo::DOM perl package.
+# Note that since xpdf's pdftohtml would have failed if the output dir already
+# existed and for simpler naming, the output files are created in a new "pages"
+# subdirectory of the tmp location parent of $conv_filename instead
+sub xpdftohtml_convert_post_process
+{
+    my $self = shift (@_);
+    my ($output_filename) = @_; # output_filename = tmp location + filename
+    # if a single html were generated.
+    # We just want the tmp location, append "pages", and read all the html files
+    # in except for index.html. Then we create a new html file by name
+    # $output_filename, which will consist of a slightly modified version of
+    # each of the other html files concatenated together.
+    my $outhandle=$self->{'outhandle'};
+    my ($tailname, $tmp_dir, $suffix)
+    = &File::Basename::fileparse($output_filename, "\\.[^\\.]+\$");
+    my $pages_subdir = &FileUtils::filenameConcatenate($tmp_dir, "pages");
+    # Code from util::create_itemfile()
+    # Read in all the files
+    opendir(DIR, $pages_subdir) || die "can't opendir $pages_subdir: $!";
+    my @page_files = grep {-f "$pages_subdir/$_"} readdir(DIR);
+    closedir DIR;
+    # Sort files in the directory by page_num
+    # files are named index.html, page1.html, page2.html, ..., pagen.html
+    sub page_number {
+    my ($dir) = @_;
+    my ($pagenum) =($dir =~ m/^page(\d+)\.html?$/i);
+    $pagenum = 0 unless defined $pagenum; # index.html will be given pagenum=0
+    return $pagenum;
+    }
+    # sort the files in the directory in the order of page_num rather than lexically.
+    @page_files = sort { page_number($a) <=> page_number($b) } @page_files;
+    #my $num_html_pages = (scalar(@page_files) - 1)/2; # skip index file.
+              # For every html file there's an img file, so halve the total num.
+              # What about other file types that may potentially be there too???
+    my $num_html_pages = 0;
+    foreach my $pagefile (@page_files) {
+    $num_html_pages++ if $pagefile =~ m/\.html?$/ && $pagefile !~ /^index\.html?/i;
+    }
+    # Prepare to create our new html page that will contain all the individual
+    # htmls generated by xpdf's pdftohtml in sequence.
+    # First write the opening html tags out to the output file. These are the
+    # same tags and their contents, including <meta>, as is generated by
+    # Xpdf's pdftohtml for each of its individual html pages.
+    my $start_text = "<html>\n<head>\n";
+    $start_text .= "<title>$tailname</title>\n";
+    $start_text .= "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n";
+    $start_text .= "</head>\n<body>\n\n";
+    #handle content encodings the same way that default_convert_post_process does
+    # $self->utf8_write_file ($start_text, $conv_filename); # will close file after write
+    # Don't want to build a giant string in memory of all the pages concatenated
+    # and then write it out in one go. Instead, build up the final single page
+    # by writing each modified paged_html file out to it as this is processed.
+    # Copying file open/close code from CommonUtil::utf8_write_file()
+    if (!open (OUTFILE, ">:utf8", $output_filename)) {
+    gsprintf(STDERR, "PDFPlugin::xpdftohtml_convert_post_process {ConvertToPlug.could_not_open_for_writing} ($!)\n", $output_filename);
+    die "\n";
+    }
+    print OUTFILE $start_text;
+    # Get the contents of each individual HTML page generated by Xpdf, after first
+    # modifying each, and write each out into our single all-encompassing html
+    foreach my $pagefile (@page_files) {
+    if ($pagefile =~ m/\.html?$/ && $pagefile !~ /^index\.html?/i) {
+        my $page_num = page_number($pagefile);
+        # get full path to pagefile
+        $pagefile = &FileUtils::filenameConcatenate($pages_subdir, $pagefile);
+#       print STDERR "@@@ About to process html file $pagefile (num $page_num)\n";
+        my $modified_page_contents = $self->_process_paged_html_page($pagefile, $page_num, $num_html_pages);
+        print OUTFILE "$modified_page_contents\n\n";
+    }
+    }
+    # we've now created a single HTML file by concatenating (a modified version)
+    # of each paged html file
+    print OUTFILE "</body>\n</html>\n"; # write out closing tags
+    close OUTFILE; # done
+    # Get rid of all the htm(l) files incl index.html in the associated "pages"
+    # subdir, since we've now processed them all into a single html file
+    # one folder level up and we don't want HTMLPlugin to process all of them next.
+#    my @fullpath_page_files = map { &FileUtils::filenameConcatenate($pages_subdir, $_) } @page_files;
+    &FileUtils::removeFilesFiltered($pages_subdir, "\.html?\$"); #  no specific whitelist, but blacklist htm(l)
+    # now the tmp area should contain a single html file contain all the html pages'
+    # contents in sequence, and a "pages" subdir containing the screenshot images
+    # of each page.
+    # HTMLPlugin will process these further in the plugin pipeline
+}
+# For whatever reason, most html <tags> don't get printed out in GLI
+# So when debugging, use this function to print them out as [tags] instead.
+sub _debug_print_html
+{
+    my $self = shift (@_);
+    my ($string_or_dom) = @_;
+    # can't seem to determine type of string with ref/reftype
+    # https://stackoverflow.com/questions/1731333/how-do-i-tell-what-type-of-value-is-in-a-perl-variable
+    # $dom objects appear to get correctly stringified in string contexts
+    # $dom.to_string/$dom.stringify seem to get called, no need to call them
+    # https://stackoverflow.com/questions/5214543/what-is-stringification-in-perl
+    my $escapedTxt = $string_or_dom;
+    $escapedTxt =~ s@\<@[@sg;
+    $escapedTxt =~ s@\>@]@sg;
+    print STDERR "#### $escapedTxt\n";
+}
+# Helper function to read in each paged_html generated by Xpdf's pdftohtml
+# then modify the html suitably using the HTML parsing functions offered by
+# Mojo::DOM, then return the modified HTML content as a string
+# See https://mojolicious.org/perldoc/Mojo/DOM
+sub _process_paged_html_page
+{
+    my $self = shift (@_);
+    my ($pagefile, $page_num, $num_html_pages) = @_;
+    my $text = "";
+    # handling content encoding the same way default_convert_post_process does
+    $self->read_file ($pagefile, "utf8", "", \$text);
+    my $dom = Mojo::DOM->new($text);
+#    $self->_debug_print_html($dom);
+    # there's a <style> element on the <html>, we need to shift it into the <div>
+    # tag that we'll be creating. We'll first slightly modify the <style> element
+    # store the first style element, which is the only one and in the <body>
+    # we'll later insert it as child of an all-encompassing div that we'll create
+#    my $page_style_tag_str = $dom->find('style')->[0]->to_string;
+#    my $page_style_tag_str = $dom->find('html style')->[0]->to_string;
+    my $page_style_tag_str = $dom->at('html')->at('style')->to_string;
+    # In the style tag, convert id style references to class style references
+    my $css_class = ".p".$page_num."f";
+    $page_style_tag_str =~ s@\#f@$css_class@sg;
+    my $style_element = Mojo::DOM->new($page_style_tag_str)->at('style'); # modified
+#$self->_debug_print_html($style_element);
+    # need to know the image's height to set the height of the surrounding
+    # div that's to replace this page's <body>:
+    my $img_height = $dom->find('img')->[0]{height};
+    # 1. Fix up the style attr on the image by additionally setting z-index=-1 for it
+    # 2. Adjust the img#background src attribute to point to the pages subdir for imgs
+    # 3. Set that img tag's class=background, and change its id to background+$page_num
+    my $bg_img_tag=$dom->find('img#background')->[0];
+    my $img_style_str = $bg_img_tag->{style}; # = $dom->find('img#background')->[0]{style}
+    $img_style_str = $img_style_str." z-index=-1;";
+#print STDERR "img_style_str: " . $img_style_str."\n";
+    my $img_src_str = $bg_img_tag->{src};
+    $img_src_str = "pages/$img_src_str";
+    $bg_img_tag->attr({style => $img_style_str, src => $img_src_str}); # reset
+#$self->_debug_print_html($bg_img_tag);
+    # set both class and modified id attributes in one step:
+    $bg_img_tag->attr({class => "background", id => "background".$page_num});
+#$self->_debug_print_html($bg_img_tag);
+    # get all the <span> nested inside <div class="txt"> elements and
+    # 1. set their class attr to be "p + page_num + id-of-the-span",
+    # 2. then delete the id, because the span ids have been reused when element
+    # ids ought to be unique. Which is why we set the modified ids to be the
+    # value of the class attribute instead
+    $dom->find('div.txt span')->each(sub {
+    $_->attr(class => "p". $page_num. $_->{id});
+    delete $_->{id};
+                     }); # both changes done in one find() operation
+#$self->_debug_print_html($dom->find('div.txt span')->last);
+    # Finally can create our new dom, starting with a div tag for the current page
+    # Must be: <div id="$page_num" style="position:relative; height:$img_height;"/>
+    my $new_dom = Mojo::DOM->new_tag('div', id => "page".$page_num, style => "position: relative; height: ".$img_height."px;" );
+#$self->_debug_print_html($new_dom);
+    $new_dom->at('div')->append_content($style_element)->root;
+    # Append a page range bucket heading if applicable
+    # Dr Bainbridge thinks for now we need only consider PDFs where the
+    # total number of pages < 1000 and create buckets of size 10 (e.g. 1-10, ... 51-60, ...)
+    # If number of remaining pages >= 10, then create new bucket heading
+    # e.g. "Pages 30-40"
+    if(($num_html_pages - $page_num) > 10) {
+    # Double-digit page numbers that start with 2
+    # i.e. 21 to 29 (and 30) should be in 21 to 30 range
+    my $start_range = $page_num - ($page_num % 10) + 1;
+    my $end_range = $page_num + 10 - ($page_num % 10);
+    if($page_num % 10 == 0) { # page 20 however, should be in 11 to 20 range
+        $start_range -= 10;
+        $end_range -= 10;
+    }
+    $new_dom->at('div')->append_content($new_dom->new_tag('h1', "Pages ".$start_range . "-" . $end_range))->root;
+    }
+    # Add a simpler heading: just the pagenumber, "Page #"
+    $new_dom->at('div')->append_content($new_dom->new_tag('h2', "Page ".$page_num))->root;
+#$self->_debug_print_html($new_dom);
+    # Copy across all the old html's body tag's child nodes into the new dom's new div tag
+    $dom->at('body')->child_nodes->each(sub { $new_dom->at('div')->append_content($_)}); #$_->to_string
+#$self->_debug_print_html($new_dom);
+    # Finished processing a single html page of the paged_html output generated by
+    # Xpdf's pdftohtml: finished massaging that single html page into the right form
+    return $new_dom->to_string;
+}
+# This subroutine is called to do the PDFPlugin post-processing for all cases
+# except the "paged_html" conversion mode. This is what PDFPlugin always used to do:
+sub default_convert_post_process
+{
+    my $self = shift (@_);
+    my ($conv_filename) = @_;
     my $outhandle=$self->{'outhandle'};

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 32205 for main/trunk/greenstone2/perllib/plugins/PDFPlugin.pm

Legend:

main/trunk/greenstone2/perllib/plugins/PDFPlugin.pm

Download in other formats: