Context Navigation

← Previous Changeset
Next Changeset →

Changeset 32289

Timestamp:

2018-07-18T21:11:21+12:00 (6 years ago)

Author:

ak19

Message:

The PDFPlugin is being deprecated (since PDFv1 and PDFv2 plugins are replacing it). PDFPlugin itself will be around for migrating users, but contained code added to support paged_html/xpdftools before the plugin was refactored into v1 and v2. This commit removes all the xpdftools/paged_html related changes added since revision 31494 to the deprecated PDFPlugin, so that it's back to using just the old pdftohtml tool and the pdf-box extension, since the xpdftools stuff has been moved into PDFv2Plugin. Some other changes since revision 31494 like deprecation messages and use of translation strings instead of hardcoded English language strings remain in PDFPlugin.pm as they're generally relevant.

File:

: 1 edited

main/trunk/greenstone2/perllib/plugins/PDFPlugin.pm (modified) (7 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/perllib/plugins/PDFPlugin.pm

-              r32277
+              r32289
 use ReadTextFile;
 use unicode;
-use Mojo::DOM; # for HTML parsing
 use AutoLoadConverters;
 …
       { 'name' => "text",
     'desc' => "{ConvertBinaryFile.convert_to.text}" },
-      { 'name' => "paged_html",
-    'desc' => "{PDFPlugin.convert_to.paged_html}"},
       { 'name' => "pagedimg_jpg",
     'desc' => "{ConvertBinaryFile.convert_to.pagedimg_jpg}"},
 …
        'desc' => "{PDFPlugin.zoom}",
        'deft' => "2",
 #       'range' => "1,3", # actually the range is 0.5-3
        'type' => "string" },
+       'range' => "1,3", # actually the range is 0.5-3
+       'type' => "int" },
      { 'name' => "use_sections",
        'desc' => "{PDFPlugin.use_sections}",
 …
     elsif ($self->{'convert_to'} eq "auto") {
     # choose html ?? is this the best option
     $self->{'convert_to'} = "paged_html";
+    $self->{'convert_to'} = "html";
+    }
     if ($self->{'use_realistic_book'}) {
 …
         push(@$specific_options, "-use_realistic_book");
+    }
-        if($self->{'convert_to'} eq "paged_html") { # for paged html, the default should be to sectionalise on headings the single superpage containing divs representing individual pages as section
-            push(@$specific_options, "sectionalise_using_h_tags");
+        }
+    }
     elsif ($secondary_plugin_name eq "PagedImagePlugin") {
 …
+}
-# Overriding to do some extra handling for paged_html output mode
-sub run_conversion_command {
-    my $self = shift (@_);
-    my ($tmp_dirname, $tmp_inputPDFname, $utf8_tailname, $lc_suffix, $tailname, $suffix) = @_;
-    if($self->{'convert_to'} ne "paged_html") {
-    return $self->ConvertBinaryFile::run_conversion_command(@_);
+    }
-    # if output mode is paged_html, we use Xpdf tools' pdftohtml and tell it
-    # to create a subdir called "pages" in the tmp area to puts its products
-    # in there. (Xpdf's pdftohtml needs to be passed a *non-existent* directory
-    # parameter, the "pages" subdir). If Xpdf's pdftohtml has successfully run,
-    # the intermediary output file tmp/<random-num>/pages/index.html should
-    # exist (besides other output products there)
-    # We let ConvertBinaryFile proceed normally, but the return value should reflect
-    # that on success it should expect the intermediary product tmpdir/pages/index.html
-    # (which is the product of xpdftohtml conversion).
-    my $output_filename = $self->ConvertBinaryFile::run_conversion_command(@_);
-    $output_filename = &FileUtils::filenameConcatenate($tmp_dirname, "pages", "index.html");
-    # However, when convert_post_process() is done, it should have output the final
-    # product of the paged_html conversion: an html file of the same name and in the
-    # same tmp location as the input PDF file.
-    my ($name_prefix, $output_dir, $ext)
-    = &File::Basename::fileparse($tmp_inputPDFname, "\\.[^\\.]+\$");
-    $self->{'conv_filename_after_post_process'} = &FileUtils::filenameConcatenate($output_dir, $name_prefix.".html");
-#    print STDERR "@@@@@ final paged html file will be: " . $self->{'conv_filename_after_post_process'} . "\n";
-    return $output_filename;
+}
 sub convert_post_process
+{
 …
     my ($conv_filename) = @_;
-    my $outhandle=$self->{'outhandle'};
-    if($self->{'convert_to'} eq "paged_html") {
-    # special post-processing for paged_html mode, as HTML pages generated
-    # by xpdf's pdftohtml need to be massaged into the form we want
-    $self->xpdftohtml_convert_post_process($conv_filename);
+    }
-    else { # use PDFPlugin's usual post processing
-    $self->default_convert_post_process($conv_filename);
+    }
+}
-# Called after gsConvert.pl has been run to convert a PDF to paged_html
-# using Xpdftools' pdftohtml
-# This method will do some cleanup of the HTML files produced after XPDF has produced
-# an HTML doc for each PDF page: it first gets rid of the default index.html.
-# Instead, it constructs a single html page containing each original HTML page
-# <body> nested as divs instead, with simple section information inserted at the top
-# of each 'page' <div> and some further styling customisation. This HTML manipulation
-# is to be done with the Mojo::DOM perl package.
-# Note that since xpdf's pdftohtml would have failed if the output dir already
-# existed and for simpler naming, the output files are created in a new "pages"
-# subdirectory of the tmp location parent of $conv_filename instead
-sub xpdftohtml_convert_post_process
+{
-    my $self = shift (@_);
-    my ($pages_index_html) = @_; # = tmp/<rand>/pages/index.html for paged_html output mode
-    my $output_filename = $self->{'conv_filename_after_post_process'};
-    # Read in all the html files in tmp's "pages" subdir, except for index.html.
-    # and use it to create a new html file called $self->{'conv_filename_after_post_process'}
-    # which will consist of a slightly modified version of
-    # each of the other html files concatenated together.
-    my $outhandle=$self->{'outhandle'};
-    my ($tailname, $pages_subdir, $suffix)
-    = &File::Basename::fileparse($pages_index_html, "\\.[^\\.]+\$");
-    # Code from util::create_itemfile()
-    # Read in all the files
-    opendir(DIR, $pages_subdir) || die "can't opendir $pages_subdir: $!";
-    my @page_files = grep {-f "$pages_subdir/$_"} readdir(DIR);
-    closedir DIR;
-    # Sort files in the directory by page_num
-    # files are named index.html, page1.html, page2.html, ..., pagen.html
-    sub page_number {
-    my ($dir) = @_;
-    my ($pagenum) =($dir =~ m/^page(\d+)\.html?$/i);
-    $pagenum = 0 unless defined $pagenum; # index.html will be given pagenum=0
-    return $pagenum;
+    }
-    # sort the files in the directory in the order of page_num rather than lexically.
-    @page_files = sort { page_number($a) <=> page_number($b) } @page_files;
-    #my $num_html_pages = (scalar(@page_files) - 1)/2; # skip index file.
-              # For every html file there's an img file, so halve the total num.
-              # What about other file types that may potentially be there too???
-    my $num_html_pages = 0;
-    foreach my $pagefile (@page_files) {
-    $num_html_pages++ if $pagefile =~ m/\.html?$/ && $pagefile !~ /^index\.html?/i;
+    }
-    # Prepare to create our new html page that will contain all the individual
-    # htmls generated by xpdf's pdftohtml in sequence.
-    # First write the opening html tags out to the output file. These are the
-    # same tags and their contents, including <meta>, as is generated by
-    # Xpdf's pdftohtml for each of its individual html pages.
-    my $start_text = "<html>\n<head>\n";
-    my ($output_tailname, $tmp_subdir, $html_suffix)
-    = &File::Basename::fileparse($output_filename, "\\.[^\\.]+\$");
-    $start_text .= "<title>$output_tailname</title>\n";
-    $start_text .= "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n";
-    $start_text .= "</head>\n<body>\n\n";
-    $start_text .= "<h1>$output_tailname</h1>\n\n";
-    #handle content encodings the same way that default_convert_post_process does
-    # $self->utf8_write_file ($start_text, $conv_filename); # will close file after write
-    # Don't want to build a giant string in memory of all the pages concatenated
-    # and then write it out in one go. Instead, build up the final single page
-    # by writing each modified paged_html file out to it as this is processed.
-    # Copying file open/close code from CommonUtil::utf8_write_file()
-    if (!open (OUTFILE, ">:utf8", $output_filename)) {
-    gsprintf(STDERR, "PDFPlugin::xpdftohtml_convert_post_process {CommonUtil.could_not_open_for_writing} ($!)\n", $output_filename);
-    die "\n";
+    }
-    print OUTFILE $start_text;
-    # Get the contents of each individual HTML page generated by Xpdf, after first
-    # modifying each, and write each out into our single all-encompassing html
-    foreach my $pagefile (@page_files) {
-    if ($pagefile =~ m/\.html?$/ && $pagefile !~ /^index\.html?/i) {
-        my $page_num = page_number($pagefile);
-        # get full path to pagefile
-        $pagefile = &FileUtils::filenameConcatenate($pages_subdir, $pagefile);
-#       print STDERR "@@@ About to process html file $pagefile (num $page_num)\n";
-        my $modified_page_contents = $self->_process_paged_html_page($pagefile, $page_num, $num_html_pages);
-        print OUTFILE "$modified_page_contents\n\n";
+    }
+    }
-    # we've now created a single HTML file by concatenating (a modified version)
-    # of each paged html file
-    print OUTFILE "</body>\n</html>\n"; # write out closing tags
-    close OUTFILE; # done
-    # Get rid of all the htm(l) files incl index.html in the associated "pages"
-    # subdir, since we've now processed them all into a single html file
-    # one folder level up and we don't want HTMLPlugin to process all of them next.
-    &FileUtils::removeFilesFiltered($pages_subdir, "\.html?\$"); #  no specific whitelist, but blacklist htm(l)
-    # now the tmp area should contain a single html file contain all the html pages'
-    # contents in sequence, and a "pages" subdir containing the screenshot images
-    # of each page.
-    # HTMLPlugin will process these further in the plugin pipeline
+}
-# For whatever reason, most html <tags> don't get printed out in GLI
-# So when debugging, use this function to print them out as [tags] instead.
-sub _debug_print_html
+{
-    my $self = shift (@_);
-    my ($string_or_dom) = @_;
-    # can't seem to determine type of string with ref/reftype
-    # https://stackoverflow.com/questions/1731333/how-do-i-tell-what-type-of-value-is-in-a-perl-variable
-    # Not needed, as $dom objects seem to get correctly stringified in string contexts
-    # $dom.to_string/$dom.stringify seem to get called, no need to call them
-    # https://stackoverflow.com/questions/5214543/what-is-stringification-in-perl
-    my $escapedTxt = $string_or_dom;
-    $escapedTxt =~ s@\<@[@sg;
-    $escapedTxt =~ s@\>@]@sg;
-    print STDERR "#### $escapedTxt\n";
+}
-# Helper function to read in each paged_html generated by Xpdf's pdftohtml
-# then modify the html suitably using the HTML parsing functions offered by
-# Mojo::DOM, then return the modified HTML content as a string
-# See https://mojolicious.org/perldoc/Mojo/DOM
-sub _process_paged_html_page
+{
-    my $self = shift (@_);
-    my ($pagefile, $page_num, $num_html_pages) = @_;
-    my $text = "";
-    # handling content encoding the same way default_convert_post_process does
-    $self->read_file ($pagefile, "utf8", "", \$text);
-    my $dom = Mojo::DOM->new($text);
-#    $self->_debug_print_html($dom);
-    # there's a <style> element on the <html>, we need to shift it into the <div>
-    # tag that we'll be creating. We'll first slightly modify the <style> element
-    # store the first style element, which is the only one and in the <body>
-    # we'll later insert it as child of an all-encompassing div that we'll create
-    my $page_style_tag_str = $dom->at('html')->at('style')->to_string;
-    # In the style tag, convert id style references to class style references
-    my $css_class = ".p".$page_num."f";
-    $page_style_tag_str =~ s@\#f@$css_class@sg;
-    my $style_element = Mojo::DOM->new($page_style_tag_str)->at('style'); # modified
-#$self->_debug_print_html($style_element);
-    # need to know the image's height to set the height of the surrounding
-    # div that's to replace this page's <body>:
-    my $img_height = $dom->find('img')->[0]{height};
-    # 2. Adjust the img#background src attribute to point to the pages subdir for imgs
-    # 3. Set that img tag's class=background, and change its id to background+$page_num
-    my $bg_img_tag=$dom->find('img#background')->[0];
-    my $img_src_str = $bg_img_tag->{src};
-    $img_src_str = "pages/$img_src_str";
-    $bg_img_tag->attr(src => $img_src_str); # reset
-#$self->_debug_print_html($bg_img_tag);
-    # set both class and modified id attributes in one step:
-    $bg_img_tag->attr({class => "background", id => "background".$page_num});
-#$self->_debug_print_html($bg_img_tag);
-    # get all the <span> nested inside <div class="txt"> elements and
-    # 1. set their class attr to be "p + page_num + id-of-the-span",
-    # 2. then delete the id, because the span ids have been reused when element
-    # ids ought to be unique. Which is why we set the modified ids to be the
-    # value of the class attribute instead
-    $dom->find('div.txt span')->each(sub {
-    $_->attr(class => "p". $page_num. $_->{id});
-    delete $_->{id};
-                     }); # both changes done in one find() operation
-#$self->_debug_print_html($dom->find('div.txt span')->last);
-    # Finally can create our new dom, starting with a div tag for the current page
-    # Must be: <div id="$page_num" style="position:relative; height:$img_height;"/>
-#    my $new_dom = Mojo::DOM->new_tag('div', id => "page".$page_num, style => "position: relative; height: ".$img_height."px;" )
-    my $new_dom = Mojo::DOM->new_tag('div', style => "position: relative; height: ".$img_height."px;" );
-#$self->_debug_print_html($new_dom);
-    $new_dom->at('div')->append_content($style_element)->root;
-#$self->_debug_print_html($new_dom);
-    # Copy across all the old html's body tag's child nodes into the new dom's new div tag
-    $dom->at('body')->child_nodes->each(sub { $new_dom->at('div')->append_content($_)}); #$_->to_string
-#$self->_debug_print_html($new_dom);
-    # build up the outer div with the <h>tags for sectionalising
-    my $inner_div_str = $new_dom->to_string;
-    my $page_div = "<div id=\"page".$page_num."\">\n";
-    # Append a page range bucket heading if applicable: if we have more than 10 pages
-    # to display in the current bucket AND we're on the first page of each bucket of 10 pages.
-    # Dr Bainbridge thinks for now we need only consider PDFs where the
-    # total number of pages < 1000 and create buckets of size 10 (e.g. 1-10, ... 51-60, ...)
-    # If number of remaining pages >= 10, then create new bucket heading
-    # e.g. "Pages 30-40"
-    if(($page_num % 10) == 1 && ($num_html_pages - $page_num) > 10) {
-    # Double-digit page numbers that start with 2
-    # i.e. 21 to 29 (and 30) should be in 21 to 30 range
-    my $start_range = $page_num - ($page_num % 10) + 1;
-    my $end_range = $page_num + 10 - ($page_num % 10);
-    $page_div .= "<h2 style=\"font-size:1em;font-weight:normal;\">Pages ".$start_range . "-" . $end_range."</h2>\n";
+    }
-    # No sectionalising for 10 pages or under. Otherwise, every page is a section too, not just buckets
-    if($num_html_pages > 10) {
-        # Whether we're starting a new bucket or not, add a simpler heading: just the pagenumber, "Page #"
-        $page_div .= "<h3 style=\"font-size:1em;font-weight:normal;\">Page ".$page_num."</h3>\n";
+    }
-    $page_div .= $inner_div_str;
-    $page_div .= "\n</div>";
-    # Finished processing a single html page of the paged_html output generated by
-    # Xpdf's pdftohtml: finished massaging that single html page into the right form
-    return $page_div;
+}
-# This subroutine is called to do the PDFPlugin post-processing for all cases
-# except the "paged_html" conversion mode. This is what PDFPlugin always used to do:
-sub default_convert_post_process
+{
-    my $self = shift (@_);
-    my ($conv_filename) = @_;
     my $outhandle=$self->{'outhandle'};

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 32289

Legend:

main/trunk/greenstone2/perllib/plugins/PDFPlugin.pm

Download in other formats: