Context Navigation

← Previous Change
Next Change →

greenstone2

Timestamp:

2018-06-21T21:41:12+12:00 (6 years ago)

Author:

ak19

Message:

First set of commits to do with implementing the new 'paged_html' output option of PDFPlugin that uses using xpdftools' new pdftohtml. So far tested only on Linux (64 bit), but things work there so I'm optimistically committing the changes since they work. 2. Committing the pre-built Linux binaries of XPDFtools for both 32 and 64 bit built by the XPDF group. 2. To use the correct bitness variant of xpdftools, setup.bash now exports the BITNESS env var, consulted by gsConvert.pl. 3. All the perl code changes to do with using xpdf tools' pdftohtml to generate paged_html and feed it in the desired form into GS(3): gsConvert.pl, PDFPlugin.pm and its parent ConvertBinaryPFile.pm have been modified to make it all work. xpdftools' pdftohtml generates a folder containing an html file and a screenshot for each page in a PDF (as well as an index.html linking to each page's html). However, we want a single html file that contains each individual 'page' html's content in a div, and need to do some further HTML style, attribute and structure modifications to massage the xpdftool output to what we want for GS. In order to parse and manipulate the HTML 'DOM' to do this, we're using the Mojo::DOM package that Dr Bainbridge found and which he's compiled up. Mojo::DOM is therefore also committed in this revision. Some further changes and some display fixes are required, but need to check with the others about that.

Location:

main/trunk/greenstone2

Files:

: 211 added
: 5 edited

bin/linux/xpdf-tools (added)
bin/linux/xpdf-tools/ANNOUNCE (added)
bin/linux/xpdf-tools/CHANGES (added)
bin/linux/xpdf-tools/COPYING (added)
bin/linux/xpdf-tools/COPYING3 (added)
bin/linux/xpdf-tools/INSTALL (added)
bin/linux/xpdf-tools/README (added)
bin/linux/xpdf-tools/bin32 (added)
bin/linux/xpdf-tools/bin32/pdfdetach (added)
bin/linux/xpdf-tools/bin32/pdffonts (added)
bin/linux/xpdf-tools/bin32/pdfimages (added)
bin/linux/xpdf-tools/bin32/pdfinfo (added)
bin/linux/xpdf-tools/bin32/pdftohtml (added)
bin/linux/xpdf-tools/bin32/pdftopng (added)
bin/linux/xpdf-tools/bin32/pdftoppm (added)
bin/linux/xpdf-tools/bin32/pdftops (added)
bin/linux/xpdf-tools/bin32/pdftotext (added)
bin/linux/xpdf-tools/bin64 (added)
bin/linux/xpdf-tools/bin64/pdfdetach (added)
bin/linux/xpdf-tools/bin64/pdffonts (added)
bin/linux/xpdf-tools/bin64/pdfimages (added)
bin/linux/xpdf-tools/bin64/pdfinfo (added)
bin/linux/xpdf-tools/bin64/pdftohtml (added)
bin/linux/xpdf-tools/bin64/pdftopng (added)
bin/linux/xpdf-tools/bin64/pdftoppm (added)
bin/linux/xpdf-tools/bin64/pdftops (added)
bin/linux/xpdf-tools/bin64/pdftotext (added)
bin/linux/xpdf-tools/doc (added)
bin/linux/xpdf-tools/doc/pdfdetach.1 (added)
bin/linux/xpdf-tools/doc/pdffonts.1 (added)
bin/linux/xpdf-tools/doc/pdfimages.1 (added)
bin/linux/xpdf-tools/doc/pdfinfo.1 (added)
bin/linux/xpdf-tools/doc/pdftohtml.1 (added)
bin/linux/xpdf-tools/doc/pdftopng.1 (added)
bin/linux/xpdf-tools/doc/pdftoppm.1 (added)
bin/linux/xpdf-tools/doc/pdftops.1 (added)
bin/linux/xpdf-tools/doc/pdftotext.1 (added)
bin/linux/xpdf-tools/doc/sample-xpdfrc (added)
bin/linux/xpdf-tools/doc/xpdfrc.5 (added)
bin/script/gsConvert.pl (modified) (3 diffs)
perllib/cpan/Mojo (added)
perllib/cpan/Mojo.pm (added)
perllib/cpan/Mojo/Asset (added)
perllib/cpan/Mojo/Asset.pm (added)
perllib/cpan/Mojo/Asset/File.pm (added)
perllib/cpan/Mojo/Asset/Memory.pm (added)
perllib/cpan/Mojo/Base.pm (added)
perllib/cpan/Mojo/ByteStream.pm (added)
perllib/cpan/Mojo/Cache.pm (added)
perllib/cpan/Mojo/Collection.pm (added)
perllib/cpan/Mojo/Content (added)
perllib/cpan/Mojo/Content.pm (added)
perllib/cpan/Mojo/Content/MultiPart.pm (added)
perllib/cpan/Mojo/Content/Single.pm (added)
perllib/cpan/Mojo/Cookie (added)
perllib/cpan/Mojo/Cookie.pm (added)
perllib/cpan/Mojo/Cookie/Request.pm (added)
perllib/cpan/Mojo/Cookie/Response.pm (added)
perllib/cpan/Mojo/DOM (added)
perllib/cpan/Mojo/DOM.pm (added)
perllib/cpan/Mojo/DOM/CSS.pm (added)
perllib/cpan/Mojo/DOM/HTML.pm (added)
perllib/cpan/Mojo/Date.pm (added)
perllib/cpan/Mojo/EventEmitter.pm (added)
perllib/cpan/Mojo/Exception.pm (added)
perllib/cpan/Mojo/File.pm (added)
perllib/cpan/Mojo/Headers.pm (added)
perllib/cpan/Mojo/HelloWorld.pm (added)
perllib/cpan/Mojo/Home.pm (added)
perllib/cpan/Mojo/IOLoop (added)
perllib/cpan/Mojo/IOLoop.pm (added)
perllib/cpan/Mojo/IOLoop/Client.pm (added)
perllib/cpan/Mojo/IOLoop/Delay.pm (added)
perllib/cpan/Mojo/IOLoop/Server.pm (added)
perllib/cpan/Mojo/IOLoop/Stream (added)
perllib/cpan/Mojo/IOLoop/Stream.pm (added)
perllib/cpan/Mojo/IOLoop/Stream/HTTPClient.pm (added)
perllib/cpan/Mojo/IOLoop/Stream/HTTPServer.pm (added)
perllib/cpan/Mojo/IOLoop/Stream/WebSocketClient.pm (added)
perllib/cpan/Mojo/IOLoop/Stream/WebSocketServer.pm (added)
perllib/cpan/Mojo/IOLoop/Subprocess.pm (added)
perllib/cpan/Mojo/IOLoop/TLS.pm (added)
perllib/cpan/Mojo/IOLoop/resources (added)
perllib/cpan/Mojo/IOLoop/resources/server.crt (added)
perllib/cpan/Mojo/IOLoop/resources/server.key (added)
perllib/cpan/Mojo/JSON (added)
perllib/cpan/Mojo/JSON.pm (added)
perllib/cpan/Mojo/JSON/Pointer.pm (added)
perllib/cpan/Mojo/Loader.pm (added)
perllib/cpan/Mojo/Log.pm (added)
perllib/cpan/Mojo/Message (added)
perllib/cpan/Mojo/Message.pm (added)
perllib/cpan/Mojo/Message/Request.pm (added)
perllib/cpan/Mojo/Message/Response.pm (added)
perllib/cpan/Mojo/Parameters.pm (added)
perllib/cpan/Mojo/Path.pm (added)
perllib/cpan/Mojo/Promise.pm (added)
perllib/cpan/Mojo/Reactor (added)
perllib/cpan/Mojo/Reactor.pm (added)
perllib/cpan/Mojo/Reactor/EV.pm (added)
perllib/cpan/Mojo/Reactor/Poll.pm (added)
perllib/cpan/Mojo/Server (added)
perllib/cpan/Mojo/Server.pm (added)
perllib/cpan/Mojo/Server/CGI.pm (added)
perllib/cpan/Mojo/Server/Daemon.pm (added)
perllib/cpan/Mojo/Server/Hypnotoad.pm (added)
perllib/cpan/Mojo/Server/Morbo (added)
perllib/cpan/Mojo/Server/Morbo.pm (added)
perllib/cpan/Mojo/Server/Morbo/Backend (added)
perllib/cpan/Mojo/Server/Morbo/Backend.pm (added)
perllib/cpan/Mojo/Server/Morbo/Backend/Poll.pm (added)
perllib/cpan/Mojo/Server/PSGI.pm (added)
perllib/cpan/Mojo/Server/Prefork.pm (added)
perllib/cpan/Mojo/Template.pm (added)
perllib/cpan/Mojo/Transaction (added)
perllib/cpan/Mojo/Transaction.pm (added)
perllib/cpan/Mojo/Transaction/HTTP.pm (added)
perllib/cpan/Mojo/Transaction/WebSocket.pm (added)
perllib/cpan/Mojo/URL.pm (added)
perllib/cpan/Mojo/Upload.pm (added)
perllib/cpan/Mojo/UserAgent (added)
perllib/cpan/Mojo/UserAgent.pm (added)
perllib/cpan/Mojo/UserAgent/CookieJar.pm (added)
perllib/cpan/Mojo/UserAgent/Proxy.pm (added)
perllib/cpan/Mojo/UserAgent/Server.pm (added)
perllib/cpan/Mojo/UserAgent/Transactor.pm (added)
perllib/cpan/Mojo/Util.pm (added)
perllib/cpan/Mojo/WebSocket.pm (added)
perllib/cpan/Mojolicious (added)
perllib/cpan/Mojolicious.pm (added)
perllib/cpan/Mojolicious/Command (added)
perllib/cpan/Mojolicious/Command.pm (added)
perllib/cpan/Mojolicious/Command/cgi.pm (added)
perllib/cpan/Mojolicious/Command/cpanify.pm (added)
perllib/cpan/Mojolicious/Command/daemon.pm (added)
perllib/cpan/Mojolicious/Command/eval.pm (added)
perllib/cpan/Mojolicious/Command/generate (added)
perllib/cpan/Mojolicious/Command/generate.pm (added)
perllib/cpan/Mojolicious/Command/generate/app.pm (added)
perllib/cpan/Mojolicious/Command/generate/lite_app.pm (added)
perllib/cpan/Mojolicious/Command/generate/makefile.pm (added)
perllib/cpan/Mojolicious/Command/generate/plugin.pm (added)
perllib/cpan/Mojolicious/Command/get.pm (added)
perllib/cpan/Mojolicious/Command/inflate.pm (added)
perllib/cpan/Mojolicious/Command/prefork.pm (added)
perllib/cpan/Mojolicious/Command/psgi.pm (added)
perllib/cpan/Mojolicious/Command/routes.pm (added)
perllib/cpan/Mojolicious/Command/test.pm (added)
perllib/cpan/Mojolicious/Command/version.pm (added)
perllib/cpan/Mojolicious/Commands.pm (added)
perllib/cpan/Mojolicious/Controller.pm (added)
perllib/cpan/Mojolicious/Guides (added)
perllib/cpan/Mojolicious/Guides.pod (added)
perllib/cpan/Mojolicious/Guides/Contributing.pod (added)
perllib/cpan/Mojolicious/Guides/Cookbook.pod (added)
perllib/cpan/Mojolicious/Guides/FAQ.pod (added)
perllib/cpan/Mojolicious/Guides/Growing.pod (added)
perllib/cpan/Mojolicious/Guides/Rendering.pod (added)
perllib/cpan/Mojolicious/Guides/Routing.pod (added)
perllib/cpan/Mojolicious/Guides/Testing.pod (added)
perllib/cpan/Mojolicious/Guides/Tutorial.pod (added)
perllib/cpan/Mojolicious/Lite.pm (added)
perllib/cpan/Mojolicious/Plugin (added)
perllib/cpan/Mojolicious/Plugin.pm (added)
perllib/cpan/Mojolicious/Plugin/Config.pm (added)
perllib/cpan/Mojolicious/Plugin/DefaultHelpers.pm (added)
perllib/cpan/Mojolicious/Plugin/EPLRenderer.pm (added)
perllib/cpan/Mojolicious/Plugin/EPRenderer.pm (added)
perllib/cpan/Mojolicious/Plugin/HeaderCondition.pm (added)
perllib/cpan/Mojolicious/Plugin/JSONConfig.pm (added)
perllib/cpan/Mojolicious/Plugin/Mount.pm (added)
perllib/cpan/Mojolicious/Plugin/PODRenderer.pm (added)
perllib/cpan/Mojolicious/Plugin/TagHelpers.pm (added)
perllib/cpan/Mojolicious/Plugins.pm (added)
perllib/cpan/Mojolicious/Renderer.pm (added)
perllib/cpan/Mojolicious/Routes (added)
perllib/cpan/Mojolicious/Routes.pm (added)
perllib/cpan/Mojolicious/Routes/Match.pm (added)
perllib/cpan/Mojolicious/Routes/Pattern.pm (added)
perllib/cpan/Mojolicious/Routes/Route.pm (added)
perllib/cpan/Mojolicious/Sessions.pm (added)
perllib/cpan/Mojolicious/Static.pm (added)
perllib/cpan/Mojolicious/Types.pm (added)
perllib/cpan/Mojolicious/Validator (added)
perllib/cpan/Mojolicious/Validator.pm (added)
perllib/cpan/Mojolicious/Validator/Validation.pm (added)
perllib/cpan/Mojolicious/resources (added)
perllib/cpan/Mojolicious/resources/public (added)
perllib/cpan/Mojolicious/resources/public/favicon.ico (added)
perllib/cpan/Mojolicious/resources/public/mojo (added)
perllib/cpan/Mojolicious/resources/public/mojo/failraptor.png (added)
perllib/cpan/Mojolicious/resources/public/mojo/jquery (added)
perllib/cpan/Mojolicious/resources/public/mojo/jquery/jquery.js (added)
perllib/cpan/Mojolicious/resources/public/mojo/logo-black-2x.png (added)
perllib/cpan/Mojolicious/resources/public/mojo/logo-black.png (added)
perllib/cpan/Mojolicious/resources/public/mojo/logo-white-2x.png (added)
perllib/cpan/Mojolicious/resources/public/mojo/logo-white.png (added)
perllib/cpan/Mojolicious/resources/public/mojo/noraptor.png (added)
perllib/cpan/Mojolicious/resources/public/mojo/notfound.png (added)
perllib/cpan/Mojolicious/resources/public/mojo/pinstripe-dark.png (added)
perllib/cpan/Mojolicious/resources/public/mojo/pinstripe-light.png (added)
perllib/cpan/Mojolicious/resources/public/mojo/prettify (added)
perllib/cpan/Mojolicious/resources/public/mojo/prettify/prettify-mojo-dark.css (added)
perllib/cpan/Mojolicious/resources/public/mojo/prettify/prettify-mojo-light.css (added)
perllib/cpan/Mojolicious/resources/public/mojo/prettify/run_prettify.js (added)
perllib/cpan/Mojolicious/resources/templates (added)
perllib/cpan/Mojolicious/resources/templates/mojo (added)
perllib/cpan/Mojolicious/resources/templates/mojo/debug.html.ep (added)
perllib/cpan/Mojolicious/resources/templates/mojo/exception.html.ep (added)
perllib/cpan/Mojolicious/resources/templates/mojo/menubar.html.ep (added)
perllib/cpan/Mojolicious/resources/templates/mojo/not_found.html.ep (added)
perllib/cpan/Mojolicious/resources/templates/mojo/perldoc.html.ep (added)
perllib/plugins/ConvertBinaryFile.pm (modified) (3 diffs)
perllib/plugins/PDFPlugin.pm (modified) (4 diffs)
perllib/strings.properties (modified) (2 diffs)
setup.bash (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/bin/script/gsConvert.pl

-              r30724
+              r32205
     # Attempt conversion to HTML
+    if (!$output_type || ($output_type =~ m/html/i)) {
+    # Uses the old pdftohtml that doesn't work for newer PDF versions
+    #if ($output_type =~ m/^html/i) {
+    if (!$output_type || ($output_type =~ m/^html/i)) {
     $success = &pdf_to_html($dirname, $input_filename, $output_filestem);
     if ($success) {
         return "html";
+    }
+    }
+    # Attempt conversion to (paged) HTML using the newer pdftohtml of Xpdftools. This
+    # will be the new default for PDFs when output_type for PDF docs is not specified
+    # (once our use of xpdftools' pdftohtml has been implemented on win and mac).
+    if ($output_type =~ m/paged_html/i) {
+    #if (!$output_type || ($output_type =~ m/paged_html/i)) {
+    $success = &xpdf_to_html($dirname, $input_filename, $output_filestem);
+    if ($success) {
+        return "paged_html";
+    }
+    }
 …
 # Convert a pdf file to html with the pdftohtml command
+# Convert a pdf file to html with the old pdftohtml command
+# which only works for older PDF versions
 sub pdf_to_html {
     my ($dirname, $input_filename, $output_filestem) = @_;
 …
     return 1;
+}
+# Convert a pdf file to html with the newer Xpdftools' pdftohtml
+# This generates "paged HTML" where extracted, selectable text is positioned
+# over screenshots of each page.
+# Since xpdf's pdftohtml fails if the output dir already exists and for easier
+# naming, the output files are created in a "pages" subdirectory of the tmp
+# location parent of $output_filestem instead
+sub xpdf_to_html {
+    my ($dirname, $input_filename, $output_filestem) = @_;
+    my $cmd = "";
+    # build up the path to the doc-to-html conversion tool we're going to use
+    my $xpdf_pdftohtml = &FileUtils::filenameConcatenate($ENV{'GSDLHOME'}, "bin", $ENV{'GSDLOS'}, "xpdf-tools");
+    if ($ENV{'GSDLOS'} =~ m/^windows$/i) {
+    # TODO
+    } elsif ($ENV{'GSDLOS'} =~ m/^darwin$/i) {
+    # TODO
+    } else { # unix, use the appropriate bin folder for the bitness of the system
+    # Don't use $ENV{'GSDLARCH'}, use the new $ENV{'BITNESS'}, since
+    # $ENV{'GSDLARCH'} is only (meant to be) set when many other 32-bit or 64-bit
+    # specific subdirectories exist in a greenstone installation.
+    # None of those locations need exist when xpdf-tools is installed with GS.
+    # So don't depend on GSDLARCH as forcing that to be exported has side-effects
+    if($ENV{'BITNESS'}) {
+        $xpdf_pdftohtml = &FileUtils::filenameConcatenate($xpdf_pdftohtml, "bin".$ENV{'BITNESS'});
+    } else { # what if $ENV{'BITNESS'} undefined, fallback on bin32? or 64?
+        $xpdf_pdftohtml = &FileUtils::filenameConcatenate($xpdf_pdftohtml, "bin32");
+    }
+    }
+    # We'll create the file by name $output_filestem during post-conversion processing.
+    # Note that Xpdf tools will only create its conversion products in a dir that does
+    # not yet exist. So we'll create this location as a subdir of the output_filestem's
+    # parent directory. The parent dir is the already generated tmp area for conversion. So:
+    # - tmpdir gs2build/tmp/<random-num> already exists at this stage
+    # - We'll create gs2build/tmp/<rand>/output_filestem.html later, during post-processing
+    # - For now, XPdftools will create gs2build/tmp/<rand>/pages and put its products in there.
+    my ($tailname, $tmp_dirname, $suffix)
+    = &File::Basename::fileparse($output_filestem, "\\.[^\\.]+\$");
+    $tmp_dirname = &FileUtils::filenameConcatenate($tmp_dirname, "pages");
+    $xpdf_pdftohtml = &FileUtils::filenameConcatenate($xpdf_pdftohtml, "pdftohtml");
+    # xpdf's pdftohtml tool also takes a zoom factor, where a zoom of 1 is 100%
+    $cmd .= "\"$xpdf_pdftohtml\"";
+    $cmd .= " -z $pdf_zoom" if ($pdf_zoom);
+#    $cmd .= " -c" if ($pdf_complex);
+#    $cmd .= " -i" if ($pdf_ignore_images);
+#    $cmd .= " -a" if ($pdf_allow_images_only);
+#    $cmd .= " -hidden" unless ($pdf_nohidden);
+    $cmd .= " \"$input_filename\" \"$tmp_dirname\"";
+    #$cmd .= " \"$input_filename\" \"$output_filestem\"";
+    if ($ENV{'GSDLOS'} !~ m/^windows$/i || $is_winnt_2000) {
+    $cmd .= " > \"$output_filestem.out\" 2> \"$output_filestem.err\"";
+    } else {
+    $cmd .= " > \"$output_filestem.err\"";
+    }
+    #print STDERR "@@@@ Running command: $cmd\n";
+    $!=0;
+    my $retval=system($cmd);
+    if ($retval!=0)
+    {
+    print STDERR "Error executing xpdf's pdftohtml tool";
+    if ($!) {print STDERR ": $!";}
+    print STDERR "\n";
+    }
+    # make sure the converter made something
+    if ($retval!=0 || ! -s &FileUtils::filenameConcatenate($tmp_dirname,"index.html"))
+    {
+    &FileUtils::removeFiles("$output_filestem.out") if (-e "$output_filestem.out");
+    # print out the converter's std err, if any
+    if (-s "$output_filestem.err") {
+        open (ERRLOG, "$output_filestem.err") || die "$!";
+        print STDERR "pdftohtml error log:\n";
+        while (<ERRLOG>) {
+        print STDERR "$_";
+        }
+        close ERRLOG;
+    }
+    #print STDERR "***********output filestem $output_filestem.html\n";
+    &FileUtils::removeFiles("$tmp_dirname") if (-d "$tmp_dirname");
+    if (-e "$output_filestem.err") {
+        if ($faillogfile ne "" && defined(open(FAILLOG,">>$faillogfile")))
+        {
+        open (ERRLOG, "$output_filestem.err");
+        while (<ERRLOG>) {print FAILLOG $_;}
+        close ERRLOG;
+        close FAILLOG;
+        }
+        &FileUtils::removeFiles("$output_filestem.err");
+    }
+    return 0;
+    }
+    &FileUtils::removeFiles("$output_filestem.err") if (-e "$output_filestem.err");
+    &FileUtils::removeFiles("$output_filestem.out") if (-e "$output_filestem.out");
+    return 1;
+}
 # Convert a pdf file to various types of image with the convert command

main/trunk/greenstone2/perllib/plugins/ConvertBinaryFile.pm

-              r31766
+              r32205
+    }
     if ($convert_to =~ /^html/) { # may be html or html_multi
+    if ($convert_to =~ /^html/ || $convert_to eq "paged_html") { # may be html or html_multi, or paged_html with the new Xpdf's own pdftohtml
     $self->{'convert_to_plugin'} = "HTMLPlugin";
     $self->{'convert_to_ext'} = "html";
 …
         $output_filename = $tmp_dirname . "\/$utf8_tailname\/" . $utf8_tailname . ".$output_type";
+    }
+    } elsif ($output_type eq "paged_html") {
+    $output_filename =~ s/$lc_suffix$/.html/;
     } else {
     $output_filename =~ s/$lc_suffix$/.$output_type/;
 …
     if ("$conv_filename" eq "") {return -1;} # had an error, will be passed down pipeline
+    if (! -e "$conv_filename") {return -1;}
+    # We used to return -1 here if $conv_filename didn't exist at this stage
+    # However, for "paged_html" convert_to mode, the converted HTML file $conv_filename
+    # will only be created from conversion products *after* convert_post_process() returns
+    my $output_type=$self->{'convert_to'};
+    if ($output_type ne "paged_html" && ! -e "$conv_filename") {return -1;}
     $self->{'conv_filename'} = $conv_filename;
     $self->convert_post_process($conv_filename);
+    if ($output_type eq "paged_html" && ! -e "$conv_filename") {return -1;}
     # Run the "fribidi" (http://fribidi.org) Unicode Bidirectional Algorithm program over the converted file

main/trunk/greenstone2/perllib/plugins/PDFPlugin.pm

-              r31494
+              r32205
 use strict;
 no strict 'refs'; # so we can use a var for filehandles (e.g. STDERR)
+no strict 'subs'; # allow filehandles to be variables and viceversa
 use ReadTextFile;
 use unicode;
+use Mojo::DOM; # for HTML parsing
 use AutoLoadConverters;
 …
       { 'name' => "text",
     'desc' => "{ConvertBinaryFile.convert_to.text}" },
+      { 'name' => "paged_html",
+    'desc' => "{PDFPlugin.convert_to.paged_html}"},
       { 'name' => "pagedimg_jpg",
     'desc' => "{ConvertBinaryFile.convert_to.pagedimg_jpg}"},
 …
     # check convert_to
+    # TODO: Start supporting PDF to txt on Windows if we're going to be using XPDF Tools (incl pdftotext) on Windows/Linux/Mac
     if ($self->{'convert_to'} eq "text" && $ENV{'GSDLOS'} =~ /^windows$/i) {
     print STDERR "Windows does not support pdf to text. PDFs will be converted to HTML instead\n";
 …
     my ($conv_filename) = @_;
+    my $outhandle=$self->{'outhandle'};
+#    print STDERR "@@@ convert_to: ".$self->{'convert_to'}."\n";
+    if($self->{'convert_to'} eq "paged_html") {
+    # special post-processing for paged_html mode, as HTML pages generated
+    # by xpdf's pdftohtml need to be massaged into the form we want
+    $self->xpdftohtml_convert_post_process($conv_filename);
+    }
+    else { # use PDFPlugin's usual post processing
+    $self->default_convert_post_process($conv_filename);
+    }
+}
+# Called after gsConvert.pl has been run to convert a PDF to paged_html
+# using Xpdftools' pdftohtml
+# This method will do some cleanup of the HTML files produced after XPDF has produced
+# an HTML doc for each PDF page: it first gets rid of the default index.html.
+# Instead, it constructs a single html page containing each original HTML page
+# <body> nested as divs instead, with simple section information inserted at the top
+# of each 'page' <div> and some further styling customisation. This HTML manipulation
+# is to be done with the Mojo::DOM perl package.
+# Note that since xpdf's pdftohtml would have failed if the output dir already
+# existed and for simpler naming, the output files are created in a new "pages"
+# subdirectory of the tmp location parent of $conv_filename instead
+sub xpdftohtml_convert_post_process
+{
+    my $self = shift (@_);
+    my ($output_filename) = @_; # output_filename = tmp location + filename
+    # if a single html were generated.
+    # We just want the tmp location, append "pages", and read all the html files
+    # in except for index.html. Then we create a new html file by name
+    # $output_filename, which will consist of a slightly modified version of
+    # each of the other html files concatenated together.
+    my $outhandle=$self->{'outhandle'};
+    my ($tailname, $tmp_dir, $suffix)
+    = &File::Basename::fileparse($output_filename, "\\.[^\\.]+\$");
+    my $pages_subdir = &FileUtils::filenameConcatenate($tmp_dir, "pages");
+    # Code from util::create_itemfile()
+    # Read in all the files
+    opendir(DIR, $pages_subdir) || die "can't opendir $pages_subdir: $!";
+    my @page_files = grep {-f "$pages_subdir/$_"} readdir(DIR);
+    closedir DIR;
+    # Sort files in the directory by page_num
+    # files are named index.html, page1.html, page2.html, ..., pagen.html
+    sub page_number {
+    my ($dir) = @_;
+    my ($pagenum) =($dir =~ m/^page(\d+)\.html?$/i);
+    $pagenum = 0 unless defined $pagenum; # index.html will be given pagenum=0
+    return $pagenum;
+    }
+    # sort the files in the directory in the order of page_num rather than lexically.
+    @page_files = sort { page_number($a) <=> page_number($b) } @page_files;
+    #my $num_html_pages = (scalar(@page_files) - 1)/2; # skip index file.
+              # For every html file there's an img file, so halve the total num.
+              # What about other file types that may potentially be there too???
+    my $num_html_pages = 0;
+    foreach my $pagefile (@page_files) {
+    $num_html_pages++ if $pagefile =~ m/\.html?$/ && $pagefile !~ /^index\.html?/i;
+    }
+    # Prepare to create our new html page that will contain all the individual
+    # htmls generated by xpdf's pdftohtml in sequence.
+    # First write the opening html tags out to the output file. These are the
+    # same tags and their contents, including <meta>, as is generated by
+    # Xpdf's pdftohtml for each of its individual html pages.
+    my $start_text = "<html>\n<head>\n";
+    $start_text .= "<title>$tailname</title>\n";
+    $start_text .= "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n";
+    $start_text .= "</head>\n<body>\n\n";
+    #handle content encodings the same way that default_convert_post_process does
+    # $self->utf8_write_file ($start_text, $conv_filename); # will close file after write
+    # Don't want to build a giant string in memory of all the pages concatenated
+    # and then write it out in one go. Instead, build up the final single page
+    # by writing each modified paged_html file out to it as this is processed.
+    # Copying file open/close code from CommonUtil::utf8_write_file()
+    if (!open (OUTFILE, ">:utf8", $output_filename)) {
+    gsprintf(STDERR, "PDFPlugin::xpdftohtml_convert_post_process {ConvertToPlug.could_not_open_for_writing} ($!)\n", $output_filename);
+    die "\n";
+    }
+    print OUTFILE $start_text;
+    # Get the contents of each individual HTML page generated by Xpdf, after first
+    # modifying each, and write each out into our single all-encompassing html
+    foreach my $pagefile (@page_files) {
+    if ($pagefile =~ m/\.html?$/ && $pagefile !~ /^index\.html?/i) {
+        my $page_num = page_number($pagefile);
+        # get full path to pagefile
+        $pagefile = &FileUtils::filenameConcatenate($pages_subdir, $pagefile);
+#       print STDERR "@@@ About to process html file $pagefile (num $page_num)\n";
+        my $modified_page_contents = $self->_process_paged_html_page($pagefile, $page_num, $num_html_pages);
+        print OUTFILE "$modified_page_contents\n\n";
+    }
+    }
+    # we've now created a single HTML file by concatenating (a modified version)
+    # of each paged html file
+    print OUTFILE "</body>\n</html>\n"; # write out closing tags
+    close OUTFILE; # done
+    # Get rid of all the htm(l) files incl index.html in the associated "pages"
+    # subdir, since we've now processed them all into a single html file
+    # one folder level up and we don't want HTMLPlugin to process all of them next.
+#    my @fullpath_page_files = map { &FileUtils::filenameConcatenate($pages_subdir, $_) } @page_files;
+    &FileUtils::removeFilesFiltered($pages_subdir, "\.html?\$"); #  no specific whitelist, but blacklist htm(l)
+    # now the tmp area should contain a single html file contain all the html pages'
+    # contents in sequence, and a "pages" subdir containing the screenshot images
+    # of each page.
+    # HTMLPlugin will process these further in the plugin pipeline
+}
+# For whatever reason, most html <tags> don't get printed out in GLI
+# So when debugging, use this function to print them out as [tags] instead.
+sub _debug_print_html
+{
+    my $self = shift (@_);
+    my ($string_or_dom) = @_;
+    # can't seem to determine type of string with ref/reftype
+    # https://stackoverflow.com/questions/1731333/how-do-i-tell-what-type-of-value-is-in-a-perl-variable
+    # $dom objects appear to get correctly stringified in string contexts
+    # $dom.to_string/$dom.stringify seem to get called, no need to call them
+    # https://stackoverflow.com/questions/5214543/what-is-stringification-in-perl
+    my $escapedTxt = $string_or_dom;
+    $escapedTxt =~ s@\<@[@sg;
+    $escapedTxt =~ s@\>@]@sg;
+    print STDERR "#### $escapedTxt\n";
+}
+# Helper function to read in each paged_html generated by Xpdf's pdftohtml
+# then modify the html suitably using the HTML parsing functions offered by
+# Mojo::DOM, then return the modified HTML content as a string
+# See https://mojolicious.org/perldoc/Mojo/DOM
+sub _process_paged_html_page
+{
+    my $self = shift (@_);
+    my ($pagefile, $page_num, $num_html_pages) = @_;
+    my $text = "";
+    # handling content encoding the same way default_convert_post_process does
+    $self->read_file ($pagefile, "utf8", "", \$text);
+    my $dom = Mojo::DOM->new($text);
+#    $self->_debug_print_html($dom);
+    # there's a <style> element on the <html>, we need to shift it into the <div>
+    # tag that we'll be creating. We'll first slightly modify the <style> element
+    # store the first style element, which is the only one and in the <body>
+    # we'll later insert it as child of an all-encompassing div that we'll create
+#    my $page_style_tag_str = $dom->find('style')->[0]->to_string;
+#    my $page_style_tag_str = $dom->find('html style')->[0]->to_string;
+    my $page_style_tag_str = $dom->at('html')->at('style')->to_string;
+    # In the style tag, convert id style references to class style references
+    my $css_class = ".p".$page_num."f";
+    $page_style_tag_str =~ s@\#f@$css_class@sg;
+    my $style_element = Mojo::DOM->new($page_style_tag_str)->at('style'); # modified
+#$self->_debug_print_html($style_element);
+    # need to know the image's height to set the height of the surrounding
+    # div that's to replace this page's <body>:
+    my $img_height = $dom->find('img')->[0]{height};
+    # 1. Fix up the style attr on the image by additionally setting z-index=-1 for it
+    # 2. Adjust the img#background src attribute to point to the pages subdir for imgs
+    # 3. Set that img tag's class=background, and change its id to background+$page_num
+    my $bg_img_tag=$dom->find('img#background')->[0];
+    my $img_style_str = $bg_img_tag->{style}; # = $dom->find('img#background')->[0]{style}
+    $img_style_str = $img_style_str." z-index=-1;";
+#print STDERR "img_style_str: " . $img_style_str."\n";
+    my $img_src_str = $bg_img_tag->{src};
+    $img_src_str = "pages/$img_src_str";
+    $bg_img_tag->attr({style => $img_style_str, src => $img_src_str}); # reset
+#$self->_debug_print_html($bg_img_tag);
+    # set both class and modified id attributes in one step:
+    $bg_img_tag->attr({class => "background", id => "background".$page_num});
+#$self->_debug_print_html($bg_img_tag);
+    # get all the <span> nested inside <div class="txt"> elements and
+    # 1. set their class attr to be "p + page_num + id-of-the-span",
+    # 2. then delete the id, because the span ids have been reused when element
+    # ids ought to be unique. Which is why we set the modified ids to be the
+    # value of the class attribute instead
+    $dom->find('div.txt span')->each(sub {
+    $_->attr(class => "p". $page_num. $_->{id});
+    delete $_->{id};
+                     }); # both changes done in one find() operation
+#$self->_debug_print_html($dom->find('div.txt span')->last);
+    # Finally can create our new dom, starting with a div tag for the current page
+    # Must be: <div id="$page_num" style="position:relative; height:$img_height;"/>
+    my $new_dom = Mojo::DOM->new_tag('div', id => "page".$page_num, style => "position: relative; height: ".$img_height."px;" );
+#$self->_debug_print_html($new_dom);
+    $new_dom->at('div')->append_content($style_element)->root;
+    # Append a page range bucket heading if applicable
+    # Dr Bainbridge thinks for now we need only consider PDFs where the
+    # total number of pages < 1000 and create buckets of size 10 (e.g. 1-10, ... 51-60, ...)
+    # If number of remaining pages >= 10, then create new bucket heading
+    # e.g. "Pages 30-40"
+    if(($num_html_pages - $page_num) > 10) {
+    # Double-digit page numbers that start with 2
+    # i.e. 21 to 29 (and 30) should be in 21 to 30 range
+    my $start_range = $page_num - ($page_num % 10) + 1;
+    my $end_range = $page_num + 10 - ($page_num % 10);
+    if($page_num % 10 == 0) { # page 20 however, should be in 11 to 20 range
+        $start_range -= 10;
+        $end_range -= 10;
+    }
+    $new_dom->at('div')->append_content($new_dom->new_tag('h1', "Pages ".$start_range . "-" . $end_range))->root;
+    }
+    # Add a simpler heading: just the pagenumber, "Page #"
+    $new_dom->at('div')->append_content($new_dom->new_tag('h2', "Page ".$page_num))->root;
+#$self->_debug_print_html($new_dom);
+    # Copy across all the old html's body tag's child nodes into the new dom's new div tag
+    $dom->at('body')->child_nodes->each(sub { $new_dom->at('div')->append_content($_)}); #$_->to_string
+#$self->_debug_print_html($new_dom);
+    # Finished processing a single html page of the paged_html output generated by
+    # Xpdf's pdftohtml: finished massaging that single html page into the right form
+    return $new_dom->to_string;
+}
+# This subroutine is called to do the PDFPlugin post-processing for all cases
+# except the "paged_html" conversion mode. This is what PDFPlugin always used to do:
+sub default_convert_post_process
+{
+    my $self = shift (@_);
+    my ($conv_filename) = @_;
     my $outhandle=$self->{'outhandle'};

main/trunk/greenstone2/perllib/strings.properties

-              r32112
+              r32205
 PDFPlugin.complex:Create more complex output. With this option set the output html will look much more like the original PDF file. For this to function properly you Ghostscript installed (for *nix gs should be on your path while for windows you must have gswin32c.exe on your path).
+PDFPlugin.convert_to.paged_html:A series of HTML pages, one for each page. Each HTML page contains selectable text positionally overlaid on top of a screenshot of the PDF page background comprising any images, tables and drawings. Generated with Xpdf tools.
 PDFPlugin.desc:Plugin that processes PDF documents.
 …
 PDFPlugin.use_sections:Create a separate section for each page of the PDF file.
 PDFPlugin.zoom:The factor by which to zoom the PDF for output (this is only useful if -complex is set).
+PDFPlugin.zoom:The factor by which to zoom the PDF for output. When not outputting as paged_html, then zoom is only useful if -complex is set. If output is as paged_html, then a zoom factor of 1 means 100 percent.
 PostScriptPlugin.desc:This is a \"poor man's\" ps to text converter. If you are serious, consider using the PRESCRIPT package, which is available for download at http://www.nzdl.org/html/software.html

main/trunk/greenstone2/setup.bash

-              r32013
+              r32205
       ;;
   esac
+  # for xpdf tools, need to know whether we're using the bin32 or bin64 folder
+  BITNESS=$GSDLARCH
+  export BITNESS
   # Only want non-trival GSDLARCH value set if there is evidence of

Note: See TracChangeset for help on using the changeset viewer.