Context Navigation

← Previous Change
Next Change →

gsConvert.pl

Timestamp:

2018-06-21T21:41:12+12:00 (6 years ago)

Author:

ak19

Message:

First set of commits to do with implementing the new 'paged_html' output option of PDFPlugin that uses using xpdftools' new pdftohtml. So far tested only on Linux (64 bit), but things work there so I'm optimistically committing the changes since they work. 2. Committing the pre-built Linux binaries of XPDFtools for both 32 and 64 bit built by the XPDF group. 2. To use the correct bitness variant of xpdftools, setup.bash now exports the BITNESS env var, consulted by gsConvert.pl. 3. All the perl code changes to do with using xpdf tools' pdftohtml to generate paged_html and feed it in the desired form into GS(3): gsConvert.pl, PDFPlugin.pm and its parent ConvertBinaryPFile.pm have been modified to make it all work. xpdftools' pdftohtml generates a folder containing an html file and a screenshot for each page in a PDF (as well as an index.html linking to each page's html). However, we want a single html file that contains each individual 'page' html's content in a div, and need to do some further HTML style, attribute and structure modifications to massage the xpdftool output to what we want for GS. In order to parse and manipulate the HTML 'DOM' to do this, we're using the Mojo::DOM package that Dr Bainbridge found and which he's compiled up. Mojo::DOM is therefore also committed in this revision. Some further changes and some display fixes are required, but need to check with the others about that.

File:

: 1 edited

main/trunk/greenstone2/bin/script/gsConvert.pl (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/bin/script/gsConvert.pl

-              r30724
+              r32205
     # Attempt conversion to HTML
+    if (!$output_type || ($output_type =~ m/html/i)) {
+    # Uses the old pdftohtml that doesn't work for newer PDF versions
+    #if ($output_type =~ m/^html/i) {
+    if (!$output_type || ($output_type =~ m/^html/i)) {
     $success = &pdf_to_html($dirname, $input_filename, $output_filestem);
     if ($success) {
         return "html";
+    }
+    }
+    # Attempt conversion to (paged) HTML using the newer pdftohtml of Xpdftools. This
+    # will be the new default for PDFs when output_type for PDF docs is not specified
+    # (once our use of xpdftools' pdftohtml has been implemented on win and mac).
+    if ($output_type =~ m/paged_html/i) {
+    #if (!$output_type || ($output_type =~ m/paged_html/i)) {
+    $success = &xpdf_to_html($dirname, $input_filename, $output_filestem);
+    if ($success) {
+        return "paged_html";
+    }
+    }
 …
 # Convert a pdf file to html with the pdftohtml command
+# Convert a pdf file to html with the old pdftohtml command
+# which only works for older PDF versions
 sub pdf_to_html {
     my ($dirname, $input_filename, $output_filestem) = @_;
 …
     return 1;
+}
+# Convert a pdf file to html with the newer Xpdftools' pdftohtml
+# This generates "paged HTML" where extracted, selectable text is positioned
+# over screenshots of each page.
+# Since xpdf's pdftohtml fails if the output dir already exists and for easier
+# naming, the output files are created in a "pages" subdirectory of the tmp
+# location parent of $output_filestem instead
+sub xpdf_to_html {
+    my ($dirname, $input_filename, $output_filestem) = @_;
+    my $cmd = "";
+    # build up the path to the doc-to-html conversion tool we're going to use
+    my $xpdf_pdftohtml = &FileUtils::filenameConcatenate($ENV{'GSDLHOME'}, "bin", $ENV{'GSDLOS'}, "xpdf-tools");
+    if ($ENV{'GSDLOS'} =~ m/^windows$/i) {
+    # TODO
+    } elsif ($ENV{'GSDLOS'} =~ m/^darwin$/i) {
+    # TODO
+    } else { # unix, use the appropriate bin folder for the bitness of the system
+    # Don't use $ENV{'GSDLARCH'}, use the new $ENV{'BITNESS'}, since
+    # $ENV{'GSDLARCH'} is only (meant to be) set when many other 32-bit or 64-bit
+    # specific subdirectories exist in a greenstone installation.
+    # None of those locations need exist when xpdf-tools is installed with GS.
+    # So don't depend on GSDLARCH as forcing that to be exported has side-effects
+    if($ENV{'BITNESS'}) {
+        $xpdf_pdftohtml = &FileUtils::filenameConcatenate($xpdf_pdftohtml, "bin".$ENV{'BITNESS'});
+    } else { # what if $ENV{'BITNESS'} undefined, fallback on bin32? or 64?
+        $xpdf_pdftohtml = &FileUtils::filenameConcatenate($xpdf_pdftohtml, "bin32");
+    }
+    }
+    # We'll create the file by name $output_filestem during post-conversion processing.
+    # Note that Xpdf tools will only create its conversion products in a dir that does
+    # not yet exist. So we'll create this location as a subdir of the output_filestem's
+    # parent directory. The parent dir is the already generated tmp area for conversion. So:
+    # - tmpdir gs2build/tmp/<random-num> already exists at this stage
+    # - We'll create gs2build/tmp/<rand>/output_filestem.html later, during post-processing
+    # - For now, XPdftools will create gs2build/tmp/<rand>/pages and put its products in there.
+    my ($tailname, $tmp_dirname, $suffix)
+    = &File::Basename::fileparse($output_filestem, "\\.[^\\.]+\$");
+    $tmp_dirname = &FileUtils::filenameConcatenate($tmp_dirname, "pages");
+    $xpdf_pdftohtml = &FileUtils::filenameConcatenate($xpdf_pdftohtml, "pdftohtml");
+    # xpdf's pdftohtml tool also takes a zoom factor, where a zoom of 1 is 100%
+    $cmd .= "\"$xpdf_pdftohtml\"";
+    $cmd .= " -z $pdf_zoom" if ($pdf_zoom);
+#    $cmd .= " -c" if ($pdf_complex);
+#    $cmd .= " -i" if ($pdf_ignore_images);
+#    $cmd .= " -a" if ($pdf_allow_images_only);
+#    $cmd .= " -hidden" unless ($pdf_nohidden);
+    $cmd .= " \"$input_filename\" \"$tmp_dirname\"";
+    #$cmd .= " \"$input_filename\" \"$output_filestem\"";
+    if ($ENV{'GSDLOS'} !~ m/^windows$/i || $is_winnt_2000) {
+    $cmd .= " > \"$output_filestem.out\" 2> \"$output_filestem.err\"";
+    } else {
+    $cmd .= " > \"$output_filestem.err\"";
+    }
+    #print STDERR "@@@@ Running command: $cmd\n";
+    $!=0;
+    my $retval=system($cmd);
+    if ($retval!=0)
+    {
+    print STDERR "Error executing xpdf's pdftohtml tool";
+    if ($!) {print STDERR ": $!";}
+    print STDERR "\n";
+    }
+    # make sure the converter made something
+    if ($retval!=0 || ! -s &FileUtils::filenameConcatenate($tmp_dirname,"index.html"))
+    {
+    &FileUtils::removeFiles("$output_filestem.out") if (-e "$output_filestem.out");
+    # print out the converter's std err, if any
+    if (-s "$output_filestem.err") {
+        open (ERRLOG, "$output_filestem.err") || die "$!";
+        print STDERR "pdftohtml error log:\n";
+        while (<ERRLOG>) {
+        print STDERR "$_";
+        }
+        close ERRLOG;
+    }
+    #print STDERR "***********output filestem $output_filestem.html\n";
+    &FileUtils::removeFiles("$tmp_dirname") if (-d "$tmp_dirname");
+    if (-e "$output_filestem.err") {
+        if ($faillogfile ne "" && defined(open(FAILLOG,">>$faillogfile")))
+        {
+        open (ERRLOG, "$output_filestem.err");
+        while (<ERRLOG>) {print FAILLOG $_;}
+        close ERRLOG;
+        close FAILLOG;
+        }
+        &FileUtils::removeFiles("$output_filestem.err");
+    }
+    return 0;
+    }
+    &FileUtils::removeFiles("$output_filestem.err") if (-e "$output_filestem.err");
+    &FileUtils::removeFiles("$output_filestem.out") if (-e "$output_filestem.out");
+    return 1;
+}
 # Convert a pdf file to various types of image with the convert command

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 32205 for main/trunk/greenstone2/bin/script/gsConvert.pl

Legend:

main/trunk/greenstone2/bin/script/gsConvert.pl

Download in other formats: