Context Navigation

← Previous Change
Next Change →

gsConvert.pl

Timestamp:

2018-07-13T20:40:24+12:00 (6 years ago)

Author:

ak19

Message:

First of the commits to do with restructuring and refactoring the PDFPlugin. 1. Introducing PDFv1Plugin.pm, which only runs the old pdftohtml. pdfbox_conversion are moved into PDFv2Plugin. 2. In the meantime we still have PDFPlugin, the current state of the plugin, for backward compatibility: it uses both the old pdftohtml tool and still has the pdfbox_conversion option. Yet to introduced the PDFv2Plugin. 3. gsConvert.pl has the new flag pdf_tool, set/passed in by PDFPlugin.pm and all PDFPlugin classes hereafter. The pdf_tool flag can be set to pdftohtml, xpdftools or pdfbox. PDFv1Plugin will always set it to pdftohtml, to denote the old pdftohtml tool is to be used, whereas PDFv2Plugin will set it to xpdftools and PDFBoxConverter sets it for symmetry's sake to pdfbox, even though being an AutoLoadConverter at present, the PDFBoxConverter class bypasses gsConvert.pl. gsConvert.pl uses the pdf_tool flag to determine which tool is to be used to do the conversion to produce the selected output_type. 4. Added some strings. One for migrating users to indicate that PDFPlugin was being deprecated in favour of the PDFv1 and PDFv2 plugins. Another was referenced by CommonUntil, and more recently by PDFPlugin, but was not defined in strings.properties. Once PDFv2Plugin has been added, need to remove references to paged_html from PDFPlugin.

File:

: 1 edited

main/trunk/greenstone2/bin/script/gsConvert.pl (modified) (5 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/bin/script/gsConvert.pl

-              r32263
+              r32273
 my $use_strings;
+my $pdf_tool;
 my $pdf_complex;
 my $pdf_nohidden;
 …
     print STDERR "  options:\n\t-type\tdoc|dot|pdf|ps|ppt|rtf|xls\t(input file type)\n";
     print STDERR "\t-errlog\t<filename>\t(append err messages)\n";
     print STDERR "\t-output\tauto|html|text|pagedimg_jpg|pagedimg_gif|pagedimg_png\t(output file type)\n";
+    print STDERR "\t-output\tauto|html|paged_html|text|pagedimg_jpg|pagedimg_gif|pagedimg_png\t(output file type)\n";
     print STDERR "\t-timeout\t<max cpu seconds>\t(ulimit on unix systems)\n";
     print STDERR "\t-use_strings\tuse strings to extract text if conversion fails\n";
     print STDERR "\t-windows_scripting\tuse windows VB script (if available) to convert Microsoft Word and PPT documents\n";
+    print STDERR "\t-pdf_tool\tpdftohtml|xpdftools|pdfbox (not all output types are supported by every pdf_tool)\n";
     print STDERR "\t-pdf_complex\tuse complex output when converting PDF to HTML\n";
     print STDERR "\t-pdf_nohidden\tDon't attempt to extract hidden text from PDF files\n";
 …
              "type/$type_re/", \$input_type,
              '/errlog/.*/', \$faillogfile,
              'output/(auto|html|text|pagedimg).*/', \$output_type,
+             'output/(auto|html|text|pagedimg).*/', \$output_type, # regex includes html_multi and paged_html besides html
              'timeout/\d+/0',\$timeout,
              'verbose/\d+/0', \$verbose,
              'windows_scripting',\$windows_scripting,
              'use_strings', \$use_strings,
+             'pdf_complex', \$pdf_complex,
+             'pdf_tool/(pdftohtml|pdfbox|xpdftools)/', \$pdf_tool, # the old pdftohtml tool, pdfbox extensions or the newer xpdf-tools
+             'pdf_complex', \$pdf_complex, # options for pdf_tool = pdftohtml (the old pdftohtml tool)
              'pdf_ignore_images', \$pdf_ignore_images,
              'pdf_allow_images_only', \$pdf_allow_images_only,
 …
     my $success = 0;
     $output_type =~ s/.*\-(.*)/$1/i;
+    # First determine which pdf conversion tool we're using among pdftohtml/pdfbox/xpdftools
+    # and then decide which conversion command to run based on the output type
+    # (pdfbox does not currently go through gsConvert.pl
+    # as PDFBoxConverter inherits from AutoLoadConverters)
+  if ($pdf_tool eq "pdftohtml" ) { # old pdftohtml tool
     # Attempt coversion to Image
     if ($output_type =~ m/jp?g|gif|png/i) {
 …
+    }
+    # Attempt conversion to (paged) HTML using the newer pdftohtml of Xpdftools. This
+    # will be the new default for PDFs when output_type for PDF docs is not specified
+    # (once our use of xpdftools' pdftohtml has been implemented on win and mac).
+    #if ($output_type =~ m/paged_html/i) {
+    if (!$output_type || ($output_type =~ m/paged_html/i)) {
+    $success = &xpdf_to_html($dirname, $input_filename, $output_filestem);
+    if ($success) {
+        return "paged_html";
+    }
+    }
+    # Attempt conversion to TEXT
+    # Attempt conversion to TEXT (not for Windows, but PDFPlugin/PDFv1Plugin takes care of that
     if (!$output_type || ($output_type =~ m/text/i)) {
+        $success = &xpdf_to_text($dirname, $input_filename, $output_filestem);
+        #if ($ENV{'GSDLOS'} =~ m/^windows$/i) { # we now have pdf to text support for windows by using xpdf tools
+        #   $success = &xpdf_to_text($dirname, $input_filename, $output_filestem);
+        #} else {
+        #   $success = &pdf_to_text($dirname, $input_filename, $output_filestem);
+        #}
+    $success = &pdf_to_text($dirname, $input_filename, $output_filestem);
     if ($success) {
         return "text";
+    }
+    }
+  }
+  elsif ($pdf_tool eq "xpdftools" ) {
+    # default to html output
+    if (!$output_type) {
+        $output_type = "html";
+    }
+    # Attempt coversion to Image
+    #if ($output_type =~ m/jp?g|gif|png/i) {
+    #    $success = &pdfps_to_img($dirname, $input_filename, $output_filestem, $output_type);
+    #    if ($success){
+    #   return "item";
+    #    }
+    #}
+    # Attempt conversion to (paged) HTML using the newer pdftohtml of Xpdftools.
+    if ($output_type =~ m/^(paged_html|html)$/i) {
+        $success = &xpdf_to_html($dirname, $input_filename, $output_filestem);
+        if ($success) {
+        return $output_type;
+        }
+    }
+    # Attempt conversion to TEXT
+    if (!$output_type || ($output_type =~ m/text/i)) {
+        $success = &xpdf_to_text($dirname, $input_filename, $output_filestem);
+        if ($success) {
+        return "text";
+        }
+    }
+  }
     return "fail";

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 32273 for main/trunk/greenstone2/bin/script/gsConvert.pl

Legend:

main/trunk/greenstone2/bin/script/gsConvert.pl

Download in other formats: