Context Navigation

← Previous Changeset
Next Changeset →

Changeset 32290

Timestamp:

2018-07-19T19:54:32+12:00 (6 years ago)

Author:

ak19

Message:

Making paged_pretty_html the default rather than pretty_html, since it's likely more users will want their converted PDF sectionalised. 2. Hopefully improved the display strings to make sense for users rather than for me.

Location:

main/trunk/greenstone2

Files:

: 4 edited

bin/script/gsConvert.pl (modified) (2 diffs)
perllib/plugins/PDFPlugin.pm (modified) (1 diff)
perllib/plugins/PDFv2Plugin.pm (modified) (6 diffs)
perllib/strings.properties (modified) (4 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/bin/script/gsConvert.pl

-              r32287
+              r32290
              "type/$type_re/", \$input_type,
              '/errlog/.*/', \$faillogfile,
              'output/(auto|html|text|pagedimg).*/', \$output_type, # regex includes html_multi and paged_html besides html
+             'output/(auto|html|text|pagedimg).*/', \$output_type, # regex includes html_multi and (paged_)pretty_html besides html, as well as pagedimgtxt_<imgext> besides pagedimg_<imgext>
              'timeout/\d+/0',\$timeout,
              'verbose/\d+/0', \$verbose,
 …
     elsif ($pdf_tool eq "xpdftools" ) {
     # default to pretty html output
+    # default to paged_pretty_html output
     if (!$output_type) {
         $output_type = "pretty_html";
+        $output_type = "paged_pretty_html";
+    }

main/trunk/greenstone2/perllib/plugins/PDFPlugin.pm

r32289	r32290
165	165	if ($self->{'use_realistic_book'}) {
166	166	if ($self->{'convert_to'} ne "html") {
167		~~print STDERR "PDFs will be converted to HTML for realistic book functionality\n"~~;
	167	&gsprintf::gsprintf(STDERR, "PDFv2Plugin: {PDFPlugin.html_for_realistic_book}\n");
168	168	$self->{'convert_to'} = "html";
169	169	}

main/trunk/greenstone2/perllib/plugins/PDFv2Plugin.pm

-              r32287
+              r32290
        'reqd' => "yes",
        'list' => $convert_to_list,
        'deft' => "pretty_html" },
+       'deft' => "paged_pretty_html" },
      { 'name' => "process_exp",
        'desc' => "{BaseImporter.process_exp}",
 …
        'type' => "regexp",
        'deft' => &get_default_block_exp() },
      { 'name' => "metadata_fields",
        'desc' => "{HTMLPlugin.metadata_fields}",
        'type' => "string",
        'deft' => "Title,Author,Subject,Keywords" },
      { 'name' => "metadata_field_separator",
     'desc' => "{HTMLPlugin.metadata_field_separator}",
     'type' => "string",
     'deft' => "" },
+#     { 'name' => "metadata_fields",
+#       'desc' => "{HTMLPlugin.metadata_fields}",
+#       'type' => "string",
+#       'deft' => "Title,Author,Subject,Keywords" },
+#     { 'name' => "metadata_field_separator",
+#   'desc' => "{HTMLPlugin.metadata_field_separator}",
+#   'type' => "string",
+#   'deft' => "" },
      { 'name' => "dpi",
        'desc' => "{PDFv2Plugin.dpi}",
 …
       { 'name' => "use_realistic_book",
         'desc' => "{PDFPlugin.use_realistic_book}",
     'type' => "flag"}
+    'type' => "flag" }
      ];
 …
     my $pdfbox_converter_self = new PDFBoxConverter($pluginlist, $inputargs, $hashArgOptLists);
     my $cbf_self = new ConvertBinaryFile($pluginlist, $inputargs, $hashArgOptLists);
     my $self = BaseImporter::merge_inheritance($pdfbox_converter_self, $cbf_self);
+    my $self = BaseImporter::merge_inheritance($pdfbox_converter_self, $cbf_self); # this param order seems necessary to preserve the default/user-selected value for the convert_to option
     if ($self->{'info_only'}) {
 …
     if ($self->{'convert_to'} eq "auto") {
+    # choose pretty_html is the best default option when using xpdftools
+    $self->{'convert_to'} = "pretty_html";
+    # defaulting to paged_pretty_html, as it's the best default option when using xpdftools
+    $self->{'convert_to'} = "paged_pretty_html";
+    &gsprintf::gsprintf(STDERR, "PDFv2Plugin: {PDFv2Plugin.auto_output_default}\n", $self->{'convert_to'});
+    }
     if ($self->{'use_realistic_book'}) {
     if ($self->{'convert_to'} ne "html") {
         print STDERR "PDFs will be converted to HTML for realistic book functionality\n";
+        &gsprintf::gsprintf(STDERR, "PDFv2Plugin: {PDFPlugin.html_for_realistic_book}\n");
         $self->{'convert_to'} = "html";
+    }
 …
     # Copying file open/close code from CommonUtil::utf8_write_file()
     if (!open (OUTFILE, ">:utf8", $output_filename)) {
     gsprintf(STDERR, "PDFv2Plugin::xpdftohtml_convert_post_process {CommonUtil.could_not_open_for_writing} ($!)\n", $output_filename);
+    &gsprintf::gsprintf(STDERR, "PDFv2Plugin::xpdftohtml_convert_post_process {CommonUtil.could_not_open_for_writing} ($!)\n", $output_filename);
     die "\n";
+    }

main/trunk/greenstone2/perllib/strings.properties

-              r32287
+              r32290
 ConvertBinaryFile.convert_to.text:Plain text format.
 ConvertBinaryFile.convert_to.paged_text:Text separately extracted for each individual page.
+ConvertBinaryFile.convert_to.paged_text:Sectionalised plain text, where every page's text is its own section.
 ConvertBinaryFile.convert_to.pagedimg:A series of images.
 …
 PDFPlugin.complex:Create more complex output. With this option set the output html will look much more like the original PDF file. For this to function properly you Ghostscript installed (for *nix gs should be on your path while for windows you must have gswin32c.exe on your path).
 PDFPlugin.convert_to.html:HTML. Text only, no images.
 PDFPlugin.convert_to.pretty_html:A series of HTML pages, one for each page. Each HTML page contains selectable text positionally overlaid on top of a screenshot of the PDF page background comprising any images, tables and drawings.
 PDFPlugin.convert_to.paged_pretty_html:Sectionalised variant of pretty_html to allow jumping to individual pages.
+PDFPlugin.convert_to.html:very basic HTML comprising just the extracted text, no images.
+PDFPlugin.convert_to.pretty_html:Each PDF page as HTML containing selectable text positionally overlaid on top of a textless screenshot of the PDF page.
+PDFPlugin.convert_to.paged_pretty_html:Sectionalised pretty_html, where each page's html is its own section.
 PDFPlugin.deprecated_plugin:*************IMPORTANT******************\nPDFPlugin is being deprecated.\nConsider upgrading to the recommended PDFv2Plugin, which supports newer versions of PDFs.\nAlternatively, if you wish to retain the old style of conversion and are NOT relying on PDFBox,\nchange to PDFv1Plugin.\nIf you are using PDFBox then upgrade to PDFv2Plugin.\n*****************************************\n
 …
 PDFPlugin.desc:Plugin that processes PDF documents using the older pdftohtml tool. Does not support newer PDF versions.
+PDFPlugin.html_for_realistic_book:PDFs will be converted to HTML for realistic book functionality
 PDFPlugin.nohidden:Prevent pdftohtml from attempting to extract hidden text. This is only useful if the -complex option is also set.
 PDFPlugin.noimages:Don't attempt to extract images from PDF.
+PDFv2Plugin.auto_output_default:Defaulting to output format %s
 PDFPlugin.use_realistic_book:Converts the PDF to a well-formed XHTML document to enable users view it in the realistic book format.
 …
 PDFv1Plugin.zoom:The factor by which to zoom the PDF for output. Only useful if -complex is set.
 PDFv2Plugin.dpi:The resolution in DPI of background images generated for pagedimg(txt) and (paged_)pretty_html output settings.
+PDFv2Plugin.dpi:The resolution in DPI of background images generated when convert_to is set to any of the pagedimg(txt) and (paged_)pretty_html formats.
 PostScriptPlugin.desc:This is a \"poor man's\" ps to text converter. If you are serious, consider using the PRESCRIPT package, which is available for download at http://www.nzdl.org/html/software.html

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 32290

Legend:

main/trunk/greenstone2/bin/script/gsConvert.pl

main/trunk/greenstone2/perllib/plugins/PDFPlugin.pm

main/trunk/greenstone2/perllib/plugins/PDFv2Plugin.pm

main/trunk/greenstone2/perllib/strings.properties

Download in other formats: