Changeset 32205

Show
Ignore:
Timestamp:
21.06.2018 21:41:12 (4 weeks ago)
Author:
ak19
Message:

First set of commits to do with implementing the new 'paged_html' output option of PDFPlugin that uses using xpdftools' new pdftohtml. So far tested only on Linux (64 bit), but things work there so I'm optimistically committing the changes since they work. 2. Committing the pre-built Linux binaries of XPDFtools for both 32 and 64 bit built by the XPDF group. 2. To use the correct bitness variant of xpdftools, setup.bash now exports the BITNESS env var, consulted by gsConvert.pl. 3. All the perl code changes to do with using xpdf tools' pdftohtml to generate paged_html and feed it in the desired form into GS(3): gsConvert.pl, PDFPlugin.pm and its parent ConvertBinaryPFile.pm have been modified to make it all work. xpdftools' pdftohtml generates a folder containing an html file and a screenshot for each page in a PDF (as well as an index.html linking to each page's html). However, we want a single html file that contains each individual 'page' html's content in a div, and need to do some further HTML style, attribute and structure modifications to massage the xpdftool output to what we want for GS. In order to parse and manipulate the HTML 'DOM' to do this, we're using the Mojo::DOM package that Dr Bainbridge found and which he's compiled up. Mojo::DOM is therefore also committed in this revision. Some further changes and some display fixes are required, but need to check with the others about that.

Location:
main/trunk/greenstone2
Files:
211 added
5 modified

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/bin/script/gsConvert.pl

    r30724 r32205  
    323323 
    324324    # Attempt conversion to HTML 
    325     if (!$output_type || ($output_type =~ m/html/i)) { 
     325    # Uses the old pdftohtml that doesn't work for newer PDF versions 
     326    #if ($output_type =~ m/^html/i) { 
     327    if (!$output_type || ($output_type =~ m/^html/i)) { 
    326328    $success = &pdf_to_html($dirname, $input_filename, $output_filestem); 
    327329    if ($success) { 
    328330        return "html"; 
     331    } 
     332    } 
     333 
     334    # Attempt conversion to (paged) HTML using the newer pdftohtml of Xpdftools. This 
     335    # will be the new default for PDFs when output_type for PDF docs is not specified 
     336    # (once our use of xpdftools' pdftohtml has been implemented on win and mac). 
     337    if ($output_type =~ m/paged_html/i) { 
     338    #if (!$output_type || ($output_type =~ m/paged_html/i)) { 
     339    $success = &xpdf_to_html($dirname, $input_filename, $output_filestem); 
     340    if ($success) { 
     341        return "paged_html"; 
    329342    } 
    330343    } 
     
    756769 
    757770 
    758 # Convert a pdf file to html with the pdftohtml command 
    759  
     771# Convert a pdf file to html with the old pdftohtml command 
     772# which only works for older PDF versions 
    760773sub pdf_to_html { 
    761774    my ($dirname, $input_filename, $output_filestem) = @_; 
     
    819832    return 1; 
    820833} 
     834 
     835 
     836# Convert a pdf file to html with the newer Xpdftools' pdftohtml 
     837# This generates "paged HTML" where extracted, selectable text is positioned 
     838# over screenshots of each page. 
     839# Since xpdf's pdftohtml fails if the output dir already exists and for easier 
     840# naming, the output files are created in a "pages" subdirectory of the tmp 
     841# location parent of $output_filestem instead 
     842sub xpdf_to_html { 
     843    my ($dirname, $input_filename, $output_filestem) = @_; 
     844 
     845    my $cmd = ""; 
     846 
     847    # build up the path to the doc-to-html conversion tool we're going to use 
     848    my $xpdf_pdftohtml = &FileUtils::filenameConcatenate($ENV{'GSDLHOME'}, "bin", $ENV{'GSDLOS'}, "xpdf-tools"); 
     849 
     850    if ($ENV{'GSDLOS'} =~ m/^windows$/i) { 
     851    # TODO 
     852    } elsif ($ENV{'GSDLOS'} =~ m/^darwin$/i) { 
     853    # TODO 
     854    } else { # unix, use the appropriate bin folder for the bitness of the system 
     855 
     856    # Don't use $ENV{'GSDLARCH'}, use the new $ENV{'BITNESS'}, since 
     857    # $ENV{'GSDLARCH'} is only (meant to be) set when many other 32-bit or 64-bit 
     858    # specific subdirectories exist in a greenstone installation. 
     859    # None of those locations need exist when xpdf-tools is installed with GS. 
     860    # So don't depend on GSDLARCH as forcing that to be exported has side-effects 
     861    if($ENV{'BITNESS'}) { 
     862        $xpdf_pdftohtml = &FileUtils::filenameConcatenate($xpdf_pdftohtml, "bin".$ENV{'BITNESS'}); 
     863    } else { # what if $ENV{'BITNESS'} undefined, fallback on bin32? or 64? 
     864        $xpdf_pdftohtml = &FileUtils::filenameConcatenate($xpdf_pdftohtml, "bin32"); 
     865    } 
     866    } 
     867 
     868    # We'll create the file by name $output_filestem during post-conversion processing. 
     869    # Note that Xpdf tools will only create its conversion products in a dir that does 
     870    # not yet exist. So we'll create this location as a subdir of the output_filestem's 
     871    # parent directory. The parent dir is the already generated tmp area for conversion. So: 
     872    # - tmpdir gs2build/tmp/<random-num> already exists at this stage 
     873    # - We'll create gs2build/tmp/<rand>/output_filestem.html later, during post-processing 
     874    # - For now, XPdftools will create gs2build/tmp/<rand>/pages and put its products in there. 
     875    my ($tailname, $tmp_dirname, $suffix) 
     876    = &File::Basename::fileparse($output_filestem, "\\.[^\\.]+\$"); 
     877    $tmp_dirname = &FileUtils::filenameConcatenate($tmp_dirname, "pages"); 
     878 
     879    $xpdf_pdftohtml = &FileUtils::filenameConcatenate($xpdf_pdftohtml, "pdftohtml"); 
     880    # xpdf's pdftohtml tool also takes a zoom factor, where a zoom of 1 is 100% 
     881    $cmd .= "\"$xpdf_pdftohtml\""; 
     882    $cmd .= " -z $pdf_zoom" if ($pdf_zoom); 
     883#    $cmd .= " -c" if ($pdf_complex); 
     884#    $cmd .= " -i" if ($pdf_ignore_images); 
     885#    $cmd .= " -a" if ($pdf_allow_images_only); 
     886#    $cmd .= " -hidden" unless ($pdf_nohidden);     
     887    $cmd .= " \"$input_filename\" \"$tmp_dirname\""; 
     888    #$cmd .= " \"$input_filename\" \"$output_filestem\""; 
     889 
     890    if ($ENV{'GSDLOS'} !~ m/^windows$/i || $is_winnt_2000) { 
     891    $cmd .= " > \"$output_filestem.out\" 2> \"$output_filestem.err\""; 
     892    } else { 
     893    $cmd .= " > \"$output_filestem.err\""; 
     894    } 
     895 
     896    #print STDERR "@@@@ Running command: $cmd\n"; 
     897 
     898    $!=0; 
     899    my $retval=system($cmd); 
     900    if ($retval!=0) 
     901    { 
     902    print STDERR "Error executing xpdf's pdftohtml tool"; 
     903    if ($!) {print STDERR ": $!";}  
     904    print STDERR "\n"; 
     905    } 
     906 
     907    # make sure the converter made something 
     908    if ($retval!=0 || ! -s &FileUtils::filenameConcatenate($tmp_dirname,"index.html")) 
     909    { 
     910    &FileUtils::removeFiles("$output_filestem.out") if (-e "$output_filestem.out"); 
     911    # print out the converter's std err, if any 
     912    if (-s "$output_filestem.err") { 
     913        open (ERRLOG, "$output_filestem.err") || die "$!"; 
     914        print STDERR "pdftohtml error log:\n"; 
     915        while (<ERRLOG>) { 
     916        print STDERR "$_"; 
     917        } 
     918        close ERRLOG; 
     919    } 
     920    #print STDERR "***********output filestem $output_filestem.html\n"; 
     921    &FileUtils::removeFiles("$tmp_dirname") if (-d "$tmp_dirname"); 
     922    if (-e "$output_filestem.err") { 
     923        if ($faillogfile ne "" && defined(open(FAILLOG,">>$faillogfile"))) 
     924        { 
     925        open (ERRLOG, "$output_filestem.err"); 
     926        while (<ERRLOG>) {print FAILLOG $_;} 
     927        close ERRLOG; 
     928        close FAILLOG; 
     929        }    
     930        &FileUtils::removeFiles("$output_filestem.err"); 
     931    } 
     932    return 0; 
     933    } 
     934 
     935    &FileUtils::removeFiles("$output_filestem.err") if (-e "$output_filestem.err"); 
     936    &FileUtils::removeFiles("$output_filestem.out") if (-e "$output_filestem.out"); 
     937    return 1; 
     938} 
     939 
     940 
    821941 
    822942# Convert a pdf file to various types of image with the convert command 
  • main/trunk/greenstone2/perllib/plugins/ConvertBinaryFile.pm

    r31766 r32205  
    161161    } 
    162162 
    163     if ($convert_to =~ /^html/) { # may be html or html_multi 
     163    if ($convert_to =~ /^html/ || $convert_to eq "paged_html") { # may be html or html_multi, or paged_html with the new Xpdf's own pdftohtml 
    164164    $self->{'convert_to_plugin'} = "HTMLPlugin"; 
    165165    $self->{'convert_to_ext'} = "html"; 
     
    349349        $output_filename = $tmp_dirname . "\/$utf8_tailname\/" . $utf8_tailname . ".$output_type"; 
    350350    } 
     351    } elsif ($output_type eq "paged_html") { 
     352    $output_filename =~ s/$lc_suffix$/.html/; 
    351353    } else { 
    352354    $output_filename =~ s/$lc_suffix$/.$output_type/; 
     
    371373         
    372374    if ("$conv_filename" eq "") {return -1;} # had an error, will be passed down pipeline  
    373     if (! -e "$conv_filename") {return -1;}  
     375 
     376    # We used to return -1 here if $conv_filename didn't exist at this stage 
     377    # However, for "paged_html" convert_to mode, the converted HTML file $conv_filename  
     378    # will only be created from conversion products *after* convert_post_process() returns 
     379    my $output_type=$self->{'convert_to'}; 
     380    if ($output_type ne "paged_html" && ! -e "$conv_filename") {return -1;}   
    374381    $self->{'conv_filename'} = $conv_filename; 
    375382    $self->convert_post_process($conv_filename); 
     383    if ($output_type eq "paged_html" && ! -e "$conv_filename") {return -1;}   
    376384 
    377385    # Run the "fribidi" (http://fribidi.org) Unicode Bidirectional Algorithm program over the converted file 
  • main/trunk/greenstone2/perllib/plugins/PDFPlugin.pm

    r31494 r32205  
    2727use strict; 
    2828no strict 'refs'; # so we can use a var for filehandles (e.g. STDERR) 
     29no strict 'subs'; # allow filehandles to be variables and viceversa 
    2930 
    3031use ReadTextFile; 
    3132use unicode; 
     33use Mojo::DOM; # for HTML parsing 
    3234 
    3335use AutoLoadConverters; 
     
    4446      { 'name' => "text", 
    4547    'desc' => "{ConvertBinaryFile.convert_to.text}" }, 
     48      { 'name' => "paged_html", 
     49    'desc' => "{PDFPlugin.convert_to.paged_html}"}, 
    4650      { 'name' => "pagedimg_jpg", 
    4751    'desc' => "{ConvertBinaryFile.convert_to.pagedimg_jpg}"}, 
     
    145149 
    146150    # check convert_to 
     151    # TODO: Start supporting PDF to txt on Windows if we're going to be using XPDF Tools (incl pdftotext) on Windows/Linux/Mac 
    147152    if ($self->{'convert_to'} eq "text" && $ENV{'GSDLOS'} =~ /^windows$/i) { 
    148153    print STDERR "Windows does not support pdf to text. PDFs will be converted to HTML instead\n"; 
     
    281286    my ($conv_filename) = @_; 
    282287 
     288    my $outhandle=$self->{'outhandle'}; 
     289#    print STDERR "@@@ convert_to: ".$self->{'convert_to'}."\n"; 
     290 
     291    if($self->{'convert_to'} eq "paged_html") {  
     292    # special post-processing for paged_html mode, as HTML pages generated 
     293    # by xpdf's pdftohtml need to be massaged into the form we want  
     294    $self->xpdftohtml_convert_post_process($conv_filename); 
     295    } 
     296    else { # use PDFPlugin's usual post processing 
     297    $self->default_convert_post_process($conv_filename); 
     298    } 
     299} 
     300 
     301# Called after gsConvert.pl has been run to convert a PDF to paged_html 
     302# using Xpdftools' pdftohtml 
     303# This method will do some cleanup of the HTML files produced after XPDF has produced 
     304# an HTML doc for each PDF page: it first gets rid of the default index.html. 
     305# Instead, it constructs a single html page containing each original HTML page 
     306# <body> nested as divs instead, with simple section information inserted at the top 
     307# of each 'page' <div> and some further styling customisation. This HTML manipulation 
     308# is to be done with the Mojo::DOM perl package. 
     309# Note that since xpdf's pdftohtml would have failed if the output dir already 
     310# existed and for simpler naming, the output files are created in a new "pages" 
     311# subdirectory of the tmp location parent of $conv_filename instead 
     312sub xpdftohtml_convert_post_process 
     313{ 
     314    my $self = shift (@_); 
     315    my ($output_filename) = @_; # output_filename = tmp location + filename  
     316    # if a single html were generated. 
     317    # We just want the tmp location, append "pages", and read all the html files 
     318    # in except for index.html. Then we create a new html file by name 
     319    # $output_filename, which will consist of a slightly modified version of 
     320    # each of the other html files concatenated together. 
     321 
     322    my $outhandle=$self->{'outhandle'}; 
     323 
     324    my ($tailname, $tmp_dir, $suffix) 
     325    = &File::Basename::fileparse($output_filename, "\\.[^\\.]+\$"); 
     326    my $pages_subdir = &FileUtils::filenameConcatenate($tmp_dir, "pages"); 
     327 
     328    # Code from util::create_itemfile() 
     329    # Read in all the files 
     330    opendir(DIR, $pages_subdir) || die "can't opendir $pages_subdir: $!"; 
     331    my @page_files = grep {-f "$pages_subdir/$_"} readdir(DIR); 
     332    closedir DIR; 
     333    # Sort files in the directory by page_num 
     334    # files are named index.html, page1.html, page2.html, ..., pagen.html 
     335    sub page_number { 
     336    my ($dir) = @_; 
     337    my ($pagenum) =($dir =~ m/^page(\d+)\.html?$/i); 
     338    $pagenum = 0 unless defined $pagenum; # index.html will be given pagenum=0 
     339    return $pagenum; 
     340    } 
     341    # sort the files in the directory in the order of page_num rather than lexically. 
     342    @page_files = sort { page_number($a) <=> page_number($b) } @page_files; 
     343 
     344    #my $num_html_pages = (scalar(@page_files) - 1)/2; # skip index file. 
     345              # For every html file there's an img file, so halve the total num. 
     346              # What about other file types that may potentially be there too??? 
     347    my $num_html_pages = 0; 
     348    foreach my $pagefile (@page_files) { 
     349    $num_html_pages++ if $pagefile =~ m/\.html?$/ && $pagefile !~ /^index\.html?/i;  
     350    } 
     351 
     352    # Prepare to create our new html page that will contain all the individual 
     353    # htmls generated by xpdf's pdftohtml in sequence. 
     354    # First write the opening html tags out to the output file. These are the 
     355    # same tags and their contents, including <meta>, as is generated by  
     356    # Xpdf's pdftohtml for each of its individual html pages. 
     357    my $start_text = "<html>\n<head>\n"; 
     358    $start_text .= "<title>$tailname</title>\n"; 
     359    $start_text .= "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n"; 
     360    $start_text .= "</head>\n<body>\n\n"; 
     361 
     362    #handle content encodings the same way that default_convert_post_process does 
     363    # $self->utf8_write_file ($start_text, $conv_filename); # will close file after write     
     364    # Don't want to build a giant string in memory of all the pages concatenated 
     365    # and then write it out in one go. Instead, build up the final single page 
     366    # by writing each modified paged_html file out to it as this is processed. 
     367    # Copying file open/close code from CommonUtil::utf8_write_file() 
     368    if (!open (OUTFILE, ">:utf8", $output_filename)) { 
     369    gsprintf(STDERR, "PDFPlugin::xpdftohtml_convert_post_process {ConvertToPlug.could_not_open_for_writing} ($!)\n", $output_filename); 
     370    die "\n"; 
     371    } 
     372    print OUTFILE $start_text; 
     373 
     374    # Get the contents of each individual HTML page generated by Xpdf, after first 
     375    # modifying each, and write each out into our single all-encompassing html 
     376    foreach my $pagefile (@page_files) { 
     377    if ($pagefile =~ m/\.html?$/ && $pagefile !~ /^index\.html?/i) { 
     378        my $page_num = page_number($pagefile);     
     379        # get full path to pagefile 
     380        $pagefile = &FileUtils::filenameConcatenate($pages_subdir, $pagefile); 
     381#       print STDERR "@@@ About to process html file $pagefile (num $page_num)\n"; 
     382        my $modified_page_contents = $self->_process_paged_html_page($pagefile, $page_num, $num_html_pages); 
     383        print OUTFILE "$modified_page_contents\n\n"; 
     384    } 
     385    } 
     386 
     387    # we've now created a single HTML file by concatenating (a modified version) 
     388    # of each paged html file 
     389    print OUTFILE "</body>\n</html>\n"; # write out closing tags 
     390    close OUTFILE; # done 
     391 
     392    # Get rid of all the htm(l) files incl index.html in the associated "pages" 
     393    # subdir, since we've now processed them all into a single html file 
     394    # one folder level up and we don't want HTMLPlugin to process all of them next. 
     395#    my @fullpath_page_files = map { &FileUtils::filenameConcatenate($pages_subdir, $_) } @page_files; 
     396    &FileUtils::removeFilesFiltered($pages_subdir, "\.html?\$"); #  no specific whitelist, but blacklist htm(l) 
     397 
     398    # now the tmp area should contain a single html file contain all the html pages' 
     399    # contents in sequence, and a "pages" subdir containing the screenshot images 
     400    # of each page.     
     401    # HTMLPlugin will process these further in the plugin pipeline 
     402} 
     403 
     404# For whatever reason, most html <tags> don't get printed out in GLI 
     405# So when debugging, use this function to print them out as [tags] instead. 
     406sub _debug_print_html 
     407{ 
     408    my $self = shift (@_); 
     409    my ($string_or_dom) = @_; 
     410 
     411    # can't seem to determine type of string with ref/reftype 
     412    # https://stackoverflow.com/questions/1731333/how-do-i-tell-what-type-of-value-is-in-a-perl-variable 
     413 
     414    # $dom objects appear to get correctly stringified in string contexts 
     415    # $dom.to_string/$dom.stringify seem to get called, no need to call them 
     416    # https://stackoverflow.com/questions/5214543/what-is-stringification-in-perl 
     417    my $escapedTxt = $string_or_dom;  
     418    $escapedTxt =~ s@\<@[@sg; 
     419    $escapedTxt =~ s@\>@]@sg; 
     420 
     421    print STDERR "#### $escapedTxt\n"; 
     422} 
     423 
     424# Helper function to read in each paged_html generated by Xpdf's pdftohtml 
     425# then modify the html suitably using the HTML parsing functions offered by  
     426# Mojo::DOM, then return the modified HTML content as a string 
     427# See https://mojolicious.org/perldoc/Mojo/DOM 
     428sub _process_paged_html_page 
     429{ 
     430    my $self = shift (@_); 
     431    my ($pagefile, $page_num, $num_html_pages) = @_; 
     432 
     433    my $text = ""; 
     434 
     435    # handling content encoding the same way default_convert_post_process does 
     436    $self->read_file ($pagefile, "utf8", "", \$text); 
     437 
     438    my $dom = Mojo::DOM->new($text); 
     439 
     440#    $self->_debug_print_html($dom); 
     441 
     442    # there's a <style> element on the <html>, we need to shift it into the <div> 
     443    # tag that we'll be creating. We'll first slightly modify the <style> element 
     444    # store the first style element, which is the only one and in the <body> 
     445    # we'll later insert it as child of an all-encompassing div that we'll create 
     446#    my $page_style_tag_str = $dom->find('style')->[0]->to_string; 
     447#    my $page_style_tag_str = $dom->find('html style')->[0]->to_string; 
     448    my $page_style_tag_str = $dom->at('html')->at('style')->to_string; 
     449    # In the style tag, convert id style references to class style references 
     450    my $css_class = ".p".$page_num."f"; 
     451    $page_style_tag_str =~ s@\#f@$css_class@sg; 
     452    my $style_element = Mojo::DOM->new($page_style_tag_str)->at('style'); # modified     
     453#$self->_debug_print_html($style_element); 
     454 
     455    # need to know the image's height to set the height of the surrounding 
     456    # div that's to replace this page's <body>: 
     457    my $img_height = $dom->find('img')->[0]{height}; 
     458 
     459 
     460    # 1. Fix up the style attr on the image by additionally setting z-index=-1 for it 
     461    # 2. Adjust the img#background src attribute to point to the pages subdir for imgs 
     462    # 3. Set that img tag's class=background, and change its id to background+$page_num 
     463    my $bg_img_tag=$dom->find('img#background')->[0]; 
     464 
     465    my $img_style_str = $bg_img_tag->{style}; # = $dom->find('img#background')->[0]{style} 
     466    $img_style_str = $img_style_str." z-index=-1;"; 
     467#print STDERR "img_style_str: " . $img_style_str."\n"; 
     468    my $img_src_str = $bg_img_tag->{src}; 
     469    $img_src_str = "pages/$img_src_str"; 
     470    $bg_img_tag->attr({style => $img_style_str, src => $img_src_str}); # reset 
     471#$self->_debug_print_html($bg_img_tag); 
     472    # set both class and modified id attributes in one step: 
     473    $bg_img_tag->attr({class => "background", id => "background".$page_num}); 
     474#$self->_debug_print_html($bg_img_tag); 
     475 
     476    # get all the <span> nested inside <div class="txt"> elements and 
     477    # 1. set their class attr to be "p + page_num + id-of-the-span", 
     478    # 2. then delete the id, because the span ids have been reused when element 
     479    # ids ought to be unique. Which is why we set the modified ids to be the 
     480    # value of the class attribute instead 
     481    $dom->find('div.txt span')->each(sub {  
     482    $_->attr(class => "p". $page_num. $_->{id}); 
     483    delete $_->{id}; 
     484                     }); # both changes done in one find() operation 
     485#$self->_debug_print_html($dom->find('div.txt span')->last); 
     486 
     487    # Finally can create our new dom, starting with a div tag for the current page 
     488    # Must be: <div id="$page_num" style="position:relative; height:$img_height;"/> 
     489    my $new_dom = Mojo::DOM->new_tag('div', id => "page".$page_num, style => "position: relative; height: ".$img_height."px;" ); 
     490#$self->_debug_print_html($new_dom); 
     491    $new_dom->at('div')->append_content($style_element)->root; 
     492 
     493    # Append a page range bucket heading if applicable 
     494    # Dr Bainbridge thinks for now we need only consider PDFs where the 
     495    # total number of pages < 1000 and create buckets of size 10 (e.g. 1-10, ... 51-60, ...) 
     496    # If number of remaining pages >= 10, then create new bucket heading 
     497    # e.g. "Pages 30-40" 
     498    if(($num_html_pages - $page_num) > 10) { 
     499    # Double-digit page numbers that start with 2 
     500    # i.e. 21 to 29 (and 30) should be in 21 to 30 range 
     501    my $start_range = $page_num - ($page_num % 10) + 1; 
     502    my $end_range = $page_num + 10 - ($page_num % 10); 
     503    if($page_num % 10 == 0) { # page 20 however, should be in 11 to 20 range 
     504        $start_range -= 10; 
     505        $end_range -= 10; 
     506    } 
     507    $new_dom->at('div')->append_content($new_dom->new_tag('h1', "Pages ".$start_range . "-" . $end_range))->root; 
     508    } 
     509 
     510    # Add a simpler heading: just the pagenumber, "Page #" 
     511    $new_dom->at('div')->append_content($new_dom->new_tag('h2', "Page ".$page_num))->root; 
     512#$self->_debug_print_html($new_dom); 
     513    # Copy across all the old html's body tag's child nodes into the new dom's new div tag 
     514    $dom->at('body')->child_nodes->each(sub { $new_dom->at('div')->append_content($_)}); #$_->to_string 
     515#$self->_debug_print_html($new_dom); 
     516 
     517    # Finished processing a single html page of the paged_html output generated by 
     518    # Xpdf's pdftohtml: finished massaging that single html page into the right form 
     519    return $new_dom->to_string; 
     520} 
     521 
     522# This subroutine is called to do the PDFPlugin post-processing for all cases 
     523# except the "paged_html" conversion mode. This is what PDFPlugin always used to do: 
     524sub default_convert_post_process 
     525{ 
     526    my $self = shift (@_); 
     527    my ($conv_filename) = @_; 
    283528    my $outhandle=$self->{'outhandle'}; 
    284529 
  • main/trunk/greenstone2/perllib/strings.properties

    r32112 r32205  
    11631163PDFPlugin.complex:Create more complex output. With this option set the output html will look much more like the original PDF file. For this to function properly you Ghostscript installed (for *nix gs should be on your path while for windows you must have gswin32c.exe on your path). 
    11641164 
     1165PDFPlugin.convert_to.paged_html:A series of HTML pages, one for each page. Each HTML page contains selectable text positionally overlaid on top of a screenshot of the PDF page background comprising any images, tables and drawings. Generated with Xpdf tools. 
     1166 
    11651167PDFPlugin.desc:Plugin that processes PDF documents. 
    11661168 
     
    11731175PDFPlugin.use_sections:Create a separate section for each page of the PDF file. 
    11741176 
    1175 PDFPlugin.zoom:The factor by which to zoom the PDF for output (this is only useful if -complex is set). 
     1177PDFPlugin.zoom:The factor by which to zoom the PDF for output. When not outputting as paged_html, then zoom is only useful if -complex is set. If output is as paged_html, then a zoom factor of 1 means 100 percent. 
    11761178 
    11771179PostScriptPlugin.desc:This is a \"poor man's\" ps to text converter. If you are serious, consider using the PRESCRIPT package, which is available for download at http://www.nzdl.org/html/software.html 
  • main/trunk/greenstone2/setup.bash

    r32013 r32205  
    193193      ;; 
    194194  esac 
     195 
     196  # for xpdf tools, need to know whether we're using the bin32 or bin64 folder 
     197  BITNESS=$GSDLARCH 
     198  export BITNESS 
    195199 
    196200  # Only want non-trival GSDLARCH value set if there is evidence of