Changeset 31757

Show
Ignore:
Timestamp:
28.06.2017 20:39:16 (5 months ago)
Author:
ak19
Message:

Fixed the earlier problems, which, it turned out, had to do with the order in which the superclass plugin instances were merged to create the subclass plugin. I can run the PDFBox command via UnknownConverterPlugin? now at last, but while the text does go into doc.xml, previewing doesn't give me access to the HTML file. Not sure if this is requires fixing up a Formatting statement, or I'm not doing enough in the plugin.

Location:
main/trunk/greenstone2/perllib
Files:
2 modified

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/perllib/plugins/UnknownConverterPlugin.pm.bak

    r31745 r31757  
    4545# At present, a file or folder of files is assumed. 
    4646# Need to look in there for files with extension process_ext. 
     47# Support html_multi as output? Then a folder of html files is generated per document? OR Flag that indicates whether an html file + associated folder (such as of images) gets generated. And name of assoc folder. Such output gets generated for instance when a doc file is replaced by its html version. 
    4748 
    4849sub BEGIN { 
     
    7879    'desc' => "{UnknownConverterPlugin.output_file_or_dir_name}", 
    7980    'type' => "string", 
    80     'reqd' => "yes", 
     81    'reqd' => "no", 
    8182    'deft' => "" } ]; 
    8283 
     
    9697    push(@{$hashArgOptLists->{"OptList"}},$options); 
    9798 
     99    my $unknown_converter_self = new UnknownPlugin($pluginlist, $inputargs, $hashArgOptLists); 
    98100    my $cbf_self = new ConvertBinaryFile($pluginlist, $inputargs, $hashArgOptLists); 
    99     my $unknown_converter_self = new UnknownPlugin($pluginlist, $inputargs, $hashArgOptLists); 
    100     my $self = BaseImporter::merge_inheritance($cbf_self, $unknown_converter_self); 
     101     
     102    # Need to feed the superclass plugins to merge_inheritance() below in the order that the 
     103    # superclass plugins were declared in the ISA listing earlier in this file: 
     104    my $self = BaseImporter::merge_inheritance($unknown_converter_self, $cbf_self); 
    101105 
    102106    $self = bless $self, $class; 
    103107 
    104 $self->{'convert_to'} = "text"; # why do I have to set a value for convert_to here, when a default's already set at start of this file??????? 
     108my $outhandle = $self->{'outhandle'}; 
     109    print STDERR "\n\n**** convert_to is |" . $self->{'convert_to'} . "|\n\n"; 
     110    if(!defined $self->{'convert_to'}) { 
     111    $self->{'convert_to'} = "text"; # why do I have to set a value for convert_to here, when a default's already set at the start of this file??????? 
     112    } 
    105113 
    106114    # Convert_To set up, including secondary_plugins for processing the text or html generated 
     
    173181    my $plugin_name = $self->{'plugin_type'}; # inherited from BaseImporter 
    174182 
     183    #### COPIED FROM ConvertBinaryFile::tmp_area_convert_file() 
     184    my $outhandle = $self->{'outhandle'}; 
     185    my $convert_to = $self->{'convert_to'}; 
     186    my $failhandle = $self->{'failhandle'}; 
     187    my $convert_to_ext = $self->{'convert_to_ext'}; #set by ConvertBinaryFile::set_standard_convert_settings() 
     188     
     189 
     190    my $upgraded_input_filename = &util::upgrade_if_dos_filename($input_filename); 
     191 
     192    # derive tmp filename from input filename 
     193    my ($tailname, $dirname, $suffix) 
     194    = &File::Basename::fileparse($upgraded_input_filename, "\\.[^\\.]+\$"); 
     195 
     196    # softlink to collection tmp dir 
     197    my $tmp_dirname = &util::get_timestamped_tmp_folder(); 
     198    if (defined $tmp_dirname) { 
     199    $self->{'tmp_dir'} = $tmp_dirname; 
     200    } else { 
     201    $tmp_dirname = $dirname; 
     202    } 
     203     
     204#    # convert to utf-8 otherwise we have problems with the doc.xml file later on 
     205#    my $utf8_tailname = (&unicode::check_is_utf8($tailname)) ? $tailname : $self->filepath_to_utf8($tailname); 
     206 
     207    # make sure filename to be used can be stored OK in a UTF-8 compliant doc.xml file 
     208     my $utf8_tailname = &unicode::raw_filename_to_utf8_url_encoded($tailname); 
     209 
     210 
     211    # URLEncode this since htmls with images where the html filename is utf8 don't seem 
     212    # to work on Windows (IE or Firefox), as browsers are looking for filesystem-encoded 
     213    # files on the filesystem. 
     214    $utf8_tailname = &util::rename_file($utf8_tailname, $self->{'file_rename_method'}, "without_suffix"); 
     215 
     216    my $lc_suffix = lc($suffix); 
     217    my $tmp_filename = &FileUtils::filenameConcatenate($tmp_dirname, "$utf8_tailname$lc_suffix"); 
     218     
     219    # If gsdl is remote, we're given relative path to input file, of the form import/utf8_tailname.suffix 
     220    # But we can't softlink to relative paths. Therefore, we need to ensure that 
     221    # the input_filename is the absolute path, see http://perldoc.perl.org/File/Spec.html 
     222    my $ensure_path_absolute = 1; # true 
     223    &FileUtils::softLink($input_filename, $tmp_filename, $ensure_path_absolute); 
     224    my $verbosity = $self->{'verbosity'}; 
     225    if ($verbosity > 0) { 
     226    print $outhandle "Converting $tailname$suffix to $convert_to format with extension $convert_to_ext\n"; 
     227    } 
     228 
     229    my $errlog = &FileUtils::filenameConcatenate($tmp_dirname, "err.log"); 
     230     
     231  
     232    my $output_type=$self->{'convert_to'}; 
     233 
     234    # store the *actual* output type and return the output filename 
     235    # it's possible we requested conversion to html, but only to text succeeded 
     236    #$self->{'convert_to_ext'} = $output_type; 
     237    if ($output_type =~ /html/i) { 
     238    $self->{'converted_to'} = "HTML"; 
     239    } elsif ($output_type =~ /te?xt/i) { 
     240    $self->{'converted_to'} = "Text"; 
     241    } elsif ($output_type =~ /item/i || $output_type =~ /^pagedimg/){ 
     242    $self->{'converted_to'} = "PagedImage"; 
     243    } 
     244     
     245    my $output_filename = $tmp_filename; 
     246    my $output_dirname; 
     247    if ($output_type =~ /item/i || $output_type =~ /^pagedimg/) { 
     248    # running under windows 
     249    if ($ENV{'GSDLOS'} =~ /^windows$/i) { 
     250        $output_dirname = $tmp_dirname . "\\$utf8_tailname\\" . $utf8_tailname; 
     251    } else { 
     252        $output_dirname = $tmp_dirname . "\/$utf8_tailname\/" . $utf8_tailname; 
     253    } 
     254    $output_filename .= ".item"; 
     255    } else { 
     256    $output_filename =~ s/$lc_suffix$/.$output_type/; 
     257    } 
     258 
     259    #### END COPIED FROM ConvertBinaryFile::tmp_area_convert_file() 
     260 
     261    # Execute the conversion command and get the type of the result, 
     262    # making sure the converter gives us the appropriate output type 
     263 
    175264    # On Linux: if the program isn't installed, $? tends to come back with 127, in any case neither 0 nor 1. 
    176265    # On Windows: echo %ERRORLEVEL% ends up as 9009 if the program is not installed. 
     
    178267    # should produce either a text file or output to stdout. 
    179268 
    180     my $outhandle=$self->{'outhandle'}; 
    181  
    182269    my $cmd = $self->{'exec_cmd'}; 
    183270    if(!$cmd) { # empty string for instance 
    184     print $outhandle "$plugin_name Conversion error: invalid cmd $cmd\n"; 
     271    print $outhandle "$plugin_name Conversion error: a command to execute is required, cmd provided is |$cmd|\n"; 
    185272    return ""; 
    186273    } 
    187274 
    188     # replace occurrences of '*' placeholder in cmd string with input filename 
    189     my ($tailname, $dir, $suffix) = &File::Basename::fileparse($input_filename, "\\.[^\\.]+\$"); 
    190     $cmd =~ s/\*/$tailname/g; 
    191     print STDERR "@@@@ $plugin_name: executing conversion cmd $cmd\n"; 
     275    # HARDCODING CMD FOR NOW 
     276    #$cmd ="/Scratch/ak19/gs3-svn-15Nov2016/packages/jre/bin/java -cp \"/Scratch/ak19/gs3-svn-15Nov2016/gs2build/ext/pdf-box/lib/java/pdfbox-app.jar\" -Dline.separator=\"<br />\" org.apache.pdfbox.ExtractText -html \"/Scratch/ak19/tutorial_sample_files/pdfbox/A9-access-best-practices.pdf\" \"/Scratch/ak19/gs3-svn-15Nov2016/pdf-tmp/1.html\""; 
     277 
     278    #$cmd ="/Scratch/ak19/gs3-svn-15Nov2016/packages/jre/bin/java -cp \"/Scratch/ak19/gs3-svn-15Nov2016/gs2build/ext/pdf-box/lib/java/pdfbox-app.jar\" -Dline.separator=\"<br />\" org.apache.pdfbox.ExtractText -html INPUT_FILE OUTPUT"; 
     279 
     280    # replace occurrences of placeholders in cmd string 
     281    #$cmd =~ s@\"@\\"@g; 
     282    $cmd =~ s@INPUT_FILE@\"$input_filename\"@g; 
     283    if(defined $output_dirname) { 
     284    $cmd =~ s@OUTPUT@\"$output_dirname\"@g; 
     285    } else { 
     286    $cmd =~ s@OUTPUT@\"$output_filename\"@g; 
     287    }    
     288 
     289    print STDERR "@@@@ $plugin_name: executing conversion cmd \n|$cmd|\n"; 
     290    print STDERR "   on infile |$input_filename|\n"; 
     291    print STDERR "   to produce expected $output_filename\n"; 
    192292    my $status = system($cmd); 
    193293 
     
    202302    } 
    203303 
    204     my $output_file_or_dir = $self->{'output_file_or_dir_name'}; 
    205     if (!-e $output_file_or_dir) { 
    206     print $outhandle "$plugin_name Conversion error: Output file/dir $output_file_or_dir doesn't exist\n"; 
     304    # remove symbolic link to original file 
     305    &FileUtils::removeFiles($tmp_filename); 
     306 
     307 
     308    if(defined $output_dirname && -d $output_dirname) { 
     309    print $outhandle "$plugin_name Conversion error: Output directory $output_dirname doesn't exist\n"; 
    207310    return ""; 
    208311    } 
     312    elsif (!-e $output_filename) { 
     313    print $outhandle "$plugin_name Conversion error: Output file $output_filename doesn't exist\n"; 
     314    return ""; 
     315    } 
    209316 
    210317    # else, conversion success 
    211318     
    212319    # if multiple images were generated by running the conversion 
    213     if ($self->{'convert_to'} eq "pagedimg") { 
    214     my $item_filename = $self->generate_item_file($output_file_or_dir); 
    215     return $item_filename; 
    216     } 
    217  
    218     return $output_file_or_dir; 
     320    if ($self->{'convert_to'} =~ /^pagedimg/) { 
     321    my $item_filename = $self->generate_item_file($output_filename); #my $item_filename = $self->generate_item_file($output_file_or_dir); 
     322 
     323    if (!-e $item_filename) { 
     324        print $outhandle "$plugin_name Conversion error: Item file $item_filename was not generated\n"; 
     325        return ""; 
     326    }    
     327    $output_filename = $item_filename; 
     328    } 
     329 
     330    $self->{'output_dirname'} = $output_dirname; 
     331    $self->{'output_filename'} = $output_filename; 
     332     
     333    return $output_filename; #$output_file_or_dir; 
    219334} 
    220335 
     
    230345    return undef unless $self->can_process_this_file($filename_full_path); 
    231346     
    232     my $output_file_or_dir = $self->{'output_file_or_dir_name'}; 
    233     my $is_output_dir = (-d $output_file_or_dir) ? 1 : 0; 
     347    my $is_output_dir = (defined $self->{'output_dirname'}) ? 1 : 0; 
    234348 
    235349    # we are only doing something special if we have a directory of html files 
    236     if (!$is_output_dir || $self->{'convert_to'} ne "html") { 
     350    #if ($is_output_dir || $self->{'convert_to'} ne "html") { 
     351    if ($self->{'convert_to'} ne "html_multi") { 
    237352    return $self->BaseImporter::read(@_); # no read in ConvertBinaryFile.pm 
    238353    } 
     
    320435    # deleted some commented out code here that exists in PowerPointPlugin 
    321436 
    322     # for UnknownConverterPlugin, don't delete any temp files that the conversion may have created 
    323     # as we don't know where it was created 
    324     #$self->clean_up_after_doc_obj_processing(); 
     437    # for UnknownConverterPlugin, don't delete any temp files that the conversion may have created? 
     438    # as we don't know where it was created. No. Now creating in tmp. 
     439    $self->clean_up_after_doc_obj_processing(); 
    325440 
    326441 
     
    333448sub read_into_doc_obj { 
    334449    my $self = shift (@_); 
    335     $self->ConvertBinaryFile::deinit(@_); 
     450    $self->ConvertBinaryFile::read_into_doc_obj(@_); 
    336451} 
    337452 
  • main/trunk/greenstone2/perllib/strings.properties

    r31754 r31757  
    12711271TextPlugin.title_sub:Substitution expression to modify string stored as Title. Used by, for example, PostScriptPlugin to remove "Page 1" etc from text used as the title. 
    12721272 
    1273 UnknownConverterPlugin.desc:If you have a custom conversion tool installed that you're able to run from the command line to convert from an unsupported document format to either text or HTML, provide that command to this Plugin and it will run the command for you, capturing the output for indexing by Greenstone, making your document searchable. Use * as placeholder for input file name, but specify suffix of file to be converted (and also of any output file generated, if a file and not dir of files is generated). 
    1274  
    1275 UnknownConverterPlugin.exec_cmd:Command line command string to execute that will do the conversion. 
     1273UnknownConverterPlugin.desc:If you have a custom conversion tool installed that you're able to run from the command line to convert from an unsupported document format to text, HTML or a series of images in jpg, png or gif form, then provide that command to this Plugin. It will then run the command for you, capturing the output for indexing by Greenstone, making any documents that aren't converted to images searchable. Set the process_extension to the suffix of files to be converted. Set convert_to to be the output format that the conversion command will generate, which will determine the output file's suffix. Use INPUT_FILE and OUTPUT as place holders in the command, which Greenstone will replace. It will pass in the full path to each file that matches the process_extension suffix in turn as INPUT_FILE. OUTPUT will be replaced with a path in the temporary folder of the output file with suffix determined by the value of convert_to. If convert_to is a pagedimg type, Greenstone sets OUTPUT to be a directory to contain the expected files and will create an item file collating the parts of the document. 
     1274 
     1275UnknownConverterPlugin.exec_cmd:Command line command string to execute that will do the conversion. Quoted elements need to have the quotes escaped with a backslash to preserve them. 
    12761276 
    12771277UnknownConverterPlugin.output_file_or_dir_name: Full pathname of the output file or of the directory (of output files) that get generated by the conversion