source: trunk/gsdl/perllib/plugins/PDFPlug.pm@ 5096

Last change on this file since 5096 was 4873, checked in by mdewsnip, 21 years ago

Further work on standardising option descriptions. Specifically, in preparation for translating the option descriptions into other languages, all the option description strings have been moved in a "resource bundle" file (modelled on a Java resource bundle). (This also has the advantage of reducing the number of duplicate descriptions). The option descriptions in the plugins, classifiers, mkcol.pl, import.pl and buildcol.pl have been replaced with keys into this resource bundle (perllib/strings.rb). When translating the strings in this file into a new language, the new resource bundle should be named strings_<language-code>.rb (where <language-code> is a combination of language and country, eg. 'fr_FR' for the version of French spoken in France).

To support these changes, the PrintUsage module (perllib/printusage.pm) has new code for reading resource bundles and displaying the correct strings. Also, pluginfo.pl, classinfo.pl, mkcol.pl, import.pl and buildcol.pl have a new option (-language) for specifying the language code to display option descriptions in.

If a resource bundle for the specified language code does not exist, a generic resource bundle is used (strings.rb). This currently contains the English text descriptions. However, for users who always use Greenstone in another language, it would be easier to rename the standard file to strings_en_US.rb and rename the resource bundle of their desired language to strings.rb. This would mean they would not have to constantly specify their language with the -language option, since the default resource bundle will suit them.

Currently, the encoding names (in encodings.pm) are not part of this scheme. These are displayed as part of BasPlug's input_encoding option. It is debatable whether these names would be worth translating into other languages.

Parse errors in plugins and classifiers currently cause them to display the usage information using the default resource bundle. It is likely that BasPlug will soon have an option added to specify the language for the usage information in this case. (Note that this does not include using pluginfo.pl or classinfo.pl to display usage information - these have a -language option).

  • Property svn:keywords set to Author Date Id Revision
File size: 7.1 KB
Line 
1###########################################################################
2#
3# PDFPlug.pm -- reasonably with-it pdf plugin
4# A component of the Greenstone digital library software
5# from the New Zealand Digital Library Project at the
6# University of Waikato, New Zealand.
7#
8# Copyright (C) 1999-2001 New Zealand Digital Library Project
9#
10# This program is free software; you can redistribute it and/or modify
11# it under the terms of the GNU General Public License as published by
12# the Free Software Foundation; either version 2 of the License, or
13# (at your option) any later version.
14#
15# This program is distributed in the hope that it will be useful,
16# but WITHOUT ANY WARRANTY; without even the implied warranty of
17# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
18# GNU General Public License for more details.
19#
20# You should have received a copy of the GNU General Public License
21# along with this program; if not, write to the Free Software
22# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
23#
24###########################################################################
25
26package PDFPlug;
27
28use ConvertToPlug;
29
30sub BEGIN {
31 @ISA = ('ConvertToPlug');
32}
33
34my $arguments =
35 [ { 'name' => "process_exp",
36 'desc' => "{BasPlug.process_exp}",
37 'type' => "string",
38 'deft' => &get_default_process_exp(),
39 'reqd' => "no" },
40 { 'name' => "block_exp",
41 'desc' => "{BasPlug.block_exp}",
42 'type' => "string",
43 'deft' => &get_default_block_exp() },
44 { 'name' => "noimages",
45 'desc' => "{PDFPlug.noimages}",
46 'type' => "flag" },
47 { 'name' => "complex",
48 'desc' => "{PDFPlug.complex}",
49 'type' => "flag" },
50 { 'name' => "nohidden",
51 'desc' => "{PDFPlug.nohidden}",
52 'type' => "flag" },
53 { 'name' => "zoom",
54 'desc' => "{PDFPlug.zoom}",
55 'deft' => "2",
56 'type' => "int" },
57 { 'name' => "use_sections",
58 'desc' => "{PDFPlug.use_sections}",
59 'type' => "flag" } ];
60
61my $options = { 'name' => "PDFPlug",
62 'desc' => "Reasonably with-it pdf plugin.",
63 'inherits' => "yes",
64 'args' => $arguments };
65
66sub new {
67 my $class = shift (@_);
68
69 my ($noimages, $complex, $zoom, $use_sections, $nohidden);
70
71 if (!parsargv::parse(\@_,
72 q^noimages^, \$noimages,
73 q^complex^, \$complex,
74 q^zoom/\d+/2^, \$zoom,
75 q^nohidden^, \$nohidden,
76 q^use_sections/1?/^, \$use_sections,
77 "allow_extra_options")) {
78
79 print STDERR "\nIncorrect options passed to PDFPlug, check your collect.cfg configuration file\n";
80 local $self = new ConvertToPlug($class, @_, "-title_sub", '^(Page\s+\d+)?(\s*1\s+)?');
81 $self->print_txt_usage(""); # Use default resource bundle
82 die "\n";
83 }
84
85
86 my @args=@_;
87 if ($use_sections) {
88 push (@args, "-description_tags");
89 }
90
91 # following title_sub removes "Page 1" added by pdftohtml, and a leading
92 # "1", which is often the page number at the top of the page. Bad Luck
93 # if your document title actually starts with "1 " - is there a better way?
94
95 my $self = new ConvertToPlug ($class, @args, "-title_sub", '^(Page\s+\d+)?(\s*1\s+)?');
96
97 if ($use_sections) {
98 $self->{'use_sections'}=1;
99 }
100
101 # 14-05-02 To allow for proper inheritance of arguments - John Thompson
102 my $option_list = $self->{'option_list'};
103 push( @{$option_list}, $options );
104
105 # these are passed through to gsConvert.pl by ConvertToPlug.pm
106 $self->{'convert_options'} = "-pdf_zoom $zoom";
107 $self->{'convert_options'} .= " -pdf_complex" if $complex;
108 $self->{'convert_options'} .= " -pdf_nohidden" if $nohidden;
109 $self->{'convert_options'} .= " -pdf_ignore_images" if $noimages;
110
111 # pdftohtml will always produce html files encoded as utf-8
112 if ($self->{'input_encoding'} eq "auto") {
113 $self->{'input_encoding'} = "utf8";
114 $self->{'extract_language'} = 1;
115 }
116
117 return bless $self, $class;
118}
119
120
121# sub print_usage {
122# print STDERR "\n usage: plugin PDFPlug [options]\n\n";
123# print STDERR " options:\n";
124# print STDERR " -convert_to (html|text) Convert to TEXT or HTML (default html)\n";
125# print STDERR " -use_sections Create a separate section for each page\n";
126# print STDERR " of the PDF file.\n";
127# print STDERR " -noimages Don't attempt to extract images from PDF.\n";
128# print STDERR " -complex Create more complex output. With this option\n";
129# print STDERR " set the output html will look much more like\n";
130# print STDERR " the original PDF file. For this to function\n";
131# print STDERR " properly you Ghostscript installed (for *nix\n";
132# print STDERR " gs should be on your path while for windows\n";
133# print STDERR " you must have gswin32c.exe on your path).\n";
134# print STDERR " -nohidden Prevent pdftohtml from attempting to extract\n";
135# print STDERR " hidden text. This is only useful if the -complex\n";
136# print STDERR " option is also set.";
137# print STDERR " -zoom The factor by which to zoomthe PDF for output\n";
138# print STDERR " (this is only useful if -complex is set).\n\n";
139# }
140
141
142
143sub get_default_process_exp {
144 my $self = shift (@_);
145
146 return q^(?i)\.pdf$^;
147}
148
149# so we don't inherit HTMLPlug's block exp...
150sub get_default_block_exp {
151 return "";
152}
153
154
155# do plugin specific processing of doc_obj for HTML type
156sub process {
157 my $self = shift (@_);
158 if ($self->{'use_sections'}
159 && $self->{'converted_to'} eq "HTML") {
160
161 print STDERR "PDFPlug: Calculating sections...\n";
162 my $textref=$_[0];
163
164 # we have "<a name=1></a>" etc for each page
165 my @sections = split('<a name=', $$textref);
166
167 shift @sections; # don't need HTML header, etc
168 # handle first section specially for title? Or all use first 100...
169
170 my $title = $sections[0];
171 $title =~ s/^\d+>//; # specific for pdftohtml...
172 $title =~ s/<\/([^>]+)><\1>//g; # (eg) </b><b> - no space
173 $title =~ s/<[^>]*>/ /g;
174 $title =~ s/(?:&nbsp;|\xc2\xa0)/ /g; # utf-8 for nbsp...
175 $title =~ s/^\s+//s;
176 $title =~ s/\s+$//;
177 $title =~ s/\s+/ /gs;
178 $title =~ s/^$self->{'title_sub'}// if ($self->{'title_sub'});
179 $title =~ s/^\s+//s; # in case title_sub introduced any...
180 $title = substr ($title, 0, 100);
181 $title =~ s/\s\S*$/.../;
182
183 my $top_section = "<!--<Section>\n<Metadata name=\"Title\">$title</Metadata>\n-->\n <!--</Section>-->\n";
184
185 # add metadata per section...
186 foreach my $section (@sections) {
187 $section =~ s@^(\d+)></a>@@; # leftover from split expression...
188
189 $title = $1; # Greenstone does magic if sections are titled digits
190 if (! defined($title) ) {
191 print STDERR "no title: $section\n";
192 }
193 my $newsection = "<!-- from PDFPlug -->\n<!-- <Section>\n";
194 $newsection .= "<Metadata name=\"Title\">" . $title
195 . "</Metadata>\n--><p>\n";
196 $newsection .= $section;
197 $newsection .= "<!--</Section>-->\n";
198 $section = $newsection;
199 }
200
201 $$textref=join('', ($top_section, @sections));
202 }
203
204 my $outhandle = $self->{'outhandle'};
205 print $outhandle "PDFPlug: passing $_[3] on to $self->{'converted_to'}Plug\n"
206 if $self->{'verbosity'} > 1;
207
208 return ConvertToPlug::process_type($self,"pdf",@_);
209}
210
2111;
Note: See TracBrowser for help on using the repository browser.