Context Navigation

← Previous Changeset
Next Changeset →

Changeset 2352

Timestamp:

2001-05-02T14:01:55+12:00 (23 years ago)

Author:

jrm21

Message:

removed crappy heuristical code that tried to check for extractable text
first, due to pdftohtml.bin being updated (and not breaking (much) anymore).

File:

: 1 edited

trunk/gsdl/bin/script/pdftohtml.pl (modified) (5 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/gsdl/bin/script/pdftohtml.pl

-              r2346
+              r2352
 sub print_usage {
+# note - we don't actually ever use most of these options...
 print STDERR
     ("pdftohtml version 0.22\n",
+    ("pdftohtml version 0.22 - modified for NZDL use\n",
      "Usage: pdftohtml [options] <PDF-file> <html-file>\n",
      "  -f <int>      : first page to convert\n",
 …
      "  -h            : print this usage information\n",
      "  -p            : exchange .pdf links by .html\n",
+     "  -c            : generate complex HTML document\n",
+     "  -F            : don't use frames in HTML document\n",
+# these options now have no effect in gs-custom pdftohtml.bin
+#     "  -c            : generate complex HTML document\n",
+#     "  -F            : don't use frames in HTML document\n",
      "  -i            : ignore images\n",
      "  -e <string>   : set extension for images (in the Html-file) (default png)\n"
 …
     my (@ARGV) = @_;
     my ($first,$last,$target_dir,$out_file,$img_ext,
     $optq,$opth,$optp,$optc,$optF,$opti);
+    $optq,$opth,$optp,$optF,$opti);
     # read command-line arguments so that
 …
              'h', \$opth,
              'p', \$optp,
              'c', \$optc,
+#            'c', \$optc,
              'F', \$optF,
              'i', \$opti
 …
+    }
+    # Heuristical code added by John McPherson to attempt to reject
+    # PDF's with no text in them.... based entirely on observation. We
+    # should really read the PDF specifications someday...
+    open (PDFIN, $input_filename) ||
+    die "Error: unable to open $input_filename for reading\n";
+    # Heuristical code removed due to pdftohtml.bin being "fixed" to not
+    # create bitmaps for each char in some pdfs. However, this means we
+    # now create .html files even if we can't extract any text. We should
+    # check for that now instead someday...
-    my $found_text_object=0;
-    my $num_objects=0;
-    my $non_text_objects=0;
-    my $unenc_stream_objects=0;
-    my $line;
-    while (!$found_text_object && ($_=<PDFIN>)) {
-    s/\r/\n/g;
-    if (/^\d+ \d+ obj/ms) {
-        # start of new object
-        my $object="";
-        $num_objects++;
-        while (! eof && ! /(>>\s*)?endobj/) {
-        $object.=$_;
-        $_=<PDFIN>;
+        }
-        if (!defined $_) {$_="";} # we've hit end of file in a funny place.
-        # we've got to the end of the current PDF object.
-        $object.=$_;
-        # remove newline chars, to help our pattern matching for whitespace
-        $object =~ s/\n/ /gs;
-        #determine object type...
-        $_=$object;
-# for PDFWriter , and pdflatex and distill. Eg:
-# "12 0 obj << /Length 13 0 R /Filter /LZWDecode >> stream ..."
-# Ie this looks like compressed text....
-        if (/\d+\s+\d+\s+obj\s+<<\s+\/Length\s+\d+\s+\d+\s*.\s*\/Filter/) {
-        $found_text_object=1;
+        }
-        # For pdflatex or ps2pdf from dvi->ps:
-        # if we are setting a font, then following object is probably text
-        # Eg "obj << /Font" or "obj << /ProcSet [...] /Font"
-        elsif (/obj\s*<<\s*(\/ProcSet \[.+?\]\s*)?\/Font /s) {
-        $found_text_object=1;
+        }
-        # Unencoded streams. Eg
-        # "<< /Length 45 0 R >> stream BT /R43 8.96638 Tf 1..."
-        elsif (/<<\s+\/Length\s+\d+\s+\d+\s+R\s+>>\s+stream\s+(q\s)?BT\s/s)
+        {
-        $unenc_stream_objects++;
+        }
-        # (some) non-text objects
-        elsif (/<<.*\/(Type).*>>/s) {
-        $non_text_objects++;
+        }
-    } else { # not in an object...
-        # header? footer?
-#       print $_;
+    }
-    if ($found_text_object) {close PDFIN;}
-    } # end of while
-    close PDFIN;
-    # decide whether to accept or reject...
-    # some of these numbers are completely arbitrary based on a few .pdfs.
-    if ( ($found_text_object > 0) ||
-     ($num_objects<=1500 && $unenc_stream_objects > 5)
+     )
+    {
-    # accept this .pdf. Currently do nothing except fall through...
-    } else {
-    # reject this .pdf.
-    print STDERR "pdftohtml.pl: $input_filename appears to have no ";
-    print STDERR "textual data. Aborting.\n";
-    # print STDERR "num: $unenc_stream_objects and $non_text_objects from $num_objects\n";
-    exit(1);
+    }
     # formulate the command

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 2352

Legend:

trunk/gsdl/bin/script/pdftohtml.pl

Download in other formats: