package prescript ####################################################################### # java images/scripts ####################################################################### # the _javalinks_ macros are the flashy image links at the top right of # the page. _javalinks_ {_imagehome_} _javalinks_ [v=1] { _imagehome_
} ####################################################################### # icons ####################################################################### _iconhpscrpt_ {PreScript offers:
PostScript conversion to plain ASCII or HTML.
PreScript is really a PostScript to plain text converter, but rudimentary HTML can also be produced. Tags are inserted to mark paragraphs (<p>), short lines (<br>), page breaks (<hr>), and header and footers (italisized with <i>...</i>).
Paragraph boundaries detection.
PreScript determines the line spacing of a document and uses this (and also indentations) to determine paragraph boundaries.
Hyphenation removal.
Hyphenated words are de-hyphenated.
Ligature translation.
Most ligatures used by TeX document are detected. PreScript doesn't track font changes making it impossible to reliablely detect all ligatures.

Installing PreScript

PreScript is written in PostScript and Python. You will need Ghostscript (at least version 4.01) and the Python interpreter (at least version 1.4.).

The PreScript 0.1 distribution

This distribution is the most stable - it is what you should use to do real work.

The PreScript 2 distribution

This is a beta release of our latest version. This version is a lot cleaner and faster; it is also extensible (users can write their own renderers), better documented, and contains better prediction of line, paragraph, and page breaks. If you notice any bugs, want to request new features, or want to become a beta tester please email the New Zealand Digital Library administrator.

Running PreScript

Usage:
prescript format input [output]

Bugs

Please report bugs to the New Zealand Digital Library administrator.


Notes

PreScript is a port of a Perl program used by the New Zealand Digital Library project to convert computer science technical reports to HTML. The Perl version is deemed unfit for a public release because the code is quite messy (a consequence of Perl's cumbersome syntax for defining objects). The Python version is considerably easier to understand, maintain, and extend. The technical paper prescript.ps.gz documents the algorithms and heuristics used in PreScript 0.1 - there is an update to this for PreScript 2 inside its distribution archive.


Other Postscript Converters

Here is a summary of other PostScript to text converters we found.
pstotext
From the DEC Virtual Paper research project. PostScript program and C program. Probably the best PostScript to text converter (after PreScript, of course).
ps2html, The Sequel
Developed at John Hopkins University to convert JHU journal articals to HTML. This converter attempts to preserve the formatting of the original PostScript document, but is tied to PostScript files generated with a specific package (QuarkXPress?). A table describing a number of parameters is used to aid conversion and can be modified for new formats. Uses a variation of Ghostscript's ps2ascii.ps.
ps2ascii.ps
Part of the Ghostscript distribution. ps2ascii.ps is considerably less robust than PreScript.
ps2a.sh
A PostScript program similar to Ghostscript's ps2ascii.ps.
ps2ascii.shar
A PostScript program and Perl script.
ps2ascii.pl
A Perl script that extracts parenthesized text from a PostScript file.
ps2txt
A stand alone C program that extracts parenthesized text. Some special code to deal with dvips generated files.
}