Changeset 25965

2012-07-17T18:42:10+12:00 (11 years ago)

Adding a new tutorial that Dr Bainbridge said would be useful for people wishing to set up the PDFBox extension for GS to work with newer PDF document versions.

1 edited


  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r25964 r25965  
    10881088<Text id="0322a"><b>Rebuild</b> and <b>preview</b> the collection. Now the <AutoText key="coredm::_Global:labelCreator_" type="italics"/> list classifies documents based on the first author appearing in the <AutoText key="metadata::dc.Creator"/> metadata.</Text>
    10891089<Text id="0322b">If you set the <AutoText text="metadata"/> field of <AutoText text="AZCompactList"/> to <Format><AutoText key="metadata::dc.Creator" type="plain"/><AutoText text=",ex.Creator" type="plain"/></Format> in the <TutorialRef id="word_pdf_collection"/> exercise, now the <AutoText key="coredm::_Global:labelCreator_" type="italics"/> list will classify based on the first author appearing in either the <AutoText key="metadata::dc.Creator"/> metadata or the <AutoText key="metadata::ex.Creator"/> metadata.</Text>
     1093<Tutorial id="pdfbox-extension">
     1095<Text id="pdfbox-ext-0">Setting up the PDFBox extension to process newer versions of PDF</Text>
     1097<Prerequisite id="word_pdf_collection"/>
     1098<Version initial="2.85" current="2.85"/>
     1100<Text id="pdfbox-ext-1">Greenstone comes with the PDFPlugin which can handle older versions of PDF, but can't cope by default with newer PDF files. However, a Greenstone extension making use of <b>PDFBox</b>, an open-source PDF conversion tool, is available if you want Greenstone to extract text from more recent PDF files. This tutorial will cover how to install the PDFBox extension for Greenstone and how to switch on its functionality in the Greenstone Librarian Interface.</Text>
     1103<Heading>Obtaining and installing the PDFBox extension for Greenstone</Heading>
     1105<Text id="pdfbox-ext-2">The wiki release notes that go with the Greenstone binary you downloaded will contain the download link to the PDFBox extension that works with your binary. If you want to try the most up-to-date version of the extension, visit <Link></Link> and download the zip archive from there, if you're in Windows. If you are working on a *nix machine, you might instead prefer to download the compressed tar file of the same by visiting <Link></Link>.</Text>
     1108<Text id="pdfbox-ext-3">Move the downloaded file into your Greenstone installation's <Format>ext</Format> folder.</Text>
     1111<Text id="pdfbox-ext-4">You will now need to decompress the file you downloaded in this location.</Text>
     1112<Text id="pdfbox-ext-5">To do so on Windows XP, rightclick on the file and choose <b>Extract All...</b> and go through the Extraction wizard. On Windows Vista and 7, double clicking on the zip file will open an Explorer window showing you its contents. Click on an empty part inside that window and choose <b>Extract All...</b> to extract its contents. On Linux, to decompress the tar.gz file, run the command:</Text>
     1113<Format>tar -xvzf &lt;tar file name&gt;</Format>
     1114<Text id="pdfbox-ext-6">All going well, you will have a folder called <Format>pdf-box</Format> inside your Greenstone's <Format>ext</Format> folder.</Text>
     1116<Heading>Turning on the PDFBox extension functionality in GLI</Heading>
     1118<Text id="pdfbox-ext-7">Before you can use the extension, make sure that all instances of GLI, the Greenstone Librarian interface, are closed.</Text>
     1120<Text id="pdfbox-ext-8">Note that if you were running GLI through a console, you will want to start up a fresh console, then run the setup script again to set up the Greenstone environment once more, which will this time take the presence of the PDFBox extension into account.</Text>
     1121<Text id="pdfbox-ext-9">To run the setup script, your console needs to be pointing to your Greenstone installation directory. From here, you would run <Format>setup.bat</Format> if you're on Windows, or <Format>source setup.bash</Format> if you're on Linux.</Text>
     1125<Text id="pdfbox-ext-10">Launch GLI once more, in the manner you're accustomed to. On Windows, the easiest way is the shortcut to GLI available through the Windows <b>Start</b> menu.</Text>
     1128<Text id="pdfbox-ext-11">Now that you've installed the PDFBox extension, this will be available as an option in the plugin's configuration dialog. To turn on the PDFBox extension for any collection you open in GLI, you would go to the <AutoText key="glidict::GUI.Design"/> panel, select <AutoText key="glidict::CDM.GUI.Plugins"/> from the left and on the right, double click the <Autotext text="PDFPlugin"/> (alternatively, select this plugin and click the <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> below) to open the dialog to configure this plugin. In the <AutoText key="glidict::CDM.PlugInManager.Configure"/> dialog, scroll down to the section <Autotext text="AutoLoadConverters"/> and select the checkbox next to the <Autotext text="pdfbox_conversion"/> option. Click <AutoText key="glidict::General.OK"/> to close the dialog, switch to the <AutoText key="glidict::GUI.Create"/> panel and rebuild your collection. This time, PDF files will be processed by PDFBox which will extract their text.</Text>
     1129<Text id="pdfbox-ext-12">Try this feature out on a collection of recent PDF files, by configuring its PDFPlugin with the <Autotext text="pdfbox_conversion"/> option turned on.</Text>
     1130<Text id="pdfbox-ext-12">You can also experiment by configuring the PDFPlugin used in the <b>Reports</b> collection, although that one contains old PDF versions which the default settings of <Autotext text="PDFPlugin"/> can already process successfully. If you do decide to test out the PDFBox extension with the <b>Reports</b> collection, then rebuild it and preview it. However, once you've inspected the results, you may wish to go back to the <AutoText key="glidict::GUI.Design"/> panel and turn off <Autotext text="pdfbox_conversion"/> and rebuild the collection once more, so that it's back to its original state and ready for future tutorials.</Text>
Note: See TracChangeset for help on using the changeset viewer.