Changeset 25996
- Timestamp:
- 2012-07-20T14:53:33+12:00 (12 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
documentation/trunk/tutorials/xml-source/tutorial_en.xml
r25966 r25996 1093 1093 <Tutorial id="pdfbox-extension"> 1094 1094 <Title> 1095 <Text id="pdfbox-ext-0"> Setting up the PDFBox extension to process newer versions of PDF</Text>1095 <Text id="pdfbox-ext-0">Processing newer versions of PDF with PDFBox</Text> 1096 1096 </Title> 1097 1097 <Prerequisite id="word_pdf_collection"/> … … 1126 1126 </NumberedItem> 1127 1127 <NumberedItem> 1128 <Text id="pdfbox-ext-11">Now that you've installed the PDFBox extension, this will be available as an option in the plugin's configuration dialog. To turn on the PDFBox extension for any collection you open in GLI, you would go to the <AutoText key="glidict::GUI.Design"/> panel, select <AutoText key="glidict::CDM.GUI.Plugins"/> from the left and on the right, double click the <Auto text text="PDFPlugin"/> (alternatively, select this plugin and click the <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> below) to open the dialog to configure this plugin. In the <AutoText key="glidict::CDM.PlugInManager.Configure"/> dialog, scroll down to the section <Autotext text="AutoLoadConverters"/> and select the checkbox next to the <Autotext text="pdfbox_conversion"/> option. Click <AutoText key="glidict::General.OK"/> to close the dialog, switch to the <AutoText key="glidict::GUI.Create"/> panel and rebuild your collection. This time, PDF files will be processed by PDFBox which will extract their text.</Text>1129 <Text id="pdfbox-ext-12">Try this feature out on a collection of recent PDF files, by configuring its PDFPlugin with the <Auto text text="pdfbox_conversion"/> option turned on.</Text>1130 <Text id="pdfbox-ext-12">You can also experiment by configuring the PDFPlugin used in the <b>Reports</b> collection, although that one contains old PDF versions which the default settings of <Auto text text="PDFPlugin"/> can already process successfully. If you do decide to test out the PDFBox extension with the <b>Reports</b> collection, then rebuild it and preview it. However, once you've inspected the results, you may wish to go back to the <AutoText key="glidict::GUI.Design"/> panel and turn off <Autotext text="pdfbox_conversion"/> and rebuild the collection once more, so that it's back to its original state and ready for future tutorials.</Text>1128 <Text id="pdfbox-ext-11">Now that you've installed the PDFBox extension, this will be available as an option in the plugin's configuration dialog. To turn on the PDFBox extension for any collection you open in GLI, you would go to the <AutoText key="glidict::GUI.Design"/> panel, select <AutoText key="glidict::CDM.GUI.Plugins"/> from the left and on the right, double click the <AutoText text="PDFPlugin"/> (alternatively, select this plugin and click the <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> below) to open the dialog to configure this plugin. In the <AutoText key="glidict::CDM.PlugInManager.Configure"/> dialog, scroll down to the section <AutoText text="AutoLoadConverters"/> and select the checkbox next to the <AutoText text="pdfbox_conversion"/> option. Click <AutoText key="glidict::General.OK"/> to close the dialog, switch to the <AutoText key="glidict::GUI.Create"/> panel and rebuild your collection. This time, PDF files will be processed by PDFBox which will extract their text.</Text> 1129 <Text id="pdfbox-ext-12">Try this feature out on a collection of recent PDF files, by configuring its PDFPlugin with the <AutoText text="pdfbox_conversion"/> option turned on.</Text> 1130 <Text id="pdfbox-ext-12">You can also experiment by configuring the PDFPlugin used in the <b>Reports</b> collection, although that one contains old PDF versions which the default settings of <AutoText text="PDFPlugin"/> can already process successfully. If you do decide to test out the PDFBox extension with the <b>Reports</b> collection, then rebuild it and preview it. However, once you've inspected the results, you may wish to go back to the <AutoText key="glidict::GUI.Design"/> panel and turn off <AutoText text="pdfbox_conversion"/> and rebuild the collection once more, so that it's back to its original state and ready for future tutorials.</Text> 1131 1131 </NumberedItem> 1132 1132 </Content> … … 1480 1480 </NumberedItem> 1481 1481 <NumberedItem> 1482 <Text id="assoc-files-8">In <AutoText key="glidict::CDM.GUI.Plugins"/>, select the <Auto text text="WordPlugin"/> and press the <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> button.1482 <Text id="assoc-files-8">In <AutoText key="glidict::CDM.GUI.Plugins"/>, select the <AutoText text="WordPlugin"/> and press the <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> button. 1483 1483 In the resulting popup, scroll down to find the associate_ext option, and set this option to <AutoText text="pdf" type="italics"/>.</Text> 1484 <Text id="assoc-files-9">Note 1: as this is an option that is categorized under the <Auto text text="BasePlugin"/> heading, it is therefore an option that is available across all the plugins provided by Greenstone. In our example, we happen to be binding a PDF document to a Word document, however it could equally be used to bind MP3 versions of files to PNG artwork of album covers.</Text>1485 <Text id="assoc-files-10">Note 2: More than one filename extension can be provided as part of this option, separated by a comma. For example, setting the value of the associate_ext in <Auto text text="TextPlugin"/> to <Autotext text="avi,png" type="italics"/> would allow both an AVI video file (say an oral history interview) and a PNG image (say a picture of the interviewee taken at the time of the recording) to bind to a text version of the document (say representing a transcript of the interview). Both AVI and PNG versions of the file can be present at the same time, or alternatively only one of the two file types need be present, or neither, and Greenstone will process the situation accordingly.</Text>1484 <Text id="assoc-files-9">Note 1: as this is an option that is categorized under the <AutoText text="BasePlugin"/> heading, it is therefore an option that is available across all the plugins provided by Greenstone. In our example, we happen to be binding a PDF document to a Word document, however it could equally be used to bind MP3 versions of files to PNG artwork of album covers.</Text> 1485 <Text id="assoc-files-10">Note 2: More than one filename extension can be provided as part of this option, separated by a comma. For example, setting the value of the associate_ext in <AutoText text="TextPlugin"/> to <AutoText text="avi,png" type="italics"/> would allow both an AVI video file (say an oral history interview) and a PNG image (say a picture of the interviewee taken at the time of the recording) to bind to a text version of the document (say representing a transcript of the interview). Both AVI and PNG versions of the file can be present at the same time, or alternatively only one of the two file types need be present, or neither, and Greenstone will process the situation accordingly.</Text> 1486 1486 <Text id="assoc-files-11">Note 3: The option <Format>associate_ext</Format> is in fact a simplified version of a more general option <Format>associate_tail_re</Format>. Using regular expression syntax, the latter provides a more powerful way of manipulating filenames. Rather than focus on just the filename extension, with <Format>associate_tail_re</Format>, one is able to group files together that share a similar filename root, but might start to differ in characters before the filename extension. For instance, the Word version of the document might be <Format>my-article.doc</Format> but the PDF version might be <Format>my-article-ver13.pdf</Format> reflecting the fact that the PDF file is saved in version 1.3 of this format. Using <Format>associate_tail_re</Format> (and a little bit of regular expression know-how!), such differences can be surmounted, and the two files still processed automatically as different versions of the same document.</Text> 1487 1487 </NumberedItem> 1488 1488 <NumberedItem> 1489 <Text id="assoc-files-12">If you're working with structured Word documents that contain formatted headings and you want better structured and formatted HTML versions of the documents to be generated by Greenstone from the Word format, optionally set the <Format>windows_scripting</Format> option for the <Auto text text="WordPlugin"/> if building on Windows, or turn on the <Format>open_office_scripting</Format> option if this extension has been added to your Greenstone installation and either OpenOffice or LibreOffice is available on your system.</Text>1490 <Text id="assoc-files-13">Optionally set the <Auto text text="level1_heading" type="italics"/> to <i>heading\s*1</i>, or whatever is appropriate for your documents if they use style information for headings that deviate from the norm for Word. Repeat as is needed for <Autotext text="level2_heading" type="italics"/> and so forth. For more details on how to control sections within a Word document, see the <TutorialRef id="enhanced_word"/> tutorial.</Text>1489 <Text id="assoc-files-12">If you're working with structured Word documents that contain formatted headings and you want better structured and formatted HTML versions of the documents to be generated by Greenstone from the Word format, optionally set the <Format>windows_scripting</Format> option for the <AutoText text="WordPlugin"/> if building on Windows, or turn on the <Format>open_office_scripting</Format> option if this extension has been added to your Greenstone installation and either OpenOffice or LibreOffice is available on your system.</Text> 1490 <Text id="assoc-files-13">Optionally set the <AutoText text="level1_heading" type="italics"/> to <i>heading\s*1</i>, or whatever is appropriate for your documents if they use style information for headings that deviate from the norm for Word. Repeat as is needed for <AutoText text="level2_heading" type="italics"/> and so forth. For more details on how to control sections within a Word document, see the <TutorialRef id="enhanced_word"/> tutorial.</Text> 1491 1491 </NumberedItem> 1492 1492 <NumberedItem> … … 1499 1499 <Text id="assoc-files-18">to:</Text> 1500 1500 <Format><td valign="top">[ex.equivDocLink][ex.equivDocIcon][ex./equivDocLink]</td></Format> 1501 <Text id="assoc-files-19">Two things occur in this edit. The main difference is the switch from using <Auto text text="ex.srclink" type="italics"/> and <Autotext text="ex.srcicon" type="italics"/> that provides the link to the primary source document (which is the Word document), and replace it with a hyperlink around an icon to the document that Greenstone has associated as an equivalent document (which is the PDF version). The icon Greenstone chooses to show is based on the filename extension of the matching file it has found. In this case <img src="../tutorial_files/ipdf.gif"/>.</Text>1502 <Text id="assoc-files-20">The second (more minor) change in this edit is to simplify the statement a bit. The original uses an <Format>{Or}</Format> statement to show a thumbnail version of the document if Greenstone has one, in preference over the source icon. Since in this collection we have no thumbnails generated, it has been simplified by eliminating the <Format>{Or}</Format> combination and going straight to the <Auto text text="ex.equivDocIcon" type="italics"/> metadata item.</Text>1501 <Text id="assoc-files-19">Two things occur in this edit. The main difference is the switch from using <AutoText text="ex.srclink" type="italics"/> and <AutoText text="ex.srcicon" type="italics"/> that provides the link to the primary source document (which is the Word document), and replace it with a hyperlink around an icon to the document that Greenstone has associated as an equivalent document (which is the PDF version). The icon Greenstone chooses to show is based on the filename extension of the matching file it has found. In this case <img src="../tutorial_files/ipdf.gif"/>.</Text> 1502 <Text id="assoc-files-20">The second (more minor) change in this edit is to simplify the statement a bit. The original uses an <Format>{Or}</Format> statement to show a thumbnail version of the document if Greenstone has one, in preference over the source icon. Since in this collection we have no thumbnails generated, it has been simplified by eliminating the <Format>{Or}</Format> combination and going straight to the <AutoText text="ex.equivDocIcon" type="italics"/> metadata item.</Text> 1503 1503 <Text id="assoc-files-21">Switch to the <AutoText key="glidict::GUI.Format"/> panel and edit the format statement for VList (All).</Text> 1504 1504 <Text id="assoc-files-22">Change:</Text> … … 1519 1519 [/highlight]{If}{[dc.Creator],: [sibling(All'\, '):dc.Creator]}</td><br /> 1520 1520 </Format> 1521 <Text id="assoc-files-24">Note: When Greenstone encounters a file that matches the provided <Format>associate_ext</Format> value (<Format>pdf</Format> in our case), it sets the metadata value <Auto text text="ex.equivDocIcon"/> for that document to be the macro <i>_iconXXX_</i>, where <i>XXX</i> is whatever the filename extension is (so <Autotext text="_iconpdf_" type="italics"/> in our case). As long as there is an existing macro defined for that combination of the word <i>icon</i> and the filename extension, then a suitable icon will be displayed when the document appears in a VList. For <i>pdf</i> the displayed icon will be <img src="../tutorial_files/ipdf.gif"/>.</Text>1521 <Text id="assoc-files-24">Note: When Greenstone encounters a file that matches the provided <Format>associate_ext</Format> value (<Format>pdf</Format> in our case), it sets the metadata value <AutoText text="ex.equivDocIcon"/> for that document to be the macro <i>_iconXXX_</i>, where <i>XXX</i> is whatever the filename extension is (so <AutoText text="_iconpdf_" type="italics"/> in our case). As long as there is an existing macro defined for that combination of the word <i>icon</i> and the filename extension, then a suitable icon will be displayed when the document appears in a VList. For <i>pdf</i> the displayed icon will be <img src="../tutorial_files/ipdf.gif"/>.</Text> 1522 1522 </NumberedItem> 1523 1523 </Content> … … 3313 3313 </NumberedItem> 3314 3314 <NumberedItem> 3315 <Text id="oaiserver-14">Although the data transmitted over OAI is in the form of XML, Greenstone uses a stylesheet to transform that XML response into a user-friendly, structured web page you see when you perform the <Auto text text="Identify"/> request (thereby visiting the <AutoText text="verb=Identify" type="italics"/> response page). This allows <AutoText text="Identify" type="italics"/> and other verbs in the OAI specification to be shown in the main Greenstone OAI Server pages as link buttons. You can see these in the main Greenstone <AutoText text="oaiserver.cgi" type="italics"/> (or <AutoText text="oaiserver.cgi?verb=Identify" type="italics"/>) page, as a row of links starting with "Identify" at the top and in the lower end of the page.</Text>3315 <Text id="oaiserver-14">Although the data transmitted over OAI is in the form of XML, Greenstone uses a stylesheet to transform that XML response into a user-friendly, structured web page you see when you perform the <AutoText text="Identify"/> request (thereby visiting the <AutoText text="verb=Identify" type="italics"/> response page). This allows <AutoText text="Identify" type="italics"/> and other verbs in the OAI specification to be shown in the main Greenstone OAI Server pages as link buttons. You can see these in the main Greenstone <AutoText text="oaiserver.cgi" type="italics"/> (or <AutoText text="oaiserver.cgi?verb=Identify" type="italics"/>) page, as a row of links starting with "Identify" at the top and in the lower end of the page.</Text> 3316 3316 <Text id="oaiserver-15">Clicking on the links will execute that verb as a request and return the response from your Greenstone OAI server as a structured web page. Try clicking on all the links.</Text> 3317 3317 </NumberedItem> 3318 3318 <NumberedItem> 3319 <Text id="oaiserver-16">OAI defines a concept called a <Auto text text="Set"/>. In Greenstone, the OAI Set concept is mapped to the practical Greenstone collection. The link to the <AutoText text="ListSets" type="italics"/> verb will therefore request the Greenstone OAI server to list all the collections that have been enabled for OAI.</Text>3319 <Text id="oaiserver-16">OAI defines a concept called a <AutoText text="Set"/>. In Greenstone, the OAI Set concept is mapped to the practical Greenstone collection. The link to the <AutoText text="ListSets" type="italics"/> verb will therefore request the Greenstone OAI server to list all the collections that have been enabled for OAI.</Text> 3320 3320 <Text id="oaiserver-17">Click on the <b>ListSets</b> button link and have a look.</Text> 3321 3321 <Text id="oaiserver-18">The response page for the <AutoText text="ListSets" type="italics"/> verb will show you that your backdrop collection is one of the collections available over OAI in your Greenstone repository.</Text> 3322 3322 </NumberedItem> 3323 3323 <NumberedItem> 3324 <Text id="oaiserver-19">You will see a couple of buttons next to each collection (or <Auto text text="Set"/>) listed here. The first is <b>Identifiers</b> and the second <b>Records</b>. Click on the <b>Identifiers</b> button for the backdrop Set. This will list all the IDs of the documents contained in your OAI collection. If you look at the IDs, they look similar enough to Greenstone's internal document IDs, but with an additional prefix (<Format>oai:<repositoryID>:setname</Format>, where <AutoText text="repositoryID" type="italics"/> was set by you in the oai.cfg configuration file).</Text>3324 <Text id="oaiserver-19">You will see a couple of buttons next to each collection (or <AutoText text="Set"/>) listed here. The first is <b>Identifiers</b> and the second <b>Records</b>. Click on the <b>Identifiers</b> button for the backdrop Set. This will list all the IDs of the documents contained in your OAI collection. If you look at the IDs, they look similar enough to Greenstone's internal document IDs, but with an additional prefix (<Format>oai:<repositoryID>:setname</Format>, where <AutoText text="repositoryID" type="italics"/> was set by you in the oai.cfg configuration file).</Text> 3325 3325 </NumberedItem> 3326 3326 <NumberedItem> … … 3328 3328 <Text id="oaiserver-21">As you would have specified some Dublin Core (dc) metadata for some of the images in the backdrop collection, the page that loads will display this information for each document in the collection (Set).</Text> 3329 3329 <Text id="oaiserver-22">Greenstone's OAI at present supports 3 metadata formats, as is explained in the comments in the oai.cfg file. Of these three, the OAI standard for Dublin Core, <AutoText text="oai_dc" type="italics"/>, is the one pertinent to this tutorial. If your collection specifies metadata for a different metadata set format, you can use the oai.cfg file to tell Greenstone how to map the metadata fields of your chosen metadata set format into the Dublin Core metadata set supported by the Greenstone OAI server (or one of the other metadata sets it supports).</Text> 3330 <Text id="oaiserver-23">Look in the oai.cfg file again and scroll down to the section on <AutoText text="oaimapping" type="italics"/>, which will explain and provide examples for how to specify such mappings from your metadata format to one that Greenstone's OAI server uses. For instance, the <b>demo</b> collection comes enabled for OAI upon installation, and specifies some mappings from its <Auto text text="DLS" type="italics"/> metadata format to <Autotext text="OAI DC" type="italics"/>. Its <AutoText key="metadata::dls.Title"/> metadata is mapped using the following line in the oai.cfg configuration file:</Text>3330 <Text id="oaiserver-23">Look in the oai.cfg file again and scroll down to the section on <AutoText text="oaimapping" type="italics"/>, which will explain and provide examples for how to specify such mappings from your metadata format to one that Greenstone's OAI server uses. For instance, the <b>demo</b> collection comes enabled for OAI upon installation, and specifies some mappings from its <AutoText text="DLS" type="italics"/> metadata format to <AutoText text="OAI DC" type="italics"/>. Its <AutoText key="metadata::dls.Title"/> metadata is mapped using the following line in the oai.cfg configuration file:</Text> 3331 3331 <Format>oaimapping dls.Title oai_dc.title</Format> 3332 3332 <Text id="oaiserver-24">Because the backdrop collection uses DC metadata already, no mapping is required.</Text> … … 3338 3338 <Text id="gli-oai-0">Connecting to an OAI server from GLI</Text> 3339 3339 </Title> 3340 <Prerequisite id="s imple_image_collection"/>3340 <Prerequisite id="setting_up_GS_OAI_server"/> 3341 3341 <Version initial="2.85" current="2.85"/> 3342 3342 <Comment> … … 3367 3367 </NumberedItem> 3368 3368 <NumberedItem> 3369 <Text id="gli-oai-9">After a while, it will have finished downloading. Change to the <AutoText key="glidict::GUI.Gather"/> panel, and on the left-hand side, open up the <AutoText key="glidict::Tree.DownloadedFiles"/> Downloaded Filesfolder. This is where Greenstone stores files you downloaded using the <AutoText key="glidict::GUI.Download"/> panel. In this case, it will contain a folder wherein the oai metadata files and images that you've just downloaded from your own Greenstone OAI server is stored.</Text>3369 <Text id="gli-oai-9">After a while, it will have finished downloading. Change to the <AutoText key="glidict::GUI.Gather"/> panel, and on the left-hand side, open up the <AutoText key="glidict::Tree.DownloadedFiles"/> folder. This is where Greenstone stores files you downloaded using the <AutoText key="glidict::GUI.Download"/> panel. In this case, it will contain a folder wherein the oai metadata files and images that you've just downloaded from your own Greenstone OAI server is stored.</Text> 3370 3370 </NumberedItem> 3371 3371 <NumberedItem> … … 3393 3393 <Content> 3394 3394 <NumberedItem> 3395 <Text id="gs-oai-3">You will want to be running the included Apache web server. So if you're on Windows and using the Local Library Server, quit it and rename the <Auto text text="server.exe" type="italics"/> application in your Greenstone installation folder to server.not. Then use the <Autotext text="Start" type="italics"/> menu shortcut to the Greenstone Server once more, to now launch the Apache web server.</Text>3396 </NumberedItem> 3397 <NumberedItem> 3398 <Text id="gs-oai-4">For this exercise, we will visit the <b>Open Archives Validator</b>, for which your OAIserver needs to provide a valid email address. In a text editor, open up your greenstone installation's etc/oai.cfg file and set the value of the <Auto text text="maintainer" type="italics"/> field to your email address.</Text>3399 <Text id="gs-oai-5">Note that by default, your Greenstone installation will make the <b>demo</b> collection available over OAI. This collection has been set up with a dummy (and invalid) email address for the <Auto text text="creator" type="italics"/> and <Autotext text="maintainer" type="italics"/> fields in the collection's collect.cfg file. You will need to open up collect/demo/etc/collect.cfg and clear the email values for the <Autotext text="creator" type="italics"/> and <Autotext text="maintainer" type="italics"/> properties (or else set these to a valid email again). Otherwise the OpenArchives validator will resort to using the <b>demo</b> collection's default dummy email to send the initial validation results to. Alternatively, you can simply remove the <b>demo</b> collection from being listed in the oai.cfg file's oaicollection property, which will cease to make the <b>demo</b> collection available over OAI.</Text>3400 <Text id="gs-oai-6">Note also that, if you wish to specify contact emails at a collection level, you will need to edit your greenstone installation's <Format>collect/<collection-name>/etc/collect.cfg</Format> file for those collections and set the <Auto text text="creator" type="italics"/> and <Autotext text="maintainer" type="italics"/> fields to the desired email address.</Text>3395 <Text id="gs-oai-3">You will want to be running the included Apache web server. So if you're on Windows and using the Local Library Server, quit it and rename the <AutoText text="server.exe" type="italics"/> application in your Greenstone installation folder to server.not. Then use the <AutoText text="Start" type="italics"/> menu shortcut to the Greenstone Server once more, to now launch the Apache web server.</Text> 3396 </NumberedItem> 3397 <NumberedItem> 3398 <Text id="gs-oai-4">For this exercise, we will visit the <b>Open Archives Validator</b>, for which your OAIserver needs to provide a valid email address. In a text editor, open up your greenstone installation's etc/oai.cfg file and set the value of the <AutoText text="maintainer" type="italics"/> field to your email address.</Text> 3399 <Text id="gs-oai-5">Note that by default, your Greenstone installation will make the <b>demo</b> collection available over OAI. This collection has been set up with a dummy (and invalid) email address for the <AutoText text="creator" type="italics"/> and <AutoText text="maintainer" type="italics"/> fields in the collection's collect.cfg file. You will need to open up collect/demo/etc/collect.cfg and clear the email values for the <AutoText text="creator" type="italics"/> and <AutoText text="maintainer" type="italics"/> properties (or else set these to a valid email again). Otherwise the OpenArchives validator will resort to using the <b>demo</b> collection's default dummy email to send the initial validation results to. Alternatively, you can simply remove the <b>demo</b> collection from being listed in the oai.cfg file's oaicollection property, which will cease to make the <b>demo</b> collection available over OAI.</Text> 3400 <Text id="gs-oai-6">Note also that, if you wish to specify contact emails at a collection level, you will need to edit your greenstone installation's <Format>collect/<collection-name>/etc/collect.cfg</Format> file for those collections and set the <AutoText text="creator" type="italics"/> and <AutoText text="maintainer" type="italics"/> fields to the desired email address.</Text> 3401 3401 </NumberedItem> 3402 3402 <NumberedItem>
Note:
See TracChangeset
for help on using the changeset viewer.