Ignore:
Timestamp:
2017-10-06T23:11:43+13:00 (7 years ago)
Author:
ak19
Message:
  1. First round of composing the new DjVu portion of the recently added UnknownConverterPlugin tutorial. The DjVu section only really covers linux, since that's where I tested it and composed the tutorial with. Despite the windows and mac links, it's untested and unadjusted yet for windows and mac, same for the Icecite portion of the tutorial. 2. Some more fixes to Images_GPS' tutorial markup to load existing glidict strings, instead of using plain html markup for this.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r32030 r32031  
    787787</NumberedItem>
    788788<NumberedItem>
    789 <Text id="images-gps-3">Go to the <AutoText key="glidict::GUI.Design"/> panel. In the <b>Browsing Classifiers</b> section, choose <b>AZCompactList</b> from the <b>select classifier to add</b> dropdown box and press <b>Add Classifier...</b>. In the configuration dialog that appears, set the metadata field to <b>dc.Title</b> and tick the buttonname option and set its value to <Format>locations</Format>. This will create a classifier labelled <Format>locations</Format> that groups all images under Eiffel Tower into one bookshelf and similarly creates bookshelves for the other 3 categories.</Text>
     789<Text id="images-gps-3">Go to the <AutoText key="glidict::GUI.Design"/> panel. In the <b>Browsing Classifiers</b> section, choose <b>AZCompactList</b> from the <AutoText key="glidict::CDM.ClassifierManager.Classifier"/> dropdown box and press <AutoText key="glidict::CDM.ClassifierManager.Add" type="button"/>. In the configuration dialog that appears, set the metadata field to <b>dc.Title</b> and tick the buttonname option and set its value to <Format>locations</Format>. This will create a classifier labelled <Format>locations</Format> that groups all images under Eiffel Tower into one bookshelf and similarly creates bookshelves for the other 3 categories.</Text>
    790790</NumberedItem>
    791791<NumberedItem>
     
    803803  Each of these image files has metadata embedded in it&mdash;including GPS data&mdash;generated by the smartphone when the photo was taken. We can extract this metadata when the collection is built, and in particular, make use of the GPS metadata to provide map-based views of the collection to the user.</Text>
    804804
    805   <Text id="images-gps7">In the <AutoText key="glidict::CDM.GUI.Plugins"/> section of the <AutoText key="glidict::GUI.Design"/> panel, go down to the <b>select plugin to add</b> and choose the <AutoText text="EmbeddedMetadataPlugin"/>. Press the <b>Add Plugin</b> button, and then click <AutoText key="glidict::General.OK" type="button"/> to add it to the plugin list. Select this plugin in the list, then use the <AutoText key="glidict::CDM.Move.Move_Up" type="button"/> button to shift it upwards until it comes just after the GreenstoneXMLPlugin.</Text>
     805  <Text id="images-gps7">In the <AutoText key="glidict::CDM.GUI.Plugins"/> section of the <AutoText key="glidict::GUI.Design"/> panel, go down to the <AutoText key="glidict::CDM.PlugInManager.PlugIn"/> and choose the <AutoText text="EmbeddedMetadataPlugin"/>. Press the <AutoText key="glidict::CDM.PlugInManager.Add" type="button"/> button, and then click <AutoText key="glidict::General.OK" type="button"/> to add it to the plugin list. Select this plugin in the list, then use the <AutoText key="glidict::CDM.Move.Move_Up" type="button"/> button to shift it upwards until it comes just after the GreenstoneXMLPlugin.</Text>
    806806</NumberedItem>
    807807
     
    45424542</NumberedItem>
    45434543<NumberedItem>
    4544 <Text id="oaiserver-6">For this tutorial, we'll make the backdrop collection created in the simple image tutorial available over OAI. Therefore, add this collection's name to the end of the <AutoText text="oaicollection" type="italics"/> property:</Text>
     4544<Text id="oaiserver-6">For this tutorial, we'll make the backdrop collection created in the <i>simple image</i> tutorial available over OAI. Therefore, add this collection's name to the end of the <AutoText text="oaicollection" type="italics"/> property:</Text>
    45454545<Format>oaicollection demo documented-examples/oai-e backdrop</Format>
    45464546<Text id="oaiserver-7">If you have a great many documents and do not want the OAI server to return all of them in one go, you could set the <AutoText text="resumeafter" type="italics"/> property to something lower than the default 250 value in the oai.cfg file. Like:</Text>
     
    47804780<Tutorial id="unknown_converter_plugin">
    47814781<Title>
    4782 <Text id="ucp-01">Using the UnknownConverterPlugin</Text>
     4782<Text id="ucp-01">Using the UnknownConverterPlugin to make unsupported document formats searchable</Text>
    47834783</Title>
    47844784<SampleFiles folder="pdfbox"/>
    47854785<Version initial="2.88" current="2.87"/>
    47864786<Content>
    4787 <Comment><Text id="ucp-02">The UnknownConverterPlugin builds on the idea of the UnknownPlugin, in that it can be configured to handle documents of unknown format and file extension. It can also be made to handle documents with known file extensions in a custom manner.</Text></Comment>
    4788 <Comment><Text id="ucp-03">The UnknownConverterPlugin extends the UnknownPlugin's abilities by letting you launch a tool you have installed on your own operating system that can be run from the commandline to convert from the "unknown" file format to either text, html or gif/jpg/png images, or a folder of these. If you know how to launch this tool from the commandline to do the conversion, then you would configure the UnknownConverterPlugin by supplying the file format (file extension) of the documents it should process, the expected output file format (text, html or paged images), and the tool's conversion command that the UnknownConverterPlugin should launch to perform the conversion. In place of the input file and the output file or folder you provide placeholders in the command to run. Once configured, the UnknownConverterPlugin will be used during building to process documents that match the specified file format. The conversion tool will be launched with the command provided, and the expected output files as specified can then be processed by Greenstone in the usual manner.</Text></Comment>
    4789 <Comment><Text id="ucp-04">An example would be djvu files, for which Greenstone provides no custom plugin. However, there's a free commandline tool available for unix systems that can convert from djvu to one of the text based format that Greenstone can process, text or html. So in this case, you could try using the UnknownConverterPlugin with the commandline tool on djvu files that you've gathered. The result should be that the djvu in your collection are now searchable.</Text></Comment>
     4787<Comment><Text id="ucp-02">This is an advanced tutorial, in that it not only supposes you have familiarised yourself with most of what you've learned in preceding tutorials, but that you're also comfortable with downloading and installing software from the web, and have a little familiarity with using image editing software.</Text></Comment>
     4788<Comment><Text id="ucp-03">The <AutoText text="UnknownConverterPlugin"/> builds on the idea of the <AutoText text="UnknownPlugin"/>, in that it can be configured to handle documents of unknown format and file extension. It can also be made to handle documents with known file extensions in a custom manner.</Text></Comment>
     4789<Comment><Text id="ucp-04">The UnknownConverterPlugin extends the UnknownPlugin's abilities by letting you launch a tool you have installed on your own operating system that can be run from the commandline to convert from the "unknown" file format to either text, html or gif/jpg/png images, or a folder of these. If you know how to launch this tool from the commandline to do the conversion, then you would configure the UnknownConverterPlugin by supplying the file format (file extension) of the documents it should process, the expected output file format (text, html or paged images), and the tool's conversion command that the UnknownConverterPlugin should launch to perform the conversion. In place of the input file and the output file or folder you provide placeholders in the command to run. Once configured, the UnknownConverterPlugin will be used during building to process documents that match the specified file format. The conversion tool will be launched with the command provided, and the expected output files as specified can then be processed by Greenstone in the usual manner.</Text></Comment>
     4790<Comment><Text id="ucp-05">An example would be djvu files, for which Greenstone provides no custom plugin. However, there's a free commandline tool available for unix systems that can convert from djvu to one of the text based format that Greenstone can process, text or html. So in this case, you could try using the UnknownConverterPlugin with the commandline tool on djvu files that you've gathered. The result should be that the djvu files in your Greenstone collection are now searchable.</Text></Comment>
     4791<Heading><Text id="ucp-06">Working with DjVu documents in Greenstone</Text></Heading>
     4792<Text id="ucp-07">DjVu documents (pronounced like the French phrase <i>déjà vu</i>) are becoming a popular document format. <Link url="http://djvu.sourceforge.net/doc/index.html">DjVuLibre</Link>, which provides open source tools for processing DjVu documents, describes DjVu as</Text>
     4793<Comment><Text id="ucp-07a">"a web-centric format and software platform for distributing documents and images. DjVu can advantageously replace PDF, PS, TIFF, JPEG, and GIF for distributing scanned documents, digital documents, or high-resolution pictures. DjVu content downloads faster, displays and renders faster, looks nicer on a screen, and consume less client resources than competing formats. DjVu images display instantly and can be smoothly zoomed and panned with no lengthy re-rendering. DjVu is used by hundreds of academic, commercial, governmental, and non-commercial web sites around the world."</Text></Comment>
     4794<Text id="ucp-08">In this part of the tutorial we'll see how to get Greenstone to not just include a collection's DjVu documents, but make them searchable too.</Text>
     4795<Heading><Text id="ucp-09">Extracting the text from DjVu documents with DjVuLibre's djvutxt</Text></Heading>
     4796<NumberedItem><Text id="ucp-10">Start up GLI and create a new collection called <i>DjVu Collection</i>.</Text>
     4797</NumberedItem>
     4798<NumberedItem><Text id="ucp-11">Visit the <Link url="http://www.djvu.org/resources/djvu_digital_vs_super_hero_pdf.php">'DjVu-Digital vs. "Super Hero" PDF' page</Link>. The page compares a PDF sample document to its equivalent DjVu version and provides download links for both.</Text>
     4799<Text id="ucp-11a">Download their <Link url="http://www.djvu.org/docs/superhero.djvu?djvuopts&amp;zoom=page">sample DjVu document</Link> into your <i>DjVu Collection</i>'s import folder.</Text>
     4800</NumberedItem>
     4801<NumberedItem><Text id="ucp-12">Back in GLI, in the <b>Collection</b> view of the <AutoText key="glidict::GUI.Gather"/> pane, right click and select <AutoText key="glidict::CollectionPopupMenu.Refresh"/>. You should now see your new document "superhero.djvu" ready to be built.</Text>
     4802</NumberedItem>
     4803<NumberedItem><Text id="ucp-13">Head over to the <AutoText key="glidict::GUI.Create"/> pane and build the collection. The document isn't recognised. You can press Preview to confirm that there's nothing much to look at in this collection.</Text>
     4804<Text id="ucp-14">If you were to search through the <AutoText key="glidict::GUI.Design"/> pane's <AutoText key="glidict::CDM.GUI.Plugins"/> for a "DjVuPlugin", you wouldn't find one, because Greenstone hasn't got one. Greenstone knows about a lot of common formats, but there's a great many formats that different people like to work with that Greenstone knows nothing about and which Greenstone developers have not created a custom plugin for.</Text>
     4805</NumberedItem>
     4806<Comment><Text id="ucp-15">You've already learnt about the <AutoText text="UnknownPlugin"/> in the <i>Multimedia</i> tutorial and know that it can be configured to process document formats for which Greenstone has no custom plugin. However, UnknownPlugin cannot index textual document formats that are unknown to Greenstone to make them searchable upon build, because it doesn't know anything about their internal structure and consequently doesn't know how to extract their text content.</Text></Comment>
     4807<Comment><Text id="ucp-16">This is where the <AutoText text="UnknownConverterPlugin"/> comes in. It builds on the idea of the UnknownPlugin, allowing you to work with document formats unknown to Greenstone. But it offer the additional advantage of being able to extract the text of the unknown document based on an important proviso: that you have a software tool installed on your machine, one that can be run readily from the commandline, which can perform the process of converting the unknown document format into text or HTML (or a series of images). If the tool can convert the document to text or HTML, Greenstone can proceed as usual to index the content to make it searchable on previewing.</Text></Comment>
     4808<NumberedItem><Text id="ucp-17">So in order to process the "superhero.djvu" document in our collection, such that its text content gets indexed for searching, we need to do a number of things: find out if there's a free djvu to text conversion tool out there, work out how to run it from the commandline and finally configure the UnknownConverterPlugin to run it for us, so Greenstone can take care of the rest.</Text>
     4809<Text id="ucp-18">We're in luck, because among the DjVu related tools that <Link url="http://djvu.sourceforge.net">DjVuLibre</Link> provides a tool called "<Format>djvutxt</Format>" that can perform the text extraction for us. DjVuLibre is available for Windows, Mac and Linux. Some <Link url="https://unix.stackexchange.com/questions/25256/why-isnt-there-a-djvu2text">Linux machines may even come pre-installed with DjVuLibre</Link>. If not, you can use a package manager to install it for you, or compile it up easily from <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre/">source</Link> in the usual Unix manner.</Text>
     4810<Text id="ucp-19">DjVuLibre provides binary installers for <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_Windows/">Windows</Link> and <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_MacOS/">Mac</Link>. Grab the one for your operating system and install it somewhere sensible, where you have permissions to run it. You can read about the other DjVu tools that DjVuLibre provide in their <Link url="http://djvu.sourceforge.net/doc/index.html">documentation page</Link>, but for this tutorial, we'll just be using their <Format>djvutxt</Format> tool.</Text>
     4811</NumberedItem>
     4812<NumberedItem><Text id="ucp-20">The next step is to find out how to run DjVuLibre's <Format>djvutxt</Format> conversion tool from the commandline.</Text>
     4813<Text id="ucp-21">The general format of the command is</Text>
     4814<Format>djvutxt input.djvu output.txt</Format>
     4815<Text id="ucp-22">Open a DOS prompt on Windows or a terminal on Mac/Linux and experiment to see what it takes to convert your Greenstone installation's <Format>web/sites/localsite/collect/DjVuColl/superhero.djvu</Format> file.</Text>
     4816<Text id="ucp-22a">You may have to invoke <Format>djvutxt</Format> using it's full filepath, in which case the command would look like:</Text>
     4817<Format>/PATH/TO/YOUR/djvutxt /PATH/TO/GS/web/sites/localsite/collect/DjVuColl/superhero.djvu /PATH/TO/YOUR/GS/superhero.txt</Format>
     4818<Text id="ucp-23">Once you have the command working, inspect the output file. You should see mostly legible text in it. Only when you've been able to successfully complete this step should you proceed to the next steps.</Text>
     4819</NumberedItem>
     4820<Heading><Text id="ucp-24">Processing DjVu documents with the UnknownConverterPlugin</Text></Heading>
     4821<NumberedItem><Text id="ucp-25">Now that you know how to run the <i>djvutxt</i> conversion tool from the commandline, open up the DjVu Collection in GLI. Go into the <AutoText key="glidict::GUI.Design"/> pane's <AutoText key="glidict::CDM.GUI.Plugins"/> section and add the <AutoText text="UnknownConverterPlugin"/>. Click the <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> and set up the plugin as follows:</Text>
     4822<BulletList>
     4823<Bullet><Text id="ucp-26">set its <Format>convert_to</Format> field to <Format>text</Format></Text></Bullet>
     4824<Bullet><Text id="ucp-27">set its <Format>mime_type</Format> field to <Format>image/vnd.djvu</Format>, which is one of the <Link url="http://djvu.sourceforge.net/doc/man/nsdejavu.html">mime types for the DjVu format</Link></Text></Bullet>
     4825<Bullet><Text id="ucp-28">set its <Format>process_extension</Format> to <Format>djvu</Format></Text></Bullet>
     4826<Bullet><Text id="ucp-29">Finally, copy the full <Format>djvutxt</Format> command you ran from the commandline and paste it into the UnknownConverterPlugin Configuration dialog's <Format>exec_cmd</Format> field. Keep the full path to the <i>djvutxt</i> binary, but replace the entire input filepath with the literal string <Format>%%INPUT_FILE</Format> and replace the output filepath with the literal string <Format>%%OUTPUT</Format>.</Text>
     4827<Text id="ucp-30">Doing so means that when you build the collection, Greenstone will replace <Format>%%INPUT_FILE</Format> with each DjVu document in your collection that it needs to process, and will replace <Format>%%OUTPUT</Format> with the expected text output file of each document upon conversion by <i>djvutxt</i>.</Text></Bullet>
     4828</BulletList>
     4829<Text id="ucp-31">If you have any spaces in any filepaths in your <Format>exec_cmd</Format>, make sure to always nest them in escaped double quotes (<Format>\"</Format>), so Greenstone can preserve the spaces in the filepath.</Text>
     4830<Text id="ucp-32">If any filepaths, other than <Format>%%INPUT_FILE</Format> and <Format>%%OUTPUT</Format> are within your Greenstone installation, you can use the <Format>%%GSDLHOME</Format><MajorVersion number="3">, <Format>%%GSDL3SRCHOME</Format> and <Format>%%GSDL3HOME</Format> (the latter for Greenstone 3's <Format>web</Format> folder)</MajorVersion> as placeholders and write out your filepaths relative to this. For instance, if your DjVuLibre is installed in your Greenstone's <Format>ext</Format> subfolder, then you would start the filepath to <i>djvutxt</i> with <Format>%%GSDL<MajorVersion number="3">3SRC</MajorVersion>HOME/ext</Format>.</Text>
     4831</NumberedItem>
     4832<NumberedItem><Text id="ucp-33">Having sufficiently configured the UnknownConverterPlugin, click on the OK button close the plugin's Configuration dialog. Move to the <AutoText key="glidict::GUI.Create"/> pane and build the collection. Your document has now been recognised. What's more, if you preview it and search for the term "Interoperability", a term that occurs in our collection's superhero.djvu document, you should now get a search result linking to that document. So Greenstone has successfully indexed the document's text, thanks to DjVuLibre's <Format>djvutxt</Format> tool extracting the text which got fed into the rest of Greenstone's building pipeline.</Text>
     4833</NumberedItem>
     4834<Heading><Text id="ucp-34">Associating an icon with DjVu documents in Greenstone</Text></Heading>
     4835<NumberedItem><Text id="ucp-35">When previewing the search result, you may notice that there's no proper icon for the superhero.djvu document. The Greenstone extracted text variant of the document has an icon, a plain text one. However, the <Format>superhero.djvu</Format> has the "unknown document format" icon, the one with the question mark on it. We can change this.</Text>
     4836</NumberedItem>
     4837<NumberedItem><Text id="ucp-36">Go back to the <AutoText key="glidict::GUI.Design"/> pane to configure your <AutoText text="UnknownConverterPlugin"/> once more. This time, enable the <Format>srcicon</Format> field and set its value to <Format>icondjvu</Format>.</Text>
     4838<Text id="ucp-37">This is a macroname we're just inventing, though we're following existing Greenstone convention in naming document icon macros, in that it's of the form "<Format>icon&lt;file-extension&gt;</Format>".</Text>
     4839<Text id="ucp-38">Click OK to close the UnknownConverterPlugin configuration dialog. Quit GLI, since there's a little more work to do.</Text>
     4840</NumberedItem>
     4841<NumberedItem><Text id="ucp-39">Greenstone doesn't have an icon for DjVu documents, since it doesn't know about the format. If you Google for the djvu icon, you'd probably find the <Link url="http://djvu.sourceforge.net/doc/man/nsdejavu.html">Wikipedia page for it</Link>.</Text>
     4842<Text id="ucp-40">Save one of their DjVu icon images. Then open the image in GIMP or another image editor and use the application's scaling feature to scale the height or the width (whichever is greater) to anywhere between 26 and 32 pixels. Save the scaled image as a GIF file with the name "<Format>icondjvu.gif</Format>", storing it in your Greenstone installation's <Format>web/interfaces/default/images</Format> folder.</Text>
     4843</NumberedItem>
     4844<NumberedItem><Text id="ucp-41">Greenstone knows nothing about the icondjvu macro we defined as the value for UnknownConverterPlugin's srcicon field, so we have to teach Greenstone about this new macro. Use a text editor to open your Greenstone 3's <Format>web/sites/localsite/siteConfig.xml</Format> file.</Text>
     4845<Text id="ucp-42">Locate the line</Text>
     4846<Format>&lt;replace macro="_iconunknown_" scope="metadata" text="&amp;lt;img src='interfaces/default/images/iunknown.gif' border='0'/&amp;gt;" resolve="false"/&gt;</Format>
     4847<Text id="ucp-43">Add a similar line above or below it and adjust it to say:</Text>
     4848<Format>&lt;replace macro="_icondjvu_" scope="metadata" text="&amp;lt;img src='interfaces/default/images/idjvu.gif' border='0'/&amp;gt;" resolve="false"/&gt;</Format>
     4849<Text id="ucp-44">Save the file.</Text>
     4850<Text id="ucp-45">The above has now associated the icon image we want appearing for the djvu document with the macro we defined for the srcicon field in UnknownConverterPlugin's configuration.</Text>
     4851</NumberedItem>
     4852<NumberedItem><Text id="ucp-45">Restart GLI, which will restart the Greenstone server, reloading the <Format>siteConfig.xml</Format> you have just edited. Rebuild the DjVu Collection again and preview it. This time, when you browse and search the collection, you should see the djvu icon appearing in place of the unknown icon for your DjVu document.</Text>
     4853</NumberedItem>
     4854<NumberedItem><Text id="ucp-45">Having designed your collection to handle DjVu documents, you can now add any other documents, including more DjVu documents. Greenstone should now be able to index the text content of DjVu documents in the collection to make them searchable, in all instances where text can be successfully extracted from them by <Format>djvutext</Format>.</Text>
     4855</NumberedItem>
     4856<Heading>
     4857<Text id="ucp-06">Using the Icecite's commandline tool to convert from PDF to text</Text>
     4858</Heading>
    47904859<Comment><Text id="ucp-05"><Link url="https://github.com/ckorzen/icecite">Icecite</Link> is an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFPlugin, even when <Format>pdfbox_conversion</Format> option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here.</Text></Comment>
    4791 <Heading>
    4792 <Text id="ucp-06">Using the Icecite tool to convert from PDF to text</Text>
    4793 </Heading>
    47944860<Comment>
    47954861<Text id="ucp-07">As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.</Text>
Note: See TracChangeset for help on using the changeset viewer.