Changeset 32044


Ignore:
Timestamp:
2017-10-13T22:22:27+13:00 (7 years ago)
Author:
ak19
Message:

Fixed up UnknownConverterPlugin tutorial again after DjVuLibre's djvutxt worked easily on Windows. But I've been having trouble with icecite once I finally got the command right (I think) for windows.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r32035 r32044  
    48074807<Comment><Text id="ucp-16">This is where the <AutoText text="UnknownConverterPlugin"/> comes in. It builds on the idea of the UnknownPlugin, allowing you to work with document formats unknown to Greenstone. But it offer the additional advantage of being able to extract the text of the unknown document based on an important proviso: that you have a software tool installed on your machine, one that can be run readily from the commandline, which can perform the process of converting the unknown document format into text or HTML (or a series of images). If the tool can convert the document to text or HTML, Greenstone can proceed as usual to index the content to make it searchable on previewing.</Text></Comment>
    48084808<NumberedItem><Text id="ucp-17">So in order to process the "superhero.djvu" document in our collection, such that its text content gets indexed for searching, we need to do a number of things: find out if there's a free djvu to text conversion tool out there, work out how to run it from the commandline and finally configure the UnknownConverterPlugin to run it for us, so Greenstone can take care of the rest.</Text>
    4809 <Text id="ucp-18">We're in luck, because among the DjVu related tools that <Link url="http://djvu.sourceforge.net">DjVuLibre</Link> provides a tool called "<Format>djvutxt</Format>" that can perform the text extraction for us. DjVuLibre is available for Windows, Mac and Linux. Some <Link url="https://unix.stackexchange.com/questions/25256/why-isnt-there-a-djvu2text">Linux machines may even come pre-installed with DjVuLibre</Link>. If not, you can use a package manager to install it for you, or compile it up easily from <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre/">source</Link> in the usual Unix manner.</Text>
    4810 <Text id="ucp-19">DjVuLibre provides binary installers for <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_Windows/">Windows</Link> and <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_MacOS/">Mac</Link>. Grab the one for your operating system and install it somewhere sensible, where you have permissions to run it. You can read about the other DjVu tools that DjVuLibre provide in their <Link url="http://djvu.sourceforge.net/doc/index.html">documentation page</Link>, but for this tutorial, we'll just be using their <Format>djvutxt</Format> tool.</Text>
     4809<Text id="ucp-18">We're in luck, because among the DjVu related tools that <Link url="http://djvu.sourceforge.net">DjVuLibre</Link> provides a tool called "<Format>djvutxt</Format>" that can perform the text extraction for us. DjVuLibre is available for Windows, Mac and Linux:</Text>
     4810<BulletList>
     4811  <Bullet><Text id="ucp-19">DjVuLibre provides binary installers for <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_Windows/">Windows</Link> and <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_MacOS/">Mac</Link>. Grab the one for your operating system and install it somewhere sensible, somewhere you have permissions to install and run it from. Upon successful installation, you're given the option to launch DjVuLibre's <i>DjView</i> tool, which will open the DjVuLibre manual (in djvu format). In the left pane of DjView, you can see a listing of the various tools DjVuLibre is comprised of, and read up on them. You can also read about <i>djvutxt</i> or the other DjVu tools that DjVuLibre provides in their <Link url="http://djvu.sourceforge.net/doc/index.html">documentation page</Link>, but for this tutorial, we'll just be using their <Format>djvutxt</Format> tool.</Text></Bullet>
     4812<Bullet><Text id="ucp-19b">As for Linux, some <Link url="https://unix.stackexchange.com/questions/25256/why-isnt-there-a-djvu2text">Linux machines may even come pre-installed with DjVuLibre</Link>. If not, you can use a package manager to install it for you, or compile it up easily from <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre/">source</Link> in the usual Unix manner.</Text></Bullet>
     4813</BulletList>
    48114814</NumberedItem>
    48124815<NumberedItem><Text id="ucp-20">The next step is to find out how to run DjVuLibre's <Format>djvutxt</Format> conversion tool from the commandline.</Text>
     
    48144817<Format>djvutxt input.djvu output.txt</Format>
    48154818<Text id="ucp-22">Open a DOS prompt on Windows or a terminal on Mac/Linux and experiment to see what it takes to convert your Greenstone installation's <Format>web/sites/localsite/collect/DjVuColl/superhero.djvu</Format> file.</Text>
    4816 <Text id="ucp-22a">You may have to invoke <Format>djvutxt</Format> using it's full filepath, in which case the command would look like:</Text>
     4819<Text id="ucp-22a">You may have to invoke <Format>djvutxt</Format> using its full filepath, in which case the command would look like:</Text>
    48174820<Format>/PATH/TO/YOUR/djvutxt /PATH/TO/GS/web/sites/localsite/collect/DjVuColl/superhero.djvu /PATH/TO/YOUR/GS/superhero.txt</Format>
    48184821<Text id="ucp-23">Once you have the command working, inspect the output file. You should see mostly legible text in it. Only when you've been able to successfully complete this step should you proceed to the next steps.</Text>
     
    48294832<Text id="ucp-31">If you have any spaces in any filepaths in your <Format>exec_cmd</Format>, make sure to always nest them in escaped double quotes (<Format>\"</Format>), so Greenstone can preserve the spaces in the filepath.</Text>
    48304833<Text id="ucp-32">If any filepaths, other than <Format>%%INPUT_FILE</Format> and <Format>%%OUTPUT</Format> are within your Greenstone installation, you can use the <Format>%%GSDLHOME</Format><MajorVersion number="3">, <Format>%%GSDL3SRCHOME</Format> and <Format>%%GSDL3HOME</Format> (the latter for Greenstone 3's <Format>web</Format> folder)</MajorVersion> as placeholders and write out your filepaths relative to this. For instance, if your DjVuLibre is installed in your Greenstone's <Format>ext</Format> subfolder, then you would start the filepath to <i>djvutxt</i> with <Format>%%GSDL<MajorVersion number="3">3SRC</MajorVersion>HOME/ext</Format>.</Text>
    4831 </NumberedItem>
    4832 <NumberedItem><Text id="ucp-33">Having sufficiently configured the UnknownConverterPlugin, click on the OK button close the plugin's Configuration dialog. Move to the <AutoText key="glidict::GUI.Create"/> pane and build the collection. Your document has now been recognised. What's more, if you preview it and search for the term "Interoperability", a term that occurs in our collection's superhero.djvu document, you should now get a search result linking to that document. So Greenstone has successfully indexed the document's text, thanks to DjVuLibre's <Format>djvutxt</Format> tool extracting the text which got fed into the rest of Greenstone's building pipeline.</Text>
     4834<Text id="ucp-32a">The value for your <Format>exec_cmd</Format> may look something like this, if you have DjVuLibre installed in <Path>C:\Program Files</Path>. Note the escaped double quotes bookending the path to <Format>djvutxt</Format>, to protect spaces in its filepath:</Text>
     4835<Format>\"C:\Program Files\DjVuLibre\djvutxt\" %%INPUT_FILE %%OUTPUT</Format>
     4836</NumberedItem>
     4837<NumberedItem><Text id="ucp-33">Having sufficiently configured the UnknownConverterPlugin, click on the OK button close the plugin's Configuration dialog. Move to the <AutoText key="glidict::GUI.Create"/> pane and build the collection. Your document has now been recognised. What's more, if you preview it and search for the term "Interoperability", a term that occurs in our collection's <i>superhero.djvu</i> document, you should now get a search result linking to that document. So Greenstone has successfully indexed the document's text, thanks to DjVuLibre's <Format>djvutxt</Format> tool extracting the text which got fed into the rest of Greenstone's building pipeline.</Text>
    48334838</NumberedItem>
    48344839<Heading><Text id="ucp-34">Associating an icon with DjVu documents in Greenstone</Text></Heading>
     
    48404845</NumberedItem>
    48414846<NumberedItem><Text id="ucp-39">Greenstone doesn't have an icon for DjVu documents, since it doesn't know about the format. If you Google for the djvu icon, you'd probably find the <Link url="http://djvu.sourceforge.net/doc/man/nsdejavu.html">Wikipedia page for it</Link>.</Text>
    4842 <Text id="ucp-40">Save one of their DjVu icon images. Then open the image in GIMP or another image editor and use the application's scaling feature to scale the height or the width (whichever is greater) to anywhere between 26 and 32 pixels. Save the scaled image as a GIF file with the name "<Format>icondjvu.gif</Format>", storing it in your Greenstone installation's <Format>web/interfaces/default/images</Format> folder.</Text>
    4843 </NumberedItem>
    4844 <NumberedItem><Text id="ucp-41">Greenstone knows nothing about the icondjvu macro we defined as the value for UnknownConverterPlugin's srcicon field, so we have to teach Greenstone about this new macro. Use a text editor to open your Greenstone 3's <Format>web/sites/localsite/siteConfig.xml</Format> file.</Text>
     4847<Text id="ucp-40">Save one of their DjVu icon images. Then open the image in GIMP or another image editor and use the application's scaling feature to scale the height or the width (whichever is greater) to anywhere between 26 and 32 pixels. Save the scaled image as a GIF file with the name "<Format>idjvu.gif</Format>", storing it in your Greenstone installation's <Format>web/interfaces/default/images</Format> folder.</Text>
     4848</NumberedItem>
     4849<NumberedItem><Text id="ucp-41">Greenstone knows nothing about the <Format>icondjvu</Format> macro we defined as the value for UnknownConverterPlugin's <Format>srcicon</Format> field, so we have to teach Greenstone about this new macro. Use a text editor to open your Greenstone 3's <Format>web/sites/localsite/siteConfig.xml</Format> file.</Text>
    48454850<Text id="ucp-42">Locate the line</Text>
    48464851<Format>&lt;replace macro="_iconunknown_" scope="metadata" text="&amp;lt;img src='interfaces/default/images/iunknown.gif' border='0'/&amp;gt;" resolve="false"/&gt;</Format>
Note: See TracChangeset for help on using the changeset viewer.