Changeset 32894 for documentation

Show
Ignore:
Timestamp:
13.03.2019 18:52:17 (5 months ago)
Author:
ak19
Message:

Modified the UnknownConverterPlugin? tutorial with corrections before removing the icecite section in the next commit. This commit now has instructions on how to run the DjVuLibre? installer when you don't have admin permissions so you can install it in a user location.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r32606 r32894  
    47844784<Comment><Text id="ucp-02">This is an advanced tutorial, in that it not only supposes you have familiarised yourself with most of what you've learned in preceding tutorials, but that you're also comfortable with downloading and installing software from the web, and have a little familiarity with using image editing software.</Text></Comment> 
    47854785<Comment><Text id="ucp-03">The <AutoText text="UnknownConverterPlugin"/> builds on the idea of the <AutoText text="UnknownPlugin"/>, in that it can be configured to handle documents of unknown format and file extension. It can also be made to handle documents with known file extensions in a custom manner.</Text></Comment> 
    4786 <Comment><Text id="ucp-04">The UnknownConverterPlugin extends the UnknownPlugin's abilities by letting you launch a tool you have installed on your own operating system that can be run from the commandline to convert from the "unknown" file format to either text, html or gif/jpg/png images, or a folder of these. If you know how to launch this tool from the commandline to do the conversion, then you would configure the UnknownConverterPlugin by supplying the file format (file extension) of the documents it should process, the expected output file format (text, html or paged images), and the tool's conversion command that the UnknownConverterPlugin should launch to perform the conversion. In place of the input file and the output file or folder you provide placeholders in the command to run. Once configured, the UnknownConverterPlugin will be used during building to process documents that match the specified file format. The conversion tool will be launched with the command provided, and the expected output files as specified can then be processed by Greenstone in the usual manner.</Text></Comment> 
    4787 <Comment><Text id="ucp-05">An example would be djvu files, for which Greenstone provides no custom plugin. However, there's a free commandline tool available for unix systems that can convert from djvu to one of the text based format that Greenstone can process, text or html. So in this case, you could try using the UnknownConverterPlugin with the commandline tool on djvu files that you've gathered. The result should be that the djvu files in your Greenstone collection are now searchable.</Text></Comment> 
     4786<Comment><Text id="ucp-04">The UnknownConverterPlugin extends the UnknownPlugin's abilities by letting you launch a tool you have installed on your own PC that can be run from the commandline to convert from the "unknown" file format to either text, html or gif/jpg/png images, or a folder of these. If you know how to launch this tool from the commandline to do the conversion, then you would configure the UnknownConverterPlugin by supplying the file format (file extension) of the documents it should process, the expected output file format (text, html or paged images), and the tool's conversion command that the UnknownConverterPlugin should launch to perform the conversion. In place of the input file and the output file or folder, you provide placeholders in the command to run. Once configured, the UnknownConverterPlugin will be used during building to process documents that match the specified file format. It will launch the commandline conversion tool with the command provided, and the expected output files as specified can then be processed by Greenstone in the usual manner.</Text></Comment> 
     4787<Comment><Text id="ucp-05">An example scenario would be if your collection contained djvu files, for which Greenstone provides no custom plugin. However, there's a free commandline tool available that can convert from djvu to one of the text based formats that Greenstone can process, text or html. So in this case, you could try using the UnknownConverterPlugin with the commandline tool on djvu files that you've gathered. The result should be that the djvu files in your Greenstone collection are now searchable.</Text></Comment> 
    47884788<Heading><Text id="ucp-06">Working with DjVu documents in Greenstone</Text></Heading> 
    47894789<Text id="ucp-07">DjVu (pronounced like the French phrase <i>déjà vu</i>) is a <Link url="https://www.djvuzone.org/">document format</Link> suited for archiving digital documents. <Link url="http://djvu.sourceforge.net/doc/index.html">DjVuLibre</Link>, which provides open source tools for processing DjVu documents, describes DjVu as</Text> 
    47904790<Comment><Text id="ucp-07a">"a web-centric format and software platform for distributing documents and images. DjVu can advantageously replace PDF, PS, TIFF, JPEG, and GIF for distributing scanned documents, digital documents, or high-resolution pictures. DjVu content downloads faster, displays and renders faster, looks nicer on a screen, and consume less client resources than competing formats. DjVu images display instantly and can be smoothly zoomed and panned with no lengthy re-rendering. DjVu is used by hundreds of academic, commercial, governmental, and non-commercial web sites around the world."</Text></Comment> 
    4791 <Text id="ucp-08">In this part of the tutorial we'll see how to get Greenstone to not just include a collection's DjVu documents, but make them searchable too. There are several tools out there to convert a DjVu document into text or HTML. For instance, Linux users can install the <i>ocrodjvu</i> package and use its <i>djvu2hocr</i> tool to extract the text content in HTML format. Janusz S. Bien, a Greenstone user on the mailing list, has recommended it as being of possible use to Greenstone users, as it's a front-end to OCR programs. In this tutorial, however, we'll look at using <i>djvutxt</i> which is part of the DjVuLibre suite of tools.</Text> 
     4791<Text id="ucp-08">In this part of the tutorial we'll see how to get Greenstone to not just include a collection's DjVu documents, but make them searchable too. There are several tools out there to convert a DjVu document into text or HTML. For instance, Linux users can install the <i>ocrodjvu</i> package and use its <i>djvu2hocr</i> tool to extract the text content in HTML format. Janusz S. Bien, a Greenstone user on the mailing list, has recommended it as being of possible use to Greenstone users, as it's a front-end to OCR programs. In this tutorial, however, we'll look at using <i>djvutxt</i> which is part of the DjVuLibre suite of tools and which is also available for other operating systems like Windows.</Text> 
    47924792<Heading><Text id="ucp-09">Extracting the text from DjVu documents with DjVuLibre's djvutxt</Text></Heading> 
    47934793<NumberedItem><Text id="ucp-10">Start up GLI and create a new collection called <i>DjVu Collection</i>.</Text> 
     
    48014801<Text id="ucp-14">If you were to search through the <AutoText key="glidict::GUI.Design"/> pane's <AutoText key="glidict::CDM.GUI.Plugins"/> for a "DjVuPlugin", you wouldn't find one, because Greenstone hasn't got one. Greenstone knows about a lot of common formats, but there's a great many formats that different people like to work with that Greenstone knows nothing about and which Greenstone developers have not created a custom plugin for.</Text> 
    48024802</NumberedItem> 
    4803 <Comment><Text id="ucp-15">You've already learnt about the <AutoText text="UnknownPlugin"/> in the <i>Multimedia</i> tutorial and know that it can be configured to process document formats for which Greenstone has no custom plugin. However, UnknownPlugin cannot index textual document formats that are unknown to Greenstone to make them searchable upon build, because it doesn't know anything about their internal structure and consequently doesn't know how to extract their text content.</Text></Comment> 
    4804 <Comment><Text id="ucp-16">This is where the <AutoText text="UnknownConverterPlugin"/> comes in. It builds on the idea of the UnknownPlugin, allowing you to work with document formats unknown to Greenstone. But it offer the additional advantage of being able to extract the text of the unknown document based on an important proviso: that you have a software tool installed on your machine, one that can be run readily from the commandline, which can perform the process of converting the unknown document format into text or HTML (or a series of images). If the tool can convert the document to text or HTML, Greenstone can proceed as usual to index the content to make it searchable on previewing.</Text></Comment> 
    4805 <NumberedItem><Text id="ucp-17">So in order to process the "superhero.djvu" document in our collection, such that its text content gets indexed for searching, we need to do a number of things: find out if there's a free djvu to text conversion tool out there, work out how to run it from the commandline and finally configure the UnknownConverterPlugin to run it for us, so Greenstone can take care of the rest.</Text> 
    4806 <Text id="ucp-18">We're in luck, because among the DjVu related tools that <Link url="http://djvu.sourceforge.net">DjVuLibre</Link> provides a tool called "<Format>djvutxt</Format>" that can perform the text extraction for us. DjVuLibre is available for Windows, Mac and Linux:</Text> 
     4803<Comment><Text id="ucp-15">You've already learnt about the <AutoText text="UnknownPlugin"/> in the <i>Multimedia</i> tutorial and know that it can be configured to process document formats for which Greenstone has no custom plugin. However, UnknownPlugin cannot index textual document formats that are unknown to Greenstone to make them searchable upon building, because it doesn't know anything about their internal structure and consequently doesn't know how to extract their text content.</Text></Comment> 
     4804<Comment><Text id="ucp-16">This is where the <AutoText text="UnknownConverterPlugin"/> comes in. It builds on the idea of the UnknownPlugin, allowing you to work with document formats unknown to Greenstone. But it offers the additional advantage of being able to extract the text of the unknown document, depending on an important proviso: that you have a software tool installed on your machine, one that can be run readily from the commandline, which can perform the process of converting the unknown document format into text or HTML (or a series of images). If the tool can convert the document to text or HTML, Greenstone can proceed as usual to index the content to make it searchable on previewing.</Text></Comment> 
     4805<NumberedItem><Text id="ucp-17">So in order to process the "superhero.djvu" document in our collection, such that its text content gets indexed for searching, we need to do a number of things: find out if there's a free djvu to text conversion tool out there, work out how to run it from the commandline and finally configure the UnknownConverterPlugin to automatically run this commandline tool for us, so Greenstone can take care of the rest.</Text> 
     4806<Text id="ucp-18">We're in luck, because among the DjVu related tools that <Link url="http://djvu.sourceforge.net">DjVuLibre</Link> provides is one called "<Format>djvutxt</Format>" that can perform the text extraction for us. DjVuLibre is available for Windows, Mac and Linux:</Text> 
    48074807<BulletList> 
    4808   <Bullet><Text id="ucp-19">DjVuLibre provides binary installers for <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_Windows/">Windows</Link> and <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_MacOS/">Mac</Link>. Grab the one for your operating system and install it somewhere sensible, somewhere you have permissions to install and run it from. Upon successful installation, you're given the option to launch DjVuLibre's <i>DjView</i> tool, which will open the DjVuLibre manual (in djvu format). In the left pane of DjView, you can see a listing of the various tools DjVuLibre is comprised of, and read up on them. You can also read about <i>djvutxt</i> or the other DjVu tools that DjVuLibre provides in their <Link url="http://djvu.sourceforge.net/doc/index.html">documentation page</Link>, but for this tutorial, we'll just be using their <Format>djvutxt</Format> tool.</Text></Bullet> 
    4809 <Bullet><Text id="ucp-19b">As for Linux, some <Link url="https://unix.stackexchange.com/questions/25256/why-isnt-there-a-djvu2text">Linux machines may even come pre-installed with DjVuLibre</Link>. If not, you can use a package manager to install it for you, or compile it up easily from <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre/">source</Link> in the usual Unix manner.</Text></Bullet> 
     4808  <Bullet><Text id="ucp-19b">Some <Link url="https://unix.stackexchange.com/questions/25256/why-isnt-there-a-djvu2text">Linux machines may even come pre-installed with DjVuLibre</Link>. If not, you can use a package manager to install it for you, or compile it up easily from <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre/">source</Link> in the usual Unix manner.</Text></Bullet> 
     4809  <Bullet><Text id="ucp-19">DjVuLibre provides binary installers for <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_Windows/">Windows</Link> and <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_MacOS/">Mac</Link>. Grab the one for your operating system and install it somewhere sensible: somewhere you have permissions to install and run it from. On Windows, running the installer in the regular manner requires you to have admin permissions. If you don't have admin rights, you can run the installer as follows (instructions taken from <Link url="https://superuser.com/questions/171917/force-a-program-to-run-without-administrator-privileges-or-uac">this superuser exchange</Link>) to install DjVyLibre in a non-admin location. Use a text editor to create a file called <Format>nonadmin.bat</Format> (beware the file doesn't end up with an additional <Format>.txt</Format> extension when saving it). Copy and paste, or carefully type, the following text into the file, then save and close it:</Text> 
     4810  <Format>cmd /min /C "set __COMPAT_LAYER=RUNASINVOKER &amp;&amp; start "" %1"</Format> 
     4811  <Text id="ucp-19a">Next, drag and drop the DjVuLibre setup executable onto the new <Format>nonadmin.bat</Format> file to run setup in a way that bypasses the admin privileges usually required for a successful installation. When installing, you'll now finally be allowed to choose a custom install directory, instead of the installer choosing an off-limits admin location like <Format>C:\Program Files (x86)</Format> for you. So make sure to choose a location in your User area as install directory.</Text> 
     4812  <Text id="ucp-19c">Upon successful installation, you're given the option to launch DjVuLibre's <i>DjView</i> tool, which will open the DjVuLibre manual (in djvu format). In the left pane of DjView, you can see a listing of the various tools DjVuLibre is comprised of, and read up on them. You can also read about <i>djvutxt</i> or the other DjVu tools that DjVuLibre provides in their <Link url="http://djvu.sourceforge.net/doc/index.html">documentation page</Link>, but for this tutorial, we'll just be using their <Format>djvutxt</Format> tool.</Text></Bullet> 
    48104813</BulletList> 
    48114814</NumberedItem> 
     
    48264829<Bullet><Text id="ucp-27">set its <Format>mime_type</Format> field to <Format>image/vnd.djvu</Format>, which is one of the <Link url="http://djvu.sourceforge.net/doc/man/nsdejavu.html">mime types for the DjVu format</Link></Text></Bullet> 
    48274830<Bullet><Text id="ucp-28">set its <Format>process_extension</Format> to <Format>djvu</Format></Text></Bullet> 
    4828 <Bullet><Text id="ucp-29">Finally, copy the full <Format>djvutxt</Format> command you ran from the commandline and paste it into the UnknownConverterPlugin Configuration dialog's <Format>exec_cmd</Format> field. Keep the full path to the <i>djvutxt</i> binary, but replace the entire input filepath with the literal string <Format>%%INPUT_FILE</Format> and replace the output filepath with the literal string <Format>%%OUTPUT</Format>.</Text> 
     4831<Bullet><Text id="ucp-29">Finally, copy the full <Format>djvutxt</Format> command you ran from the commandline and paste it into the UnknownConverterPlugin Configuration dialog's <Format>exec_cmd</Format> field. Keep the full path to the <i>djvutxt</i> binary, but replace the entire input filepath with the literal string <Format>%%INPUT_FILE</Format> and replace the output filepath with the literal string <Format>%%OUTPUT</Format>. </Text> 
    48294832<Text id="ucp-30">Doing so means that when you build the collection, Greenstone will replace <Format>%%INPUT_FILE</Format> with each DjVu document in your collection that it needs to process, and will replace <Format>%%OUTPUT</Format> with the expected text output file of each document upon conversion by <i>djvutxt</i>.</Text></Bullet> 
    48304833</BulletList> 
    4831 <Text id="ucp-31">If you have any spaces in any filepaths in your <Format>exec_cmd</Format>, make sure to always nest them in escaped double quotes (<Format>\"</Format>), so Greenstone can preserve the spaces in the filepath.</Text> 
     4834<Text id="ucp-31">If you have any spaces in any filepaths in your <Format>exec_cmd</Format>, make sure to always nest that entire filepath in escaped double quotes (<Format>\"</Format>), so Greenstone can preserve the spaces in it.</Text> 
    48324835<Text id="ucp-32">If any filepaths, other than <Format>%%INPUT_FILE</Format> and <Format>%%OUTPUT</Format> are within your Greenstone installation, you can use the <Format>%%GSDLHOME</Format><MajorVersion number="3">, <Format>%%GSDL3SRCHOME</Format> and <Format>%%GSDL3HOME</Format> (the latter for Greenstone 3's <Format>web</Format> folder)</MajorVersion> as placeholders and write out your filepaths relative to this. For instance, if your DjVuLibre is installed in your Greenstone's <Format>ext</Format> subfolder, then you would start the filepath to <i>djvutxt</i> with <Format>%%GSDL<MajorVersion number="3">3SRC</MajorVersion>HOME/ext</Format>.</Text> 
    4833 <Text id="ucp-32a">The value for your <Format>exec_cmd</Format> may look something like this, if you have DjVuLibre installed in <Path>C:\Program Files</Path>. Note the escaped double quotes bookending the path to <Format>djvutxt</Format>, to protect spaces in its filepath:</Text> 
     4836<Text id="ucp-32a">The value for your <Format>exec_cmd</Format> field may look something like the following, if you have DjVuLibre installed in <Path>C:\Program Files</Path>. Note the escaped double quotes bookending the path to <Format>djvutxt</Format>, to protect spaces in its filepath:</Text> 
    48344837<Format>\"C:\Program Files\DjVuLibre\djvutxt\" %%INPUT_FILE %%OUTPUT</Format> 
    48354838</NumberedItem> 
     
    48434846<Text id="ucp-38">Click OK to close the UnknownConverterPlugin configuration dialog. Quit GLI, since there's a little more work to do.</Text> 
    48444847</NumberedItem> 
    4845 <NumberedItem><Text id="ucp-39">Greenstone doesn't have an icon for DjVu documents, since it doesn't know about the format. If you Google for the djvu icon, you'd probably find the <Link url="http://djvu.sourceforge.net/doc/man/nsdejavu.html">Wikipedia page for it</Link>.</Text> 
    4846 <Text id="ucp-40">Save one of their DjVu icon images. Then open the image in GIMP or another image editor and use the application's scaling feature to scale the height or the width (whichever is greater) to anywhere between 26 and 32 pixels. Save the scaled image as a GIF file with the name "<Format>idjvu.gif</Format>", storing it in your Greenstone installation's <Format>web/interfaces/default/images</Format> folder.</Text> 
     4848<NumberedItem><Text id="ucp-39">Greenstone doesn't have an icon for DjVu documents, since it doesn't know about the format. If you Google for the djvu icon, you'd probably find the <Link url="https://en.wikipedia.org/wiki/DjVu">Wikipedia page for it</Link>.</Text> 
     4849<Text id="ucp-40">Save one of their DjVu icon images. Then open the image in Windows Paint or GIMP or another image editor, and use the application's scaling feature to scale the image's height or the width (whichever is greater) to anywhere between 26 and 32 pixels. Save the scaled image as a GIF file with the name "<Format>idjvu.gif</Format>", storing it in your Greenstone installation's <Format>web/interfaces/default/images</Format> folder.</Text> 
    48474850</NumberedItem> 
    48484851<NumberedItem><Text id="ucp-41">Greenstone knows nothing about the <Format>icondjvu</Format> macro we defined as the value for UnknownConverterPlugin's <Format>srcicon</Format> field, so we have to teach Greenstone about this new macro. Use a text editor to open your Greenstone 3's <Format>web/sites/localsite/siteConfig.xml</Format> file.</Text> 
     
    48594862</NumberedItem> 
    48604863<Heading> 
    4861 <Text id="ucp-46">Using the Icecite's commandline tool to convert from PDF to text</Text> 
    4862 </Heading> 
    4863 <Comment><Text id="ucp-47"><Link url="https://github.com/ckorzen/icecite">Icecite</Link> is an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFPlugin, even when <Format>pdfbox_conversion</Format> option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here.</Text></Comment> 
     4864<Text id="ucp-46">Using Icecite's commandline tool to convert from PDF to text</Text> 
     4865</Heading> 
     4866<Comment><Text id="ucp-47"><Link url="https://github.com/ckorzen/icecite">Icecite</Link> (now known as <Link url="https://github.com/ad-freiburg/pdfact">PdfAct</Link>) is an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFv1Plugin or old PDFPlugin, even when the <Format>pdfbox_conversion</Format> option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here. The PDFv2Plugin introduced since Greenstone 3.09 should however be able to handle more PDF documents out of the box, so try that first before using the UnknownConverterPlugin with Icecite.</Text></Comment> 
    48644867<Comment> 
    48654868<Text id="ucp-48">As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.</Text> 
     
    48944897</Heading> 
    48954898<Comment><Text id="ucp-65">We're now ready to use the <AutoText text="UnknownConverterPlugin"/> to launch Icecite as the external tool to do the conversion, producing output that Greenstone's building scripts can ingest into Greenstone and index for searching.</Text></Comment> 
    4896 <NumberedItem><Text id="ucp-66">Run GLI</Text></NumberedItem> 
     4899<NumberedItem><Text id="ucp-66">Run GLI.</Text></NumberedItem> 
    48974900<NumberedItem><Text id="ucp-67">Create a new collection called Icecite. In the <AutoText key="glidict::GUI.Gather"/> pane, drop in the sample PDF file into your collection.</Text></NumberedItem> 
    48984901<NumberedItem><Text id="ucp-68">In the <AutoText key="glidict::GUI.Design"/> pane and select <AutoText key="glidict::CDM.GUI.Plugins"/> from the list on the left. Add the <AutoText text="UnknownConverterPlugin"/>. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring the <AutoText text="UnknownConverterPlugin"/>. Click <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> and set up the plugin with the following settings:</Text>