Changeset 32894
- Timestamp:
- 2019-03-13T18:52:17+13:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
documentation/trunk/tutorials/xml-source/tutorial_en.xml
r32606 r32894 4784 4784 <Comment><Text id="ucp-02">This is an advanced tutorial, in that it not only supposes you have familiarised yourself with most of what you've learned in preceding tutorials, but that you're also comfortable with downloading and installing software from the web, and have a little familiarity with using image editing software.</Text></Comment> 4785 4785 <Comment><Text id="ucp-03">The <AutoText text="UnknownConverterPlugin"/> builds on the idea of the <AutoText text="UnknownPlugin"/>, in that it can be configured to handle documents of unknown format and file extension. It can also be made to handle documents with known file extensions in a custom manner.</Text></Comment> 4786 <Comment><Text id="ucp-04">The UnknownConverterPlugin extends the UnknownPlugin's abilities by letting you launch a tool you have installed on your own operating system that can be run from the commandline to convert from the "unknown" file format to either text, html or gif/jpg/png images, or a folder of these. If you know how to launch this tool from the commandline to do the conversion, then you would configure the UnknownConverterPlugin by supplying the file format (file extension) of the documents it should process, the expected output file format (text, html or paged images), and the tool's conversion command that the UnknownConverterPlugin should launch to perform the conversion. In place of the input file and the output file or folder you provide placeholders in the command to run. Once configured, the UnknownConverterPlugin will be used during building to process documents that match the specified file format. The conversion tool will be launchedwith the command provided, and the expected output files as specified can then be processed by Greenstone in the usual manner.</Text></Comment>4787 <Comment><Text id="ucp-05">An example would be djvu files, for which Greenstone provides no custom plugin. However, there's a free commandline tool available for unix systems that can convert from djvu to one of the text based formatthat Greenstone can process, text or html. So in this case, you could try using the UnknownConverterPlugin with the commandline tool on djvu files that you've gathered. The result should be that the djvu files in your Greenstone collection are now searchable.</Text></Comment>4786 <Comment><Text id="ucp-04">The UnknownConverterPlugin extends the UnknownPlugin's abilities by letting you launch a tool you have installed on your own PC that can be run from the commandline to convert from the "unknown" file format to either text, html or gif/jpg/png images, or a folder of these. If you know how to launch this tool from the commandline to do the conversion, then you would configure the UnknownConverterPlugin by supplying the file format (file extension) of the documents it should process, the expected output file format (text, html or paged images), and the tool's conversion command that the UnknownConverterPlugin should launch to perform the conversion. In place of the input file and the output file or folder, you provide placeholders in the command to run. Once configured, the UnknownConverterPlugin will be used during building to process documents that match the specified file format. It will launch the commandline conversion tool with the command provided, and the expected output files as specified can then be processed by Greenstone in the usual manner.</Text></Comment> 4787 <Comment><Text id="ucp-05">An example scenario would be if your collection contained djvu files, for which Greenstone provides no custom plugin. However, there's a free commandline tool available that can convert from djvu to one of the text based formats that Greenstone can process, text or html. So in this case, you could try using the UnknownConverterPlugin with the commandline tool on djvu files that you've gathered. The result should be that the djvu files in your Greenstone collection are now searchable.</Text></Comment> 4788 4788 <Heading><Text id="ucp-06">Working with DjVu documents in Greenstone</Text></Heading> 4789 4789 <Text id="ucp-07">DjVu (pronounced like the French phrase <i>déjà vu</i>) is a <Link url="https://www.djvuzone.org/">document format</Link> suited for archiving digital documents. <Link url="http://djvu.sourceforge.net/doc/index.html">DjVuLibre</Link>, which provides open source tools for processing DjVu documents, describes DjVu as</Text> 4790 4790 <Comment><Text id="ucp-07a">"a web-centric format and software platform for distributing documents and images. DjVu can advantageously replace PDF, PS, TIFF, JPEG, and GIF for distributing scanned documents, digital documents, or high-resolution pictures. DjVu content downloads faster, displays and renders faster, looks nicer on a screen, and consume less client resources than competing formats. DjVu images display instantly and can be smoothly zoomed and panned with no lengthy re-rendering. DjVu is used by hundreds of academic, commercial, governmental, and non-commercial web sites around the world."</Text></Comment> 4791 <Text id="ucp-08">In this part of the tutorial we'll see how to get Greenstone to not just include a collection's DjVu documents, but make them searchable too. There are several tools out there to convert a DjVu document into text or HTML. For instance, Linux users can install the <i>ocrodjvu</i> package and use its <i>djvu2hocr</i> tool to extract the text content in HTML format. Janusz S. Bien, a Greenstone user on the mailing list, has recommended it as being of possible use to Greenstone users, as it's a front-end to OCR programs. In this tutorial, however, we'll look at using <i>djvutxt</i> which is part of the DjVuLibre suite of tools .</Text>4791 <Text id="ucp-08">In this part of the tutorial we'll see how to get Greenstone to not just include a collection's DjVu documents, but make them searchable too. There are several tools out there to convert a DjVu document into text or HTML. For instance, Linux users can install the <i>ocrodjvu</i> package and use its <i>djvu2hocr</i> tool to extract the text content in HTML format. Janusz S. Bien, a Greenstone user on the mailing list, has recommended it as being of possible use to Greenstone users, as it's a front-end to OCR programs. In this tutorial, however, we'll look at using <i>djvutxt</i> which is part of the DjVuLibre suite of tools and which is also available for other operating systems like Windows.</Text> 4792 4792 <Heading><Text id="ucp-09">Extracting the text from DjVu documents with DjVuLibre's djvutxt</Text></Heading> 4793 4793 <NumberedItem><Text id="ucp-10">Start up GLI and create a new collection called <i>DjVu Collection</i>.</Text> … … 4801 4801 <Text id="ucp-14">If you were to search through the <AutoText key="glidict::GUI.Design"/> pane's <AutoText key="glidict::CDM.GUI.Plugins"/> for a "DjVuPlugin", you wouldn't find one, because Greenstone hasn't got one. Greenstone knows about a lot of common formats, but there's a great many formats that different people like to work with that Greenstone knows nothing about and which Greenstone developers have not created a custom plugin for.</Text> 4802 4802 </NumberedItem> 4803 <Comment><Text id="ucp-15">You've already learnt about the <AutoText text="UnknownPlugin"/> in the <i>Multimedia</i> tutorial and know that it can be configured to process document formats for which Greenstone has no custom plugin. However, UnknownPlugin cannot index textual document formats that are unknown to Greenstone to make them searchable upon build , because it doesn't know anything about their internal structure and consequently doesn't know how to extract their text content.</Text></Comment>4804 <Comment><Text id="ucp-16">This is where the <AutoText text="UnknownConverterPlugin"/> comes in. It builds on the idea of the UnknownPlugin, allowing you to work with document formats unknown to Greenstone. But it offer the additional advantage of being able to extract the text of the unknown document basedon an important proviso: that you have a software tool installed on your machine, one that can be run readily from the commandline, which can perform the process of converting the unknown document format into text or HTML (or a series of images). If the tool can convert the document to text or HTML, Greenstone can proceed as usual to index the content to make it searchable on previewing.</Text></Comment>4805 <NumberedItem><Text id="ucp-17">So in order to process the "superhero.djvu" document in our collection, such that its text content gets indexed for searching, we need to do a number of things: find out if there's a free djvu to text conversion tool out there, work out how to run it from the commandline and finally configure the UnknownConverterPlugin to run itfor us, so Greenstone can take care of the rest.</Text>4806 <Text id="ucp-18">We're in luck, because among the DjVu related tools that <Link url="http://djvu.sourceforge.net">DjVuLibre</Link> provides a toolcalled "<Format>djvutxt</Format>" that can perform the text extraction for us. DjVuLibre is available for Windows, Mac and Linux:</Text>4803 <Comment><Text id="ucp-15">You've already learnt about the <AutoText text="UnknownPlugin"/> in the <i>Multimedia</i> tutorial and know that it can be configured to process document formats for which Greenstone has no custom plugin. However, UnknownPlugin cannot index textual document formats that are unknown to Greenstone to make them searchable upon building, because it doesn't know anything about their internal structure and consequently doesn't know how to extract their text content.</Text></Comment> 4804 <Comment><Text id="ucp-16">This is where the <AutoText text="UnknownConverterPlugin"/> comes in. It builds on the idea of the UnknownPlugin, allowing you to work with document formats unknown to Greenstone. But it offers the additional advantage of being able to extract the text of the unknown document, depending on an important proviso: that you have a software tool installed on your machine, one that can be run readily from the commandline, which can perform the process of converting the unknown document format into text or HTML (or a series of images). If the tool can convert the document to text or HTML, Greenstone can proceed as usual to index the content to make it searchable on previewing.</Text></Comment> 4805 <NumberedItem><Text id="ucp-17">So in order to process the "superhero.djvu" document in our collection, such that its text content gets indexed for searching, we need to do a number of things: find out if there's a free djvu to text conversion tool out there, work out how to run it from the commandline and finally configure the UnknownConverterPlugin to automatically run this commandline tool for us, so Greenstone can take care of the rest.</Text> 4806 <Text id="ucp-18">We're in luck, because among the DjVu related tools that <Link url="http://djvu.sourceforge.net">DjVuLibre</Link> provides is one called "<Format>djvutxt</Format>" that can perform the text extraction for us. DjVuLibre is available for Windows, Mac and Linux:</Text> 4807 4807 <BulletList> 4808 <Bullet><Text id="ucp-19">DjVuLibre provides binary installers for <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_Windows/">Windows</Link> and <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_MacOS/">Mac</Link>. Grab the one for your operating system and install it somewhere sensible, somewhere you have permissions to install and run it from. Upon successful installation, you're given the option to launch DjVuLibre's <i>DjView</i> tool, which will open the DjVuLibre manual (in djvu format). In the left pane of DjView, you can see a listing of the various tools DjVuLibre is comprised of, and read up on them. You can also read about <i>djvutxt</i> or the other DjVu tools that DjVuLibre provides in their <Link url="http://djvu.sourceforge.net/doc/index.html">documentation page</Link>, but for this tutorial, we'll just be using their <Format>djvutxt</Format> tool.</Text></Bullet> 4809 <Bullet><Text id="ucp-19b">As for Linux, some <Link url="https://unix.stackexchange.com/questions/25256/why-isnt-there-a-djvu2text">Linux machines may even come pre-installed with DjVuLibre</Link>. If not, you can use a package manager to install it for you, or compile it up easily from <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre/">source</Link> in the usual Unix manner.</Text></Bullet> 4808 <Bullet><Text id="ucp-19b">Some <Link url="https://unix.stackexchange.com/questions/25256/why-isnt-there-a-djvu2text">Linux machines may even come pre-installed with DjVuLibre</Link>. If not, you can use a package manager to install it for you, or compile it up easily from <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre/">source</Link> in the usual Unix manner.</Text></Bullet> 4809 <Bullet><Text id="ucp-19">DjVuLibre provides binary installers for <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_Windows/">Windows</Link> and <Link url="https://sourceforge.net/projects/djvu/files/DjVuLibre_MacOS/">Mac</Link>. Grab the one for your operating system and install it somewhere sensible: somewhere you have permissions to install and run it from. On Windows, running the installer in the regular manner requires you to have admin permissions. If you don't have admin rights, you can run the installer as follows (instructions taken from <Link url="https://superuser.com/questions/171917/force-a-program-to-run-without-administrator-privileges-or-uac">this superuser exchange</Link>) to install DjVyLibre in a non-admin location. Use a text editor to create a file called <Format>nonadmin.bat</Format> (beware the file doesn't end up with an additional <Format>.txt</Format> extension when saving it). Copy and paste, or carefully type, the following text into the file, then save and close it:</Text> 4810 <Format>cmd /min /C "set __COMPAT_LAYER=RUNASINVOKER && start "" %1"</Format> 4811 <Text id="ucp-19a">Next, drag and drop the DjVuLibre setup executable onto the new <Format>nonadmin.bat</Format> file to run setup in a way that bypasses the admin privileges usually required for a successful installation. When installing, you'll now finally be allowed to choose a custom install directory, instead of the installer choosing an off-limits admin location like <Format>C:\Program Files (x86)</Format> for you. So make sure to choose a location in your User area as install directory.</Text> 4812 <Text id="ucp-19c">Upon successful installation, you're given the option to launch DjVuLibre's <i>DjView</i> tool, which will open the DjVuLibre manual (in djvu format). In the left pane of DjView, you can see a listing of the various tools DjVuLibre is comprised of, and read up on them. You can also read about <i>djvutxt</i> or the other DjVu tools that DjVuLibre provides in their <Link url="http://djvu.sourceforge.net/doc/index.html">documentation page</Link>, but for this tutorial, we'll just be using their <Format>djvutxt</Format> tool.</Text></Bullet> 4810 4813 </BulletList> 4811 4814 </NumberedItem> … … 4826 4829 <Bullet><Text id="ucp-27">set its <Format>mime_type</Format> field to <Format>image/vnd.djvu</Format>, which is one of the <Link url="http://djvu.sourceforge.net/doc/man/nsdejavu.html">mime types for the DjVu format</Link></Text></Bullet> 4827 4830 <Bullet><Text id="ucp-28">set its <Format>process_extension</Format> to <Format>djvu</Format></Text></Bullet> 4828 <Bullet><Text id="ucp-29">Finally, copy the full <Format>djvutxt</Format> command you ran from the commandline and paste it into the UnknownConverterPlugin Configuration dialog's <Format>exec_cmd</Format> field. Keep the full path to the <i>djvutxt</i> binary, but replace the entire input filepath with the literal string <Format>%%INPUT_FILE</Format> and replace the output filepath with the literal string <Format>%%OUTPUT</Format>. </Text>4831 <Bullet><Text id="ucp-29">Finally, copy the full <Format>djvutxt</Format> command you ran from the commandline and paste it into the UnknownConverterPlugin Configuration dialog's <Format>exec_cmd</Format> field. Keep the full path to the <i>djvutxt</i> binary, but replace the entire input filepath with the literal string <Format>%%INPUT_FILE</Format> and replace the output filepath with the literal string <Format>%%OUTPUT</Format>. </Text> 4829 4832 <Text id="ucp-30">Doing so means that when you build the collection, Greenstone will replace <Format>%%INPUT_FILE</Format> with each DjVu document in your collection that it needs to process, and will replace <Format>%%OUTPUT</Format> with the expected text output file of each document upon conversion by <i>djvutxt</i>.</Text></Bullet> 4830 4833 </BulletList> 4831 <Text id="ucp-31">If you have any spaces in any filepaths in your <Format>exec_cmd</Format>, make sure to always nest th em in escaped double quotes (<Format>\"</Format>), so Greenstone can preserve the spaces in the filepath.</Text>4834 <Text id="ucp-31">If you have any spaces in any filepaths in your <Format>exec_cmd</Format>, make sure to always nest that entire filepath in escaped double quotes (<Format>\"</Format>), so Greenstone can preserve the spaces in it.</Text> 4832 4835 <Text id="ucp-32">If any filepaths, other than <Format>%%INPUT_FILE</Format> and <Format>%%OUTPUT</Format> are within your Greenstone installation, you can use the <Format>%%GSDLHOME</Format><MajorVersion number="3">, <Format>%%GSDL3SRCHOME</Format> and <Format>%%GSDL3HOME</Format> (the latter for Greenstone 3's <Format>web</Format> folder)</MajorVersion> as placeholders and write out your filepaths relative to this. For instance, if your DjVuLibre is installed in your Greenstone's <Format>ext</Format> subfolder, then you would start the filepath to <i>djvutxt</i> with <Format>%%GSDL<MajorVersion number="3">3SRC</MajorVersion>HOME/ext</Format>.</Text> 4833 <Text id="ucp-32a">The value for your <Format>exec_cmd</Format> may look something like this, if you have DjVuLibre installed in <Path>C:\Program Files</Path>. Note the escaped double quotes bookending the path to <Format>djvutxt</Format>, to protect spaces in its filepath:</Text>4836 <Text id="ucp-32a">The value for your <Format>exec_cmd</Format> field may look something like the following, if you have DjVuLibre installed in <Path>C:\Program Files</Path>. Note the escaped double quotes bookending the path to <Format>djvutxt</Format>, to protect spaces in its filepath:</Text> 4834 4837 <Format>\"C:\Program Files\DjVuLibre\djvutxt\" %%INPUT_FILE %%OUTPUT</Format> 4835 4838 </NumberedItem> … … 4843 4846 <Text id="ucp-38">Click OK to close the UnknownConverterPlugin configuration dialog. Quit GLI, since there's a little more work to do.</Text> 4844 4847 </NumberedItem> 4845 <NumberedItem><Text id="ucp-39">Greenstone doesn't have an icon for DjVu documents, since it doesn't know about the format. If you Google for the djvu icon, you'd probably find the <Link url="http ://djvu.sourceforge.net/doc/man/nsdejavu.html">Wikipedia page for it</Link>.</Text>4846 <Text id="ucp-40">Save one of their DjVu icon images. Then open the image in GIMP or another image editor and use the application's scaling feature to scale theheight or the width (whichever is greater) to anywhere between 26 and 32 pixels. Save the scaled image as a GIF file with the name "<Format>idjvu.gif</Format>", storing it in your Greenstone installation's <Format>web/interfaces/default/images</Format> folder.</Text>4848 <NumberedItem><Text id="ucp-39">Greenstone doesn't have an icon for DjVu documents, since it doesn't know about the format. If you Google for the djvu icon, you'd probably find the <Link url="https://en.wikipedia.org/wiki/DjVu">Wikipedia page for it</Link>.</Text> 4849 <Text id="ucp-40">Save one of their DjVu icon images. Then open the image in Windows Paint or GIMP or another image editor, and use the application's scaling feature to scale the image's height or the width (whichever is greater) to anywhere between 26 and 32 pixels. Save the scaled image as a GIF file with the name "<Format>idjvu.gif</Format>", storing it in your Greenstone installation's <Format>web/interfaces/default/images</Format> folder.</Text> 4847 4850 </NumberedItem> 4848 4851 <NumberedItem><Text id="ucp-41">Greenstone knows nothing about the <Format>icondjvu</Format> macro we defined as the value for UnknownConverterPlugin's <Format>srcicon</Format> field, so we have to teach Greenstone about this new macro. Use a text editor to open your Greenstone 3's <Format>web/sites/localsite/siteConfig.xml</Format> file.</Text> … … 4859 4862 </NumberedItem> 4860 4863 <Heading> 4861 <Text id="ucp-46">Using theIcecite's commandline tool to convert from PDF to text</Text>4862 </Heading> 4863 <Comment><Text id="ucp-47"><Link url="https://github.com/ckorzen/icecite">Icecite</Link> is an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFPlugin, even when <Format>pdfbox_conversion</Format> option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here.</Text></Comment>4864 <Text id="ucp-46">Using Icecite's commandline tool to convert from PDF to text</Text> 4865 </Heading> 4866 <Comment><Text id="ucp-47"><Link url="https://github.com/ckorzen/icecite">Icecite</Link> (now known as <Link url="https://github.com/ad-freiburg/pdfact">PdfAct</Link>) is an open-source tool that can do many PDF related tasks, including extracting text from a PDF. In this part of the tutorial, we're going to learn how to run Icecite's PDF to text conversion utility from the command line. Based on that command, we'll configure the UnknownConverterPlugin to launch Icecite from GLI, to do the conversion on a PDF document in a Greenstone collection. This ends up being a useful exercise in instances where certain PDFs aren't recognised by Greenstone's PDFv1Plugin or old PDFPlugin, even when the <Format>pdfbox_conversion</Format> option (which uses the PDFBox tool for the conversion) is switched on. In such cases, you can use what you learn here. The PDFv2Plugin introduced since Greenstone 3.09 should however be able to handle more PDF documents out of the box, so try that first before using the UnknownConverterPlugin with Icecite.</Text></Comment> 4864 4867 <Comment> 4865 4868 <Text id="ucp-48">As Icecite needs Java 8, you need to have either a JDK8 or a JRE8 installed in order to proceed with this portion of the tutorial.</Text> … … 4894 4897 </Heading> 4895 4898 <Comment><Text id="ucp-65">We're now ready to use the <AutoText text="UnknownConverterPlugin"/> to launch Icecite as the external tool to do the conversion, producing output that Greenstone's building scripts can ingest into Greenstone and index for searching.</Text></Comment> 4896 <NumberedItem><Text id="ucp-66">Run GLI </Text></NumberedItem>4899 <NumberedItem><Text id="ucp-66">Run GLI.</Text></NumberedItem> 4897 4900 <NumberedItem><Text id="ucp-67">Create a new collection called Icecite. In the <AutoText key="glidict::GUI.Gather"/> pane, drop in the sample PDF file into your collection.</Text></NumberedItem> 4898 4901 <NumberedItem><Text id="ucp-68">In the <AutoText key="glidict::GUI.Design"/> pane and select <AutoText key="glidict::CDM.GUI.Plugins"/> from the list on the left. Add the <AutoText text="UnknownConverterPlugin"/>. Having tried out the Icecite conversion command manually in the previous part of this tutorial, we're now ready to use it when configuring the <AutoText text="UnknownConverterPlugin"/>. Click <AutoText key="glidict::CDM.PlugInManager.Configure" type="button"/> and set up the plugin with the following settings:</Text>
Note:
See TracChangeset
for help on using the changeset viewer.