Changeset 30856 for documentation

06.10.2016 18:42:14 (3 years ago)

First commit of the formatted (for tutorials) version of the GS3 tutorial Cmdline Incremental Building.

1 modified


  • documentation/trunk/tutorials/xml-source/tutorial_en.xml

    r30761 r30856  
    52945294<Text id="depositor-22">A major benefit of using <AutoText key="coredm::_depositor:textdepositor_"/> is that the user can upload documents and metadata remotely, without having to have Greenstone installed at the client end. <AutoText key="coredm::_depositor:textdepositor_"/> is a tool for remote data input, allowing you to also deposit items to collections built with the MG or MGPP indexers. The difference is that the MG and MGPP indexers need to rebuild the entire index after adding a new item, while the Lucene indexer incrementally adds the new document to the existing index.</Text> 
     5298<MajorVersion number="3"> 
     5299<Tutorial id="incremental_cmdline"> 
     5301<Text id="depositor-1">Incrementally building a collection using the command line</Text> 
     5303<SampleFiles folder="demo_NewFiles"/> 
     5304<Prerequisite id="indexers"/> 
     5305<Version initial="2.71" current="2.86"/> 
     5307<Text id="ic-00">Intro</Text> 
     5309<Text id="ic-01">In GLI, create a new collection called <i>Incremental With Manifests</i> and base it on the <i>Demo Collection</i>. The short name of this collection will become <i>incremen</i>, and this will be the name of the collection's folder on the file system.</Text> 
     5312<Text id="ic-02">Use GLI's Workspace view to navigate to this tutorial's sample files folder, <i>incr-build</i>. It will contain a folder named <i>import</i>. Open this, and drag and drop into your new collection the 3 subfolders within it.</Text> 
     5315<Text id="ic-03">Do not build the collection in GLI. Instead, open a terminal. We'll be building and rebuilding manually, from the command-line. To open a terminal in Windows, press Ctrl+r and type <Format>cmd</Format> in the <b>Run</b> dialog that displays. To open a terminal on a Mac machine, click on menu <Path>Go &rarr; Utilities &rarr; Terminal</Path>.</Text> 
     5318<Text id="ic-04">Close GLI if it's running. You can run the Greenstone server or not. In a text editor open your <Format>incremen</Format> collection's <Format>collectionConfig.xml</Format> file located in <Format>web\sites\localsite\collect\incremen\etc</Format>.</Text> 
     5319<Text id="ic-04a">Scroll down to the following line near the bottom:</Text> 
     5320<Format>&lt;importOption name=&quot;OIDtype&quot; value=&quot;dirname&quot;/&gt;</Format> 
     5321<Text id="ic-04b">Edit it to refer to the full filenames instead of directory names:</Text> 
     5322<Format>&lt;importOption name=&quot;OIDtype&quot; value=&quot;full_filename&quot;/&gt;</Format> 
     5323<Text id="ic-04c">The above step sets the identifiers used by Greenstone for this collection's documents to be based on their full filenames. Doing so will allow us to refer to the files by name in the &lt;Filename&gt; elements of any manifest file we use for incrementally building the collection. These &lt;Filename&gt; elements will then identify which files are to be indexed if newly added, which are to be deleted, and which to be re-indexed (as should happen if a document or its metadata has been edited).</Text> 
     5326<Text id="ic-05">Since this is the first time we're building our collection, we're going to do a complete build. And we'll use the command line to do so. Use the terminal to <Format>cd</Format> into your Greenstone 3 installation folder. For instance, if you have your Greenstone installed on Windows as "<i>Greenstone3</i>" within your account folder at <Format>C:\Users\you</Format>, then type the following in your terminal and hit Enter:</Text> 
     5327<Format>cd C:\Users\you\Greenstone3</Format> 
     5328<Text id="ic-05a">On Linux or Macs, the general command is the same, but the installed location would be different and the slashes go the other way. For example, if installed in <Format>/Users/me/Greenstone3</Format>, you'd type the following and hit Enter:</Text> 
     5329<Format>cd /Users/me/Greenstone3</Format> 
     5330<Text id="ic-05b">Now you're ready to set up the Greenstone environment in your terminal. On Windows, type the following into your terminal and hit Enter again:</Text> 
     5332<Text id="ic-05c">On Linux and Mac:</Text> 
     5333<Format>source ./</Format> 
     5334<Text id="ic-05d">In terminals, you'll need to hit Enter after each command in order to execute the command you just finished typing. We won't repeat this instruction anymore. Just remember to hit Enter after every complete command entered into a terminal.</Text> 
     5335<Text id="ic-05e">With the terminal now operating within your Greenstone installation folder, and with the Greenstone environment now set up and ready, type the following commands to do a complete build of your new collection. Although the command contains the word "rebuild" in it, since this is the first time the collection's being built, it will just build it.</Text> 
     5336<Format>perl -S -site localsite incremen</Format> 
     5337<Text id="ic-05f">Preview the collection. If the Greenstone server is not running (as would happen if you had closed GLI and didn't start the standalone Greenstone server application), then run it from the Start Menu on Windows now. You could also run the Greenstone 3 server by running the <Format>gs3-server.bat</Format> script in the terminal if using a Windows, or running the <Format></Format> script from a Linux/Mac terminal.</Text> 
     5338<Text id="ic-05g">Leave the terminal (in which you have set up your Greenstone 3's environment) open for the rest of this tutorial exercise. We'll be using it throughout.</Text> 
     5341<Text id="ic-06">Incrementally rebuilding your collection after adding some additional new documents to it</Text> 
     5344<Text id="ic-06a">If you want you can use GLI to drag and drop the <i>fb33fe</i>, <i>fb34fe</i> and <i>wb34te</i> folders, located in the <i>incr-build/more-files</i> sample files subfolder, into your collection. 
     5345Alternatively, you can use a File Browser to copy the folders <i>fb33fe</i>, <i>fb34fe</i> and <i>wb34te</i>, located in the <i>incr-build/more-files</i> sample files subfolder, into your collection's <Format>import</Format> folder at <Format>web\sites\localsite\collect\incremen\import</Format>.</Text> 
     5346<Text id="ic-06b">The above step will only have gathered 3 new documents into your collection. However, since the changes have not been built, previewing at this stage will make no difference.</Text> 
     5349<Text id="ic-07">We want to build just the newly added documents into the collection if possible, instead of rebuilding everything. Return to the terminal you had left open. This time, instead of running <Format>full-rebuild</Format>, we'll run the <Format>incremental-import</Format> and <Format>incremental-buildcol</Format> scripts to perform the two phases of a Greenstone build operation incrementally. Incremental building allows us to (re)build just what is necessary, rather than everything.</Text> 
     5350<Text id="ic-07a">Since we know exactly which files have been added and thus which files need to be built, we can write a manifest file specifying this. The manifest files used by the Greenstone incremental building process are just XML files that can be created and edited in a plain text editor, and which indicate which files need to be (re)processed by a Greenstone incremental build operation.</Text> 
     5351<Text id="ic-07b">We've already prepared the manifest files we'll be using in this tutorial exercise for you. Use a File Browser to copy the <i>manifests</i> subfolder from the sample files folder into your <Format>incremen</Format> collection folder that's located inside your Greenstone 3 installation directory (at <Format>web\sites\localsite\collect\incremen</Format>).</Text> 
     5352<Text id="ic-07c">In a text editor, open the <i>add-new-files.xml</i> manifest file found in the newly copied <i>manifests</i> subfolder. Inspect the contents of this manifest file. It should contain:</Text> 
     5354&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;<br /> 
     5355&lt;Manifest&gt;<br /> 
     5356  <Tab n="1"/>&lt;Index&gt;<br /> 
     5357    <Tab n="2"/>&lt;Filename&gt;fb33fe/fb33fe.htm&lt;/Filename&gt;<br /> 
     5358    <Tab n="2"/>&lt;Filename&gt;fb34fe/fb34fe.htm&lt;/Filename&gt;<br /> 
     5359    <Tab n="2"/>&lt;Filename&gt;wb34te/wb34te.htm&lt;/Filename&gt;<br /> 
     5360  <Tab n="1"/>&lt;/Index&gt;<br /> 
     5363<Text id="ic-07d">The above lists the 3 main documents to be added/indexed by Greenstone (hence the keyword &lt;Index&gt;). Since these documents are located inside their own subfolders when copied into the <i>import</i> folder, the manifest file also indicates the relative folder structure of these documents, e.g. <i>"fb33fe/fb33fe.htm"</i> shows that the <i>fb33fe.htm</i> HTML document is located in the folder <i>fb33fe</i>. Only the main documents to be added are listed, not the associated image files also found at the same folder level, as Greenstone will track down all the image files referred to by the main html documents to be indexed and will process them as files associated with the html.</Text> 
     5366<Text id="ic-08">We can finally run the commands for the incremental build operation.</Text> 
     5367<Text id="ic-08a">Use the terminal to first run the incremental import stage:</Text> 
     5368<Format>perl -S -manifest manifests/add-new-files.xml -site localsite incremen</Format> 
     5369<Text id="ic-08b">Once that finishes running, start off the incremental buildcol stage of the build process:</Text> 
     5370<Format>perl -S -activate -site localsite incremen</Format> 
     5371<Text id="ic-08c">The incremental import command specifies the manifest file that Greenstone is to consult in order to work out which files should be processed and how (Indexed, Deleted or Reindexed). By the builcol stage, the specific files would then be ready for further incremental processing by the buildcol script. The activate flag to the incremental buildcol script tells Greenstone to (re-)activate the updated collection if the Greenstone 3 server is running.</Text> 
     5374<Text id="ic-09">Preview the collection either by running the Greenstone server application, if it isn't already, or by starting the Greenstone server from the command line with the command:</Text> 
     5375<Format>ant start</Format> 
     5376<Text id="ic-09a">(To stop the Greenstone server at any point, use the command <Format>ant stop</Format>. To stop-and-start it, you'd use <Format>ant restart</Format>.)</Text> 
     5377<Text id="ic-09b">When the server is runnning, preview your library home page, located by default at <Format>http://localhost:8383/greenstone3/library</Format>. Click on the Titles browser. There should be 3 additional documents now, and you should be able to search for terms that occur in them.</Text> 
     5380<Text id="ic-10">Incrementally rebuilding your collection after deleting some documents from it</Text> 
     5383<Text id="ic-10a">Inspect the <i>delete-some-files.xml</i> manifest file (located in your <Format>increment</Format> collection folder's <i>manifests</i> subfolder). It contains:</Text> 
     5385&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;<br /> 
     5386&lt;Manifest&gt;<br /> 
     5387  <Tab n="1"/>&lt;Delete&gt;<br /> 
     5388    <Tab n="2"/>&lt;OID&gt;b18ase-b18ase_htm&lt;/OID&gt;<br /> 
     5389    <Tab n="2"/>&lt;OID&gt;fb33fe-fb33fe_htm&lt;/OID&gt;<br /> 
     5390  <Tab n="1"/>&lt;/Delete&gt;<br /> 
     5393<Text id="ic-10b">As per the above manifest file, the operation to be performed by an incremental build is a &lt;Delete&gt; operation involving two documents. For the delete operation, the documents are not indicated by the &lt;Filename&gt; XML element, but by the &lt;OID&gt; element which specifies the object identifier. We need to use the OID here because we're telling Greenstone precisely what the identifiers of the documents are that we wish to have removed from our collection. The identifiers of every built document in a Greenstone collection are specified in the Identifier field of the document's <i>doc.xml</i> file located in the collection's <Format>archives</Format> folder. The <i>doc.xml</i> file is the Greenstone-specific XML format in which Greenstone stores documents already imported.</Text> 
     5394<Text id="ic-10c">For instance, to find the identifier of the <i>b18ase.htm</i> document in your built collection, open up <Format>web/sites/localsite/collect/incremen/archives/b18ase-b.dir/doc.xml</Format> in a text editor. Then scroll down, looking for a piece of Greenstone extracted metadata labelled Identifier, which is the OID for this document:</Text> 
     5395<Format>&lt;Metadata name=&quot;Identifier&quot;&gt;b18ase-b18ase_htm&lt;/Metadata&gt;</Format> 
     5396<Text id="ic-10d">The above value for the document identifier is what's used in the <i>delete-some-files.xml</i> manifest file to refer to this document. This document is one of two that are to be deleted as per the manifest file. Make sure to close the <i>doc.xml</i> file if you have it open.</Text> 
     5399<Text id="ic-11">So then, let's first physically remove these two documents from our collection, so that the contents of the import folder match what the manifest specifies: use a file browser to remove the folders <i>b18ase</i> and <i>fb33fe</i> from the collection's <Format>import</Format> folder.</Text> 
     5402<Text id="ic-12">Finally, let's incrementally rebuild the collection, specifying the manifest file that Greenstone should use this time to carry out the incremental build operation. As before, there are two steps.</Text> 
     5403<Text id="ic-12a">First run the modified incremental import command:</Text> 
     5404<Format>perl -S -manifest manifests/delete-some-files.xml -site localsite incremen</Format> 
     5405<Text id="ic-12b">When that has finished running, run the old incremental buildcol command again (it doesn't change):</Text> 
     5406<Format>perl -S -activate -site localsite incremen</Format> 
     5409<Text id="ic-13">When it has finished, preview the collection once more and check that the 2 documents have been removed. They should not turn up in the browse classifiers, nor in search results.</Text> 
     5412<Text id="ic-14">Incrementally rebuilding your collection after editing a document's text and modifying document metadata</Text> 
     5415<Text id="ic-14a">Inspect the <i>mod-text-and-meta.xml</i> manifest file (located in <Format>incremen/manifests</Format>) in a text editor. It should contain:</Text> 
     5417&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;<br /> 
     5418&lt;Manifest&gt;<br /> 
     5419  <Tab n="1"/>&lt;Reindex&gt;<br /> 
     5420    <Tab n="2"/>&lt;Filename&gt;fb34fe/fb34fe.htm&lt;/Filename&gt;<br /> 
     5421    <Tab n="2"/>&lt;Filename&gt;b20cre/b20cre.htm&lt;/Filename&gt;<br /> 
     5422  <Tab n="1"/>&lt;/Reindex&gt;<br /> 
     5425<Text id="ic-14b">Note the &lt;Reindex&gt; used this time. It indicates which documents <b>already</b> in the collection are to be re-processed when the collection is incrementally rebuilt as per this manifest file.</Text> 
     5428<Text id="ic-15">Open up the file <i>fb34fe/fb34fe.htm</i> of your <Format>incremen</Format> collection's <Format>import</Format> folder in a text editor and add, remove or change some text nested anywhere in between the HTML tags within the &lt;BODY&gt; tag. Be careful not to partially modify HTML element names or HTML entities (entities start with an ampersand, &amp;, and end with a semi-colon, ;), as doing so can make your text contents invalid HTML. 
     5430<Text id="ic-15a">Save and close the edited file.</Text> 
     5433<Text id="ic-16">Next, quit the Greenstone server application if it was running, so that the Greenstone server is stopped. Start up GLI. Open the incremen collection and go to the Enrich panel. Add or modify dc.Title metadata for the b20cre document.</Text> 
     5436<Text id="ic-17">Quit GLI. Optionally run the Greenstone server application.</Text> 
     5437<Text id="ic-17a">In the above two steps, we've modified the text contents of fb34fe and the metadata associated with b20cre. Our mod-text-and-meta.xml manifest file already indicates that these two files are to be reindexed, so we can go ahead and incrementally rebuild the collection with this manifest file.</Text> 
     5440<Text id="ic-18">Run the incremental rebuild operation to re-process just these two files. To do so, pass the <Format>mod-text-and-meta.xml</Format> manifest file this time.</Text> 
     5441<Text id="ic-18a">First run:</Text> 
     5442<Format>perl -S -manifest manifests/mod-text-and-meta.xml -site localsite incremen</Format> 
     5443<Text id="ic-18b">Followed by:</Text> 
     5444<Format>perl -S -activate -site localsite incremen</Format> 
     5447<Text id="ic-19">Preview the collection once more. Check that the 2 documents contain your edits: try searching for any additional words you added. Also check the dc.Title metadata that you had modified can now be searched and appears as the title for the b20cre document in the Titles browsing classifier.</Text> 
     5449<Text id="ic-20">&lt;CONCLUSION&gt;</Text> 
     5450<Text id="ic-20a">In this tutorial, we looked at cutting down the amount of time spent on rebuilding a collection by manually controlling the rebuild operation so that it processes only what has changed. We do so by means of a manifest that specifies exactly what files need to be rebuilt and how (whether they need to be Indexed, Deleted or Reindexed).</Text> 
     5451<Text id="ic-20b">&lt;Also mention how lucene provides incremental-buildcol too, whereas mg and mgpp only provide incremental-import.&gt;</Text> 
     5452<Text id="ic-20c">Note: There's no search highlighting in collection documents that were modified and then incrementally rebuilt.</Text>