===== Overview ===== Building tools (matching the import and buildcol phases of Greenstone) that make use of the OpenMPI library to provide flexible parallel processing capabilities to Greenstone. ==== Import ==== Currently the import tool is quite basic and essentially splits the collection up into fixed sized batches, processing each with a separate processor. ==== Buildcol ==== Meanwhile the buildcol tool is slightly more complex, as it lets Greenstone Perl code create a XML 'recipe' describing the collection build process and any precedence requirements, and then obeys this recipe farming out parallizable parts of the collection build to separate processes as possible. ===== Requirements ===== Requires the GS2-Extension "tdb-edit" to be installed, compiled and enabled, in order to allow multiple parallel readers/writers to access key datastore/base files (such as the archive-inf databases). ==== Supported InfoDBs ==== The "tdb-edit" extension was create specifically for parallel building, so is 100% percent compatible. Included in this extension are versions of the GDBM tools txt2db and db2txt (named txt2dbl and db2txtl respectively) that use simple file locking to allow (albeit not as efficiently) parallel readers/writers. SQLite is currently not supported (see below). No other infodb types currently supported. === GDBM === *NEW* GDBM support is now available through the 'gdbmserver' database driver. This makes use of a GDBMCLI called via daemonized server (so the Greenstone processes become clients in a client/server model). This not only allows GDBM databases in parallel building, but will also improve serial building as it reduces the number of GDBM database open/closes from 1-2 per document in the build process to 1-2 per build process (obviously the gain won't be huge with 10 documents, especially with the extra cost of starting and waiting, and stopping and waiting the GDBMServers, but on one million documents that difference is significant). The GDBMServer class requires to non-standard CPAN packages, namely IPC::Run and Proc::Daemon. I've included the source for these in the packages folder, but for now compiling them is a manual process (go to packages, extract, go to folder, run "perl Makefile.pl PREFIX=$GEXTPARALLELBUILDING", make, then make install). There are logs dropped in the logs directory, but good luck with making heads or tails of them in a parallel building context (even with write buffering/synchronization the server communications are going to be all jumbled up). === SQLite === The SQLite (3.6.23.1) included in circa 2.84 Greenstone, while allowing multiple readers and one active writer with using the CLI, does not support parallel writers. Any attempt to use the CLI to write to a database table already opened exclusive by another writer results in a "Error: database is locked" being thrown in the sqlite library and the write being ignored. I can see two possible ways to work around this. The first, and I guess preferred, is to move away from using the CLI and instead use native Perl DBI/DBD. By doing so we can listen for, and react appropriately to, SQLITE_BUSY errors (by waiting then retrying). We could even, conceivably, implement some locking schema similar to HS's changes to GDBM. There may even be versions of the SQLite DBD (OpenPGK?) that allow you to specify that writes *must* succeed and to try infinitely until they do. As an alternative, I tried to upgrade to the latest version of SQLite (3.7.8) which includes a new feature called Write Ahead Logging. Presumably this feature allows multiple writers on a single database - so sounded perfect for our purpose. However, in practice, there still appears to be locking issues - with the occasional import/build throwing one or more of these errors: * Error: database is locked * Use of uninitialized value $filenames[0] in join or string at /research/jmt12/gsdl-64-svn/ext/parallel-building/perllib/util.pm line 860. * Unexpected EOF in "bitio_m_stdio.cpp" on line 22[1316641505][Worker2] complete All subsequent builds (presumably on a now corrupt archives directory) without a successful import produce garbage. I include the SQLite source in the packages folder for anyone who wants to try and figure out why. Meanwhile I've added a sanity test to buildcol.pl that explicitly ignores -parallel if infodb type is sqlite (and prints out a warning about doing so). ===== bin/script and perllib ===== **Note:** The following is including for historic reasons - these changes have now been merged (or otherwise dealt with) by major changes to the way import and build scripts are run. The number of actual customized files in Parallel Buildings perllib are now fewer in number, and tend to depend upon proper class inheritence and overriding. In order to try and make this compatible with the latest advances in the main trunk (so not the 64bit version I've been testing on), I've implemented the parallel building using a SVN head version of import.pl, buildcol.pl and perllib. I'll try to keep a list of the files I've changed here to aid in merging this code back into Greenstone: * bin/script/buildcol.pl: extended with a -parallel argument and the code to use it (basically, call builder methods to construct a 'recipe' of how the build should proceed, and then call mpibuilder with the recipe to do that actual build!) * perllib/IncrementalBuildTools.pm: made modifying the INC conditional on the paths not already existing in INC - otherwise all our clever INC building to support extensions is clobbered. Similar change for modifying the PATH. * perllib/basebuilder.pm: adding indexlevel variable to build_indexes() function (although it doesn't do anything in base) and also added dummy prepare_build_recipe() function that just complaining that inheriting class should implement. * perllib/classify.pm: modified classifier loader to use @INC rather than hardcoded locations - so extensions are properly supported (two places). * perllib/colcfg.pm: added indexname and indexlevel CLI configuration options * perllib/dbutil.pm: removed, as I want the one in TDB-Edit extension (with it's dynamic database driver loading) to apply. * perllib/doc.pm: commented out hardcoded(?) part to a particular 64bit version of perl... is this from my earlier checkins? * perllib/lucenebuilder.pm: added indexlevel parameter and code to ensure that, if specified, only that level is built (with appropriate testing for the non-lucene-supported 'paragraph' level). Added prepare_build_recipe() function that knows all phases in lucene builds can be run in parallel, with build_indexes being separable by level. * perllib/mgbuilder.pm: the most 'complex' recipe generator, mg creates the compress_text item, then adds all the build_index items as dependents. InfoDB can happen in parallel with compress_text however. * perllib/mgppbuilder.pm: the simpliest recipe, mgpp can run all three of its phases in parallel. * perllib/parse2.pm: see IncrementalBuildTools * perllib/parse3.pm: see IncrementalBuildTools * perllib/plugin.pm: see IncrementalBuildTools * perllib/util.pm: made it only complain about periods (.) in the Identifier once - rather than once per document (which is a PITA when building one million documents). * perllib/dbutil/gdbm.pm: changed to call lock enabled versions of txt2db and db2txt. * perllib/dbutil/sqlite.pm: added WAL Pragma (for all the good it did). Also needed to redirect output (like for db_fast) as the WAL reports each type of action ("add","update", and "delete") that it has queued - very quickly becoming annoying. * perllib/plugins/DirectoryPlugin.pm: making the "Global file scan..." comment obey verbosity. * perllib/plugins/MARCPlugin.pm: see IncrementalBuildTools (in this case the path to cpan) * perllib/plugins/MetadataXMLPlugin.pm: see IncrementalBuildTools (in this case the path to cpan) * perllib/plugins/OAIMetadataXMLPlugin.pm: see IncrementalBuildTools (in this case the path to cpan) * perllib/plugins/ReadXMLPlugin.pm: see IncrementalBuildTools (in this case the path to cpan) ===== Packages ===== ==== Bit-Vector-7.2 ==== Required by Thrift's Perl API. ==== Hadoop-1.1.0 ==== Provides Hadoop capabilities to the extension - you can then either run Greenstone in parallel (using OpenMPI as the parallel framework) pulling the files out of HDFS, or you can run the alternate Hadoop framework import (and maybe build if I can be bothered) and make even better use of HDFS. ==== IPC-Run-0.90 ==== Used in the server daemons (GDBM and TDB) to provide a handle to running applications that allows bi-directional piping and better process control (get child PIDs etc). ==== OpenMPI-1.4.3 ==== Provides a framework within which to run Greenstone in parallel. ==== Proc-Daemon-0.14 ==== Perl module to allow proper daemonization of perl processes. ==== Sort-Key-1.32 ==== Perl module providing better sorting algorithms include natural sort of keys. ==== ThriftFS-0.9.0 ==== A custom collection of files extracted from a src install of Hadoop and Thrift providing a persistent Hadoop-Thrift server (in Java), and an API for communicating with the server from Perl. Includes a java file providing slightly more efficient Base91 encoding/decoding (as compared to Base64). Required by tweaks to Thrift to allow binary data to be passed around as Java Strings without UTF8 encoding accidentally mangling things (if only they'd used Java Byte[]s instead). ==== Tinyxml-gs-2.6.2 ==== Used to parse XML 'build recipes' in the parallel version of buildcol.