BUILDING COLLECTIONS OUTSIDE OF GREENSTONE The programs for building are like mgpp_passes, mgpp_invf_dict etc. they have man pages with them, like mgpp_passes.1 etc. mgpp_passes is the main program used for building and the man page has a simple script that gives an example of how to build a collection. Here is a simple bash script that can be used to build a collection, #! /bin/bash # The arguments on the command line specify the # source of the text source=$@ # This is the name of the collection text=demo echo $source # Create *.text.stats, *.invf.dict, *.invf.level # *.invf.chunk and *.invf.chunks.trans cat ${source} | mgpp_passes -T1 -I1 -f ${text} # Create *.text.dict mgpp_compression_dict -f ${text} # Create *.invf.dict.hash mgpp_perf_hash_build -f ${text} # Create *.text, *.text.idx, *.text.level # *.invf and *.invf.idx cat ${source} | mgpp_passes -T2 -I2 -f ${text} # Create *.text.weight and *.weight.approx mgpp_weights_build -f ${text} # Create *.invf.dict.blocked mgpp_invf_dict -f ${text} # Create *.invf.dict.blocked.1 mgpp_stem_idx -s 1 -f ${text} # Create *.invf.dict.blocked.2 mgpp_stem_idx -s 2 -f ${text} # Create *.invf.dict.blocked.3 mgpp_stem_idx -s 3 -f ${text} This builds a basic collection, using 'Document' as the document level tag, with a word level index. Format of documents: There must be a document level tag thats starts (and optionally ends) each document. The default is 'Document'. Text outside these tags is ignored. To change the document tag add '-J ' to the mgpp_passes commands. There can only be one -J option. Smaller granularity is added using level tags. These must enclose all the text enclosed by the document tags. They can be specified to mgpp_passes by '-K '. There can be many -K options. The smallest level of granularity (default is word level) can be changed by adding '-L '. This can be no larger than the smallest level. Metadata or tagged fields are specified like the text of the title these are all automatically indexed and can be searched as fields. The output files are placed in the current directory. To change where they go, use '-d ' for all the commands. The man pages describe all the different options to the various programs. There are four stages to mgpp_passes, specified by T1 and T2 (for text compression) and I1 and I2 (for indexing). T1 and I1 must be done before T2 and I2, respectively. As shown above, T1/I1 and T2/I2 can be run together, reducing the number of passes through the documents to 2. But all output files get placed in the same directory, and the same text is passed to the indexer and the compressor. In greenstone, we pass only the text to the compressor, but metadata and text to the indexer, and text files and index files are put in separate directories. So mgpp_passes is run four times in this case, like: cat | mgpp_passes -d -T1 ... cat | mgpp_passes -d -T2 ... cat | mgpp_passes -d -I1 ... cat | mgpp_passes -d -I2 ...