[3365] | 1 | BUILDING COLLECTIONS OUTSIDE OF GREENSTONE
|
---|
| 2 |
|
---|
| 3 |
|
---|
| 4 | The programs for building are like mgpp_passes, mgpp_invf_dict etc. they have
|
---|
| 5 | man pages with them, like mgpp_passes.1 etc. mgpp_passes is the main
|
---|
| 6 | program used for building and the man page has a simple script that gives an
|
---|
| 7 | example of how to build a collection.
|
---|
| 8 |
|
---|
| 9 | Here is a simple bash script that can be used to build a collection,
|
---|
| 10 |
|
---|
| 11 | #! /bin/bash
|
---|
| 12 |
|
---|
| 13 |
|
---|
| 14 | # The arguments on the command line specify the
|
---|
| 15 | # source of the text
|
---|
| 16 | source=$@
|
---|
| 17 |
|
---|
| 18 | # This is the name of the collection
|
---|
| 19 | text=demo
|
---|
| 20 |
|
---|
| 21 | echo $source
|
---|
| 22 |
|
---|
| 23 | # Create *.text.stats, *.invf.dict, *.invf.level
|
---|
| 24 | # *.invf.chunk and *.invf.chunks.trans
|
---|
| 25 | cat ${source} | mgpp_passes -T1 -I1 -f ${text}
|
---|
| 26 |
|
---|
| 27 | # Create *.text.dict
|
---|
| 28 | mgpp_compression_dict -f ${text}
|
---|
| 29 |
|
---|
| 30 | # Create *.invf.dict.hash
|
---|
| 31 | mgpp_perf_hash_build -f ${text}
|
---|
| 32 |
|
---|
| 33 | # Create *.text, *.text.idx, *.text.level
|
---|
| 34 | # *.invf and *.invf.idx
|
---|
| 35 | cat ${source} | mgpp_passes -T2 -I2 -f ${text}
|
---|
| 36 |
|
---|
| 37 | # Create *.text.weight and *.weight.approx
|
---|
| 38 | mgpp_weights_build -f ${text}
|
---|
| 39 |
|
---|
| 40 | # Create *.invf.dict.blocked
|
---|
| 41 | mgpp_invf_dict -f ${text}
|
---|
| 42 |
|
---|
| 43 | # Create *.invf.dict.blocked.1
|
---|
| 44 | mgpp_stem_idx -s 1 -f ${text}
|
---|
| 45 |
|
---|
| 46 | # Create *.invf.dict.blocked.2
|
---|
| 47 | mgpp_stem_idx -s 2 -f ${text}
|
---|
| 48 |
|
---|
| 49 | # Create *.invf.dict.blocked.3
|
---|
| 50 | mgpp_stem_idx -s 3 -f ${text}
|
---|
| 51 |
|
---|
| 52 |
|
---|
| 53 | This builds a basic collection, using 'Document' as the document level tag,
|
---|
| 54 | with a word level index.
|
---|
| 55 |
|
---|
| 56 | Format of documents:
|
---|
| 57 |
|
---|
| 58 | There must be a document level tag thats starts (and optionally ends) each
|
---|
| 59 | document. The default is 'Document'. Text outside these tags is ignored. To
|
---|
| 60 | change the document tag add '-J <tagname>' to the mgpp_passes commands. There
|
---|
| 61 | can only be one -J option.
|
---|
| 62 |
|
---|
| 63 | Smaller granularity is added using level tags. These must enclose all the
|
---|
| 64 | text enclosed by the document tags. They can be specified to mgpp_passes by
|
---|
| 65 | '-K <level tag>'. There can be many -K options.
|
---|
| 66 |
|
---|
| 67 | The smallest level of granularity (default is word level) can be changed by
|
---|
| 68 | adding '-L <index level>'. This can be no larger than the smallest level.
|
---|
| 69 |
|
---|
| 70 | Metadata or tagged fields are specified like
|
---|
| 71 | <Title>the text of the title </Title>
|
---|
| 72 |
|
---|
| 73 | these are all automatically indexed and can be searched as fields.
|
---|
| 74 |
|
---|
| 75 | The output files are placed in the current directory. To change where they go,
|
---|
| 76 | use '-d <directory name> ' for all the commands.
|
---|
| 77 |
|
---|
| 78 | The man pages describe all the different options to the various programs.
|
---|
| 79 |
|
---|
| 80 |
|
---|
| 81 | There are four stages to mgpp_passes, specified by T1 and T2 (for text
|
---|
| 82 | compression) and I1 and I2 (for indexing). T1 and I1 must be done before T2
|
---|
| 83 | and I2, respectively. As shown above, T1/I1 and T2/I2 can be run together,
|
---|
| 84 | reducing the number of passes through the documents to 2. But all output files
|
---|
| 85 | get placed in the same directory, and the same text is passed to the indexer
|
---|
| 86 | and the compressor.
|
---|
| 87 |
|
---|
| 88 | In greenstone, we pass only the text to the compressor, but metadata and text
|
---|
| 89 | to the indexer, and text files and index files are put in separate directories.
|
---|
| 90 |
|
---|
| 91 | So mgpp_passes is run four times in this case, like:
|
---|
| 92 |
|
---|
| 93 | cat <text src> | mgpp_passes -d <text_dir> -T1 ...
|
---|
| 94 | cat <text src> | mgpp_passes -d <text_dir> -T2 ...
|
---|
| 95 | cat <index src> | mgpp_passes -d <index_dir> -I1 ...
|
---|
| 96 | cat <index src> | mgpp_passes -d <index_dir> -I2 ...
|
---|
| 97 |
|
---|
| 98 |
|
---|
| 99 |
|
---|