1 | BUILDING COLLECTIONS OUTSIDE OF GREENSTONE
|
---|
2 |
|
---|
3 |
|
---|
4 | The programs for building are like mgpp_passes, mgpp_invf_dict etc. they have
|
---|
5 | man pages with them, like mgpp_passes.1 etc. mgpp_passes is the main
|
---|
6 | program used for building and the man page has a simple script that gives an
|
---|
7 | example of how to build a collection.
|
---|
8 |
|
---|
9 | Here is a simple bash script that can be used to build a collection,
|
---|
10 |
|
---|
11 | #! /bin/bash
|
---|
12 |
|
---|
13 |
|
---|
14 | # The arguments on the command line specify the
|
---|
15 | # source of the text
|
---|
16 | source=$@
|
---|
17 |
|
---|
18 | # This is the name of the collection
|
---|
19 | text=demo
|
---|
20 |
|
---|
21 | echo $source
|
---|
22 |
|
---|
23 | # Create *.text.stats, *.invf.dict, *.invf.level
|
---|
24 | # *.invf.chunk and *.invf.chunks.trans
|
---|
25 | cat ${source} | mgpp_passes -T1 -I1 -f ${text}
|
---|
26 |
|
---|
27 | # Create *.text.dict
|
---|
28 | mgpp_compression_dict -f ${text}
|
---|
29 |
|
---|
30 | # Create *.invf.dict.hash
|
---|
31 | mgpp_perf_hash_build -f ${text}
|
---|
32 |
|
---|
33 | # Create *.text, *.text.idx, *.text.level
|
---|
34 | # *.invf and *.invf.idx
|
---|
35 | cat ${source} | mgpp_passes -T2 -I2 -f ${text}
|
---|
36 |
|
---|
37 | # Create *.text.weight and *.weight.approx
|
---|
38 | mgpp_weights_build -f ${text}
|
---|
39 |
|
---|
40 | # Create *.invf.dict.blocked
|
---|
41 | mgpp_invf_dict -f ${text}
|
---|
42 |
|
---|
43 | # Create *.invf.dict.blocked.1
|
---|
44 | mgpp_stem_idx -s 1 -f ${text}
|
---|
45 |
|
---|
46 | # Create *.invf.dict.blocked.2
|
---|
47 | mgpp_stem_idx -s 2 -f ${text}
|
---|
48 |
|
---|
49 | # Create *.invf.dict.blocked.3
|
---|
50 | mgpp_stem_idx -s 3 -f ${text}
|
---|
51 |
|
---|
52 |
|
---|
53 | This builds a basic collection, using 'Document' as the document level tag,
|
---|
54 | with a word level index.
|
---|
55 |
|
---|
56 | Format of documents:
|
---|
57 |
|
---|
58 | There must be a document level tag thats starts (and optionally ends) each
|
---|
59 | document. The default is 'Document'. Text outside these tags is ignored. To
|
---|
60 | change the document tag add '-J <tagname>' to the mgpp_passes commands. There
|
---|
61 | can only be one -J option.
|
---|
62 |
|
---|
63 | Smaller granularity is added using level tags. These must enclose all the
|
---|
64 | text enclosed by the document tags. They can be specified to mgpp_passes by
|
---|
65 | '-K <level tag>'. There can be many -K options.
|
---|
66 |
|
---|
67 | The smallest level of granularity (default is word level) can be changed by
|
---|
68 | adding '-L <index level>'. This can be no larger than the smallest level.
|
---|
69 |
|
---|
70 | Metadata or tagged fields are specified like
|
---|
71 | <Title>the text of the title </Title>
|
---|
72 |
|
---|
73 | these are all automatically indexed and can be searched as fields.
|
---|
74 |
|
---|
75 | The output files are placed in the current directory. To change where they go,
|
---|
76 | use '-d <directory name> ' for all the commands.
|
---|
77 |
|
---|
78 | The man pages describe all the different options to the various programs.
|
---|
79 |
|
---|
80 |
|
---|
81 | There are four stages to mgpp_passes, specified by T1 and T2 (for text
|
---|
82 | compression) and I1 and I2 (for indexing). T1 and I1 must be done before T2
|
---|
83 | and I2, respectively. As shown above, T1/I1 and T2/I2 can be run together,
|
---|
84 | reducing the number of passes through the documents to 2. But all output files
|
---|
85 | get placed in the same directory, and the same text is passed to the indexer
|
---|
86 | and the compressor.
|
---|
87 |
|
---|
88 | In greenstone, we pass only the text to the compressor, but metadata and text
|
---|
89 | to the indexer, and text files and index files are put in separate directories.
|
---|
90 |
|
---|
91 | So mgpp_passes is run four times in this case, like:
|
---|
92 |
|
---|
93 | cat <text src> | mgpp_passes -d <text_dir> -T1 ...
|
---|
94 | cat <text src> | mgpp_passes -d <text_dir> -T2 ...
|
---|
95 | cat <index src> | mgpp_passes -d <index_dir> -I1 ...
|
---|
96 | cat <index src> | mgpp_passes -d <index_dir> -I2 ...
|
---|
97 |
|
---|
98 |
|
---|
99 |
|
---|