source: trunk/indexers/mgpp/docs/standalone_building.txt@ 3365

Last change on this file since 3365 was 3365, checked in by kjdon, 22 years ago

Initial revision

  • Property svn:keywords set to Author Date Id Revision
File size: 3.1 KB
Line 
1BUILDING COLLECTIONS OUTSIDE OF GREENSTONE
2
3
4The programs for building are like mgpp_passes, mgpp_invf_dict etc. they have
5man pages with them, like mgpp_passes.1 etc. mgpp_passes is the main
6program used for building and the man page has a simple script that gives an
7example of how to build a collection.
8
9Here is a simple bash script that can be used to build a collection,
10
11#! /bin/bash
12
13
14# The arguments on the command line specify the
15# source of the text
16source=$@
17
18# This is the name of the collection
19text=demo
20
21echo $source
22
23# Create *.text.stats, *.invf.dict, *.invf.level
24# *.invf.chunk and *.invf.chunks.trans
25cat ${source} | mgpp_passes -T1 -I1 -f ${text}
26
27# Create *.text.dict
28mgpp_compression_dict -f ${text}
29
30# Create *.invf.dict.hash
31mgpp_perf_hash_build -f ${text}
32
33# Create *.text, *.text.idx, *.text.level
34# *.invf and *.invf.idx
35cat ${source} | mgpp_passes -T2 -I2 -f ${text}
36
37# Create *.text.weight and *.weight.approx
38mgpp_weights_build -f ${text}
39
40# Create *.invf.dict.blocked
41mgpp_invf_dict -f ${text}
42
43# Create *.invf.dict.blocked.1
44mgpp_stem_idx -s 1 -f ${text}
45
46# Create *.invf.dict.blocked.2
47mgpp_stem_idx -s 2 -f ${text}
48
49# Create *.invf.dict.blocked.3
50mgpp_stem_idx -s 3 -f ${text}
51
52
53This builds a basic collection, using 'Document' as the document level tag,
54with a word level index.
55
56Format of documents:
57
58There must be a document level tag thats starts (and optionally ends) each
59document. The default is 'Document'. Text outside these tags is ignored. To
60change the document tag add '-J <tagname>' to the mgpp_passes commands. There
61can only be one -J option.
62
63Smaller granularity is added using level tags. These must enclose all the
64 text enclosed by the document tags. They can be specified to mgpp_passes by
65'-K <level tag>'. There can be many -K options.
66
67The smallest level of granularity (default is word level) can be changed by
68adding '-L <index level>'. This can be no larger than the smallest level.
69
70Metadata or tagged fields are specified like
71<Title>the text of the title </Title>
72
73these are all automatically indexed and can be searched as fields.
74
75The output files are placed in the current directory. To change where they go,
76use '-d <directory name> ' for all the commands.
77
78The man pages describe all the different options to the various programs.
79
80
81There are four stages to mgpp_passes, specified by T1 and T2 (for text
82compression) and I1 and I2 (for indexing). T1 and I1 must be done before T2
83and I2, respectively. As shown above, T1/I1 and T2/I2 can be run together,
84reducing the number of passes through the documents to 2. But all output files
85get placed in the same directory, and the same text is passed to the indexer
86and the compressor.
87
88In greenstone, we pass only the text to the compressor, but metadata and text
89to the indexer, and text files and index files are put in separate directories.
90
91So mgpp_passes is run four times in this case, like:
92
93cat <text src> | mgpp_passes -d <text_dir> -T1 ...
94cat <text src> | mgpp_passes -d <text_dir> -T2 ...
95cat <index src> | mgpp_passes -d <index_dir> -I1 ...
96cat <index src> | mgpp_passes -d <index_dir> -I2 ...
97
98
99
Note: See TracBrowser for help on using the repository browser.