source: tags/greenstone-3_01-distribution/indexers/mgpp/text/mgpp_passes.1@ 10896

Last change on this file since 10896 was 10896, checked in by (none), 18 years ago

This commit was manufactured by cvs2svn to create tag
'greenstone-3_01-distribution'.

  • Property svn:keywords set to Author Date Id Revision
File size: 6.7 KB
Line 
1.\"------------------------------------------------------------
2.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
3.de Id
4.ds Rv \\$3
5.ds Dt \\$4
6..
7.\"------------------------------------------------------------
8.TH mgpp_passes 1 \*(Dt CITRI
9.SH NAME
10mgpp_passes \- builds mgpp databases
11.SH SYNOPSIS
12.B mgpp_passes
13[
14.BI \-J " doc-tag"
15]
16[
17.BI \-K " level-tag"
18]
19.if n .ti +10n
20[
21.BI \-L " index-level"
22]
23[
24.BI \-m " invf-mem-buffer"
25]
26.if n .ti +10n
27[
28.B \-T1
29]
30[
31.B \-T2
32]
33[
34.B \-I1
35]
36[
37.B \-I2
38]
39[
40.B \-S
41]
42[
43.B \-C
44]
45.if n .ti +10n
46[
47.B \-h
48]
49[
50.BI \-d " directory"
51]
52.BI \-f " name"
53[
54.I filename(s)
55]
56.SH DESCRIPTION
57.B mgpp_passes
58is the program that does most of the work when building mgpp
59database systems. The input documents can come from either
60.I stdin
61or from a list of files on the command line. In general,
62.B mgpp_passes
63must be run twice to build a database, first with the
64.B \-T1
65and
66.B \-I1
67options, and second with the
68.B \-T2
69and
70.B \-I2
71options. Several other programs must be run in order to get an
72mgpp database. The
73.SB EXAMPLE
74section below gives an example of how to build a complete
75mgpp database.
76.SH OPTIONS
77Options may appear in any order, but the
78.IR filename(s) ,
79if specified, must be last.
80.TP "\w'\fB\-C\fP \fIcompstatpointt\fP'u+2n"
81.BI \-J " doc-tag"
82Specifies the SGML tag that encloses each document. Text appearing
83outside this tag is ignored. The document tag defines the highest
84level document that can be queried and printed. The default document
85tag is 'Document'.
86.TP
87.BI \-K " level-tag"
88Specifies the SGML tag of a sub document level. A level tag must
89enclose all text enclosed by the document tag. Levels can be
90queried and printed as if they were separate documents. Multiple
91document levels can be specified (the document tag is always
92added as a document level).
93.TP
94.BI \-L " index-level"
95Specifies the SGML tag enclosing the smallest indexed element. The
96index level should be no larger than the smallest document
97level. An empty string can be used to specify a word level index
98(which is the default).
99.TP
100.BI \-m " invf-mem-buffer"
101Maximum amount of memory to use for the pass-2 file inversion in
102megabytes. This option is only useful when used in conjunction with
103the option
104.BR \-I1 .
105The larger this value, the faster the pass-2 inversion will proceed.
106The default value is 5 MB.
107.TP
108.B \-T1
109Generate the
110.I *.text.stats
111file.
112.TP
113.B \-T2
114Generate the
115.IR *.text ,
116.IR *.text.idx ,
117.IR *.text.level ,
118and possibly the
119.I *.text.dict.aux
120files. Using this option requires that the
121.I *.text.dict
122file be present.
123.TP
124.B \-I1
125Generate the
126.IR *.invf.dict ,
127.IR *.invf.level ,
128.IR *.invf.chunk ,
129and
130.I *.invf.chunk.trans
131files.
132.TP
133.B \-I2
134Generate the
135.I *.invf
136and
137.I *.invf.idx
138files. Using this option requires
139that the
140.IR *.invf.dict.hash ,
141.IR *.invf.level ,
142.IR *.invf.chunk ,
143and
144.I *.invf.chunk.trans
145files be present. The
146.I *.invf.dict.hash
147file is generated by
148.BR mgpp_perf_hash_build (1)
149from the
150.I *.invf.dict
151file.
152.TP
153.B \-S
154This option causes a special pass to be executed. It is up to a user
155to modify
156.I mg.special.c
157in the source code to do something with the documents it is given.
158.TP
159.B \-C
160This activates the compatibility parsing mode. When using this
161mode documents are separated by control-B and paragraphs are separated
162by control-C. Internally these are converted to documents surrounded
163by 'Document' tags and paragraphs surrounded by 'Paragraph' tags.
164.TP
165.B \-h
166This displays a usage line on
167.IR stderr .
168.TP
169.BI \-d " directory"
170This specifies the directory where the document collection is to be
171written.
172.TP
173.BI \-f " name"
174This specifies the base name of the document collection that will be
175created.
176.TP
177.I filename(s)
178This specifies the source text. If this is not specified, then the
179program expects the source text from
180.IR stdin .
181.SH EXAMPLE
182What follows is a UNIX
183.BR csh (1)
184script as an example of how to build an mgpp document collection.
185.LP
186.nf
187.DT
188.ft B
189.I #! /bin/csh
190.I
191# The first argument on the command line specifies the
192.I
193# source of the text
194set source = ($1)
195.PP
196.I
197# The second argument is the name of the collection
198set text = ($2)
199.PP
200.I
201# Create *.text.stats, *.invf.dict, *.invf.level
202.I
203# *.invf.chunk and *.invf.chunks.trans
204${source} | mgpp_passes -T1 -I1 -f ${text}
205.PP
206.I
207# Create *.text.dict
208mgpp_compression_dict -f ${text}
209.PP
210.I
211# Create *.invf.dict.hash
212mgpp_perf_hash_build -f ${text}
213.PP
214.I
215# Create *.text, *.text.idx, *.text.level
216.I
217# *.invf and *.invf.idx
218${source} | mgpp_passes -T2 -I2 -f ${text}
219.PP
220.I
221# Create *.text.weight and *.weight.approx
222mgpp_weights_build -f ${text}
223.PP
224.I
225# Create *.invf.dict.blocked
226mgpp_invf_dict -f ${text}
227.PP
228.I
229# Create *.invf.dict.blocked.1
230mgpp_stem_idx -s 1 -f ${text}
231.PP
232.I
233# Create *.invf.dict.blocked.2
234mgpp_stem_idx -s 2 -f ${text}
235.PP
236.I
237# Create *.invf.dict.blocked.3
238mgpp_stem_idx -s 3 -f ${text}
239.PP
240.I
241# Create *.text.dict.fast
242mgpp_fast_comp_dict -f ${text}
243.ft R
244.fi
245.SH ENVIRONMENT
246.TP "\w'\fBMGDATA\fP'u+2n"
247.SB MGDATA
248If this environment variable exists, then its value is used as the
249default directory where the mgpp
250collection files are. If this variable does not exist, then the
251directory \*(lq\fB.\fP\*(rq is used by default. The command line
252option
253.BI \-d " directory"
254overrides the directory in
255.BR MGDATA .
256.SH FILES
257.TP 22
258.B *.invf
259Inverted file.
260.TP
261.B *.invf.chunk
262Inverted file chunk descriptor file. When the inverted file is
263created it is created in chunks that use no more than a set amount of
264memory. This file describes those chunks.
265.TP
266.B *.invf.chunk.trans
267Word-occurrence-order to lexical-order translation file. The
268.B *.invf.chunk
269file is written in word-occurrence order but is required by
270.B \-I2
271to be in lexical order.
272.TP
273.B *.invf.dict
274Compressed stemmed dictionary.
275.TP
276.B *.invf.dict.blocked
277Compressed stemmed dictionary with index into the dictionary.
278.TP
279.B *.invf.dict.blocked.n
280Transformation dictionary from words stemmed with method
281.B n
282to unstemmed words.
283.TP
284.B *.invf.dict.hash
285Data for an order-preserving perfect hash function.
286.TP
287.B *.invf.idx
288The index into the inverted file.
289.TP
290.B *.invf.level
291Information about the document levels needed for querying.
292.TP
293.B *.text
294Compressed text.
295.TP
296.B *.text.dict
297Compressed compression dictionary.
298.TP
299.B *.text.dict.fast
300A fast loading version of the compressed compression dictionary.
301.TP
302.B *.text.idx
303Index into the compressed documents.
304.TP
305.B *.text.level
306Information about the document levels needed for text decompression.
307.TP
308.B *.text.stats
309Statistics about the text.
310.TP
311.B *.weight
312The exact weights file.
313.TP
314.B *.weight.approx
315The approximate weights file.
316.SH "SEE ALSO"
317.na
318.BR mgpp_compression_dict (1),
319.BR mgpp_fast_comp_dict (1),
320.BR mgpp_invf_dict (1),
321.BR mgpp_perf_hash_build (1),
322.BR mgpp_stem_idx (1),
323.BR mgpp_weights_build (1)
Note: See TracBrowser for help on using the repository browser.