source: trunk/gsdl/src/mgpp/text/mg_passes.1@ 856

Last change on this file since 856 was 856, checked in by sjboddie, 24 years ago

Rodgers new C++ mg

  • Property svn:executable set to *
  • Property svn:keywords set to Author Date Id Revision
File size: 6.7 KB
Line 
1.\"------------------------------------------------------------
2.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
3.de Id
4.ds Rv \\$3
5.ds Dt \\$4
6..
7.Id $Id: mg_passes.1 856 2000-01-14 02:26:25Z sjboddie $
8.\"------------------------------------------------------------
9.TH mg_passes 1 \*(Dt CITRI
10.SH NAME
11mg_passes \- builds mg databases
12.SH SYNOPSIS
13.B mg_passes
14[
15.BI \-J " doc-tag"
16]
17[
18.BI \-K " level-tag"
19]
20.if n .ti +10n
21[
22.BI \-L " index-level"
23]
24[
25.BI \-m " invf-mem-buffer"
26]
27.if n .ti +10n
28[
29.B \-T1
30]
31[
32.B \-T2
33]
34[
35.B \-I1
36]
37[
38.B \-I2
39]
40[
41.B \-S
42]
43[
44.B \-C
45]
46.if n .ti +10n
47[
48.B \-h
49]
50[
51.BI \-d " directory"
52]
53.BI \-f " name"
54[
55.I filename(s)
56]
57.SH DESCRIPTION
58.B mg_passes
59is the program that does most of the work when building mg
60database systems. The input documents can come from either
61.I stdin
62or from a list of files on the command line. In general,
63.B mg_passes
64must be run twice to build a database, first with the
65.B \-T1
66and
67.B \-I1
68options, and second with the
69.B \-T2
70and
71.B \-I2
72options. Several other programs must be run in order to get an
73mg database. The
74.SB EXAMPLE
75section below gives an example of how to build a complete
76mg database.
77.SH OPTIONS
78Options may appear in any order, but the
79.IR filename(s) ,
80if specified, must be last.
81.TP "\w'\fB\-C\fP \fIcompstatpointt\fP'u+2n"
82.BI \-J " doc-tag"
83Specifies the SGML tag that encloses each document. Text appearing
84outside this tag is ignored. The document tag defines the highest
85level document that can be queried and printed. The default document
86tag is 'Document'.
87.TP
88.BI \-K " level-tag"
89Specifies the SGML tag of a sub document level. A level tag must
90enclose all text enclosed by the document tag. Levels can be
91queried and printed as if they were separate documents. Multiple
92document levels can be specified (the document tag is always
93added as a document level).
94.TP
95.BI \-L " index-level"
96Specifies the SGML tag enclosing the smallest indexed element. The
97index level should be no larger than the smallest document
98level. An empty string can be used to specify a word level index
99(which is the default).
100.TP
101.BI \-m " invf-mem-buffer"
102Maximum amount of memory to use for the pass-2 file inversion in
103megabytes. This option is only useful when used in conjunction with
104the option
105.BR \-I1 .
106The larger this value, the faster the pass-2 inversion will proceed.
107The default value is 5 MB.
108.TP
109.B \-T1
110Generate the
111.I *.text.stats
112file.
113.TP
114.B \-T2
115Generate the
116.IR *.text ,
117.IR *.text.idx ,
118.IR *.text.level ,
119and possibly the
120.I *.text.dict.aux
121files. Using this option requires that the
122.I *.text.dict
123file be present.
124.TP
125.B \-I1
126Generate the
127.IR *.invf.dict ,
128.IR *.invf.level ,
129.IR *.invf.chunk ,
130and
131.I *.invf.chunk.trans
132files.
133.TP
134.B \-I2
135Generate the
136.I *.invf
137and
138.I *.invf.idx
139files. Using this option requires
140that the
141.IR *.invf.dict.hash ,
142.IR *.invf.level ,
143.IR *.invf.chunk ,
144and
145.I *.invf.chunk.trans
146files be present. The
147.I *.invf.dict.hash
148file is generated by
149.BR mg_perf_hash_build (1)
150from the
151.I *.invf.dict
152file.
153.TP
154.B \-S
155This option causes a special pass to be executed. It is up to a user
156to modify
157.I mg.special.c
158in the source code to do something with the documents it is given.
159.TP
160.B \-C
161This activates the compatibility parsing mode. When using this
162mode documents are separated by control-B and paragraphs are separated
163by control-C. Internally these are converted to documents surrounded
164by 'Document' tags and paragraphs surrounded by 'Paragraph' tags.
165.TP
166.B \-h
167This displays a usage line on
168.IR stderr .
169.TP
170.BI \-d " directory"
171This specifies the directory where the document collection is to be
172written.
173.TP
174.BI \-f " name"
175This specifies the base name of the document collection that will be
176created.
177.TP
178.I filename(s)
179This specifies the source text. If this is not specified, then the
180program expects the source text from
181.IR stdin .
182.SH EXAMPLE
183What follows is a UNIX
184.BR csh (1)
185script as an example of how to build an mg document collection.
186.LP
187.nf
188.DT
189.ft B
190.I #! /bin/csh
191.I
192# The first argument on the command line specifies the
193.I
194# source of the text
195set source = ($1)
196.PP
197.I
198# The second argument is the name of the collection
199set text = ($2)
200.PP
201.I
202# Create *.text.stats, *.invf.dict, *.invf.level
203.I
204# *.invf.chunk and *.invf.chunks.trans
205${source} | mg_passes -T1 -I1 -f ${text}
206.PP
207.I
208# Create *.text.dict
209mg_compression_dict -f ${text}
210.PP
211.I
212# Create *.invf.dict.hash
213mg_perf_hash_build -f ${text}
214.PP
215.I
216# Create *.text, *.text.idx, *.text.level
217.I
218# *.invf and *.invf.idx
219${source} | mg_passes -T2 -I2 -f ${text}
220.PP
221.I
222# Create *.text.weight and *.weight.approx
223mg_weights_build -f ${text}
224.PP
225.I
226# Create *.invf.dict.blocked
227mg_invf_dict -f ${text}
228.PP
229.I
230# Create *.invf.dict.blocked.1
231mg_stem_idx -s 1 -f ${text}
232.PP
233.I
234# Create *.invf.dict.blocked.2
235mg_stem_idx -s 2 -f ${text}
236.PP
237.I
238# Create *.invf.dict.blocked.3
239mg_stem_idx -s 3 -f ${text}
240.PP
241.I
242# Create *.text.dict.fast
243mg_fast_comp_dict -f ${text}
244.ft R
245.fi
246.SH ENVIRONMENT
247.TP "\w'\fBMGDATA\fP'u+2n"
248.SB MGDATA
249If this environment variable exists, then its value is used as the
250default directory where the mg
251collection files are. If this variable does not exist, then the
252directory \*(lq\fB.\fP\*(rq is used by default. The command line
253option
254.BI \-d " directory"
255overrides the directory in
256.BR MGDATA .
257.SH FILES
258.TP 22
259.B *.invf
260Inverted file.
261.TP
262.B *.invf.chunk
263Inverted file chunk descriptor file. When the inverted file is
264created it is created in chunks that use no more than a set amount of
265memory. This file describes those chunks.
266.TP
267.B *.invf.chunk.trans
268Word-occurrence-order to lexical-order translation file. The
269.B *.invf.chunk
270file is written in word-occurrence order but is required by
271.B \-I2
272to be in lexical order.
273.TP
274.B *.invf.dict
275Compressed stemmed dictionary.
276.TP
277.B *.invf.dict.blocked
278Compressed stemmed dictionary with index into the dictionary.
279.TP
280.B *.invf.dict.blocked.n
281Transformation dictionary from words stemmed with method
282.B n
283to unstemmed words.
284.TP
285.B *.invf.dict.hash
286Data for an order-preserving perfect hash function.
287.TP
288.B *.invf.idx
289The index into the inverted file.
290.TP
291.B *.invf.level
292Information about the document levels needed for querying.
293.TP
294.B *.text
295Compressed text.
296.TP
297.B *.text.dict
298Compressed compression dictionary.
299.TP
300.B *.text.dict.fast
301A fast loading version of the compressed compression dictionary.
302.TP
303.B *.text.idx
304Index into the compressed documents.
305.TP
306.B *.text.level
307Information about the document levels needed for text decompression.
308.TP
309.B *.text.stats
310Statistics about the text.
311.TP
312.B *.weight
313The exact weights file.
314.TP
315.B *.weight.approx
316The approximate weights file.
317.SH "SEE ALSO"
318.na
319.BR mg_compression_dict (1),
320.BR mg_fast_comp_dict (1),
321.BR mg_invf_dict (1),
322.BR mg_perf_hash_build (1),
323.BR mg_stem_idx (1),
324.BR mg_weights_build (1)
Note: See TracBrowser for help on using the repository browser.