source: gsdl/trunk/trunk/mg/src/text/mg_passes.1@ 16583

Last change on this file since 16583 was 16583, checked in by davidb, 16 years ago

Undoing change commited in r16582

  • Property svn:executable set to *
  • Property svn:keywords set to Author Date Id Revision
File size: 9.5 KB
Line 
1.\"------------------------------------------------------------
2.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
3.de Id
4.ds Rv \\$3
5.ds Dt \\$4
6..
7.Id $Id: mg_passes.1 16583 2008-07-29 10:20:36Z davidb $
8.\"------------------------------------------------------------
9.TH mg_passes 1 \*(Dt CITRI
10.SH NAME
11mg_passes \- builds mg databases
12.SH SYNOPSIS
13.B mg_passes
14[
15.B \-h
16]
17[
18.B \-G
19]
20[
21.B \-S
22]
23[
24.B \-D
25]
26[
27.B \-W
28]
29.if n .ti +9n
30[
31.BR \-1 " |"
32.BR \-2 " |"
33.B \-3
34]
35[
36.BI \-C " compstatpoint"
37]
38.if n .ti +9n
39[
40.BI \-n " tracename"
41]
42.if t .ti +.5i
43[
44.BI \-b " bufsize"
45]
46[
47.BI \-m " memlimit"
48]
49.if n .ti +9n
50[
51.BI \-c " numchunks"
52]
53[
54.BI \-a " stemmer"
55]
56[
57.BI \-s " stemmethod"
58]
59[
60.BI \-t " tracepos"
61]
62.if n .ti +9n
63[
64.B \-T1
65]
66[
67.B \-T2
68]
69.if t .ti +.5i
70[
71.B \-I1
72]
73[
74.B \-I2
75]
76.if n .ti +9n
77[
78.BI \-d " directory"
79]
80.BI \-f " name"
81[
82.I filename(s)
83]
84.SH DESCRIPTION
85.B mg_passes
86is the program that does most of the work when building
87.BR mg (1)
88database systems. The input documents can come from either
89.I stdin
90or from a list of files on the command line. Individual documents
91must be separated with control-B characters. In general,
92.B mg_passes
93must be run twice to build a database, first with the
94.B \-T1
95and
96.B \-I1
97options, and second with the
98.B \-T2
99and
100.B \-I2
101options. Several other programs must be run in order to get an
102.BR mg (1)
103database that is ready for the
104.BR mgquery (1)
105program. The
106.SB EXAMPLE
107section below gives an example of how to build a complete
108.BR mg (1)
109database.
110.SH OPTIONS
111Options may appear in any order, but the
112.IR filename(s) ,
113if specified, must be last.
114.TP "\w'\fB\-C\fP \fIcompstatpoint\fP'u+2n"
115.B \-h
116This displays a usage line on
117.IR stderr .
118.TP
119.B \-G
120Treat SGML tags as non-words when building the inverted file. An SGML
121tag is anything between angle brackets, i.e., `<' and `>'.
122.TP
123.B \-S
124This option causes a special pass to be executed. It is up to a user
125to modify
126.I mg.special.c
127in the source code to do something with the documents it is given.
128.TP
129.B \-D
130If
131.B mg_passes
132fails, then print the document that caused the failure to the trace
133file if tracing is active, or to
134.I stderr
135if it is not.
136.TP
137.B \-W
138This option enables the generation of the weights file when
139.B \-I2
140is specified. It causes
141.B \-I2
142to use a little more memory and CPU.
143.TP
144.B \-1
145Produce a level-1 inverted file. This option is only useful when
146specified with
147.BR "\-I1 ".
148A level-1 inverted file makes it possible for
149.BR mgquery (1)
150to do Boolean queries. Ranked queries can still be done,
151although the quality of the ranking is abysmal.
152.TP
153.B \-2
154Produce a level-2 inverted file. This option is only useful when
155specified with
156.BR "\-I1 ".
157This is the default when neither
158.BR \-1 ", " "\-2 " "or " \-3
159is specified.
160A level-2 inverted file makes it possible for
161.BR mgquery (1)
162to do Boolean queries and cosine-ranked queries.
163.TP
164.B \-3
165Produce a level-3 inverted file. This option is only useful when
166specified with
167.BR "\-I1 ".
168This has been implemented to enable paragraph-level inversion.
169Paragraphs are delimited by control-C characters in the source text.
170.TP
171.BI \-C " compstatpoint"
172This option causes statistics on the compression performance to be
173output to a file called
174.IR *.compression.stats .
175.I compstatpoint
176specifies the interval between outputting each line of statistics. The
177units of
178.I compstatpoint
179are kilobytes of source text. E.g., if
180.I compstatpoint
181is 10, then a line is output to the file every 10 KB of input
182source. Each line of the file consists of 4 numbers The first number
183is the amount of input text, in bytes, processed so far. The second
184number is the amount of input text, in bytes, processed since the
185last line was output to the file. The third number is the number of
186output bytes generated since the last line was output to the file, and
187the fourth number is the compression achieved since the last line was
188output, i.e., the third number divided by the second number.
189.TP
190.BI \-n " tracename"
191This specifies the filename to use for the trace log, if tracing is
192enabled using the
193.B \-t
194option. If
195.BI \-n " tracename"
196is not given and tracing is enabled, a default trace filename will be
197used.
198.TP
199.BI \-s " stemmethod"
200This specifies the method to use to \*(lqstem\*(rq the words in the
201inverted file dictionary. This is a bit mask specifying the
202operations to do on words as they are parsed out of the text, where
203bit number 0 is the low-order (rightmost) bit. Bit 0 does case
204folding, and bit 1 does simple stemming, so the value 3 for
205.I stemmethod
206does both case folding and stemming.
207.TP
208.BI \-a " stemmer"
209This specifies the stemmer to use when stemming words. This
210is a description of the language the stemmer is intended for
211or a description of the stemmer. Valid options include:
212english, lovin, french, and simplefrench.
213.TP
214.BI \-b " bufsize"
215Specify the size of the document buffer in kilobytes. If any document
216is larger than
217.IR bufsize ,
218the program will abort with an error message. This should probably be
219replaced with some system which automatically increases the buffer
220size as required. The default size is 3072 KB (3 MB).
221.TP
222.BI \-m " memlimit"
223Maximum amount of memory to use for the pass-2 file inversion in
224megabytes. This option is only useful when used in conjunction with
225the option
226.BR \-I1 .
227The larger this value, the faster the pass-2 inversion will proceed.
228The default value is 5 MB.
229.TP
230.BI \-c " numchunks"
231The maximum number of inversion chunks to write to disk. Each chunk
232will be approximately as large as
233.IR memlimit .
234This option is only useful when used in conjunction with the option
235.BR \-I2 .
236The larger this value, the faster the pass-2 inversion will proceed.
237The default value is 5 MB.
238.TP
239.BI \-t " tracepos"
240This option activates tracing. A line will be generated in the
241trace file for every
242.I tracepos
243input bytes processed. The default name for the trace file can be
244overridden using the
245.BI \-n " tracename"
246option.
247.TP
248.B \-T1
249Generate the
250.I *.text.stats
251file.
252.TP
253.B \-T2
254Generate the
255.IR *.text ,
256.IR *.text.idx ,
257and possibly the
258.I *.text.dict.aux
259files. Using this option requires that the
260.I *.text.dict
261file be present.
262.TP
263.B \-I1
264Generate the
265.IR *.invf.dict ,
266.IR *.invf.chunk ,
267and
268.I *.invf.chunk.trans
269files.
270.TP
271.B \-I2
272Generate the
273.I *.invf
274and
275.I *.invf.idx
276files. Using this option requires
277that the
278.IR *.invf.dict.hash ,
279.IR *.invf.chunk ,
280and
281.I *.invf.chunk.trans
282files
283be present. The
284.I *.invf.dict.hash
285file is generated by
286.BR mg_perf_hash_build (1)
287from the
288.I *.invf.dict.build
289file. If the
290.B \-W
291option is specified, the
292.I *.weight
293file will also be generated.
294.TP
295.BI \-d " directory"
296This specifies the directory where the document collection is to be
297written.
298.TP
299.BI \-f " name"
300This specifies the base name of the document collection that will be
301created.
302.TP
303.I filename(s)
304This specifies the source text. If this is not specified, then the
305program expects the source text from
306.IR stdin .
307.SH EXAMPLE
308What follows is a UNIX
309.BR csh (1)
310script as an example of how to build an
311.BR mg (1)
312document collection.
313.LP
314.nf
315.DT
316.ft B
317.I #! /bin/csh
318.I
319# The first argument on the command line specifies the
320.I
321# source of the text
322set source = ($1)
323.PP
324.I
325# The second argument is the name of the collection
326set text = ($2)
327.PP
328.I
329# Create *.text.stats, *.invf.dict.build,
330.I
331# *.invf.chunk and *.invf.chunks.trans
332${source} | mg_passes -T1 -I1 -m 1 -t 1 -f ${text}
333.PP
334.I
335# Create *.text.dict
336mg_compression_dict -f ${text}
337.PP
338.I
339# Create *.invf.dict.hash
340mg_perf_hash_build -f ${text}
341.PP
342.I
343# Create *.text, *.text.idx,
344.I
345# *.invf and *.invf.idx
346${source} | mg_passes -T2 -I2 -c 2 -t 1 -f ${text}
347.PP
348.I
349# Create *.text.idx.wgt and *.weight.approx
350mg_weights_build -f ${text} -b 8
351.PP
352.I
353# Create *.invf.dict
354mg_invf_dict -f ${text} -b 4096
355.PP
356.I
357# Create *.text.dict
358mg_fast_comp_dict -f ${text}
359.ft R
360.fi
361.SH ENVIRONMENT
362.TP "\w'\fBMGDATA\fP'u+2n"
363.SB MGDATA
364If this environment variable exists, then its value is used as the
365default directory where the
366.BR mg (1)
367collection files are. If this variable does not exist, then the
368directory \*(lq\fB.\fP\*(rq is used by default. The command line
369option
370.BI \-d " directory"
371overrides the directory in
372.BR MGDATA .
373.SH FILES
374.TP 20
375.B *.invf
376Inverted file.
377.TP
378.B *.invf.chunk
379Inverted file chunk descriptor file. When the inverted file is
380created it is created in chunks that use no more than a set amount of
381memory. This file describes those chunks.
382.TP
383.B *.invf.chunk.trans
384Word-occurrence-order to lexical-order translation file. The
385.B *.invf.chunk
386file is written in word-occurrence order but is required by
387.B \-I2
388to be in lexical order.
389.TP
390.B *.invf.dict.build
391Compressed stemmed dictionary.
392.TP
393.B *.invf.dict.hash
394Data for an order-preserving perfect hash function.
395.TP
396.B *.invf.idx
397The index into the inverted file.
398.TP
399.B *.weight
400The exact weights file.
401.TP
402.B *.text
403Compressed documents.
404.TP
405.B *.text.stats
406Statistics about the text.
407.TP
408.B *.text.dict
409Compressed compression dictionary.
410.TP
411.B *.text.idx
412Index into the compressed documents.
413.TP
414.B *.trace
415The default trace file.
416.TP
417.B *.compression.stats
418Statistics about the compression of the text.
419.SH "SEE ALSO"
420.na
421.BR mg (1),
422.BR mg_compression_dict (1),
423.BR mg_fast_comp_dict (1),
424.BR mg_get (1),
425.BR mg_invf_dict (1),
426.BR mg_invf_dump (1),
427.BR mg_invf_rebuild (1),
428.BR mg_perf_hash_build (1),
429.BR mg_text_estimate (1),
430.BR mg_weights_build (1),
431.BR mgbilevel (1),
432.BR mgbuild (1),
433.BR mgdictlist (1),
434.BR mgfelics (1),
435.BR mgquery (1),
436.BR mgstat (1),
437.BR mgtic (1),
438.BR mgticbuild (1),
439.BR mgticdump (1),
440.BR mgticprune (1),
441.BR mgticstat (1).
Note: See TracBrowser for help on using the repository browser.