Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

mg_passes.1@ 16583

Last change on this file since 16583 was 16583, checked in by davidb, 16 years ago
Undoing change commited in r16582
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 9.5 KB

Rev	Line
[3745]	1	.\"------------------------------------------------------------
	2	.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
	3	.de Id
	4	.ds Rv \\$3
	5	.ds Dt \\$4
	6	..
	7	.Id $Id: mg_passes.1 16583 2008-07-29 10:20:36Z davidb $
	8	.\"------------------------------------------------------------
	9	.TH mg_passes 1 \*(Dt CITRI
	10	.SH NAME
	11	mg_passes \- builds mg databases
	12	.SH SYNOPSIS
	13	.B mg_passes
	14	[
	15	.B \-h
	16	]
	17	[
	18	.B \-G
	19	]
	20	[
	21	.B \-S
	22	]
	23	[
	24	.B \-D
	25	]
	26	[
	27	.B \-W
	28	]
	29	.if n .ti +9n
	30	[
	31	.BR \-1 " \|"
	32	.BR \-2 " \|"
	33	.B \-3
	34	]
	35	[
	36	.BI \-C " compstatpoint"
	37	]
	38	.if n .ti +9n
	39	[
	40	.BI \-n " tracename"
	41	]
	42	.if t .ti +.5i
	43	[
	44	.BI \-b " bufsize"
	45	]
	46	[
	47	.BI \-m " memlimit"
	48	]
	49	.if n .ti +9n
	50	[
	51	.BI \-c " numchunks"
	52	]
	53	[
	54	.BI \-a " stemmer"
	55	]
	56	[
	57	.BI \-s " stemmethod"
	58	]
	59	[
	60	.BI \-t " tracepos"
	61	]
	62	.if n .ti +9n
	63	[
	64	.B \-T1
	65	]
	66	[
	67	.B \-T2
	68	]
	69	.if t .ti +.5i
	70	[
	71	.B \-I1
	72	]
	73	[
	74	.B \-I2
	75	]
	76	.if n .ti +9n
	77	[
	78	.BI \-d " directory"
	79	]
	80	.BI \-f " name"
	81	[
	82	.I filename(s)
	83	]
	84	.SH DESCRIPTION
	85	.B mg_passes
	86	is the program that does most of the work when building
	87	.BR mg (1)
	88	database systems. The input documents can come from either
	89	.I stdin
	90	or from a list of files on the command line. Individual documents
	91	must be separated with control-B characters. In general,
	92	.B mg_passes
	93	must be run twice to build a database, first with the
	94	.B \-T1
	95	and
	96	.B \-I1
	97	options, and second with the
	98	.B \-T2
	99	and
	100	.B \-I2
	101	options. Several other programs must be run in order to get an
	102	.BR mg (1)
	103	database that is ready for the
	104	.BR mgquery (1)
	105	program. The
	106	.SB EXAMPLE
	107	section below gives an example of how to build a complete
	108	.BR mg (1)
	109	database.
	110	.SH OPTIONS
	111	Options may appear in any order, but the
	112	.IR filename(s) ,
	113	if specified, must be last.
	114	.TP "\w'\fB\-C\fP \fIcompstatpoint\fP'u+2n"
	115	.B \-h
	116	This displays a usage line on
	117	.IR stderr .
	118	.TP
	119	.B \-G
	120	Treat SGML tags as non-words when building the inverted file. An SGML
	121	tag is anything between angle brackets, i.e., `<' and `>'.
	122	.TP
	123	.B \-S
	124	This option causes a special pass to be executed. It is up to a user
	125	to modify
	126	.I mg.special.c
	127	in the source code to do something with the documents it is given.
	128	.TP
	129	.B \-D
	130	If
	131	.B mg_passes
	132	fails, then print the document that caused the failure to the trace
	133	file if tracing is active, or to
	134	.I stderr
	135	if it is not.
	136	.TP
	137	.B \-W
	138	This option enables the generation of the weights file when
	139	.B \-I2
	140	is specified. It causes
	141	.B \-I2
	142	to use a little more memory and CPU.
	143	.TP
	144	.B \-1
	145	Produce a level-1 inverted file. This option is only useful when
	146	specified with
	147	.BR "\-I1 ".
	148	A level-1 inverted file makes it possible for
	149	.BR mgquery (1)
	150	to do Boolean queries. Ranked queries can still be done,
	151	although the quality of the ranking is abysmal.
	152	.TP
	153	.B \-2
	154	Produce a level-2 inverted file. This option is only useful when
	155	specified with
	156	.BR "\-I1 ".
	157	This is the default when neither
	158	.BR \-1 ", " "\-2 " "or " \-3
	159	is specified.
	160	A level-2 inverted file makes it possible for
	161	.BR mgquery (1)
	162	to do Boolean queries and cosine-ranked queries.
	163	.TP
	164	.B \-3
	165	Produce a level-3 inverted file. This option is only useful when
	166	specified with
	167	.BR "\-I1 ".
	168	This has been implemented to enable paragraph-level inversion.
	169	Paragraphs are delimited by control-C characters in the source text.
	170	.TP
	171	.BI \-C " compstatpoint"
	172	This option causes statistics on the compression performance to be
	173	output to a file called
	174	.IR *.compression.stats .
	175	.I compstatpoint
	176	specifies the interval between outputting each line of statistics. The
	177	units of
	178	.I compstatpoint
	179	are kilobytes of source text. E.g., if
	180	.I compstatpoint
	181	is 10, then a line is output to the file every 10 KB of input
	182	source. Each line of the file consists of 4 numbers The first number
	183	is the amount of input text, in bytes, processed so far. The second
	184	number is the amount of input text, in bytes, processed since the
	185	last line was output to the file. The third number is the number of
	186	output bytes generated since the last line was output to the file, and
	187	the fourth number is the compression achieved since the last line was
	188	output, i.e., the third number divided by the second number.
	189	.TP
	190	.BI \-n " tracename"
	191	This specifies the filename to use for the trace log, if tracing is
	192	enabled using the
	193	.B \-t
	194	option. If
	195	.BI \-n " tracename"
	196	is not given and tracing is enabled, a default trace filename will be
	197	used.
	198	.TP
	199	.BI \-s " stemmethod"
	200	This specifies the method to use to \(lqstem\(rq the words in the
	201	inverted file dictionary. This is a bit mask specifying the
	202	operations to do on words as they are parsed out of the text, where
	203	bit number 0 is the low-order (rightmost) bit. Bit 0 does case
	204	folding, and bit 1 does simple stemming, so the value 3 for
	205	.I stemmethod
	206	does both case folding and stemming.
	207	.TP
	208	.BI \-a " stemmer"
	209	This specifies the stemmer to use when stemming words. This
	210	is a description of the language the stemmer is intended for
	211	or a description of the stemmer. Valid options include:
	212	english, lovin, french, and simplefrench.
	213	.TP
	214	.BI \-b " bufsize"
	215	Specify the size of the document buffer in kilobytes. If any document
	216	is larger than
	217	.IR bufsize ,
	218	the program will abort with an error message. This should probably be
	219	replaced with some system which automatically increases the buffer
	220	size as required. The default size is 3072 KB (3 MB).
	221	.TP
	222	.BI \-m " memlimit"
	223	Maximum amount of memory to use for the pass-2 file inversion in
	224	megabytes. This option is only useful when used in conjunction with
	225	the option
	226	.BR \-I1 .
	227	The larger this value, the faster the pass-2 inversion will proceed.
	228	The default value is 5 MB.
	229	.TP
	230	.BI \-c " numchunks"
	231	The maximum number of inversion chunks to write to disk. Each chunk
	232	will be approximately as large as
	233	.IR memlimit .
	234	This option is only useful when used in conjunction with the option
	235	.BR \-I2 .
	236	The larger this value, the faster the pass-2 inversion will proceed.
	237	The default value is 5 MB.
	238	.TP
	239	.BI \-t " tracepos"
	240	This option activates tracing. A line will be generated in the
	241	trace file for every
	242	.I tracepos
	243	input bytes processed. The default name for the trace file can be
	244	overridden using the
	245	.BI \-n " tracename"
	246	option.
	247	.TP
	248	.B \-T1
	249	Generate the
	250	.I *.text.stats
	251	file.
	252	.TP
	253	.B \-T2
	254	Generate the
	255	.IR *.text ,
	256	.IR *.text.idx ,
	257	and possibly the
	258	.I *.text.dict.aux
	259	files. Using this option requires that the
	260	.I *.text.dict
	261	file be present.
	262	.TP
	263	.B \-I1
	264	Generate the
	265	.IR *.invf.dict ,
	266	.IR *.invf.chunk ,
	267	and
	268	.I *.invf.chunk.trans
	269	files.
	270	.TP
	271	.B \-I2
	272	Generate the
	273	.I *.invf
	274	and
	275	.I *.invf.idx
	276	files. Using this option requires
	277	that the
	278	.IR *.invf.dict.hash ,
	279	.IR *.invf.chunk ,
	280	and
	281	.I *.invf.chunk.trans
	282	files
	283	be present. The
	284	.I *.invf.dict.hash
	285	file is generated by
	286	.BR mg_perf_hash_build (1)
	287	from the
	288	.I *.invf.dict.build
	289	file. If the
	290	.B \-W
	291	option is specified, the
	292	.I *.weight
	293	file will also be generated.
	294	.TP
	295	.BI \-d " directory"
	296	This specifies the directory where the document collection is to be
	297	written.
	298	.TP
	299	.BI \-f " name"
	300	This specifies the base name of the document collection that will be
	301	created.
	302	.TP
	303	.I filename(s)
	304	This specifies the source text. If this is not specified, then the
	305	program expects the source text from
	306	.IR stdin .
	307	.SH EXAMPLE
	308	What follows is a UNIX
	309	.BR csh (1)
	310	script as an example of how to build an
	311	.BR mg (1)
	312	document collection.
	313	.LP
	314	.nf
	315	.DT
	316	.ft B
	317	.I #! /bin/csh
	318	.I
	319	# The first argument on the command line specifies the
	320	.I
	321	# source of the text
	322	set source = ($1)
	323	.PP
	324	.I
	325	# The second argument is the name of the collection
	326	set text = ($2)
	327	.PP
	328	.I
	329	# Create .text.stats, .invf.dict.build,
	330	.I
	331	# .invf.chunk and .invf.chunks.trans
	332	${source} \| mg_passes -T1 -I1 -m 1 -t 1 -f ${text}
	333	.PP
	334	.I
	335	# Create *.text.dict
	336	mg_compression_dict -f ${text}
	337	.PP
	338	.I
	339	# Create *.invf.dict.hash
	340	mg_perf_hash_build -f ${text}
	341	.PP
	342	.I
	343	# Create .text, .text.idx,
	344	.I
	345	# .invf and .invf.idx
	346	${source} \| mg_passes -T2 -I2 -c 2 -t 1 -f ${text}
	347	.PP
	348	.I
	349	# Create .text.idx.wgt and .weight.approx
	350	mg_weights_build -f ${text} -b 8
	351	.PP
	352	.I
	353	# Create *.invf.dict
	354	mg_invf_dict -f ${text} -b 4096
	355	.PP
	356	.I
	357	# Create *.text.dict
	358	mg_fast_comp_dict -f ${text}
	359	.ft R
	360	.fi
	361	.SH ENVIRONMENT
	362	.TP "\w'\fBMGDATA\fP'u+2n"
	363	.SB MGDATA
	364	If this environment variable exists, then its value is used as the
	365	default directory where the
	366	.BR mg (1)
	367	collection files are. If this variable does not exist, then the
	368	directory \(lq\fB.\fP\(rq is used by default. The command line
	369	option
	370	.BI \-d " directory"
	371	overrides the directory in
	372	.BR MGDATA .
	373	.SH FILES
	374	.TP 20
	375	.B *.invf
	376	Inverted file.
	377	.TP
	378	.B *.invf.chunk
	379	Inverted file chunk descriptor file. When the inverted file is
	380	created it is created in chunks that use no more than a set amount of
	381	memory. This file describes those chunks.
	382	.TP
	383	.B *.invf.chunk.trans
	384	Word-occurrence-order to lexical-order translation file. The
	385	.B *.invf.chunk
	386	file is written in word-occurrence order but is required by
	387	.B \-I2
	388	to be in lexical order.
	389	.TP
	390	.B *.invf.dict.build
	391	Compressed stemmed dictionary.
	392	.TP
	393	.B *.invf.dict.hash
	394	Data for an order-preserving perfect hash function.
	395	.TP
	396	.B *.invf.idx
	397	The index into the inverted file.
	398	.TP
	399	.B *.weight
	400	The exact weights file.
	401	.TP
	402	.B *.text
	403	Compressed documents.
	404	.TP
	405	.B *.text.stats
	406	Statistics about the text.
	407	.TP
	408	.B *.text.dict
	409	Compressed compression dictionary.
	410	.TP
	411	.B *.text.idx
	412	Index into the compressed documents.
	413	.TP
	414	.B *.trace
	415	The default trace file.
	416	.TP
	417	.B *.compression.stats
	418	Statistics about the compression of the text.
	419	.SH "SEE ALSO"
	420	.na
	421	.BR mg (1),
	422	.BR mg_compression_dict (1),
	423	.BR mg_fast_comp_dict (1),
	424	.BR mg_get (1),
	425	.BR mg_invf_dict (1),
	426	.BR mg_invf_dump (1),
	427	.BR mg_invf_rebuild (1),
	428	.BR mg_perf_hash_build (1),
	429	.BR mg_text_estimate (1),
	430	.BR mg_weights_build (1),
	431	.BR mgbilevel (1),
	432	.BR mgbuild (1),
	433	.BR mgdictlist (1),
	434	.BR mgfelics (1),
	435	.BR mgquery (1),
	436	.BR mgstat (1),
	437	.BR mgtic (1),
	438	.BR mgticbuild (1),
	439	.BR mgticdump (1),
	440	.BR mgticprune (1),
	441	.BR mgticstat (1).

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format