.\"------------------------------------------------------------ .\" Id - set Rv,revision, and Dt, Date using rcs-Id tag. .de Id .ds Rv \\$3 .ds Dt \\$4 .. .Id $Id: mg_passes.1 16583 2008-07-29 10:20:36Z davidb $ .\"------------------------------------------------------------ .TH mg_passes 1 \*(Dt CITRI .SH NAME mg_passes \- builds mg databases .SH SYNOPSIS .B mg_passes [ .B \-h ] [ .B \-G ] [ .B \-S ] [ .B \-D ] [ .B \-W ] .if n .ti +9n [ .BR \-1 " |" .BR \-2 " |" .B \-3 ] [ .BI \-C " compstatpoint" ] .if n .ti +9n [ .BI \-n " tracename" ] .if t .ti +.5i [ .BI \-b " bufsize" ] [ .BI \-m " memlimit" ] .if n .ti +9n [ .BI \-c " numchunks" ] [ .BI \-a " stemmer" ] [ .BI \-s " stemmethod" ] [ .BI \-t " tracepos" ] .if n .ti +9n [ .B \-T1 ] [ .B \-T2 ] .if t .ti +.5i [ .B \-I1 ] [ .B \-I2 ] .if n .ti +9n [ .BI \-d " directory" ] .BI \-f " name" [ .I filename(s) ] .SH DESCRIPTION .B mg_passes is the program that does most of the work when building .BR mg (1) database systems. The input documents can come from either .I stdin or from a list of files on the command line. Individual documents must be separated with control-B characters. In general, .B mg_passes must be run twice to build a database, first with the .B \-T1 and .B \-I1 options, and second with the .B \-T2 and .B \-I2 options. Several other programs must be run in order to get an .BR mg (1) database that is ready for the .BR mgquery (1) program. The .SB EXAMPLE section below gives an example of how to build a complete .BR mg (1) database. .SH OPTIONS Options may appear in any order, but the .IR filename(s) , if specified, must be last. .TP "\w'\fB\-C\fP \fIcompstatpoint\fP'u+2n" .B \-h This displays a usage line on .IR stderr . .TP .B \-G Treat SGML tags as non-words when building the inverted file. An SGML tag is anything between angle brackets, i.e., `<' and `>'. .TP .B \-S This option causes a special pass to be executed. It is up to a user to modify .I mg.special.c in the source code to do something with the documents it is given. .TP .B \-D If .B mg_passes fails, then print the document that caused the failure to the trace file if tracing is active, or to .I stderr if it is not. .TP .B \-W This option enables the generation of the weights file when .B \-I2 is specified. It causes .B \-I2 to use a little more memory and CPU. .TP .B \-1 Produce a level-1 inverted file. This option is only useful when specified with .BR "\-I1 ". A level-1 inverted file makes it possible for .BR mgquery (1) to do Boolean queries. Ranked queries can still be done, although the quality of the ranking is abysmal. .TP .B \-2 Produce a level-2 inverted file. This option is only useful when specified with .BR "\-I1 ". This is the default when neither .BR \-1 ", " "\-2 " "or " \-3 is specified. A level-2 inverted file makes it possible for .BR mgquery (1) to do Boolean queries and cosine-ranked queries. .TP .B \-3 Produce a level-3 inverted file. This option is only useful when specified with .BR "\-I1 ". This has been implemented to enable paragraph-level inversion. Paragraphs are delimited by control-C characters in the source text. .TP .BI \-C " compstatpoint" This option causes statistics on the compression performance to be output to a file called .IR *.compression.stats . .I compstatpoint specifies the interval between outputting each line of statistics. The units of .I compstatpoint are kilobytes of source text. E.g., if .I compstatpoint is 10, then a line is output to the file every 10 KB of input source. Each line of the file consists of 4 numbers The first number is the amount of input text, in bytes, processed so far. The second number is the amount of input text, in bytes, processed since the last line was output to the file. The third number is the number of output bytes generated since the last line was output to the file, and the fourth number is the compression achieved since the last line was output, i.e., the third number divided by the second number. .TP .BI \-n " tracename" This specifies the filename to use for the trace log, if tracing is enabled using the .B \-t option. If .BI \-n " tracename" is not given and tracing is enabled, a default trace filename will be used. .TP .BI \-s " stemmethod" This specifies the method to use to \*(lqstem\*(rq the words in the inverted file dictionary. This is a bit mask specifying the operations to do on words as they are parsed out of the text, where bit number 0 is the low-order (rightmost) bit. Bit 0 does case folding, and bit 1 does simple stemming, so the value 3 for .I stemmethod does both case folding and stemming. .TP .BI \-a " stemmer" This specifies the stemmer to use when stemming words. This is a description of the language the stemmer is intended for or a description of the stemmer. Valid options include: english, lovin, french, and simplefrench. .TP .BI \-b " bufsize" Specify the size of the document buffer in kilobytes. If any document is larger than .IR bufsize , the program will abort with an error message. This should probably be replaced with some system which automatically increases the buffer size as required. The default size is 3072 KB (3 MB). .TP .BI \-m " memlimit" Maximum amount of memory to use for the pass-2 file inversion in megabytes. This option is only useful when used in conjunction with the option .BR \-I1 . The larger this value, the faster the pass-2 inversion will proceed. The default value is 5 MB. .TP .BI \-c " numchunks" The maximum number of inversion chunks to write to disk. Each chunk will be approximately as large as .IR memlimit . This option is only useful when used in conjunction with the option .BR \-I2 . The larger this value, the faster the pass-2 inversion will proceed. The default value is 5 MB. .TP .BI \-t " tracepos" This option activates tracing. A line will be generated in the trace file for every .I tracepos input bytes processed. The default name for the trace file can be overridden using the .BI \-n " tracename" option. .TP .B \-T1 Generate the .I *.text.stats file. .TP .B \-T2 Generate the .IR *.text , .IR *.text.idx , and possibly the .I *.text.dict.aux files. Using this option requires that the .I *.text.dict file be present. .TP .B \-I1 Generate the .IR *.invf.dict , .IR *.invf.chunk , and .I *.invf.chunk.trans files. .TP .B \-I2 Generate the .I *.invf and .I *.invf.idx files. Using this option requires that the .IR *.invf.dict.hash , .IR *.invf.chunk , and .I *.invf.chunk.trans files be present. The .I *.invf.dict.hash file is generated by .BR mg_perf_hash_build (1) from the .I *.invf.dict.build file. If the .B \-W option is specified, the .I *.weight file will also be generated. .TP .BI \-d " directory" This specifies the directory where the document collection is to be written. .TP .BI \-f " name" This specifies the base name of the document collection that will be created. .TP .I filename(s) This specifies the source text. If this is not specified, then the program expects the source text from .IR stdin . .SH EXAMPLE What follows is a UNIX .BR csh (1) script as an example of how to build an .BR mg (1) document collection. .LP .nf .DT .ft B .I #! /bin/csh .I # The first argument on the command line specifies the .I # source of the text set source = ($1) .PP .I # The second argument is the name of the collection set text = ($2) .PP .I # Create *.text.stats, *.invf.dict.build, .I # *.invf.chunk and *.invf.chunks.trans ${source} | mg_passes -T1 -I1 -m 1 -t 1 -f ${text} .PP .I # Create *.text.dict mg_compression_dict -f ${text} .PP .I # Create *.invf.dict.hash mg_perf_hash_build -f ${text} .PP .I # Create *.text, *.text.idx, .I # *.invf and *.invf.idx ${source} | mg_passes -T2 -I2 -c 2 -t 1 -f ${text} .PP .I # Create *.text.idx.wgt and *.weight.approx mg_weights_build -f ${text} -b 8 .PP .I # Create *.invf.dict mg_invf_dict -f ${text} -b 4096 .PP .I # Create *.text.dict mg_fast_comp_dict -f ${text} .ft R .fi .SH ENVIRONMENT .TP "\w'\fBMGDATA\fP'u+2n" .SB MGDATA If this environment variable exists, then its value is used as the default directory where the .BR mg (1) collection files are. If this variable does not exist, then the directory \*(lq\fB.\fP\*(rq is used by default. The command line option .BI \-d " directory" overrides the directory in .BR MGDATA . .SH FILES .TP 20 .B *.invf Inverted file. .TP .B *.invf.chunk Inverted file chunk descriptor file. When the inverted file is created it is created in chunks that use no more than a set amount of memory. This file describes those chunks. .TP .B *.invf.chunk.trans Word-occurrence-order to lexical-order translation file. The .B *.invf.chunk file is written in word-occurrence order but is required by .B \-I2 to be in lexical order. .TP .B *.invf.dict.build Compressed stemmed dictionary. .TP .B *.invf.dict.hash Data for an order-preserving perfect hash function. .TP .B *.invf.idx The index into the inverted file. .TP .B *.weight The exact weights file. .TP .B *.text Compressed documents. .TP .B *.text.stats Statistics about the text. .TP .B *.text.dict Compressed compression dictionary. .TP .B *.text.idx Index into the compressed documents. .TP .B *.trace The default trace file. .TP .B *.compression.stats Statistics about the compression of the text. .SH "SEE ALSO" .na .BR mg (1), .BR mg_compression_dict (1), .BR mg_fast_comp_dict (1), .BR mg_get (1), .BR mg_invf_dict (1), .BR mg_invf_dump (1), .BR mg_invf_rebuild (1), .BR mg_perf_hash_build (1), .BR mg_text_estimate (1), .BR mg_weights_build (1), .BR mgbilevel (1), .BR mgbuild (1), .BR mgdictlist (1), .BR mgfelics (1), .BR mgquery (1), .BR mgstat (1), .BR mgtic (1), .BR mgticbuild (1), .BR mgticdump (1), .BR mgticprune (1), .BR mgticstat (1).