.\"------------------------------------------------------------ .\" Id - set Rv,revision, and Dt, Date using rcs-Id tag. .de Id .ds Rv \\$3 .ds Dt \\$4 .. .\"------------------------------------------------------------ .TH mgpp_passes 1 \*(Dt CITRI .SH NAME mgpp_passes \- builds mgpp databases .SH SYNOPSIS .B mgpp_passes [ .BI \-J " doc-tag" ] [ .BI \-K " level-tag" ] .if n .ti +10n [ .BI \-L " index-level" ] [ .BI \-m " invf-mem-buffer" ] .if n .ti +10n [ .B \-T1 ] [ .B \-T2 ] [ .B \-I1 ] [ .B \-I2 ] [ .B \-S ] [ .B \-C ] .if n .ti +10n [ .B \-h ] [ .BI \-d " directory" ] .BI \-f " name" [ .I filename(s) ] .SH DESCRIPTION .B mgpp_passes is the program that does most of the work when building mgpp database systems. The input documents can come from either .I stdin or from a list of files on the command line. In general, .B mgpp_passes must be run twice to build a database, first with the .B \-T1 and .B \-I1 options, and second with the .B \-T2 and .B \-I2 options. Several other programs must be run in order to get an mgpp database. The .SB EXAMPLE section below gives an example of how to build a complete mgpp database. .SH OPTIONS Options may appear in any order, but the .IR filename(s) , if specified, must be last. .TP "\w'\fB\-C\fP \fIcompstatpointt\fP'u+2n" .BI \-J " doc-tag" Specifies the SGML tag that encloses each document. Text appearing outside this tag is ignored. The document tag defines the highest level document that can be queried and printed. The default document tag is 'Document'. .TP .BI \-K " level-tag" Specifies the SGML tag of a sub document level. A level tag must enclose all text enclosed by the document tag. Levels can be queried and printed as if they were separate documents. Multiple document levels can be specified (the document tag is always added as a document level). .TP .BI \-L " index-level" Specifies the SGML tag enclosing the smallest indexed element. The index level should be no larger than the smallest document level. An empty string can be used to specify a word level index (which is the default). .TP .BI \-m " invf-mem-buffer" Maximum amount of memory to use for the pass-2 file inversion in megabytes. This option is only useful when used in conjunction with the option .BR \-I1 . The larger this value, the faster the pass-2 inversion will proceed. The default value is 5 MB. .TP .B \-T1 Generate the .I *.text.stats file. .TP .B \-T2 Generate the .IR *.text , .IR *.text.idx , .IR *.text.level , and possibly the .I *.text.dict.aux files. Using this option requires that the .I *.text.dict file be present. .TP .B \-I1 Generate the .IR *.invf.dict , .IR *.invf.level , .IR *.invf.chunk , and .I *.invf.chunk.trans files. .TP .B \-I2 Generate the .I *.invf and .I *.invf.idx files. Using this option requires that the .IR *.invf.dict.hash , .IR *.invf.level , .IR *.invf.chunk , and .I *.invf.chunk.trans files be present. The .I *.invf.dict.hash file is generated by .BR mgpp_perf_hash_build (1) from the .I *.invf.dict file. .TP .B \-S This option causes a special pass to be executed. It is up to a user to modify .I mg.special.c in the source code to do something with the documents it is given. .TP .B \-C This activates the compatibility parsing mode. When using this mode documents are separated by control-B and paragraphs are separated by control-C. Internally these are converted to documents surrounded by 'Document' tags and paragraphs surrounded by 'Paragraph' tags. .TP .B \-h This displays a usage line on .IR stderr . .TP .BI \-d " directory" This specifies the directory where the document collection is to be written. .TP .BI \-f " name" This specifies the base name of the document collection that will be created. .TP .I filename(s) This specifies the source text. If this is not specified, then the program expects the source text from .IR stdin . .SH EXAMPLE What follows is a UNIX .BR csh (1) script as an example of how to build an mgpp document collection. .LP .nf .DT .ft B .I #! /bin/csh .I # The first argument on the command line specifies the .I # source of the text set source = ($1) .PP .I # The second argument is the name of the collection set text = ($2) .PP .I # Create *.text.stats, *.invf.dict, *.invf.level .I # *.invf.chunk and *.invf.chunks.trans ${source} | mgpp_passes -T1 -I1 -f ${text} .PP .I # Create *.text.dict mgpp_compression_dict -f ${text} .PP .I # Create *.invf.dict.hash mgpp_perf_hash_build -f ${text} .PP .I # Create *.text, *.text.idx, *.text.level .I # *.invf and *.invf.idx ${source} | mgpp_passes -T2 -I2 -f ${text} .PP .I # Create *.text.weight and *.weight.approx mgpp_weights_build -f ${text} .PP .I # Create *.invf.dict.blocked mgpp_invf_dict -f ${text} .PP .I # Create *.invf.dict.blocked.1 mgpp_stem_idx -s 1 -f ${text} .PP .I # Create *.invf.dict.blocked.2 mgpp_stem_idx -s 2 -f ${text} .PP .I # Create *.invf.dict.blocked.3 mgpp_stem_idx -s 3 -f ${text} .PP .I # Create *.text.dict.fast mgpp_fast_comp_dict -f ${text} .ft R .fi .SH ENVIRONMENT .TP "\w'\fBMGDATA\fP'u+2n" .SB MGDATA If this environment variable exists, then its value is used as the default directory where the mgpp collection files are. If this variable does not exist, then the directory \*(lq\fB.\fP\*(rq is used by default. The command line option .BI \-d " directory" overrides the directory in .BR MGDATA . .SH FILES .TP 22 .B *.invf Inverted file. .TP .B *.invf.chunk Inverted file chunk descriptor file. When the inverted file is created it is created in chunks that use no more than a set amount of memory. This file describes those chunks. .TP .B *.invf.chunk.trans Word-occurrence-order to lexical-order translation file. The .B *.invf.chunk file is written in word-occurrence order but is required by .B \-I2 to be in lexical order. .TP .B *.invf.dict Compressed stemmed dictionary. .TP .B *.invf.dict.blocked Compressed stemmed dictionary with index into the dictionary. .TP .B *.invf.dict.blocked.n Transformation dictionary from words stemmed with method .B n to unstemmed words. .TP .B *.invf.dict.hash Data for an order-preserving perfect hash function. .TP .B *.invf.idx The index into the inverted file. .TP .B *.invf.level Information about the document levels needed for querying. .TP .B *.text Compressed text. .TP .B *.text.dict Compressed compression dictionary. .TP .B *.text.dict.fast A fast loading version of the compressed compression dictionary. .TP .B *.text.idx Index into the compressed documents. .TP .B *.text.level Information about the document levels needed for text decompression. .TP .B *.text.stats Statistics about the text. .TP .B *.weight The exact weights file. .TP .B *.weight.approx The approximate weights file. .SH "SEE ALSO" .na .BR mgpp_compression_dict (1), .BR mgpp_fast_comp_dict (1), .BR mgpp_invf_dict (1), .BR mgpp_perf_hash_build (1), .BR mgpp_stem_idx (1), .BR mgpp_weights_build (1)