.\"------------------------------------------------------------
.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
.de Id
.ds Rv \\$3
.ds Dt \\$4
..
.\"------------------------------------------------------------
.TH mgpp_passes 1 \*(Dt CITRI
.SH NAME
mgpp_passes \- builds mgpp databases
.SH SYNOPSIS
.B mgpp_passes
[
.BI \-J " doc-tag"
]
[
.BI \-K " level-tag"
]
.if n .ti +10n
[
.BI \-L " index-level"
]
[
.BI \-m " invf-mem-buffer"
]
.if n .ti +10n
[
.B \-T1
]
[
.B \-T2
]
[
.B \-I1
]
[
.B \-I2
]
[
.B \-S
]
[
.B \-C
]
.if n .ti +10n
[
.B \-h
]
[
.BI \-d " directory"
]
.BI \-f " name"
[
.I filename(s)
]
.SH DESCRIPTION
.B mgpp_passes
is the program that does most of the work when building mgpp
database systems.  The input documents can come from either
.I stdin
or from a list of files on the command line.  In general,
.B mgpp_passes
must be run twice to build a database, first with the
.B \-T1
and
.B \-I1
options, and second with the
.B \-T2
and
.B \-I2
options.  Several other programs must be run in order to get an
mgpp database.  The
.SB EXAMPLE
section below gives an example of how to build a complete
mgpp database.
.SH OPTIONS
Options may appear in any order, but the
.IR filename(s) ,
if specified, must be last.
.TP "\w'\fB\-C\fP \fIcompstatpointt\fP'u+2n"
.BI \-J " doc-tag"
Specifies the SGML tag that encloses each document.  Text appearing
outside this tag is ignored.  The document tag defines the highest
level document that can be queried and printed.  The default document
tag is 'Document'.
.TP
.BI \-K " level-tag"
Specifies the SGML tag of a sub document level.  A level tag must
enclose all text enclosed by the document tag.  Levels can be
queried and printed as if they were separate documents.  Multiple
document levels can be specified (the document tag is always
added as a document level).
.TP
.BI \-L " index-level"
Specifies the SGML tag enclosing the smallest indexed element.  The
index level should be no larger than the smallest document 
level.  An empty string can be used to specify a word level index 
(which is the default).
.TP
.BI \-m " invf-mem-buffer"
Maximum amount of memory to use for the pass-2 file inversion in
megabytes.  This option is only useful when used in conjunction with
the option
.BR \-I1 .
The larger this value, the faster the pass-2 inversion will proceed.
The default value is 5 MB.
.TP
.B \-T1
Generate the
.I *.text.stats
file.
.TP
.B \-T2
Generate the
.IR *.text ,
.IR *.text.idx ,
.IR *.text.level ,
and possibly the
.I *.text.dict.aux
files.  Using this option requires that the
.I *.text.dict
file be present.
.TP
.B \-I1
Generate the
.IR *.invf.dict ,
.IR *.invf.level ,
.IR *.invf.chunk ,
and
.I *.invf.chunk.trans
files.
.TP
.B \-I2
Generate the
.I *.invf
and
.I *.invf.idx
files.  Using this option requires
that the
.IR *.invf.dict.hash ,
.IR *.invf.level ,
.IR *.invf.chunk ,
and
.I *.invf.chunk.trans
files be present.  The
.I *.invf.dict.hash
file is generated by
.BR mgpp_perf_hash_build (1)
from the
.I *.invf.dict
file.
.TP
.B \-S
This option causes a special pass to be executed.  It is up to a user
to modify
.I mg.special.c
in the source code to do something with the documents it is given.
.TP 
.B \-C
This activates the compatibility parsing mode.  When using this
mode documents are separated by control-B and paragraphs are separated
by control-C.  Internally these are converted to documents surrounded
by 'Document' tags and paragraphs surrounded by 'Paragraph' tags.
.TP
.B \-h
This displays a usage line on
.IR stderr .
.TP
.BI \-d " directory"
This specifies the directory where the document collection is to be
written.
.TP
.BI \-f " name"
This specifies the base name of the document collection that will be
created.
.TP
.I filename(s)
This specifies the source text. If this is not specified, then the
program expects the source text from
.IR stdin .
.SH EXAMPLE
What follows is a UNIX
.BR csh (1)
script as an example of how to build an mgpp document collection.
.LP
.nf
.DT
.ft B
.I #! /bin/csh
.I
# The first argument on the command line specifies the
.I
#   source of the text
set source = ($1)
.PP
.I
# The second argument is the name of the collection
set text = ($2)
.PP
.I
# Create *.text.stats, *.invf.dict, *.invf.level
.I
#   *.invf.chunk and *.invf.chunks.trans
${source} | mgpp_passes -T1 -I1 -f ${text}
.PP
.I
# Create *.text.dict
mgpp_compression_dict -f ${text}
.PP
.I
# Create *.invf.dict.hash
mgpp_perf_hash_build -f ${text}
.PP
.I
# Create *.text, *.text.idx, *.text.level
.I
#   *.invf and *.invf.idx
${source} | mgpp_passes -T2 -I2 -f ${text}
.PP
.I
# Create *.text.weight and *.weight.approx
mgpp_weights_build -f ${text}
.PP
.I
# Create *.invf.dict.blocked
mgpp_invf_dict -f ${text}
.PP
.I
# Create *.invf.dict.blocked.1
mgpp_stem_idx -s 1 -f ${text}
.PP
.I
# Create *.invf.dict.blocked.2
mgpp_stem_idx -s 2 -f ${text}
.PP
.I
# Create *.invf.dict.blocked.3
mgpp_stem_idx -s 3 -f ${text}
.PP
.I
# Create *.text.dict.fast
mgpp_fast_comp_dict -f ${text}
.ft R
.fi
.SH ENVIRONMENT
.TP "\w'\fBMGDATA\fP'u+2n"
.SB MGDATA
If this environment variable exists, then its value is used as the
default directory where the mgpp
collection files are.  If this variable does not exist, then the
directory \*(lq\fB.\fP\*(rq is used by default.  The command line
option
.BI \-d " directory"
overrides the directory in
.BR MGDATA .
.SH FILES
.TP 22
.B *.invf
Inverted file.
.TP
.B *.invf.chunk
Inverted file chunk descriptor file.  When the inverted file is
created it is created in chunks that use no more than a set amount of
memory.  This file describes those chunks.
.TP
.B *.invf.chunk.trans
Word-occurrence-order to lexical-order translation file.  The
.B *.invf.chunk
file is written in word-occurrence order but is required by
.B \-I2
to be in lexical order.
.TP
.B *.invf.dict
Compressed stemmed dictionary.
.TP
.B *.invf.dict.blocked
Compressed stemmed dictionary with index into the dictionary.
.TP
.B *.invf.dict.blocked.n
Transformation dictionary from words stemmed with method 
.B n
to unstemmed words.
.TP
.B *.invf.dict.hash
Data for an order-preserving perfect hash function.
.TP
.B *.invf.idx
The index into the inverted file.
.TP
.B *.invf.level
Information about the document levels needed for querying.
.TP
.B *.text
Compressed text.
.TP
.B *.text.dict
Compressed compression dictionary.
.TP
.B *.text.dict.fast
A fast loading version of the compressed compression dictionary.
.TP
.B *.text.idx
Index into the compressed documents.
.TP
.B *.text.level
Information about the document levels needed for text decompression.
.TP
.B *.text.stats
Statistics about the text.
.TP
.B *.weight
The exact weights file.
.TP
.B *.weight.approx
The approximate weights file.
.SH "SEE ALSO"
.na
.BR mgpp_compression_dict (1),
.BR mgpp_fast_comp_dict (1),
.BR mgpp_invf_dict (1),
.BR mgpp_perf_hash_build (1),
.BR mgpp_stem_idx (1),
.BR mgpp_weights_build (1)