.\"------------------------------------------------------------
.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
.de Id
.ds Rv \\$3
.ds Dt \\$4
..
.Id $Id: mg_passes.1 16583 2008-07-29 10:20:36Z davidb $
.\"------------------------------------------------------------
.TH mg_passes 1 \*(Dt CITRI
.SH NAME
mg_passes \- builds mg databases
.SH SYNOPSIS
.B mg_passes
[
.B \-h
]
[
.B \-G
]
[
.B \-S
]
[
.B \-D
]
[
.B \-W
]
.if n .ti +9n
[
.BR \-1 " |"
.BR \-2 " |"
.B  \-3
]
[
.BI \-C " compstatpoint"
]
.if n .ti +9n
[
.BI \-n " tracename"
]
.if t .ti +.5i
[
.BI \-b " bufsize"
]
[
.BI \-m " memlimit"
]
.if n .ti +9n
[
.BI \-c " numchunks"
]
[
.BI \-a " stemmer"
]
[
.BI \-s " stemmethod"
]
[
.BI \-t " tracepos"
]
.if n .ti +9n
[
.B \-T1
]
[
.B \-T2
]
.if t .ti +.5i
[
.B \-I1
]
[
.B \-I2
]
.if n .ti +9n
[
.BI \-d " directory"
]
.BI \-f " name"
[
.I filename(s)
]
.SH DESCRIPTION
.B mg_passes
is the program that does most of the work when building
.BR mg (1)
database systems.  The input documents can come from either
.I stdin
or from a list of files on the command line.  Individual documents
must be separated with control-B characters.  In general,
.B mg_passes
must be run twice to build a database, first with the
.B \-T1
and
.B \-I1
options, and second with the
.B \-T2
and
.B \-I2
options.  Several other programs must be run in order to get an
.BR mg (1)
database that is ready for the
.BR mgquery (1)
program.  The
.SB EXAMPLE
section below gives an example of how to build a complete
.BR mg (1)
database.
.SH OPTIONS
Options may appear in any order, but the
.IR filename(s) ,
if specified, must be last.
.TP "\w'\fB\-C\fP \fIcompstatpoint\fP'u+2n"
.B \-h
This displays a usage line on
.IR stderr .
.TP
.B \-G
Treat SGML tags as non-words when building the inverted file.  An SGML
tag is anything between angle brackets, i.e., `<' and `>'.
.TP
.B \-S
This option causes a special pass to be executed.  It is up to a user
to modify
.I mg.special.c
in the source code to do something with the documents it is given.
.TP
.B \-D
If
.B mg_passes
fails, then print the document that caused the failure to the trace
file if tracing is active, or to
.I stderr
if it is not.
.TP
.B \-W
This option enables the generation of the weights file when
.B \-I2
is specified.  It causes
.B \-I2
to use a little more memory and CPU.
.TP
.B \-1
Produce a level-1 inverted file.  This option is only useful when
specified with
.BR "\-I1 ".
A level-1 inverted file makes it possible for
.BR mgquery (1)
to do Boolean queries.  Ranked queries can still be done,
although the quality of the ranking is abysmal.
.TP
.B \-2
Produce a level-2 inverted file.  This option is only useful when
specified with
.BR "\-I1 ".
This is the default when neither
.BR \-1 ", " "\-2 " "or " \-3
is specified.
A level-2 inverted file makes it possible for
.BR mgquery (1)
to do Boolean queries and cosine-ranked queries.
.TP
.B \-3
Produce a level-3 inverted file.  This option is only useful when
specified with
.BR "\-I1 ".
This has been implemented to enable paragraph-level inversion.
Paragraphs are delimited by control-C characters in the source text.
.TP
.BI \-C " compstatpoint"
This option causes statistics on the compression performance to be
output to a file called
.IR *.compression.stats .
.I compstatpoint
specifies the interval between outputting each line of statistics.  The
units of
.I compstatpoint
are kilobytes of source text.  E.g., if
.I compstatpoint
is 10, then a line is output to the file every 10 KB of input
source.  Each line of the file consists of 4 numbers The first number
is the amount of input text, in bytes, processed so far.  The second
number is the amount of input text, in bytes, processed since the
last line was output to the file.  The third number is the number of
output bytes generated since the last line was output to the file, and
the fourth number is the compression achieved since the last line was
output, i.e., the third number divided by the second number.
.TP
.BI \-n " tracename"
This specifies the filename to use for the trace log, if tracing is
enabled using the
.B \-t
option.  If
.BI \-n " tracename"
is not given and tracing is enabled, a default trace filename will be
used.
.TP
.BI \-s " stemmethod"
This specifies the method to use to \*(lqstem\*(rq the words in the
inverted file dictionary.  This is a bit mask specifying the
operations to do on words as they are parsed out of the text, where
bit number 0 is the low-order (rightmost) bit.  Bit 0 does case
folding, and bit 1 does simple stemming, so the value 3 for
.I stemmethod
does both case folding and stemming.
.TP
.BI \-a " stemmer"
This specifies the stemmer to use when stemming words.  This
is a description of the language the stemmer is intended for
or a description of the stemmer.  Valid options include:
english, lovin, french, and simplefrench.
.TP
.BI \-b " bufsize"
Specify the size of the document buffer in kilobytes.  If any document
is larger than
.IR bufsize ,
the program will abort with an error message.  This should probably be
replaced with some system which automatically increases the buffer
size as required.  The default size is 3072 KB (3 MB).
.TP
.BI \-m " memlimit"
Maximum amount of memory to use for the pass-2 file inversion in
megabytes.  This option is only useful when used in conjunction with
the option
.BR \-I1 .
The larger this value, the faster the pass-2 inversion will proceed.
The default value is 5 MB.
.TP
.BI \-c " numchunks"
The maximum number of inversion chunks to write to disk.  Each chunk
will be approximately as large as
.IR memlimit .
This option is only useful when used in conjunction with the option
.BR \-I2 .
The larger this value, the faster the pass-2 inversion will proceed.
The default value is 5 MB.
.TP
.BI \-t " tracepos"
This option activates tracing.  A line will be generated in the
trace file for every
.I tracepos
input bytes processed.  The default name for the trace file can be
overridden using the
.BI \-n " tracename"
option.
.TP
.B \-T1
Generate the
.I *.text.stats
file.
.TP
.B \-T2
Generate the
.IR *.text ,
.IR *.text.idx ,
and possibly the
.I *.text.dict.aux
files.  Using this option requires that the
.I *.text.dict
file be present.
.TP
.B \-I1
Generate the
.IR *.invf.dict ,
.IR *.invf.chunk ,
and
.I *.invf.chunk.trans
files.
.TP
.B \-I2
Generate the
.I *.invf
and
.I *.invf.idx
files.  Using this option requires
that the
.IR *.invf.dict.hash ,
.IR *.invf.chunk ,
and
.I *.invf.chunk.trans
files
be present.  The
.I *.invf.dict.hash
file is generated by
.BR mg_perf_hash_build (1)
from the
.I *.invf.dict.build
file.  If the
.B \-W
option is specified, the
.I *.weight
file will also be generated.
.TP
.BI \-d " directory"
This specifies the directory where the document collection is to be
written.
.TP
.BI \-f " name"
This specifies the base name of the document collection that will be
created.
.TP
.I filename(s)
This specifies the source text. If this is not specified, then the
program expects the source text from
.IR stdin .
.SH EXAMPLE
What follows is a UNIX
.BR csh (1)
script as an example of how to build an
.BR mg (1)
document collection.
.LP
.nf
.DT
.ft B
.I #! /bin/csh
.I
# The first argument on the command line specifies the
.I
#   source of the text
set source = ($1)
.PP
.I
# The second argument is the name of the collection
set text = ($2)
.PP
.I
# Create *.text.stats,  *.invf.dict.build,
.I
#   *.invf.chunk and *.invf.chunks.trans
${source} | mg_passes -T1 -I1 -m 1 -t 1 -f ${text}
.PP
.I
# Create *.text.dict
mg_compression_dict -f ${text}
.PP
.I
# Create *.invf.dict.hash
mg_perf_hash_build -f ${text}
.PP
.I
# Create *.text,  *.text.idx,
.I
#   *.invf and *.invf.idx
${source} | mg_passes -T2 -I2 -c 2 -t 1 -f ${text}
.PP
.I
# Create *.text.idx.wgt and *.weight.approx
mg_weights_build -f ${text} -b 8
.PP
.I
# Create *.invf.dict
mg_invf_dict -f ${text} -b 4096
.PP
.I
# Create *.text.dict
mg_fast_comp_dict -f ${text}
.ft R
.fi
.SH ENVIRONMENT
.TP "\w'\fBMGDATA\fP'u+2n"
.SB MGDATA
If this environment variable exists, then its value is used as the
default directory where the
.BR mg (1)
collection files are.  If this variable does not exist, then the
directory \*(lq\fB.\fP\*(rq is used by default.  The command line
option
.BI \-d " directory"
overrides the directory in
.BR MGDATA .
.SH FILES
.TP 20
.B *.invf
Inverted file.
.TP
.B *.invf.chunk
Inverted file chunk descriptor file.  When the inverted file is
created it is created in chunks that use no more than a set amount of
memory.  This file describes those chunks.
.TP
.B *.invf.chunk.trans
Word-occurrence-order to lexical-order translation file.  The
.B *.invf.chunk
file is written in word-occurrence order but is required by
.B \-I2
to be in lexical order.
.TP
.B *.invf.dict.build
Compressed stemmed dictionary.
.TP
.B *.invf.dict.hash
Data for an order-preserving perfect hash function.
.TP
.B *.invf.idx
The index into the inverted file.
.TP
.B *.weight
The exact weights file.
.TP
.B *.text
Compressed documents.
.TP
.B *.text.stats
Statistics about the text.
.TP
.B *.text.dict
Compressed compression dictionary.
.TP
.B *.text.idx
Index into the compressed documents.
.TP
.B *.trace
The default trace file.
.TP
.B *.compression.stats
Statistics about the compression of the text.
.SH "SEE ALSO"
.na
.BR mg (1),
.BR mg_compression_dict (1),
.BR mg_fast_comp_dict (1),
.BR mg_get (1),
.BR mg_invf_dict (1),
.BR mg_invf_dump (1),
.BR mg_invf_rebuild (1),
.BR mg_perf_hash_build (1),
.BR mg_text_estimate (1),
.BR mg_weights_build (1),
.BR mgbilevel (1),
.BR mgbuild (1),
.BR mgdictlist (1),
.BR mgfelics (1),
.BR mgquery (1),
.BR mgstat (1),
.BR mgtic (1),
.BR mgticbuild (1),
.BR mgticdump (1),
.BR mgticprune (1),
.BR mgticstat (1).