[3745] | 1 | .\"------------------------------------------------------------
|
---|
| 2 | .\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
|
---|
| 3 | .de Id
|
---|
| 4 | .ds Rv \\$3
|
---|
| 5 | .ds Dt \\$4
|
---|
| 6 | ..
|
---|
| 7 | .Id $Id: mg.1 16583 2008-07-29 10:20:36Z davidb $
|
---|
| 8 | .\"------------------------------------------------------------
|
---|
| 9 | .TH mg 1 \*(Dt CITRI
|
---|
| 10 | .\"=====================================================================
|
---|
| 11 | .\" Author:
|
---|
| 12 | .\" Nelson H. F. Beebe
|
---|
| 13 | .\" Center for Scientific Computing
|
---|
| 14 | .\" Department of Mathematics
|
---|
| 15 | .\" University of Utah
|
---|
| 16 | .\" Salt Lake City, UT 84112
|
---|
| 17 | .\" USA
|
---|
| 18 | .\" Email: [email protected] (Internet)
|
---|
| 19 | .\"=====================================================================
|
---|
| 20 | .if t .ds Bi B\s-2IB\s+2T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
|
---|
| 21 | .if n .ds Bi BibTeX
|
---|
| 22 | .if t .ds Te T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
|
---|
| 23 | .if n .ds Te TeX
|
---|
| 24 | .\"=====================================================================
|
---|
| 25 | .SH NAME
|
---|
| 26 | mg \- full-text inverted index support
|
---|
| 27 | .\"=====================================================================
|
---|
| 28 | .SH DESCRIPTION
|
---|
| 29 | .B mg
|
---|
| 30 | is a suite of programs that can be used to create
|
---|
| 31 | and query full-text inverted indexes for a
|
---|
| 32 | collection of documents.
|
---|
| 33 | .PP
|
---|
| 34 | An inverted index is essentially a list of
|
---|
| 35 | pointers to occurrences of every word in a
|
---|
| 36 | collection of documents, so that, for example, in
|
---|
| 37 | a collection of Shakespeare's works, one can pose
|
---|
| 38 | questions like
|
---|
| 39 | .I "How often is a fool mentioned in the plays?"
|
---|
| 40 | and
|
---|
| 41 | .I "Are Romeo and Juliet ever mentioned in the same sentence?"
|
---|
| 42 | While search utilities like the UNIX
|
---|
| 43 | .BR grep (1)
|
---|
| 44 | family can be used for simple queries, they suffer
|
---|
| 45 | from several limitations:
|
---|
| 46 | .TP \w'\(bu'u+2n
|
---|
| 47 | \(bu
|
---|
| 48 | Each invocation requires a separate pass over the
|
---|
| 49 | documents, which becomes impossibly slow if the
|
---|
| 50 | document collection is very large.
|
---|
| 51 | .TP
|
---|
| 52 | \(bu
|
---|
| 53 | It is difficult to compose Boolean queries like
|
---|
| 54 | .IR "Caesar AND Brutus AND NOT Antony" .
|
---|
| 55 | .TP
|
---|
| 56 | \(bu
|
---|
| 57 | They provide no mechanism for ranking matches
|
---|
| 58 | according to importance, which is extremely
|
---|
| 59 | important when queries are complex and numerous
|
---|
| 60 | matches are found.
|
---|
| 61 | .TP
|
---|
| 62 | \(bu
|
---|
| 63 | They provide no mechanism for displaying
|
---|
| 64 | surrounding context (although the Free Software
|
---|
| 65 | Foundation
|
---|
| 66 | implementations for the GNU Project remedy this
|
---|
| 67 | with their
|
---|
| 68 | .BI \-A " num"
|
---|
| 69 | (after)
|
---|
| 70 | and
|
---|
| 71 | .BI \-B " num"
|
---|
| 72 | (before)
|
---|
| 73 | switches).
|
---|
| 74 | .TP
|
---|
| 75 | \(bu
|
---|
| 76 | They provide no mechanism for searching for
|
---|
| 77 | occurrences in the same paragraph, unless
|
---|
| 78 | preprocessing is done on the files to wrap
|
---|
| 79 | paragraphs into long lines. Even that will often
|
---|
| 80 | fail, because input buffer sizes may be limited to
|
---|
| 81 | a few hundred characters.
|
---|
| 82 | .TP
|
---|
| 83 | \(bu
|
---|
| 84 | They provide no easy way to deal with grammatical
|
---|
| 85 | word-ending variants, except by explicit
|
---|
| 86 | enumeration, such as searching for
|
---|
| 87 | .IR compress ,
|
---|
| 88 | .IR compressed ,
|
---|
| 89 | .IR compresses ,
|
---|
| 90 | .IR compressing ,
|
---|
| 91 | and
|
---|
| 92 | .IR compression .
|
---|
| 93 | .PP
|
---|
| 94 | Most computer users have been faced with the
|
---|
| 95 | problem of finding information from a large
|
---|
| 96 | collection of files, such as electronic mail,
|
---|
| 97 | on-line documentation, or source code. Inverted
|
---|
| 98 | indices provide a highly-effective solution to
|
---|
| 99 | this problem.
|
---|
| 100 | .PP
|
---|
| 101 | The
|
---|
| 102 | .B mg
|
---|
| 103 | software is described in Appendix A of the book
|
---|
| 104 | .RS
|
---|
| 105 | .nf
|
---|
| 106 | Ian H. Witten, Alistair Moffat, and Timothy C. Bell
|
---|
| 107 | .I "Managing Gigabytes: Compressing and Indexing Documents and Images"
|
---|
| 108 | Van Nostrand Reinhold
|
---|
| 109 | 1994
|
---|
| 110 | xiv + 429 pages
|
---|
| 111 | US$54.95
|
---|
| 112 | ISBN 0-442-01863-0
|
---|
| 113 | Library of Congress catalog number TA1637 .W58 1994
|
---|
| 114 | .fi
|
---|
| 115 | .RE
|
---|
| 116 | .PP
|
---|
| 117 | Many of the algorithms implemented in
|
---|
| 118 | .B mg
|
---|
| 119 | represent significant advances over previous work,
|
---|
| 120 | both in speed, and in storage requirements. On a
|
---|
| 121 | fast workstation, in tens of minutes, or a few
|
---|
| 122 | hours,
|
---|
| 123 | .BR mgbuild (1)
|
---|
| 124 | can create an index to
|
---|
| 125 | .I all
|
---|
| 126 | of the words in hundreds of thousands of documents
|
---|
| 127 | occupying hundreds of megabytes, or even more than
|
---|
| 128 | a gigabyte, of disk space.
|
---|
| 129 | .BR mgquery (1)
|
---|
| 130 | can then be used to answer complex queries, with
|
---|
| 131 | responses often returned in a second or less.
|
---|
| 132 | .B mg
|
---|
| 133 | also contains algorithms to deal with images, so
|
---|
| 134 | that with a small amount of descriptive text for
|
---|
| 135 | each image, it is possible to do searches in
|
---|
| 136 | collections of images, and to have retrievals
|
---|
| 137 | display the images using a viewer like
|
---|
| 138 | .BR xv (1).
|
---|
| 139 | .PP
|
---|
| 140 | .B mg
|
---|
| 141 | can deal with compressed text and image files and
|
---|
| 142 | surprisingly, it usually runs faster than it would
|
---|
| 143 | if the files were not compressed! Thus, the
|
---|
| 144 | considerable disk space savings possible from
|
---|
| 145 | file compression are not lost because of the need
|
---|
| 146 | for fast document search and retrieval.
|
---|
| 147 | .PP
|
---|
| 148 | The Free Software Foundation GNU Project
|
---|
| 149 | compression utilities
|
---|
| 150 | .BR gzip (1)
|
---|
| 151 | and
|
---|
| 152 | .BR gunzip (1)
|
---|
| 153 | are recommended for general use over older
|
---|
| 154 | alternatives, like
|
---|
| 155 | .BR compress (1),
|
---|
| 156 | because of their speed and high compression
|
---|
| 157 | ratios.
|
---|
| 158 | .\"=====================================================================
|
---|
| 159 | .SH AVAILABILITY
|
---|
| 160 | The
|
---|
| 161 | .B mg
|
---|
| 162 | software can be obtained via anonymous ftp to the
|
---|
| 163 | Australian archive host
|
---|
| 164 | .I "munnari.oz.au [128.250.1.21]"
|
---|
| 165 | from the directory
|
---|
| 166 | .IR "/pub/mg" .
|
---|
| 167 | .\"=====================================================================
|
---|
| 168 | .SH "TYPICAL USE OF mg"
|
---|
| 169 | Although
|
---|
| 170 | .B mg
|
---|
| 171 | consists of more than 20 separate programs, many
|
---|
| 172 | of which have complicated command-line options,
|
---|
| 173 | take heart: most users require only two or three
|
---|
| 174 | of these programs, and nothing more than a
|
---|
| 175 | document name on the command line.
|
---|
| 176 | .PP
|
---|
| 177 | A
|
---|
| 178 | .I document
|
---|
| 179 | for
|
---|
| 180 | .B mg
|
---|
| 181 | is a fragment of text suitable for retrieval as a
|
---|
| 182 | unit when it is found to contain a requested word,
|
---|
| 183 | or words. In a collection of poetry, a document
|
---|
| 184 | might be a stanza, while in a novel, it could be a
|
---|
| 185 | paragraph. In an index of first lines of poems, a
|
---|
| 186 | document would likely be just a single line.
|
---|
| 187 | .PP
|
---|
| 188 | Just what constitutes a document is decided by a
|
---|
| 189 | user-modifiable UNIX shell script,
|
---|
| 190 | .BR mg_get (1).
|
---|
| 191 | The default script provided with the
|
---|
| 192 | .B mg
|
---|
| 193 | source distribution knows about these named
|
---|
| 194 | document collections:
|
---|
| 195 | .TP \w'mailfiles'u+2n
|
---|
| 196 | .I alice
|
---|
| 197 | Lewis Carroll's
|
---|
| 198 | .I "Alice in Wonderland"
|
---|
| 199 | book.
|
---|
| 200 | .TP
|
---|
| 201 | .I allfiles
|
---|
| 202 | all mail files in the directory tree
|
---|
| 203 | .IR $HOME/Mail ,
|
---|
| 204 | including all of its nested subdirectories.
|
---|
| 205 | .TP
|
---|
| 206 | .I mailfiles
|
---|
| 207 | individual mail messages in
|
---|
| 208 | .IR $HOME/mbox
|
---|
| 209 | and
|
---|
| 210 | .IR $HOME/.sentmail.
|
---|
| 211 | .TP
|
---|
| 212 | .I davinci
|
---|
| 213 | A small collection of text and images from the
|
---|
| 214 | work of Leonardo da Vinci.
|
---|
| 215 | .PP
|
---|
| 216 | A document collection name is used by
|
---|
| 217 | .B mg
|
---|
| 218 | as a
|
---|
| 219 | .BR csh (1)
|
---|
| 220 | .I case
|
---|
| 221 | statement selector, and as a subdirectory name in the
|
---|
| 222 | .B $MGDATA
|
---|
| 223 | directory, or the current directory, if
|
---|
| 224 | .B MGDATA
|
---|
| 225 | is not defined.
|
---|
| 226 | .\"=====================================================================
|
---|
| 227 | .SH "EXTENDING mg_get"
|
---|
| 228 | This section describes how to extend
|
---|
| 229 | .BR mg_get (1)
|
---|
| 230 | to handle a new document collection.
|
---|
| 231 | .PP
|
---|
| 232 | Let us take two examples: all \*(Bi
|
---|
| 233 | .I .bib
|
---|
| 234 | files, and all \*(Te files, contained in
|
---|
| 235 | subdirectories under the login directory.
|
---|
| 236 | .PP
|
---|
| 237 | For \*(Bi, each bibliographic entry will be
|
---|
| 238 | considered a separate document. In order to
|
---|
| 239 | facilitate easy identification of entries, we
|
---|
| 240 | shall require them to begin at the start of a
|
---|
| 241 | line; the
|
---|
| 242 | .BR bibclean (1)
|
---|
| 243 | utility can be used to standardize the format of
|
---|
| 244 | .I .bib
|
---|
| 245 | files, and to validate their string values, so
|
---|
| 246 | that this requirement is met.
|
---|
| 247 | .PP
|
---|
| 248 | For \*(Te, each paragraph will be a separate
|
---|
| 249 | document, and we assume that paragraphs are
|
---|
| 250 | separated by blank lines. We assume that files
|
---|
| 251 | with extensions
|
---|
| 252 | .IR .atx ,
|
---|
| 253 | .IR .ltx ,
|
---|
| 254 | .IR .stx ,
|
---|
| 255 | and
|
---|
| 256 | .I .tex
|
---|
| 257 | contain input to common \*(Te macro package
|
---|
| 258 | variants.
|
---|
| 259 | .PP
|
---|
| 260 | Make a personal copy of the
|
---|
| 261 | .I mg_get
|
---|
| 262 | script, using the one in the
|
---|
| 263 | .B mg
|
---|
| 264 | source distribution
|
---|
| 265 | .RI ( mg-1.0/mg/mg_get.sh ),
|
---|
| 266 | or the one in the local binary program directory,
|
---|
| 267 | at many sites called
|
---|
| 268 | .IR /usr/local/bin/mg_get .
|
---|
| 269 | .PP
|
---|
| 270 | Examination of the
|
---|
| 271 | .I mg_get
|
---|
| 272 | script shows that each document collection name is
|
---|
| 273 | used in a
|
---|
| 274 | .BR csh (1)
|
---|
| 275 | .I case
|
---|
| 276 | statement selector, and that most of work is done
|
---|
| 277 | by very simple
|
---|
| 278 | .BR awk (1)
|
---|
| 279 | programs that extract documents from files. In
|
---|
| 280 | your private copy of the
|
---|
| 281 | .I mg_get
|
---|
| 282 | file, after the line
|
---|
| 283 | .PP
|
---|
| 284 | .nf
|
---|
| 285 | breaksw #davinci
|
---|
| 286 | .fi
|
---|
| 287 | .PP
|
---|
| 288 | and before the line
|
---|
| 289 | .PP
|
---|
| 290 | .nf
|
---|
| 291 | default:
|
---|
| 292 | .fi
|
---|
| 293 | .PP
|
---|
| 294 | insert this new code:
|
---|
| 295 | .PP
|
---|
| 296 | .nf
|
---|
| 297 | case bibfiles:
|
---|
| 298 | # Takes a list of files that contain BibTeX entries, and splits them up
|
---|
| 299 | # by putting ^B after each entry. Assumes that each entry
|
---|
| 300 | # begins with a line '^@'.
|
---|
| 301 | switch ($flag)
|
---|
| 302 | case '-init':
|
---|
| 303 | breaksw
|
---|
| 304 |
|
---|
| 305 | case '-text':
|
---|
| 306 | find $HOME -name '*.bib' -print | \e
|
---|
| 307 | sort | \e
|
---|
| 308 | xargs -l100 awk \e
|
---|
| 309 | '/^@/&&NR!=1{print "^B"} {print $0} END{print "^B"}'
|
---|
| 310 | breaksw #-text
|
---|
| 311 |
|
---|
| 312 | case '-cleanup':
|
---|
| 313 | breaksw #-cleanup
|
---|
| 314 |
|
---|
| 315 | endsw #flag
|
---|
| 316 | breaksw #bibfiles
|
---|
| 317 |
|
---|
| 318 | case texfiles:
|
---|
| 319 | # Takes a list of TeX files and split them up
|
---|
| 320 | # by putting ^B after each paragraph. Assumes that each entry
|
---|
| 321 | # begins with a line '^@'.
|
---|
| 322 | switch ($flag)
|
---|
| 323 | case '-init':
|
---|
| 324 | breaksw
|
---|
| 325 |
|
---|
| 326 | case '-text':
|
---|
| 327 | find $HOME -name '*x' -print | \e
|
---|
| 328 | egrep '[.]tex$|[.]ltx$|[.]atx$|[.]stx$' | \e
|
---|
| 329 | sort | \e
|
---|
| 330 | xargs -l100 nawk ' /^ *$/ {if (b!=1) printf "^B";b=1} \e
|
---|
| 331 | \e!/^ *$/ {print;b=0} \e
|
---|
| 332 | END {printf "^B"}'
|
---|
| 333 | breaksw #-text
|
---|
| 334 |
|
---|
| 335 | case '-cleanup':
|
---|
| 336 | breaksw #-cleanup
|
---|
| 337 |
|
---|
| 338 | endsw #flag
|
---|
| 339 | breaksw #texfiles
|
---|
| 340 | .fi
|
---|
| 341 | .PP
|
---|
| 342 | The ^B characters here are Control-B characters,
|
---|
| 343 | .I not
|
---|
| 344 | caret-B pairs.
|
---|
| 345 | .PP
|
---|
| 346 | If you have a large number of \*(Bi or \*(Te
|
---|
| 347 | files, it is likely that a list of them would be
|
---|
| 348 | too long for the UNIX shell to hold in a single
|
---|
| 349 | variable, or on a single command line. Thus,
|
---|
| 350 | instead of storing the output of
|
---|
| 351 | .BR find (1)
|
---|
| 352 | in a variable, we proceed more cautiously, and
|
---|
| 353 | employ it to produce a list of the required files,
|
---|
| 354 | then pipe them to
|
---|
| 355 | .BR xargs (1),
|
---|
| 356 | which in turn passes up to 100 filenames at a time
|
---|
| 357 | to
|
---|
| 358 | .BR nawk (1)
|
---|
| 359 | for document selection.
|
---|
| 360 | .PP
|
---|
| 361 | Install this modified
|
---|
| 362 | .I mg_get
|
---|
| 363 | script in your private directory for executable
|
---|
| 364 | programs (e.g.
|
---|
| 365 | .IR $HOME/bin ),
|
---|
| 366 | create a directory
|
---|
| 367 | .I $HOME/mgdata
|
---|
| 368 | to hold the index, issue a
|
---|
| 369 | .B rehash
|
---|
| 370 | command if you are using
|
---|
| 371 | .BR csh (1)
|
---|
| 372 | or
|
---|
| 373 | .BR tcsh (1),
|
---|
| 374 | ensure that
|
---|
| 375 | .I mg_get
|
---|
| 376 | occurs in your search path
|
---|
| 377 | .I before
|
---|
| 378 | any system-wide one (the command
|
---|
| 379 | .BI which " mg_get"
|
---|
| 380 | will tell you which version will be selected),
|
---|
| 381 | then create the inverted indexes by
|
---|
| 382 | .BI mgbuild " bibfiles"
|
---|
| 383 | and
|
---|
| 384 | .B mgbuild
|
---|
| 385 | .IR texfiles .
|
---|
| 386 | These commands may take several minutes to run if
|
---|
| 387 | you have a lot of \*(Bi or \*(Te files, or a large
|
---|
| 388 | home directory tree. Once they are complete, you
|
---|
| 389 | can then query the index with the commands
|
---|
| 390 | .BI mgquery " bibfiles"
|
---|
| 391 | and
|
---|
| 392 | .B mgquery
|
---|
| 393 | .IR texfiles .
|
---|
| 394 | These should respond very rapidly.
|
---|
| 395 | .PP
|
---|
| 396 | In order to keep your index up-to-date, you should
|
---|
| 397 | arrange for it to be recreated automatically and
|
---|
| 398 | regularly, probably every night. You can do this
|
---|
| 399 | with
|
---|
| 400 | .BR cron (1).
|
---|
| 401 | Use the command
|
---|
| 402 | .B "crontab \-e"
|
---|
| 403 | to edit your
|
---|
| 404 | .I crontab
|
---|
| 405 | file and add two lines like this:
|
---|
| 406 | .PP
|
---|
| 407 | .nf
|
---|
| 408 | 00 04 * * * mgbuild bibfiles >$HOME/mgdata/bibfiles.log 2>&1
|
---|
| 409 | 15 04 * * * mgbuild texfiles >$HOME/mgdata/texfiles.log 2>&1
|
---|
| 410 | .fi
|
---|
| 411 | .PP
|
---|
| 412 | Save the file and exit the editor. Now, every
|
---|
| 413 | night at 4am and 4:15am,
|
---|
| 414 | .BR mgbuild (1)
|
---|
| 415 | will reconstruct your inverted indexes, and the
|
---|
| 416 | results of the builds will be saved in log files
|
---|
| 417 | in your
|
---|
| 418 | .I $HOME/mgdata
|
---|
| 419 | directory.
|
---|
| 420 | .\"=====================================================================
|
---|
| 421 | .SH "SEE ALSO"
|
---|
| 422 | .na
|
---|
| 423 | .BR awk (1),
|
---|
| 424 | .BR bibclean (1),
|
---|
| 425 | .BR bibtex (1),
|
---|
| 426 | .BR compress (1),
|
---|
| 427 | .BR csh (1),
|
---|
| 428 | .BR grep (1),
|
---|
| 429 | .BR gunzip (1),
|
---|
| 430 | .BR gzip (1),
|
---|
| 431 | .BR mg_compression_dict (1),
|
---|
| 432 | .BR mg_fast_comp_dict (1),
|
---|
| 433 | .BR mg_get (1),
|
---|
| 434 | .BR mg_invf_dict (1),
|
---|
| 435 | .BR mg_invf_dump (1),
|
---|
| 436 | .BR mg_invf_rebuild (1),
|
---|
| 437 | .BR mg_passes (1),
|
---|
| 438 | .BR mg_perf_hash_build (1),
|
---|
| 439 | .BR mg_text_estimate (1),
|
---|
| 440 | .BR mg_weights_build (1),
|
---|
| 441 | .BR mgbilevel (1),
|
---|
| 442 | .BR mgbuild (1),
|
---|
| 443 | .BR mgdictlist (1),
|
---|
| 444 | .BR mgfelics (1),
|
---|
| 445 | .BR mgquery (1),
|
---|
| 446 | .BR mgstat (1),
|
---|
| 447 | .BR mgtic (1),
|
---|
| 448 | .BR mgticbuild (1),
|
---|
| 449 | .BR mgticdump (1),
|
---|
| 450 | .BR mgticprune (1),
|
---|
| 451 | .BR mgticstat (1),
|
---|
| 452 | .BR nawk (1),
|
---|
| 453 | .BR tcsh (1),
|
---|
| 454 | .BR tex (1),
|
---|
| 455 | .BR xargs (1),
|
---|
| 456 | .BR xmg (1),
|
---|
| 457 | .BR xv (1).
|
---|
| 458 | .\"=====================================================================
|
---|
| 459 | .\" This is for GNU Emacs file-specific customization:
|
---|
| 460 | .\" Local Variables:
|
---|
| 461 | .\" fill-column: 50
|
---|
| 462 | .\" End:
|
---|