1 | .\"------------------------------------------------------------
|
---|
2 | .\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
|
---|
3 | .de Id
|
---|
4 | .ds Rv \\$3
|
---|
5 | .ds Dt \\$4
|
---|
6 | ..
|
---|
7 | .Id $Id: mg.1 16583 2008-07-29 10:20:36Z davidb $
|
---|
8 | .\"------------------------------------------------------------
|
---|
9 | .TH mg 1 \*(Dt CITRI
|
---|
10 | .\"=====================================================================
|
---|
11 | .\" Author:
|
---|
12 | .\" Nelson H. F. Beebe
|
---|
13 | .\" Center for Scientific Computing
|
---|
14 | .\" Department of Mathematics
|
---|
15 | .\" University of Utah
|
---|
16 | .\" Salt Lake City, UT 84112
|
---|
17 | .\" USA
|
---|
18 | .\" Email: [email protected] (Internet)
|
---|
19 | .\"=====================================================================
|
---|
20 | .if t .ds Bi B\s-2IB\s+2T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
|
---|
21 | .if n .ds Bi BibTeX
|
---|
22 | .if t .ds Te T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
|
---|
23 | .if n .ds Te TeX
|
---|
24 | .\"=====================================================================
|
---|
25 | .SH NAME
|
---|
26 | mg \- full-text inverted index support
|
---|
27 | .\"=====================================================================
|
---|
28 | .SH DESCRIPTION
|
---|
29 | .B mg
|
---|
30 | is a suite of programs that can be used to create
|
---|
31 | and query full-text inverted indexes for a
|
---|
32 | collection of documents.
|
---|
33 | .PP
|
---|
34 | An inverted index is essentially a list of
|
---|
35 | pointers to occurrences of every word in a
|
---|
36 | collection of documents, so that, for example, in
|
---|
37 | a collection of Shakespeare's works, one can pose
|
---|
38 | questions like
|
---|
39 | .I "How often is a fool mentioned in the plays?"
|
---|
40 | and
|
---|
41 | .I "Are Romeo and Juliet ever mentioned in the same sentence?"
|
---|
42 | While search utilities like the UNIX
|
---|
43 | .BR grep (1)
|
---|
44 | family can be used for simple queries, they suffer
|
---|
45 | from several limitations:
|
---|
46 | .TP \w'\(bu'u+2n
|
---|
47 | \(bu
|
---|
48 | Each invocation requires a separate pass over the
|
---|
49 | documents, which becomes impossibly slow if the
|
---|
50 | document collection is very large.
|
---|
51 | .TP
|
---|
52 | \(bu
|
---|
53 | It is difficult to compose Boolean queries like
|
---|
54 | .IR "Caesar AND Brutus AND NOT Antony" .
|
---|
55 | .TP
|
---|
56 | \(bu
|
---|
57 | They provide no mechanism for ranking matches
|
---|
58 | according to importance, which is extremely
|
---|
59 | important when queries are complex and numerous
|
---|
60 | matches are found.
|
---|
61 | .TP
|
---|
62 | \(bu
|
---|
63 | They provide no mechanism for displaying
|
---|
64 | surrounding context (although the Free Software
|
---|
65 | Foundation
|
---|
66 | implementations for the GNU Project remedy this
|
---|
67 | with their
|
---|
68 | .BI \-A " num"
|
---|
69 | (after)
|
---|
70 | and
|
---|
71 | .BI \-B " num"
|
---|
72 | (before)
|
---|
73 | switches).
|
---|
74 | .TP
|
---|
75 | \(bu
|
---|
76 | They provide no mechanism for searching for
|
---|
77 | occurrences in the same paragraph, unless
|
---|
78 | preprocessing is done on the files to wrap
|
---|
79 | paragraphs into long lines. Even that will often
|
---|
80 | fail, because input buffer sizes may be limited to
|
---|
81 | a few hundred characters.
|
---|
82 | .TP
|
---|
83 | \(bu
|
---|
84 | They provide no easy way to deal with grammatical
|
---|
85 | word-ending variants, except by explicit
|
---|
86 | enumeration, such as searching for
|
---|
87 | .IR compress ,
|
---|
88 | .IR compressed ,
|
---|
89 | .IR compresses ,
|
---|
90 | .IR compressing ,
|
---|
91 | and
|
---|
92 | .IR compression .
|
---|
93 | .PP
|
---|
94 | Most computer users have been faced with the
|
---|
95 | problem of finding information from a large
|
---|
96 | collection of files, such as electronic mail,
|
---|
97 | on-line documentation, or source code. Inverted
|
---|
98 | indices provide a highly-effective solution to
|
---|
99 | this problem.
|
---|
100 | .PP
|
---|
101 | The
|
---|
102 | .B mg
|
---|
103 | software is described in Appendix A of the book
|
---|
104 | .RS
|
---|
105 | .nf
|
---|
106 | Ian H. Witten, Alistair Moffat, and Timothy C. Bell
|
---|
107 | .I "Managing Gigabytes: Compressing and Indexing Documents and Images"
|
---|
108 | Van Nostrand Reinhold
|
---|
109 | 1994
|
---|
110 | xiv + 429 pages
|
---|
111 | US$54.95
|
---|
112 | ISBN 0-442-01863-0
|
---|
113 | Library of Congress catalog number TA1637 .W58 1994
|
---|
114 | .fi
|
---|
115 | .RE
|
---|
116 | .PP
|
---|
117 | Many of the algorithms implemented in
|
---|
118 | .B mg
|
---|
119 | represent significant advances over previous work,
|
---|
120 | both in speed, and in storage requirements. On a
|
---|
121 | fast workstation, in tens of minutes, or a few
|
---|
122 | hours,
|
---|
123 | .BR mgbuild (1)
|
---|
124 | can create an index to
|
---|
125 | .I all
|
---|
126 | of the words in hundreds of thousands of documents
|
---|
127 | occupying hundreds of megabytes, or even more than
|
---|
128 | a gigabyte, of disk space.
|
---|
129 | .BR mgquery (1)
|
---|
130 | can then be used to answer complex queries, with
|
---|
131 | responses often returned in a second or less.
|
---|
132 | .B mg
|
---|
133 | also contains algorithms to deal with images, so
|
---|
134 | that with a small amount of descriptive text for
|
---|
135 | each image, it is possible to do searches in
|
---|
136 | collections of images, and to have retrievals
|
---|
137 | display the images using a viewer like
|
---|
138 | .BR xv (1).
|
---|
139 | .PP
|
---|
140 | .B mg
|
---|
141 | can deal with compressed text and image files and
|
---|
142 | surprisingly, it usually runs faster than it would
|
---|
143 | if the files were not compressed! Thus, the
|
---|
144 | considerable disk space savings possible from
|
---|
145 | file compression are not lost because of the need
|
---|
146 | for fast document search and retrieval.
|
---|
147 | .PP
|
---|
148 | The Free Software Foundation GNU Project
|
---|
149 | compression utilities
|
---|
150 | .BR gzip (1)
|
---|
151 | and
|
---|
152 | .BR gunzip (1)
|
---|
153 | are recommended for general use over older
|
---|
154 | alternatives, like
|
---|
155 | .BR compress (1),
|
---|
156 | because of their speed and high compression
|
---|
157 | ratios.
|
---|
158 | .\"=====================================================================
|
---|
159 | .SH AVAILABILITY
|
---|
160 | The
|
---|
161 | .B mg
|
---|
162 | software can be obtained via anonymous ftp to the
|
---|
163 | Australian archive host
|
---|
164 | .I "munnari.oz.au [128.250.1.21]"
|
---|
165 | from the directory
|
---|
166 | .IR "/pub/mg" .
|
---|
167 | .\"=====================================================================
|
---|
168 | .SH "TYPICAL USE OF mg"
|
---|
169 | Although
|
---|
170 | .B mg
|
---|
171 | consists of more than 20 separate programs, many
|
---|
172 | of which have complicated command-line options,
|
---|
173 | take heart: most users require only two or three
|
---|
174 | of these programs, and nothing more than a
|
---|
175 | document name on the command line.
|
---|
176 | .PP
|
---|
177 | A
|
---|
178 | .I document
|
---|
179 | for
|
---|
180 | .B mg
|
---|
181 | is a fragment of text suitable for retrieval as a
|
---|
182 | unit when it is found to contain a requested word,
|
---|
183 | or words. In a collection of poetry, a document
|
---|
184 | might be a stanza, while in a novel, it could be a
|
---|
185 | paragraph. In an index of first lines of poems, a
|
---|
186 | document would likely be just a single line.
|
---|
187 | .PP
|
---|
188 | Just what constitutes a document is decided by a
|
---|
189 | user-modifiable UNIX shell script,
|
---|
190 | .BR mg_get (1).
|
---|
191 | The default script provided with the
|
---|
192 | .B mg
|
---|
193 | source distribution knows about these named
|
---|
194 | document collections:
|
---|
195 | .TP \w'mailfiles'u+2n
|
---|
196 | .I alice
|
---|
197 | Lewis Carroll's
|
---|
198 | .I "Alice in Wonderland"
|
---|
199 | book.
|
---|
200 | .TP
|
---|
201 | .I allfiles
|
---|
202 | all mail files in the directory tree
|
---|
203 | .IR $HOME/Mail ,
|
---|
204 | including all of its nested subdirectories.
|
---|
205 | .TP
|
---|
206 | .I mailfiles
|
---|
207 | individual mail messages in
|
---|
208 | .IR $HOME/mbox
|
---|
209 | and
|
---|
210 | .IR $HOME/.sentmail.
|
---|
211 | .TP
|
---|
212 | .I davinci
|
---|
213 | A small collection of text and images from the
|
---|
214 | work of Leonardo da Vinci.
|
---|
215 | .PP
|
---|
216 | A document collection name is used by
|
---|
217 | .B mg
|
---|
218 | as a
|
---|
219 | .BR csh (1)
|
---|
220 | .I case
|
---|
221 | statement selector, and as a subdirectory name in the
|
---|
222 | .B $MGDATA
|
---|
223 | directory, or the current directory, if
|
---|
224 | .B MGDATA
|
---|
225 | is not defined.
|
---|
226 | .\"=====================================================================
|
---|
227 | .SH "EXTENDING mg_get"
|
---|
228 | This section describes how to extend
|
---|
229 | .BR mg_get (1)
|
---|
230 | to handle a new document collection.
|
---|
231 | .PP
|
---|
232 | Let us take two examples: all \*(Bi
|
---|
233 | .I .bib
|
---|
234 | files, and all \*(Te files, contained in
|
---|
235 | subdirectories under the login directory.
|
---|
236 | .PP
|
---|
237 | For \*(Bi, each bibliographic entry will be
|
---|
238 | considered a separate document. In order to
|
---|
239 | facilitate easy identification of entries, we
|
---|
240 | shall require them to begin at the start of a
|
---|
241 | line; the
|
---|
242 | .BR bibclean (1)
|
---|
243 | utility can be used to standardize the format of
|
---|
244 | .I .bib
|
---|
245 | files, and to validate their string values, so
|
---|
246 | that this requirement is met.
|
---|
247 | .PP
|
---|
248 | For \*(Te, each paragraph will be a separate
|
---|
249 | document, and we assume that paragraphs are
|
---|
250 | separated by blank lines. We assume that files
|
---|
251 | with extensions
|
---|
252 | .IR .atx ,
|
---|
253 | .IR .ltx ,
|
---|
254 | .IR .stx ,
|
---|
255 | and
|
---|
256 | .I .tex
|
---|
257 | contain input to common \*(Te macro package
|
---|
258 | variants.
|
---|
259 | .PP
|
---|
260 | Make a personal copy of the
|
---|
261 | .I mg_get
|
---|
262 | script, using the one in the
|
---|
263 | .B mg
|
---|
264 | source distribution
|
---|
265 | .RI ( mg-1.0/mg/mg_get.sh ),
|
---|
266 | or the one in the local binary program directory,
|
---|
267 | at many sites called
|
---|
268 | .IR /usr/local/bin/mg_get .
|
---|
269 | .PP
|
---|
270 | Examination of the
|
---|
271 | .I mg_get
|
---|
272 | script shows that each document collection name is
|
---|
273 | used in a
|
---|
274 | .BR csh (1)
|
---|
275 | .I case
|
---|
276 | statement selector, and that most of work is done
|
---|
277 | by very simple
|
---|
278 | .BR awk (1)
|
---|
279 | programs that extract documents from files. In
|
---|
280 | your private copy of the
|
---|
281 | .I mg_get
|
---|
282 | file, after the line
|
---|
283 | .PP
|
---|
284 | .nf
|
---|
285 | breaksw #davinci
|
---|
286 | .fi
|
---|
287 | .PP
|
---|
288 | and before the line
|
---|
289 | .PP
|
---|
290 | .nf
|
---|
291 | default:
|
---|
292 | .fi
|
---|
293 | .PP
|
---|
294 | insert this new code:
|
---|
295 | .PP
|
---|
296 | .nf
|
---|
297 | case bibfiles:
|
---|
298 | # Takes a list of files that contain BibTeX entries, and splits them up
|
---|
299 | # by putting ^B after each entry. Assumes that each entry
|
---|
300 | # begins with a line '^@'.
|
---|
301 | switch ($flag)
|
---|
302 | case '-init':
|
---|
303 | breaksw
|
---|
304 |
|
---|
305 | case '-text':
|
---|
306 | find $HOME -name '*.bib' -print | \e
|
---|
307 | sort | \e
|
---|
308 | xargs -l100 awk \e
|
---|
309 | '/^@/&&NR!=1{print "^B"} {print $0} END{print "^B"}'
|
---|
310 | breaksw #-text
|
---|
311 |
|
---|
312 | case '-cleanup':
|
---|
313 | breaksw #-cleanup
|
---|
314 |
|
---|
315 | endsw #flag
|
---|
316 | breaksw #bibfiles
|
---|
317 |
|
---|
318 | case texfiles:
|
---|
319 | # Takes a list of TeX files and split them up
|
---|
320 | # by putting ^B after each paragraph. Assumes that each entry
|
---|
321 | # begins with a line '^@'.
|
---|
322 | switch ($flag)
|
---|
323 | case '-init':
|
---|
324 | breaksw
|
---|
325 |
|
---|
326 | case '-text':
|
---|
327 | find $HOME -name '*x' -print | \e
|
---|
328 | egrep '[.]tex$|[.]ltx$|[.]atx$|[.]stx$' | \e
|
---|
329 | sort | \e
|
---|
330 | xargs -l100 nawk ' /^ *$/ {if (b!=1) printf "^B";b=1} \e
|
---|
331 | \e!/^ *$/ {print;b=0} \e
|
---|
332 | END {printf "^B"}'
|
---|
333 | breaksw #-text
|
---|
334 |
|
---|
335 | case '-cleanup':
|
---|
336 | breaksw #-cleanup
|
---|
337 |
|
---|
338 | endsw #flag
|
---|
339 | breaksw #texfiles
|
---|
340 | .fi
|
---|
341 | .PP
|
---|
342 | The ^B characters here are Control-B characters,
|
---|
343 | .I not
|
---|
344 | caret-B pairs.
|
---|
345 | .PP
|
---|
346 | If you have a large number of \*(Bi or \*(Te
|
---|
347 | files, it is likely that a list of them would be
|
---|
348 | too long for the UNIX shell to hold in a single
|
---|
349 | variable, or on a single command line. Thus,
|
---|
350 | instead of storing the output of
|
---|
351 | .BR find (1)
|
---|
352 | in a variable, we proceed more cautiously, and
|
---|
353 | employ it to produce a list of the required files,
|
---|
354 | then pipe them to
|
---|
355 | .BR xargs (1),
|
---|
356 | which in turn passes up to 100 filenames at a time
|
---|
357 | to
|
---|
358 | .BR nawk (1)
|
---|
359 | for document selection.
|
---|
360 | .PP
|
---|
361 | Install this modified
|
---|
362 | .I mg_get
|
---|
363 | script in your private directory for executable
|
---|
364 | programs (e.g.
|
---|
365 | .IR $HOME/bin ),
|
---|
366 | create a directory
|
---|
367 | .I $HOME/mgdata
|
---|
368 | to hold the index, issue a
|
---|
369 | .B rehash
|
---|
370 | command if you are using
|
---|
371 | .BR csh (1)
|
---|
372 | or
|
---|
373 | .BR tcsh (1),
|
---|
374 | ensure that
|
---|
375 | .I mg_get
|
---|
376 | occurs in your search path
|
---|
377 | .I before
|
---|
378 | any system-wide one (the command
|
---|
379 | .BI which " mg_get"
|
---|
380 | will tell you which version will be selected),
|
---|
381 | then create the inverted indexes by
|
---|
382 | .BI mgbuild " bibfiles"
|
---|
383 | and
|
---|
384 | .B mgbuild
|
---|
385 | .IR texfiles .
|
---|
386 | These commands may take several minutes to run if
|
---|
387 | you have a lot of \*(Bi or \*(Te files, or a large
|
---|
388 | home directory tree. Once they are complete, you
|
---|
389 | can then query the index with the commands
|
---|
390 | .BI mgquery " bibfiles"
|
---|
391 | and
|
---|
392 | .B mgquery
|
---|
393 | .IR texfiles .
|
---|
394 | These should respond very rapidly.
|
---|
395 | .PP
|
---|
396 | In order to keep your index up-to-date, you should
|
---|
397 | arrange for it to be recreated automatically and
|
---|
398 | regularly, probably every night. You can do this
|
---|
399 | with
|
---|
400 | .BR cron (1).
|
---|
401 | Use the command
|
---|
402 | .B "crontab \-e"
|
---|
403 | to edit your
|
---|
404 | .I crontab
|
---|
405 | file and add two lines like this:
|
---|
406 | .PP
|
---|
407 | .nf
|
---|
408 | 00 04 * * * mgbuild bibfiles >$HOME/mgdata/bibfiles.log 2>&1
|
---|
409 | 15 04 * * * mgbuild texfiles >$HOME/mgdata/texfiles.log 2>&1
|
---|
410 | .fi
|
---|
411 | .PP
|
---|
412 | Save the file and exit the editor. Now, every
|
---|
413 | night at 4am and 4:15am,
|
---|
414 | .BR mgbuild (1)
|
---|
415 | will reconstruct your inverted indexes, and the
|
---|
416 | results of the builds will be saved in log files
|
---|
417 | in your
|
---|
418 | .I $HOME/mgdata
|
---|
419 | directory.
|
---|
420 | .\"=====================================================================
|
---|
421 | .SH "SEE ALSO"
|
---|
422 | .na
|
---|
423 | .BR awk (1),
|
---|
424 | .BR bibclean (1),
|
---|
425 | .BR bibtex (1),
|
---|
426 | .BR compress (1),
|
---|
427 | .BR csh (1),
|
---|
428 | .BR grep (1),
|
---|
429 | .BR gunzip (1),
|
---|
430 | .BR gzip (1),
|
---|
431 | .BR mg_compression_dict (1),
|
---|
432 | .BR mg_fast_comp_dict (1),
|
---|
433 | .BR mg_get (1),
|
---|
434 | .BR mg_invf_dict (1),
|
---|
435 | .BR mg_invf_dump (1),
|
---|
436 | .BR mg_invf_rebuild (1),
|
---|
437 | .BR mg_passes (1),
|
---|
438 | .BR mg_perf_hash_build (1),
|
---|
439 | .BR mg_text_estimate (1),
|
---|
440 | .BR mg_weights_build (1),
|
---|
441 | .BR mgbilevel (1),
|
---|
442 | .BR mgbuild (1),
|
---|
443 | .BR mgdictlist (1),
|
---|
444 | .BR mgfelics (1),
|
---|
445 | .BR mgquery (1),
|
---|
446 | .BR mgstat (1),
|
---|
447 | .BR mgtic (1),
|
---|
448 | .BR mgticbuild (1),
|
---|
449 | .BR mgticdump (1),
|
---|
450 | .BR mgticprune (1),
|
---|
451 | .BR mgticstat (1),
|
---|
452 | .BR nawk (1),
|
---|
453 | .BR tcsh (1),
|
---|
454 | .BR tex (1),
|
---|
455 | .BR xargs (1),
|
---|
456 | .BR xmg (1),
|
---|
457 | .BR xv (1).
|
---|
458 | .\"=====================================================================
|
---|
459 | .\" This is for GNU Emacs file-specific customization:
|
---|
460 | .\" Local Variables:
|
---|
461 | .\" fill-column: 50
|
---|
462 | .\" End:
|
---|