source: trunk/indexers/mg/docs/mg.1@ 3745

Last change on this file since 3745 was 3745, checked in by mdewsnip, 21 years ago

Addition of MG package for search and retrieval

  • Property svn:executable set to *
  • Property svn:keywords set to Author Date Id Revision
File size: 11.4 KB
Line 
1.\"------------------------------------------------------------
2.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
3.de Id
4.ds Rv \\$3
5.ds Dt \\$4
6..
7.Id $Id: mg.1 3745 2003-02-20 21:20:24Z mdewsnip $
8.\"------------------------------------------------------------
9.TH mg 1 \*(Dt CITRI
10.\"=====================================================================
11.\" Author:
12.\" Nelson H. F. Beebe
13.\" Center for Scientific Computing
14.\" Department of Mathematics
15.\" University of Utah
16.\" Salt Lake City, UT 84112
17.\" USA
18.\" Email: [email protected] (Internet)
19.\"=====================================================================
20.if t .ds Bi B\s-2IB\s+2T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
21.if n .ds Bi BibTeX
22.if t .ds Te T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
23.if n .ds Te TeX
24.\"=====================================================================
25.SH NAME
26mg \- full-text inverted index support
27.\"=====================================================================
28.SH DESCRIPTION
29.B mg
30is a suite of programs that can be used to create
31and query full-text inverted indexes for a
32collection of documents.
33.PP
34An inverted index is essentially a list of
35pointers to occurrences of every word in a
36collection of documents, so that, for example, in
37a collection of Shakespeare's works, one can pose
38questions like
39.I "How often is a fool mentioned in the plays?"
40and
41.I "Are Romeo and Juliet ever mentioned in the same sentence?"
42While search utilities like the UNIX
43.BR grep (1)
44family can be used for simple queries, they suffer
45from several limitations:
46.TP \w'\(bu'u+2n
47\(bu
48Each invocation requires a separate pass over the
49documents, which becomes impossibly slow if the
50document collection is very large.
51.TP
52\(bu
53It is difficult to compose Boolean queries like
54.IR "Caesar AND Brutus AND NOT Antony" .
55.TP
56\(bu
57They provide no mechanism for ranking matches
58according to importance, which is extremely
59important when queries are complex and numerous
60matches are found.
61.TP
62\(bu
63They provide no mechanism for displaying
64surrounding context (although the Free Software
65Foundation
66implementations for the GNU Project remedy this
67with their
68.BI \-A " num"
69(after)
70and
71.BI \-B " num"
72(before)
73switches).
74.TP
75\(bu
76They provide no mechanism for searching for
77occurrences in the same paragraph, unless
78preprocessing is done on the files to wrap
79paragraphs into long lines. Even that will often
80fail, because input buffer sizes may be limited to
81a few hundred characters.
82.TP
83\(bu
84They provide no easy way to deal with grammatical
85word-ending variants, except by explicit
86enumeration, such as searching for
87.IR compress ,
88.IR compressed ,
89.IR compresses ,
90.IR compressing ,
91and
92.IR compression .
93.PP
94Most computer users have been faced with the
95problem of finding information from a large
96collection of files, such as electronic mail,
97on-line documentation, or source code. Inverted
98indices provide a highly-effective solution to
99this problem.
100.PP
101The
102.B mg
103software is described in Appendix A of the book
104.RS
105.nf
106Ian H. Witten, Alistair Moffat, and Timothy C. Bell
107.I "Managing Gigabytes: Compressing and Indexing Documents and Images"
108Van Nostrand Reinhold
1091994
110xiv + 429 pages
111US$54.95
112ISBN 0-442-01863-0
113Library of Congress catalog number TA1637 .W58 1994
114.fi
115.RE
116.PP
117Many of the algorithms implemented in
118.B mg
119represent significant advances over previous work,
120both in speed, and in storage requirements. On a
121fast workstation, in tens of minutes, or a few
122hours,
123.BR mgbuild (1)
124can create an index to
125.I all
126of the words in hundreds of thousands of documents
127occupying hundreds of megabytes, or even more than
128a gigabyte, of disk space.
129.BR mgquery (1)
130can then be used to answer complex queries, with
131responses often returned in a second or less.
132.B mg
133also contains algorithms to deal with images, so
134that with a small amount of descriptive text for
135each image, it is possible to do searches in
136collections of images, and to have retrievals
137display the images using a viewer like
138.BR xv (1).
139.PP
140.B mg
141can deal with compressed text and image files and
142surprisingly, it usually runs faster than it would
143if the files were not compressed! Thus, the
144considerable disk space savings possible from
145file compression are not lost because of the need
146for fast document search and retrieval.
147.PP
148The Free Software Foundation GNU Project
149compression utilities
150.BR gzip (1)
151and
152.BR gunzip (1)
153are recommended for general use over older
154alternatives, like
155.BR compress (1),
156because of their speed and high compression
157ratios.
158.\"=====================================================================
159.SH AVAILABILITY
160The
161.B mg
162software can be obtained via anonymous ftp to the
163Australian archive host
164.I "munnari.oz.au [128.250.1.21]"
165from the directory
166.IR "/pub/mg" .
167.\"=====================================================================
168.SH "TYPICAL USE OF mg"
169Although
170.B mg
171consists of more than 20 separate programs, many
172of which have complicated command-line options,
173take heart: most users require only two or three
174of these programs, and nothing more than a
175document name on the command line.
176.PP
177A
178.I document
179for
180.B mg
181is a fragment of text suitable for retrieval as a
182unit when it is found to contain a requested word,
183or words. In a collection of poetry, a document
184might be a stanza, while in a novel, it could be a
185paragraph. In an index of first lines of poems, a
186document would likely be just a single line.
187.PP
188Just what constitutes a document is decided by a
189user-modifiable UNIX shell script,
190.BR mg_get (1).
191The default script provided with the
192.B mg
193source distribution knows about these named
194document collections:
195.TP \w'mailfiles'u+2n
196.I alice
197Lewis Carroll's
198.I "Alice in Wonderland"
199book.
200.TP
201.I allfiles
202all mail files in the directory tree
203.IR $HOME/Mail ,
204including all of its nested subdirectories.
205.TP
206.I mailfiles
207individual mail messages in
208.IR $HOME/mbox
209and
210.IR $HOME/.sentmail.
211.TP
212.I davinci
213A small collection of text and images from the
214work of Leonardo da Vinci.
215.PP
216A document collection name is used by
217.B mg
218as a
219.BR csh (1)
220.I case
221statement selector, and as a subdirectory name in the
222.B $MGDATA
223directory, or the current directory, if
224.B MGDATA
225is not defined.
226.\"=====================================================================
227.SH "EXTENDING mg_get"
228This section describes how to extend
229.BR mg_get (1)
230to handle a new document collection.
231.PP
232Let us take two examples: all \*(Bi
233.I .bib
234files, and all \*(Te files, contained in
235subdirectories under the login directory.
236.PP
237For \*(Bi, each bibliographic entry will be
238considered a separate document. In order to
239facilitate easy identification of entries, we
240shall require them to begin at the start of a
241line; the
242.BR bibclean (1)
243utility can be used to standardize the format of
244.I .bib
245files, and to validate their string values, so
246that this requirement is met.
247.PP
248For \*(Te, each paragraph will be a separate
249document, and we assume that paragraphs are
250separated by blank lines. We assume that files
251with extensions
252.IR .atx ,
253.IR .ltx ,
254.IR .stx ,
255and
256.I .tex
257contain input to common \*(Te macro package
258variants.
259.PP
260Make a personal copy of the
261.I mg_get
262script, using the one in the
263.B mg
264source distribution
265.RI ( mg-1.0/mg/mg_get.sh ),
266or the one in the local binary program directory,
267at many sites called
268.IR /usr/local/bin/mg_get .
269.PP
270Examination of the
271.I mg_get
272script shows that each document collection name is
273used in a
274.BR csh (1)
275.I case
276statement selector, and that most of work is done
277by very simple
278.BR awk (1)
279programs that extract documents from files. In
280your private copy of the
281.I mg_get
282file, after the line
283.PP
284.nf
285 breaksw #davinci
286.fi
287.PP
288and before the line
289.PP
290.nf
291 default:
292.fi
293.PP
294insert this new code:
295.PP
296.nf
297 case bibfiles:
298# Takes a list of files that contain BibTeX entries, and splits them up
299# by putting ^B after each entry. Assumes that each entry
300# begins with a line '^@'.
301 switch ($flag)
302 case '-init':
303 breaksw
304
305 case '-text':
306 find $HOME -name '*.bib' -print | \e
307 sort | \e
308 xargs -l100 awk \e
309 '/^@/&&NR!=1{print "^B"} {print $0} END{print "^B"}'
310 breaksw #-text
311
312 case '-cleanup':
313 breaksw #-cleanup
314
315 endsw #flag
316 breaksw #bibfiles
317
318 case texfiles:
319# Takes a list of TeX files and split them up
320# by putting ^B after each paragraph. Assumes that each entry
321# begins with a line '^@'.
322 switch ($flag)
323 case '-init':
324 breaksw
325
326 case '-text':
327 find $HOME -name '*x' -print | \e
328 egrep '[.]tex$|[.]ltx$|[.]atx$|[.]stx$' | \e
329 sort | \e
330 xargs -l100 nawk ' /^ *$/ {if (b!=1) printf "^B";b=1} \e
331 \e!/^ *$/ {print;b=0} \e
332 END {printf "^B"}'
333 breaksw #-text
334
335 case '-cleanup':
336 breaksw #-cleanup
337
338 endsw #flag
339 breaksw #texfiles
340.fi
341.PP
342The ^B characters here are Control-B characters,
343.I not
344caret-B pairs.
345.PP
346If you have a large number of \*(Bi or \*(Te
347files, it is likely that a list of them would be
348too long for the UNIX shell to hold in a single
349variable, or on a single command line. Thus,
350instead of storing the output of
351.BR find (1)
352in a variable, we proceed more cautiously, and
353employ it to produce a list of the required files,
354then pipe them to
355.BR xargs (1),
356which in turn passes up to 100 filenames at a time
357to
358.BR nawk (1)
359for document selection.
360.PP
361Install this modified
362.I mg_get
363script in your private directory for executable
364programs (e.g.
365.IR $HOME/bin ),
366create a directory
367.I $HOME/mgdata
368to hold the index, issue a
369.B rehash
370command if you are using
371.BR csh (1)
372or
373.BR tcsh (1),
374ensure that
375.I mg_get
376occurs in your search path
377.I before
378any system-wide one (the command
379.BI which " mg_get"
380will tell you which version will be selected),
381then create the inverted indexes by
382.BI mgbuild " bibfiles"
383and
384.B mgbuild
385.IR texfiles .
386These commands may take several minutes to run if
387you have a lot of \*(Bi or \*(Te files, or a large
388home directory tree. Once they are complete, you
389can then query the index with the commands
390.BI mgquery " bibfiles"
391and
392.B mgquery
393.IR texfiles .
394These should respond very rapidly.
395.PP
396In order to keep your index up-to-date, you should
397arrange for it to be recreated automatically and
398regularly, probably every night. You can do this
399with
400.BR cron (1).
401Use the command
402.B "crontab \-e"
403to edit your
404.I crontab
405file and add two lines like this:
406.PP
407.nf
40800 04 * * * mgbuild bibfiles >$HOME/mgdata/bibfiles.log 2>&1
40915 04 * * * mgbuild texfiles >$HOME/mgdata/texfiles.log 2>&1
410.fi
411.PP
412Save the file and exit the editor. Now, every
413night at 4am and 4:15am,
414.BR mgbuild (1)
415will reconstruct your inverted indexes, and the
416results of the builds will be saved in log files
417in your
418.I $HOME/mgdata
419directory.
420.\"=====================================================================
421.SH "SEE ALSO"
422.na
423.BR awk (1),
424.BR bibclean (1),
425.BR bibtex (1),
426.BR compress (1),
427.BR csh (1),
428.BR grep (1),
429.BR gunzip (1),
430.BR gzip (1),
431.BR mg_compression_dict (1),
432.BR mg_fast_comp_dict (1),
433.BR mg_get (1),
434.BR mg_invf_dict (1),
435.BR mg_invf_dump (1),
436.BR mg_invf_rebuild (1),
437.BR mg_passes (1),
438.BR mg_perf_hash_build (1),
439.BR mg_text_estimate (1),
440.BR mg_weights_build (1),
441.BR mgbilevel (1),
442.BR mgbuild (1),
443.BR mgdictlist (1),
444.BR mgfelics (1),
445.BR mgquery (1),
446.BR mgstat (1),
447.BR mgtic (1),
448.BR mgticbuild (1),
449.BR mgticdump (1),
450.BR mgticprune (1),
451.BR mgticstat (1),
452.BR nawk (1),
453.BR tcsh (1),
454.BR tex (1),
455.BR xargs (1),
456.BR xmg (1),
457.BR xv (1).
458.\"=====================================================================
459.\" This is for GNU Emacs file-specific customization:
460.\" Local Variables:
461.\" fill-column: 50
462.\" End:
Note: See TracBrowser for help on using the repository browser.