Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

mg.1@ 3745

Last change on this file since 3745 was 3745, checked in by mdewsnip, 21 years ago
Addition of MG package for search and retrieval
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 11.4 KB

Line
1	.\"------------------------------------------------------------
2	.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
3	.de Id
4	.ds Rv \\$3
5	.ds Dt \\$4
6	..
7	.Id $Id: mg.1 3745 2003-02-20 21:20:24Z mdewsnip $
8	.\"------------------------------------------------------------
9	.TH mg 1 \*(Dt CITRI
10	.\"=====================================================================
11	.\" Author:
12	.\" Nelson H. F. Beebe
13	.\" Center for Scientific Computing
14	.\" Department of Mathematics
15	.\" University of Utah
16	.\" Salt Lake City, UT 84112
17	.\" USA
18	.\" Email: [email protected] (Internet)
19	.\"=====================================================================
20	.if t .ds Bi B\s-2IB\s+2T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
21	.if n .ds Bi BibTeX
22	.if t .ds Te T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
23	.if n .ds Te TeX
24	.\"=====================================================================
25	.SH NAME
26	mg \- full-text inverted index support
27	.\"=====================================================================
28	.SH DESCRIPTION
29	.B mg
30	is a suite of programs that can be used to create
31	and query full-text inverted indexes for a
32	collection of documents.
33	.PP
34	An inverted index is essentially a list of
35	pointers to occurrences of every word in a
36	collection of documents, so that, for example, in
37	a collection of Shakespeare's works, one can pose
38	questions like
39	.I "How often is a fool mentioned in the plays?"
40	and
41	.I "Are Romeo and Juliet ever mentioned in the same sentence?"
42	While search utilities like the UNIX
43	.BR grep (1)
44	family can be used for simple queries, they suffer
45	from several limitations:
46	.TP \w'\(bu'u+2n
47	\(bu
48	Each invocation requires a separate pass over the
49	documents, which becomes impossibly slow if the
50	document collection is very large.
51	.TP
52	\(bu
53	It is difficult to compose Boolean queries like
54	.IR "Caesar AND Brutus AND NOT Antony" .
55	.TP
56	\(bu
57	They provide no mechanism for ranking matches
58	according to importance, which is extremely
59	important when queries are complex and numerous
60	matches are found.
61	.TP
62	\(bu
63	They provide no mechanism for displaying
64	surrounding context (although the Free Software
65	Foundation
66	implementations for the GNU Project remedy this
67	with their
68	.BI \-A " num"
69	(after)
70	and
71	.BI \-B " num"
72	(before)
73	switches).
74	.TP
75	\(bu
76	They provide no mechanism for searching for
77	occurrences in the same paragraph, unless
78	preprocessing is done on the files to wrap
79	paragraphs into long lines. Even that will often
80	fail, because input buffer sizes may be limited to
81	a few hundred characters.
82	.TP
83	\(bu
84	They provide no easy way to deal with grammatical
85	word-ending variants, except by explicit
86	enumeration, such as searching for
87	.IR compress ,
88	.IR compressed ,
89	.IR compresses ,
90	.IR compressing ,
91	and
92	.IR compression .
93	.PP
94	Most computer users have been faced with the
95	problem of finding information from a large
96	collection of files, such as electronic mail,
97	on-line documentation, or source code. Inverted
98	indices provide a highly-effective solution to
99	this problem.
100	.PP
101	The
102	.B mg
103	software is described in Appendix A of the book
104	.RS
105	.nf
106	Ian H. Witten, Alistair Moffat, and Timothy C. Bell
107	.I "Managing Gigabytes: Compressing and Indexing Documents and Images"
108	Van Nostrand Reinhold
109	1994
110	xiv + 429 pages
111	US$54.95
112	ISBN 0-442-01863-0
113	Library of Congress catalog number TA1637 .W58 1994
114	.fi
115	.RE
116	.PP
117	Many of the algorithms implemented in
118	.B mg
119	represent significant advances over previous work,
120	both in speed, and in storage requirements. On a
121	fast workstation, in tens of minutes, or a few
122	hours,
123	.BR mgbuild (1)
124	can create an index to
125	.I all
126	of the words in hundreds of thousands of documents
127	occupying hundreds of megabytes, or even more than
128	a gigabyte, of disk space.
129	.BR mgquery (1)
130	can then be used to answer complex queries, with
131	responses often returned in a second or less.
132	.B mg
133	also contains algorithms to deal with images, so
134	that with a small amount of descriptive text for
135	each image, it is possible to do searches in
136	collections of images, and to have retrievals
137	display the images using a viewer like
138	.BR xv (1).
139	.PP
140	.B mg
141	can deal with compressed text and image files and
142	surprisingly, it usually runs faster than it would
143	if the files were not compressed! Thus, the
144	considerable disk space savings possible from
145	file compression are not lost because of the need
146	for fast document search and retrieval.
147	.PP
148	The Free Software Foundation GNU Project
149	compression utilities
150	.BR gzip (1)
151	and
152	.BR gunzip (1)
153	are recommended for general use over older
154	alternatives, like
155	.BR compress (1),
156	because of their speed and high compression
157	ratios.
158	.\"=====================================================================
159	.SH AVAILABILITY
160	The
161	.B mg
162	software can be obtained via anonymous ftp to the
163	Australian archive host
164	.I "munnari.oz.au [128.250.1.21]"
165	from the directory
166	.IR "/pub/mg" .
167	.\"=====================================================================
168	.SH "TYPICAL USE OF mg"
169	Although
170	.B mg
171	consists of more than 20 separate programs, many
172	of which have complicated command-line options,
173	take heart: most users require only two or three
174	of these programs, and nothing more than a
175	document name on the command line.
176	.PP
177	A
178	.I document
179	for
180	.B mg
181	is a fragment of text suitable for retrieval as a
182	unit when it is found to contain a requested word,
183	or words. In a collection of poetry, a document
184	might be a stanza, while in a novel, it could be a
185	paragraph. In an index of first lines of poems, a
186	document would likely be just a single line.
187	.PP
188	Just what constitutes a document is decided by a
189	user-modifiable UNIX shell script,
190	.BR mg_get (1).
191	The default script provided with the
192	.B mg
193	source distribution knows about these named
194	document collections:
195	.TP \w'mailfiles'u+2n
196	.I alice
197	Lewis Carroll's
198	.I "Alice in Wonderland"
199	book.
200	.TP
201	.I allfiles
202	all mail files in the directory tree
203	.IR $HOME/Mail ,
204	including all of its nested subdirectories.
205	.TP
206	.I mailfiles
207	individual mail messages in
208	.IR $HOME/mbox
209	and
210	.IR $HOME/.sentmail.
211	.TP
212	.I davinci
213	A small collection of text and images from the
214	work of Leonardo da Vinci.
215	.PP
216	A document collection name is used by
217	.B mg
218	as a
219	.BR csh (1)
220	.I case
221	statement selector, and as a subdirectory name in the
222	.B $MGDATA
223	directory, or the current directory, if
224	.B MGDATA
225	is not defined.
226	.\"=====================================================================
227	.SH "EXTENDING mg_get"
228	This section describes how to extend
229	.BR mg_get (1)
230	to handle a new document collection.
231	.PP
232	Let us take two examples: all \*(Bi
233	.I .bib
234	files, and all \*(Te files, contained in
235	subdirectories under the login directory.
236	.PP
237	For \*(Bi, each bibliographic entry will be
238	considered a separate document. In order to
239	facilitate easy identification of entries, we
240	shall require them to begin at the start of a
241	line; the
242	.BR bibclean (1)
243	utility can be used to standardize the format of
244	.I .bib
245	files, and to validate their string values, so
246	that this requirement is met.
247	.PP
248	For \*(Te, each paragraph will be a separate
249	document, and we assume that paragraphs are
250	separated by blank lines. We assume that files
251	with extensions
252	.IR .atx ,
253	.IR .ltx ,
254	.IR .stx ,
255	and
256	.I .tex
257	contain input to common \*(Te macro package
258	variants.
259	.PP
260	Make a personal copy of the
261	.I mg_get
262	script, using the one in the
263	.B mg
264	source distribution
265	.RI ( mg-1.0/mg/mg_get.sh ),
266	or the one in the local binary program directory,
267	at many sites called
268	.IR /usr/local/bin/mg_get .
269	.PP
270	Examination of the
271	.I mg_get
272	script shows that each document collection name is
273	used in a
274	.BR csh (1)
275	.I case
276	statement selector, and that most of work is done
277	by very simple
278	.BR awk (1)
279	programs that extract documents from files. In
280	your private copy of the
281	.I mg_get
282	file, after the line
283	.PP
284	.nf
285	breaksw #davinci
286	.fi
287	.PP
288	and before the line
289	.PP
290	.nf
291	default:
292	.fi
293	.PP
294	insert this new code:
295	.PP
296	.nf
297	case bibfiles:
298	# Takes a list of files that contain BibTeX entries, and splits them up
299	# by putting ^B after each entry. Assumes that each entry
300	# begins with a line '^@'.
301	switch ($flag)
302	case '-init':
303	breaksw
304
305	case '-text':
306	find $HOME -name '*.bib' -print \| \e
307	sort \| \e
308	xargs -l100 awk \e
309	'/^@/&&NR!=1{print "^B"} {print $0} END{print "^B"}'
310	breaksw #-text
311
312	case '-cleanup':
313	breaksw #-cleanup
314
315	endsw #flag
316	breaksw #bibfiles
317
318	case texfiles:
319	# Takes a list of TeX files and split them up
320	# by putting ^B after each paragraph. Assumes that each entry
321	# begins with a line '^@'.
322	switch ($flag)
323	case '-init':
324	breaksw
325
326	case '-text':
327	find $HOME -name '*x' -print \| \e
328	egrep '[.]tex$\|[.]ltx$\|[.]atx$\|[.]stx$' \| \e
329	sort \| \e
330	xargs -l100 nawk ' /^ *$/ {if (b!=1) printf "^B";b=1} \e
331	\e!/^ *$/ {print;b=0} \e
332	END {printf "^B"}'
333	breaksw #-text
334
335	case '-cleanup':
336	breaksw #-cleanup
337
338	endsw #flag
339	breaksw #texfiles
340	.fi
341	.PP
342	The ^B characters here are Control-B characters,
343	.I not
344	caret-B pairs.
345	.PP
346	If you have a large number of \(Bi or \(Te
347	files, it is likely that a list of them would be
348	too long for the UNIX shell to hold in a single
349	variable, or on a single command line. Thus,
350	instead of storing the output of
351	.BR find (1)
352	in a variable, we proceed more cautiously, and
353	employ it to produce a list of the required files,
354	then pipe them to
355	.BR xargs (1),
356	which in turn passes up to 100 filenames at a time
357	to
358	.BR nawk (1)
359	for document selection.
360	.PP
361	Install this modified
362	.I mg_get
363	script in your private directory for executable
364	programs (e.g.
365	.IR $HOME/bin ),
366	create a directory
367	.I $HOME/mgdata
368	to hold the index, issue a
369	.B rehash
370	command if you are using
371	.BR csh (1)
372	or
373	.BR tcsh (1),
374	ensure that
375	.I mg_get
376	occurs in your search path
377	.I before
378	any system-wide one (the command
379	.BI which " mg_get"
380	will tell you which version will be selected),
381	then create the inverted indexes by
382	.BI mgbuild " bibfiles"
383	and
384	.B mgbuild
385	.IR texfiles .
386	These commands may take several minutes to run if
387	you have a lot of \(Bi or \(Te files, or a large
388	home directory tree. Once they are complete, you
389	can then query the index with the commands
390	.BI mgquery " bibfiles"
391	and
392	.B mgquery
393	.IR texfiles .
394	These should respond very rapidly.
395	.PP
396	In order to keep your index up-to-date, you should
397	arrange for it to be recreated automatically and
398	regularly, probably every night. You can do this
399	with
400	.BR cron (1).
401	Use the command
402	.B "crontab \-e"
403	to edit your
404	.I crontab
405	file and add two lines like this:
406	.PP
407	.nf
408	00 04 * * * mgbuild bibfiles >$HOME/mgdata/bibfiles.log 2>&1
409	15 04 * * * mgbuild texfiles >$HOME/mgdata/texfiles.log 2>&1
410	.fi
411	.PP
412	Save the file and exit the editor. Now, every
413	night at 4am and 4:15am,
414	.BR mgbuild (1)
415	will reconstruct your inverted indexes, and the
416	results of the builds will be saved in log files
417	in your
418	.I $HOME/mgdata
419	directory.
420	.\"=====================================================================
421	.SH "SEE ALSO"
422	.na
423	.BR awk (1),
424	.BR bibclean (1),
425	.BR bibtex (1),
426	.BR compress (1),
427	.BR csh (1),
428	.BR grep (1),
429	.BR gunzip (1),
430	.BR gzip (1),
431	.BR mg_compression_dict (1),
432	.BR mg_fast_comp_dict (1),
433	.BR mg_get (1),
434	.BR mg_invf_dict (1),
435	.BR mg_invf_dump (1),
436	.BR mg_invf_rebuild (1),
437	.BR mg_passes (1),
438	.BR mg_perf_hash_build (1),
439	.BR mg_text_estimate (1),
440	.BR mg_weights_build (1),
441	.BR mgbilevel (1),
442	.BR mgbuild (1),
443	.BR mgdictlist (1),
444	.BR mgfelics (1),
445	.BR mgquery (1),
446	.BR mgstat (1),
447	.BR mgtic (1),
448	.BR mgticbuild (1),
449	.BR mgticdump (1),
450	.BR mgticprune (1),
451	.BR mgticstat (1),
452	.BR nawk (1),
453	.BR tcsh (1),
454	.BR tex (1),
455	.BR xargs (1),
456	.BR xmg (1),
457	.BR xv (1).
458	.\"=====================================================================
459	.\" This is for GNU Emacs file-specific customization:
460	.\" Local Variables:
461	.\" fill-column: 50
462	.\" End:

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format