Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

mg.1@ 16583

Last change on this file since 16583 was 16583, checked in by davidb, 16 years ago
Undoing change commited in r16582
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 11.4 KB

Rev	Line
[3745]	1	.\"------------------------------------------------------------
	2	.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
	3	.de Id
	4	.ds Rv \\$3
	5	.ds Dt \\$4
	6	..
	7	.Id $Id: mg.1 16583 2008-07-29 10:20:36Z davidb $
	8	.\"------------------------------------------------------------
	9	.TH mg 1 \*(Dt CITRI
	10	.\"=====================================================================
	11	.\" Author:
	12	.\" Nelson H. F. Beebe
	13	.\" Center for Scientific Computing
	14	.\" Department of Mathematics
	15	.\" University of Utah
	16	.\" Salt Lake City, UT 84112
	17	.\" USA
	18	.\" Email: [email protected] (Internet)
	19	.\"=====================================================================
	20	.if t .ds Bi B\s-2IB\s+2T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
	21	.if n .ds Bi BibTeX
	22	.if t .ds Te T\\h'-0.1667m'\\v'0.20v'E\\v'-0.20v'\\h'-0.125m'X
	23	.if n .ds Te TeX
	24	.\"=====================================================================
	25	.SH NAME
	26	mg \- full-text inverted index support
	27	.\"=====================================================================
	28	.SH DESCRIPTION
	29	.B mg
	30	is a suite of programs that can be used to create
	31	and query full-text inverted indexes for a
	32	collection of documents.
	33	.PP
	34	An inverted index is essentially a list of
	35	pointers to occurrences of every word in a
	36	collection of documents, so that, for example, in
	37	a collection of Shakespeare's works, one can pose
	38	questions like
	39	.I "How often is a fool mentioned in the plays?"
	40	and
	41	.I "Are Romeo and Juliet ever mentioned in the same sentence?"
	42	While search utilities like the UNIX
	43	.BR grep (1)
	44	family can be used for simple queries, they suffer
	45	from several limitations:
	46	.TP \w'\(bu'u+2n
	47	\(bu
	48	Each invocation requires a separate pass over the
	49	documents, which becomes impossibly slow if the
	50	document collection is very large.
	51	.TP
	52	\(bu
	53	It is difficult to compose Boolean queries like
	54	.IR "Caesar AND Brutus AND NOT Antony" .
	55	.TP
	56	\(bu
	57	They provide no mechanism for ranking matches
	58	according to importance, which is extremely
	59	important when queries are complex and numerous
	60	matches are found.
	61	.TP
	62	\(bu
	63	They provide no mechanism for displaying
	64	surrounding context (although the Free Software
	65	Foundation
	66	implementations for the GNU Project remedy this
	67	with their
	68	.BI \-A " num"
	69	(after)
	70	and
	71	.BI \-B " num"
	72	(before)
	73	switches).
	74	.TP
	75	\(bu
	76	They provide no mechanism for searching for
	77	occurrences in the same paragraph, unless
	78	preprocessing is done on the files to wrap
	79	paragraphs into long lines. Even that will often
	80	fail, because input buffer sizes may be limited to
	81	a few hundred characters.
	82	.TP
	83	\(bu
	84	They provide no easy way to deal with grammatical
	85	word-ending variants, except by explicit
	86	enumeration, such as searching for
	87	.IR compress ,
	88	.IR compressed ,
	89	.IR compresses ,
	90	.IR compressing ,
	91	and
	92	.IR compression .
	93	.PP
	94	Most computer users have been faced with the
	95	problem of finding information from a large
	96	collection of files, such as electronic mail,
	97	on-line documentation, or source code. Inverted
	98	indices provide a highly-effective solution to
	99	this problem.
	100	.PP
	101	The
	102	.B mg
	103	software is described in Appendix A of the book
	104	.RS
	105	.nf
	106	Ian H. Witten, Alistair Moffat, and Timothy C. Bell
	107	.I "Managing Gigabytes: Compressing and Indexing Documents and Images"
	108	Van Nostrand Reinhold
	109	1994
	110	xiv + 429 pages
	111	US$54.95
	112	ISBN 0-442-01863-0
	113	Library of Congress catalog number TA1637 .W58 1994
	114	.fi
	115	.RE
	116	.PP
	117	Many of the algorithms implemented in
	118	.B mg
	119	represent significant advances over previous work,
	120	both in speed, and in storage requirements. On a
	121	fast workstation, in tens of minutes, or a few
	122	hours,
	123	.BR mgbuild (1)
	124	can create an index to
	125	.I all
	126	of the words in hundreds of thousands of documents
	127	occupying hundreds of megabytes, or even more than
	128	a gigabyte, of disk space.
	129	.BR mgquery (1)
	130	can then be used to answer complex queries, with
	131	responses often returned in a second or less.
	132	.B mg
	133	also contains algorithms to deal with images, so
	134	that with a small amount of descriptive text for
	135	each image, it is possible to do searches in
	136	collections of images, and to have retrievals
	137	display the images using a viewer like
	138	.BR xv (1).
	139	.PP
	140	.B mg
	141	can deal with compressed text and image files and
	142	surprisingly, it usually runs faster than it would
	143	if the files were not compressed! Thus, the
	144	considerable disk space savings possible from
	145	file compression are not lost because of the need
	146	for fast document search and retrieval.
	147	.PP
	148	The Free Software Foundation GNU Project
	149	compression utilities
	150	.BR gzip (1)
	151	and
	152	.BR gunzip (1)
	153	are recommended for general use over older
	154	alternatives, like
	155	.BR compress (1),
	156	because of their speed and high compression
	157	ratios.
	158	.\"=====================================================================
	159	.SH AVAILABILITY
	160	The
	161	.B mg
	162	software can be obtained via anonymous ftp to the
	163	Australian archive host
	164	.I "munnari.oz.au [128.250.1.21]"
	165	from the directory
	166	.IR "/pub/mg" .
	167	.\"=====================================================================
	168	.SH "TYPICAL USE OF mg"
	169	Although
	170	.B mg
	171	consists of more than 20 separate programs, many
	172	of which have complicated command-line options,
	173	take heart: most users require only two or three
	174	of these programs, and nothing more than a
	175	document name on the command line.
	176	.PP
	177	A
	178	.I document
	179	for
	180	.B mg
	181	is a fragment of text suitable for retrieval as a
	182	unit when it is found to contain a requested word,
	183	or words. In a collection of poetry, a document
	184	might be a stanza, while in a novel, it could be a
	185	paragraph. In an index of first lines of poems, a
	186	document would likely be just a single line.
	187	.PP
	188	Just what constitutes a document is decided by a
	189	user-modifiable UNIX shell script,
	190	.BR mg_get (1).
	191	The default script provided with the
	192	.B mg
	193	source distribution knows about these named
	194	document collections:
	195	.TP \w'mailfiles'u+2n
	196	.I alice
	197	Lewis Carroll's
	198	.I "Alice in Wonderland"
	199	book.
	200	.TP
	201	.I allfiles
	202	all mail files in the directory tree
	203	.IR $HOME/Mail ,
	204	including all of its nested subdirectories.
	205	.TP
	206	.I mailfiles
	207	individual mail messages in
	208	.IR $HOME/mbox
	209	and
	210	.IR $HOME/.sentmail.
	211	.TP
	212	.I davinci
	213	A small collection of text and images from the
	214	work of Leonardo da Vinci.
	215	.PP
	216	A document collection name is used by
	217	.B mg
	218	as a
	219	.BR csh (1)
	220	.I case
	221	statement selector, and as a subdirectory name in the
	222	.B $MGDATA
	223	directory, or the current directory, if
	224	.B MGDATA
	225	is not defined.
	226	.\"=====================================================================
	227	.SH "EXTENDING mg_get"
	228	This section describes how to extend
	229	.BR mg_get (1)
	230	to handle a new document collection.
	231	.PP
	232	Let us take two examples: all \*(Bi
	233	.I .bib
	234	files, and all \*(Te files, contained in
	235	subdirectories under the login directory.
	236	.PP
	237	For \*(Bi, each bibliographic entry will be
	238	considered a separate document. In order to
	239	facilitate easy identification of entries, we
	240	shall require them to begin at the start of a
	241	line; the
	242	.BR bibclean (1)
	243	utility can be used to standardize the format of
	244	.I .bib
	245	files, and to validate their string values, so
	246	that this requirement is met.
	247	.PP
	248	For \*(Te, each paragraph will be a separate
	249	document, and we assume that paragraphs are
	250	separated by blank lines. We assume that files
	251	with extensions
	252	.IR .atx ,
	253	.IR .ltx ,
	254	.IR .stx ,
	255	and
	256	.I .tex
	257	contain input to common \*(Te macro package
	258	variants.
	259	.PP
	260	Make a personal copy of the
	261	.I mg_get
	262	script, using the one in the
	263	.B mg
	264	source distribution
	265	.RI ( mg-1.0/mg/mg_get.sh ),
	266	or the one in the local binary program directory,
	267	at many sites called
	268	.IR /usr/local/bin/mg_get .
	269	.PP
	270	Examination of the
	271	.I mg_get
	272	script shows that each document collection name is
	273	used in a
	274	.BR csh (1)
	275	.I case
	276	statement selector, and that most of work is done
	277	by very simple
	278	.BR awk (1)
	279	programs that extract documents from files. In
	280	your private copy of the
	281	.I mg_get
	282	file, after the line
	283	.PP
	284	.nf
	285	breaksw #davinci
	286	.fi
	287	.PP
	288	and before the line
	289	.PP
	290	.nf
	291	default:
	292	.fi
	293	.PP
	294	insert this new code:
	295	.PP
	296	.nf
	297	case bibfiles:
	298	# Takes a list of files that contain BibTeX entries, and splits them up
	299	# by putting ^B after each entry. Assumes that each entry
	300	# begins with a line '^@'.
	301	switch ($flag)
	302	case '-init':
	303	breaksw
	304
	305	case '-text':
	306	find $HOME -name '*.bib' -print \| \e
	307	sort \| \e
	308	xargs -l100 awk \e
	309	'/^@/&&NR!=1{print "^B"} {print $0} END{print "^B"}'
	310	breaksw #-text
	311
	312	case '-cleanup':
	313	breaksw #-cleanup
	314
	315	endsw #flag
	316	breaksw #bibfiles
	317
	318	case texfiles:
	319	# Takes a list of TeX files and split them up
	320	# by putting ^B after each paragraph. Assumes that each entry
	321	# begins with a line '^@'.
	322	switch ($flag)
	323	case '-init':
	324	breaksw
	325
	326	case '-text':
	327	find $HOME -name '*x' -print \| \e
	328	egrep '[.]tex$\|[.]ltx$\|[.]atx$\|[.]stx$' \| \e
	329	sort \| \e
	330	xargs -l100 nawk ' /^ *$/ {if (b!=1) printf "^B";b=1} \e
	331	\e!/^ *$/ {print;b=0} \e
	332	END {printf "^B"}'
	333	breaksw #-text
	334
	335	case '-cleanup':
	336	breaksw #-cleanup
	337
	338	endsw #flag
	339	breaksw #texfiles
	340	.fi
	341	.PP
	342	The ^B characters here are Control-B characters,
	343	.I not
	344	caret-B pairs.
	345	.PP
	346	If you have a large number of \(Bi or \(Te
	347	files, it is likely that a list of them would be
	348	too long for the UNIX shell to hold in a single
	349	variable, or on a single command line. Thus,
	350	instead of storing the output of
	351	.BR find (1)
	352	in a variable, we proceed more cautiously, and
	353	employ it to produce a list of the required files,
	354	then pipe them to
	355	.BR xargs (1),
	356	which in turn passes up to 100 filenames at a time
	357	to
	358	.BR nawk (1)
	359	for document selection.
	360	.PP
	361	Install this modified
	362	.I mg_get
	363	script in your private directory for executable
	364	programs (e.g.
	365	.IR $HOME/bin ),
	366	create a directory
	367	.I $HOME/mgdata
	368	to hold the index, issue a
	369	.B rehash
	370	command if you are using
	371	.BR csh (1)
	372	or
	373	.BR tcsh (1),
	374	ensure that
	375	.I mg_get
	376	occurs in your search path
	377	.I before
	378	any system-wide one (the command
	379	.BI which " mg_get"
	380	will tell you which version will be selected),
	381	then create the inverted indexes by
	382	.BI mgbuild " bibfiles"
	383	and
	384	.B mgbuild
	385	.IR texfiles .
	386	These commands may take several minutes to run if
	387	you have a lot of \(Bi or \(Te files, or a large
	388	home directory tree. Once they are complete, you
	389	can then query the index with the commands
	390	.BI mgquery " bibfiles"
	391	and
	392	.B mgquery
	393	.IR texfiles .
	394	These should respond very rapidly.
	395	.PP
	396	In order to keep your index up-to-date, you should
	397	arrange for it to be recreated automatically and
	398	regularly, probably every night. You can do this
	399	with
	400	.BR cron (1).
	401	Use the command
	402	.B "crontab \-e"
	403	to edit your
	404	.I crontab
	405	file and add two lines like this:
	406	.PP
	407	.nf
	408	00 04 * * * mgbuild bibfiles >$HOME/mgdata/bibfiles.log 2>&1
	409	15 04 * * * mgbuild texfiles >$HOME/mgdata/texfiles.log 2>&1
	410	.fi
	411	.PP
	412	Save the file and exit the editor. Now, every
	413	night at 4am and 4:15am,
	414	.BR mgbuild (1)
	415	will reconstruct your inverted indexes, and the
	416	results of the builds will be saved in log files
	417	in your
	418	.I $HOME/mgdata
	419	directory.
	420	.\"=====================================================================
	421	.SH "SEE ALSO"
	422	.na
	423	.BR awk (1),
	424	.BR bibclean (1),
	425	.BR bibtex (1),
	426	.BR compress (1),
	427	.BR csh (1),
	428	.BR grep (1),
	429	.BR gunzip (1),
	430	.BR gzip (1),
	431	.BR mg_compression_dict (1),
	432	.BR mg_fast_comp_dict (1),
	433	.BR mg_get (1),
	434	.BR mg_invf_dict (1),
	435	.BR mg_invf_dump (1),
	436	.BR mg_invf_rebuild (1),
	437	.BR mg_passes (1),
	438	.BR mg_perf_hash_build (1),
	439	.BR mg_text_estimate (1),
	440	.BR mg_weights_build (1),
	441	.BR mgbilevel (1),
	442	.BR mgbuild (1),
	443	.BR mgdictlist (1),
	444	.BR mgfelics (1),
	445	.BR mgquery (1),
	446	.BR mgstat (1),
	447	.BR mgtic (1),
	448	.BR mgticbuild (1),
	449	.BR mgticdump (1),
	450	.BR mgticprune (1),
	451	.BR mgticstat (1),
	452	.BR nawk (1),
	453	.BR tcsh (1),
	454	.BR tex (1),
	455	.BR xargs (1),
	456	.BR xmg (1),
	457	.BR xv (1).
	458	.\"=====================================================================
	459	.\" This is for GNU Emacs file-specific customization:
	460	.\" Local Variables:
	461	.\" fill-column: 50
	462	.\" End:

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format