Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

mgintro.1@ 16583

Last change on this file since 16583 was 16583, checked in by davidb, 16 years ago
Undoing change commited in r16582
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 5.8 KB

Rev	Line
[3745]	1	.\"------------------------------------------------------------
	2	.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
	3	.de Id
	4	.ds Rv \\$3
	5	.ds Dt \\$4
	6	..
	7	.Id $Id: mgintro.1 16583 2008-07-29 10:20:36Z davidb $
	8	.\"------------------------------------------------------------
	9	.ds r \&\s-1MG\s0
	10	.if n .ds - \%--
	11	.if t .ds - \(em
	12	.\"------------------------------------------------------------
	13	.am SS
	14	.LP
	15	..
	16	.\"------------------------------------------------------------
	17	.TH MGINTRO 1 \*(Dt CITRI
	18	.\"--------------------------------------------------------------
	19	.SH NAME
	20	mgintro \- introduction to the MG system
	21	.\"--------------------------------------------------------------
	22	.SH DESCRIPTION
	23	The MG (Managing Gigabytes) system is a collection of
	24	programs which comprise a full-text retrieval system.
	25	A full-text retrieval system allows one to create a
	26	database out of some given documents and then do queries
	27	upon it to retrieve any relevant documents.
	28	It is "full-text" in the sense that every word in the
	29	text is indexed and the query operates only on this index
	30	to do the searching.
	31	.PP
	32	For example, one could have a database on the book,
	33	"Alice in Wonderland." A document could be represented by
	34	each paragraph in the book.
	35	Having built up the "Alice" database, one could do queries
	36	such as "cat alice grin" and retrieve any paragraphs which
	37	match the query. The matching could either be boolean, that
	38	is the retrieved paragraphs contain a boolean expression of
	39	the query terms e.g. "cat alice grin"; or the matching
	40	could be ranked i.e. the most relevant documents to the query
	41	in relevance order, using some standard heuristic measure.
	42	.\"--------------------------------------------------------------
	43	.SS Motivation
	44	If one wants to find some particular information which
	45	is stored in a computer text file then one has a few alternative
	46	courses of action. One can operate directly on the text files
	47	with utilities such as grep or can process the text files into
	48	some form of database. Grep is generally limited to identifying
	49	lines by matching on regular expressions. If the collection
	50	of files which grep operates on becomes large, then continual
	51	passes over the entire text on each query becomes expensive.
	52	However, its usage is simple as no auxiliary files must be created.
	53	.PP
	54	A database consists of some data and indexes into that data. By having
	55	indexes one can query a large database quickly. Standard databases
	56	divide the data up into records of fields. This means that the granularity
	57	of search is a field. In a full-text system, such as \*r,
	58	there are no fields
	59	(or there is an arbitrary sized list of word fields per document)
	60	and instead every word is indexed.
	61	Using this method, we can except free-form information and yet be fast on searches.
	62	The next question is what is the overhead of this database.
	63	In \*r most files which are produced are in a compressed form. The
	64	two notable compressed files being the given data and the index, called
	65	an "inverted file". By compressing the files it is possible to have the
	66	size of the database smaller than the size of the source data.
	67	.\"--------------------------------------------------------------
	68	.SS Typical Usage
	69	The most common use for \*r
	70	has been as a search database on unix mail files.
	71	However, any set of text data can be used, one just needs to determine
	72	what constitutes a document (see
	73	.BR mgintro++ (1)
	74	).
	75	\*r has also been used on large collections such
	76	as Comact (Commonwealth Acts of Australia) which is around 132 megabytes
	77	and also on sizes up to around 2 gigabytes for TREC
	78	(a mixture of collections such as the Wall Street Journal
	79	and Associated Press).
	80	.\"--------------------------------------------------------------
	81	.SS Getting Started with \*r
	82	The first thing to do is install the package;
	83	please follow the INSTALL instructions.
	84	Having done this, it is necessary to set a couple of environment variables.
	85	MGDATA should be set to a directory which is to hold subdirectories for
	86	each database that you build. For example:
	87	.IP
	88	.B mkdir ~/mgdata; setenv MGDATA ~/mgdata.
	89	.LP
	90	If you want to try out building some sample databases then there is
	91	some sample data such as the "Alice In Wonderland" book. To make sure
	92	this is accessible you should set the environment variable MGSAMPLE.
	93	For example:
	94	.IP
	95	.B setenv MGSAMPLE ~/mg/SampleData
	96	.LP
	97	Here, "~/mg/SampleData" should contain alice.z .
	98	.PP
	99	To build the Alice database (to be contained in $MGDATA/alice subdirectory),
	100	type the command
	101	.IP
	102	.B mgbuild alice
	103	.LP
	104	Assuming all went well and some status messages
	105	were printed indicating the build was completed, then type
	106	.IP
	107	.B mgquery alice
	108	.LP
	109	to query the database.
	110	You can type a few words at the prompt, hit return and
	111	some relevant documents, Alice paragraphs, should be retrieved.
	112	Type ".set query ranked" to do ranking queries. Please refer to the
	113	.BR mgquery (1)
	114	man-page for more information on the commands and options of
	115	.BR mgquery (1).
	116	.PP
	117	The next thing to do is to use \*r
	118	on a more personal database. If you have
	119	your mail stored in subdirectories of ~/Mail, such as is done if you use the
	120	typical set up of
	121	.BR elm (1),
	122	then type
	123	.IP
	124	.B mgbuild allfiles
	125	.LP
	126	If, however, you keep all your mail in ~/mbox or ~/sentmail, then type
	127	.IP
	128	.B mgbuild mailfiles
	129	.LP
	130	.\"--------------------------------------------------------------
	131	.SH AVAILABILITY
	132	The \*r software for SunOS 4, Solaris, HPUX, and MIPS,
	133	can be ftped from: munnari.oz.au [128.250.1.21] in the directory
	134	/pub/mg.
	135	.\"--------------------------------------------------------------
	136	.SH SEE ALSO
	137	.na
	138	.BR mgintro++ (1),
	139	.BR mgbuild (1),
	140	.BR mgquery (1)
	141	.br
	142	"Guide To The \*r System", in Appendix A of the book:
	143	.PP
	144	.RS
	145	.nf
	146	Ian H. Witten, Alistair Moffat, and Timothy C. Bell
	147	.I "Managing Gigabytes: Compressing and Indexing Documents and Images"
	148	Van Nostrand Reinhold
	149	1994
	150	xiv + 429 pages
	151	US$54.95
	152	ISBN 0-442-01863-0
	153	Library of Congress catalog number TA1637 .W58 1994.
	154	.fi
	155	.RE

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format