source: trunk/indexers/mg/docs/mgintro.1@ 3745

Last change on this file since 3745 was 3745, checked in by mdewsnip, 21 years ago

Addition of MG package for search and retrieval

  • Property svn:executable set to *
  • Property svn:keywords set to Author Date Id Revision
File size: 5.8 KB
Line 
1.\"------------------------------------------------------------
2.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
3.de Id
4.ds Rv \\$3
5.ds Dt \\$4
6..
7.Id $Id: mgintro.1 3745 2003-02-20 21:20:24Z mdewsnip $
8.\"------------------------------------------------------------
9.ds r \&\s-1MG\s0
10.if n .ds - \%--
11.if t .ds - \(em
12.\"------------------------------------------------------------
13.am SS
14.LP
15..
16.\"------------------------------------------------------------
17.TH MGINTRO 1 \*(Dt CITRI
18.\"--------------------------------------------------------------
19.SH NAME
20mgintro \- introduction to the MG system
21.\"--------------------------------------------------------------
22.SH DESCRIPTION
23The MG (Managing Gigabytes) system is a collection of
24programs which comprise a full-text retrieval system.
25A full-text retrieval system allows one to create a
26database out of some given documents and then do queries
27upon it to retrieve any relevant documents.
28It is "full-text" in the sense that every word in the
29text is indexed and the query operates only on this index
30to do the searching.
31.PP
32For example, one could have a database on the book,
33"Alice in Wonderland." A document could be represented by
34each paragraph in the book.
35Having built up the "Alice" database, one could do queries
36such as "cat alice grin" and retrieve any paragraphs which
37match the query. The matching could either be boolean, that
38is the retrieved paragraphs contain a boolean expression of
39the query terms e.g. "cat alice grin"; or the matching
40could be ranked i.e. the most relevant documents to the query
41in relevance order, using some standard heuristic measure.
42.\"--------------------------------------------------------------
43.SS Motivation
44If one wants to find some particular information which
45is stored in a computer text file then one has a few alternative
46courses of action. One can operate directly on the text files
47with utilities such as grep or can process the text files into
48some form of database. Grep is generally limited to identifying
49lines by matching on regular expressions. If the collection
50of files which grep operates on becomes large, then continual
51passes over the entire text on each query becomes expensive.
52However, its usage is simple as no auxiliary files must be created.
53.PP
54A database consists of some data and indexes into that data. By having
55indexes one can query a large database quickly. Standard databases
56divide the data up into records of fields. This means that the granularity
57of search is a field. In a full-text system, such as \*r,
58there are no fields
59(or there is an arbitrary sized list of word fields per document)
60and instead every word is indexed.
61Using this method, we can except free-form information and yet be fast on searches.
62The next question is what is the overhead of this database.
63In \*r most files which are produced are in a compressed form. The
64two notable compressed files being the given data and the index, called
65an "inverted file". By compressing the files it is possible to have the
66size of the database smaller than the size of the source data.
67.\"--------------------------------------------------------------
68.SS Typical Usage
69The most common use for \*r
70has been as a search database on unix mail files.
71However, any set of text data can be used, one just needs to determine
72what constitutes a document (see
73.BR mgintro++ (1)
74).
75\*r has also been used on large collections such
76as Comact (Commonwealth Acts of Australia) which is around 132 megabytes
77and also on sizes up to around 2 gigabytes for TREC
78(a mixture of collections such as the Wall Street Journal
79and Associated Press).
80.\"--------------------------------------------------------------
81.SS Getting Started with \*r
82The first thing to do is install the package;
83please follow the INSTALL instructions.
84Having done this, it is necessary to set a couple of environment variables.
85MGDATA should be set to a directory which is to hold subdirectories for
86each database that you build. For example:
87.IP
88.B mkdir ~/mgdata; setenv MGDATA ~/mgdata.
89.LP
90If you want to try out building some sample databases then there is
91some sample data such as the "Alice In Wonderland" book. To make sure
92this is accessible you should set the environment variable MGSAMPLE.
93For example:
94.IP
95.B setenv MGSAMPLE ~/mg/SampleData
96.LP
97Here, "~/mg/SampleData" should contain alice.z .
98.PP
99To build the Alice database (to be contained in $MGDATA/alice subdirectory),
100type the command
101.IP
102.B mgbuild alice
103.LP
104Assuming all went well and some status messages
105were printed indicating the build was completed, then type
106.IP
107.B mgquery alice
108.LP
109to query the database.
110You can type a few words at the prompt, hit return and
111some relevant documents, Alice paragraphs, should be retrieved.
112Type ".set query ranked" to do ranking queries. Please refer to the
113.BR mgquery (1)
114man-page for more information on the commands and options of
115.BR mgquery (1).
116.PP
117The next thing to do is to use \*r
118on a more personal database. If you have
119your mail stored in subdirectories of ~/Mail, such as is done if you use the
120typical set up of
121.BR elm (1),
122then type
123.IP
124.B mgbuild allfiles
125.LP
126If, however, you keep all your mail in ~/mbox or ~/sentmail, then type
127.IP
128.B mgbuild mailfiles
129.LP
130.\"--------------------------------------------------------------
131.SH AVAILABILITY
132The \*r software for SunOS 4, Solaris, HPUX, and MIPS,
133can be ftped from: munnari.oz.au [128.250.1.21] in the directory
134/pub/mg.
135.\"--------------------------------------------------------------
136.SH SEE ALSO
137.na
138.BR mgintro++ (1),
139.BR mgbuild (1),
140.BR mgquery (1)
141.br
142"Guide To The \*r System", in Appendix A of the book:
143.PP
144.RS
145.nf
146Ian H. Witten, Alistair Moffat, and Timothy C. Bell
147.I "Managing Gigabytes: Compressing and Indexing Documents and Images"
148Van Nostrand Reinhold
1491994
150xiv + 429 pages
151US$54.95
152ISBN 0-442-01863-0
153Library of Congress catalog number TA1637 .W58 1994.
154.fi
155.RE
Note: See TracBrowser for help on using the repository browser.