1 | .\"------------------------------------------------------------
|
---|
2 | .\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
|
---|
3 | .de Id
|
---|
4 | .ds Rv \\$3
|
---|
5 | .ds Dt \\$4
|
---|
6 | ..
|
---|
7 | .Id $Id: mgintro.1 3745 2003-02-20 21:20:24Z mdewsnip $
|
---|
8 | .\"------------------------------------------------------------
|
---|
9 | .ds r \&\s-1MG\s0
|
---|
10 | .if n .ds - \%--
|
---|
11 | .if t .ds - \(em
|
---|
12 | .\"------------------------------------------------------------
|
---|
13 | .am SS
|
---|
14 | .LP
|
---|
15 | ..
|
---|
16 | .\"------------------------------------------------------------
|
---|
17 | .TH MGINTRO 1 \*(Dt CITRI
|
---|
18 | .\"--------------------------------------------------------------
|
---|
19 | .SH NAME
|
---|
20 | mgintro \- introduction to the MG system
|
---|
21 | .\"--------------------------------------------------------------
|
---|
22 | .SH DESCRIPTION
|
---|
23 | The MG (Managing Gigabytes) system is a collection of
|
---|
24 | programs which comprise a full-text retrieval system.
|
---|
25 | A full-text retrieval system allows one to create a
|
---|
26 | database out of some given documents and then do queries
|
---|
27 | upon it to retrieve any relevant documents.
|
---|
28 | It is "full-text" in the sense that every word in the
|
---|
29 | text is indexed and the query operates only on this index
|
---|
30 | to do the searching.
|
---|
31 | .PP
|
---|
32 | For example, one could have a database on the book,
|
---|
33 | "Alice in Wonderland." A document could be represented by
|
---|
34 | each paragraph in the book.
|
---|
35 | Having built up the "Alice" database, one could do queries
|
---|
36 | such as "cat alice grin" and retrieve any paragraphs which
|
---|
37 | match the query. The matching could either be boolean, that
|
---|
38 | is the retrieved paragraphs contain a boolean expression of
|
---|
39 | the query terms e.g. "cat alice grin"; or the matching
|
---|
40 | could be ranked i.e. the most relevant documents to the query
|
---|
41 | in relevance order, using some standard heuristic measure.
|
---|
42 | .\"--------------------------------------------------------------
|
---|
43 | .SS Motivation
|
---|
44 | If one wants to find some particular information which
|
---|
45 | is stored in a computer text file then one has a few alternative
|
---|
46 | courses of action. One can operate directly on the text files
|
---|
47 | with utilities such as grep or can process the text files into
|
---|
48 | some form of database. Grep is generally limited to identifying
|
---|
49 | lines by matching on regular expressions. If the collection
|
---|
50 | of files which grep operates on becomes large, then continual
|
---|
51 | passes over the entire text on each query becomes expensive.
|
---|
52 | However, its usage is simple as no auxiliary files must be created.
|
---|
53 | .PP
|
---|
54 | A database consists of some data and indexes into that data. By having
|
---|
55 | indexes one can query a large database quickly. Standard databases
|
---|
56 | divide the data up into records of fields. This means that the granularity
|
---|
57 | of search is a field. In a full-text system, such as \*r,
|
---|
58 | there are no fields
|
---|
59 | (or there is an arbitrary sized list of word fields per document)
|
---|
60 | and instead every word is indexed.
|
---|
61 | Using this method, we can except free-form information and yet be fast on searches.
|
---|
62 | The next question is what is the overhead of this database.
|
---|
63 | In \*r most files which are produced are in a compressed form. The
|
---|
64 | two notable compressed files being the given data and the index, called
|
---|
65 | an "inverted file". By compressing the files it is possible to have the
|
---|
66 | size of the database smaller than the size of the source data.
|
---|
67 | .\"--------------------------------------------------------------
|
---|
68 | .SS Typical Usage
|
---|
69 | The most common use for \*r
|
---|
70 | has been as a search database on unix mail files.
|
---|
71 | However, any set of text data can be used, one just needs to determine
|
---|
72 | what constitutes a document (see
|
---|
73 | .BR mgintro++ (1)
|
---|
74 | ).
|
---|
75 | \*r has also been used on large collections such
|
---|
76 | as Comact (Commonwealth Acts of Australia) which is around 132 megabytes
|
---|
77 | and also on sizes up to around 2 gigabytes for TREC
|
---|
78 | (a mixture of collections such as the Wall Street Journal
|
---|
79 | and Associated Press).
|
---|
80 | .\"--------------------------------------------------------------
|
---|
81 | .SS Getting Started with \*r
|
---|
82 | The first thing to do is install the package;
|
---|
83 | please follow the INSTALL instructions.
|
---|
84 | Having done this, it is necessary to set a couple of environment variables.
|
---|
85 | MGDATA should be set to a directory which is to hold subdirectories for
|
---|
86 | each database that you build. For example:
|
---|
87 | .IP
|
---|
88 | .B mkdir ~/mgdata; setenv MGDATA ~/mgdata.
|
---|
89 | .LP
|
---|
90 | If you want to try out building some sample databases then there is
|
---|
91 | some sample data such as the "Alice In Wonderland" book. To make sure
|
---|
92 | this is accessible you should set the environment variable MGSAMPLE.
|
---|
93 | For example:
|
---|
94 | .IP
|
---|
95 | .B setenv MGSAMPLE ~/mg/SampleData
|
---|
96 | .LP
|
---|
97 | Here, "~/mg/SampleData" should contain alice.z .
|
---|
98 | .PP
|
---|
99 | To build the Alice database (to be contained in $MGDATA/alice subdirectory),
|
---|
100 | type the command
|
---|
101 | .IP
|
---|
102 | .B mgbuild alice
|
---|
103 | .LP
|
---|
104 | Assuming all went well and some status messages
|
---|
105 | were printed indicating the build was completed, then type
|
---|
106 | .IP
|
---|
107 | .B mgquery alice
|
---|
108 | .LP
|
---|
109 | to query the database.
|
---|
110 | You can type a few words at the prompt, hit return and
|
---|
111 | some relevant documents, Alice paragraphs, should be retrieved.
|
---|
112 | Type ".set query ranked" to do ranking queries. Please refer to the
|
---|
113 | .BR mgquery (1)
|
---|
114 | man-page for more information on the commands and options of
|
---|
115 | .BR mgquery (1).
|
---|
116 | .PP
|
---|
117 | The next thing to do is to use \*r
|
---|
118 | on a more personal database. If you have
|
---|
119 | your mail stored in subdirectories of ~/Mail, such as is done if you use the
|
---|
120 | typical set up of
|
---|
121 | .BR elm (1),
|
---|
122 | then type
|
---|
123 | .IP
|
---|
124 | .B mgbuild allfiles
|
---|
125 | .LP
|
---|
126 | If, however, you keep all your mail in ~/mbox or ~/sentmail, then type
|
---|
127 | .IP
|
---|
128 | .B mgbuild mailfiles
|
---|
129 | .LP
|
---|
130 | .\"--------------------------------------------------------------
|
---|
131 | .SH AVAILABILITY
|
---|
132 | The \*r software for SunOS 4, Solaris, HPUX, and MIPS,
|
---|
133 | can be ftped from: munnari.oz.au [128.250.1.21] in the directory
|
---|
134 | /pub/mg.
|
---|
135 | .\"--------------------------------------------------------------
|
---|
136 | .SH SEE ALSO
|
---|
137 | .na
|
---|
138 | .BR mgintro++ (1),
|
---|
139 | .BR mgbuild (1),
|
---|
140 | .BR mgquery (1)
|
---|
141 | .br
|
---|
142 | "Guide To The \*r System", in Appendix A of the book:
|
---|
143 | .PP
|
---|
144 | .RS
|
---|
145 | .nf
|
---|
146 | Ian H. Witten, Alistair Moffat, and Timothy C. Bell
|
---|
147 | .I "Managing Gigabytes: Compressing and Indexing Documents and Images"
|
---|
148 | Van Nostrand Reinhold
|
---|
149 | 1994
|
---|
150 | xiv + 429 pages
|
---|
151 | US$54.95
|
---|
152 | ISBN 0-442-01863-0
|
---|
153 | Library of Congress catalog number TA1637 .W58 1994.
|
---|
154 | .fi
|
---|
155 | .RE
|
---|