[3745] | 1 | .\"------------------------------------------------------------
|
---|
| 2 | .\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
|
---|
| 3 | .de Id
|
---|
| 4 | .ds Rv \\$3
|
---|
| 5 | .ds Dt \\$4
|
---|
| 6 | ..
|
---|
| 7 | .Id $Id: mgintro.1 16583 2008-07-29 10:20:36Z davidb $
|
---|
| 8 | .\"------------------------------------------------------------
|
---|
| 9 | .ds r \&\s-1MG\s0
|
---|
| 10 | .if n .ds - \%--
|
---|
| 11 | .if t .ds - \(em
|
---|
| 12 | .\"------------------------------------------------------------
|
---|
| 13 | .am SS
|
---|
| 14 | .LP
|
---|
| 15 | ..
|
---|
| 16 | .\"------------------------------------------------------------
|
---|
| 17 | .TH MGINTRO 1 \*(Dt CITRI
|
---|
| 18 | .\"--------------------------------------------------------------
|
---|
| 19 | .SH NAME
|
---|
| 20 | mgintro \- introduction to the MG system
|
---|
| 21 | .\"--------------------------------------------------------------
|
---|
| 22 | .SH DESCRIPTION
|
---|
| 23 | The MG (Managing Gigabytes) system is a collection of
|
---|
| 24 | programs which comprise a full-text retrieval system.
|
---|
| 25 | A full-text retrieval system allows one to create a
|
---|
| 26 | database out of some given documents and then do queries
|
---|
| 27 | upon it to retrieve any relevant documents.
|
---|
| 28 | It is "full-text" in the sense that every word in the
|
---|
| 29 | text is indexed and the query operates only on this index
|
---|
| 30 | to do the searching.
|
---|
| 31 | .PP
|
---|
| 32 | For example, one could have a database on the book,
|
---|
| 33 | "Alice in Wonderland." A document could be represented by
|
---|
| 34 | each paragraph in the book.
|
---|
| 35 | Having built up the "Alice" database, one could do queries
|
---|
| 36 | such as "cat alice grin" and retrieve any paragraphs which
|
---|
| 37 | match the query. The matching could either be boolean, that
|
---|
| 38 | is the retrieved paragraphs contain a boolean expression of
|
---|
| 39 | the query terms e.g. "cat alice grin"; or the matching
|
---|
| 40 | could be ranked i.e. the most relevant documents to the query
|
---|
| 41 | in relevance order, using some standard heuristic measure.
|
---|
| 42 | .\"--------------------------------------------------------------
|
---|
| 43 | .SS Motivation
|
---|
| 44 | If one wants to find some particular information which
|
---|
| 45 | is stored in a computer text file then one has a few alternative
|
---|
| 46 | courses of action. One can operate directly on the text files
|
---|
| 47 | with utilities such as grep or can process the text files into
|
---|
| 48 | some form of database. Grep is generally limited to identifying
|
---|
| 49 | lines by matching on regular expressions. If the collection
|
---|
| 50 | of files which grep operates on becomes large, then continual
|
---|
| 51 | passes over the entire text on each query becomes expensive.
|
---|
| 52 | However, its usage is simple as no auxiliary files must be created.
|
---|
| 53 | .PP
|
---|
| 54 | A database consists of some data and indexes into that data. By having
|
---|
| 55 | indexes one can query a large database quickly. Standard databases
|
---|
| 56 | divide the data up into records of fields. This means that the granularity
|
---|
| 57 | of search is a field. In a full-text system, such as \*r,
|
---|
| 58 | there are no fields
|
---|
| 59 | (or there is an arbitrary sized list of word fields per document)
|
---|
| 60 | and instead every word is indexed.
|
---|
| 61 | Using this method, we can except free-form information and yet be fast on searches.
|
---|
| 62 | The next question is what is the overhead of this database.
|
---|
| 63 | In \*r most files which are produced are in a compressed form. The
|
---|
| 64 | two notable compressed files being the given data and the index, called
|
---|
| 65 | an "inverted file". By compressing the files it is possible to have the
|
---|
| 66 | size of the database smaller than the size of the source data.
|
---|
| 67 | .\"--------------------------------------------------------------
|
---|
| 68 | .SS Typical Usage
|
---|
| 69 | The most common use for \*r
|
---|
| 70 | has been as a search database on unix mail files.
|
---|
| 71 | However, any set of text data can be used, one just needs to determine
|
---|
| 72 | what constitutes a document (see
|
---|
| 73 | .BR mgintro++ (1)
|
---|
| 74 | ).
|
---|
| 75 | \*r has also been used on large collections such
|
---|
| 76 | as Comact (Commonwealth Acts of Australia) which is around 132 megabytes
|
---|
| 77 | and also on sizes up to around 2 gigabytes for TREC
|
---|
| 78 | (a mixture of collections such as the Wall Street Journal
|
---|
| 79 | and Associated Press).
|
---|
| 80 | .\"--------------------------------------------------------------
|
---|
| 81 | .SS Getting Started with \*r
|
---|
| 82 | The first thing to do is install the package;
|
---|
| 83 | please follow the INSTALL instructions.
|
---|
| 84 | Having done this, it is necessary to set a couple of environment variables.
|
---|
| 85 | MGDATA should be set to a directory which is to hold subdirectories for
|
---|
| 86 | each database that you build. For example:
|
---|
| 87 | .IP
|
---|
| 88 | .B mkdir ~/mgdata; setenv MGDATA ~/mgdata.
|
---|
| 89 | .LP
|
---|
| 90 | If you want to try out building some sample databases then there is
|
---|
| 91 | some sample data such as the "Alice In Wonderland" book. To make sure
|
---|
| 92 | this is accessible you should set the environment variable MGSAMPLE.
|
---|
| 93 | For example:
|
---|
| 94 | .IP
|
---|
| 95 | .B setenv MGSAMPLE ~/mg/SampleData
|
---|
| 96 | .LP
|
---|
| 97 | Here, "~/mg/SampleData" should contain alice.z .
|
---|
| 98 | .PP
|
---|
| 99 | To build the Alice database (to be contained in $MGDATA/alice subdirectory),
|
---|
| 100 | type the command
|
---|
| 101 | .IP
|
---|
| 102 | .B mgbuild alice
|
---|
| 103 | .LP
|
---|
| 104 | Assuming all went well and some status messages
|
---|
| 105 | were printed indicating the build was completed, then type
|
---|
| 106 | .IP
|
---|
| 107 | .B mgquery alice
|
---|
| 108 | .LP
|
---|
| 109 | to query the database.
|
---|
| 110 | You can type a few words at the prompt, hit return and
|
---|
| 111 | some relevant documents, Alice paragraphs, should be retrieved.
|
---|
| 112 | Type ".set query ranked" to do ranking queries. Please refer to the
|
---|
| 113 | .BR mgquery (1)
|
---|
| 114 | man-page for more information on the commands and options of
|
---|
| 115 | .BR mgquery (1).
|
---|
| 116 | .PP
|
---|
| 117 | The next thing to do is to use \*r
|
---|
| 118 | on a more personal database. If you have
|
---|
| 119 | your mail stored in subdirectories of ~/Mail, such as is done if you use the
|
---|
| 120 | typical set up of
|
---|
| 121 | .BR elm (1),
|
---|
| 122 | then type
|
---|
| 123 | .IP
|
---|
| 124 | .B mgbuild allfiles
|
---|
| 125 | .LP
|
---|
| 126 | If, however, you keep all your mail in ~/mbox or ~/sentmail, then type
|
---|
| 127 | .IP
|
---|
| 128 | .B mgbuild mailfiles
|
---|
| 129 | .LP
|
---|
| 130 | .\"--------------------------------------------------------------
|
---|
| 131 | .SH AVAILABILITY
|
---|
| 132 | The \*r software for SunOS 4, Solaris, HPUX, and MIPS,
|
---|
| 133 | can be ftped from: munnari.oz.au [128.250.1.21] in the directory
|
---|
| 134 | /pub/mg.
|
---|
| 135 | .\"--------------------------------------------------------------
|
---|
| 136 | .SH SEE ALSO
|
---|
| 137 | .na
|
---|
| 138 | .BR mgintro++ (1),
|
---|
| 139 | .BR mgbuild (1),
|
---|
| 140 | .BR mgquery (1)
|
---|
| 141 | .br
|
---|
| 142 | "Guide To The \*r System", in Appendix A of the book:
|
---|
| 143 | .PP
|
---|
| 144 | .RS
|
---|
| 145 | .nf
|
---|
| 146 | Ian H. Witten, Alistair Moffat, and Timothy C. Bell
|
---|
| 147 | .I "Managing Gigabytes: Compressing and Indexing Documents and Images"
|
---|
| 148 | Van Nostrand Reinhold
|
---|
| 149 | 1994
|
---|
| 150 | xiv + 429 pages
|
---|
| 151 | US$54.95
|
---|
| 152 | ISBN 0-442-01863-0
|
---|
| 153 | Library of Congress catalog number TA1637 .W58 1994.
|
---|
| 154 | .fi
|
---|
| 155 | .RE
|
---|