source: gsdl/trunk/trunk/mg/man/man1/mgquery.1@ 16583

Last change on this file since 16583 was 16583, checked in by davidb, 16 years ago

Undoing change commited in r16582

  • Property svn:executable set to *
  • Property svn:keywords set to Author Date Id Revision
File size: 18.3 KB
Line 
1.\"------------------------------------------------------------
2.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
3.de Id
4.ds Rv \\$3
5.ds Dt \\$4
6..
7.Id $Id: mgquery.1 16583 2008-07-29 10:20:36Z davidb $
8.\"------------------------------------------------------------
9.TH mgquery 1 \*(Dt CITRI
10.SH NAME
11mgquery \- query program for the mg system
12.SH SYNOPSIS
13.B mgquery
14[
15.B \-h
16]
17[
18.B \-D
19]
20[
21.BI \-f " name"
22]
23[
24.BI \-d " directory"
25]
26.if n .ti +9n
27[
28.I collection-name
29]
30.SH DESCRIPTION
31.B mgquery
32enables users to make Boolean or ranked queries from a data base
33generated by the
34.BR mg (1)
35system. It accepts queries from
36.I stdin
37and sends the retrieved documents to
38.IR stdout .
39Information on the resource usage of
40.B mgquery
41as it processes queries can be obtained interactively.
42.SH OPTIONS
43Options may appear in any order, but the
44.IR collection-name ,
45if specified, must be last.
46.TP "\w'\fB\-d\fP \fIdirectory\fP'u+2n"
47.B \-h
48This displays a usage line on
49.IR stderr .
50.TP
51.B \-D
52This option causes the entire text to be decompressed and sent to
53.IR stdout .
54.TP
55.BI \-f " name"
56This specifies the base name of the document collection that will be
57used. If a collection with the specified base
58.I name
59does not exist, an error message will be displayed and
60.B mgquery
61will exit.
62.TP
63.BI \-d " directory"
64This specifies the directory where the document collection can be found.
65.SH USAGE
66Prior to processing the command line arguments, the
67.B mgquery
68program attempts to read in a startup script called
69.IR ./.mgrc .
70If that fails, it attempts to read in the file
71.IR $HOME/.mgrc .
72The startup file can only contain commands\(emno queries are
73permitted in the
74.I .mgrc
75file. Lines starting with \*(lq\fB#\fP\*(rq in the file are comments.
76The most common use for the
77.I .mgrc
78file is to personalise the initial values of the predefined parameters
79with
80.B .set
81commands.
82.LP
83The input to
84.B mgquery
85consists of a series of input lines. The backslash
86character
87.RB (\*(lq \e \*(rq)
88is used at the end of lines to indicate
89that input continues on the next line.
90.LP
91Input lines on which the first character is a dot
92.RB (\*(lq . \*(rq)
93are commands to the
94.B mgquery
95program. Input lines that do not start with a dot are queries.
96.LP
97A query consists of two parts. One part is a Boolean or ranked query
98that identifies documents. The second part is a post-processing
99pattern matching operation. Any text between the first speech mark
100(\*(lq) and the last speech mark (\*(rq) is considered to be a
101post-processing pattern.
102.SH COMMANDS
103The
104.B mgquery
105program can accept the following commands.
106.TP 17
107.B .help
108Display several pages of help text.
109.TP
110.B .quit
111Quit the program.
112.TP
113.B .warranty
114Display the
115.BR mg (1)
116warranty.
117.TP
118.B .conditions
119Display the conditions of use and distribution of
120.BR mg (1).
121.TP
122.BI ".set " "name value"
123Set the parameter
124.I name
125to the specified
126.IR value .
127If the parameter is a Boolean
128.I value
129and the
130.I value
131is omitted, the parameter will be inverted (i.e., if it was
132.IR true ,
133then it will change to
134.IR false ;
135if it was
136.IR false ,
137then it will change to
138.IR true ).
139.TP
140.BI ".unset " name
141Delete the parameter
142.I name
143from the currently-defined parameters.
144.TP
145.B .reset
146Reset the parameters to the state that they had after the processing
147of the
148.B mgquery
149command line.
150.TP
151.B .display
152Display the values of all the currently-defined parameters.
153.TP
154.B .push
155Push the currently-defined parameters onto a stack.
156.TP
157.B .pop
158Pops a set of parameters off the stack, replacing the currently-defined
159ones.
160.TP
161.BI ".output " arg
162This is used to specify where to send the text of the documents. Once
163the
164.B .output
165command is specified, all subsequent output will be sent to the place
166specified by
167.IR arg .
168If
169.I arg
170is not specified subsequent output will be directed to
171.IR stdout .
172.I Arg
173may be any of the following.
174.RS
175.TP 13
176.BI "> " filename
177Send output to the specified file.
178.TP
179.BI ">> " filename
180Append output to the specified file.
181.TP
182.BI "| " command
183Pipe the output to
184.IR command ,
185which is executed by
186.IR sh .
187.RE
188.TP
189.BI ".input " arg
190This is used to specify where input (queries and commands) comes
191from. Once the
192.B .input
193command is specified all subsequent input will be come from the place
194specified by
195.IR arg .
196If
197.I arg
198is not specified subsequent input will come from
199.IR stdin .
200.RS
201.TP 13
202.BI "< " filename
203Get input from the specified file.
204.TP
205.BI "| " command
206The input comes from the standard output of
207.IR command ,
208which is executed by
209.IR sh .
210.RE
211.SH PARAMETERS
212The following parameters are predefined and have special
213significance. Each parameter will be followed by its default
214value. Parameters are initialised before the
215.I .mgrc
216file is read or the command line arguments are processed.
217.TP 17
218.BI accumulator_method " `array'"
219This parameter is used during ranking, and specifies how the
220weight for each document should be accumulated. The following
221methods are available:
222.IR array ,
223.IR splay_tree ,
224.IR hash_table ,
225and
226.IR list .
227.TP
228.BI briefstats " `off'"
229This is a Boolean parameter that determines whether the
230totals for disk, memory and time usage statistics will be
231displayed at the end of each query.
232.IR Note :
233this takes precedence over the parameters
234.BR diskstats ,
235.BR memstats " and " timestats .
236This parameter may take the values
237.IR yes ", " no ", "
238.IR true ", " false ", "
239.IR on " or " off .
240.TP
241.BI buffer " `1048576'"
242When the documents are being read in, they are read into a
243buffer of this size and then displayed from this buffer. If
244the documents are larger than this buffer, the buffer is
245expanded automatically. Having a large buffer gives a very
246slight performance improvement, because it allows the order of
247disk operations to be optimised. The buffer size is measured
248in bytes.
249.TP
250.BI diskstats " `off'"
251This is a Boolean parameter that determines whether the disk
252usage statistics for the preceding query will be displayed
253after each query. This parameter may take the values
254.IR yes ", " no ", "
255.IR true ", " false ", "
256.IR on " or " off .
257.TP
258.BI doc_sepstr " `---------------------------------- %n\en\'"
259This specifies the string that will be used to separate
260documents when they are displayed for `Boolean' or `docnums'
261queries. The standard C escape character sequences
262may be used to place special characters in the
263string. For example, a newline would be `\en'. To include a `%',
264use the sequence `%%'. To include the
265.BR mg (1)
266document number, use the sequence `%n'. The following escape character
267sequences are available
268.nf
269.ta 1.7iL
270.B Sequence Meaning
271`\e\e' backslash
272`\eb' backspace
273`\ef' formfeed
274`\en' newline
275`\er' carriage return
276`\et' tab
277`\e"' speech marks
278`\e'' quote mark
279`\ex\fIhh\fP' ASCII code in hexadecimal
280`\ennn' ASCII code in octal
281.fi
282.TP
283.BI expert " `false'"
284If this is
285.IR true ,
286then much of the dialogue output is suppressed. This parameter may
287take the values
288.IR yes ", " no ", "
289.IR true ", " false ", "
290.IR on " or " off .
291.TP
292.BI hash_tbl_size " `1000'"
293One of the options during ranking queries is to use a hash
294table to accumulate the weights for each document. The hash
295table is a simple chained type. This parameter specifies the
296size of the hash table and may take any value between 8 and
297268435456 (2^28).
298.TP
299.BI heads_length " `50'"
300When the mode is
301.BR heads ,
302this specifies the number of characters that will be output for each
303document.
304.TP
305.BI maxdocs " `all'"
306The maximum number of documents to display in response to a
307query. This parameter may take on a numeric value between 1
308and 429467295 (2^32 - 1) or the word
309.IR all .
310.TP
311.BI maxparas " `1000'"
312The maximum number of paragraphs to identify during a ranked
313query with paragraph indexing. After the paragraphs have been
314identified, the paragraphs are converted into documents, and
315because some of the paragraphs may refer to the same documents
316the final number of answers may be less than
317.BR maxparas .
318The
319.B maxdocs
320parameter will then be applied. This parameter may take on a numeric
321value between 1 and 429467295 (2^32 - 1).
322.TP
323.BI max_accumulators " `50000'"
324This parameter limits the number of different paragraph and
325document numbers to be accumulated during ranked queries when
326the parameter
327.B accumulator_method
328is set to
329.IR splay_tree ,
330.IR hash_table ,
331or
332.IR list .
333This parameter may take any value between 8 and 268435456 (2^28).
334.TP
335.BI max_terms " `all'"
336This parameter limits the number of terms that will actually
337be used during a ranked query. If more terms than the number
338specified by
339.B max_terms
340are entered, then the extra terms will be discarded. If
341.B sorted_terms
342is on, then the limiting will be done after the terms have been
343sorted. This parameter may take any value between 1 and 429467295
344(2^32 - 1), or the word
345.IR all.
346.TP
347.BI memstats " `off'"
348This is a Boolean parameter that determines whether the memory
349usage statistics for the preceding query will be displayed
350after each query. This parameter may take the values
351.IR yes ", " no ", "
352.IR true ", " false ", "
353.IR on " or " off .
354.TP
355.BI mgdir " `.'"
356This is set to the directory where the
357.BR mg (1)
358data files may be found. If
359the environment variable
360.B MGDATA
361exists, then this is instead initialised to the value of
362.BR MGDATA .
363The value of this parameter may be changed, either in the
364.I .mgrc
365file with a
366.BI ".set mgdir "directory
367command, or from the command line using the
368.BI \-d " directory"
369option. Once the \*(lq\fB>\fP\*(rq prompt appears, changing this
370parameter will have no effect.
371.TP
372.BI mgname " `bible'"
373This is set to the name of the
374.BR mg (1)
375collection that is to be used for the session. The value of this
376parameter may be changed, either in the
377.I .mgrc
378file with a
379.BI ".set mgname "name
380command, or from the command line using the
381.BI \-f " name"
382option. Once the \*(lq\fB>\fP\*(rq prompt appears, changing this
383parameter will have no effect.
384.TP
385.BI mode " `text'"
386This specifies how documents should be displayed when they
387are retrieved. It may take six different values:
388.IR text ,
389.IR hilite ,
390.IR docnums ,
391.IR heads ,
392.IR silent ,
393or
394.IR count .
395.I text
396displays the contents of the document.
397.I hilite
398displays the contents of the document and highlights any of the
399stemmed query terms.
400.I docnums
401displays only the document numbers.
402.I heads
403is used to print out the head of each document.
404.I silent
405retrieves all the documents but displays nothing except how many
406documents were retrieved. This mode is intended to be used in timing
407experiments.
408.I count
409does the minimum
410amount of work required to determine how many documents would
411be retrieved, but does not retrieve them.
412.TP
413.BI optimise_type " `1'"
414There are three types of boolean query optimisation (parse tree
415rearrangement). Type 0 leaves parse tree unaltered. Type 1 optimises
416for AND of terms and AND of OR of terms. Type 2 converts the tree
417into DNF (an experiment :-).
418.TP
419.BI pager " `more'"
420This is the name of the program that will be used to display
421the help and the retrieved documents. If the environment
422variable
423.B PAGER
424is defined, then
425.B pager
426takes on that value.
427.TP
428.BI hilite_style " `bold'"
429This specifies the type of highlighting method.
430It may take one of two different values:
431.IR bold,
432or
433.IR underline.
434.TP
435.BI para_sepstr " `\en######## PARAGRAPH %n ########\en'"
436This specifies the string that will be used to separate paragraphs.
437The standard C escape character sequences may be used to place special
438characters in the string. For example, a newline would be written
439as `\en'. To include a `%', use the sequence `%%'. To include the
440paragraph number within the document, use the sequence `%n'.
441.TP
442.BI para_start " `***** Weight = %w *****\en'"
443This specifies the string that will be used at the head of paragraphs
444for a paragraph-level index following a ranked query. The standard
445C-language escape character sequences may be used to place special
446characters in the string. For example, a newline would be written as
447`\en'. To include a `%', use the sequence `%%'. To include the
448paragraph weight, use the sequence `%w'.
449.TP
450.BI qfreq " `true'"
451This determine whether the ranked queries will take into account the
452number of times each query term is specified. When this is
453.IR true ,
454the number of times a term appears in the query is used in the
455ranking. When this is
456.IR false ,
457all query terms are assumed to occur only once. This parameter may
458take the values
459.IR yes ", " no ", "
460.IR true ", " false ", "
461.IR on " or " off .
462.TP
463.BI query " `Boolean'"
464This specifies the type of queries that are to be specified.
465It can take four different values:
466.IR Boolean ,
467.IR ranked ,
468.IR docnums " or " approx-ranked.
469.I Boolean
470is for Boolean queries.
471The
472.BR yacc (1)
473grammar for Boolean queries is as follows.
474.IP
475.nf
476 query : or;
477.IP
478 or : or '|' and
479 | and ;
480.IP
481 and : and '&' not
482 | and not
483 | not ;
484.IP
485 not : term
486 | '!' not ;
487.IP
488 term : TERM
489 | '(' or ')' ;
490.fi
491.IP
492.IR ranked " and " approx-ranked
493are for queries ranked by the cosine measure.
494.I approx-ranked
495uses only the low-precision document lengths, and therefore only
496produces an approximation to full cosine ranking.
497.IP
498.nf
499 query : TERM
500 | query TERM ;
501.fi
502.IP
503.I docnums
504allows the entry of document numbers. Multiple numbers separated by
505spaces may be specified, or ranges separated by hyphens.
506.IP
507.nf
508 query : range
509 | query range ;
510.IP
511 range : num
512 | num '-' num ;
513.fi
514.TP
515.BI ranked_doc_sepstr " `-------------------------------- %n %w\en'"
516This specifies the string that will be used to separate documents when
517they are displayed for `ranked' or `approx-ranked' queries. The
518standard C escape character sequences may be used to place special
519characters in the string. For example, a newline would be written as
520`\en'. To include a `%', use the sequence `%%'. To include the
521.BR mg (1)
522document number, use the sequence `%n'. To include the document
523weight, use the sequence `%w'.
524.TP
525.BI sizestats " `false'"
526If this is
527.IR true ,
528then various numbers are output at the end of each query indicating
529what went on during the query. This parameter may take the values
530.IR yes ", " no ", "
531.IR true ", " false ", "
532.IR on " or " off .
533.TP
534.BI skip_dump " `skips.%d'"
535If this parameter is set, then a file will be produced in the current
536directory during ranked queries on skipped inverted files when
537.B accumulator_method
538is set to
539.IR splay_tree ,
540.IR hash_table ,
541or
542.IR list .
543The name of the file is the value of this parameter. A `%d' in the
544file name will be replaced with the process id of
545.BR mgquery .
546This file will contain information about the usage of skips during the
547query processing. This option is expensive; use
548.B .unset skip_dump
549to obtain optimal performance.
550.TP
551.BI sorted_terms " `on'"
552This specifies whether or not the terms should be sorted into
553decreasing occurrence in documents so that the least-often occurring
554terms are processed first when ranked queries are being done. When
555this is
556.IR true ,
557the terms are sorted. When this is
558.IR false ,
559the terms are not sorted, and are instead processed in order of
560occurrence. This parameter may take the values
561.IR yes ", " no ", "
562.IR true ", " false ", "
563.IR on " or " off .
564.TP
565.BI stop_at_max_accum " `on'"
566This specifies what should happen when the maximum number of
567accumulators set by
568.B max_accumulators
569is reached. When this is
570.IR true ,
571the processing of terms is stopped at the completion of the current
572term. When this is
573.IR false ,
574processing continues but no new accumulators are created. This
575parameter may take the values
576.IR yes ", " no ", "
577.IR true ", " false ", "
578.IR on " or " off .
579.TP
580.BI terminator " `'"
581This specifies the string that will be output after the last document
582from the previous query has been output. The standard C escape
583character sequences may be used to place special characters in the
584string. For example, a newline would be written as `\en'. To include
585a `%', use the sequence `%%'.
586.TP
587.BI timestats " `false'"
588If this is
589.IR true ,
590then the time to process a query is displayed in both real time and
591CPU time. This parameter may take the values
592.IR yes ", " no ", "
593.IR true ", " false ", "
594.IR on " or " off .
595.TP
596.BI verbatim " `off'"
597This is a Boolean parameter that determines whether the program
598should attempt to do a regular-expression match on the retrieved
599text. If verbatim is
600.I on
601and a post-processing string is specified with the query, then the
602post-processing string will be searched for in the documents just
603before they are displayed. If the string is found, the document will
604be displayed; if not, the document will not be displayed. If verbatim
605is
606.IR off ,
607the post-processing string will be considered a regular expression
608as in
609.BR egrep (1)
610or
611.BR vi (1).
612E.g., if verbatim is
613.I on,
614\*(lq\fBand.*the\fP\*(rq will look for the 8-character sequence
615\*(lq\fBand.*the\fP\*(rq. If verbatim is
616.IR off ,
617\*(lq\fBand.*the\fP\*(rq will look for the sequence
618\*(lq\fBand\fP\*(rq followed somewhere later in the document by the
619sequence \*(lq\fBthe\fP\*(rq. This parameter may take the values
620.IR yes ", " no ", "
621.IR true ", " false ", "
622.IR on " or " off .
623.SH ENVIRONMENT
624.TP "\w'\fBMGDATA\fP'u+2n"
625.SB MGDATA
626If this environment variable exists, then its value is used as the
627default directory where the
628.BR mg (1)
629collection files are. If this variable does not exist, then the
630directory \*(lq\fB.\fP\*(rq is used by default. The command line
631option
632.BI \-d " directory"
633overrides the directory in
634.BR MGDATA .
635.SH FILES
636.TP 20
637.I .mgrc
638.B mgquery
639startup file
640.TP
641.B help.mg
642Help file for
643.BR mgquery .
644The contents of this file is displayed with the
645.B .help
646command.
647.TP
648.B *.invf
649Inverted file.
650.TP
651.B *.invf.dict
652The `on-disk' stemmed dictionary.
653.TP
654.B *.text
655Compressed documents.
656.TP
657.B *.text.dict
658Compression dictionary.
659.TP
660.B *.text.idx
661Index into the compressed documents.
662.TP
663.B *.text.idx.wgt
664Interleaved index into the compressed documents and document weights.
665.TP
666.B *.weight.approx
667Approximate document weights.
668.SH "SEE ALSO"
669.na
670.BR egrep (1),
671.BR mg (1),
672.BR mg_compression_dict (1),
673.BR mg_fast_comp_dict (1),
674.BR mg_get (1),
675.BR mg_invf_dict (1),
676.BR mg_invf_dump (1),
677.BR mg_invf_rebuild (1),
678.BR mg_passes (1),
679.BR mg_perf_hash_build (1),
680.BR mg_text_estimate (1),
681.BR mg_weights_build (1),
682.BR mgbilevel (1),
683.BR mgbuild (1),
684.BR mgdictlist (1),
685.BR mgfelics (1),
686.BR mgstat (1),
687.BR mgtic (1),
688.BR mgticbuild (1),
689.BR mgticdump (1),
690.BR mgticprune (1),
691.BR mgticstat (1),
692.BR vi (1),
693.BR yacc (1).
Note: See TracBrowser for help on using the repository browser.