source: gsdl/trunk/trunk/mg/man/man1/mg_compression_dict.1@ 16583

Last change on this file since 16583 was 16583, checked in by davidb, 16 years ago

Undoing change commited in r16582

  • Property svn:executable set to *
  • Property svn:keywords set to Author Date Id Revision
File size: 6.5 KB
Line 
1.\"------------------------------------------------------------
2.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
3.de Id
4.ds Rv \\$3
5.ds Dt \\$4
6..
7.Id $Id: mg_compression_dict.1 16583 2008-07-29 10:20:36Z davidb $
8.\"------------------------------------------------------------
9.TH mg_compression_dict 1 \*(Dt CITRI
10.SH NAME
11mg_compression_dict \- build a compression dictionary.
12.SH SYNOPSIS
13.B mg_compression_dict
14[
15.B \-h
16]
17[
18.BR \-C " |"
19.BR \-P " |"
20.B \-S
21]
22.if n .ti +9n
23[
24.BR \-0 " |"
25.BR \-1 " |"
26.BR \-2 " |"
27.B \-3
28]
29[
30.BR \-H " |"
31.BR \-B " |"
32.BR \-D " |"
33.BR \-Y " |"
34.B \-M
35]
36.if n .ti +9n
37.if t .ti +.5i
38[
39.BI \-l " lookback"
40]
41[
42.BI \-k " mem"
43]
44[
45.BI \-d " directory"
46]
47.BI \-f " name"
48.SH DESCRIPTION
49.B mg_compression_dict
50builds a compression dictionary based on the statistics gathered
51during the first pass over the text. The options to the program are
52mainly concerned with limiting the amount of memory the dictionary
53will use and with how the text compressor will cope with any novel
54words found during the compression phase.
55.SH OPTIONS
56Options may appear in any order.
57.TP "\w'\fB\-d\fP \fIdirectory\fP'u+2n"
58.B \-h
59This displays a usage line on
60.IR stderr .
61.TP
62.B \-C
63Build a complete dictionary from the statistics file. If during the
64text compression phase a novel word is found, then the compressor will
65produce an error message and stop.
66.TP
67.B \-P
68Build a partial dictionary from the statistics file. This dictionary
69assumes that the statistics file are based on the entire text. The
70statistics of words not includes in the dictionary are used to
71calculate the escape probability. If novel words are being coded
72character by character, then there may not be a Huffman code for every
73possible character. This means that the compressor may fail if a novel
74word contains a novel character.
75.TP
76.B \-S
77Build a seed dictionary from the statistics file. This dictionary
78assumes that the statistics file is based on only a portion of the
79text to be compressed. The probability of a novel word is based on the
80number of words that have only occurred once. If novel words are being
81coded character by character, then the Huffman codes for characters are
82based on the frequency of characters in the dictionary.
83.TP
84.B \-0
85All words from the statistics file are included in the built
86dictionary.
87.TP
88.B \-1
89Words are included in the dictionary until the dictionary reaches the
90desired size. Words are selected for the dictionary based on the order
91they occurred in the source text.
92.TP
93.B \-2
94Words are included in the dictionary until the dictionary reaches the
95desired size. The most frequent words are included in the dictionary
96first; where there is a tie for frequency, the shortest word is
97included first.
98.TP
99.B \-3
100Words are included in the dictionary until the dictionary reaches the
101desired size. The most frequent words are included in the dictionary
102first; where there is a tie for frequency, the shortest word is
103included first. Words are the shuffled back and forth between the
104`keep' and `discard' lists to find the `optimal' set of words that
105should be in the dictionary.
106.TP
107.B \-H
108This specifies that novel words will be coded character by character
109using Huffman codes.
110.TP
111.B \-B
112This specifies that an auxiliary dictionary will be built by the
113compressor. Each novel word found will be placed at the end of the
114auxiliary dictionary. Novel words will be coded in the compressed text
115using binary codes. The binary code represents their occurrence
116position in the auxiliary dictionary.
117.TP
118.B \-D
119This specifies that an auxiliary dictionary will be built by the
120compressor. Each novel word found will be placed at the end of the
121auxiliary dictionary. Novel words will be coded in the compressed text
122using delta codes. The delta code represents their occurrence position
123in the auxiliary dictionary.
124.TP
125.B \-Y
126This specifies that an auxiliary dictionary will be built by the
127compressor. Each novel word found will be placed at the end of the
128auxiliary dictionary. Novel words will be coded in the compressed text
129using a combination of gamma and binary codes. The code represents
130their occurrence position in the auxiliary dictionary. This generally
131produces better compression than
132.B \-B
133or
134.BR \-D .
135.TP
136.B \-M
137This specifies that an auxiliary dictionary will be built by the
138compressor. Each novel word found will be placed at the end of the
139auxiliary dictionary. Novel words will be coded in the compressed text
140using a combination of gamma and binary codes. The code represents
141their occurrence position in the auxiliary dictionary. This method is
142adaptive within documents, and generally produces better compression
143than
144.BR \-B ,
145.B \-D
146or
147.BR \-Y .
148.TP
149.BI \-l " lookback"
150The generated dictionary is designed to be front coded when it is
151loaded into memory. Under normal circumstances, a front-coded
152dictionary would require scanning from the beginning in order to find
153any particular word. However, every
154.I lookback
155words in the dictionary, the whole word is stored and a pointer to that
156word maintained. E.g., if
157.I lookback
158is 4, then every fourth word is stored in its entirety.
159.TP
160.BI \-k " mem"
161This limits the amount of memory to use for the generated
162dictionary. Words are selected for the dictionary based of the text
163statistics, and whether
164.BR \-0 , " \-1" , " \-2"
165or
166.B \-3
167is specified. The memory is calculated assuming a lookback of 0,
168irrespective of what actual lookback is specified. This means that if
169a non-zero lookback is given, the dictionary will actually occupy
170less space than specified by
171.BR \-k .
172.TP
173.BI \-d " directory"
174This specifies the directory where the document collection can be found.
175.TP
176.BI \-f " name"
177This specifies the base name of the document collection.
178.SH ENVIRONMENT
179.TP "\w'\fBMGDATA\fP'u+2n"
180.SB MGDATA
181If this environment variable exists, then its value is used as the
182default directory where the
183.BR mg (1)
184collection files are. If this variable does not exist, then the
185directory \*(lq\fB.\fP\*(rq is used by default. The command line
186option
187.BI \-d " directory"
188overrides the directory in
189.BR MGDATA .
190.SH FILES
191.TP 20
192.B *.text.stats
193Statistics about the source text.
194.TP
195.B *.text.dict
196Compression dictionary for the source text.
197.SH "SEE ALSO"
198.na
199.BR mg (1),
200.BR mg_fast_comp_dict (1),
201.BR mg_get (1),
202.BR mg_invf_dict (1),
203.BR mg_invf_dump (1),
204.BR mg_invf_rebuild (1),
205.BR mg_passes (1),
206.BR mg_perf_hash_build (1),
207.BR mg_text_estimate (1),
208.BR mg_weights_build (1),
209.BR mgbilevel (1),
210.BR mgbuild (1),
211.BR mgdictlist (1),
212.BR mgfelics (1),
213.BR mgquery (1),
214.BR mgstat (1),
215.BR mgtic (1),
216.BR mgticbuild (1),
217.BR mgticdump (1),
218.BR mgticprune (1),
219.BR mgticstat (1).
Note: See TracBrowser for help on using the repository browser.