source: main/tags/2.80/indexers/mgpp/text/mgpp_compression_dict.1@ 24540

Last change on this file since 24540 was 3365, checked in by kjdon, 22 years ago

Initial revision

  • Property svn:keywords set to Author Date Id Revision
File size: 5.8 KB
Line 
1.\"------------------------------------------------------------
2.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
3.de Id
4.ds Rv \\$3
5.ds Dt \\$4
6..
7.\"------------------------------------------------------------
8.TH mgpp_compression_dict 1 \*(Dt CITRI
9.SH NAME
10mgpp_compression_dict \- build a compression dictionary.
11.SH SYNOPSIS
12.B mgpp_compression_dict
13[
14.B \-h
15]
16[
17.BR \-C " |"
18.BR \-P " |"
19.B \-S
20]
21.if n .ti +9n
22[
23.BR \-0 " |"
24.BR \-1 " |"
25.BR \-2 " |"
26.B \-3
27]
28[
29.BR \-H " |"
30.BR \-B " |"
31.BR \-D " |"
32.BR \-Y " |"
33]
34.if n .ti +9n
35.if t .ti +.5i
36[
37.BI \-l " lookback"
38]
39[
40.BI \-k " mem"
41]
42[
43.BI \-d " directory"
44]
45.BI \-f " name"
46.SH DESCRIPTION
47.B mgpp_compression_dict
48builds a compression dictionary based on the statistics gathered
49during the first pass over the text. The options to the program are
50mainly concerned with limiting the amount of memory the dictionary
51will use and with how the text compressor will cope with any novel
52words found during the compression phase.
53.SH OPTIONS
54Options may appear in any order.
55.TP "\w'\fB\-d\fP \fIdirectory\fP'u+2n"
56.B \-h
57This displays a usage line on
58.IR stderr .
59.TP
60.B \-C
61Build a complete dictionary from the statistics file. If during the
62text compression phase a novel word is found, then the compressor will
63produce an error message and stop.
64.TP
65.B \-P
66Build a partial dictionary from the statistics file. This dictionary
67assumes that the statistics file are based on the entire text. The
68statistics of words not includes in the dictionary are used to
69calculate the escape probability. If novel words are being coded
70character by character, then there may not be a Huffman code for every
71possible character. This means that the compressor may fail if a novel
72word contains a novel character.
73.TP
74.B \-S
75Build a seed dictionary from the statistics file. This dictionary
76assumes that the statistics file is based on only a portion of the
77text to be compressed. The probability of a novel word is based on the
78number of words that have only occurred once. If novel words are being
79coded character by character, then the Huffman codes for characters are
80based on the frequency of characters in the dictionary.
81.TP
82.B \-0
83All words from the statistics file are included in the built
84dictionary.
85.TP
86.B \-1
87Words are included in the dictionary until the dictionary reaches the
88desired size. Words are selected for the dictionary based on the order
89they occurred in the source text.
90.TP
91.B \-2
92Words are included in the dictionary until the dictionary reaches the
93desired size. The most frequent words are included in the dictionary
94first; where there is a tie for frequency, the shortest word is
95included first.
96.TP
97.B \-3
98Words are included in the dictionary until the dictionary reaches the
99desired size. The most frequent words are included in the dictionary
100first; where there is a tie for frequency, the shortest word is
101included first. Words are the shuffled back and forth between the
102`keep' and `discard' lists to find the `optimal' set of words that
103should be in the dictionary.
104.TP
105.B \-H
106This specifies that novel words will be coded character by character
107using Huffman codes.
108.TP
109.B \-B
110This specifies that an auxiliary dictionary will be built by the
111compressor. Each novel word found will be placed at the end of the
112auxiliary dictionary. Novel words will be coded in the compressed text
113using binary codes. The binary code represents their occurrence
114position in the auxiliary dictionary.
115.TP
116.B \-D
117This specifies that an auxiliary dictionary will be built by the
118compressor. Each novel word found will be placed at the end of the
119auxiliary dictionary. Novel words will be coded in the compressed text
120using delta codes. The delta code represents their occurrence position
121in the auxiliary dictionary.
122.TP
123.B \-Y
124This specifies that an auxiliary dictionary will be built by the
125compressor. Each novel word found will be placed at the end of the
126auxiliary dictionary. Novel words will be coded in the compressed text
127using a combination of gamma and binary codes. The code represents
128their occurrence position in the auxiliary dictionary. This generally
129produces better compression than
130.B \-B
131or
132.BR \-D .
133.TP
134.BI \-l " lookback"
135The generated dictionary is designed to be front coded when it is
136loaded into memory. Under normal circumstances, a front-coded
137dictionary would require scanning from the beginning in order to find
138any particular word. However, every
139.I lookback
140words in the dictionary, the whole word is stored and a pointer to that
141word maintained. E.g., if
142.I lookback
143is 4, then every fourth word is stored in its entirety.
144.TP
145.BI \-k " mem"
146This limits the amount of memory to use for the generated
147dictionary. Words are selected for the dictionary based of the text
148statistics, and whether
149.BR \-0 , " \-1" , " \-2"
150or
151.B \-3
152is specified. The memory is calculated assuming a lookback of 0,
153irrespective of what actual lookback is specified. This means that if
154a non-zero lookback is given, the dictionary will actually occupy
155less space than specified by
156.BR \-k .
157.TP
158.BI \-d " directory"
159This specifies the directory where the document collection can be found.
160.TP
161.BI \-f " name"
162This specifies the base name of the document collection.
163.SH ENVIRONMENT
164.TP "\w'\fBMGDATA\fP'u+2n"
165.SB MGDATA
166If this environment variable exists, then its value is used as the
167default directory where the mgpp
168collection files are. If this variable does not exist, then the
169directory \*(lq\fB.\fP\*(rq is used by default. The command line
170option
171.BI \-d " directory"
172overrides the directory in
173.BR MGDATA .
174.SH FILES
175.TP 20
176.B *.text.stats
177Statistics about the source text.
178.TP
179.B *.text.dict
180Compression dictionary for the source text.
181.SH "SEE ALSO"
182.na
183.BR mgpp_fast_comp_dict (1),
184.BR mgpp_invf_dict (1),
185.BR mgpp_passes (1),
186.BR mgpp_perf_hash_build (1),
187.BR mgpp_stem_idx (1),
188.BR mgpp_weights_build (1)
Note: See TracBrowser for help on using the repository browser.