1 | .\"------------------------------------------------------------
|
---|
2 | .\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
|
---|
3 | .de Id
|
---|
4 | .ds Rv \\$3
|
---|
5 | .ds Dt \\$4
|
---|
6 | ..
|
---|
7 | .\"------------------------------------------------------------
|
---|
8 | .TH mgpp_compression_dict 1 \*(Dt CITRI
|
---|
9 | .SH NAME
|
---|
10 | mgpp_compression_dict \- build a compression dictionary.
|
---|
11 | .SH SYNOPSIS
|
---|
12 | .B mgpp_compression_dict
|
---|
13 | [
|
---|
14 | .B \-h
|
---|
15 | ]
|
---|
16 | [
|
---|
17 | .BR \-C " |"
|
---|
18 | .BR \-P " |"
|
---|
19 | .B \-S
|
---|
20 | ]
|
---|
21 | .if n .ti +9n
|
---|
22 | [
|
---|
23 | .BR \-0 " |"
|
---|
24 | .BR \-1 " |"
|
---|
25 | .BR \-2 " |"
|
---|
26 | .B \-3
|
---|
27 | ]
|
---|
28 | [
|
---|
29 | .BR \-H " |"
|
---|
30 | .BR \-B " |"
|
---|
31 | .BR \-D " |"
|
---|
32 | .BR \-Y " |"
|
---|
33 | ]
|
---|
34 | .if n .ti +9n
|
---|
35 | .if t .ti +.5i
|
---|
36 | [
|
---|
37 | .BI \-l " lookback"
|
---|
38 | ]
|
---|
39 | [
|
---|
40 | .BI \-k " mem"
|
---|
41 | ]
|
---|
42 | [
|
---|
43 | .BI \-d " directory"
|
---|
44 | ]
|
---|
45 | .BI \-f " name"
|
---|
46 | .SH DESCRIPTION
|
---|
47 | .B mgpp_compression_dict
|
---|
48 | builds a compression dictionary based on the statistics gathered
|
---|
49 | during the first pass over the text. The options to the program are
|
---|
50 | mainly concerned with limiting the amount of memory the dictionary
|
---|
51 | will use and with how the text compressor will cope with any novel
|
---|
52 | words found during the compression phase.
|
---|
53 | .SH OPTIONS
|
---|
54 | Options may appear in any order.
|
---|
55 | .TP "\w'\fB\-d\fP \fIdirectory\fP'u+2n"
|
---|
56 | .B \-h
|
---|
57 | This displays a usage line on
|
---|
58 | .IR stderr .
|
---|
59 | .TP
|
---|
60 | .B \-C
|
---|
61 | Build a complete dictionary from the statistics file. If during the
|
---|
62 | text compression phase a novel word is found, then the compressor will
|
---|
63 | produce an error message and stop.
|
---|
64 | .TP
|
---|
65 | .B \-P
|
---|
66 | Build a partial dictionary from the statistics file. This dictionary
|
---|
67 | assumes that the statistics file are based on the entire text. The
|
---|
68 | statistics of words not includes in the dictionary are used to
|
---|
69 | calculate the escape probability. If novel words are being coded
|
---|
70 | character by character, then there may not be a Huffman code for every
|
---|
71 | possible character. This means that the compressor may fail if a novel
|
---|
72 | word contains a novel character.
|
---|
73 | .TP
|
---|
74 | .B \-S
|
---|
75 | Build a seed dictionary from the statistics file. This dictionary
|
---|
76 | assumes that the statistics file is based on only a portion of the
|
---|
77 | text to be compressed. The probability of a novel word is based on the
|
---|
78 | number of words that have only occurred once. If novel words are being
|
---|
79 | coded character by character, then the Huffman codes for characters are
|
---|
80 | based on the frequency of characters in the dictionary.
|
---|
81 | .TP
|
---|
82 | .B \-0
|
---|
83 | All words from the statistics file are included in the built
|
---|
84 | dictionary.
|
---|
85 | .TP
|
---|
86 | .B \-1
|
---|
87 | Words are included in the dictionary until the dictionary reaches the
|
---|
88 | desired size. Words are selected for the dictionary based on the order
|
---|
89 | they occurred in the source text.
|
---|
90 | .TP
|
---|
91 | .B \-2
|
---|
92 | Words are included in the dictionary until the dictionary reaches the
|
---|
93 | desired size. The most frequent words are included in the dictionary
|
---|
94 | first; where there is a tie for frequency, the shortest word is
|
---|
95 | included first.
|
---|
96 | .TP
|
---|
97 | .B \-3
|
---|
98 | Words are included in the dictionary until the dictionary reaches the
|
---|
99 | desired size. The most frequent words are included in the dictionary
|
---|
100 | first; where there is a tie for frequency, the shortest word is
|
---|
101 | included first. Words are the shuffled back and forth between the
|
---|
102 | `keep' and `discard' lists to find the `optimal' set of words that
|
---|
103 | should be in the dictionary.
|
---|
104 | .TP
|
---|
105 | .B \-H
|
---|
106 | This specifies that novel words will be coded character by character
|
---|
107 | using Huffman codes.
|
---|
108 | .TP
|
---|
109 | .B \-B
|
---|
110 | This specifies that an auxiliary dictionary will be built by the
|
---|
111 | compressor. Each novel word found will be placed at the end of the
|
---|
112 | auxiliary dictionary. Novel words will be coded in the compressed text
|
---|
113 | using binary codes. The binary code represents their occurrence
|
---|
114 | position in the auxiliary dictionary.
|
---|
115 | .TP
|
---|
116 | .B \-D
|
---|
117 | This specifies that an auxiliary dictionary will be built by the
|
---|
118 | compressor. Each novel word found will be placed at the end of the
|
---|
119 | auxiliary dictionary. Novel words will be coded in the compressed text
|
---|
120 | using delta codes. The delta code represents their occurrence position
|
---|
121 | in the auxiliary dictionary.
|
---|
122 | .TP
|
---|
123 | .B \-Y
|
---|
124 | This specifies that an auxiliary dictionary will be built by the
|
---|
125 | compressor. Each novel word found will be placed at the end of the
|
---|
126 | auxiliary dictionary. Novel words will be coded in the compressed text
|
---|
127 | using a combination of gamma and binary codes. The code represents
|
---|
128 | their occurrence position in the auxiliary dictionary. This generally
|
---|
129 | produces better compression than
|
---|
130 | .B \-B
|
---|
131 | or
|
---|
132 | .BR \-D .
|
---|
133 | .TP
|
---|
134 | .BI \-l " lookback"
|
---|
135 | The generated dictionary is designed to be front coded when it is
|
---|
136 | loaded into memory. Under normal circumstances, a front-coded
|
---|
137 | dictionary would require scanning from the beginning in order to find
|
---|
138 | any particular word. However, every
|
---|
139 | .I lookback
|
---|
140 | words in the dictionary, the whole word is stored and a pointer to that
|
---|
141 | word maintained. E.g., if
|
---|
142 | .I lookback
|
---|
143 | is 4, then every fourth word is stored in its entirety.
|
---|
144 | .TP
|
---|
145 | .BI \-k " mem"
|
---|
146 | This limits the amount of memory to use for the generated
|
---|
147 | dictionary. Words are selected for the dictionary based of the text
|
---|
148 | statistics, and whether
|
---|
149 | .BR \-0 , " \-1" , " \-2"
|
---|
150 | or
|
---|
151 | .B \-3
|
---|
152 | is specified. The memory is calculated assuming a lookback of 0,
|
---|
153 | irrespective of what actual lookback is specified. This means that if
|
---|
154 | a non-zero lookback is given, the dictionary will actually occupy
|
---|
155 | less space than specified by
|
---|
156 | .BR \-k .
|
---|
157 | .TP
|
---|
158 | .BI \-d " directory"
|
---|
159 | This specifies the directory where the document collection can be found.
|
---|
160 | .TP
|
---|
161 | .BI \-f " name"
|
---|
162 | This specifies the base name of the document collection.
|
---|
163 | .SH ENVIRONMENT
|
---|
164 | .TP "\w'\fBMGDATA\fP'u+2n"
|
---|
165 | .SB MGDATA
|
---|
166 | If this environment variable exists, then its value is used as the
|
---|
167 | default directory where the mgpp
|
---|
168 | collection files are. If this variable does not exist, then the
|
---|
169 | directory \*(lq\fB.\fP\*(rq is used by default. The command line
|
---|
170 | option
|
---|
171 | .BI \-d " directory"
|
---|
172 | overrides the directory in
|
---|
173 | .BR MGDATA .
|
---|
174 | .SH FILES
|
---|
175 | .TP 20
|
---|
176 | .B *.text.stats
|
---|
177 | Statistics about the source text.
|
---|
178 | .TP
|
---|
179 | .B *.text.dict
|
---|
180 | Compression dictionary for the source text.
|
---|
181 | .SH "SEE ALSO"
|
---|
182 | .na
|
---|
183 | .BR mgpp_fast_comp_dict (1),
|
---|
184 | .BR mgpp_invf_dict (1),
|
---|
185 | .BR mgpp_passes (1),
|
---|
186 | .BR mgpp_perf_hash_build (1),
|
---|
187 | .BR mgpp_stem_idx (1),
|
---|
188 | .BR mgpp_weights_build (1)
|
---|