[3745] | 1 | .\"------------------------------------------------------------
|
---|
| 2 | .\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
|
---|
| 3 | .de Id
|
---|
| 4 | .ds Rv \\$3
|
---|
| 5 | .ds Dt \\$4
|
---|
| 6 | ..
|
---|
| 7 | .Id $Id: mg_compression_dict.1 3745 2003-02-20 21:20:24Z mdewsnip $
|
---|
| 8 | .\"------------------------------------------------------------
|
---|
| 9 | .TH mg_compression_dict 1 \*(Dt CITRI
|
---|
| 10 | .SH NAME
|
---|
| 11 | mg_compression_dict \- build a compression dictionary.
|
---|
| 12 | .SH SYNOPSIS
|
---|
| 13 | .B mg_compression_dict
|
---|
| 14 | [
|
---|
| 15 | .B \-h
|
---|
| 16 | ]
|
---|
| 17 | [
|
---|
| 18 | .BR \-C " |"
|
---|
| 19 | .BR \-P " |"
|
---|
| 20 | .B \-S
|
---|
| 21 | ]
|
---|
| 22 | .if n .ti +9n
|
---|
| 23 | [
|
---|
| 24 | .BR \-0 " |"
|
---|
| 25 | .BR \-1 " |"
|
---|
| 26 | .BR \-2 " |"
|
---|
| 27 | .B \-3
|
---|
| 28 | ]
|
---|
| 29 | [
|
---|
| 30 | .BR \-H " |"
|
---|
| 31 | .BR \-B " |"
|
---|
| 32 | .BR \-D " |"
|
---|
| 33 | .BR \-Y " |"
|
---|
| 34 | .B \-M
|
---|
| 35 | ]
|
---|
| 36 | .if n .ti +9n
|
---|
| 37 | .if t .ti +.5i
|
---|
| 38 | [
|
---|
| 39 | .BI \-l " lookback"
|
---|
| 40 | ]
|
---|
| 41 | [
|
---|
| 42 | .BI \-k " mem"
|
---|
| 43 | ]
|
---|
| 44 | [
|
---|
| 45 | .BI \-d " directory"
|
---|
| 46 | ]
|
---|
| 47 | .BI \-f " name"
|
---|
| 48 | .SH DESCRIPTION
|
---|
| 49 | .B mg_compression_dict
|
---|
| 50 | builds a compression dictionary based on the statistics gathered
|
---|
| 51 | during the first pass over the text. The options to the program are
|
---|
| 52 | mainly concerned with limiting the amount of memory the dictionary
|
---|
| 53 | will use and with how the text compressor will cope with any novel
|
---|
| 54 | words found during the compression phase.
|
---|
| 55 | .SH OPTIONS
|
---|
| 56 | Options may appear in any order.
|
---|
| 57 | .TP "\w'\fB\-d\fP \fIdirectory\fP'u+2n"
|
---|
| 58 | .B \-h
|
---|
| 59 | This displays a usage line on
|
---|
| 60 | .IR stderr .
|
---|
| 61 | .TP
|
---|
| 62 | .B \-C
|
---|
| 63 | Build a complete dictionary from the statistics file. If during the
|
---|
| 64 | text compression phase a novel word is found, then the compressor will
|
---|
| 65 | produce an error message and stop.
|
---|
| 66 | .TP
|
---|
| 67 | .B \-P
|
---|
| 68 | Build a partial dictionary from the statistics file. This dictionary
|
---|
| 69 | assumes that the statistics file are based on the entire text. The
|
---|
| 70 | statistics of words not includes in the dictionary are used to
|
---|
| 71 | calculate the escape probability. If novel words are being coded
|
---|
| 72 | character by character, then there may not be a Huffman code for every
|
---|
| 73 | possible character. This means that the compressor may fail if a novel
|
---|
| 74 | word contains a novel character.
|
---|
| 75 | .TP
|
---|
| 76 | .B \-S
|
---|
| 77 | Build a seed dictionary from the statistics file. This dictionary
|
---|
| 78 | assumes that the statistics file is based on only a portion of the
|
---|
| 79 | text to be compressed. The probability of a novel word is based on the
|
---|
| 80 | number of words that have only occurred once. If novel words are being
|
---|
| 81 | coded character by character, then the Huffman codes for characters are
|
---|
| 82 | based on the frequency of characters in the dictionary.
|
---|
| 83 | .TP
|
---|
| 84 | .B \-0
|
---|
| 85 | All words from the statistics file are included in the built
|
---|
| 86 | dictionary.
|
---|
| 87 | .TP
|
---|
| 88 | .B \-1
|
---|
| 89 | Words are included in the dictionary until the dictionary reaches the
|
---|
| 90 | desired size. Words are selected for the dictionary based on the order
|
---|
| 91 | they occurred in the source text.
|
---|
| 92 | .TP
|
---|
| 93 | .B \-2
|
---|
| 94 | Words are included in the dictionary until the dictionary reaches the
|
---|
| 95 | desired size. The most frequent words are included in the dictionary
|
---|
| 96 | first; where there is a tie for frequency, the shortest word is
|
---|
| 97 | included first.
|
---|
| 98 | .TP
|
---|
| 99 | .B \-3
|
---|
| 100 | Words are included in the dictionary until the dictionary reaches the
|
---|
| 101 | desired size. The most frequent words are included in the dictionary
|
---|
| 102 | first; where there is a tie for frequency, the shortest word is
|
---|
| 103 | included first. Words are the shuffled back and forth between the
|
---|
| 104 | `keep' and `discard' lists to find the `optimal' set of words that
|
---|
| 105 | should be in the dictionary.
|
---|
| 106 | .TP
|
---|
| 107 | .B \-H
|
---|
| 108 | This specifies that novel words will be coded character by character
|
---|
| 109 | using Huffman codes.
|
---|
| 110 | .TP
|
---|
| 111 | .B \-B
|
---|
| 112 | This specifies that an auxiliary dictionary will be built by the
|
---|
| 113 | compressor. Each novel word found will be placed at the end of the
|
---|
| 114 | auxiliary dictionary. Novel words will be coded in the compressed text
|
---|
| 115 | using binary codes. The binary code represents their occurrence
|
---|
| 116 | position in the auxiliary dictionary.
|
---|
| 117 | .TP
|
---|
| 118 | .B \-D
|
---|
| 119 | This specifies that an auxiliary dictionary will be built by the
|
---|
| 120 | compressor. Each novel word found will be placed at the end of the
|
---|
| 121 | auxiliary dictionary. Novel words will be coded in the compressed text
|
---|
| 122 | using delta codes. The delta code represents their occurrence position
|
---|
| 123 | in the auxiliary dictionary.
|
---|
| 124 | .TP
|
---|
| 125 | .B \-Y
|
---|
| 126 | This specifies that an auxiliary dictionary will be built by the
|
---|
| 127 | compressor. Each novel word found will be placed at the end of the
|
---|
| 128 | auxiliary dictionary. Novel words will be coded in the compressed text
|
---|
| 129 | using a combination of gamma and binary codes. The code represents
|
---|
| 130 | their occurrence position in the auxiliary dictionary. This generally
|
---|
| 131 | produces better compression than
|
---|
| 132 | .B \-B
|
---|
| 133 | or
|
---|
| 134 | .BR \-D .
|
---|
| 135 | .TP
|
---|
| 136 | .B \-M
|
---|
| 137 | This specifies that an auxiliary dictionary will be built by the
|
---|
| 138 | compressor. Each novel word found will be placed at the end of the
|
---|
| 139 | auxiliary dictionary. Novel words will be coded in the compressed text
|
---|
| 140 | using a combination of gamma and binary codes. The code represents
|
---|
| 141 | their occurrence position in the auxiliary dictionary. This method is
|
---|
| 142 | adaptive within documents, and generally produces better compression
|
---|
| 143 | than
|
---|
| 144 | .BR \-B ,
|
---|
| 145 | .B \-D
|
---|
| 146 | or
|
---|
| 147 | .BR \-Y .
|
---|
| 148 | .TP
|
---|
| 149 | .BI \-l " lookback"
|
---|
| 150 | The generated dictionary is designed to be front coded when it is
|
---|
| 151 | loaded into memory. Under normal circumstances, a front-coded
|
---|
| 152 | dictionary would require scanning from the beginning in order to find
|
---|
| 153 | any particular word. However, every
|
---|
| 154 | .I lookback
|
---|
| 155 | words in the dictionary, the whole word is stored and a pointer to that
|
---|
| 156 | word maintained. E.g., if
|
---|
| 157 | .I lookback
|
---|
| 158 | is 4, then every fourth word is stored in its entirety.
|
---|
| 159 | .TP
|
---|
| 160 | .BI \-k " mem"
|
---|
| 161 | This limits the amount of memory to use for the generated
|
---|
| 162 | dictionary. Words are selected for the dictionary based of the text
|
---|
| 163 | statistics, and whether
|
---|
| 164 | .BR \-0 , " \-1" , " \-2"
|
---|
| 165 | or
|
---|
| 166 | .B \-3
|
---|
| 167 | is specified. The memory is calculated assuming a lookback of 0,
|
---|
| 168 | irrespective of what actual lookback is specified. This means that if
|
---|
| 169 | a non-zero lookback is given, the dictionary will actually occupy
|
---|
| 170 | less space than specified by
|
---|
| 171 | .BR \-k .
|
---|
| 172 | .TP
|
---|
| 173 | .BI \-d " directory"
|
---|
| 174 | This specifies the directory where the document collection can be found.
|
---|
| 175 | .TP
|
---|
| 176 | .BI \-f " name"
|
---|
| 177 | This specifies the base name of the document collection.
|
---|
| 178 | .SH ENVIRONMENT
|
---|
| 179 | .TP "\w'\fBMGDATA\fP'u+2n"
|
---|
| 180 | .SB MGDATA
|
---|
| 181 | If this environment variable exists, then its value is used as the
|
---|
| 182 | default directory where the
|
---|
| 183 | .BR mg (1)
|
---|
| 184 | collection files are. If this variable does not exist, then the
|
---|
| 185 | directory \*(lq\fB.\fP\*(rq is used by default. The command line
|
---|
| 186 | option
|
---|
| 187 | .BI \-d " directory"
|
---|
| 188 | overrides the directory in
|
---|
| 189 | .BR MGDATA .
|
---|
| 190 | .SH FILES
|
---|
| 191 | .TP 20
|
---|
| 192 | .B *.text.stats
|
---|
| 193 | Statistics about the source text.
|
---|
| 194 | .TP
|
---|
| 195 | .B *.text.dict
|
---|
| 196 | Compression dictionary for the source text.
|
---|
| 197 | .SH "SEE ALSO"
|
---|
| 198 | .na
|
---|
| 199 | .BR mg (1),
|
---|
| 200 | .BR mg_fast_comp_dict (1),
|
---|
| 201 | .BR mg_get (1),
|
---|
| 202 | .BR mg_invf_dict (1),
|
---|
| 203 | .BR mg_invf_dump (1),
|
---|
| 204 | .BR mg_invf_rebuild (1),
|
---|
| 205 | .BR mg_passes (1),
|
---|
| 206 | .BR mg_perf_hash_build (1),
|
---|
| 207 | .BR mg_text_estimate (1),
|
---|
| 208 | .BR mg_weights_build (1),
|
---|
| 209 | .BR mgbilevel (1),
|
---|
| 210 | .BR mgbuild (1),
|
---|
| 211 | .BR mgdictlist (1),
|
---|
| 212 | .BR mgfelics (1),
|
---|
| 213 | .BR mgquery (1),
|
---|
| 214 | .BR mgstat (1),
|
---|
| 215 | .BR mgtic (1),
|
---|
| 216 | .BR mgticbuild (1),
|
---|
| 217 | .BR mgticdump (1),
|
---|
| 218 | .BR mgticprune (1),
|
---|
| 219 | .BR mgticstat (1).
|
---|