[3365] | 1 | .\"------------------------------------------------------------
|
---|
| 2 | .\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
|
---|
| 3 | .de Id
|
---|
| 4 | .ds Rv \\$3
|
---|
| 5 | .ds Dt \\$4
|
---|
| 6 | ..
|
---|
| 7 | .\"------------------------------------------------------------
|
---|
| 8 | .TH mgpp_compression_dict 1 \*(Dt CITRI
|
---|
| 9 | .SH NAME
|
---|
| 10 | mgpp_compression_dict \- build a compression dictionary.
|
---|
| 11 | .SH SYNOPSIS
|
---|
| 12 | .B mgpp_compression_dict
|
---|
| 13 | [
|
---|
| 14 | .B \-h
|
---|
| 15 | ]
|
---|
| 16 | [
|
---|
| 17 | .BR \-C " |"
|
---|
| 18 | .BR \-P " |"
|
---|
| 19 | .B \-S
|
---|
| 20 | ]
|
---|
| 21 | .if n .ti +9n
|
---|
| 22 | [
|
---|
| 23 | .BR \-0 " |"
|
---|
| 24 | .BR \-1 " |"
|
---|
| 25 | .BR \-2 " |"
|
---|
| 26 | .B \-3
|
---|
| 27 | ]
|
---|
| 28 | [
|
---|
| 29 | .BR \-H " |"
|
---|
| 30 | .BR \-B " |"
|
---|
| 31 | .BR \-D " |"
|
---|
| 32 | .BR \-Y " |"
|
---|
| 33 | ]
|
---|
| 34 | .if n .ti +9n
|
---|
| 35 | .if t .ti +.5i
|
---|
| 36 | [
|
---|
| 37 | .BI \-l " lookback"
|
---|
| 38 | ]
|
---|
| 39 | [
|
---|
| 40 | .BI \-k " mem"
|
---|
| 41 | ]
|
---|
| 42 | [
|
---|
| 43 | .BI \-d " directory"
|
---|
| 44 | ]
|
---|
| 45 | .BI \-f " name"
|
---|
| 46 | .SH DESCRIPTION
|
---|
| 47 | .B mgpp_compression_dict
|
---|
| 48 | builds a compression dictionary based on the statistics gathered
|
---|
| 49 | during the first pass over the text. The options to the program are
|
---|
| 50 | mainly concerned with limiting the amount of memory the dictionary
|
---|
| 51 | will use and with how the text compressor will cope with any novel
|
---|
| 52 | words found during the compression phase.
|
---|
| 53 | .SH OPTIONS
|
---|
| 54 | Options may appear in any order.
|
---|
| 55 | .TP "\w'\fB\-d\fP \fIdirectory\fP'u+2n"
|
---|
| 56 | .B \-h
|
---|
| 57 | This displays a usage line on
|
---|
| 58 | .IR stderr .
|
---|
| 59 | .TP
|
---|
| 60 | .B \-C
|
---|
| 61 | Build a complete dictionary from the statistics file. If during the
|
---|
| 62 | text compression phase a novel word is found, then the compressor will
|
---|
| 63 | produce an error message and stop.
|
---|
| 64 | .TP
|
---|
| 65 | .B \-P
|
---|
| 66 | Build a partial dictionary from the statistics file. This dictionary
|
---|
| 67 | assumes that the statistics file are based on the entire text. The
|
---|
| 68 | statistics of words not includes in the dictionary are used to
|
---|
| 69 | calculate the escape probability. If novel words are being coded
|
---|
| 70 | character by character, then there may not be a Huffman code for every
|
---|
| 71 | possible character. This means that the compressor may fail if a novel
|
---|
| 72 | word contains a novel character.
|
---|
| 73 | .TP
|
---|
| 74 | .B \-S
|
---|
| 75 | Build a seed dictionary from the statistics file. This dictionary
|
---|
| 76 | assumes that the statistics file is based on only a portion of the
|
---|
| 77 | text to be compressed. The probability of a novel word is based on the
|
---|
| 78 | number of words that have only occurred once. If novel words are being
|
---|
| 79 | coded character by character, then the Huffman codes for characters are
|
---|
| 80 | based on the frequency of characters in the dictionary.
|
---|
| 81 | .TP
|
---|
| 82 | .B \-0
|
---|
| 83 | All words from the statistics file are included in the built
|
---|
| 84 | dictionary.
|
---|
| 85 | .TP
|
---|
| 86 | .B \-1
|
---|
| 87 | Words are included in the dictionary until the dictionary reaches the
|
---|
| 88 | desired size. Words are selected for the dictionary based on the order
|
---|
| 89 | they occurred in the source text.
|
---|
| 90 | .TP
|
---|
| 91 | .B \-2
|
---|
| 92 | Words are included in the dictionary until the dictionary reaches the
|
---|
| 93 | desired size. The most frequent words are included in the dictionary
|
---|
| 94 | first; where there is a tie for frequency, the shortest word is
|
---|
| 95 | included first.
|
---|
| 96 | .TP
|
---|
| 97 | .B \-3
|
---|
| 98 | Words are included in the dictionary until the dictionary reaches the
|
---|
| 99 | desired size. The most frequent words are included in the dictionary
|
---|
| 100 | first; where there is a tie for frequency, the shortest word is
|
---|
| 101 | included first. Words are the shuffled back and forth between the
|
---|
| 102 | `keep' and `discard' lists to find the `optimal' set of words that
|
---|
| 103 | should be in the dictionary.
|
---|
| 104 | .TP
|
---|
| 105 | .B \-H
|
---|
| 106 | This specifies that novel words will be coded character by character
|
---|
| 107 | using Huffman codes.
|
---|
| 108 | .TP
|
---|
| 109 | .B \-B
|
---|
| 110 | This specifies that an auxiliary dictionary will be built by the
|
---|
| 111 | compressor. Each novel word found will be placed at the end of the
|
---|
| 112 | auxiliary dictionary. Novel words will be coded in the compressed text
|
---|
| 113 | using binary codes. The binary code represents their occurrence
|
---|
| 114 | position in the auxiliary dictionary.
|
---|
| 115 | .TP
|
---|
| 116 | .B \-D
|
---|
| 117 | This specifies that an auxiliary dictionary will be built by the
|
---|
| 118 | compressor. Each novel word found will be placed at the end of the
|
---|
| 119 | auxiliary dictionary. Novel words will be coded in the compressed text
|
---|
| 120 | using delta codes. The delta code represents their occurrence position
|
---|
| 121 | in the auxiliary dictionary.
|
---|
| 122 | .TP
|
---|
| 123 | .B \-Y
|
---|
| 124 | This specifies that an auxiliary dictionary will be built by the
|
---|
| 125 | compressor. Each novel word found will be placed at the end of the
|
---|
| 126 | auxiliary dictionary. Novel words will be coded in the compressed text
|
---|
| 127 | using a combination of gamma and binary codes. The code represents
|
---|
| 128 | their occurrence position in the auxiliary dictionary. This generally
|
---|
| 129 | produces better compression than
|
---|
| 130 | .B \-B
|
---|
| 131 | or
|
---|
| 132 | .BR \-D .
|
---|
| 133 | .TP
|
---|
| 134 | .BI \-l " lookback"
|
---|
| 135 | The generated dictionary is designed to be front coded when it is
|
---|
| 136 | loaded into memory. Under normal circumstances, a front-coded
|
---|
| 137 | dictionary would require scanning from the beginning in order to find
|
---|
| 138 | any particular word. However, every
|
---|
| 139 | .I lookback
|
---|
| 140 | words in the dictionary, the whole word is stored and a pointer to that
|
---|
| 141 | word maintained. E.g., if
|
---|
| 142 | .I lookback
|
---|
| 143 | is 4, then every fourth word is stored in its entirety.
|
---|
| 144 | .TP
|
---|
| 145 | .BI \-k " mem"
|
---|
| 146 | This limits the amount of memory to use for the generated
|
---|
| 147 | dictionary. Words are selected for the dictionary based of the text
|
---|
| 148 | statistics, and whether
|
---|
| 149 | .BR \-0 , " \-1" , " \-2"
|
---|
| 150 | or
|
---|
| 151 | .B \-3
|
---|
| 152 | is specified. The memory is calculated assuming a lookback of 0,
|
---|
| 153 | irrespective of what actual lookback is specified. This means that if
|
---|
| 154 | a non-zero lookback is given, the dictionary will actually occupy
|
---|
| 155 | less space than specified by
|
---|
| 156 | .BR \-k .
|
---|
| 157 | .TP
|
---|
| 158 | .BI \-d " directory"
|
---|
| 159 | This specifies the directory where the document collection can be found.
|
---|
| 160 | .TP
|
---|
| 161 | .BI \-f " name"
|
---|
| 162 | This specifies the base name of the document collection.
|
---|
| 163 | .SH ENVIRONMENT
|
---|
| 164 | .TP "\w'\fBMGDATA\fP'u+2n"
|
---|
| 165 | .SB MGDATA
|
---|
| 166 | If this environment variable exists, then its value is used as the
|
---|
| 167 | default directory where the mgpp
|
---|
| 168 | collection files are. If this variable does not exist, then the
|
---|
| 169 | directory \*(lq\fB.\fP\*(rq is used by default. The command line
|
---|
| 170 | option
|
---|
| 171 | .BI \-d " directory"
|
---|
| 172 | overrides the directory in
|
---|
| 173 | .BR MGDATA .
|
---|
| 174 | .SH FILES
|
---|
| 175 | .TP 20
|
---|
| 176 | .B *.text.stats
|
---|
| 177 | Statistics about the source text.
|
---|
| 178 | .TP
|
---|
| 179 | .B *.text.dict
|
---|
| 180 | Compression dictionary for the source text.
|
---|
| 181 | .SH "SEE ALSO"
|
---|
| 182 | .na
|
---|
| 183 | .BR mgpp_fast_comp_dict (1),
|
---|
| 184 | .BR mgpp_invf_dict (1),
|
---|
| 185 | .BR mgpp_passes (1),
|
---|
| 186 | .BR mgpp_perf_hash_build (1),
|
---|
| 187 | .BR mgpp_stem_idx (1),
|
---|
| 188 | .BR mgpp_weights_build (1)
|
---|