Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

mgpp_compression_dict.1@ 24540

Last change on this file since 24540 was 3365, checked in by kjdon, 22 years ago
Initial revision
Property svn:keywords set to `Author Date Id Revision`
File size: 5.8 KB

Line
1	.\"------------------------------------------------------------
2	.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
3	.de Id
4	.ds Rv \\$3
5	.ds Dt \\$4
6	..
7	.\"------------------------------------------------------------
8	.TH mgpp_compression_dict 1 \*(Dt CITRI
9	.SH NAME
10	mgpp_compression_dict \- build a compression dictionary.
11	.SH SYNOPSIS
12	.B mgpp_compression_dict
13	[
14	.B \-h
15	]
16	[
17	.BR \-C " \|"
18	.BR \-P " \|"
19	.B \-S
20	]
21	.if n .ti +9n
22	[
23	.BR \-0 " \|"
24	.BR \-1 " \|"
25	.BR \-2 " \|"
26	.B \-3
27	]
28	[
29	.BR \-H " \|"
30	.BR \-B " \|"
31	.BR \-D " \|"
32	.BR \-Y " \|"
33	]
34	.if n .ti +9n
35	.if t .ti +.5i
36	[
37	.BI \-l " lookback"
38	]
39	[
40	.BI \-k " mem"
41	]
42	[
43	.BI \-d " directory"
44	]
45	.BI \-f " name"
46	.SH DESCRIPTION
47	.B mgpp_compression_dict
48	builds a compression dictionary based on the statistics gathered
49	during the first pass over the text. The options to the program are
50	mainly concerned with limiting the amount of memory the dictionary
51	will use and with how the text compressor will cope with any novel
52	words found during the compression phase.
53	.SH OPTIONS
54	Options may appear in any order.
55	.TP "\w'\fB\-d\fP \fIdirectory\fP'u+2n"
56	.B \-h
57	This displays a usage line on
58	.IR stderr .
59	.TP
60	.B \-C
61	Build a complete dictionary from the statistics file. If during the
62	text compression phase a novel word is found, then the compressor will
63	produce an error message and stop.
64	.TP
65	.B \-P
66	Build a partial dictionary from the statistics file. This dictionary
67	assumes that the statistics file are based on the entire text. The
68	statistics of words not includes in the dictionary are used to
69	calculate the escape probability. If novel words are being coded
70	character by character, then there may not be a Huffman code for every
71	possible character. This means that the compressor may fail if a novel
72	word contains a novel character.
73	.TP
74	.B \-S
75	Build a seed dictionary from the statistics file. This dictionary
76	assumes that the statistics file is based on only a portion of the
77	text to be compressed. The probability of a novel word is based on the
78	number of words that have only occurred once. If novel words are being
79	coded character by character, then the Huffman codes for characters are
80	based on the frequency of characters in the dictionary.
81	.TP
82	.B \-0
83	All words from the statistics file are included in the built
84	dictionary.
85	.TP
86	.B \-1
87	Words are included in the dictionary until the dictionary reaches the
88	desired size. Words are selected for the dictionary based on the order
89	they occurred in the source text.
90	.TP
91	.B \-2
92	Words are included in the dictionary until the dictionary reaches the
93	desired size. The most frequent words are included in the dictionary
94	first; where there is a tie for frequency, the shortest word is
95	included first.
96	.TP
97	.B \-3
98	Words are included in the dictionary until the dictionary reaches the
99	desired size. The most frequent words are included in the dictionary
100	first; where there is a tie for frequency, the shortest word is
101	included first. Words are the shuffled back and forth between the
102	`keep' and `discard' lists to find the `optimal' set of words that
103	should be in the dictionary.
104	.TP
105	.B \-H
106	This specifies that novel words will be coded character by character
107	using Huffman codes.
108	.TP
109	.B \-B
110	This specifies that an auxiliary dictionary will be built by the
111	compressor. Each novel word found will be placed at the end of the
112	auxiliary dictionary. Novel words will be coded in the compressed text
113	using binary codes. The binary code represents their occurrence
114	position in the auxiliary dictionary.
115	.TP
116	.B \-D
117	This specifies that an auxiliary dictionary will be built by the
118	compressor. Each novel word found will be placed at the end of the
119	auxiliary dictionary. Novel words will be coded in the compressed text
120	using delta codes. The delta code represents their occurrence position
121	in the auxiliary dictionary.
122	.TP
123	.B \-Y
124	This specifies that an auxiliary dictionary will be built by the
125	compressor. Each novel word found will be placed at the end of the
126	auxiliary dictionary. Novel words will be coded in the compressed text
127	using a combination of gamma and binary codes. The code represents
128	their occurrence position in the auxiliary dictionary. This generally
129	produces better compression than
130	.B \-B
131	or
132	.BR \-D .
133	.TP
134	.BI \-l " lookback"
135	The generated dictionary is designed to be front coded when it is
136	loaded into memory. Under normal circumstances, a front-coded
137	dictionary would require scanning from the beginning in order to find
138	any particular word. However, every
139	.I lookback
140	words in the dictionary, the whole word is stored and a pointer to that
141	word maintained. E.g., if
142	.I lookback
143	is 4, then every fourth word is stored in its entirety.
144	.TP
145	.BI \-k " mem"
146	This limits the amount of memory to use for the generated
147	dictionary. Words are selected for the dictionary based of the text
148	statistics, and whether
149	.BR \-0 , " \-1" , " \-2"
150	or
151	.B \-3
152	is specified. The memory is calculated assuming a lookback of 0,
153	irrespective of what actual lookback is specified. This means that if
154	a non-zero lookback is given, the dictionary will actually occupy
155	less space than specified by
156	.BR \-k .
157	.TP
158	.BI \-d " directory"
159	This specifies the directory where the document collection can be found.
160	.TP
161	.BI \-f " name"
162	This specifies the base name of the document collection.
163	.SH ENVIRONMENT
164	.TP "\w'\fBMGDATA\fP'u+2n"
165	.SB MGDATA
166	If this environment variable exists, then its value is used as the
167	default directory where the mgpp
168	collection files are. If this variable does not exist, then the
169	directory \(lq\fB.\fP\(rq is used by default. The command line
170	option
171	.BI \-d " directory"
172	overrides the directory in
173	.BR MGDATA .
174	.SH FILES
175	.TP 20
176	.B *.text.stats
177	Statistics about the source text.
178	.TP
179	.B *.text.dict
180	Compression dictionary for the source text.
181	.SH "SEE ALSO"
182	.na
183	.BR mgpp_fast_comp_dict (1),
184	.BR mgpp_invf_dict (1),
185	.BR mgpp_passes (1),
186	.BR mgpp_perf_hash_build (1),
187	.BR mgpp_stem_idx (1),
188	.BR mgpp_weights_build (1)

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format