Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

mg_compression_dict.1@ 1847

Last change on this file since 1847 was 856, checked in by sjboddie, 24 years ago
Rodgers new C++ mg
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 5.8 KB

Line
1	.\"------------------------------------------------------------
2	.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
3	.de Id
4	.ds Rv \\$3
5	.ds Dt \\$4
6	..
7	.Id $Id: mg_compression_dict.1 856 2000-01-14 02:26:25Z sjboddie $
8	.\"------------------------------------------------------------
9	.TH mg_compression_dict 1 \*(Dt CITRI
10	.SH NAME
11	mg_compression_dict \- build a compression dictionary.
12	.SH SYNOPSIS
13	.B mg_compression_dict
14	[
15	.B \-h
16	]
17	[
18	.BR \-C " \|"
19	.BR \-P " \|"
20	.B \-S
21	]
22	.if n .ti +9n
23	[
24	.BR \-0 " \|"
25	.BR \-1 " \|"
26	.BR \-2 " \|"
27	.B \-3
28	]
29	[
30	.BR \-H " \|"
31	.BR \-B " \|"
32	.BR \-D " \|"
33	.BR \-Y " \|"
34	]
35	.if n .ti +9n
36	.if t .ti +.5i
37	[
38	.BI \-l " lookback"
39	]
40	[
41	.BI \-k " mem"
42	]
43	[
44	.BI \-d " directory"
45	]
46	.BI \-f " name"
47	.SH DESCRIPTION
48	.B mg_compression_dict
49	builds a compression dictionary based on the statistics gathered
50	during the first pass over the text. The options to the program are
51	mainly concerned with limiting the amount of memory the dictionary
52	will use and with how the text compressor will cope with any novel
53	words found during the compression phase.
54	.SH OPTIONS
55	Options may appear in any order.
56	.TP "\w'\fB\-d\fP \fIdirectory\fP'u+2n"
57	.B \-h
58	This displays a usage line on
59	.IR stderr .
60	.TP
61	.B \-C
62	Build a complete dictionary from the statistics file. If during the
63	text compression phase a novel word is found, then the compressor will
64	produce an error message and stop.
65	.TP
66	.B \-P
67	Build a partial dictionary from the statistics file. This dictionary
68	assumes that the statistics file are based on the entire text. The
69	statistics of words not includes in the dictionary are used to
70	calculate the escape probability. If novel words are being coded
71	character by character, then there may not be a Huffman code for every
72	possible character. This means that the compressor may fail if a novel
73	word contains a novel character.
74	.TP
75	.B \-S
76	Build a seed dictionary from the statistics file. This dictionary
77	assumes that the statistics file is based on only a portion of the
78	text to be compressed. The probability of a novel word is based on the
79	number of words that have only occurred once. If novel words are being
80	coded character by character, then the Huffman codes for characters are
81	based on the frequency of characters in the dictionary.
82	.TP
83	.B \-0
84	All words from the statistics file are included in the built
85	dictionary.
86	.TP
87	.B \-1
88	Words are included in the dictionary until the dictionary reaches the
89	desired size. Words are selected for the dictionary based on the order
90	they occurred in the source text.
91	.TP
92	.B \-2
93	Words are included in the dictionary until the dictionary reaches the
94	desired size. The most frequent words are included in the dictionary
95	first; where there is a tie for frequency, the shortest word is
96	included first.
97	.TP
98	.B \-3
99	Words are included in the dictionary until the dictionary reaches the
100	desired size. The most frequent words are included in the dictionary
101	first; where there is a tie for frequency, the shortest word is
102	included first. Words are the shuffled back and forth between the
103	`keep' and `discard' lists to find the `optimal' set of words that
104	should be in the dictionary.
105	.TP
106	.B \-H
107	This specifies that novel words will be coded character by character
108	using Huffman codes.
109	.TP
110	.B \-B
111	This specifies that an auxiliary dictionary will be built by the
112	compressor. Each novel word found will be placed at the end of the
113	auxiliary dictionary. Novel words will be coded in the compressed text
114	using binary codes. The binary code represents their occurrence
115	position in the auxiliary dictionary.
116	.TP
117	.B \-D
118	This specifies that an auxiliary dictionary will be built by the
119	compressor. Each novel word found will be placed at the end of the
120	auxiliary dictionary. Novel words will be coded in the compressed text
121	using delta codes. The delta code represents their occurrence position
122	in the auxiliary dictionary.
123	.TP
124	.B \-Y
125	This specifies that an auxiliary dictionary will be built by the
126	compressor. Each novel word found will be placed at the end of the
127	auxiliary dictionary. Novel words will be coded in the compressed text
128	using a combination of gamma and binary codes. The code represents
129	their occurrence position in the auxiliary dictionary. This generally
130	produces better compression than
131	.B \-B
132	or
133	.BR \-D .
134	.TP
135	.BI \-l " lookback"
136	The generated dictionary is designed to be front coded when it is
137	loaded into memory. Under normal circumstances, a front-coded
138	dictionary would require scanning from the beginning in order to find
139	any particular word. However, every
140	.I lookback
141	words in the dictionary, the whole word is stored and a pointer to that
142	word maintained. E.g., if
143	.I lookback
144	is 4, then every fourth word is stored in its entirety.
145	.TP
146	.BI \-k " mem"
147	This limits the amount of memory to use for the generated
148	dictionary. Words are selected for the dictionary based of the text
149	statistics, and whether
150	.BR \-0 , " \-1" , " \-2"
151	or
152	.B \-3
153	is specified. The memory is calculated assuming a lookback of 0,
154	irrespective of what actual lookback is specified. This means that if
155	a non-zero lookback is given, the dictionary will actually occupy
156	less space than specified by
157	.BR \-k .
158	.TP
159	.BI \-d " directory"
160	This specifies the directory where the document collection can be found.
161	.TP
162	.BI \-f " name"
163	This specifies the base name of the document collection.
164	.SH ENVIRONMENT
165	.TP "\w'\fBMGDATA\fP'u+2n"
166	.SB MGDATA
167	If this environment variable exists, then its value is used as the
168	default directory where the mg
169	collection files are. If this variable does not exist, then the
170	directory \(lq\fB.\fP\(rq is used by default. The command line
171	option
172	.BI \-d " directory"
173	overrides the directory in
174	.BR MGDATA .
175	.SH FILES
176	.TP 20
177	.B *.text.stats
178	Statistics about the source text.
179	.TP
180	.B *.text.dict
181	Compression dictionary for the source text.
182	.SH "SEE ALSO"
183	.na
184	.BR mg_fast_comp_dict (1),
185	.BR mg_invf_dict (1),
186	.BR mg_passes (1),
187	.BR mg_perf_hash_build (1),
188	.BR mg_stem_idx (1),
189	.BR mg_weights_build (1)

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format