Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

mgpp_compression_dict.1@ 3365

Last change on this file since 3365 was 3365, checked in by kjdon, 22 years ago
Initial revision
Property svn:keywords set to `Author Date Id Revision`
File size: 5.8 KB

Rev	Line
[3365]	1	.\"------------------------------------------------------------
	2	.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
	3	.de Id
	4	.ds Rv \\$3
	5	.ds Dt \\$4
	6	..
	7	.\"------------------------------------------------------------
	8	.TH mgpp_compression_dict 1 \*(Dt CITRI
	9	.SH NAME
	10	mgpp_compression_dict \- build a compression dictionary.
	11	.SH SYNOPSIS
	12	.B mgpp_compression_dict
	13	[
	14	.B \-h
	15	]
	16	[
	17	.BR \-C " \|"
	18	.BR \-P " \|"
	19	.B \-S
	20	]
	21	.if n .ti +9n
	22	[
	23	.BR \-0 " \|"
	24	.BR \-1 " \|"
	25	.BR \-2 " \|"
	26	.B \-3
	27	]
	28	[
	29	.BR \-H " \|"
	30	.BR \-B " \|"
	31	.BR \-D " \|"
	32	.BR \-Y " \|"
	33	]
	34	.if n .ti +9n
	35	.if t .ti +.5i
	36	[
	37	.BI \-l " lookback"
	38	]
	39	[
	40	.BI \-k " mem"
	41	]
	42	[
	43	.BI \-d " directory"
	44	]
	45	.BI \-f " name"
	46	.SH DESCRIPTION
	47	.B mgpp_compression_dict
	48	builds a compression dictionary based on the statistics gathered
	49	during the first pass over the text. The options to the program are
	50	mainly concerned with limiting the amount of memory the dictionary
	51	will use and with how the text compressor will cope with any novel
	52	words found during the compression phase.
	53	.SH OPTIONS
	54	Options may appear in any order.
	55	.TP "\w'\fB\-d\fP \fIdirectory\fP'u+2n"
	56	.B \-h
	57	This displays a usage line on
	58	.IR stderr .
	59	.TP
	60	.B \-C
	61	Build a complete dictionary from the statistics file. If during the
	62	text compression phase a novel word is found, then the compressor will
	63	produce an error message and stop.
	64	.TP
	65	.B \-P
	66	Build a partial dictionary from the statistics file. This dictionary
	67	assumes that the statistics file are based on the entire text. The
	68	statistics of words not includes in the dictionary are used to
	69	calculate the escape probability. If novel words are being coded
	70	character by character, then there may not be a Huffman code for every
	71	possible character. This means that the compressor may fail if a novel
	72	word contains a novel character.
	73	.TP
	74	.B \-S
	75	Build a seed dictionary from the statistics file. This dictionary
	76	assumes that the statistics file is based on only a portion of the
	77	text to be compressed. The probability of a novel word is based on the
	78	number of words that have only occurred once. If novel words are being
	79	coded character by character, then the Huffman codes for characters are
	80	based on the frequency of characters in the dictionary.
	81	.TP
	82	.B \-0
	83	All words from the statistics file are included in the built
	84	dictionary.
	85	.TP
	86	.B \-1
	87	Words are included in the dictionary until the dictionary reaches the
	88	desired size. Words are selected for the dictionary based on the order
	89	they occurred in the source text.
	90	.TP
	91	.B \-2
	92	Words are included in the dictionary until the dictionary reaches the
	93	desired size. The most frequent words are included in the dictionary
	94	first; where there is a tie for frequency, the shortest word is
	95	included first.
	96	.TP
	97	.B \-3
	98	Words are included in the dictionary until the dictionary reaches the
	99	desired size. The most frequent words are included in the dictionary
	100	first; where there is a tie for frequency, the shortest word is
	101	included first. Words are the shuffled back and forth between the
	102	`keep' and `discard' lists to find the `optimal' set of words that
	103	should be in the dictionary.
	104	.TP
	105	.B \-H
	106	This specifies that novel words will be coded character by character
	107	using Huffman codes.
	108	.TP
	109	.B \-B
	110	This specifies that an auxiliary dictionary will be built by the
	111	compressor. Each novel word found will be placed at the end of the
	112	auxiliary dictionary. Novel words will be coded in the compressed text
	113	using binary codes. The binary code represents their occurrence
	114	position in the auxiliary dictionary.
	115	.TP
	116	.B \-D
	117	This specifies that an auxiliary dictionary will be built by the
	118	compressor. Each novel word found will be placed at the end of the
	119	auxiliary dictionary. Novel words will be coded in the compressed text
	120	using delta codes. The delta code represents their occurrence position
	121	in the auxiliary dictionary.
	122	.TP
	123	.B \-Y
	124	This specifies that an auxiliary dictionary will be built by the
	125	compressor. Each novel word found will be placed at the end of the
	126	auxiliary dictionary. Novel words will be coded in the compressed text
	127	using a combination of gamma and binary codes. The code represents
	128	their occurrence position in the auxiliary dictionary. This generally
	129	produces better compression than
	130	.B \-B
	131	or
	132	.BR \-D .
	133	.TP
	134	.BI \-l " lookback"
	135	The generated dictionary is designed to be front coded when it is
	136	loaded into memory. Under normal circumstances, a front-coded
	137	dictionary would require scanning from the beginning in order to find
	138	any particular word. However, every
	139	.I lookback
	140	words in the dictionary, the whole word is stored and a pointer to that
	141	word maintained. E.g., if
	142	.I lookback
	143	is 4, then every fourth word is stored in its entirety.
	144	.TP
	145	.BI \-k " mem"
	146	This limits the amount of memory to use for the generated
	147	dictionary. Words are selected for the dictionary based of the text
	148	statistics, and whether
	149	.BR \-0 , " \-1" , " \-2"
	150	or
	151	.B \-3
	152	is specified. The memory is calculated assuming a lookback of 0,
	153	irrespective of what actual lookback is specified. This means that if
	154	a non-zero lookback is given, the dictionary will actually occupy
	155	less space than specified by
	156	.BR \-k .
	157	.TP
	158	.BI \-d " directory"
	159	This specifies the directory where the document collection can be found.
	160	.TP
	161	.BI \-f " name"
	162	This specifies the base name of the document collection.
	163	.SH ENVIRONMENT
	164	.TP "\w'\fBMGDATA\fP'u+2n"
	165	.SB MGDATA
	166	If this environment variable exists, then its value is used as the
	167	default directory where the mgpp
	168	collection files are. If this variable does not exist, then the
	169	directory \(lq\fB.\fP\(rq is used by default. The command line
	170	option
	171	.BI \-d " directory"
	172	overrides the directory in
	173	.BR MGDATA .
	174	.SH FILES
	175	.TP 20
	176	.B *.text.stats
	177	Statistics about the source text.
	178	.TP
	179	.B *.text.dict
	180	Compression dictionary for the source text.
	181	.SH "SEE ALSO"
	182	.na
	183	.BR mgpp_fast_comp_dict (1),
	184	.BR mgpp_invf_dict (1),
	185	.BR mgpp_passes (1),
	186	.BR mgpp_perf_hash_build (1),
	187	.BR mgpp_stem_idx (1),
	188	.BR mgpp_weights_build (1)

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format