Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

mg_compression_dict.1@ 3745

Last change on this file since 3745 was 3745, checked in by mdewsnip, 21 years ago
Addition of MG package for search and retrieval
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 6.5 KB

Rev	Line
[3745]	1	.\"------------------------------------------------------------
	2	.\" Id - set Rv,revision, and Dt, Date using rcs-Id tag.
	3	.de Id
	4	.ds Rv \\$3
	5	.ds Dt \\$4
	6	..
	7	.Id $Id: mg_compression_dict.1 3745 2003-02-20 21:20:24Z mdewsnip $
	8	.\"------------------------------------------------------------
	9	.TH mg_compression_dict 1 \*(Dt CITRI
	10	.SH NAME
	11	mg_compression_dict \- build a compression dictionary.
	12	.SH SYNOPSIS
	13	.B mg_compression_dict
	14	[
	15	.B \-h
	16	]
	17	[
	18	.BR \-C " \|"
	19	.BR \-P " \|"
	20	.B \-S
	21	]
	22	.if n .ti +9n
	23	[
	24	.BR \-0 " \|"
	25	.BR \-1 " \|"
	26	.BR \-2 " \|"
	27	.B \-3
	28	]
	29	[
	30	.BR \-H " \|"
	31	.BR \-B " \|"
	32	.BR \-D " \|"
	33	.BR \-Y " \|"
	34	.B \-M
	35	]
	36	.if n .ti +9n
	37	.if t .ti +.5i
	38	[
	39	.BI \-l " lookback"
	40	]
	41	[
	42	.BI \-k " mem"
	43	]
	44	[
	45	.BI \-d " directory"
	46	]
	47	.BI \-f " name"
	48	.SH DESCRIPTION
	49	.B mg_compression_dict
	50	builds a compression dictionary based on the statistics gathered
	51	during the first pass over the text. The options to the program are
	52	mainly concerned with limiting the amount of memory the dictionary
	53	will use and with how the text compressor will cope with any novel
	54	words found during the compression phase.
	55	.SH OPTIONS
	56	Options may appear in any order.
	57	.TP "\w'\fB\-d\fP \fIdirectory\fP'u+2n"
	58	.B \-h
	59	This displays a usage line on
	60	.IR stderr .
	61	.TP
	62	.B \-C
	63	Build a complete dictionary from the statistics file. If during the
	64	text compression phase a novel word is found, then the compressor will
	65	produce an error message and stop.
	66	.TP
	67	.B \-P
	68	Build a partial dictionary from the statistics file. This dictionary
	69	assumes that the statistics file are based on the entire text. The
	70	statistics of words not includes in the dictionary are used to
	71	calculate the escape probability. If novel words are being coded
	72	character by character, then there may not be a Huffman code for every
	73	possible character. This means that the compressor may fail if a novel
	74	word contains a novel character.
	75	.TP
	76	.B \-S
	77	Build a seed dictionary from the statistics file. This dictionary
	78	assumes that the statistics file is based on only a portion of the
	79	text to be compressed. The probability of a novel word is based on the
	80	number of words that have only occurred once. If novel words are being
	81	coded character by character, then the Huffman codes for characters are
	82	based on the frequency of characters in the dictionary.
	83	.TP
	84	.B \-0
	85	All words from the statistics file are included in the built
	86	dictionary.
	87	.TP
	88	.B \-1
	89	Words are included in the dictionary until the dictionary reaches the
	90	desired size. Words are selected for the dictionary based on the order
	91	they occurred in the source text.
	92	.TP
	93	.B \-2
	94	Words are included in the dictionary until the dictionary reaches the
	95	desired size. The most frequent words are included in the dictionary
	96	first; where there is a tie for frequency, the shortest word is
	97	included first.
	98	.TP
	99	.B \-3
	100	Words are included in the dictionary until the dictionary reaches the
	101	desired size. The most frequent words are included in the dictionary
	102	first; where there is a tie for frequency, the shortest word is
	103	included first. Words are the shuffled back and forth between the
	104	`keep' and `discard' lists to find the `optimal' set of words that
	105	should be in the dictionary.
	106	.TP
	107	.B \-H
	108	This specifies that novel words will be coded character by character
	109	using Huffman codes.
	110	.TP
	111	.B \-B
	112	This specifies that an auxiliary dictionary will be built by the
	113	compressor. Each novel word found will be placed at the end of the
	114	auxiliary dictionary. Novel words will be coded in the compressed text
	115	using binary codes. The binary code represents their occurrence
	116	position in the auxiliary dictionary.
	117	.TP
	118	.B \-D
	119	This specifies that an auxiliary dictionary will be built by the
	120	compressor. Each novel word found will be placed at the end of the
	121	auxiliary dictionary. Novel words will be coded in the compressed text
	122	using delta codes. The delta code represents their occurrence position
	123	in the auxiliary dictionary.
	124	.TP
	125	.B \-Y
	126	This specifies that an auxiliary dictionary will be built by the
	127	compressor. Each novel word found will be placed at the end of the
	128	auxiliary dictionary. Novel words will be coded in the compressed text
	129	using a combination of gamma and binary codes. The code represents
	130	their occurrence position in the auxiliary dictionary. This generally
	131	produces better compression than
	132	.B \-B
	133	or
	134	.BR \-D .
	135	.TP
	136	.B \-M
	137	This specifies that an auxiliary dictionary will be built by the
	138	compressor. Each novel word found will be placed at the end of the
	139	auxiliary dictionary. Novel words will be coded in the compressed text
	140	using a combination of gamma and binary codes. The code represents
	141	their occurrence position in the auxiliary dictionary. This method is
	142	adaptive within documents, and generally produces better compression
	143	than
	144	.BR \-B ,
	145	.B \-D
	146	or
	147	.BR \-Y .
	148	.TP
	149	.BI \-l " lookback"
	150	The generated dictionary is designed to be front coded when it is
	151	loaded into memory. Under normal circumstances, a front-coded
	152	dictionary would require scanning from the beginning in order to find
	153	any particular word. However, every
	154	.I lookback
	155	words in the dictionary, the whole word is stored and a pointer to that
	156	word maintained. E.g., if
	157	.I lookback
	158	is 4, then every fourth word is stored in its entirety.
	159	.TP
	160	.BI \-k " mem"
	161	This limits the amount of memory to use for the generated
	162	dictionary. Words are selected for the dictionary based of the text
	163	statistics, and whether
	164	.BR \-0 , " \-1" , " \-2"
	165	or
	166	.B \-3
	167	is specified. The memory is calculated assuming a lookback of 0,
	168	irrespective of what actual lookback is specified. This means that if
	169	a non-zero lookback is given, the dictionary will actually occupy
	170	less space than specified by
	171	.BR \-k .
	172	.TP
	173	.BI \-d " directory"
	174	This specifies the directory where the document collection can be found.
	175	.TP
	176	.BI \-f " name"
	177	This specifies the base name of the document collection.
	178	.SH ENVIRONMENT
	179	.TP "\w'\fBMGDATA\fP'u+2n"
	180	.SB MGDATA
	181	If this environment variable exists, then its value is used as the
	182	default directory where the
	183	.BR mg (1)
	184	collection files are. If this variable does not exist, then the
	185	directory \(lq\fB.\fP\(rq is used by default. The command line
	186	option
	187	.BI \-d " directory"
	188	overrides the directory in
	189	.BR MGDATA .
	190	.SH FILES
	191	.TP 20
	192	.B *.text.stats
	193	Statistics about the source text.
	194	.TP
	195	.B *.text.dict
	196	Compression dictionary for the source text.
	197	.SH "SEE ALSO"
	198	.na
	199	.BR mg (1),
	200	.BR mg_fast_comp_dict (1),
	201	.BR mg_get (1),
	202	.BR mg_invf_dict (1),
	203	.BR mg_invf_dump (1),
	204	.BR mg_invf_rebuild (1),
	205	.BR mg_passes (1),
	206	.BR mg_perf_hash_build (1),
	207	.BR mg_text_estimate (1),
	208	.BR mg_weights_build (1),
	209	.BR mgbilevel (1),
	210	.BR mgbuild (1),
	211	.BR mgdictlist (1),
	212	.BR mgfelics (1),
	213	.BR mgquery (1),
	214	.BR mgstat (1),
	215	.BR mgtic (1),
	216	.BR mgticbuild (1),
	217	.BR mgticdump (1),
	218	.BR mgticprune (1),
	219	.BR mgticstat (1).

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format