Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

source: trunk/gsdl/packages/mg-1.3d/docs/mgmerge.README@ 13

Last change on this file since 13 was 13, checked in by rjmcnab, 26 years ago
* empty log message *
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 8.8 KB

Rev	Line
[13]	1	A CRASH COURSE GUIDE TO MGMERGE
	2	===============================
	3
	4	Shane Hudson
	5	15 November 1994
	6
	7
	8	This document is intended as a note to the maintainers of mgmerge
	9	outlining the mgmerge utility, the changes
	10	that had to be made to the existing mg code to make mgmerge
	11	work, and the new source code for mgmerge.
	12
	13
	14	** NOTE: **
	15	All the NEW/CHANGED files are in the file "new_mg.tar" which
	16	you should have received with this; files that were
	17	not changed are NOT in the tar file so it will have to be
	18	extracted "on top of" the existing code. You'll figure it out :)
	19
	20
	21	** DESCRIPTION: **
	22	mgmerge adds documents to an existing mg database
	23	without the need to rebuild the database from scratch.
	24	It works by building a temporary new database from the
	25	new documents, then merging the "old" and "new" databases
	26	to give one merged database.
	27
	28
	29	The source consists of:
	30	mgmerge.sh The main script, a lot like mgbuild.sh
	31	mg_get_merge.sh Script to retrieve new documents, a lot like
	32	mg_get.sh
	33	mg_merge.h header file used by mg_invf_merge.c
	34	and mg_text_merge.c
	35	mg_text_merge.c Program to append one compressed text file
	36	to another
	37	mg_invf_merge.c Program to merge two inverted files and
	38	stemmed dictionaries
	39
	40	See the man pages for mgmerge, mg_get_merge, mg_text_merge and
	41	mg_invf_merge for a brief overview.
	42
	43	mg_get_merge.sh should really be put into mg_get.sh, with a "-merge"
	44	option if it is being called by the mgmerge utility, but I was
	45	too lazy to change it to that.
	46
	47
	48	The files updated by mgmerge are:
	49
	50	*.text with mg_text_merge
	51	*.text.idx with mg_text_merge
	52	*.text.idx.wgt with mg_weights_build (after merging)
	53	*.invf with mg_invf_merge
	54	*.invf.idx with "
	55	*.invf.dict with "
	56	*.weight with mg_invf_merge or mg_weights_build
	57	*.invf.dict.blocked with mg_invf_dict (after merging)
	58
	59	The .text.stats file and .text.dict file remain the same.
	60	Any other database files (ie, *.invf.dict.hash)
	61	will be out of date after a merge and will need to be recomputed.
	62
	63	--------------------------------------------------------------------------
	64
	65	MG_TEXT_MERGE
	66	=============
	67
	68	The two "merging" utilities are mg_text_merge and mg_invf_merge
	69
	70	mg_text_merge simply appends the compressed text for the new
	71	documents to the old compressed text file.
	72	For this to succeed, the two files should have been created
	73	with the same parameters to mg_passes and the same compression
	74	model must be used. So *.text.dict from the old (static) database
	75	is used as the model for compressing the new documents.
	76	The default "-C" option to mg_compression_dict in mgbuild was
	77	changed to "-S" so novel words in the new documents can be coded.
	78	The option must be the same in mgmerge -- it is "-S" there too.
	79
	80	---------------
	81
	82	EFFECT on FILE COMPRESSION PERFORMANCE
	83
	84	Since the compression model is not being updated,compression
	85	performance for the text will slowly degrade. The only solution
	86	is a periodic rebuild from scratch.
	87
	88	---------------------------------------------------------------------------
	89
	90	MG_INVF_MERGE
	91	=============
	92
	93	mg_invf_merge updates the stemmed dictionary and
	94	inverted file.
	95	For reasons of simplicity, the stemmed dictionaries merged
	96	are in the ".invf.dict" format, NOT the ".invf.dict.blocked"
	97	format.
	98	So if a database is to be merged the .invf.dict file should
	99	not be deleted after the .invf.dict.blocked file used by mgquery
	100	has been created.
	101	Also for simplicity, level 3 inverted files are not supported, nor
	102	are skipped format files.
	103
	104
	105
	106	Merging the inverted files requires every entry in each file
	107	to be decoded and merged if an entry for the same term appears
	108	in the other inverted file.
	109	This can be a slow process if the old inverted file is very large.
	110	Especially since each entry is decoded bit-by-bit with
	111	the bit-level I/O rouutines.
	112	As was noted on page 94 of the book "Managing Gigabytes" (hereafter
	113	called "MG"), switching from bernoulli coding (called "Bblock" in
	114	the mg source code) to gamma or delta coding for the
	115	inter-document gaps would be an advantage since they require no
	116	parameters.
	117	With Bblock coding, N (the number of documents) is a parameter and
	118	as N changes whenever documents are added, every entry must
	119	be decoded and recoded.
	120	My approach was to keep Bblock coding, but keep the N parameter
	121	constant by recording the artificial N used to code document
	122	gaps (called "Nstatic") as well as the real number of documents.
	123
	124	Then, any entry in the old inverted file (called "IFold") that
	125	is not merged with an entry in the new inverted file (called "IFnew")
	126	can be copied directly to the merged inverted file (called "IFmerge").
	127	This can give a significant increase in speed, since usually the
	128	size of IFnew is very small if only a few documents are being added.
	129
	130	To store Nstatic, a field "static_num_of_docs" was added to
	131	the invf_dict_header and stem_dict_header structs defined
	132	in invf.h
	133	Any program that decodes inverted file entries also had to be changed.
	134	(The existing mg source code that I changed has been returned with
	135	this document.)
	136	Most changes were simply altering a line that
	137	called BIO_Bblock_Init() so it used the static_num_of_docs
	138	value rather than num_of_docs.
	139
	140	So since the file format has changed, any existing collection will
	141	have to be rebuilt first even using mgmerge with it is not
	142	intended.
	143
	144	-------------
	145
	146	EFFECT on INVERTED FILE COMPRESSION PERFORMANCE
	147
	148	As more merges are accumulatively performed on a database, the
	149	compression performance will decline.
	150	The solution is an option in mg_invf_merge to do a "slow"
	151	merge where every inverted file entry is decoded and recoded using
	152	the real value of N.
	153	This is not a lot slower than a fast merge when compared to the
	154	cost of completely rebuilding the collection from scratch, and
	155	the option can be used periodically (when, say, 40% of the
	156	inverted file has been added since the previous slow merge).
	157
	158	-----------------
	159
	160	THE EXACT WEIGHTS FILE
	161
	162	Recomputing exact document weights requires a complete scan over the
	163	merged inverted file. This is a waste since the weights for old
	164	documents will (hopefully) not change much anyway, and the weights
	165	for new documents can be computed exactly when merging the inverted files,
	166	at no extra time cost.
	167	So mg_invf_merge computes the weights for new documents and
	168	updates the .weight file.
	169	The weights for old documents are left untouched.
	170	mgmerge has an option to recompute the weights file.
	171	This is not much more expensive in terms of time, and is only
	172	periodically needed.
	173
	174	The effect of leaving the old weights unchanged is difficult to
	175	asess since the "correct" ranking of documents for a query is
	176	subjective.
	177	The frequency with which a rebuild of the weights file with the
	178	"-w" option should be done is hard to guess.
	179	But if most merges involves only a few documents, it makes sense
	180	to not bother recomputing it since the change in old weights values
	181	is very small and the weights only stray from their "true"
	182	values slowly over accumulated merging.
	183
	184	-------------------------------------------------------------------------
	185
	186	EXPERIMENTAL RESULTS
	187	====================
	188
	189	With some small test collections (only a few Mb of text)
	190	and a slow merge, and a weights file rebuild,
	191	mgmerge typically took under 20% of the time to completely
	192	rebuild. Using the default options (fast merge, dont rebuild weights)
	193	took around 10% or more.
	194	For larger collections, the savings were greater, especially if the
	195	added text is very small.
	196	For example, when the last 6Kb of text of the GNUBib collection was
	197	added to the rest of it (14Mb) using mgmerge, the total mgmerge time
	198	was around 20 seconds whereas a complete mgbuild took 280 seconds.
	199	Only one larger collection was tested (resource contraints prevented
	200	testing on any really large collections): two short (one-line)
	201	documents were added to the Gutenberg collection, which comprised
	202	of nearly 74Mb of source text and had an inverted file size of 9Mb.
	203	mgmerge took 42 seconds (or 142 seconds with a slow merge and rebuilding the
	204	weights file), and I'd estimate that an mgbuild on the same machine would
	205	have taken at least 20 minutes.
	206
	207	To summarise, mgmerge is not as fast as some sophisticated method
	208	such as the fixed-length blocks described in section 5.7 of "MG",
	209	but its huge advantage is that it required almost no changes to
	210	the existing mg code and is still far better than using mgbuild
	211	every time a document needs to be added to a collection.
	212
	213	----------------------------------------------------------------------------
	214
	215	OTHER NOTES
	216	===========
	217
	218	Since it works like mgbuild, mgmerge needs mg_get_merge
	219	to return the SAME text each time "mg_get_merge -text .." is called.
	220
	221	----
	222
	223	One other thing discovered (but not fixed) was a bug in "invf.pass2.c":
	224	It crashes on parsed words longer than 128 characters.
	225	So if any text with words >128 chars is piped to mg_passes,
	226	the "-N1" and "-N2" options (the memory-efficient inversion method)
	227	had better be used and not the "-I1"/"-I2" option (which would only
	228	work on small collections anyway, being the memory-inefficient
	229	method).
	230

Note: See TracBrowser for help on using the repository browser.

Download in other formats: