Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

source: trunk/indexers/mg/docs/mgmerge.README@ 3745

Last change on this file since 3745 was 3745, checked in by mdewsnip, 21 years ago
Addition of MG package for search and retrieval
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 8.8 KB

Line
1	A CRASH COURSE GUIDE TO MGMERGE
2	===============================
3
4	Shane Hudson
5	15 November 1994
6
7
8	This document is intended as a note to the maintainers of mgmerge
9	outlining the mgmerge utility, the changes
10	that had to be made to the existing mg code to make mgmerge
11	work, and the new source code for mgmerge.
12
13
14	** NOTE: **
15	All the NEW/CHANGED files are in the file "new_mg.tar" which
16	you should have received with this; files that were
17	not changed are NOT in the tar file so it will have to be
18	extracted "on top of" the existing code. You'll figure it out :)
19
20
21	** DESCRIPTION: **
22	mgmerge adds documents to an existing mg database
23	without the need to rebuild the database from scratch.
24	It works by building a temporary new database from the
25	new documents, then merging the "old" and "new" databases
26	to give one merged database.
27
28
29	The source consists of:
30	mgmerge.sh The main script, a lot like mgbuild.sh
31	mg_get_merge.sh Script to retrieve new documents, a lot like
32	mg_get.sh
33	mg_merge.h header file used by mg_invf_merge.c
34	and mg_text_merge.c
35	mg_text_merge.c Program to append one compressed text file
36	to another
37	mg_invf_merge.c Program to merge two inverted files and
38	stemmed dictionaries
39
40	See the man pages for mgmerge, mg_get_merge, mg_text_merge and
41	mg_invf_merge for a brief overview.
42
43	mg_get_merge.sh should really be put into mg_get.sh, with a "-merge"
44	option if it is being called by the mgmerge utility, but I was
45	too lazy to change it to that.
46
47
48	The files updated by mgmerge are:
49
50	*.text with mg_text_merge
51	*.text.idx with mg_text_merge
52	*.text.idx.wgt with mg_weights_build (after merging)
53	*.invf with mg_invf_merge
54	*.invf.idx with "
55	*.invf.dict with "
56	*.weight with mg_invf_merge or mg_weights_build
57	*.invf.dict.blocked with mg_invf_dict (after merging)
58
59	The .text.stats file and .text.dict file remain the same.
60	Any other database files (ie, *.invf.dict.hash)
61	will be out of date after a merge and will need to be recomputed.
62
63	--------------------------------------------------------------------------
64
65	MG_TEXT_MERGE
66	=============
67
68	The two "merging" utilities are mg_text_merge and mg_invf_merge
69
70	mg_text_merge simply appends the compressed text for the new
71	documents to the old compressed text file.
72	For this to succeed, the two files should have been created
73	with the same parameters to mg_passes and the same compression
74	model must be used. So *.text.dict from the old (static) database
75	is used as the model for compressing the new documents.
76	The default "-C" option to mg_compression_dict in mgbuild was
77	changed to "-S" so novel words in the new documents can be coded.
78	The option must be the same in mgmerge -- it is "-S" there too.
79
80	---------------
81
82	EFFECT on FILE COMPRESSION PERFORMANCE
83
84	Since the compression model is not being updated,compression
85	performance for the text will slowly degrade. The only solution
86	is a periodic rebuild from scratch.
87
88	---------------------------------------------------------------------------
89
90	MG_INVF_MERGE
91	=============
92
93	mg_invf_merge updates the stemmed dictionary and
94	inverted file.
95	For reasons of simplicity, the stemmed dictionaries merged
96	are in the ".invf.dict" format, NOT the ".invf.dict.blocked"
97	format.
98	So if a database is to be merged the .invf.dict file should
99	not be deleted after the .invf.dict.blocked file used by mgquery
100	has been created.
101	Also for simplicity, level 3 inverted files are not supported, nor
102	are skipped format files.
103
104
105
106	Merging the inverted files requires every entry in each file
107	to be decoded and merged if an entry for the same term appears
108	in the other inverted file.
109	This can be a slow process if the old inverted file is very large.
110	Especially since each entry is decoded bit-by-bit with
111	the bit-level I/O rouutines.
112	As was noted on page 94 of the book "Managing Gigabytes" (hereafter
113	called "MG"), switching from bernoulli coding (called "Bblock" in
114	the mg source code) to gamma or delta coding for the
115	inter-document gaps would be an advantage since they require no
116	parameters.
117	With Bblock coding, N (the number of documents) is a parameter and
118	as N changes whenever documents are added, every entry must
119	be decoded and recoded.
120	My approach was to keep Bblock coding, but keep the N parameter
121	constant by recording the artificial N used to code document
122	gaps (called "Nstatic") as well as the real number of documents.
123
124	Then, any entry in the old inverted file (called "IFold") that
125	is not merged with an entry in the new inverted file (called "IFnew")
126	can be copied directly to the merged inverted file (called "IFmerge").
127	This can give a significant increase in speed, since usually the
128	size of IFnew is very small if only a few documents are being added.
129
130	To store Nstatic, a field "static_num_of_docs" was added to
131	the invf_dict_header and stem_dict_header structs defined
132	in invf.h
133	Any program that decodes inverted file entries also had to be changed.
134	(The existing mg source code that I changed has been returned with
135	this document.)
136	Most changes were simply altering a line that
137	called BIO_Bblock_Init() so it used the static_num_of_docs
138	value rather than num_of_docs.
139
140	So since the file format has changed, any existing collection will
141	have to be rebuilt first even using mgmerge with it is not
142	intended.
143
144	-------------
145
146	EFFECT on INVERTED FILE COMPRESSION PERFORMANCE
147
148	As more merges are accumulatively performed on a database, the
149	compression performance will decline.
150	The solution is an option in mg_invf_merge to do a "slow"
151	merge where every inverted file entry is decoded and recoded using
152	the real value of N.
153	This is not a lot slower than a fast merge when compared to the
154	cost of completely rebuilding the collection from scratch, and
155	the option can be used periodically (when, say, 40% of the
156	inverted file has been added since the previous slow merge).
157
158	-----------------
159
160	THE EXACT WEIGHTS FILE
161
162	Recomputing exact document weights requires a complete scan over the
163	merged inverted file. This is a waste since the weights for old
164	documents will (hopefully) not change much anyway, and the weights
165	for new documents can be computed exactly when merging the inverted files,
166	at no extra time cost.
167	So mg_invf_merge computes the weights for new documents and
168	updates the .weight file.
169	The weights for old documents are left untouched.
170	mgmerge has an option to recompute the weights file.
171	This is not much more expensive in terms of time, and is only
172	periodically needed.
173
174	The effect of leaving the old weights unchanged is difficult to
175	asess since the "correct" ranking of documents for a query is
176	subjective.
177	The frequency with which a rebuild of the weights file with the
178	"-w" option should be done is hard to guess.
179	But if most merges involves only a few documents, it makes sense
180	to not bother recomputing it since the change in old weights values
181	is very small and the weights only stray from their "true"
182	values slowly over accumulated merging.
183
184	-------------------------------------------------------------------------
185
186	EXPERIMENTAL RESULTS
187	====================
188
189	With some small test collections (only a few Mb of text)
190	and a slow merge, and a weights file rebuild,
191	mgmerge typically took under 20% of the time to completely
192	rebuild. Using the default options (fast merge, dont rebuild weights)
193	took around 10% or more.
194	For larger collections, the savings were greater, especially if the
195	added text is very small.
196	For example, when the last 6Kb of text of the GNUBib collection was
197	added to the rest of it (14Mb) using mgmerge, the total mgmerge time
198	was around 20 seconds whereas a complete mgbuild took 280 seconds.
199	Only one larger collection was tested (resource contraints prevented
200	testing on any really large collections): two short (one-line)
201	documents were added to the Gutenberg collection, which comprised
202	of nearly 74Mb of source text and had an inverted file size of 9Mb.
203	mgmerge took 42 seconds (or 142 seconds with a slow merge and rebuilding the
204	weights file), and I'd estimate that an mgbuild on the same machine would
205	have taken at least 20 minutes.
206
207	To summarise, mgmerge is not as fast as some sophisticated method
208	such as the fixed-length blocks described in section 5.7 of "MG",
209	but its huge advantage is that it required almost no changes to
210	the existing mg code and is still far better than using mgbuild
211	every time a document needs to be added to a collection.
212
213	----------------------------------------------------------------------------
214
215	OTHER NOTES
216	===========
217
218	Since it works like mgbuild, mgmerge needs mg_get_merge
219	to return the SAME text each time "mg_get_merge -text .." is called.
220
221	----
222
223	One other thing discovered (but not fixed) was a bug in "invf.pass2.c":
224	It crashes on parsed words longer than 128 characters.
225	So if any text with words >128 chars is piped to mg_passes,
226	the "-N1" and "-N2" options (the memory-efficient inversion method)
227	had better be used and not the "-I1"/"-I2" option (which would only
228	work on small collections anyway, being the memory-inefficient
229	method).
230

Note: See TracBrowser for help on using the repository browser.

Download in other formats: