source: trunk/indexers/mg/docs/mgmerge.README@ 3745

Last change on this file since 3745 was 3745, checked in by mdewsnip, 21 years ago

Addition of MG package for search and retrieval

  • Property svn:executable set to *
  • Property svn:keywords set to Author Date Id Revision
File size: 8.8 KB
Line 
1A CRASH COURSE GUIDE TO MGMERGE
2===============================
3
4Shane Hudson
515 November 1994
6
7
8This document is intended as a note to the maintainers of mgmerge
9outlining the mgmerge utility, the changes
10that had to be made to the existing mg code to make mgmerge
11work, and the new source code for mgmerge.
12
13
14**** NOTE: ****
15All the NEW/CHANGED files are in the file "new_mg.tar" which
16you should have received with this; files that were
17not changed are NOT in the tar file so it will have to be
18extracted "on top of" the existing code. You'll figure it out :)
19
20
21**** DESCRIPTION: ****
22mgmerge adds documents to an existing mg database
23without the need to rebuild the database from scratch.
24It works by building a temporary new database from the
25new documents, then merging the "old" and "new" databases
26to give one merged database.
27
28
29The source consists of:
30 mgmerge.sh The main script, a lot like mgbuild.sh
31 mg_get_merge.sh Script to retrieve new documents, a lot like
32 mg_get.sh
33 mg_merge.h header file used by mg_invf_merge.c
34 and mg_text_merge.c
35 mg_text_merge.c Program to append one compressed text file
36 to another
37 mg_invf_merge.c Program to merge two inverted files and
38 stemmed dictionaries
39
40See the man pages for mgmerge, mg_get_merge, mg_text_merge and
41mg_invf_merge for a brief overview.
42
43mg_get_merge.sh should really be put into mg_get.sh, with a "-merge"
44option if it is being called by the mgmerge utility, but I was
45too lazy to change it to that.
46
47
48The files updated by mgmerge are:
49
50*.text with mg_text_merge
51*.text.idx with mg_text_merge
52*.text.idx.wgt with mg_weights_build (after merging)
53*.invf with mg_invf_merge
54*.invf.idx with "
55*.invf.dict with "
56*.weight with mg_invf_merge or mg_weights_build
57*.invf.dict.blocked with mg_invf_dict (after merging)
58
59The *.text.stats file and *.text.dict file remain the same.
60Any other database files (ie, *.invf.dict.hash)
61will be out of date after a merge and will need to be recomputed.
62
63--------------------------------------------------------------------------
64
65MG_TEXT_MERGE
66=============
67
68The two "merging" utilities are mg_text_merge and mg_invf_merge
69
70mg_text_merge simply appends the compressed text for the new
71documents to the old compressed text file.
72For this to succeed, the two files should have been created
73with the same parameters to mg_passes and the same compression
74model must be used. So *.text.dict from the old (static) database
75is used as the model for compressing the new documents.
76The default "-C" option to mg_compression_dict in mgbuild was
77changed to "-S" so novel words in the new documents can be coded.
78The option must be the same in mgmerge -- it is "-S" there too.
79
80---------------
81
82EFFECT on FILE COMPRESSION PERFORMANCE
83
84Since the compression model is not being updated,compression
85performance for the text will slowly degrade. The only solution
86is a periodic rebuild from scratch.
87
88---------------------------------------------------------------------------
89
90MG_INVF_MERGE
91=============
92
93mg_invf_merge updates the stemmed dictionary and
94inverted file.
95For reasons of simplicity, the stemmed dictionaries merged
96are in the ".invf.dict" format, NOT the ".invf.dict.blocked"
97format.
98So if a database is to be merged the .invf.dict file should
99not be deleted after the .invf.dict.blocked file used by mgquery
100has been created.
101Also for simplicity, level 3 inverted files are not supported, nor
102are skipped format files.
103
104
105
106Merging the inverted files requires every entry in each file
107to be decoded and merged if an entry for the same term appears
108in the other inverted file.
109This can be a slow process if the old inverted file is very large.
110Especially since each entry is decoded bit-by-bit with
111the bit-level I/O rouutines.
112As was noted on page 94 of the book "Managing Gigabytes" (hereafter
113called "MG"), switching from bernoulli coding (called "Bblock" in
114the mg source code) to gamma or delta coding for the
115inter-document gaps would be an advantage since they require no
116parameters.
117With Bblock coding, N (the number of documents) is a parameter and
118as N changes whenever documents are added, every entry must
119be decoded and recoded.
120My approach was to keep Bblock coding, but keep the N parameter
121constant by recording the artificial N used to code document
122gaps (called "Nstatic") as well as the real number of documents.
123
124Then, any entry in the old inverted file (called "IFold") that
125is not merged with an entry in the new inverted file (called "IFnew")
126can be copied directly to the merged inverted file (called "IFmerge").
127This can give a significant increase in speed, since usually the
128size of IFnew is very small if only a few documents are being added.
129
130To store Nstatic, a field "static_num_of_docs" was added to
131the invf_dict_header and stem_dict_header structs defined
132in invf.h
133Any program that decodes inverted file entries also had to be changed.
134(The existing mg source code that I changed has been returned with
135this document.)
136Most changes were simply altering a line that
137called BIO_Bblock_Init() so it used the static_num_of_docs
138value rather than num_of_docs.
139
140So since the file format has changed, any existing collection will
141have to be rebuilt first even using mgmerge with it is not
142intended.
143
144-------------
145
146EFFECT on INVERTED FILE COMPRESSION PERFORMANCE
147
148As more merges are accumulatively performed on a database, the
149compression performance will decline.
150The solution is an option in mg_invf_merge to do a "slow"
151merge where every inverted file entry is decoded and recoded using
152the real value of N.
153This is not a lot slower than a fast merge when compared to the
154cost of completely rebuilding the collection from scratch, and
155the option can be used periodically (when, say, 40% of the
156inverted file has been added since the previous slow merge).
157
158-----------------
159
160THE EXACT WEIGHTS FILE
161
162Recomputing exact document weights requires a complete scan over the
163merged inverted file. This is a waste since the weights for old
164documents will (hopefully) not change much anyway, and the weights
165for new documents can be computed exactly when merging the inverted files,
166at no extra time cost.
167So mg_invf_merge computes the weights for new documents and
168updates the .weight file.
169The weights for old documents are left untouched.
170mgmerge has an option to recompute the weights file.
171This is not much more expensive in terms of time, and is only
172periodically needed.
173
174The effect of leaving the old weights unchanged is difficult to
175asess since the "correct" ranking of documents for a query is
176subjective.
177The frequency with which a rebuild of the weights file with the
178"-w" option should be done is hard to guess.
179But if most merges involves only a few documents, it makes sense
180to not bother recomputing it since the change in old weights values
181is very small and the weights only stray from their "true"
182values slowly over accumulated merging.
183
184-------------------------------------------------------------------------
185
186EXPERIMENTAL RESULTS
187====================
188
189With some small test collections (only a few Mb of text)
190and a slow merge, and a weights file rebuild,
191mgmerge typically took under 20% of the time to completely
192rebuild. Using the default options (fast merge, dont rebuild weights)
193took around 10% or more.
194For larger collections, the savings were greater, especially if the
195added text is very small.
196For example, when the last 6Kb of text of the GNUBib collection was
197added to the rest of it (14Mb) using mgmerge, the total mgmerge time
198was around 20 seconds whereas a complete mgbuild took 280 seconds.
199Only one larger collection was tested (resource contraints prevented
200testing on any really large collections): two short (one-line)
201documents were added to the Gutenberg collection, which comprised
202of nearly 74Mb of source text and had an inverted file size of 9Mb.
203mgmerge took 42 seconds (or 142 seconds with a slow merge and rebuilding the
204weights file), and I'd estimate that an mgbuild on the same machine would
205have taken at least 20 minutes.
206
207To summarise, mgmerge is not as fast as some sophisticated method
208such as the fixed-length blocks described in section 5.7 of "MG",
209but its huge advantage is that it required almost no changes to
210the existing mg code and is still far better than using mgbuild
211every time a document needs to be added to a collection.
212
213----------------------------------------------------------------------------
214
215OTHER NOTES
216===========
217
218Since it works like mgbuild, mgmerge needs mg_get_merge
219to return the SAME text each time "mg_get_merge -text .." is called.
220
221----
222
223One other thing discovered (but not fixed) was a bug in "invf.pass2.c":
224It crashes on parsed words longer than 128 characters.
225So if any text with words >128 chars is piped to mg_passes,
226the "-N1" and "-N2" options (the memory-efficient inversion method)
227had better be used and not the "-I1"/"-I2" option (which would only
228work on small collections anyway, being the memory-inefficient
229method).
230
Note: See TracBrowser for help on using the repository browser.