1 | TITLE
|
---|
2 | Parsing of Long Words
|
---|
3 | APPLICATION
|
---|
4 | mg-1, mg-2
|
---|
5 | TYPE
|
---|
6 | bug
|
---|
7 | REPORT
|
---|
8 | [email protected] - May 11th 1994
|
---|
9 | FIX
|
---|
10 | [email protected] - August 9th 1994
|
---|
11 | CLAIM
|
---|
12 | Mg didn't handle long words properly; it crashed.
|
---|
13 | PROBLEM
|
---|
14 | Invf passes calls PARSE_LONG_WORD [words.h] which uses a limit of
|
---|
15 | MAXLONGWORD on iterating thru the string and storing into
|
---|
16 | a word. MAXLONGWORD = 8192.
|
---|
17 | However, mg strings generally store the length in the first
|
---|
18 | byte limiting them to 255 characters. The word which was passed
|
---|
19 | to PARSE_LONG_WORD was an allocated string of MAXSTEMLEN = 255,
|
---|
20 | which is as large as we should get anyway. Thus when accessing
|
---|
21 | a larger word than 255 chars, PARSE_LONG_WORD would allow it
|
---|
22 | (less than 8192) and would try storing beyond the array limit.
|
---|
23 | SOLUTION
|
---|
24 | The author can't remember why PARSE_LONG_WORD was used and what
|
---|
25 | the significance of MAXLONGWORD = 8192 is.
|
---|
26 | So PARSE_LONG_WORD has been changed to PARSE_STEM_WORD which
|
---|
27 | uses MAXSTEMLEN as its limit.
|
---|
28 | FILES
|
---|
29 | * words.h
|
---|
30 | * invf.pass1.c
|
---|
31 | * invf.pass2.c
|
---|
32 | * ivf.pass1.c
|
---|
33 | * ivf.pass2.c
|
---|
34 | * query.ranked.c
|
---|
35 | *************************************************************
|
---|
36 | TITLE
|
---|
37 | Use of Lovins stemmer
|
---|
38 | APPLICATION
|
---|
39 | mg-1
|
---|
40 | TYPE
|
---|
41 | improve
|
---|
42 | REPORT
|
---|
43 | local - 1994
|
---|
44 | FIX
|
---|
45 | [email protected] - 1994
|
---|
46 | CLAIM
|
---|
47 | Stemming was done naively.
|
---|
48 | PROBLEM
|
---|
49 | Only a few types of words and their endings
|
---|
50 | were considered.
|
---|
51 | SOLUTION
|
---|
52 | Replacement with a more elaborate "known" stemmer by Lovins.
|
---|
53 | The algorithm is described in:
|
---|
54 | J.B. Lovins, "Development of a Stemming Algorithm",
|
---|
55 | Mechanical Translation and Computational Linguistics, Vol 11,1968.
|
---|
56 | FILES
|
---|
57 | * stem.c
|
---|
58 | * stem.h
|
---|
59 | *************************************************************
|
---|
60 | TITLE
|
---|
61 | Different term parsing
|
---|
62 | APPLICATION
|
---|
63 | mg-1
|
---|
64 | TYPE
|
---|
65 | bug
|
---|
66 | REPORT
|
---|
67 | [email protected] - 23 Aug 1994
|
---|
68 | FIX
|
---|
69 | [email protected] - 23 Aug 1994
|
---|
70 | CLAIM
|
---|
71 | Boolean queries did not extract words/terms using the
|
---|
72 | same method as is done at inverted-file creation and
|
---|
73 | as is used for rank query parsing.
|
---|
74 | PROBLEM
|
---|
75 | The hand-written lex. analyser, query_lex, which is called by
|
---|
76 | the boolean query parser was not calling a common
|
---|
77 | word-extraction routine as used by the rest of mg.
|
---|
78 | This would be ok if the code did the same things - but they didn't.
|
---|
79 | Query_lex, for instance, did NOT place any limit on the
|
---|
80 | number of digits in a term.
|
---|
81 | Of even more concern, it would allow arbitrary sized words
|
---|
82 | although it used Pascal style strings which store the length
|
---|
83 | in the first byte and can therefore only be 255 characters in length.
|
---|
84 | SOLUTION
|
---|
85 | Query_lex in "query.bool.y", was modified to call the routine
|
---|
86 | PARSE_STEM_WORD which is also used by text-inversion routines and
|
---|
87 | ranking query routines.
|
---|
88 | Now all terms are extracted by the same routine.
|
---|
89 | To do this, the end of the line buffer had to be noted as
|
---|
90 | PARSE_STEM_WORD requires a pointer to the end - which is the
|
---|
91 | safe thing to do (don't want to run over the end).
|
---|
92 | This meant I had to find the length of the query line buffer.
|
---|
93 | This was allocated in the file "read_line.c" by the routine,
|
---|
94 | "readline". Its size was the literal number 1024.
|
---|
95 | This was changed to a constant and placed in "read_line.h".
|
---|
96 | The definition for PARSE_STEM_WORD can be found in "words.h".
|
---|
97 | FILES
|
---|
98 | * query.bool.y
|
---|
99 | * query.bool.c (by bison)
|
---|
100 | * read_line.c
|
---|
101 | * read_line.h
|
---|
102 | *************************************************************
|
---|
103 | TITLE
|
---|
104 | Highlighting of query terms
|
---|
105 | APPLICATION
|
---|
106 | mg-1
|
---|
107 | TYPE
|
---|
108 | extend
|
---|
109 | REPORT
|
---|
110 | [email protected] - Aug 94
|
---|
111 | FIX
|
---|
112 | [email protected] - Sep 94
|
---|
113 | CLAIM
|
---|
114 | Difficult to feel happy that the query-result returned is
|
---|
115 | satisfying the query - need to look hard to find the queried words.
|
---|
116 | Need to show words in results using some highlighting method.
|
---|
117 | PROBLEM
|
---|
118 | No highlighting of query terms in results.
|
---|
119 | SOLUTION
|
---|
120 | Mgquery was previously outputting the decompressed text to a pager
|
---|
121 | such as "less(1)" or "more(1)".
|
---|
122 | (Except when redirected or piped elsewhere :)
|
---|
123 | So what was needed was some sort of highlight pager that instead of
|
---|
124 | displaying the text would also use some means for highlighting the
|
---|
125 | stemmed query words.
|
---|
126 | Two common forms of highlighting were chosen: underline and bolding.
|
---|
127 | These are supported by "less(1)" and possibly by "more(1)" by
|
---|
128 | using the backspace character.
|
---|
129 | A highlight pager will also need to know which words need to be
|
---|
130 | highlighted. Therefore, the code was modified to build up a
|
---|
131 | string of the stemmed query words for passing to the highlight pager.
|
---|
132 | Design Options:
|
---|
133 | ---------------
|
---|
134 | * Could do text filtering in mgquery before passing out to pager.
|
---|
135 | Instead I pipe to a separate process, the "hilite_words" pager,
|
---|
136 | which filters and pipes into less/more.
|
---|
137 | * Could do different highlighting or a combination.
|
---|
138 | * Could use a different structure for storing the query words other
|
---|
139 | than the hash-table I used.
|
---|
140 | FILES
|
---|
141 | * Makefile - to include hilite_words target
|
---|
142 | * mg_hilite_words.c
|
---|
143 | * mgquery.c
|
---|
144 | * mgquery.1
|
---|
145 | * query.bool.y
|
---|
146 | * query.ranked.c
|
---|
147 | * environment.c
|
---|
148 | * environment.h
|
---|
149 | * backend.h
|
---|
150 | *************************************************************
|
---|
151 | TITLE
|
---|
152 | Mg_compression_dict did premature free
|
---|
153 | APPLICATION
|
---|
154 | mg-1
|
---|
155 | TYPE
|
---|
156 | bug
|
---|
157 | REPORT
|
---|
158 | [email protected] - 23 Sep 94
|
---|
159 | FIX
|
---|
160 | [email protected] - 23 Sep 94
|
---|
161 | CLAIM
|
---|
162 | mg_compression_dict dumped core in
|
---|
163 | file: mg_compression_dict.c
|
---|
164 | function: Write_data
|
---|
165 | line: int codelen = hd->clens[i];
|
---|
166 | PROBLEM
|
---|
167 | Huffman data, hd, was freed *before* it was accessed again.
|
---|
168 | SOLUTION
|
---|
169 | The freeing of hd has been moved to after all accesses
|
---|
170 | (just before returning).
|
---|
171 | FILES
|
---|
172 | * mg_compression_dict.c
|
---|
173 | *************************************************************
|
---|
174 | TITLE
|
---|
175 | Boolean tree optimising rewrite
|
---|
176 | APPLICATION
|
---|
177 | mg-1
|
---|
178 | TYPE
|
---|
179 | bug
|
---|
180 | REPORT
|
---|
181 | [email protected] - 23 Sep 94
|
---|
182 | FIX
|
---|
183 | [email protected] - Oct 94
|
---|
184 | CLAIM
|
---|
185 | "I am still getting core dump in "and" queries in mgquery,
|
---|
186 | where the first word does not exist, but the second one does."
|
---|
187 | PROBLEM
|
---|
188 | Having freed a particular node, it tried to refree it and
|
---|
189 | access one of its fields.
|
---|
190 |
|
---|
191 | I.e. code-fragment...
|
---|
192 |
|
---|
193 | FreeNode(curr); /* where curr = CHILD(base) for 1st term in list */
|
---|
194 | FreeNodes(next);
|
---|
195 | FreeNodes(CHILD(base));
|
---|
196 | /* but CHILD(base) has already been freed above */
|
---|
197 | /* if the node was the first one in the list */
|
---|
198 |
|
---|
199 | SOLUTION
|
---|
200 | A number of things in the code seemed a bit dubious to me.
|
---|
201 | So I have rewritten the boolean optimising stage and abstracted out
|
---|
202 | the various stages - each file starts with "bool".
|
---|
203 | Boolean query optimising seems to be a tricky problem.
|
---|
204 | It is not clear that putting an expression into a certain form will
|
---|
205 | actually simplify it and whether simplification means faster querying.
|
---|
206 | I have converted a given boolean expression into DNF
|
---|
207 | (Disjunctive Normal Form). "And not" nodes, which are readily apparent
|
---|
208 | in DNF, are converted to "diff" nodes. I have only applied the idempotency
|
---|
209 | laws involving TRUE and FALSE, and not the ones requiring matching of
|
---|
210 | expressions - it is a potentially more complicated problem.
|
---|
211 | The optimiser has been tested by playing with "bool_tester", and if you are
|
---|
212 | having a crash or problem in a boolean query it would be worth testing the
|
---|
213 | query on the "bool_tester." The token "*" stands for TRUE (or all documents)
|
---|
214 | and the token "_" stands for FALSE (or no documents). This should show the
|
---|
215 | expression before and after optimisation in an ascii tree bracketting format.
|
---|
216 | FILES
|
---|
217 | * bool_tree.c
|
---|
218 | * bool_parser.y
|
---|
219 | * bool_optimiser.c
|
---|
220 | * bool_query.c
|
---|
221 | * bool_tester.c
|
---|
222 | * term_lists.c
|
---|
223 | *************************************************************
|
---|
224 | TITLE
|
---|
225 | Mgtic pixel placement
|
---|
226 | APPLICATION
|
---|
227 | mg-1
|
---|
228 | TYPE
|
---|
229 | bug
|
---|
230 | REPORT
|
---|
231 | Bruce McKenzie - [email protected] (21st Oct 1994)
|
---|
232 | FIX
|
---|
233 | [email protected]
|
---|
234 | CLAIM
|
---|
235 | mgtic crashed on certain files.
|
---|
236 | PROBLEM
|
---|
237 | Placing pixels outside of bitmap.
|
---|
238 | SOLUTION
|
---|
239 | Changed the putpixel routine to truncate at borders of the image.
|
---|
240 | FILES
|
---|
241 | * mgtic.c
|
---|
242 | *************************************************************
|
---|
243 | TITLE
|
---|
244 | Improved boolean tree optimising
|
---|
245 | APPLICATION
|
---|
246 | mg-1
|
---|
247 | TYPE
|
---|
248 | improve
|
---|
249 | REPORT
|
---|
250 | [email protected] - 12/Dec/94
|
---|
251 | FIX
|
---|
252 | [email protected] - 21/Dec/94, 14/Mar/95
|
---|
253 | CLAIM
|
---|
254 | Optimising by conversion to DNF is not necessarily such
|
---|
255 | a good idea - can actually slow things down.
|
---|
256 | PROBLEM
|
---|
257 | The distributive law used in converting to DNF
|
---|
258 | duplicates expressions.
|
---|
259 | SOLUTION
|
---|
260 | Introduce a query environment variable, optimise_type = 0 | 1 | 2.
|
---|
261 | Type 0 does nothing to the parse tree.
|
---|
262 | Type 2 does the DNF conversion.
|
---|
263 | Type 1 is the new default and does the following...
|
---|
264 | Do simple tree rearrangement like flattening.
|
---|
265 | Optimise for CNF queries.
|
---|
266 | FILES
|
---|
267 | * bool_query.c, .h
|
---|
268 | * bool_optimiser.c
|
---|
269 | * environment.c
|
---|
270 | * invf_get.c
|
---|
271 | * bool_tree.c, .h
|
---|
272 | * bool_tester.c
|
---|
273 | * lists.h
|
---|
274 | *************************************************************
|
---|
275 | TITLE
|
---|
276 | Similarity variants
|
---|
277 | APPLICATION
|
---|
278 | mg-2
|
---|
279 | TYPE
|
---|
280 | extend
|
---|
281 | REPORT
|
---|
282 | [email protected]/alistair - June 1994
|
---|
283 | FIX
|
---|
284 | [email protected] - July 1994 .. Feb 1995
|
---|
285 | CLAIM
|
---|
286 | Can only use one type of similarity measure - the
|
---|
287 | standard cosine measure.
|
---|
288 | PROBLEM
|
---|
289 | See CITRI/TR-95-3 for more details.
|
---|
290 | The standard measure can be broken up into 7 components.
|
---|
291 | The 7 components are
|
---|
292 | Each one of these components has a number of alternatives.
|
---|
293 | The overall measure, S_qd, can also be altered.
|
---|
294 | Thus the particular similarity measure used can be specified
|
---|
295 | by an 8 dimensional vector.
|
---|
296 | What is desired is to be able to specify to mgquery an option
|
---|
297 | and a 8-digit string representing this vector (assuming that
|
---|
298 | any one alternative can have at most 9 (not using zero) variants).
|
---|
299 | SOLUTION
|
---|
300 | The programs which had to be modified were:
|
---|
301 | (i) mgquery,
|
---|
302 | (ii) mg_weights_build.
|
---|
303 | The other mg programs in existence store the text, indexing info,
|
---|
304 | and the basic statistics such as N, n, ft, fdt.
|
---|
305 | Other programs which had to be created were:
|
---|
306 | (i) mg_fmd_build,
|
---|
307 | (ii) mg_wt_build.
|
---|
308 | Mg_fmd_build will create the file to store the f_md statistic,
|
---|
309 | where f_md is the largest (maximum) f_dt of any term in document, d.
|
---|
310 | Mg_wt_build will create the file to store the w_t primitive.
|
---|
311 | It only creates this for the w_t variants 6-9 which would require
|
---|
312 | extra passes of invf at query time if they were not stored here.
|
---|
313 | For details on similarity changes for mgquery and mg_weights_build,
|
---|
314 | please see the other modification entries.
|
---|
315 | FILES
|
---|
316 | * mg_fmd_build.c
|
---|
317 | * mg_wt_build.c
|
---|
318 | * build_lib.c, build_lib.h
|
---|
319 | *************************************************************
|
---|
320 | TITLE
|
---|
321 | Similarity variants for mgquery
|
---|
322 | APPLICATION
|
---|
323 | mg-2
|
---|
324 | TYPE
|
---|
325 | extend
|
---|
326 | REPORT
|
---|
327 | [email protected]/alistair - June 1994
|
---|
328 | FIX
|
---|
329 | [email protected] - July 1994 .. Feb 1995
|
---|
330 | CLAIM
|
---|
331 | "mgquery" needs to be altered to allow modification of
|
---|
332 | the similarity measure.
|
---|
333 | PROBLEM
|
---|
334 | See CITRI/TR-95-3 for more details.
|
---|
335 | SOLUTION
|
---|
336 |
|
---|
337 | Most of the similarity measures, Sqd, are of the
|
---|
338 | form: Aqd
|
---|
339 | -----
|
---|
340 | Bqd
|
---|
341 | where Bqd is an expression involving Wd and possibly Wq,
|
---|
342 | where Aqd is a sum over the common document/query terms
|
---|
343 | of w_qt and w_dt.
|
---|
344 | Building of Aqd
|
---|
345 | ===============
|
---|
346 | The calculation of Aqd is done in the file build_Aqd.c .
|
---|
347 | The functions for doing this used to be in the file invf_get.c .
|
---|
348 | Build_Aqd.c contains 4 different functions for building Aqd, each
|
---|
349 | of them building a different data structure:
|
---|
350 | (i) Array, (ii) Splay Tree, (iii) Hash Table, (iv) List Table.
|
---|
351 | Each of these routines seems to have been construction by modifications
|
---|
352 | to duplicated code. This is often the easiest way to construct variants
|
---|
353 | but is quite difficult to maintain consistency.
|
---|
354 | As the aim of the exercise was to try out different sim. measures for
|
---|
355 | retrieval effectiveness, I only modified the code that constructed
|
---|
356 | an array. This routine was called "CosineDecode"; I changed it to
|
---|
357 | "build_Aqd_Array." This change reflects the fact that we are only
|
---|
358 | calculating Aqd and this need not be used for the Cosine measure.
|
---|
359 | The other routines: "CosineDecodeSplay," "CosineDecodeHash," and
|
---|
360 | "CosineDecodeList" have been left unaltered - they need to be updated
|
---|
361 | in the future which would be best be done by abstracting out common code.
|
---|
362 | By the stage of building Aqd, the query terms have been looked up in
|
---|
363 | the inverted file dictionary and put into a list.
|
---|
364 | This list of common terms is traversed to lookup the corresponding
|
---|
365 | invf entries. Before the invf entry is processed, all query and term
|
---|
366 | relevant statistics and primitive quantities are calculated.
|
---|
367 | For example, fqt, ft, wt, rqt, wqt, Wq-partial-sum.
|
---|
368 | To save unnecessary calculations, there is a test for each value
|
---|
369 | to see whether it is needed e.g. "if (sim->variant_needs & NEEDS_wt) ... ".
|
---|
370 | Aside: Variant Needs
|
---|
371 | --------------------
|
---|
372 | The idea behind the "variant_needs" field is to be able to have all
|
---|
373 | the code in the one place for each possible variant and this code would
|
---|
374 | get the information at the correct time/place only when it is needed.
|
---|
375 | The overhead is a "bit-and" and "test" for each component.
|
---|
376 | The important concern though, is that the "variant_needs" must be
|
---|
377 | accurate i.e. it should be carefully maintained.
|
---|
378 | Each possible need is stored as a bit position in a constant/macro
|
---|
379 | of the form "NEEDS_component" e.g. NEEDS_wt, NEEDS_nt
|
---|
380 | More recently (Jan/95), I have found that the _need_ing of a component
|
---|
381 | may be relevant to a particular purpose, that is, it may be needed in
|
---|
382 | one situation and not another. This was the case for Wd.
|
---|
383 | For rdt#6, Wd was needed, however, it might not be needed for Bqd, the
|
---|
384 | denominator of Sqd. So I changed from NEEDS_Wd to NEEDS_Wd_for_Sqd and
|
---|
385 | NEEDS_Wd_for_rdt.
|
---|
386 | --------------------
|
---|
387 | To handle the different calculations of the variants, I wrote macros
|
---|
388 | based around a switch statement. All these macros are stored in the file
|
---|
389 | "similarity.h" . The point of doing this is to centralise most of the changes
|
---|
390 | and cut down on the number of files which have to be altered if a new
|
---|
391 | way of calculating a primitive is to be added.
|
---|
392 | This is achieved by having a data record called, "Similarity_variant" whose
|
---|
393 | fields includes all the statistics and the similarity primitives.
|
---|
394 | So the standard procedure is to see if something is needed and if so, then
|
---|
395 | extract it from an mg file or calculate it using a macro - most of the
|
---|
396 | input and output to the macro is done via the "Similarity_variant" structure.
|
---|
397 | As well as calculating Aqd, it is also a convenient place to calculate Wq.
|
---|
398 | Previously, for Cosine measure, Wq was not needed because each Aqd was directly
|
---|
399 | divided by Wq and thus would not change the ordering. However, there are
|
---|
400 | some Sqd's which divide by a sum involving Wq and thus need Wq.
|
---|
401 | Calculating Sqd and Ordering Documents
|
---|
402 | ======================================
|
---|
403 | The file query.ranked.c contains the code for calculating Sqd using the
|
---|
404 | approximate Wd in order to do the ranking of the documents.
|
---|
405 | The mg-1 file was cleaned up and modified slightly.
|
---|
406 | All the heap data structures and routines were taken out and placed in
|
---|
407 | their own file, heap_weights.c/.h .
|
---|
408 | The major components/steps in the ranking process were abstracted into
|
---|
409 | macros and functions:
|
---|
410 | calc_MaxParas
|
---|
411 | insert_heap_entry
|
---|
412 | insert_greater_heap_entry
|
---|
413 | approx_guided_insert
|
---|
414 | fill_initial_heap
|
---|
415 | add_heap_remainders
|
---|
416 | change_heap_to_exact
|
---|
417 | add_remainder_exact_weights
|
---|
418 | Make_Exact_Root
|
---|
419 | build_doc_list
|
---|
420 | Aside: Zeroing of an Aqd Element
|
---|
421 | -------------------------
|
---|
422 | One interesting change concerned the use of zeroing out an Aqd element
|
---|
423 | so as to mark it as being used in the heap.
|
---|
424 | In the heap, the approx. Sqd is stored based on Aqd/Wd-approx.
|
---|
425 | Later when Aqd is required, it can be extracted from the approx. Sqd
|
---|
426 | by muliplication of Wd-approx i.e. Aqd = Sqd-approx * Wd-approx.
|
---|
427 | This, however, is not always possible for the various Sqd variants.
|
---|
428 | So, instead of zeroing out Aqd, I decided just to make it negative.
|
---|
429 | -------------------------
|
---|
430 | In some Sqd cases, Wd and Wd-approx. are not required. In which case,
|
---|
431 | step 3 which calls on "change_heap_to_exact", "Heap_Build" and
|
---|
432 | "add_remainder_exact_weights" is not required.
|
---|
433 |
|
---|
434 | FILES
|
---|
435 | * build_Aqd.c
|
---|
436 | * query.ranked.c
|
---|
437 | * similarity.c/.h (in libmg)
|
---|
438 | * heap_weights.c/.h
|
---|
439 | * backend.c/.h
|
---|
440 | *************************************************************
|
---|
441 | TITLE
|
---|
442 | Similarity variants for mg_weights_build
|
---|
443 | APPLICATION
|
---|
444 | mg-2
|
---|
445 | TYPE
|
---|
446 | extend
|
---|
447 | REPORT
|
---|
448 | [email protected]/alistair - June 1994
|
---|
449 | FIX
|
---|
450 | [email protected] - July 1994 .. Feb 1995
|
---|
451 | CLAIM
|
---|
452 | "mg_weights_build" needs to be altered to allow modification of
|
---|
453 | the similarity measure.
|
---|
454 | PROBLEM
|
---|
455 | See CITRI/TR-95-3 for more details.
|
---|
456 | SOLUTION
|
---|
457 | The weight files which are generated for a particular similarity
|
---|
458 | measure have their names extended by a suffix.
|
---|
459 | In the case of Wd#1 no weights are generated.
|
---|
460 | In the case of the standard cosine weights, Wd#2, a 3 letter suffix
|
---|
461 | is used to represent .
|
---|
462 | In the case of the other Wd variants, a one letter suffix is used
|
---|
463 | to represent which Wd variant it is.
|
---|
464 | In each case, the variant input (e.g. -q 22222222) should be the whole
|
---|
465 | similarity variant string and the relevant fields will be extracted out.
|
---|
466 | This is done for consistency in code and interface.
|
---|
467 | The code is fairly similar to the original.
|
---|
468 | A dependency check has been added so that the dates of files and the
|
---|
469 | type of needed files is verified before building.
|
---|
470 | The dependencies include, invf dictionary, invf index, invf, fmd and wt files.
|
---|
471 | The major change here, is the possible use of fmd and/or wt files.
|
---|
472 | Later when I was having to write mg_fmd_build and mg_wt_build,
|
---|
473 | I decided to abstract out some macros, namely:
|
---|
474 | Get_ft, Get_ft_Ft, loop_invf_entry,
|
---|
475 | which were put into "build_lib.c" .
|
---|
476 | "loop_invf_entry" takes a function/macro name as a parameter and applies
|
---|
477 | it to the sim record (with field fdt set), current doc number and modifies
|
---|
478 | the return value.
|
---|
479 | FILES
|
---|
480 | * mg_weights_build.c
|
---|
481 | * build_lib.c
|
---|
482 | *************************************************************
|
---|
483 | TITLE
|
---|
484 | Mgstat with non-existent files
|
---|
485 | APPLICATION
|
---|
486 | mg-1
|
---|
487 | TYPE
|
---|
488 | bug
|
---|
489 | REPORT
|
---|
490 | [email protected] - 16 May 1994
|
---|
491 | FIX
|
---|
492 | [email protected] - 10 Aug 1994
|
---|
493 | CLAIM
|
---|
494 | NaNs and Infinites would be printed out by mgstat
|
---|
495 | if unable to open .text or .text.dict file.
|
---|
496 | PROBLEM
|
---|
497 | The NaNs etc. were output in the column stating
|
---|
498 | the percentage size of the file compared with the
|
---|
499 | number of input bytes of the source text data.
|
---|
500 | If it couldn't read the .text file with its
|
---|
501 | header describing the number of source text bytes, then
|
---|
502 | in working out the percentage it would divide by zero.
|
---|
503 | Also due to some bad control flow, it wouldn't attempt to
|
---|
504 | open the .text file if it failed when opening
|
---|
505 | the .text.dict file.
|
---|
506 | SOLUTION
|
---|
507 | Only printout the percentage if we can read the header
|
---|
508 | from the .text file.
|
---|
509 | Read in text header irrespective of text dictionary file.
|
---|
510 | FILES
|
---|
511 | * mgstat.c
|
---|
512 | *************************************************************
|
---|
513 | TITLE
|
---|
514 | Boolean tree optimisations
|
---|
515 | APPLICATION
|
---|
516 | mg-2
|
---|
517 | TYPE
|
---|
518 | extend
|
---|
519 | REPORT
|
---|
520 | (i) [email protected] - 28/Sep/94
|
---|
521 | (ii) [email protected] - 12/Dec/94
|
---|
522 | FIX
|
---|
523 | (i) [email protected] - 18/Oct/94
|
---|
524 | (ii) [email protected] - 21/Dec/94
|
---|
525 | CLAIM
|
---|
526 | The initial prompt for investigating the optimisation of
|
---|
527 | boolean queries is noted in the mg-1 mod14.txt.
|
---|
528 | The code for optimising seemed to have a number of faults.
|
---|
529 | PROBLEM
|
---|
530 | Boolean optimisation was unreliable.
|
---|
531 | SOLUTION
|
---|
532 | Initially (in case (i) above, see mg-1/mod14.txt), I rewrote
|
---|
533 | all the boolean tree and optimising code. I converted the boolean
|
---|
534 | expression into DNF. I did this after reading some notes about
|
---|
535 | the steps involved in optimisation and they suggested standardising
|
---|
536 | in some normal form. I thought that DNF would be appropriate so that
|
---|
537 | all the terms are converted to be part of "and" expressions and be
|
---|
538 | evaluated quickly using skipping.
|
---|
539 | This, however, can suffer quite badly if the distributive law is
|
---|
540 | applied to often and the query expands in size. If there was
|
---|
541 | some sort of cacheing of invf entries, then it might not be so
|
---|
542 | bad otherwise there is quite an overhead on reading the same
|
---|
543 | invf entry more than once.
|
---|
544 |
|
---|
545 | As it happens, CNF queries are reasonably common, where the user
|
---|
546 | queries with a conjunction of disjunctions of similes:
|
---|
547 | e.g. (car | automobile | vehicle) & (fast | quick | speedy)
|
---|
548 | This sort of CNF query expands a hell of a lot !
|
---|
549 | So after speaking with Justin who wanted to benchmark Atlas with Mg on
|
---|
550 | these sort of queries, I looked up the MG book for other ideas.
|
---|
551 |
|
---|
552 | The method that I implemented was the following:
|
---|
553 |
|
---|
554 | -----------------------------------------------------------------
|
---|
555 | Steps of tree modifications:
|
---|
556 | Gets literals by pushing the nots down, detecting T/F at leaves
|
---|
557 | and collapsing the tree by detecting 'and' of 'and's and 'or' of
|
---|
558 | 'or's.
|
---|
559 | Next it looks at the or nodes and if all the children are terms
|
---|
560 | then it marks the or-node as such.
|
---|
561 | Finally, the or-term-nodes are sorted by using the sum of their
|
---|
562 | ft's for comparison.
|
---|
563 | Steps at query evaluation:
|
---|
564 | If it comes across an 'and' of 'or-terms' then the evaluation is
|
---|
565 | done noting the distributive law.
|
---|
566 | I.e. a & (b | c | d) = (a & b) | (a & c) | (a & d)
|
---|
567 | Assuming 'a' is the c-set of documents.
|
---|
568 | All of 'a' is tested against 'b' and matching ones are marked.
|
---|
569 | Next, all the unmarked members of 'a' are tested against 'c'.
|
---|
570 | Likewise for 'd'.
|
---|
571 | Now all the marked members of our c-set are kept.
|
---|
572 | When we do the testing, we can use the skipping in the invf entries.
|
---|
573 | -----------------------------------------------------------------
|
---|
574 |
|
---|
575 |
|
---|
576 | After doing this, I added the choice of which type of optimisation
|
---|
577 | the user wanted by adding query-environment-variable, "optimise_type".
|
---|
578 | Type 0 = no parse tree modification.
|
---|
579 | Type 1 = Or-term recognition and CNF query evaluation optimisation.
|
---|
580 | Type 2 = Put into DNF form. [generally not recommended]
|
---|
581 | FILES
|
---|
582 | * bool_tree.c
|
---|
583 | * bool_parser.y
|
---|
584 | * bool_optimiser.c
|
---|
585 | * bool_query.c
|
---|
586 | * bool_tester.c
|
---|
587 | * term_lists.c
|
---|
588 | * query_env.c
|
---|
589 | * invf_get.c [GetDocsOp]
|
---|
590 | *************************************************************
|
---|
591 | TITLE
|
---|
592 | nonexistent HOME bug
|
---|
593 | APPLICATION
|
---|
594 | mg-1, mg-2
|
---|
595 | TYPE
|
---|
596 | bug
|
---|
597 | REPORT
|
---|
598 | [email protected] - 2/May/95
|
---|
599 | FIX
|
---|
600 | [email protected] - 2/May/95
|
---|
601 | CLAIM
|
---|
602 | "The big problem was that mgquery crashes when the HOME environment
|
---|
603 | variable is not set, which is the case when it is run by the www server."
|
---|
604 | [...] "I expect it happens when looking for $HOME/.mgrc."
|
---|
605 | PROBLEM
|
---|
606 | The result of getenv("HOME")" was used directly in
|
---|
607 | a sprintf call. If the environment variable HOME
|
---|
608 | was not in existence then null would be used.
|
---|
609 | In some C libraries sprintf will convert the 0
|
---|
610 | string into the string "(null)" on others it will core dump.
|
---|
611 | (For example, Solaris seems to core dump, sunos 4 seems ok).
|
---|
612 | SOLUTION
|
---|
613 | The result from getenv("HOME")" is tested before
|
---|
614 | being used.
|
---|
615 | FILES
|
---|
616 | * commands.c
|
---|
617 | *************************************************************
|
---|
618 | TITLE
|
---|
619 | mgquery collection name preference
|
---|
620 | APPLICATION
|
---|
621 | mg-1, mg-2
|
---|
622 | TYPE
|
---|
623 | improve
|
---|
624 | REPORT
|
---|
625 | [email protected] - 2/May/95
|
---|
626 | FIX
|
---|
627 | [email protected] - 4/May/95
|
---|
628 | CLAIM
|
---|
629 | Surely something must override mquery's preference for ./bib.
|
---|
630 | If MGDATA is set correctly, I think it should prefer that collection,
|
---|
631 | and -d should definitely override it.
|
---|
632 | I could always say -d . if I really wanted ./bib.
|
---|
633 | PROBLEM
|
---|
634 | Currently the priority is:
|
---|
635 | 1. Check if ./name is a directory,
|
---|
636 | If so then use it as the collection directory.
|
---|
637 | 2. Check if ./name.text is a file,
|
---|
638 | If so then use ./ as the collection directory.
|
---|
639 | 3. Check if mgdir/name is a directory,
|
---|
640 | If so then use mgdir/name as the collection directory.
|
---|
641 | 4. Otherwise,
|
---|
642 | Use mgdir/name as the database file prefix.
|
---|
643 | This would be the case if one used "-f alice/alice".
|
---|
644 | However, one would then not specify a final name argument
|
---|
645 | and we'd never get here. Go figure ???
|
---|
646 | SOLUTION
|
---|
647 | Moved step 3 to the top instead.
|
---|
648 | FILES
|
---|
649 | * mgquery.c [search_for_collection()]
|
---|
650 | *************************************************************
|
---|
651 | TITLE
|
---|
652 | Printout of query terms
|
---|
653 | APPLICATION
|
---|
654 | mg-1, mg-2
|
---|
655 | TYPE
|
---|
656 | extend
|
---|
657 | REPORT
|
---|
658 | [email protected] - April 95
|
---|
659 | FIX
|
---|
660 | [email protected] - April 95
|
---|
661 | CLAIM
|
---|
662 | No easy way to find out the parsed and stemmed words
|
---|
663 | used in the query. Would like to know these words
|
---|
664 | so I can call a separate highlighting program to
|
---|
665 | highlight these words.
|
---|
666 | PROBLEM
|
---|
667 | No facility available.
|
---|
668 | SOLUTION
|
---|
669 | A ".queryterms" mgquery command was added which lists
|
---|
670 | out the parsed/stemmed queryterms of the last query.
|
---|
671 | FILES
|
---|
672 | * commands.c (added CmdQueryTerms)
|
---|
673 | *************************************************************
|
---|
674 | TITLE
|
---|
675 | mg_getrc
|
---|
676 | APPLICATION
|
---|
677 | mg-1, mg-2
|
---|
678 | TYPE
|
---|
679 | extend
|
---|
680 | REPORT
|
---|
681 | [email protected] - 2/May/95
|
---|
682 | FIX
|
---|
683 | -
|
---|
684 | CLAIM
|
---|
685 | Repeated code had to be written for different named
|
---|
686 | gets but really the same type of parsing required.
|
---|
687 | E.g. one might want to use a standard method for inserting
|
---|
688 | ^Bs between paragraphs for different books. One doesn't
|
---|
689 | want to write duplicate code for each different named book,
|
---|
690 | rather note that each book should be filtered "book" style.
|
---|
691 | PROBLEM
|
---|
692 | There was no way of abstracting out types of filters from
|
---|
693 | the name of an instance of a collection.
|
---|
694 | SOLUTION
|
---|
695 | Allow information to be given with <name, type, files>.
|
---|
696 | This extra info can be provided in a mg_getrc file.
|
---|
697 | See man page for mg_get for details.
|
---|
698 | FILES
|
---|
699 | * mg_get.sh
|
---|
700 | *************************************************************
|
---|
701 | TITLE
|
---|
702 | TREC DocNo file
|
---|
703 | APPLICATION
|
---|
704 | mg-2
|
---|
705 | TYPE
|
---|
706 | improve
|
---|
707 | REPORT
|
---|
708 | [email protected] - 26/May/95
|
---|
709 | FIX
|
---|
710 | [email protected] - 26/May/95
|
---|
711 | CLAIM
|
---|
712 | MG has problems dealing with trec docnos for trec disk 3.
|
---|
713 | PROBLEM
|
---|
714 | Trec DocNos file didn't have a wide enough field
|
---|
715 | to handle disk 3.
|
---|
716 | SOLUTION
|
---|
717 | Allow different width fields for file.
|
---|
718 | It is still fixed width but a number in the header
|
---|
719 | says how wide the field is.
|
---|
720 | FILES
|
---|
721 | * passes/mg.special.c
|
---|
722 | * query/mgquery.c
|
---|
723 | *************************************************************
|
---|
724 | TITLE
|
---|
725 | Boolean optimiser #1 with `!'
|
---|
726 | APPLICATION
|
---|
727 | mg-1, mg-2
|
---|
728 | TYPE
|
---|
729 | bug
|
---|
730 | REPORT
|
---|
731 | [email protected] - 20/7/95
|
---|
732 | FIX
|
---|
733 | [email protected] - 27/7/95
|
---|
734 | CLAIM
|
---|
735 | Complained about not-nodes.
|
---|
736 | e.g. complained about "croquet & !hedgehog"
|
---|
737 | PROBLEM
|
---|
738 | Boolean optimiser type#1 didn't convert
|
---|
739 | "and not"s into diff nodes.
|
---|
740 | SOLUTION
|
---|
741 | Added code to convert '&!' to '-'.
|
---|
742 | FILES
|
---|
743 | * mg/bool_optimiser.c [mg-1]
|
---|
744 | * query/bool_optimiser.c [mg-2]
|
---|
745 | *************************************************************
|
---|
746 | TITLE
|
---|
747 | Autoconfiguring mg-1
|
---|
748 | APPLICATION
|
---|
749 | mg-1
|
---|
750 | TYPE
|
---|
751 | improve
|
---|
752 | REPORT
|
---|
753 | many people - 94/95
|
---|
754 | FIX
|
---|
755 | [email protected] - Aug/95
|
---|
756 | CLAIM
|
---|
757 | Portability is limited by setting up c-macros just for particular
|
---|
758 | machines and operating systems.
|
---|
759 | People had to make changes for HP, Next, Linux, Dec Alpha, ...
|
---|
760 | PROBLEM
|
---|
761 | Porting was only targetting at the machines that the author had
|
---|
762 | access to.
|
---|
763 | SOLUTION
|
---|
764 | Use GNU's autoconfigure program.
|
---|
765 | This allows checking of the systems features/characteristics.
|
---|
766 | It also allows some checking for specific machines/OS - although
|
---|
767 | I have not utilised this option.
|
---|
768 | I used GNU's tar-1.11.8 as an example to base my changes on.
|
---|
769 | I also used autoscan to generate the initial "configure.in".
|
---|
770 | The "Makefile.in"s were done very similarly to GNU tar's.
|
---|
771 | The "config.h" and "sysfuncs.h" files were scrapped and
|
---|
772 | rewritten. The new "config.h" file is generated by the configure
|
---|
773 | script - it contains all the #define's for the system features.
|
---|
774 | The "sysfuncs.h" file wraps up a number of system headers.
|
---|
775 | For example, some systems use , while some use ;
|
---|
776 | which one is included is decided in "sysfuncs.h".
|
---|
777 | I have also used GNU tar's use of ansi2knr in its Makefiles.
|
---|
778 | This should hopefully allow the package to work on a system with
|
---|
779 | only a K&R C compiler.
|
---|
780 | However, there are probably problems with what I have done.
|
---|
781 | I am concerned about <stdarg.h> for example.
|
---|
782 | I also noticed that "ansi2knr" require function definitions as
|
---|
783 | the GNU coding style recommends ie. with function name the first
|
---|
784 | string on the line. This prompted me to run all the package's code
|
---|
785 | thru GNU's indent.
|
---|
786 | Setting up the configure changes is difficult. It really seems
|
---|
787 | necessary to try the package out on many target machines so one
|
---|
788 | can know what is necessary.
|
---|
789 | A simple check target for the main Makefile has been written.
|
---|
790 | It is used to see if the installation is working - it does
|
---|
791 | not test much of the functionality of mg.
|
---|
792 | It does cmp's on data files and diff's on query/result files.
|
---|
793 | FILES
|
---|
794 | Most of the files in the distribution.
|
---|
795 | *************************************************************
|
---|
796 | TITLE
|
---|
797 | Consistent use of stderr
|
---|
798 | APPLICATION
|
---|
799 | mg-1
|
---|
800 | TYPE
|
---|
801 | improve
|
---|
802 | REPORT
|
---|
803 | [email protected] - 16 May 1994
|
---|
804 | FIX
|
---|
805 | [email protected] - 11 August 1994
|
---|
806 | CLAIM
|
---|
807 | Inconsistent use of stdout/stderr in usage messages.
|
---|
808 | PROBLEM
|
---|
809 | Sometimes used "printf" and sometimes used "fprintf(stderr"
|
---|
810 | in usage messages.
|
---|
811 | SOLUTION
|
---|
812 | All should now use "fprintf(stderr" in usage messages.
|
---|
813 | FILES
|
---|
814 | * mg_compression_dict.c
|
---|
815 | * mg_compression_dict.1
|
---|
816 | * mg_fast_comp_dict.c
|
---|
817 | * mg_fast_comp_dict.1
|
---|
818 | * mg_invf_dict.c
|
---|
819 | * mg_invf_dict.1
|
---|
820 | * mg_invf_dump.c
|
---|
821 | * mg_invf_dump.1
|
---|
822 | * mg_invf_rebuild.c
|
---|
823 | * mg_invf_rebuild.1
|
---|
824 | * mg_perf_hash_build.c
|
---|
825 | * mg_perf_hash_build.1
|
---|
826 | * mg_text_estimate.c
|
---|
827 | * mg_text_estimate.1
|
---|
828 | * mg_weights_build.c
|
---|
829 | * mg_weights_build.1
|
---|
830 | *************************************************************
|
---|
831 | TITLE
|
---|
832 | xmg bug
|
---|
833 | APPLICATION
|
---|
834 | mg-1
|
---|
835 | TYPE
|
---|
836 | bug
|
---|
837 | REPORT
|
---|
838 | [email protected] - 22 April 1994
|
---|
839 | FIX
|
---|
840 | [email protected] - 22 April 1994
|
---|
841 | CLAIM
|
---|
842 | "Serious problem in xmg, which I fear occurs whenever a query
|
---|
843 | doesn't return anything."
|
---|
844 | PROBLEM
|
---|
845 | ??
|
---|
846 | SOLUTION
|
---|
847 | [xmg.sh 201] set rank 0
|
---|
848 | FILES
|
---|
849 | * xmg.sh
|
---|
850 | *************************************************************
|
---|
851 | TITLE
|
---|
852 | Unnecessary loading of text
|
---|
853 | APPLICATION
|
---|
854 | mg-1
|
---|
855 | TYPE
|
---|
856 | bug
|
---|
857 | REPORT
|
---|
858 | [email protected] - ?? August 1994
|
---|
859 | FIX
|
---|
860 | [email protected] - 12 August 1994
|
---|
861 | CLAIM
|
---|
862 | Mg was loading and uncompressing text when the
|
---|
863 | query did not require the text.
|
---|
864 | PROBLEM
|
---|
865 | There was no test for the query mode
|
---|
866 | before loading and uncompressing the text.
|
---|
867 | SOLUTION
|
---|
868 | Only load/uncompress text if query mode
|
---|
869 | is for text, headers or silent(for timing).
|
---|
870 | FILES
|
---|
871 | * mgquery.c
|
---|
872 | *************************************************************
|
---|
873 | TITLE
|
---|
874 | Man page errors
|
---|
875 | APPLICATION
|
---|
876 | mg-1
|
---|
877 | TYPE
|
---|
878 | bug
|
---|
879 | REPORT
|
---|
880 | [email protected] - 16 May 1994
|
---|
881 | FIX
|
---|
882 | [email protected] - 16 May 1994
|
---|
883 | CLAIM
|
---|
884 | Man page errors.
|
---|
885 | PROBLEM
|
---|
886 | See below.
|
---|
887 | SOLUTION
|
---|
888 | "The mg_make_fast_dict.1 file has been renamed mg_fast_comp_dict.1,
|
---|
889 | and all mg_make_fast_dict strings changed to mg_fast_comp_dict in all
|
---|
890 | man pages.
|
---|
891 | A large number of errors of spelling, typography, spacing, fonts,
|
---|
892 | grammar, omitted words, slang, punctuation, missing man page
|
---|
893 | cross-references, and man-page style have been corrected."
|
---|
894 | FILES
|
---|
895 | * mg_compression_dict.1
|
---|
896 | * mg_fast_comp_dict.1
|
---|
897 | * mg_get.1
|
---|
898 | * mg_invf_dict.1
|
---|
899 | * mg_invf_dump.1
|
---|
900 | * mg_invf_rebuild.1
|
---|
901 | * mg_passes.1
|
---|
902 | * mg_perf_hash_build.1
|
---|
903 | * mg_text_estimate.1
|
---|
904 | * mg_weights_build.1
|
---|
905 | * mgbilevel.1
|
---|
906 | * mgbuild.1
|
---|
907 | * mgdictlist.1
|
---|
908 | * mgfelics.1
|
---|
909 | * mgquery.1
|
---|
910 | * mgstat.1
|
---|
911 | * mgtic.1
|
---|
912 | * mgticbuild.1
|
---|
913 | * mgticdump.1
|
---|
914 | * mgticprune.1
|
---|
915 | * mgticstat.1
|
---|
916 | * xmg.1
|
---|
917 | *************************************************************
|
---|
918 | TITLE
|
---|
919 | Man page overview
|
---|
920 | APPLICATION
|
---|
921 | mg-1
|
---|
922 | TYPE
|
---|
923 | extend
|
---|
924 | REPORT
|
---|
925 | [email protected] -
|
---|
926 | FIX
|
---|
927 | [email protected] - 17 August 1994
|
---|
928 | CLAIM
|
---|
929 | "Write new mg.1 file to give a brief overview of mg, with samples
|
---|
930 | of how to use it. Otherwise, users are likely to be completely
|
---|
931 | overwhelmed by the number of programs (about 20) which might need to
|
---|
932 | be used, when in reality, only 2 or 3 are likely to be run by end
|
---|
933 | users."
|
---|
934 | SOLUTION
|
---|
935 | It was thought that mg.1, written by Nelson Beebe, was very useful
|
---|
936 | but a bit too comprehensive for an introduction.
|
---|
937 | Therefore, two man files, mgintro.1 and mgintro++.1 were written
|
---|
938 | with the basic stuff in mgintro.1 and slightly more advanced stuff
|
---|
939 | in mgintro++.1 .
|
---|
940 | FILES
|
---|
941 | * mg.1
|
---|
942 | * mgintro.1
|
---|
943 | * mgintro++.1
|
---|
944 | *************************************************************
|
---|
945 | TITLE
|
---|
946 | Parse errors not bus errors
|
---|
947 | APPLICATION
|
---|
948 | mg-1
|
---|
949 | TYPE
|
---|
950 | bug
|
---|
951 | REPORT
|
---|
952 | [email protected] - 2 Jun 94
|
---|
953 | FIX
|
---|
954 | [email protected] - 19 Aug 94
|
---|
955 | CLAIM
|
---|
956 | "These two queries
|
---|
957 | (which I typed in before I knew what I was doing!!)
|
---|
958 | > The Queen of Hearts, she made some tarts
|
---|
959 | > "The Queen of Hearts" and "she made some tarts"
|
---|
960 | produced the following result:
|
---|
961 | mgquery : parse error
|
---|
962 | Bus error
|
---|
963 | "
|
---|
964 | PROBLEM
|
---|
965 | What is expected to happen under boolean querying:
|
---|
966 | Query1:
|
---|
967 | > The Queen of Hearts, she made some tarts
|
---|
968 | will produce a parse error due to the comma which
|
---|
969 | is not a valid TERM.
|
---|
970 | Query2:
|
---|
971 | > "The Queen of Hearts" and "she made some tarts"
|
---|
972 | will store a post-processing string
|
---|
973 | of ''The Queen of Hearts" and "she made some tarts'' and
|
---|
974 | will have a main boolean query of the empty string.
|
---|
975 | This is because the postprocessing string takes in
|
---|
976 | everything between the first quote and the last one.
|
---|
977 | An empty string is illegal for the boolean grammar and
|
---|
978 | hence a parse error.
|
---|
979 | The problem stems from the fact that the processing of
|
---|
980 | the parse tree is carried out, even though we have a
|
---|
981 | parse error. In the case of using an empty string to build
|
---|
982 | a parse tree, it is likely to leave the parse tree undefined.
|
---|
983 | SOLUTION
|
---|
984 | As soon as we find out that there is a parse-error,
|
---|
985 | we abandon any processing of the parse tree.
|
---|
986 | FILES
|
---|
987 | * query.bool.y
|
---|
988 | * query.bool.c (generated from query.bool.y)
|
---|
989 | *************************************************************
|
---|
990 | TITLE
|
---|
991 | Perfect hashing on small vocab
|
---|
992 | APPLICATION
|
---|
993 | mg-1
|
---|
994 | TYPE
|
---|
995 | bug
|
---|
996 | REPORT
|
---|
997 | [email protected] - July 1994
|
---|
998 | FIX
|
---|
999 | [email protected] - July 1994
|
---|
1000 | CLAIM
|
---|
1001 | Mg could not handle small collections in the case
|
---|
1002 | where there was only a small number of unique words.
|
---|
1003 | The perfect hash function would report an error.
|
---|
1004 | PROBLEM
|
---|
1005 | Rounding of the arithmetic during the calculation of the
|
---|
1006 | parameters of the perfect hash function was resulting in a
|
---|
1007 | combination of values such that the probability of a hash
|
---|
1008 | function being found was very small. This led to the limit
|
---|
1009 | on the generation loop being exceeded, and eventual
|
---|
1010 | failure.
|
---|
1011 | SOLUTION
|
---|
1012 | By using ceiling rather than floor when converting from a
|
---|
1013 | floating point value to an integer parameter, the arithmetic
|
---|
1014 | is now correct for all lexicon sizes, and the probability of
|
---|
1015 | each iteration successfully generating a hash function is
|
---|
1016 | sufficiently great that with _very_ high probability the
|
---|
1017 | execution loop counter will not be exceeded unless there
|
---|
1018 | genuinely is no hash function (for example, if the lexicon
|
---|
1019 | contains two words the same there cannot be a hash
|
---|
1020 | function).
|
---|
1021 | FILES
|
---|
1022 | * perf_hash.c
|
---|
1023 | *************************************************************
|
---|