1 | TITLE
|
---|
2 | Parsing of Long Words
|
---|
3 | APPLICATION
|
---|
4 | mg-1, mg-2
|
---|
5 | TYPE
|
---|
6 | bug
|
---|
7 | REPORT
|
---|
8 | [email protected] - May 11th 1994
|
---|
9 | FIX
|
---|
10 | [email protected] - August 9th 1994
|
---|
11 | CLAIM
|
---|
12 | Mg didn't handle long words properly; it crashed.
|
---|
13 | PROBLEM
|
---|
14 | Invf passes calls PARSE_LONG_WORD [words.h] which uses a limit of
|
---|
15 | MAXLONGWORD on iterating thru the string and storing into
|
---|
16 | a word. MAXLONGWORD = 8192.
|
---|
17 | However, mg strings generally store the length in the first
|
---|
18 | byte limiting them to 255 characters. The word which was passed
|
---|
19 | to PARSE_LONG_WORD was an allocated string of MAXSTEMLEN = 255,
|
---|
20 | which is as large as we should get anyway. Thus when accessing
|
---|
21 | a larger word than 255 chars, PARSE_LONG_WORD would allow it
|
---|
22 | (less than 8192) and would try storing beyond the array limit.
|
---|
23 | SOLUTION
|
---|
24 | The author can't remember why PARSE_LONG_WORD was used and what
|
---|
25 | the significance of MAXLONGWORD = 8192 is.
|
---|
26 | So PARSE_LONG_WORD has been changed to PARSE_STEM_WORD which
|
---|
27 | uses MAXSTEMLEN as its limit.
|
---|
28 | FILES
|
---|
29 | * words.h
|
---|
30 | * invf.pass1.c
|
---|
31 | * invf.pass2.c
|
---|
32 | * ivf.pass1.c
|
---|
33 | * ivf.pass2.c
|
---|
34 | * query.ranked.c
|
---|
35 | *************************************************************
|
---|
36 | TITLE
|
---|
37 | Use of Lovins stemmer
|
---|
38 | APPLICATION
|
---|
39 | mg-1
|
---|
40 | TYPE
|
---|
41 | improve
|
---|
42 | REPORT
|
---|
43 | local - 1994
|
---|
44 | FIX
|
---|
45 | [email protected] - 1994
|
---|
46 | CLAIM
|
---|
47 | Stemming was done naively.
|
---|
48 | PROBLEM
|
---|
49 | Only a few types of words and their endings
|
---|
50 | were considered.
|
---|
51 | SOLUTION
|
---|
52 | Replacement with a more elaborate "known" stemmer by Lovins.
|
---|
53 | The algorithm is described in:
|
---|
54 | J.B. Lovins, "Development of a Stemming Algorithm",
|
---|
55 | Mechanical Translation and Computational Linguistics, Vol 11,1968.
|
---|
56 | FILES
|
---|
57 | * stem.c
|
---|
58 | * stem.h
|
---|
59 | *************************************************************
|
---|
60 | TITLE
|
---|
61 | Different term parsing
|
---|
62 | APPLICATION
|
---|
63 | mg-1
|
---|
64 | TYPE
|
---|
65 | bug
|
---|
66 | REPORT
|
---|
67 | [email protected] - 23 Aug 1994
|
---|
68 | FIX
|
---|
69 | [email protected] - 23 Aug 1994
|
---|
70 | CLAIM
|
---|
71 | Boolean queries did not extract words/terms using the
|
---|
72 | same method as is done at inverted-file creation and
|
---|
73 | as is used for rank query parsing.
|
---|
74 | PROBLEM
|
---|
75 | The hand-written lex. analyser, query_lex, which is called by
|
---|
76 | the boolean query parser was not calling a common
|
---|
77 | word-extraction routine as used by the rest of mg.
|
---|
78 | This would be ok if the code did the same things - but they didn't.
|
---|
79 | Query_lex, for instance, did NOT place any limit on the
|
---|
80 | number of digits in a term.
|
---|
81 | Of even more concern, it would allow arbitrary sized words
|
---|
82 | although it used Pascal style strings which store the length
|
---|
83 | in the first byte and can therefore only be 255 characters in length.
|
---|
84 | SOLUTION
|
---|
85 | Query_lex in "query.bool.y", was modified to call the routine
|
---|
86 | PARSE_STEM_WORD which is also used by text-inversion routines and
|
---|
87 | ranking query routines.
|
---|
88 | Now all terms are extracted by the same routine.
|
---|
89 | To do this, the end of the line buffer had to be noted as
|
---|
90 | PARSE_STEM_WORD requires a pointer to the end - which is the
|
---|
91 | safe thing to do (don't want to run over the end).
|
---|
92 | This meant I had to find the length of the query line buffer.
|
---|
93 | This was allocated in the file "read_line.c" by the routine,
|
---|
94 | "readline". Its size was the literal number 1024.
|
---|
95 | This was changed to a constant and placed in "read_line.h".
|
---|
96 | The definition for PARSE_STEM_WORD can be found in "words.h".
|
---|
97 | FILES
|
---|
98 | * query.bool.y
|
---|
99 | * query.bool.c (by bison)
|
---|
100 | * read_line.c
|
---|
101 | * read_line.h
|
---|
102 | *************************************************************
|
---|
103 | TITLE
|
---|
104 | Highlighting of query terms
|
---|
105 | APPLICATION
|
---|
106 | mg-1
|
---|
107 | TYPE
|
---|
108 | extend
|
---|
109 | REPORT
|
---|
110 | [email protected] - Aug 94
|
---|
111 | FIX
|
---|
112 | [email protected] - Sep 94
|
---|
113 | CLAIM
|
---|
114 | Difficult to feel happy that the query-result returned is
|
---|
115 | satisfying the query - need to look hard to find the queried words.
|
---|
116 | Need to show words in results using some highlighting method.
|
---|
117 | PROBLEM
|
---|
118 | No highlighting of query terms in results.
|
---|
119 | SOLUTION
|
---|
120 | Mgquery was previously outputting the decompressed text to a pager
|
---|
121 | such as "less(1)" or "more(1)".
|
---|
122 | (Except when redirected or piped elsewhere :)
|
---|
123 | So what was needed was some sort of highlight pager that instead of
|
---|
124 | displaying the text would also use some means for highlighting the
|
---|
125 | stemmed query words.
|
---|
126 | Two common forms of highlighting were chosen: underline and bolding.
|
---|
127 | These are supported by "less(1)" and possibly by "more(1)" by
|
---|
128 | using the backspace character.
|
---|
129 | A highlight pager will also need to know which words need to be
|
---|
130 | highlighted. Therefore, the code was modified to build up a
|
---|
131 | string of the stemmed query words for passing to the highlight pager.
|
---|
132 | Design Options:
|
---|
133 | ---------------
|
---|
134 | * Could do text filtering in mgquery before passing out to pager.
|
---|
135 | Instead I pipe to a separate process, the "hilite_words" pager,
|
---|
136 | which filters and pipes into less/more.
|
---|
137 | * Could do different highlighting or a combination.
|
---|
138 | * Could use a different structure for storing the query words other
|
---|
139 | than the hash-table I used.
|
---|
140 | FILES
|
---|
141 | * Makefile - to include hilite_words target
|
---|
142 | * mg_hilite_words.c
|
---|
143 | * mgquery.c
|
---|
144 | * mgquery.1
|
---|
145 | * query.bool.y
|
---|
146 | * query.ranked.c
|
---|
147 | * environment.c
|
---|
148 | * environment.h
|
---|
149 | * backend.h
|
---|
150 | *************************************************************
|
---|
151 | TITLE
|
---|
152 | Mg_compression_dict did premature free
|
---|
153 | APPLICATION
|
---|
154 | mg-1
|
---|
155 | TYPE
|
---|
156 | bug
|
---|
157 | REPORT
|
---|
158 | [email protected] - 23 Sep 94
|
---|
159 | FIX
|
---|
160 | [email protected] - 23 Sep 94
|
---|
161 | CLAIM
|
---|
162 | mg_compression_dict dumped core in
|
---|
163 | file: mg_compression_dict.c
|
---|
164 | function: Write_data
|
---|
165 | line: int codelen = hd->clens[i];
|
---|
166 | PROBLEM
|
---|
167 | Huffman data, hd, was freed *before* it was accessed again.
|
---|
168 | SOLUTION
|
---|
169 | The freeing of hd has been moved to after all accesses
|
---|
170 | (just before returning).
|
---|
171 | FILES
|
---|
172 | * mg_compression_dict.c
|
---|
173 | *************************************************************
|
---|
174 | TITLE
|
---|
175 | Boolean tree optimising rewrite
|
---|
176 | APPLICATION
|
---|
177 | mg-1
|
---|
178 | TYPE
|
---|
179 | bug
|
---|
180 | REPORT
|
---|
181 | [email protected] - 23 Sep 94
|
---|
182 | FIX
|
---|
183 | [email protected] - Oct 94
|
---|
184 | CLAIM
|
---|
185 | "I am still getting core dump in "and" queries in mgquery,
|
---|
186 | where the first word does not exist, but the second one does."
|
---|
187 | PROBLEM
|
---|
188 | Having freed a particular node, it tried to refree it and
|
---|
189 | access one of its fields.
|
---|
190 |
|
---|
191 | I.e. code-fragment...
|
---|
192 |
|
---|
193 | FreeNode(curr); /* where curr = CHILD(base) for 1st term in list */
|
---|
194 | FreeNodes(next);
|
---|
195 | FreeNodes(CHILD(base));
|
---|
196 | /* but CHILD(base) has already been freed above */
|
---|
197 | /* if the node was the first one in the list */
|
---|
198 |
|
---|
199 | SOLUTION
|
---|
200 | A number of things in the code seemed a bit dubious to me.
|
---|
201 | So I have rewritten the boolean optimising stage and abstracted out
|
---|
202 | the various stages - each file starts with "bool".
|
---|
203 | Boolean query optimising seems to be a tricky problem.
|
---|
204 | It is not clear that putting an expression into a certain form will
|
---|
205 | actually simplify it and whether simplification means faster querying.
|
---|
206 | I have converted a given boolean expression into DNF
|
---|
207 | (Disjunctive Normal Form). "And not" nodes, which are readily apparent
|
---|
208 | in DNF, are converted to "diff" nodes. I have only applied the idempotency
|
---|
209 | laws involving TRUE and FALSE, and not the ones requiring matching of
|
---|
210 | expressions - it is a potentially more complicated problem.
|
---|
211 | The optimiser has been tested by playing with "bool_tester", and if you are
|
---|
212 | having a crash or problem in a boolean query it would be worth testing the
|
---|
213 | query on the "bool_tester." The token "*" stands for TRUE (or all documents)
|
---|
214 | and the token "_" stands for FALSE (or no documents). This should show the
|
---|
215 | expression before and after optimisation in an ascii tree bracketting format.
|
---|
216 | FILES
|
---|
217 | * bool_tree.c
|
---|
218 | * bool_parser.y
|
---|
219 | * bool_optimiser.c
|
---|
220 | * bool_query.c
|
---|
221 | * bool_tester.c
|
---|
222 | * term_lists.c
|
---|
223 | *************************************************************
|
---|
224 | TITLE
|
---|
225 | Mgtic pixel placement
|
---|
226 | APPLICATION
|
---|
227 | mg-1
|
---|
228 | TYPE
|
---|
229 | bug
|
---|
230 | REPORT
|
---|
231 | Bruce McKenzie - [email protected] (21st Oct 1994)
|
---|
232 | FIX
|
---|
233 | [email protected]
|
---|
234 | CLAIM
|
---|
235 | mgtic crashed on certain files.
|
---|
236 | PROBLEM
|
---|
237 | Placing pixels outside of bitmap.
|
---|
238 | SOLUTION
|
---|
239 | Changed the putpixel routine to truncate at borders of the image.
|
---|
240 | FILES
|
---|
241 | * mgtic.c
|
---|
242 | *************************************************************
|
---|
243 | TITLE
|
---|
244 | Improved boolean tree optimising
|
---|
245 | APPLICATION
|
---|
246 | mg-1
|
---|
247 | TYPE
|
---|
248 | improve
|
---|
249 | REPORT
|
---|
250 | [email protected] - 12/Dec/94
|
---|
251 | FIX
|
---|
252 | [email protected] - 21/Dec/94, 14/Mar/95
|
---|
253 | CLAIM
|
---|
254 | Optimising by conversion to DNF is not necessarily such
|
---|
255 | a good idea - can actually slow things down.
|
---|
256 | PROBLEM
|
---|
257 | The distributive law used in converting to DNF
|
---|
258 | duplicates expressions.
|
---|
259 | SOLUTION
|
---|
260 | Introduce a query environment variable, optimise_type = 0 | 1 | 2.
|
---|
261 | Type 0 does nothing to the parse tree.
|
---|
262 | Type 2 does the DNF conversion.
|
---|
263 | Type 1 is the new default and does the following...
|
---|
264 | Do simple tree rearrangement like flattening.
|
---|
265 | Optimise for CNF queries.
|
---|
266 | FILES
|
---|
267 | * bool_query.c, .h
|
---|
268 | * bool_optimiser.c
|
---|
269 | * environment.c
|
---|
270 | * invf_get.c
|
---|
271 | * bool_tree.c, .h
|
---|
272 | * bool_tester.c
|
---|
273 | * lists.h
|
---|
274 | *************************************************************
|
---|
275 | TITLE
|
---|
276 | Mgstat with non-existent files
|
---|
277 | APPLICATION
|
---|
278 | mg-1
|
---|
279 | TYPE
|
---|
280 | bug
|
---|
281 | REPORT
|
---|
282 | [email protected] - 16 May 1994
|
---|
283 | FIX
|
---|
284 | [email protected] - 10 Aug 1994
|
---|
285 | CLAIM
|
---|
286 | NaNs and Infinites would be printed out by mgstat
|
---|
287 | if unable to open .text or .text.dict file.
|
---|
288 | PROBLEM
|
---|
289 | The NaNs etc. were output in the column stating
|
---|
290 | the percentage size of the file compared with the
|
---|
291 | number of input bytes of the source text data.
|
---|
292 | If it couldn't read the .text file with its
|
---|
293 | header describing the number of source text bytes, then
|
---|
294 | in working out the percentage it would divide by zero.
|
---|
295 | Also due to some bad control flow, it wouldn't attempt to
|
---|
296 | open the .text file if it failed when opening
|
---|
297 | the .text.dict file.
|
---|
298 | SOLUTION
|
---|
299 | Only printout the percentage if we can read the header
|
---|
300 | from the .text file.
|
---|
301 | Read in text header irrespective of text dictionary file.
|
---|
302 | FILES
|
---|
303 | * mgstat.c
|
---|
304 | *************************************************************
|
---|
305 | TITLE
|
---|
306 | nonexistent HOME bug
|
---|
307 | APPLICATION
|
---|
308 | mg-1, mg-2
|
---|
309 | TYPE
|
---|
310 | bug
|
---|
311 | REPORT
|
---|
312 | [email protected] - 2/May/95
|
---|
313 | FIX
|
---|
314 | [email protected] - 2/May/95
|
---|
315 | CLAIM
|
---|
316 | "The big problem was that mgquery crashes when the HOME environment
|
---|
317 | variable is not set, which is the case when it is run by the www server."
|
---|
318 | [...] "I expect it happens when looking for $HOME/.mgrc."
|
---|
319 | PROBLEM
|
---|
320 | The result of getenv("HOME")" was used directly in
|
---|
321 | a sprintf call. If the environment variable HOME
|
---|
322 | was not in existence then null would be used.
|
---|
323 | In some C libraries sprintf will convert the 0
|
---|
324 | string into the string "(null)" on others it will core dump.
|
---|
325 | (For example, Solaris seems to core dump, sunos 4 seems ok).
|
---|
326 | SOLUTION
|
---|
327 | The result from getenv("HOME")" is tested before
|
---|
328 | being used.
|
---|
329 | FILES
|
---|
330 | * commands.c
|
---|
331 | *************************************************************
|
---|
332 | TITLE
|
---|
333 | mgquery collection name preference
|
---|
334 | APPLICATION
|
---|
335 | mg-1, mg-2
|
---|
336 | TYPE
|
---|
337 | improve
|
---|
338 | REPORT
|
---|
339 | [email protected] - 2/May/95
|
---|
340 | FIX
|
---|
341 | [email protected] - 4/May/95
|
---|
342 | CLAIM
|
---|
343 | Surely something must override mquery's preference for ./bib.
|
---|
344 | If MGDATA is set correctly, I think it should prefer that collection,
|
---|
345 | and -d should definitely override it.
|
---|
346 | I could always say -d . if I really wanted ./bib.
|
---|
347 | PROBLEM
|
---|
348 | Currently the priority is:
|
---|
349 | 1. Check if ./name is a directory,
|
---|
350 | If so then use it as the collection directory.
|
---|
351 | 2. Check if ./name.text is a file,
|
---|
352 | If so then use ./ as the collection directory.
|
---|
353 | 3. Check if mgdir/name is a directory,
|
---|
354 | If so then use mgdir/name as the collection directory.
|
---|
355 | 4. Otherwise,
|
---|
356 | Use mgdir/name as the database file prefix.
|
---|
357 | This would be the case if one used "-f alice/alice".
|
---|
358 | However, one would then not specify a final name argument
|
---|
359 | and we'd never get here. Go figure ???
|
---|
360 | SOLUTION
|
---|
361 | Moved step 3 to the top instead.
|
---|
362 | FILES
|
---|
363 | * mgquery.c [search_for_collection()]
|
---|
364 | *************************************************************
|
---|
365 | TITLE
|
---|
366 | Printout of query terms
|
---|
367 | APPLICATION
|
---|
368 | mg-1, mg-2
|
---|
369 | TYPE
|
---|
370 | extend
|
---|
371 | REPORT
|
---|
372 | [email protected] - April 95
|
---|
373 | FIX
|
---|
374 | [email protected] - April 95
|
---|
375 | CLAIM
|
---|
376 | No easy way to find out the parsed and stemmed words
|
---|
377 | used in the query. Would like to know these words
|
---|
378 | so I can call a separate highlighting program to
|
---|
379 | highlight these words.
|
---|
380 | PROBLEM
|
---|
381 | No facility available.
|
---|
382 | SOLUTION
|
---|
383 | A ".queryterms" mgquery command was added which lists
|
---|
384 | out the parsed/stemmed queryterms of the last query.
|
---|
385 | FILES
|
---|
386 | * commands.c (added CmdQueryTerms)
|
---|
387 | *************************************************************
|
---|
388 | TITLE
|
---|
389 | mg_getrc
|
---|
390 | APPLICATION
|
---|
391 | mg-1, mg-2
|
---|
392 | TYPE
|
---|
393 | extend
|
---|
394 | REPORT
|
---|
395 | [email protected] - 2/May/95
|
---|
396 | FIX
|
---|
397 | -
|
---|
398 | CLAIM
|
---|
399 | Repeated code had to be written for different named
|
---|
400 | gets but really the same type of parsing required.
|
---|
401 | E.g. one might want to use a standard method for inserting
|
---|
402 | ^Bs between paragraphs for different books. One doesn't
|
---|
403 | want to write duplicate code for each different named book,
|
---|
404 | rather note that each book should be filtered "book" style.
|
---|
405 | PROBLEM
|
---|
406 | There was no way of abstracting out types of filters from
|
---|
407 | the name of an instance of a collection.
|
---|
408 | SOLUTION
|
---|
409 | Allow information to be given with <name, type, files>.
|
---|
410 | This extra info can be provided in a mg_getrc file.
|
---|
411 | See man page for mg_get for details.
|
---|
412 | FILES
|
---|
413 | * mg_get.sh
|
---|
414 | *************************************************************
|
---|
415 | TITLE
|
---|
416 | Boolean optimiser #1 with `!'
|
---|
417 | APPLICATION
|
---|
418 | mg-1, mg-2
|
---|
419 | TYPE
|
---|
420 | bug
|
---|
421 | REPORT
|
---|
422 | [email protected] - 20/7/95
|
---|
423 | FIX
|
---|
424 | [email protected] - 27/7/95
|
---|
425 | CLAIM
|
---|
426 | Complained about not-nodes.
|
---|
427 | e.g. complained about "croquet & !hedgehog"
|
---|
428 | PROBLEM
|
---|
429 | Boolean optimiser type#1 didn't convert
|
---|
430 | "and not"s into diff nodes.
|
---|
431 | SOLUTION
|
---|
432 | Added code to convert '&!' to '-'.
|
---|
433 | FILES
|
---|
434 | * mg/bool_optimiser.c [mg-1]
|
---|
435 | * query/bool_optimiser.c [mg-2]
|
---|
436 | *************************************************************
|
---|
437 | TITLE
|
---|
438 | Consistent use of stderr
|
---|
439 | APPLICATION
|
---|
440 | mg-1
|
---|
441 | TYPE
|
---|
442 | improve
|
---|
443 | REPORT
|
---|
444 | [email protected] - 16 May 1994
|
---|
445 | FIX
|
---|
446 | [email protected] - 11 August 1994
|
---|
447 | CLAIM
|
---|
448 | Inconsistent use of stdout/stderr in usage messages.
|
---|
449 | PROBLEM
|
---|
450 | Sometimes used "printf" and sometimes used "fprintf(stderr"
|
---|
451 | in usage messages.
|
---|
452 | SOLUTION
|
---|
453 | All should now use "fprintf(stderr" in usage messages.
|
---|
454 | FILES
|
---|
455 | * mg_compression_dict.c
|
---|
456 | * mg_compression_dict.1
|
---|
457 | * mg_fast_comp_dict.c
|
---|
458 | * mg_fast_comp_dict.1
|
---|
459 | * mg_invf_dict.c
|
---|
460 | * mg_invf_dict.1
|
---|
461 | * mg_invf_dump.c
|
---|
462 | * mg_invf_dump.1
|
---|
463 | * mg_invf_rebuild.c
|
---|
464 | * mg_invf_rebuild.1
|
---|
465 | * mg_perf_hash_build.c
|
---|
466 | * mg_perf_hash_build.1
|
---|
467 | * mg_text_estimate.c
|
---|
468 | * mg_text_estimate.1
|
---|
469 | * mg_weights_build.c
|
---|
470 | * mg_weights_build.1
|
---|
471 | *************************************************************
|
---|
472 | TITLE
|
---|
473 | xmg bug
|
---|
474 | APPLICATION
|
---|
475 | mg-1
|
---|
476 | TYPE
|
---|
477 | bug
|
---|
478 | REPORT
|
---|
479 | [email protected] - 22 April 1994
|
---|
480 | FIX
|
---|
481 | [email protected] - 22 April 1994
|
---|
482 | CLAIM
|
---|
483 | "Serious problem in xmg, which I fear occurs whenever a query
|
---|
484 | doesn't return anything."
|
---|
485 | PROBLEM
|
---|
486 | ??
|
---|
487 | SOLUTION
|
---|
488 | [xmg.sh 201] set rank 0
|
---|
489 | FILES
|
---|
490 | * xmg.sh
|
---|
491 | *************************************************************
|
---|
492 | TITLE
|
---|
493 | Unnecessary loading of text
|
---|
494 | APPLICATION
|
---|
495 | mg-1
|
---|
496 | TYPE
|
---|
497 | bug
|
---|
498 | REPORT
|
---|
499 | [email protected] - ?? August 1994
|
---|
500 | FIX
|
---|
501 | [email protected] - 12 August 1994
|
---|
502 | CLAIM
|
---|
503 | Mg was loading and uncompressing text when the
|
---|
504 | query did not require the text.
|
---|
505 | PROBLEM
|
---|
506 | There was no test for the query mode
|
---|
507 | before loading and uncompressing the text.
|
---|
508 | SOLUTION
|
---|
509 | Only load/uncompress text if query mode
|
---|
510 | is for text, headers or silent(for timing).
|
---|
511 | FILES
|
---|
512 | * mgquery.c
|
---|
513 | *************************************************************
|
---|
514 | TITLE
|
---|
515 | Man page errors
|
---|
516 | APPLICATION
|
---|
517 | mg-1
|
---|
518 | TYPE
|
---|
519 | bug
|
---|
520 | REPORT
|
---|
521 | [email protected] - 16 May 1994
|
---|
522 | FIX
|
---|
523 | [email protected] - 16 May 1994
|
---|
524 | CLAIM
|
---|
525 | Man page errors.
|
---|
526 | PROBLEM
|
---|
527 | See below.
|
---|
528 | SOLUTION
|
---|
529 | "The mg_make_fast_dict.1 file has been renamed mg_fast_comp_dict.1,
|
---|
530 | and all mg_make_fast_dict strings changed to mg_fast_comp_dict in all
|
---|
531 | man pages.
|
---|
532 | A large number of errors of spelling, typography, spacing, fonts,
|
---|
533 | grammar, omitted words, slang, punctuation, missing man page
|
---|
534 | cross-references, and man-page style have been corrected."
|
---|
535 | FILES
|
---|
536 | * mg_compression_dict.1
|
---|
537 | * mg_fast_comp_dict.1
|
---|
538 | * mg_get.1
|
---|
539 | * mg_invf_dict.1
|
---|
540 | * mg_invf_dump.1
|
---|
541 | * mg_invf_rebuild.1
|
---|
542 | * mg_passes.1
|
---|
543 | * mg_perf_hash_build.1
|
---|
544 | * mg_text_estimate.1
|
---|
545 | * mg_weights_build.1
|
---|
546 | * mgbilevel.1
|
---|
547 | * mgbuild.1
|
---|
548 | * mgdictlist.1
|
---|
549 | * mgfelics.1
|
---|
550 | * mgquery.1
|
---|
551 | * mgstat.1
|
---|
552 | * mgtic.1
|
---|
553 | * mgticbuild.1
|
---|
554 | * mgticdump.1
|
---|
555 | * mgticprune.1
|
---|
556 | * mgticstat.1
|
---|
557 | * xmg.1
|
---|
558 | *************************************************************
|
---|
559 | TITLE
|
---|
560 | Man page overview
|
---|
561 | APPLICATION
|
---|
562 | mg-1
|
---|
563 | TYPE
|
---|
564 | extend
|
---|
565 | REPORT
|
---|
566 | [email protected] -
|
---|
567 | FIX
|
---|
568 | [email protected] - 17 August 1994
|
---|
569 | CLAIM
|
---|
570 | "Write new mg.1 file to give a brief overview of mg, with samples
|
---|
571 | of how to use it. Otherwise, users are likely to be completely
|
---|
572 | overwhelmed by the number of programs (about 20) which might need to
|
---|
573 | be used, when in reality, only 2 or 3 are likely to be run by end
|
---|
574 | users."
|
---|
575 | SOLUTION
|
---|
576 | It was thought that mg.1, written by Nelson Beebe, was very useful
|
---|
577 | but a bit too comprehensive for an introduction.
|
---|
578 | Therefore, two man files, mgintro.1 and mgintro++.1 were written
|
---|
579 | with the basic stuff in mgintro.1 and slightly more advanced stuff
|
---|
580 | in mgintro++.1 .
|
---|
581 | FILES
|
---|
582 | * mg.1
|
---|
583 | * mgintro.1
|
---|
584 | * mgintro++.1
|
---|
585 | *************************************************************
|
---|
586 | TITLE
|
---|
587 | Parse errors not bus errors
|
---|
588 | APPLICATION
|
---|
589 | mg-1
|
---|
590 | TYPE
|
---|
591 | bug
|
---|
592 | REPORT
|
---|
593 | [email protected] - 2 Jun 94
|
---|
594 | FIX
|
---|
595 | [email protected] - 19 Aug 94
|
---|
596 | CLAIM
|
---|
597 | "These two queries
|
---|
598 | (which I typed in before I knew what I was doing!!)
|
---|
599 | > The Queen of Hearts, she made some tarts
|
---|
600 | > "The Queen of Hearts" and "she made some tarts"
|
---|
601 | produced the following result:
|
---|
602 | mgquery : parse error
|
---|
603 | Bus error
|
---|
604 | "
|
---|
605 | PROBLEM
|
---|
606 | What is expected to happen under boolean querying:
|
---|
607 | Query1:
|
---|
608 | > The Queen of Hearts, she made some tarts
|
---|
609 | will produce a parse error due to the comma which
|
---|
610 | is not a valid TERM.
|
---|
611 | Query2:
|
---|
612 | > "The Queen of Hearts" and "she made some tarts"
|
---|
613 | will store a post-processing string
|
---|
614 | of ''The Queen of Hearts" and "she made some tarts'' and
|
---|
615 | will have a main boolean query of the empty string.
|
---|
616 | This is because the postprocessing string takes in
|
---|
617 | everything between the first quote and the last one.
|
---|
618 | An empty string is illegal for the boolean grammar and
|
---|
619 | hence a parse error.
|
---|
620 | The problem stems from the fact that the processing of
|
---|
621 | the parse tree is carried out, even though we have a
|
---|
622 | parse error. In the case of using an empty string to build
|
---|
623 | a parse tree, it is likely to leave the parse tree undefined.
|
---|
624 | SOLUTION
|
---|
625 | As soon as we find out that there is a parse-error,
|
---|
626 | we abandon any processing of the parse tree.
|
---|
627 | FILES
|
---|
628 | * query.bool.y
|
---|
629 | * query.bool.c (generated from query.bool.y)
|
---|
630 | *************************************************************
|
---|
631 | TITLE
|
---|
632 | Perfect hashing on small vocab
|
---|
633 | APPLICATION
|
---|
634 | mg-1
|
---|
635 | TYPE
|
---|
636 | bug
|
---|
637 | REPORT
|
---|
638 | [email protected] - July 1994
|
---|
639 | FIX
|
---|
640 | [email protected] - July 1994
|
---|
641 | CLAIM
|
---|
642 | Mg could not handle small collections in the case
|
---|
643 | where there was only a small number of unique words.
|
---|
644 | The perfect hash function would report an error.
|
---|
645 | PROBLEM
|
---|
646 | Rounding of the arithmetic during the calculation of the
|
---|
647 | parameters of the perfect hash function was resulting in a
|
---|
648 | combination of values such that the probability of a hash
|
---|
649 | function being found was very small. This led to the limit
|
---|
650 | on the generation loop being exceeded, and eventual
|
---|
651 | failure.
|
---|
652 | SOLUTION
|
---|
653 | By using ceiling rather than floor when converting from a
|
---|
654 | floating point value to an integer parameter, the arithmetic
|
---|
655 | is now correct for all lexicon sizes, and the probability of
|
---|
656 | each iteration successfully generating a hash function is
|
---|
657 | sufficiently great that with _very_ high probability the
|
---|
658 | execution loop counter will not be exceeded unless there
|
---|
659 | genuinely is no hash function (for example, if the lexicon
|
---|
660 | contains two words the same there cannot be a hash
|
---|
661 | function).
|
---|
662 | FILES
|
---|
663 | * perf_hash.c
|
---|
664 | *************************************************************
|
---|