source: trunk/indexers/mg/SampleData/MODIFICATIONS@ 3745

Last change on this file since 3745 was 3745, checked in by mdewsnip, 21 years ago

Addition of MG package for search and retrieval

  • Property svn:executable set to *
  • Property svn:keywords set to Author Date Id Revision
File size: 19.6 KB
Line 
1TITLE
2 Parsing of Long Words
3APPLICATION
4 mg-1, mg-2
5TYPE
6 bug
7REPORT
8 [email protected] - May 11th 1994
9FIX
10 [email protected] - August 9th 1994
11CLAIM
12 Mg didn't handle long words properly; it crashed.
13PROBLEM
14 Invf passes calls PARSE_LONG_WORD [words.h] which uses a limit of
15 MAXLONGWORD on iterating thru the string and storing into
16 a word. MAXLONGWORD = 8192.
17 However, mg strings generally store the length in the first
18 byte limiting them to 255 characters. The word which was passed
19 to PARSE_LONG_WORD was an allocated string of MAXSTEMLEN = 255,
20 which is as large as we should get anyway. Thus when accessing
21 a larger word than 255 chars, PARSE_LONG_WORD would allow it
22 (less than 8192) and would try storing beyond the array limit.
23SOLUTION
24 The author can't remember why PARSE_LONG_WORD was used and what
25 the significance of MAXLONGWORD = 8192 is.
26 So PARSE_LONG_WORD has been changed to PARSE_STEM_WORD which
27 uses MAXSTEMLEN as its limit.
28FILES
29 * words.h
30 * invf.pass1.c
31 * invf.pass2.c
32 * ivf.pass1.c
33 * ivf.pass2.c
34 * query.ranked.c
35*************************************************************
36TITLE
37 Use of Lovins stemmer
38APPLICATION
39 mg-1
40TYPE
41 improve
42REPORT
43 local - 1994
44FIX
45 [email protected] - 1994
46CLAIM
47 Stemming was done naively.
48PROBLEM
49 Only a few types of words and their endings
50 were considered.
51SOLUTION
52 Replacement with a more elaborate "known" stemmer by Lovins.
53 The algorithm is described in:
54 J.B. Lovins, "Development of a Stemming Algorithm",
55 Mechanical Translation and Computational Linguistics, Vol 11,1968.
56FILES
57 * stem.c
58 * stem.h
59*************************************************************
60TITLE
61 Different term parsing
62APPLICATION
63 mg-1
64TYPE
65 bug
66REPORT
67 [email protected] - 23 Aug 1994
68FIX
69 [email protected] - 23 Aug 1994
70CLAIM
71 Boolean queries did not extract words/terms using the
72 same method as is done at inverted-file creation and
73 as is used for rank query parsing.
74PROBLEM
75 The hand-written lex. analyser, query_lex, which is called by
76 the boolean query parser was not calling a common
77 word-extraction routine as used by the rest of mg.
78 This would be ok if the code did the same things - but they didn't.
79 Query_lex, for instance, did NOT place any limit on the
80 number of digits in a term.
81 Of even more concern, it would allow arbitrary sized words
82 although it used Pascal style strings which store the length
83 in the first byte and can therefore only be 255 characters in length.
84SOLUTION
85 Query_lex in "query.bool.y", was modified to call the routine
86 PARSE_STEM_WORD which is also used by text-inversion routines and
87 ranking query routines.
88 Now all terms are extracted by the same routine.
89 To do this, the end of the line buffer had to be noted as
90 PARSE_STEM_WORD requires a pointer to the end - which is the
91 safe thing to do (don't want to run over the end).
92 This meant I had to find the length of the query line buffer.
93 This was allocated in the file "read_line.c" by the routine,
94 "readline". Its size was the literal number 1024.
95 This was changed to a constant and placed in "read_line.h".
96 The definition for PARSE_STEM_WORD can be found in "words.h".
97FILES
98 * query.bool.y
99 * query.bool.c (by bison)
100 * read_line.c
101 * read_line.h
102*************************************************************
103TITLE
104 Highlighting of query terms
105APPLICATION
106 mg-1
107TYPE
108 extend
109REPORT
110 [email protected] - Aug 94
111FIX
112 [email protected] - Sep 94
113CLAIM
114 Difficult to feel happy that the query-result returned is
115 satisfying the query - need to look hard to find the queried words.
116 Need to show words in results using some highlighting method.
117PROBLEM
118 No highlighting of query terms in results.
119SOLUTION
120 Mgquery was previously outputting the decompressed text to a pager
121 such as "less(1)" or "more(1)".
122 (Except when redirected or piped elsewhere :)
123 So what was needed was some sort of highlight pager that instead of
124 displaying the text would also use some means for highlighting the
125 stemmed query words.
126 Two common forms of highlighting were chosen: underline and bolding.
127 These are supported by "less(1)" and possibly by "more(1)" by
128 using the backspace character.
129 A highlight pager will also need to know which words need to be
130 highlighted. Therefore, the code was modified to build up a
131 string of the stemmed query words for passing to the highlight pager.
132 Design Options:
133 ---------------
134 * Could do text filtering in mgquery before passing out to pager.
135 Instead I pipe to a separate process, the "hilite_words" pager,
136 which filters and pipes into less/more.
137 * Could do different highlighting or a combination.
138 * Could use a different structure for storing the query words other
139 than the hash-table I used.
140FILES
141 * Makefile - to include hilite_words target
142 * mg_hilite_words.c
143 * mgquery.c
144 * mgquery.1
145 * query.bool.y
146 * query.ranked.c
147 * environment.c
148 * environment.h
149 * backend.h
150*************************************************************
151TITLE
152 Mg_compression_dict did premature free
153APPLICATION
154 mg-1
155TYPE
156 bug
157REPORT
158 [email protected] - 23 Sep 94
159FIX
160 [email protected] - 23 Sep 94
161CLAIM
162 mg_compression_dict dumped core in
163 file: mg_compression_dict.c
164 function: Write_data
165 line: int codelen = hd->clens[i];
166PROBLEM
167 Huffman data, hd, was freed *before* it was accessed again.
168SOLUTION
169 The freeing of hd has been moved to after all accesses
170 (just before returning).
171FILES
172 * mg_compression_dict.c
173*************************************************************
174TITLE
175 Boolean tree optimising rewrite
176APPLICATION
177 mg-1
178TYPE
179 bug
180REPORT
181 [email protected] - 23 Sep 94
182FIX
183 [email protected] - Oct 94
184CLAIM
185 "I am still getting core dump in "and" queries in mgquery,
186 where the first word does not exist, but the second one does."
187PROBLEM
188 Having freed a particular node, it tried to refree it and
189 access one of its fields.
190
191 I.e. code-fragment...
192
193 FreeNode(curr); /* where curr = CHILD(base) for 1st term in list */
194 FreeNodes(next);
195 FreeNodes(CHILD(base));
196 /* but CHILD(base) has already been freed above */
197 /* if the node was the first one in the list */
198
199SOLUTION
200 A number of things in the code seemed a bit dubious to me.
201 So I have rewritten the boolean optimising stage and abstracted out
202 the various stages - each file starts with "bool".
203 Boolean query optimising seems to be a tricky problem.
204 It is not clear that putting an expression into a certain form will
205 actually simplify it and whether simplification means faster querying.
206 I have converted a given boolean expression into DNF
207 (Disjunctive Normal Form). "And not" nodes, which are readily apparent
208 in DNF, are converted to "diff" nodes. I have only applied the idempotency
209 laws involving TRUE and FALSE, and not the ones requiring matching of
210 expressions - it is a potentially more complicated problem.
211 The optimiser has been tested by playing with "bool_tester", and if you are
212 having a crash or problem in a boolean query it would be worth testing the
213 query on the "bool_tester." The token "*" stands for TRUE (or all documents)
214 and the token "_" stands for FALSE (or no documents). This should show the
215 expression before and after optimisation in an ascii tree bracketting format.
216FILES
217 * bool_tree.c
218 * bool_parser.y
219 * bool_optimiser.c
220 * bool_query.c
221 * bool_tester.c
222 * term_lists.c
223*************************************************************
224TITLE
225 Mgtic pixel placement
226APPLICATION
227 mg-1
228TYPE
229 bug
230REPORT
231 Bruce McKenzie - [email protected] (21st Oct 1994)
232FIX
233 [email protected]
234CLAIM
235 mgtic crashed on certain files.
236PROBLEM
237 Placing pixels outside of bitmap.
238SOLUTION
239 Changed the putpixel routine to truncate at borders of the image.
240FILES
241 * mgtic.c
242*************************************************************
243TITLE
244 Improved boolean tree optimising
245APPLICATION
246 mg-1
247TYPE
248 improve
249REPORT
250 [email protected] - 12/Dec/94
251FIX
252 [email protected] - 21/Dec/94, 14/Mar/95
253CLAIM
254 Optimising by conversion to DNF is not necessarily such
255 a good idea - can actually slow things down.
256PROBLEM
257 The distributive law used in converting to DNF
258 duplicates expressions.
259SOLUTION
260 Introduce a query environment variable, optimise_type = 0 | 1 | 2.
261 Type 0 does nothing to the parse tree.
262 Type 2 does the DNF conversion.
263 Type 1 is the new default and does the following...
264 Do simple tree rearrangement like flattening.
265 Optimise for CNF queries.
266FILES
267 * bool_query.c, .h
268 * bool_optimiser.c
269 * environment.c
270 * invf_get.c
271 * bool_tree.c, .h
272 * bool_tester.c
273 * lists.h
274*************************************************************
275TITLE
276 Mgstat with non-existent files
277APPLICATION
278 mg-1
279TYPE
280 bug
281REPORT
282 [email protected] - 16 May 1994
283FIX
284 [email protected] - 10 Aug 1994
285CLAIM
286 NaNs and Infinites would be printed out by mgstat
287 if unable to open .text or .text.dict file.
288PROBLEM
289 The NaNs etc. were output in the column stating
290 the percentage size of the file compared with the
291 number of input bytes of the source text data.
292 If it couldn't read the .text file with its
293 header describing the number of source text bytes, then
294 in working out the percentage it would divide by zero.
295 Also due to some bad control flow, it wouldn't attempt to
296 open the .text file if it failed when opening
297 the .text.dict file.
298SOLUTION
299 Only printout the percentage if we can read the header
300 from the .text file.
301 Read in text header irrespective of text dictionary file.
302FILES
303 * mgstat.c
304*************************************************************
305TITLE
306 nonexistent HOME bug
307APPLICATION
308 mg-1, mg-2
309TYPE
310 bug
311REPORT
312 [email protected] - 2/May/95
313FIX
314 [email protected] - 2/May/95
315CLAIM
316"The big problem was that mgquery crashes when the HOME environment
317 variable is not set, which is the case when it is run by the www server."
318 [...] "I expect it happens when looking for $HOME/.mgrc."
319PROBLEM
320 The result of getenv("HOME")" was used directly in
321 a sprintf call. If the environment variable HOME
322 was not in existence then null would be used.
323 In some C libraries sprintf will convert the 0
324 string into the string "(null)" on others it will core dump.
325 (For example, Solaris seems to core dump, sunos 4 seems ok).
326SOLUTION
327 The result from getenv("HOME")" is tested before
328 being used.
329FILES
330 * commands.c
331*************************************************************
332TITLE
333 mgquery collection name preference
334APPLICATION
335 mg-1, mg-2
336TYPE
337 improve
338REPORT
339 [email protected] - 2/May/95
340FIX
341 [email protected] - 4/May/95
342CLAIM
343 Surely something must override mquery's preference for ./bib.
344 If MGDATA is set correctly, I think it should prefer that collection,
345 and -d should definitely override it.
346 I could always say -d . if I really wanted ./bib.
347PROBLEM
348Currently the priority is:
3491. Check if ./name is a directory,
350 If so then use it as the collection directory.
3512. Check if ./name.text is a file,
352 If so then use ./ as the collection directory.
3533. Check if mgdir/name is a directory,
354 If so then use mgdir/name as the collection directory.
3554. Otherwise,
356 Use mgdir/name as the database file prefix.
357 This would be the case if one used "-f alice/alice".
358 However, one would then not specify a final name argument
359 and we'd never get here. Go figure ???
360SOLUTION
361Moved step 3 to the top instead.
362FILES
363 * mgquery.c [search_for_collection()]
364*************************************************************
365TITLE
366 Printout of query terms
367APPLICATION
368 mg-1, mg-2
369TYPE
370 extend
371REPORT
372 [email protected] - April 95
373FIX
374 [email protected] - April 95
375CLAIM
376 No easy way to find out the parsed and stemmed words
377 used in the query. Would like to know these words
378 so I can call a separate highlighting program to
379 highlight these words.
380PROBLEM
381 No facility available.
382SOLUTION
383 A ".queryterms" mgquery command was added which lists
384 out the parsed/stemmed queryterms of the last query.
385FILES
386 * commands.c (added CmdQueryTerms)
387*************************************************************
388TITLE
389 mg_getrc
390APPLICATION
391 mg-1, mg-2
392TYPE
393 extend
394REPORT
395 [email protected] - 2/May/95
396FIX
397 -
398CLAIM
399 Repeated code had to be written for different named
400 gets but really the same type of parsing required.
401 E.g. one might want to use a standard method for inserting
402 ^Bs between paragraphs for different books. One doesn't
403 want to write duplicate code for each different named book,
404 rather note that each book should be filtered "book" style.
405PROBLEM
406 There was no way of abstracting out types of filters from
407 the name of an instance of a collection.
408SOLUTION
409 Allow information to be given with <name, type, files>.
410 This extra info can be provided in a mg_getrc file.
411 See man page for mg_get for details.
412FILES
413 * mg_get.sh
414*************************************************************
415TITLE
416 Boolean optimiser #1 with `!'
417APPLICATION
418 mg-1, mg-2
419TYPE
420 bug
421REPORT
422 [email protected] - 20/7/95
423FIX
424 [email protected] - 27/7/95
425CLAIM
426 Complained about not-nodes.
427 e.g. complained about "croquet & !hedgehog"
428PROBLEM
429 Boolean optimiser type#1 didn't convert
430 "and not"s into diff nodes.
431SOLUTION
432 Added code to convert '&!' to '-'.
433FILES
434 * mg/bool_optimiser.c [mg-1]
435 * query/bool_optimiser.c [mg-2]
436*************************************************************
437TITLE
438 Consistent use of stderr
439APPLICATION
440 mg-1
441TYPE
442 improve
443REPORT
444 [email protected] - 16 May 1994
445FIX
446 [email protected] - 11 August 1994
447CLAIM
448 Inconsistent use of stdout/stderr in usage messages.
449PROBLEM
450 Sometimes used "printf" and sometimes used "fprintf(stderr"
451 in usage messages.
452SOLUTION
453 All should now use "fprintf(stderr" in usage messages.
454FILES
455 * mg_compression_dict.c
456 * mg_compression_dict.1
457 * mg_fast_comp_dict.c
458 * mg_fast_comp_dict.1
459 * mg_invf_dict.c
460 * mg_invf_dict.1
461 * mg_invf_dump.c
462 * mg_invf_dump.1
463 * mg_invf_rebuild.c
464 * mg_invf_rebuild.1
465 * mg_perf_hash_build.c
466 * mg_perf_hash_build.1
467 * mg_text_estimate.c
468 * mg_text_estimate.1
469 * mg_weights_build.c
470 * mg_weights_build.1
471*************************************************************
472TITLE
473 xmg bug
474APPLICATION
475 mg-1
476TYPE
477 bug
478REPORT
479 [email protected] - 22 April 1994
480FIX
481 [email protected] - 22 April 1994
482CLAIM
483 "Serious problem in xmg, which I fear occurs whenever a query
484 doesn't return anything."
485PROBLEM
486 ??
487SOLUTION
488 [xmg.sh 201] set rank 0
489FILES
490 * xmg.sh
491*************************************************************
492TITLE
493 Unnecessary loading of text
494APPLICATION
495 mg-1
496TYPE
497 bug
498REPORT
499 [email protected] - ?? August 1994
500FIX
501 [email protected] - 12 August 1994
502CLAIM
503 Mg was loading and uncompressing text when the
504 query did not require the text.
505PROBLEM
506 There was no test for the query mode
507 before loading and uncompressing the text.
508SOLUTION
509 Only load/uncompress text if query mode
510 is for text, headers or silent(for timing).
511FILES
512 * mgquery.c
513*************************************************************
514TITLE
515 Man page errors
516APPLICATION
517 mg-1
518TYPE
519 bug
520REPORT
521 [email protected] - 16 May 1994
522FIX
523 [email protected] - 16 May 1994
524CLAIM
525 Man page errors.
526PROBLEM
527 See below.
528SOLUTION
529 "The mg_make_fast_dict.1 file has been renamed mg_fast_comp_dict.1,
530 and all mg_make_fast_dict strings changed to mg_fast_comp_dict in all
531 man pages.
532 A large number of errors of spelling, typography, spacing, fonts,
533 grammar, omitted words, slang, punctuation, missing man page
534 cross-references, and man-page style have been corrected."
535FILES
536 * mg_compression_dict.1
537 * mg_fast_comp_dict.1
538 * mg_get.1
539 * mg_invf_dict.1
540 * mg_invf_dump.1
541 * mg_invf_rebuild.1
542 * mg_passes.1
543 * mg_perf_hash_build.1
544 * mg_text_estimate.1
545 * mg_weights_build.1
546 * mgbilevel.1
547 * mgbuild.1
548 * mgdictlist.1
549 * mgfelics.1
550 * mgquery.1
551 * mgstat.1
552 * mgtic.1
553 * mgticbuild.1
554 * mgticdump.1
555 * mgticprune.1
556 * mgticstat.1
557 * xmg.1
558*************************************************************
559TITLE
560 Man page overview
561APPLICATION
562 mg-1
563TYPE
564 extend
565REPORT
566 [email protected] -
567FIX
568 [email protected] - 17 August 1994
569CLAIM
570 "Write new mg.1 file to give a brief overview of mg, with samples
571 of how to use it. Otherwise, users are likely to be completely
572 overwhelmed by the number of programs (about 20) which might need to
573 be used, when in reality, only 2 or 3 are likely to be run by end
574 users."
575SOLUTION
576 It was thought that mg.1, written by Nelson Beebe, was very useful
577 but a bit too comprehensive for an introduction.
578 Therefore, two man files, mgintro.1 and mgintro++.1 were written
579 with the basic stuff in mgintro.1 and slightly more advanced stuff
580 in mgintro++.1 .
581FILES
582 * mg.1
583 * mgintro.1
584 * mgintro++.1
585*************************************************************
586TITLE
587 Parse errors not bus errors
588APPLICATION
589 mg-1
590TYPE
591 bug
592REPORT
593 [email protected] - 2 Jun 94
594FIX
595 [email protected] - 19 Aug 94
596CLAIM
597 "These two queries
598 (which I typed in before I knew what I was doing!!)
599 > The Queen of Hearts, she made some tarts
600 > "The Queen of Hearts" and "she made some tarts"
601 produced the following result:
602 mgquery : parse error
603 Bus error
604 "
605PROBLEM
606 What is expected to happen under boolean querying:
607 Query1:
608 > The Queen of Hearts, she made some tarts
609 will produce a parse error due to the comma which
610 is not a valid TERM.
611 Query2:
612 > "The Queen of Hearts" and "she made some tarts"
613 will store a post-processing string
614 of ''The Queen of Hearts" and "she made some tarts'' and
615 will have a main boolean query of the empty string.
616 This is because the postprocessing string takes in
617 everything between the first quote and the last one.
618 An empty string is illegal for the boolean grammar and
619 hence a parse error.
620 The problem stems from the fact that the processing of
621 the parse tree is carried out, even though we have a
622 parse error. In the case of using an empty string to build
623 a parse tree, it is likely to leave the parse tree undefined.
624SOLUTION
625 As soon as we find out that there is a parse-error,
626 we abandon any processing of the parse tree.
627FILES
628 * query.bool.y
629 * query.bool.c (generated from query.bool.y)
630*************************************************************
631TITLE
632 Perfect hashing on small vocab
633APPLICATION
634 mg-1
635TYPE
636 bug
637REPORT
638 [email protected] - July 1994
639FIX
640 [email protected] - July 1994
641CLAIM
642 Mg could not handle small collections in the case
643 where there was only a small number of unique words.
644 The perfect hash function would report an error.
645PROBLEM
646 Rounding of the arithmetic during the calculation of the
647 parameters of the perfect hash function was resulting in a
648 combination of values such that the probability of a hash
649 function being found was very small. This led to the limit
650 on the generation loop being exceeded, and eventual
651 failure.
652SOLUTION
653 By using ceiling rather than floor when converting from a
654 floating point value to an integer parameter, the arithmetic
655 is now correct for all lexicon sizes, and the probability of
656 each iteration successfully generating a hash function is
657 sufficiently great that with _very_ high probability the
658 execution loop counter will not be exceeded unless there
659 genuinely is no hash function (for example, if the lexicon
660 contains two words the same there cannot be a hash
661 function).
662FILES
663 * perf_hash.c
664*************************************************************
Note: See TracBrowser for help on using the repository browser.