source: indexers/trunk/mg/MODIFICATIONS@ 18132

Last change on this file since 18132 was 16583, checked in by davidb, 16 years ago

Undoing change commited in r16582

  • Property svn:executable set to *
  • Property svn:keywords set to Author Date Id Revision
File size: 34.3 KB
Line 
1TITLE
2 Parsing of Long Words
3APPLICATION
4 mg-1, mg-2
5TYPE
6 bug
7REPORT
8 [email protected] - May 11th 1994
9FIX
10 [email protected] - August 9th 1994
11CLAIM
12 Mg didn't handle long words properly; it crashed.
13PROBLEM
14 Invf passes calls PARSE_LONG_WORD [words.h] which uses a limit of
15 MAXLONGWORD on iterating thru the string and storing into
16 a word. MAXLONGWORD = 8192.
17 However, mg strings generally store the length in the first
18 byte limiting them to 255 characters. The word which was passed
19 to PARSE_LONG_WORD was an allocated string of MAXSTEMLEN = 255,
20 which is as large as we should get anyway. Thus when accessing
21 a larger word than 255 chars, PARSE_LONG_WORD would allow it
22 (less than 8192) and would try storing beyond the array limit.
23SOLUTION
24 The author can't remember why PARSE_LONG_WORD was used and what
25 the significance of MAXLONGWORD = 8192 is.
26 So PARSE_LONG_WORD has been changed to PARSE_STEM_WORD which
27 uses MAXSTEMLEN as its limit.
28FILES
29 * words.h
30 * invf.pass1.c
31 * invf.pass2.c
32 * ivf.pass1.c
33 * ivf.pass2.c
34 * query.ranked.c
35*************************************************************
36TITLE
37 Use of Lovins stemmer
38APPLICATION
39 mg-1
40TYPE
41 improve
42REPORT
43 local - 1994
44FIX
45 [email protected] - 1994
46CLAIM
47 Stemming was done naively.
48PROBLEM
49 Only a few types of words and their endings
50 were considered.
51SOLUTION
52 Replacement with a more elaborate "known" stemmer by Lovins.
53 The algorithm is described in:
54 J.B. Lovins, "Development of a Stemming Algorithm",
55 Mechanical Translation and Computational Linguistics, Vol 11,1968.
56FILES
57 * stem.c
58 * stem.h
59*************************************************************
60TITLE
61 Different term parsing
62APPLICATION
63 mg-1
64TYPE
65 bug
66REPORT
67 [email protected] - 23 Aug 1994
68FIX
69 [email protected] - 23 Aug 1994
70CLAIM
71 Boolean queries did not extract words/terms using the
72 same method as is done at inverted-file creation and
73 as is used for rank query parsing.
74PROBLEM
75 The hand-written lex. analyser, query_lex, which is called by
76 the boolean query parser was not calling a common
77 word-extraction routine as used by the rest of mg.
78 This would be ok if the code did the same things - but they didn't.
79 Query_lex, for instance, did NOT place any limit on the
80 number of digits in a term.
81 Of even more concern, it would allow arbitrary sized words
82 although it used Pascal style strings which store the length
83 in the first byte and can therefore only be 255 characters in length.
84SOLUTION
85 Query_lex in "query.bool.y", was modified to call the routine
86 PARSE_STEM_WORD which is also used by text-inversion routines and
87 ranking query routines.
88 Now all terms are extracted by the same routine.
89 To do this, the end of the line buffer had to be noted as
90 PARSE_STEM_WORD requires a pointer to the end - which is the
91 safe thing to do (don't want to run over the end).
92 This meant I had to find the length of the query line buffer.
93 This was allocated in the file "read_line.c" by the routine,
94 "readline". Its size was the literal number 1024.
95 This was changed to a constant and placed in "read_line.h".
96 The definition for PARSE_STEM_WORD can be found in "words.h".
97FILES
98 * query.bool.y
99 * query.bool.c (by bison)
100 * read_line.c
101 * read_line.h
102*************************************************************
103TITLE
104 Highlighting of query terms
105APPLICATION
106 mg-1
107TYPE
108 extend
109REPORT
110 [email protected] - Aug 94
111FIX
112 [email protected] - Sep 94
113CLAIM
114 Difficult to feel happy that the query-result returned is
115 satisfying the query - need to look hard to find the queried words.
116 Need to show words in results using some highlighting method.
117PROBLEM
118 No highlighting of query terms in results.
119SOLUTION
120 Mgquery was previously outputting the decompressed text to a pager
121 such as "less(1)" or "more(1)".
122 (Except when redirected or piped elsewhere :)
123 So what was needed was some sort of highlight pager that instead of
124 displaying the text would also use some means for highlighting the
125 stemmed query words.
126 Two common forms of highlighting were chosen: underline and bolding.
127 These are supported by "less(1)" and possibly by "more(1)" by
128 using the backspace character.
129 A highlight pager will also need to know which words need to be
130 highlighted. Therefore, the code was modified to build up a
131 string of the stemmed query words for passing to the highlight pager.
132 Design Options:
133 ---------------
134 * Could do text filtering in mgquery before passing out to pager.
135 Instead I pipe to a separate process, the "hilite_words" pager,
136 which filters and pipes into less/more.
137 * Could do different highlighting or a combination.
138 * Could use a different structure for storing the query words other
139 than the hash-table I used.
140FILES
141 * Makefile - to include hilite_words target
142 * mg_hilite_words.c
143 * mgquery.c
144 * mgquery.1
145 * query.bool.y
146 * query.ranked.c
147 * environment.c
148 * environment.h
149 * backend.h
150*************************************************************
151TITLE
152 Mg_compression_dict did premature free
153APPLICATION
154 mg-1
155TYPE
156 bug
157REPORT
158 [email protected] - 23 Sep 94
159FIX
160 [email protected] - 23 Sep 94
161CLAIM
162 mg_compression_dict dumped core in
163 file: mg_compression_dict.c
164 function: Write_data
165 line: int codelen = hd->clens[i];
166PROBLEM
167 Huffman data, hd, was freed *before* it was accessed again.
168SOLUTION
169 The freeing of hd has been moved to after all accesses
170 (just before returning).
171FILES
172 * mg_compression_dict.c
173*************************************************************
174TITLE
175 Boolean tree optimising rewrite
176APPLICATION
177 mg-1
178TYPE
179 bug
180REPORT
181 [email protected] - 23 Sep 94
182FIX
183 [email protected] - Oct 94
184CLAIM
185 "I am still getting core dump in "and" queries in mgquery,
186 where the first word does not exist, but the second one does."
187PROBLEM
188 Having freed a particular node, it tried to refree it and
189 access one of its fields.
190
191 I.e. code-fragment...
192
193 FreeNode(curr); /* where curr = CHILD(base) for 1st term in list */
194 FreeNodes(next);
195 FreeNodes(CHILD(base));
196 /* but CHILD(base) has already been freed above */
197 /* if the node was the first one in the list */
198
199SOLUTION
200 A number of things in the code seemed a bit dubious to me.
201 So I have rewritten the boolean optimising stage and abstracted out
202 the various stages - each file starts with "bool".
203 Boolean query optimising seems to be a tricky problem.
204 It is not clear that putting an expression into a certain form will
205 actually simplify it and whether simplification means faster querying.
206 I have converted a given boolean expression into DNF
207 (Disjunctive Normal Form). "And not" nodes, which are readily apparent
208 in DNF, are converted to "diff" nodes. I have only applied the idempotency
209 laws involving TRUE and FALSE, and not the ones requiring matching of
210 expressions - it is a potentially more complicated problem.
211 The optimiser has been tested by playing with "bool_tester", and if you are
212 having a crash or problem in a boolean query it would be worth testing the
213 query on the "bool_tester." The token "*" stands for TRUE (or all documents)
214 and the token "_" stands for FALSE (or no documents). This should show the
215 expression before and after optimisation in an ascii tree bracketting format.
216FILES
217 * bool_tree.c
218 * bool_parser.y
219 * bool_optimiser.c
220 * bool_query.c
221 * bool_tester.c
222 * term_lists.c
223*************************************************************
224TITLE
225 Mgtic pixel placement
226APPLICATION
227 mg-1
228TYPE
229 bug
230REPORT
231 Bruce McKenzie - [email protected] (21st Oct 1994)
232FIX
233 [email protected]
234CLAIM
235 mgtic crashed on certain files.
236PROBLEM
237 Placing pixels outside of bitmap.
238SOLUTION
239 Changed the putpixel routine to truncate at borders of the image.
240FILES
241 * mgtic.c
242*************************************************************
243TITLE
244 Improved boolean tree optimising
245APPLICATION
246 mg-1
247TYPE
248 improve
249REPORT
250 [email protected] - 12/Dec/94
251FIX
252 [email protected] - 21/Dec/94, 14/Mar/95
253CLAIM
254 Optimising by conversion to DNF is not necessarily such
255 a good idea - can actually slow things down.
256PROBLEM
257 The distributive law used in converting to DNF
258 duplicates expressions.
259SOLUTION
260 Introduce a query environment variable, optimise_type = 0 | 1 | 2.
261 Type 0 does nothing to the parse tree.
262 Type 2 does the DNF conversion.
263 Type 1 is the new default and does the following...
264 Do simple tree rearrangement like flattening.
265 Optimise for CNF queries.
266FILES
267 * bool_query.c, .h
268 * bool_optimiser.c
269 * environment.c
270 * invf_get.c
271 * bool_tree.c, .h
272 * bool_tester.c
273 * lists.h
274*************************************************************
275TITLE
276 Similarity variants
277APPLICATION
278 mg-2
279TYPE
280 extend
281REPORT
282 [email protected]/alistair - June 1994
283FIX
284 [email protected] - July 1994 .. Feb 1995
285CLAIM
286 Can only use one type of similarity measure - the
287 standard cosine measure.
288PROBLEM
289 See CITRI/TR-95-3 for more details.
290 The standard measure can be broken up into 7 components.
291 The 7 components are
292 Each one of these components has a number of alternatives.
293 The overall measure, S_qd, can also be altered.
294 Thus the particular similarity measure used can be specified
295 by an 8 dimensional vector.
296 What is desired is to be able to specify to mgquery an option
297 and a 8-digit string representing this vector (assuming that
298 any one alternative can have at most 9 (not using zero) variants).
299SOLUTION
300 The programs which had to be modified were:
301 (i) mgquery,
302 (ii) mg_weights_build.
303 The other mg programs in existence store the text, indexing info,
304 and the basic statistics such as N, n, ft, fdt.
305 Other programs which had to be created were:
306 (i) mg_fmd_build,
307 (ii) mg_wt_build.
308 Mg_fmd_build will create the file to store the f_md statistic,
309 where f_md is the largest (maximum) f_dt of any term in document, d.
310 Mg_wt_build will create the file to store the w_t primitive.
311 It only creates this for the w_t variants 6-9 which would require
312 extra passes of invf at query time if they were not stored here.
313 For details on similarity changes for mgquery and mg_weights_build,
314 please see the other modification entries.
315FILES
316 * mg_fmd_build.c
317 * mg_wt_build.c
318 * build_lib.c, build_lib.h
319*************************************************************
320TITLE
321 Similarity variants for mgquery
322APPLICATION
323 mg-2
324TYPE
325 extend
326REPORT
327 [email protected]/alistair - June 1994
328FIX
329 [email protected] - July 1994 .. Feb 1995
330CLAIM
331 "mgquery" needs to be altered to allow modification of
332 the similarity measure.
333PROBLEM
334 See CITRI/TR-95-3 for more details.
335SOLUTION
336
337 Most of the similarity measures, Sqd, are of the
338 form: Aqd
339 -----
340 Bqd
341 where Bqd is an expression involving Wd and possibly Wq,
342 where Aqd is a sum over the common document/query terms
343 of w_qt and w_dt.
344 Building of Aqd
345 ===============
346 The calculation of Aqd is done in the file build_Aqd.c .
347 The functions for doing this used to be in the file invf_get.c .
348 Build_Aqd.c contains 4 different functions for building Aqd, each
349 of them building a different data structure:
350 (i) Array, (ii) Splay Tree, (iii) Hash Table, (iv) List Table.
351 Each of these routines seems to have been construction by modifications
352 to duplicated code. This is often the easiest way to construct variants
353 but is quite difficult to maintain consistency.
354 As the aim of the exercise was to try out different sim. measures for
355 retrieval effectiveness, I only modified the code that constructed
356 an array. This routine was called "CosineDecode"; I changed it to
357 "build_Aqd_Array." This change reflects the fact that we are only
358 calculating Aqd and this need not be used for the Cosine measure.
359 The other routines: "CosineDecodeSplay," "CosineDecodeHash," and
360 "CosineDecodeList" have been left unaltered - they need to be updated
361 in the future which would be best be done by abstracting out common code.
362 By the stage of building Aqd, the query terms have been looked up in
363 the inverted file dictionary and put into a list.
364 This list of common terms is traversed to lookup the corresponding
365 invf entries. Before the invf entry is processed, all query and term
366 relevant statistics and primitive quantities are calculated.
367 For example, fqt, ft, wt, rqt, wqt, Wq-partial-sum.
368 To save unnecessary calculations, there is a test for each value
369 to see whether it is needed e.g. "if (sim->variant_needs & NEEDS_wt) ... ".
370 Aside: Variant Needs
371 --------------------
372 The idea behind the "variant_needs" field is to be able to have all
373 the code in the one place for each possible variant and this code would
374 get the information at the correct time/place only when it is needed.
375 The overhead is a "bit-and" and "test" for each component.
376 The important concern though, is that the "variant_needs" must be
377 accurate i.e. it should be carefully maintained.
378 Each possible need is stored as a bit position in a constant/macro
379 of the form "NEEDS_component" e.g. NEEDS_wt, NEEDS_nt
380 More recently (Jan/95), I have found that the _need_ing of a component
381 may be relevant to a particular purpose, that is, it may be needed in
382 one situation and not another. This was the case for Wd.
383 For rdt#6, Wd was needed, however, it might not be needed for Bqd, the
384 denominator of Sqd. So I changed from NEEDS_Wd to NEEDS_Wd_for_Sqd and
385 NEEDS_Wd_for_rdt.
386 --------------------
387 To handle the different calculations of the variants, I wrote macros
388 based around a switch statement. All these macros are stored in the file
389 "similarity.h" . The point of doing this is to centralise most of the changes
390 and cut down on the number of files which have to be altered if a new
391 way of calculating a primitive is to be added.
392 This is achieved by having a data record called, "Similarity_variant" whose
393 fields includes all the statistics and the similarity primitives.
394 So the standard procedure is to see if something is needed and if so, then
395 extract it from an mg file or calculate it using a macro - most of the
396 input and output to the macro is done via the "Similarity_variant" structure.
397 As well as calculating Aqd, it is also a convenient place to calculate Wq.
398 Previously, for Cosine measure, Wq was not needed because each Aqd was directly
399 divided by Wq and thus would not change the ordering. However, there are
400 some Sqd's which divide by a sum involving Wq and thus need Wq.
401 Calculating Sqd and Ordering Documents
402 ======================================
403 The file query.ranked.c contains the code for calculating Sqd using the
404 approximate Wd in order to do the ranking of the documents.
405 The mg-1 file was cleaned up and modified slightly.
406 All the heap data structures and routines were taken out and placed in
407 their own file, heap_weights.c/.h .
408 The major components/steps in the ranking process were abstracted into
409 macros and functions:
410 calc_MaxParas
411 insert_heap_entry
412 insert_greater_heap_entry
413 approx_guided_insert
414 fill_initial_heap
415 add_heap_remainders
416 change_heap_to_exact
417 add_remainder_exact_weights
418 Make_Exact_Root
419 build_doc_list
420 Aside: Zeroing of an Aqd Element
421 -------------------------
422 One interesting change concerned the use of zeroing out an Aqd element
423 so as to mark it as being used in the heap.
424 In the heap, the approx. Sqd is stored based on Aqd/Wd-approx.
425 Later when Aqd is required, it can be extracted from the approx. Sqd
426 by muliplication of Wd-approx i.e. Aqd = Sqd-approx * Wd-approx.
427 This, however, is not always possible for the various Sqd variants.
428 So, instead of zeroing out Aqd, I decided just to make it negative.
429 -------------------------
430 In some Sqd cases, Wd and Wd-approx. are not required. In which case,
431 step 3 which calls on "change_heap_to_exact", "Heap_Build" and
432 "add_remainder_exact_weights" is not required.
433
434FILES
435 * build_Aqd.c
436 * query.ranked.c
437 * similarity.c/.h (in libmg)
438 * heap_weights.c/.h
439 * backend.c/.h
440*************************************************************
441TITLE
442 Similarity variants for mg_weights_build
443APPLICATION
444 mg-2
445TYPE
446 extend
447REPORT
448 [email protected]/alistair - June 1994
449FIX
450 [email protected] - July 1994 .. Feb 1995
451CLAIM
452 "mg_weights_build" needs to be altered to allow modification of
453 the similarity measure.
454PROBLEM
455 See CITRI/TR-95-3 for more details.
456SOLUTION
457 The weight files which are generated for a particular similarity
458 measure have their names extended by a suffix.
459 In the case of Wd#1 no weights are generated.
460 In the case of the standard cosine weights, Wd#2, a 3 letter suffix
461 is used to represent .
462 In the case of the other Wd variants, a one letter suffix is used
463 to represent which Wd variant it is.
464 In each case, the variant input (e.g. -q 22222222) should be the whole
465 similarity variant string and the relevant fields will be extracted out.
466 This is done for consistency in code and interface.
467 The code is fairly similar to the original.
468 A dependency check has been added so that the dates of files and the
469 type of needed files is verified before building.
470 The dependencies include, invf dictionary, invf index, invf, fmd and wt files.
471 The major change here, is the possible use of fmd and/or wt files.
472 Later when I was having to write mg_fmd_build and mg_wt_build,
473 I decided to abstract out some macros, namely:
474 Get_ft, Get_ft_Ft, loop_invf_entry,
475 which were put into "build_lib.c" .
476 "loop_invf_entry" takes a function/macro name as a parameter and applies
477 it to the sim record (with field fdt set), current doc number and modifies
478 the return value.
479FILES
480 * mg_weights_build.c
481 * build_lib.c
482*************************************************************
483TITLE
484 Mgstat with non-existent files
485APPLICATION
486 mg-1
487TYPE
488 bug
489REPORT
490 [email protected] - 16 May 1994
491FIX
492 [email protected] - 10 Aug 1994
493CLAIM
494 NaNs and Infinites would be printed out by mgstat
495 if unable to open .text or .text.dict file.
496PROBLEM
497 The NaNs etc. were output in the column stating
498 the percentage size of the file compared with the
499 number of input bytes of the source text data.
500 If it couldn't read the .text file with its
501 header describing the number of source text bytes, then
502 in working out the percentage it would divide by zero.
503 Also due to some bad control flow, it wouldn't attempt to
504 open the .text file if it failed when opening
505 the .text.dict file.
506SOLUTION
507 Only printout the percentage if we can read the header
508 from the .text file.
509 Read in text header irrespective of text dictionary file.
510FILES
511 * mgstat.c
512*************************************************************
513TITLE
514 Boolean tree optimisations
515APPLICATION
516 mg-2
517TYPE
518 extend
519REPORT
520 (i) [email protected] - 28/Sep/94
521 (ii) [email protected] - 12/Dec/94
522FIX
523 (i) [email protected] - 18/Oct/94
524 (ii) [email protected] - 21/Dec/94
525CLAIM
526 The initial prompt for investigating the optimisation of
527 boolean queries is noted in the mg-1 mod14.txt.
528 The code for optimising seemed to have a number of faults.
529PROBLEM
530 Boolean optimisation was unreliable.
531SOLUTION
532 Initially (in case (i) above, see mg-1/mod14.txt), I rewrote
533 all the boolean tree and optimising code. I converted the boolean
534 expression into DNF. I did this after reading some notes about
535 the steps involved in optimisation and they suggested standardising
536 in some normal form. I thought that DNF would be appropriate so that
537 all the terms are converted to be part of "and" expressions and be
538 evaluated quickly using skipping.
539 This, however, can suffer quite badly if the distributive law is
540 applied to often and the query expands in size. If there was
541 some sort of cacheing of invf entries, then it might not be so
542 bad otherwise there is quite an overhead on reading the same
543 invf entry more than once.
544
545 As it happens, CNF queries are reasonably common, where the user
546 queries with a conjunction of disjunctions of similes:
547 e.g. (car | automobile | vehicle) & (fast | quick | speedy)
548 This sort of CNF query expands a hell of a lot !
549 So after speaking with Justin who wanted to benchmark Atlas with Mg on
550 these sort of queries, I looked up the MG book for other ideas.
551
552 The method that I implemented was the following:
553
554 -----------------------------------------------------------------
555 Steps of tree modifications:
556 Gets literals by pushing the nots down, detecting T/F at leaves
557 and collapsing the tree by detecting 'and' of 'and's and 'or' of
558 'or's.
559 Next it looks at the or nodes and if all the children are terms
560 then it marks the or-node as such.
561 Finally, the or-term-nodes are sorted by using the sum of their
562 ft's for comparison.
563 Steps at query evaluation:
564 If it comes across an 'and' of 'or-terms' then the evaluation is
565 done noting the distributive law.
566 I.e. a & (b | c | d) = (a & b) | (a & c) | (a & d)
567 Assuming 'a' is the c-set of documents.
568 All of 'a' is tested against 'b' and matching ones are marked.
569 Next, all the unmarked members of 'a' are tested against 'c'.
570 Likewise for 'd'.
571 Now all the marked members of our c-set are kept.
572 When we do the testing, we can use the skipping in the invf entries.
573 -----------------------------------------------------------------
574
575
576 After doing this, I added the choice of which type of optimisation
577 the user wanted by adding query-environment-variable, "optimise_type".
578 Type 0 = no parse tree modification.
579 Type 1 = Or-term recognition and CNF query evaluation optimisation.
580 Type 2 = Put into DNF form. [generally not recommended]
581FILES
582 * bool_tree.c
583 * bool_parser.y
584 * bool_optimiser.c
585 * bool_query.c
586 * bool_tester.c
587 * term_lists.c
588 * query_env.c
589 * invf_get.c [GetDocsOp]
590*************************************************************
591TITLE
592 nonexistent HOME bug
593APPLICATION
594 mg-1, mg-2
595TYPE
596 bug
597REPORT
598 [email protected] - 2/May/95
599FIX
600 [email protected] - 2/May/95
601CLAIM
602"The big problem was that mgquery crashes when the HOME environment
603 variable is not set, which is the case when it is run by the www server."
604 [...] "I expect it happens when looking for $HOME/.mgrc."
605PROBLEM
606 The result of getenv("HOME")" was used directly in
607 a sprintf call. If the environment variable HOME
608 was not in existence then null would be used.
609 In some C libraries sprintf will convert the 0
610 string into the string "(null)" on others it will core dump.
611 (For example, Solaris seems to core dump, sunos 4 seems ok).
612SOLUTION
613 The result from getenv("HOME")" is tested before
614 being used.
615FILES
616 * commands.c
617*************************************************************
618TITLE
619 mgquery collection name preference
620APPLICATION
621 mg-1, mg-2
622TYPE
623 improve
624REPORT
625 [email protected] - 2/May/95
626FIX
627 [email protected] - 4/May/95
628CLAIM
629 Surely something must override mquery's preference for ./bib.
630 If MGDATA is set correctly, I think it should prefer that collection,
631 and -d should definitely override it.
632 I could always say -d . if I really wanted ./bib.
633PROBLEM
634Currently the priority is:
6351. Check if ./name is a directory,
636 If so then use it as the collection directory.
6372. Check if ./name.text is a file,
638 If so then use ./ as the collection directory.
6393. Check if mgdir/name is a directory,
640 If so then use mgdir/name as the collection directory.
6414. Otherwise,
642 Use mgdir/name as the database file prefix.
643 This would be the case if one used "-f alice/alice".
644 However, one would then not specify a final name argument
645 and we'd never get here. Go figure ???
646SOLUTION
647Moved step 3 to the top instead.
648FILES
649 * mgquery.c [search_for_collection()]
650*************************************************************
651TITLE
652 Printout of query terms
653APPLICATION
654 mg-1, mg-2
655TYPE
656 extend
657REPORT
658 [email protected] - April 95
659FIX
660 [email protected] - April 95
661CLAIM
662 No easy way to find out the parsed and stemmed words
663 used in the query. Would like to know these words
664 so I can call a separate highlighting program to
665 highlight these words.
666PROBLEM
667 No facility available.
668SOLUTION
669 A ".queryterms" mgquery command was added which lists
670 out the parsed/stemmed queryterms of the last query.
671FILES
672 * commands.c (added CmdQueryTerms)
673*************************************************************
674TITLE
675 mg_getrc
676APPLICATION
677 mg-1, mg-2
678TYPE
679 extend
680REPORT
681 [email protected] - 2/May/95
682FIX
683 -
684CLAIM
685 Repeated code had to be written for different named
686 gets but really the same type of parsing required.
687 E.g. one might want to use a standard method for inserting
688 ^Bs between paragraphs for different books. One doesn't
689 want to write duplicate code for each different named book,
690 rather note that each book should be filtered "book" style.
691PROBLEM
692 There was no way of abstracting out types of filters from
693 the name of an instance of a collection.
694SOLUTION
695 Allow information to be given with <name, type, files>.
696 This extra info can be provided in a mg_getrc file.
697 See man page for mg_get for details.
698FILES
699 * mg_get.sh
700*************************************************************
701TITLE
702 TREC DocNo file
703APPLICATION
704 mg-2
705TYPE
706 improve
707REPORT
708 [email protected] - 26/May/95
709FIX
710 [email protected] - 26/May/95
711CLAIM
712 MG has problems dealing with trec docnos for trec disk 3.
713PROBLEM
714 Trec DocNos file didn't have a wide enough field
715 to handle disk 3.
716SOLUTION
717 Allow different width fields for file.
718 It is still fixed width but a number in the header
719 says how wide the field is.
720FILES
721 * passes/mg.special.c
722 * query/mgquery.c
723*************************************************************
724TITLE
725 Boolean optimiser #1 with `!'
726APPLICATION
727 mg-1, mg-2
728TYPE
729 bug
730REPORT
731 [email protected] - 20/7/95
732FIX
733 [email protected] - 27/7/95
734CLAIM
735 Complained about not-nodes.
736 e.g. complained about "croquet & !hedgehog"
737PROBLEM
738 Boolean optimiser type#1 didn't convert
739 "and not"s into diff nodes.
740SOLUTION
741 Added code to convert '&!' to '-'.
742FILES
743 * mg/bool_optimiser.c [mg-1]
744 * query/bool_optimiser.c [mg-2]
745*************************************************************
746TITLE
747 Autoconfiguring mg-1
748APPLICATION
749 mg-1
750TYPE
751 improve
752REPORT
753 many people - 94/95
754FIX
755 [email protected] - Aug/95
756CLAIM
757 Portability is limited by setting up c-macros just for particular
758 machines and operating systems.
759 People had to make changes for HP, Next, Linux, Dec Alpha, ...
760PROBLEM
761 Porting was only targetting at the machines that the author had
762 access to.
763SOLUTION
764 Use GNU's autoconfigure program.
765 This allows checking of the systems features/characteristics.
766 It also allows some checking for specific machines/OS - although
767 I have not utilised this option.
768 I used GNU's tar-1.11.8 as an example to base my changes on.
769 I also used autoscan to generate the initial "configure.in".
770 The "Makefile.in"s were done very similarly to GNU tar's.
771 The "config.h" and "sysfuncs.h" files were scrapped and
772 rewritten. The new "config.h" file is generated by the configure
773 script - it contains all the #define's for the system features.
774 The "sysfuncs.h" file wraps up a number of system headers.
775 For example, some systems use , while some use ;
776 which one is included is decided in "sysfuncs.h".
777 I have also used GNU tar's use of ansi2knr in its Makefiles.
778 This should hopefully allow the package to work on a system with
779 only a K&R C compiler.
780 However, there are probably problems with what I have done.
781 I am concerned about <stdarg.h> for example.
782 I also noticed that "ansi2knr" require function definitions as
783 the GNU coding style recommends ie. with function name the first
784 string on the line. This prompted me to run all the package's code
785 thru GNU's indent.
786 Setting up the configure changes is difficult. It really seems
787 necessary to try the package out on many target machines so one
788 can know what is necessary.
789 A simple check target for the main Makefile has been written.
790 It is used to see if the installation is working - it does
791 not test much of the functionality of mg.
792 It does cmp's on data files and diff's on query/result files.
793FILES
794 Most of the files in the distribution.
795*************************************************************
796TITLE
797 Consistent use of stderr
798APPLICATION
799 mg-1
800TYPE
801 improve
802REPORT
803 [email protected] - 16 May 1994
804FIX
805 [email protected] - 11 August 1994
806CLAIM
807 Inconsistent use of stdout/stderr in usage messages.
808PROBLEM
809 Sometimes used "printf" and sometimes used "fprintf(stderr"
810 in usage messages.
811SOLUTION
812 All should now use "fprintf(stderr" in usage messages.
813FILES
814 * mg_compression_dict.c
815 * mg_compression_dict.1
816 * mg_fast_comp_dict.c
817 * mg_fast_comp_dict.1
818 * mg_invf_dict.c
819 * mg_invf_dict.1
820 * mg_invf_dump.c
821 * mg_invf_dump.1
822 * mg_invf_rebuild.c
823 * mg_invf_rebuild.1
824 * mg_perf_hash_build.c
825 * mg_perf_hash_build.1
826 * mg_text_estimate.c
827 * mg_text_estimate.1
828 * mg_weights_build.c
829 * mg_weights_build.1
830*************************************************************
831TITLE
832 xmg bug
833APPLICATION
834 mg-1
835TYPE
836 bug
837REPORT
838 [email protected] - 22 April 1994
839FIX
840 [email protected] - 22 April 1994
841CLAIM
842 "Serious problem in xmg, which I fear occurs whenever a query
843 doesn't return anything."
844PROBLEM
845 ??
846SOLUTION
847 [xmg.sh 201] set rank 0
848FILES
849 * xmg.sh
850*************************************************************
851TITLE
852 Unnecessary loading of text
853APPLICATION
854 mg-1
855TYPE
856 bug
857REPORT
858 [email protected] - ?? August 1994
859FIX
860 [email protected] - 12 August 1994
861CLAIM
862 Mg was loading and uncompressing text when the
863 query did not require the text.
864PROBLEM
865 There was no test for the query mode
866 before loading and uncompressing the text.
867SOLUTION
868 Only load/uncompress text if query mode
869 is for text, headers or silent(for timing).
870FILES
871 * mgquery.c
872*************************************************************
873TITLE
874 Man page errors
875APPLICATION
876 mg-1
877TYPE
878 bug
879REPORT
880 [email protected] - 16 May 1994
881FIX
882 [email protected] - 16 May 1994
883CLAIM
884 Man page errors.
885PROBLEM
886 See below.
887SOLUTION
888 "The mg_make_fast_dict.1 file has been renamed mg_fast_comp_dict.1,
889 and all mg_make_fast_dict strings changed to mg_fast_comp_dict in all
890 man pages.
891 A large number of errors of spelling, typography, spacing, fonts,
892 grammar, omitted words, slang, punctuation, missing man page
893 cross-references, and man-page style have been corrected."
894FILES
895 * mg_compression_dict.1
896 * mg_fast_comp_dict.1
897 * mg_get.1
898 * mg_invf_dict.1
899 * mg_invf_dump.1
900 * mg_invf_rebuild.1
901 * mg_passes.1
902 * mg_perf_hash_build.1
903 * mg_text_estimate.1
904 * mg_weights_build.1
905 * mgbilevel.1
906 * mgbuild.1
907 * mgdictlist.1
908 * mgfelics.1
909 * mgquery.1
910 * mgstat.1
911 * mgtic.1
912 * mgticbuild.1
913 * mgticdump.1
914 * mgticprune.1
915 * mgticstat.1
916 * xmg.1
917*************************************************************
918TITLE
919 Man page overview
920APPLICATION
921 mg-1
922TYPE
923 extend
924REPORT
925 [email protected] -
926FIX
927 [email protected] - 17 August 1994
928CLAIM
929 "Write new mg.1 file to give a brief overview of mg, with samples
930 of how to use it. Otherwise, users are likely to be completely
931 overwhelmed by the number of programs (about 20) which might need to
932 be used, when in reality, only 2 or 3 are likely to be run by end
933 users."
934SOLUTION
935 It was thought that mg.1, written by Nelson Beebe, was very useful
936 but a bit too comprehensive for an introduction.
937 Therefore, two man files, mgintro.1 and mgintro++.1 were written
938 with the basic stuff in mgintro.1 and slightly more advanced stuff
939 in mgintro++.1 .
940FILES
941 * mg.1
942 * mgintro.1
943 * mgintro++.1
944*************************************************************
945TITLE
946 Parse errors not bus errors
947APPLICATION
948 mg-1
949TYPE
950 bug
951REPORT
952 [email protected] - 2 Jun 94
953FIX
954 [email protected] - 19 Aug 94
955CLAIM
956 "These two queries
957 (which I typed in before I knew what I was doing!!)
958 > The Queen of Hearts, she made some tarts
959 > "The Queen of Hearts" and "she made some tarts"
960 produced the following result:
961 mgquery : parse error
962 Bus error
963 "
964PROBLEM
965 What is expected to happen under boolean querying:
966 Query1:
967 > The Queen of Hearts, she made some tarts
968 will produce a parse error due to the comma which
969 is not a valid TERM.
970 Query2:
971 > "The Queen of Hearts" and "she made some tarts"
972 will store a post-processing string
973 of ''The Queen of Hearts" and "she made some tarts'' and
974 will have a main boolean query of the empty string.
975 This is because the postprocessing string takes in
976 everything between the first quote and the last one.
977 An empty string is illegal for the boolean grammar and
978 hence a parse error.
979 The problem stems from the fact that the processing of
980 the parse tree is carried out, even though we have a
981 parse error. In the case of using an empty string to build
982 a parse tree, it is likely to leave the parse tree undefined.
983SOLUTION
984 As soon as we find out that there is a parse-error,
985 we abandon any processing of the parse tree.
986FILES
987 * query.bool.y
988 * query.bool.c (generated from query.bool.y)
989*************************************************************
990TITLE
991 Perfect hashing on small vocab
992APPLICATION
993 mg-1
994TYPE
995 bug
996REPORT
997 [email protected] - July 1994
998FIX
999 [email protected] - July 1994
1000CLAIM
1001 Mg could not handle small collections in the case
1002 where there was only a small number of unique words.
1003 The perfect hash function would report an error.
1004PROBLEM
1005 Rounding of the arithmetic during the calculation of the
1006 parameters of the perfect hash function was resulting in a
1007 combination of values such that the probability of a hash
1008 function being found was very small. This led to the limit
1009 on the generation loop being exceeded, and eventual
1010 failure.
1011SOLUTION
1012 By using ceiling rather than floor when converting from a
1013 floating point value to an integer parameter, the arithmetic
1014 is now correct for all lexicon sizes, and the probability of
1015 each iteration successfully generating a hash function is
1016 sufficiently great that with _very_ high probability the
1017 execution loop counter will not be exceeded unless there
1018 genuinely is no hash function (for example, if the lexicon
1019 contains two words the same there cannot be a hash
1020 function).
1021FILES
1022 * perf_hash.c
1023*************************************************************
Note: See TracBrowser for help on using the repository browser.