Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

MODIFICATIONS@ 18132

Last change on this file since 18132 was 16583, checked in by davidb, 16 years ago
Undoing change commited in r16582
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 34.3 KB

Line
1	TITLE
2	Parsing of Long Words
3	APPLICATION
4	mg-1, mg-2
5	TYPE
6	bug
7	REPORT
8	[email protected] - May 11th 1994
9	FIX
10	[email protected] - August 9th 1994
11	CLAIM
12	Mg didn't handle long words properly; it crashed.
13	PROBLEM
14	Invf passes calls PARSE_LONG_WORD [words.h] which uses a limit of
15	MAXLONGWORD on iterating thru the string and storing into
16	a word. MAXLONGWORD = 8192.
17	However, mg strings generally store the length in the first
18	byte limiting them to 255 characters. The word which was passed
19	to PARSE_LONG_WORD was an allocated string of MAXSTEMLEN = 255,
20	which is as large as we should get anyway. Thus when accessing
21	a larger word than 255 chars, PARSE_LONG_WORD would allow it
22	(less than 8192) and would try storing beyond the array limit.
23	SOLUTION
24	The author can't remember why PARSE_LONG_WORD was used and what
25	the significance of MAXLONGWORD = 8192 is.
26	So PARSE_LONG_WORD has been changed to PARSE_STEM_WORD which
27	uses MAXSTEMLEN as its limit.
28	FILES
29	* words.h
30	* invf.pass1.c
31	* invf.pass2.c
32	* ivf.pass1.c
33	* ivf.pass2.c
34	* query.ranked.c
35	*************************************************************
36	TITLE
37	Use of Lovins stemmer
38	APPLICATION
39	mg-1
40	TYPE
41	improve
42	REPORT
43	local - 1994
44	FIX
45	[email protected] - 1994
46	CLAIM
47	Stemming was done naively.
48	PROBLEM
49	Only a few types of words and their endings
50	were considered.
51	SOLUTION
52	Replacement with a more elaborate "known" stemmer by Lovins.
53	The algorithm is described in:
54	J.B. Lovins, "Development of a Stemming Algorithm",
55	Mechanical Translation and Computational Linguistics, Vol 11,1968.
56	FILES
57	* stem.c
58	* stem.h
59	*************************************************************
60	TITLE
61	Different term parsing
62	APPLICATION
63	mg-1
64	TYPE
65	bug
66	REPORT
67	[email protected] - 23 Aug 1994
68	FIX
69	[email protected] - 23 Aug 1994
70	CLAIM
71	Boolean queries did not extract words/terms using the
72	same method as is done at inverted-file creation and
73	as is used for rank query parsing.
74	PROBLEM
75	The hand-written lex. analyser, query_lex, which is called by
76	the boolean query parser was not calling a common
77	word-extraction routine as used by the rest of mg.
78	This would be ok if the code did the same things - but they didn't.
79	Query_lex, for instance, did NOT place any limit on the
80	number of digits in a term.
81	Of even more concern, it would allow arbitrary sized words
82	although it used Pascal style strings which store the length
83	in the first byte and can therefore only be 255 characters in length.
84	SOLUTION
85	Query_lex in "query.bool.y", was modified to call the routine
86	PARSE_STEM_WORD which is also used by text-inversion routines and
87	ranking query routines.
88	Now all terms are extracted by the same routine.
89	To do this, the end of the line buffer had to be noted as
90	PARSE_STEM_WORD requires a pointer to the end - which is the
91	safe thing to do (don't want to run over the end).
92	This meant I had to find the length of the query line buffer.
93	This was allocated in the file "read_line.c" by the routine,
94	"readline". Its size was the literal number 1024.
95	This was changed to a constant and placed in "read_line.h".
96	The definition for PARSE_STEM_WORD can be found in "words.h".
97	FILES
98	* query.bool.y
99	* query.bool.c (by bison)
100	* read_line.c
101	* read_line.h
102	*************************************************************
103	TITLE
104	Highlighting of query terms
105	APPLICATION
106	mg-1
107	TYPE
108	extend
109	REPORT
110	[email protected] - Aug 94
111	FIX
112	[email protected] - Sep 94
113	CLAIM
114	Difficult to feel happy that the query-result returned is
115	satisfying the query - need to look hard to find the queried words.
116	Need to show words in results using some highlighting method.
117	PROBLEM
118	No highlighting of query terms in results.
119	SOLUTION
120	Mgquery was previously outputting the decompressed text to a pager
121	such as "less(1)" or "more(1)".
122	(Except when redirected or piped elsewhere :)
123	So what was needed was some sort of highlight pager that instead of
124	displaying the text would also use some means for highlighting the
125	stemmed query words.
126	Two common forms of highlighting were chosen: underline and bolding.
127	These are supported by "less(1)" and possibly by "more(1)" by
128	using the backspace character.
129	A highlight pager will also need to know which words need to be
130	highlighted. Therefore, the code was modified to build up a
131	string of the stemmed query words for passing to the highlight pager.
132	Design Options:
133	---------------
134	* Could do text filtering in mgquery before passing out to pager.
135	Instead I pipe to a separate process, the "hilite_words" pager,
136	which filters and pipes into less/more.
137	* Could do different highlighting or a combination.
138	* Could use a different structure for storing the query words other
139	than the hash-table I used.
140	FILES
141	* Makefile - to include hilite_words target
142	* mg_hilite_words.c
143	* mgquery.c
144	* mgquery.1
145	* query.bool.y
146	* query.ranked.c
147	* environment.c
148	* environment.h
149	* backend.h
150	*************************************************************
151	TITLE
152	Mg_compression_dict did premature free
153	APPLICATION
154	mg-1
155	TYPE
156	bug
157	REPORT
158	[email protected] - 23 Sep 94
159	FIX
160	[email protected] - 23 Sep 94
161	CLAIM
162	mg_compression_dict dumped core in
163	file: mg_compression_dict.c
164	function: Write_data
165	line: int codelen = hd->clens[i];
166	PROBLEM
167	Huffman data, hd, was freed before it was accessed again.
168	SOLUTION
169	The freeing of hd has been moved to after all accesses
170	(just before returning).
171	FILES
172	* mg_compression_dict.c
173	*************************************************************
174	TITLE
175	Boolean tree optimising rewrite
176	APPLICATION
177	mg-1
178	TYPE
179	bug
180	REPORT
181	[email protected] - 23 Sep 94
182	FIX
183	[email protected] - Oct 94
184	CLAIM
185	"I am still getting core dump in "and" queries in mgquery,
186	where the first word does not exist, but the second one does."
187	PROBLEM
188	Having freed a particular node, it tried to refree it and
189	access one of its fields.
190
191	I.e. code-fragment...
192
193	FreeNode(curr); /* where curr = CHILD(base) for 1st term in list */
194	FreeNodes(next);
195	FreeNodes(CHILD(base));
196	/* but CHILD(base) has already been freed above */
197	/* if the node was the first one in the list */
198
199	SOLUTION
200	A number of things in the code seemed a bit dubious to me.
201	So I have rewritten the boolean optimising stage and abstracted out
202	the various stages - each file starts with "bool".
203	Boolean query optimising seems to be a tricky problem.
204	It is not clear that putting an expression into a certain form will
205	actually simplify it and whether simplification means faster querying.
206	I have converted a given boolean expression into DNF
207	(Disjunctive Normal Form). "And not" nodes, which are readily apparent
208	in DNF, are converted to "diff" nodes. I have only applied the idempotency
209	laws involving TRUE and FALSE, and not the ones requiring matching of
210	expressions - it is a potentially more complicated problem.
211	The optimiser has been tested by playing with "bool_tester", and if you are
212	having a crash or problem in a boolean query it would be worth testing the
213	query on the "bool_tester." The token "*" stands for TRUE (or all documents)
214	and the token "_" stands for FALSE (or no documents). This should show the
215	expression before and after optimisation in an ascii tree bracketting format.
216	FILES
217	* bool_tree.c
218	* bool_parser.y
219	* bool_optimiser.c
220	* bool_query.c
221	* bool_tester.c
222	* term_lists.c
223	*************************************************************
224	TITLE
225	Mgtic pixel placement
226	APPLICATION
227	mg-1
228	TYPE
229	bug
230	REPORT
231	Bruce McKenzie - [email protected] (21st Oct 1994)
232	FIX
233	[email protected]
234	CLAIM
235	mgtic crashed on certain files.
236	PROBLEM
237	Placing pixels outside of bitmap.
238	SOLUTION
239	Changed the putpixel routine to truncate at borders of the image.
240	FILES
241	* mgtic.c
242	*************************************************************
243	TITLE
244	Improved boolean tree optimising
245	APPLICATION
246	mg-1
247	TYPE
248	improve
249	REPORT
250	[email protected] - 12/Dec/94
251	FIX
252	[email protected] - 21/Dec/94, 14/Mar/95
253	CLAIM
254	Optimising by conversion to DNF is not necessarily such
255	a good idea - can actually slow things down.
256	PROBLEM
257	The distributive law used in converting to DNF
258	duplicates expressions.
259	SOLUTION
260	Introduce a query environment variable, optimise_type = 0 \| 1 \| 2.
261	Type 0 does nothing to the parse tree.
262	Type 2 does the DNF conversion.
263	Type 1 is the new default and does the following...
264	Do simple tree rearrangement like flattening.
265	Optimise for CNF queries.
266	FILES
267	* bool_query.c, .h
268	* bool_optimiser.c
269	* environment.c
270	* invf_get.c
271	* bool_tree.c, .h
272	* bool_tester.c
273	* lists.h
274	*************************************************************
275	TITLE
276	Similarity variants
277	APPLICATION
278	mg-2
279	TYPE
280	extend
281	REPORT
282	[email protected]/alistair - June 1994
283	FIX
284	[email protected] - July 1994 .. Feb 1995
285	CLAIM
286	Can only use one type of similarity measure - the
287	standard cosine measure.
288	PROBLEM
289	See CITRI/TR-95-3 for more details.
290	The standard measure can be broken up into 7 components.
291	The 7 components are
292	Each one of these components has a number of alternatives.
293	The overall measure, S_qd, can also be altered.
294	Thus the particular similarity measure used can be specified
295	by an 8 dimensional vector.
296	What is desired is to be able to specify to mgquery an option
297	and a 8-digit string representing this vector (assuming that
298	any one alternative can have at most 9 (not using zero) variants).
299	SOLUTION
300	The programs which had to be modified were:
301	(i) mgquery,
302	(ii) mg_weights_build.
303	The other mg programs in existence store the text, indexing info,
304	and the basic statistics such as N, n, ft, fdt.
305	Other programs which had to be created were:
306	(i) mg_fmd_build,
307	(ii) mg_wt_build.
308	Mg_fmd_build will create the file to store the f_md statistic,
309	where f_md is the largest (maximum) f_dt of any term in document, d.
310	Mg_wt_build will create the file to store the w_t primitive.
311	It only creates this for the w_t variants 6-9 which would require
312	extra passes of invf at query time if they were not stored here.
313	For details on similarity changes for mgquery and mg_weights_build,
314	please see the other modification entries.
315	FILES
316	* mg_fmd_build.c
317	* mg_wt_build.c
318	* build_lib.c, build_lib.h
319	*************************************************************
320	TITLE
321	Similarity variants for mgquery
322	APPLICATION
323	mg-2
324	TYPE
325	extend
326	REPORT
327	[email protected]/alistair - June 1994
328	FIX
329	[email protected] - July 1994 .. Feb 1995
330	CLAIM
331	"mgquery" needs to be altered to allow modification of
332	the similarity measure.
333	PROBLEM
334	See CITRI/TR-95-3 for more details.
335	SOLUTION
336
337	Most of the similarity measures, Sqd, are of the
338	form: Aqd
339	-----
340	Bqd
341	where Bqd is an expression involving Wd and possibly Wq,
342	where Aqd is a sum over the common document/query terms
343	of w_qt and w_dt.
344	Building of Aqd
345	===============
346	The calculation of Aqd is done in the file build_Aqd.c .
347	The functions for doing this used to be in the file invf_get.c .
348	Build_Aqd.c contains 4 different functions for building Aqd, each
349	of them building a different data structure:
350	(i) Array, (ii) Splay Tree, (iii) Hash Table, (iv) List Table.
351	Each of these routines seems to have been construction by modifications
352	to duplicated code. This is often the easiest way to construct variants
353	but is quite difficult to maintain consistency.
354	As the aim of the exercise was to try out different sim. measures for
355	retrieval effectiveness, I only modified the code that constructed
356	an array. This routine was called "CosineDecode"; I changed it to
357	"build_Aqd_Array." This change reflects the fact that we are only
358	calculating Aqd and this need not be used for the Cosine measure.
359	The other routines: "CosineDecodeSplay," "CosineDecodeHash," and
360	"CosineDecodeList" have been left unaltered - they need to be updated
361	in the future which would be best be done by abstracting out common code.
362	By the stage of building Aqd, the query terms have been looked up in
363	the inverted file dictionary and put into a list.
364	This list of common terms is traversed to lookup the corresponding
365	invf entries. Before the invf entry is processed, all query and term
366	relevant statistics and primitive quantities are calculated.
367	For example, fqt, ft, wt, rqt, wqt, Wq-partial-sum.
368	To save unnecessary calculations, there is a test for each value
369	to see whether it is needed e.g. "if (sim->variant_needs & NEEDS_wt) ... ".
370	Aside: Variant Needs
371	--------------------
372	The idea behind the "variant_needs" field is to be able to have all
373	the code in the one place for each possible variant and this code would
374	get the information at the correct time/place only when it is needed.
375	The overhead is a "bit-and" and "test" for each component.
376	The important concern though, is that the "variant_needs" must be
377	accurate i.e. it should be carefully maintained.
378	Each possible need is stored as a bit position in a constant/macro
379	of the form "NEEDS_component" e.g. NEEDS_wt, NEEDS_nt
380	More recently (Jan/95), I have found that the _need_ing of a component
381	may be relevant to a particular purpose, that is, it may be needed in
382	one situation and not another. This was the case for Wd.
383	For rdt#6, Wd was needed, however, it might not be needed for Bqd, the
384	denominator of Sqd. So I changed from NEEDS_Wd to NEEDS_Wd_for_Sqd and
385	NEEDS_Wd_for_rdt.
386	--------------------
387	To handle the different calculations of the variants, I wrote macros
388	based around a switch statement. All these macros are stored in the file
389	"similarity.h" . The point of doing this is to centralise most of the changes
390	and cut down on the number of files which have to be altered if a new
391	way of calculating a primitive is to be added.
392	This is achieved by having a data record called, "Similarity_variant" whose
393	fields includes all the statistics and the similarity primitives.
394	So the standard procedure is to see if something is needed and if so, then
395	extract it from an mg file or calculate it using a macro - most of the
396	input and output to the macro is done via the "Similarity_variant" structure.
397	As well as calculating Aqd, it is also a convenient place to calculate Wq.
398	Previously, for Cosine measure, Wq was not needed because each Aqd was directly
399	divided by Wq and thus would not change the ordering. However, there are
400	some Sqd's which divide by a sum involving Wq and thus need Wq.
401	Calculating Sqd and Ordering Documents
402	======================================
403	The file query.ranked.c contains the code for calculating Sqd using the
404	approximate Wd in order to do the ranking of the documents.
405	The mg-1 file was cleaned up and modified slightly.
406	All the heap data structures and routines were taken out and placed in
407	their own file, heap_weights.c/.h .
408	The major components/steps in the ranking process were abstracted into
409	macros and functions:
410	calc_MaxParas
411	insert_heap_entry
412	insert_greater_heap_entry
413	approx_guided_insert
414	fill_initial_heap
415	add_heap_remainders
416	change_heap_to_exact
417	add_remainder_exact_weights
418	Make_Exact_Root
419	build_doc_list
420	Aside: Zeroing of an Aqd Element
421	-------------------------
422	One interesting change concerned the use of zeroing out an Aqd element
423	so as to mark it as being used in the heap.
424	In the heap, the approx. Sqd is stored based on Aqd/Wd-approx.
425	Later when Aqd is required, it can be extracted from the approx. Sqd
426	by muliplication of Wd-approx i.e. Aqd = Sqd-approx * Wd-approx.
427	This, however, is not always possible for the various Sqd variants.
428	So, instead of zeroing out Aqd, I decided just to make it negative.
429	-------------------------
430	In some Sqd cases, Wd and Wd-approx. are not required. In which case,
431	step 3 which calls on "change_heap_to_exact", "Heap_Build" and
432	"add_remainder_exact_weights" is not required.
433
434	FILES
435	* build_Aqd.c
436	* query.ranked.c
437	* similarity.c/.h (in libmg)
438	* heap_weights.c/.h
439	* backend.c/.h
440	*************************************************************
441	TITLE
442	Similarity variants for mg_weights_build
443	APPLICATION
444	mg-2
445	TYPE
446	extend
447	REPORT
448	[email protected]/alistair - June 1994
449	FIX
450	[email protected] - July 1994 .. Feb 1995
451	CLAIM
452	"mg_weights_build" needs to be altered to allow modification of
453	the similarity measure.
454	PROBLEM
455	See CITRI/TR-95-3 for more details.
456	SOLUTION
457	The weight files which are generated for a particular similarity
458	measure have their names extended by a suffix.
459	In the case of Wd#1 no weights are generated.
460	In the case of the standard cosine weights, Wd#2, a 3 letter suffix
461	is used to represent .
462	In the case of the other Wd variants, a one letter suffix is used
463	to represent which Wd variant it is.
464	In each case, the variant input (e.g. -q 22222222) should be the whole
465	similarity variant string and the relevant fields will be extracted out.
466	This is done for consistency in code and interface.
467	The code is fairly similar to the original.
468	A dependency check has been added so that the dates of files and the
469	type of needed files is verified before building.
470	The dependencies include, invf dictionary, invf index, invf, fmd and wt files.
471	The major change here, is the possible use of fmd and/or wt files.
472	Later when I was having to write mg_fmd_build and mg_wt_build,
473	I decided to abstract out some macros, namely:
474	Get_ft, Get_ft_Ft, loop_invf_entry,
475	which were put into "build_lib.c" .
476	"loop_invf_entry" takes a function/macro name as a parameter and applies
477	it to the sim record (with field fdt set), current doc number and modifies
478	the return value.
479	FILES
480	* mg_weights_build.c
481	* build_lib.c
482	*************************************************************
483	TITLE
484	Mgstat with non-existent files
485	APPLICATION
486	mg-1
487	TYPE
488	bug
489	REPORT
490	[email protected] - 16 May 1994
491	FIX
492	[email protected] - 10 Aug 1994
493	CLAIM
494	NaNs and Infinites would be printed out by mgstat
495	if unable to open .text or .text.dict file.
496	PROBLEM
497	The NaNs etc. were output in the column stating
498	the percentage size of the file compared with the
499	number of input bytes of the source text data.
500	If it couldn't read the .text file with its
501	header describing the number of source text bytes, then
502	in working out the percentage it would divide by zero.
503	Also due to some bad control flow, it wouldn't attempt to
504	open the .text file if it failed when opening
505	the .text.dict file.
506	SOLUTION
507	Only printout the percentage if we can read the header
508	from the .text file.
509	Read in text header irrespective of text dictionary file.
510	FILES
511	* mgstat.c
512	*************************************************************
513	TITLE
514	Boolean tree optimisations
515	APPLICATION
516	mg-2
517	TYPE
518	extend
519	REPORT
520	(i) [email protected] - 28/Sep/94
521	(ii) [email protected] - 12/Dec/94
522	FIX
523	(i) [email protected] - 18/Oct/94
524	(ii) [email protected] - 21/Dec/94
525	CLAIM
526	The initial prompt for investigating the optimisation of
527	boolean queries is noted in the mg-1 mod14.txt.
528	The code for optimising seemed to have a number of faults.
529	PROBLEM
530	Boolean optimisation was unreliable.
531	SOLUTION
532	Initially (in case (i) above, see mg-1/mod14.txt), I rewrote
533	all the boolean tree and optimising code. I converted the boolean
534	expression into DNF. I did this after reading some notes about
535	the steps involved in optimisation and they suggested standardising
536	in some normal form. I thought that DNF would be appropriate so that
537	all the terms are converted to be part of "and" expressions and be
538	evaluated quickly using skipping.
539	This, however, can suffer quite badly if the distributive law is
540	applied to often and the query expands in size. If there was
541	some sort of cacheing of invf entries, then it might not be so
542	bad otherwise there is quite an overhead on reading the same
543	invf entry more than once.
544
545	As it happens, CNF queries are reasonably common, where the user
546	queries with a conjunction of disjunctions of similes:
547	e.g. (car \| automobile \| vehicle) & (fast \| quick \| speedy)
548	This sort of CNF query expands a hell of a lot !
549	So after speaking with Justin who wanted to benchmark Atlas with Mg on
550	these sort of queries, I looked up the MG book for other ideas.
551
552	The method that I implemented was the following:
553
554	-----------------------------------------------------------------
555	Steps of tree modifications:
556	Gets literals by pushing the nots down, detecting T/F at leaves
557	and collapsing the tree by detecting 'and' of 'and's and 'or' of
558	'or's.
559	Next it looks at the or nodes and if all the children are terms
560	then it marks the or-node as such.
561	Finally, the or-term-nodes are sorted by using the sum of their
562	ft's for comparison.
563	Steps at query evaluation:
564	If it comes across an 'and' of 'or-terms' then the evaluation is
565	done noting the distributive law.
566	I.e. a & (b \| c \| d) = (a & b) \| (a & c) \| (a & d)
567	Assuming 'a' is the c-set of documents.
568	All of 'a' is tested against 'b' and matching ones are marked.
569	Next, all the unmarked members of 'a' are tested against 'c'.
570	Likewise for 'd'.
571	Now all the marked members of our c-set are kept.
572	When we do the testing, we can use the skipping in the invf entries.
573	-----------------------------------------------------------------
574
575
576	After doing this, I added the choice of which type of optimisation
577	the user wanted by adding query-environment-variable, "optimise_type".
578	Type 0 = no parse tree modification.
579	Type 1 = Or-term recognition and CNF query evaluation optimisation.
580	Type 2 = Put into DNF form. [generally not recommended]
581	FILES
582	* bool_tree.c
583	* bool_parser.y
584	* bool_optimiser.c
585	* bool_query.c
586	* bool_tester.c
587	* term_lists.c
588	* query_env.c
589	* invf_get.c [GetDocsOp]
590	*************************************************************
591	TITLE
592	nonexistent HOME bug
593	APPLICATION
594	mg-1, mg-2
595	TYPE
596	bug
597	REPORT
598	[email protected] - 2/May/95
599	FIX
600	[email protected] - 2/May/95
601	CLAIM
602	"The big problem was that mgquery crashes when the HOME environment
603	variable is not set, which is the case when it is run by the www server."
604	[...] "I expect it happens when looking for $HOME/.mgrc."
605	PROBLEM
606	The result of getenv("HOME")" was used directly in
607	a sprintf call. If the environment variable HOME
608	was not in existence then null would be used.
609	In some C libraries sprintf will convert the 0
610	string into the string "(null)" on others it will core dump.
611	(For example, Solaris seems to core dump, sunos 4 seems ok).
612	SOLUTION
613	The result from getenv("HOME")" is tested before
614	being used.
615	FILES
616	* commands.c
617	*************************************************************
618	TITLE
619	mgquery collection name preference
620	APPLICATION
621	mg-1, mg-2
622	TYPE
623	improve
624	REPORT
625	[email protected] - 2/May/95
626	FIX
627	[email protected] - 4/May/95
628	CLAIM
629	Surely something must override mquery's preference for ./bib.
630	If MGDATA is set correctly, I think it should prefer that collection,
631	and -d should definitely override it.
632	I could always say -d . if I really wanted ./bib.
633	PROBLEM
634	Currently the priority is:
635	1. Check if ./name is a directory,
636	If so then use it as the collection directory.
637	2. Check if ./name.text is a file,
638	If so then use ./ as the collection directory.
639	3. Check if mgdir/name is a directory,
640	If so then use mgdir/name as the collection directory.
641	4. Otherwise,
642	Use mgdir/name as the database file prefix.
643	This would be the case if one used "-f alice/alice".
644	However, one would then not specify a final name argument
645	and we'd never get here. Go figure ???
646	SOLUTION
647	Moved step 3 to the top instead.
648	FILES
649	* mgquery.c [search_for_collection()]
650	*************************************************************
651	TITLE
652	Printout of query terms
653	APPLICATION
654	mg-1, mg-2
655	TYPE
656	extend
657	REPORT
658	[email protected] - April 95
659	FIX
660	[email protected] - April 95
661	CLAIM
662	No easy way to find out the parsed and stemmed words
663	used in the query. Would like to know these words
664	so I can call a separate highlighting program to
665	highlight these words.
666	PROBLEM
667	No facility available.
668	SOLUTION
669	A ".queryterms" mgquery command was added which lists
670	out the parsed/stemmed queryterms of the last query.
671	FILES
672	* commands.c (added CmdQueryTerms)
673	*************************************************************
674	TITLE
675	mg_getrc
676	APPLICATION
677	mg-1, mg-2
678	TYPE
679	extend
680	REPORT
681	[email protected] - 2/May/95
682	FIX
683	-
684	CLAIM
685	Repeated code had to be written for different named
686	gets but really the same type of parsing required.
687	E.g. one might want to use a standard method for inserting
688	^Bs between paragraphs for different books. One doesn't
689	want to write duplicate code for each different named book,
690	rather note that each book should be filtered "book" style.
691	PROBLEM
692	There was no way of abstracting out types of filters from
693	the name of an instance of a collection.
694	SOLUTION
695	Allow information to be given with <name, type, files>.
696	This extra info can be provided in a mg_getrc file.
697	See man page for mg_get for details.
698	FILES
699	* mg_get.sh
700	*************************************************************
701	TITLE
702	TREC DocNo file
703	APPLICATION
704	mg-2
705	TYPE
706	improve
707	REPORT
708	[email protected] - 26/May/95
709	FIX
710	[email protected] - 26/May/95
711	CLAIM
712	MG has problems dealing with trec docnos for trec disk 3.
713	PROBLEM
714	Trec DocNos file didn't have a wide enough field
715	to handle disk 3.
716	SOLUTION
717	Allow different width fields for file.
718	It is still fixed width but a number in the header
719	says how wide the field is.
720	FILES
721	* passes/mg.special.c
722	* query/mgquery.c
723	*************************************************************
724	TITLE
725	Boolean optimiser #1 with `!'
726	APPLICATION
727	mg-1, mg-2
728	TYPE
729	bug
730	REPORT
731	[email protected] - 20/7/95
732	FIX
733	[email protected] - 27/7/95
734	CLAIM
735	Complained about not-nodes.
736	e.g. complained about "croquet & !hedgehog"
737	PROBLEM
738	Boolean optimiser type#1 didn't convert
739	"and not"s into diff nodes.
740	SOLUTION
741	Added code to convert '&!' to '-'.
742	FILES
743	* mg/bool_optimiser.c [mg-1]
744	* query/bool_optimiser.c [mg-2]
745	*************************************************************
746	TITLE
747	Autoconfiguring mg-1
748	APPLICATION
749	mg-1
750	TYPE
751	improve
752	REPORT
753	many people - 94/95
754	FIX
755	[email protected] - Aug/95
756	CLAIM
757	Portability is limited by setting up c-macros just for particular
758	machines and operating systems.
759	People had to make changes for HP, Next, Linux, Dec Alpha, ...
760	PROBLEM
761	Porting was only targetting at the machines that the author had
762	access to.
763	SOLUTION
764	Use GNU's autoconfigure program.
765	This allows checking of the systems features/characteristics.
766	It also allows some checking for specific machines/OS - although
767	I have not utilised this option.
768	I used GNU's tar-1.11.8 as an example to base my changes on.
769	I also used autoscan to generate the initial "configure.in".
770	The "Makefile.in"s were done very similarly to GNU tar's.
771	The "config.h" and "sysfuncs.h" files were scrapped and
772	rewritten. The new "config.h" file is generated by the configure
773	script - it contains all the #define's for the system features.
774	The "sysfuncs.h" file wraps up a number of system headers.
775	For example, some systems use , while some use ;
776	which one is included is decided in "sysfuncs.h".
777	I have also used GNU tar's use of ansi2knr in its Makefiles.
778	This should hopefully allow the package to work on a system with
779	only a K&R C compiler.
780	However, there are probably problems with what I have done.
781	I am concerned about <stdarg.h> for example.
782	I also noticed that "ansi2knr" require function definitions as
783	the GNU coding style recommends ie. with function name the first
784	string on the line. This prompted me to run all the package's code
785	thru GNU's indent.
786	Setting up the configure changes is difficult. It really seems
787	necessary to try the package out on many target machines so one
788	can know what is necessary.
789	A simple check target for the main Makefile has been written.
790	It is used to see if the installation is working - it does
791	not test much of the functionality of mg.
792	It does cmp's on data files and diff's on query/result files.
793	FILES
794	Most of the files in the distribution.
795	*************************************************************
796	TITLE
797	Consistent use of stderr
798	APPLICATION
799	mg-1
800	TYPE
801	improve
802	REPORT
803	[email protected] - 16 May 1994
804	FIX
805	[email protected] - 11 August 1994
806	CLAIM
807	Inconsistent use of stdout/stderr in usage messages.
808	PROBLEM
809	Sometimes used "printf" and sometimes used "fprintf(stderr"
810	in usage messages.
811	SOLUTION
812	All should now use "fprintf(stderr" in usage messages.
813	FILES
814	* mg_compression_dict.c
815	* mg_compression_dict.1
816	* mg_fast_comp_dict.c
817	* mg_fast_comp_dict.1
818	* mg_invf_dict.c
819	* mg_invf_dict.1
820	* mg_invf_dump.c
821	* mg_invf_dump.1
822	* mg_invf_rebuild.c
823	* mg_invf_rebuild.1
824	* mg_perf_hash_build.c
825	* mg_perf_hash_build.1
826	* mg_text_estimate.c
827	* mg_text_estimate.1
828	* mg_weights_build.c
829	* mg_weights_build.1
830	*************************************************************
831	TITLE
832	xmg bug
833	APPLICATION
834	mg-1
835	TYPE
836	bug
837	REPORT
838	[email protected] - 22 April 1994
839	FIX
840	[email protected] - 22 April 1994
841	CLAIM
842	"Serious problem in xmg, which I fear occurs whenever a query
843	doesn't return anything."
844	PROBLEM
845	??
846	SOLUTION
847	[xmg.sh 201] set rank 0
848	FILES
849	* xmg.sh
850	*************************************************************
851	TITLE
852	Unnecessary loading of text
853	APPLICATION
854	mg-1
855	TYPE
856	bug
857	REPORT
858	[email protected] - ?? August 1994
859	FIX
860	[email protected] - 12 August 1994
861	CLAIM
862	Mg was loading and uncompressing text when the
863	query did not require the text.
864	PROBLEM
865	There was no test for the query mode
866	before loading and uncompressing the text.
867	SOLUTION
868	Only load/uncompress text if query mode
869	is for text, headers or silent(for timing).
870	FILES
871	* mgquery.c
872	*************************************************************
873	TITLE
874	Man page errors
875	APPLICATION
876	mg-1
877	TYPE
878	bug
879	REPORT
880	[email protected] - 16 May 1994
881	FIX
882	[email protected] - 16 May 1994
883	CLAIM
884	Man page errors.
885	PROBLEM
886	See below.
887	SOLUTION
888	"The mg_make_fast_dict.1 file has been renamed mg_fast_comp_dict.1,
889	and all mg_make_fast_dict strings changed to mg_fast_comp_dict in all
890	man pages.
891	A large number of errors of spelling, typography, spacing, fonts,
892	grammar, omitted words, slang, punctuation, missing man page
893	cross-references, and man-page style have been corrected."
894	FILES
895	* mg_compression_dict.1
896	* mg_fast_comp_dict.1
897	* mg_get.1
898	* mg_invf_dict.1
899	* mg_invf_dump.1
900	* mg_invf_rebuild.1
901	* mg_passes.1
902	* mg_perf_hash_build.1
903	* mg_text_estimate.1
904	* mg_weights_build.1
905	* mgbilevel.1
906	* mgbuild.1
907	* mgdictlist.1
908	* mgfelics.1
909	* mgquery.1
910	* mgstat.1
911	* mgtic.1
912	* mgticbuild.1
913	* mgticdump.1
914	* mgticprune.1
915	* mgticstat.1
916	* xmg.1
917	*************************************************************
918	TITLE
919	Man page overview
920	APPLICATION
921	mg-1
922	TYPE
923	extend
924	REPORT
925	[email protected] -
926	FIX
927	[email protected] - 17 August 1994
928	CLAIM
929	"Write new mg.1 file to give a brief overview of mg, with samples
930	of how to use it. Otherwise, users are likely to be completely
931	overwhelmed by the number of programs (about 20) which might need to
932	be used, when in reality, only 2 or 3 are likely to be run by end
933	users."
934	SOLUTION
935	It was thought that mg.1, written by Nelson Beebe, was very useful
936	but a bit too comprehensive for an introduction.
937	Therefore, two man files, mgintro.1 and mgintro++.1 were written
938	with the basic stuff in mgintro.1 and slightly more advanced stuff
939	in mgintro++.1 .
940	FILES
941	* mg.1
942	* mgintro.1
943	* mgintro++.1
944	*************************************************************
945	TITLE
946	Parse errors not bus errors
947	APPLICATION
948	mg-1
949	TYPE
950	bug
951	REPORT
952	[email protected] - 2 Jun 94
953	FIX
954	[email protected] - 19 Aug 94
955	CLAIM
956	"These two queries
957	(which I typed in before I knew what I was doing!!)
958	> The Queen of Hearts, she made some tarts
959	> "The Queen of Hearts" and "she made some tarts"
960	produced the following result:
961	mgquery : parse error
962	Bus error
963	"
964	PROBLEM
965	What is expected to happen under boolean querying:
966	Query1:
967	> The Queen of Hearts, she made some tarts
968	will produce a parse error due to the comma which
969	is not a valid TERM.
970	Query2:
971	> "The Queen of Hearts" and "she made some tarts"
972	will store a post-processing string
973	of ''The Queen of Hearts" and "she made some tarts'' and
974	will have a main boolean query of the empty string.
975	This is because the postprocessing string takes in
976	everything between the first quote and the last one.
977	An empty string is illegal for the boolean grammar and
978	hence a parse error.
979	The problem stems from the fact that the processing of
980	the parse tree is carried out, even though we have a
981	parse error. In the case of using an empty string to build
982	a parse tree, it is likely to leave the parse tree undefined.
983	SOLUTION
984	As soon as we find out that there is a parse-error,
985	we abandon any processing of the parse tree.
986	FILES
987	* query.bool.y
988	* query.bool.c (generated from query.bool.y)
989	*************************************************************
990	TITLE
991	Perfect hashing on small vocab
992	APPLICATION
993	mg-1
994	TYPE
995	bug
996	REPORT
997	[email protected] - July 1994
998	FIX
999	[email protected] - July 1994
1000	CLAIM
1001	Mg could not handle small collections in the case
1002	where there was only a small number of unique words.
1003	The perfect hash function would report an error.
1004	PROBLEM
1005	Rounding of the arithmetic during the calculation of the
1006	parameters of the perfect hash function was resulting in a
1007	combination of values such that the probability of a hash
1008	function being found was very small. This led to the limit
1009	on the generation loop being exceeded, and eventual
1010	failure.
1011	SOLUTION
1012	By using ceiling rather than floor when converting from a
1013	floating point value to an integer parameter, the arithmetic
1014	is now correct for all lexicon sizes, and the probability of
1015	each iteration successfully generating a hash function is
1016	sufficiently great that with _very_ high probability the
1017	execution loop counter will not be exceeded unless there
1018	genuinely is no hash function (for example, if the lexicon
1019	contains two words the same there cannot be a hash
1020	function).
1021	FILES
1022	* perf_hash.c
1023	*************************************************************

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format