Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

MODIFICATIONS@ 3745

Last change on this file since 3745 was 3745, checked in by mdewsnip, 21 years ago
Addition of MG package for search and retrieval
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 19.6 KB

Line
1	TITLE
2	Parsing of Long Words
3	APPLICATION
4	mg-1, mg-2
5	TYPE
6	bug
7	REPORT
8	[email protected] - May 11th 1994
9	FIX
10	[email protected] - August 9th 1994
11	CLAIM
12	Mg didn't handle long words properly; it crashed.
13	PROBLEM
14	Invf passes calls PARSE_LONG_WORD [words.h] which uses a limit of
15	MAXLONGWORD on iterating thru the string and storing into
16	a word. MAXLONGWORD = 8192.
17	However, mg strings generally store the length in the first
18	byte limiting them to 255 characters. The word which was passed
19	to PARSE_LONG_WORD was an allocated string of MAXSTEMLEN = 255,
20	which is as large as we should get anyway. Thus when accessing
21	a larger word than 255 chars, PARSE_LONG_WORD would allow it
22	(less than 8192) and would try storing beyond the array limit.
23	SOLUTION
24	The author can't remember why PARSE_LONG_WORD was used and what
25	the significance of MAXLONGWORD = 8192 is.
26	So PARSE_LONG_WORD has been changed to PARSE_STEM_WORD which
27	uses MAXSTEMLEN as its limit.
28	FILES
29	* words.h
30	* invf.pass1.c
31	* invf.pass2.c
32	* ivf.pass1.c
33	* ivf.pass2.c
34	* query.ranked.c
35	*************************************************************
36	TITLE
37	Use of Lovins stemmer
38	APPLICATION
39	mg-1
40	TYPE
41	improve
42	REPORT
43	local - 1994
44	FIX
45	[email protected] - 1994
46	CLAIM
47	Stemming was done naively.
48	PROBLEM
49	Only a few types of words and their endings
50	were considered.
51	SOLUTION
52	Replacement with a more elaborate "known" stemmer by Lovins.
53	The algorithm is described in:
54	J.B. Lovins, "Development of a Stemming Algorithm",
55	Mechanical Translation and Computational Linguistics, Vol 11,1968.
56	FILES
57	* stem.c
58	* stem.h
59	*************************************************************
60	TITLE
61	Different term parsing
62	APPLICATION
63	mg-1
64	TYPE
65	bug
66	REPORT
67	[email protected] - 23 Aug 1994
68	FIX
69	[email protected] - 23 Aug 1994
70	CLAIM
71	Boolean queries did not extract words/terms using the
72	same method as is done at inverted-file creation and
73	as is used for rank query parsing.
74	PROBLEM
75	The hand-written lex. analyser, query_lex, which is called by
76	the boolean query parser was not calling a common
77	word-extraction routine as used by the rest of mg.
78	This would be ok if the code did the same things - but they didn't.
79	Query_lex, for instance, did NOT place any limit on the
80	number of digits in a term.
81	Of even more concern, it would allow arbitrary sized words
82	although it used Pascal style strings which store the length
83	in the first byte and can therefore only be 255 characters in length.
84	SOLUTION
85	Query_lex in "query.bool.y", was modified to call the routine
86	PARSE_STEM_WORD which is also used by text-inversion routines and
87	ranking query routines.
88	Now all terms are extracted by the same routine.
89	To do this, the end of the line buffer had to be noted as
90	PARSE_STEM_WORD requires a pointer to the end - which is the
91	safe thing to do (don't want to run over the end).
92	This meant I had to find the length of the query line buffer.
93	This was allocated in the file "read_line.c" by the routine,
94	"readline". Its size was the literal number 1024.
95	This was changed to a constant and placed in "read_line.h".
96	The definition for PARSE_STEM_WORD can be found in "words.h".
97	FILES
98	* query.bool.y
99	* query.bool.c (by bison)
100	* read_line.c
101	* read_line.h
102	*************************************************************
103	TITLE
104	Highlighting of query terms
105	APPLICATION
106	mg-1
107	TYPE
108	extend
109	REPORT
110	[email protected] - Aug 94
111	FIX
112	[email protected] - Sep 94
113	CLAIM
114	Difficult to feel happy that the query-result returned is
115	satisfying the query - need to look hard to find the queried words.
116	Need to show words in results using some highlighting method.
117	PROBLEM
118	No highlighting of query terms in results.
119	SOLUTION
120	Mgquery was previously outputting the decompressed text to a pager
121	such as "less(1)" or "more(1)".
122	(Except when redirected or piped elsewhere :)
123	So what was needed was some sort of highlight pager that instead of
124	displaying the text would also use some means for highlighting the
125	stemmed query words.
126	Two common forms of highlighting were chosen: underline and bolding.
127	These are supported by "less(1)" and possibly by "more(1)" by
128	using the backspace character.
129	A highlight pager will also need to know which words need to be
130	highlighted. Therefore, the code was modified to build up a
131	string of the stemmed query words for passing to the highlight pager.
132	Design Options:
133	---------------
134	* Could do text filtering in mgquery before passing out to pager.
135	Instead I pipe to a separate process, the "hilite_words" pager,
136	which filters and pipes into less/more.
137	* Could do different highlighting or a combination.
138	* Could use a different structure for storing the query words other
139	than the hash-table I used.
140	FILES
141	* Makefile - to include hilite_words target
142	* mg_hilite_words.c
143	* mgquery.c
144	* mgquery.1
145	* query.bool.y
146	* query.ranked.c
147	* environment.c
148	* environment.h
149	* backend.h
150	*************************************************************
151	TITLE
152	Mg_compression_dict did premature free
153	APPLICATION
154	mg-1
155	TYPE
156	bug
157	REPORT
158	[email protected] - 23 Sep 94
159	FIX
160	[email protected] - 23 Sep 94
161	CLAIM
162	mg_compression_dict dumped core in
163	file: mg_compression_dict.c
164	function: Write_data
165	line: int codelen = hd->clens[i];
166	PROBLEM
167	Huffman data, hd, was freed before it was accessed again.
168	SOLUTION
169	The freeing of hd has been moved to after all accesses
170	(just before returning).
171	FILES
172	* mg_compression_dict.c
173	*************************************************************
174	TITLE
175	Boolean tree optimising rewrite
176	APPLICATION
177	mg-1
178	TYPE
179	bug
180	REPORT
181	[email protected] - 23 Sep 94
182	FIX
183	[email protected] - Oct 94
184	CLAIM
185	"I am still getting core dump in "and" queries in mgquery,
186	where the first word does not exist, but the second one does."
187	PROBLEM
188	Having freed a particular node, it tried to refree it and
189	access one of its fields.
190
191	I.e. code-fragment...
192
193	FreeNode(curr); /* where curr = CHILD(base) for 1st term in list */
194	FreeNodes(next);
195	FreeNodes(CHILD(base));
196	/* but CHILD(base) has already been freed above */
197	/* if the node was the first one in the list */
198
199	SOLUTION
200	A number of things in the code seemed a bit dubious to me.
201	So I have rewritten the boolean optimising stage and abstracted out
202	the various stages - each file starts with "bool".
203	Boolean query optimising seems to be a tricky problem.
204	It is not clear that putting an expression into a certain form will
205	actually simplify it and whether simplification means faster querying.
206	I have converted a given boolean expression into DNF
207	(Disjunctive Normal Form). "And not" nodes, which are readily apparent
208	in DNF, are converted to "diff" nodes. I have only applied the idempotency
209	laws involving TRUE and FALSE, and not the ones requiring matching of
210	expressions - it is a potentially more complicated problem.
211	The optimiser has been tested by playing with "bool_tester", and if you are
212	having a crash or problem in a boolean query it would be worth testing the
213	query on the "bool_tester." The token "*" stands for TRUE (or all documents)
214	and the token "_" stands for FALSE (or no documents). This should show the
215	expression before and after optimisation in an ascii tree bracketting format.
216	FILES
217	* bool_tree.c
218	* bool_parser.y
219	* bool_optimiser.c
220	* bool_query.c
221	* bool_tester.c
222	* term_lists.c
223	*************************************************************
224	TITLE
225	Mgtic pixel placement
226	APPLICATION
227	mg-1
228	TYPE
229	bug
230	REPORT
231	Bruce McKenzie - [email protected] (21st Oct 1994)
232	FIX
233	[email protected]
234	CLAIM
235	mgtic crashed on certain files.
236	PROBLEM
237	Placing pixels outside of bitmap.
238	SOLUTION
239	Changed the putpixel routine to truncate at borders of the image.
240	FILES
241	* mgtic.c
242	*************************************************************
243	TITLE
244	Improved boolean tree optimising
245	APPLICATION
246	mg-1
247	TYPE
248	improve
249	REPORT
250	[email protected] - 12/Dec/94
251	FIX
252	[email protected] - 21/Dec/94, 14/Mar/95
253	CLAIM
254	Optimising by conversion to DNF is not necessarily such
255	a good idea - can actually slow things down.
256	PROBLEM
257	The distributive law used in converting to DNF
258	duplicates expressions.
259	SOLUTION
260	Introduce a query environment variable, optimise_type = 0 \| 1 \| 2.
261	Type 0 does nothing to the parse tree.
262	Type 2 does the DNF conversion.
263	Type 1 is the new default and does the following...
264	Do simple tree rearrangement like flattening.
265	Optimise for CNF queries.
266	FILES
267	* bool_query.c, .h
268	* bool_optimiser.c
269	* environment.c
270	* invf_get.c
271	* bool_tree.c, .h
272	* bool_tester.c
273	* lists.h
274	*************************************************************
275	TITLE
276	Mgstat with non-existent files
277	APPLICATION
278	mg-1
279	TYPE
280	bug
281	REPORT
282	[email protected] - 16 May 1994
283	FIX
284	[email protected] - 10 Aug 1994
285	CLAIM
286	NaNs and Infinites would be printed out by mgstat
287	if unable to open .text or .text.dict file.
288	PROBLEM
289	The NaNs etc. were output in the column stating
290	the percentage size of the file compared with the
291	number of input bytes of the source text data.
292	If it couldn't read the .text file with its
293	header describing the number of source text bytes, then
294	in working out the percentage it would divide by zero.
295	Also due to some bad control flow, it wouldn't attempt to
296	open the .text file if it failed when opening
297	the .text.dict file.
298	SOLUTION
299	Only printout the percentage if we can read the header
300	from the .text file.
301	Read in text header irrespective of text dictionary file.
302	FILES
303	* mgstat.c
304	*************************************************************
305	TITLE
306	nonexistent HOME bug
307	APPLICATION
308	mg-1, mg-2
309	TYPE
310	bug
311	REPORT
312	[email protected] - 2/May/95
313	FIX
314	[email protected] - 2/May/95
315	CLAIM
316	"The big problem was that mgquery crashes when the HOME environment
317	variable is not set, which is the case when it is run by the www server."
318	[...] "I expect it happens when looking for $HOME/.mgrc."
319	PROBLEM
320	The result of getenv("HOME")" was used directly in
321	a sprintf call. If the environment variable HOME
322	was not in existence then null would be used.
323	In some C libraries sprintf will convert the 0
324	string into the string "(null)" on others it will core dump.
325	(For example, Solaris seems to core dump, sunos 4 seems ok).
326	SOLUTION
327	The result from getenv("HOME")" is tested before
328	being used.
329	FILES
330	* commands.c
331	*************************************************************
332	TITLE
333	mgquery collection name preference
334	APPLICATION
335	mg-1, mg-2
336	TYPE
337	improve
338	REPORT
339	[email protected] - 2/May/95
340	FIX
341	[email protected] - 4/May/95
342	CLAIM
343	Surely something must override mquery's preference for ./bib.
344	If MGDATA is set correctly, I think it should prefer that collection,
345	and -d should definitely override it.
346	I could always say -d . if I really wanted ./bib.
347	PROBLEM
348	Currently the priority is:
349	1. Check if ./name is a directory,
350	If so then use it as the collection directory.
351	2. Check if ./name.text is a file,
352	If so then use ./ as the collection directory.
353	3. Check if mgdir/name is a directory,
354	If so then use mgdir/name as the collection directory.
355	4. Otherwise,
356	Use mgdir/name as the database file prefix.
357	This would be the case if one used "-f alice/alice".
358	However, one would then not specify a final name argument
359	and we'd never get here. Go figure ???
360	SOLUTION
361	Moved step 3 to the top instead.
362	FILES
363	* mgquery.c [search_for_collection()]
364	*************************************************************
365	TITLE
366	Printout of query terms
367	APPLICATION
368	mg-1, mg-2
369	TYPE
370	extend
371	REPORT
372	[email protected] - April 95
373	FIX
374	[email protected] - April 95
375	CLAIM
376	No easy way to find out the parsed and stemmed words
377	used in the query. Would like to know these words
378	so I can call a separate highlighting program to
379	highlight these words.
380	PROBLEM
381	No facility available.
382	SOLUTION
383	A ".queryterms" mgquery command was added which lists
384	out the parsed/stemmed queryterms of the last query.
385	FILES
386	* commands.c (added CmdQueryTerms)
387	*************************************************************
388	TITLE
389	mg_getrc
390	APPLICATION
391	mg-1, mg-2
392	TYPE
393	extend
394	REPORT
395	[email protected] - 2/May/95
396	FIX
397	-
398	CLAIM
399	Repeated code had to be written for different named
400	gets but really the same type of parsing required.
401	E.g. one might want to use a standard method for inserting
402	^Bs between paragraphs for different books. One doesn't
403	want to write duplicate code for each different named book,
404	rather note that each book should be filtered "book" style.
405	PROBLEM
406	There was no way of abstracting out types of filters from
407	the name of an instance of a collection.
408	SOLUTION
409	Allow information to be given with <name, type, files>.
410	This extra info can be provided in a mg_getrc file.
411	See man page for mg_get for details.
412	FILES
413	* mg_get.sh
414	*************************************************************
415	TITLE
416	Boolean optimiser #1 with `!'
417	APPLICATION
418	mg-1, mg-2
419	TYPE
420	bug
421	REPORT
422	[email protected] - 20/7/95
423	FIX
424	[email protected] - 27/7/95
425	CLAIM
426	Complained about not-nodes.
427	e.g. complained about "croquet & !hedgehog"
428	PROBLEM
429	Boolean optimiser type#1 didn't convert
430	"and not"s into diff nodes.
431	SOLUTION
432	Added code to convert '&!' to '-'.
433	FILES
434	* mg/bool_optimiser.c [mg-1]
435	* query/bool_optimiser.c [mg-2]
436	*************************************************************
437	TITLE
438	Consistent use of stderr
439	APPLICATION
440	mg-1
441	TYPE
442	improve
443	REPORT
444	[email protected] - 16 May 1994
445	FIX
446	[email protected] - 11 August 1994
447	CLAIM
448	Inconsistent use of stdout/stderr in usage messages.
449	PROBLEM
450	Sometimes used "printf" and sometimes used "fprintf(stderr"
451	in usage messages.
452	SOLUTION
453	All should now use "fprintf(stderr" in usage messages.
454	FILES
455	* mg_compression_dict.c
456	* mg_compression_dict.1
457	* mg_fast_comp_dict.c
458	* mg_fast_comp_dict.1
459	* mg_invf_dict.c
460	* mg_invf_dict.1
461	* mg_invf_dump.c
462	* mg_invf_dump.1
463	* mg_invf_rebuild.c
464	* mg_invf_rebuild.1
465	* mg_perf_hash_build.c
466	* mg_perf_hash_build.1
467	* mg_text_estimate.c
468	* mg_text_estimate.1
469	* mg_weights_build.c
470	* mg_weights_build.1
471	*************************************************************
472	TITLE
473	xmg bug
474	APPLICATION
475	mg-1
476	TYPE
477	bug
478	REPORT
479	[email protected] - 22 April 1994
480	FIX
481	[email protected] - 22 April 1994
482	CLAIM
483	"Serious problem in xmg, which I fear occurs whenever a query
484	doesn't return anything."
485	PROBLEM
486	??
487	SOLUTION
488	[xmg.sh 201] set rank 0
489	FILES
490	* xmg.sh
491	*************************************************************
492	TITLE
493	Unnecessary loading of text
494	APPLICATION
495	mg-1
496	TYPE
497	bug
498	REPORT
499	[email protected] - ?? August 1994
500	FIX
501	[email protected] - 12 August 1994
502	CLAIM
503	Mg was loading and uncompressing text when the
504	query did not require the text.
505	PROBLEM
506	There was no test for the query mode
507	before loading and uncompressing the text.
508	SOLUTION
509	Only load/uncompress text if query mode
510	is for text, headers or silent(for timing).
511	FILES
512	* mgquery.c
513	*************************************************************
514	TITLE
515	Man page errors
516	APPLICATION
517	mg-1
518	TYPE
519	bug
520	REPORT
521	[email protected] - 16 May 1994
522	FIX
523	[email protected] - 16 May 1994
524	CLAIM
525	Man page errors.
526	PROBLEM
527	See below.
528	SOLUTION
529	"The mg_make_fast_dict.1 file has been renamed mg_fast_comp_dict.1,
530	and all mg_make_fast_dict strings changed to mg_fast_comp_dict in all
531	man pages.
532	A large number of errors of spelling, typography, spacing, fonts,
533	grammar, omitted words, slang, punctuation, missing man page
534	cross-references, and man-page style have been corrected."
535	FILES
536	* mg_compression_dict.1
537	* mg_fast_comp_dict.1
538	* mg_get.1
539	* mg_invf_dict.1
540	* mg_invf_dump.1
541	* mg_invf_rebuild.1
542	* mg_passes.1
543	* mg_perf_hash_build.1
544	* mg_text_estimate.1
545	* mg_weights_build.1
546	* mgbilevel.1
547	* mgbuild.1
548	* mgdictlist.1
549	* mgfelics.1
550	* mgquery.1
551	* mgstat.1
552	* mgtic.1
553	* mgticbuild.1
554	* mgticdump.1
555	* mgticprune.1
556	* mgticstat.1
557	* xmg.1
558	*************************************************************
559	TITLE
560	Man page overview
561	APPLICATION
562	mg-1
563	TYPE
564	extend
565	REPORT
566	[email protected] -
567	FIX
568	[email protected] - 17 August 1994
569	CLAIM
570	"Write new mg.1 file to give a brief overview of mg, with samples
571	of how to use it. Otherwise, users are likely to be completely
572	overwhelmed by the number of programs (about 20) which might need to
573	be used, when in reality, only 2 or 3 are likely to be run by end
574	users."
575	SOLUTION
576	It was thought that mg.1, written by Nelson Beebe, was very useful
577	but a bit too comprehensive for an introduction.
578	Therefore, two man files, mgintro.1 and mgintro++.1 were written
579	with the basic stuff in mgintro.1 and slightly more advanced stuff
580	in mgintro++.1 .
581	FILES
582	* mg.1
583	* mgintro.1
584	* mgintro++.1
585	*************************************************************
586	TITLE
587	Parse errors not bus errors
588	APPLICATION
589	mg-1
590	TYPE
591	bug
592	REPORT
593	[email protected] - 2 Jun 94
594	FIX
595	[email protected] - 19 Aug 94
596	CLAIM
597	"These two queries
598	(which I typed in before I knew what I was doing!!)
599	> The Queen of Hearts, she made some tarts
600	> "The Queen of Hearts" and "she made some tarts"
601	produced the following result:
602	mgquery : parse error
603	Bus error
604	"
605	PROBLEM
606	What is expected to happen under boolean querying:
607	Query1:
608	> The Queen of Hearts, she made some tarts
609	will produce a parse error due to the comma which
610	is not a valid TERM.
611	Query2:
612	> "The Queen of Hearts" and "she made some tarts"
613	will store a post-processing string
614	of ''The Queen of Hearts" and "she made some tarts'' and
615	will have a main boolean query of the empty string.
616	This is because the postprocessing string takes in
617	everything between the first quote and the last one.
618	An empty string is illegal for the boolean grammar and
619	hence a parse error.
620	The problem stems from the fact that the processing of
621	the parse tree is carried out, even though we have a
622	parse error. In the case of using an empty string to build
623	a parse tree, it is likely to leave the parse tree undefined.
624	SOLUTION
625	As soon as we find out that there is a parse-error,
626	we abandon any processing of the parse tree.
627	FILES
628	* query.bool.y
629	* query.bool.c (generated from query.bool.y)
630	*************************************************************
631	TITLE
632	Perfect hashing on small vocab
633	APPLICATION
634	mg-1
635	TYPE
636	bug
637	REPORT
638	[email protected] - July 1994
639	FIX
640	[email protected] - July 1994
641	CLAIM
642	Mg could not handle small collections in the case
643	where there was only a small number of unique words.
644	The perfect hash function would report an error.
645	PROBLEM
646	Rounding of the arithmetic during the calculation of the
647	parameters of the perfect hash function was resulting in a
648	combination of values such that the probability of a hash
649	function being found was very small. This led to the limit
650	on the generation loop being exceeded, and eventual
651	failure.
652	SOLUTION
653	By using ceiling rather than floor when converting from a
654	floating point value to an integer parameter, the arithmetic
655	is now correct for all lexicon sizes, and the probability of
656	each iteration successfully generating a hash function is
657	sufficiently great that with _very_ high probability the
658	execution loop counter will not be exceeded unless there
659	genuinely is no hash function (for example, if the lexicon
660	contains two words the same there cannot be a hash
661	function).
662	FILES
663	* perf_hash.c
664	*************************************************************

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format