Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

MODIFICATIONS@ 7627

Last change on this file since 7627 was 3745, checked in by mdewsnip, 21 years ago
Addition of MG package for search and retrieval
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 19.6 KB

Rev	Line
[3745]	1	TITLE
	2	Parsing of Long Words
	3	APPLICATION
	4	mg-1, mg-2
	5	TYPE
	6	bug
	7	REPORT
	8	[email protected] - May 11th 1994
	9	FIX
	10	[email protected] - August 9th 1994
	11	CLAIM
	12	Mg didn't handle long words properly; it crashed.
	13	PROBLEM
	14	Invf passes calls PARSE_LONG_WORD [words.h] which uses a limit of
	15	MAXLONGWORD on iterating thru the string and storing into
	16	a word. MAXLONGWORD = 8192.
	17	However, mg strings generally store the length in the first
	18	byte limiting them to 255 characters. The word which was passed
	19	to PARSE_LONG_WORD was an allocated string of MAXSTEMLEN = 255,
	20	which is as large as we should get anyway. Thus when accessing
	21	a larger word than 255 chars, PARSE_LONG_WORD would allow it
	22	(less than 8192) and would try storing beyond the array limit.
	23	SOLUTION
	24	The author can't remember why PARSE_LONG_WORD was used and what
	25	the significance of MAXLONGWORD = 8192 is.
	26	So PARSE_LONG_WORD has been changed to PARSE_STEM_WORD which
	27	uses MAXSTEMLEN as its limit.
	28	FILES
	29	* words.h
	30	* invf.pass1.c
	31	* invf.pass2.c
	32	* ivf.pass1.c
	33	* ivf.pass2.c
	34	* query.ranked.c
	35	*************************************************************
	36	TITLE
	37	Use of Lovins stemmer
	38	APPLICATION
	39	mg-1
	40	TYPE
	41	improve
	42	REPORT
	43	local - 1994
	44	FIX
	45	[email protected] - 1994
	46	CLAIM
	47	Stemming was done naively.
	48	PROBLEM
	49	Only a few types of words and their endings
	50	were considered.
	51	SOLUTION
	52	Replacement with a more elaborate "known" stemmer by Lovins.
	53	The algorithm is described in:
	54	J.B. Lovins, "Development of a Stemming Algorithm",
	55	Mechanical Translation and Computational Linguistics, Vol 11,1968.
	56	FILES
	57	* stem.c
	58	* stem.h
	59	*************************************************************
	60	TITLE
	61	Different term parsing
	62	APPLICATION
	63	mg-1
	64	TYPE
	65	bug
	66	REPORT
	67	[email protected] - 23 Aug 1994
	68	FIX
	69	[email protected] - 23 Aug 1994
	70	CLAIM
	71	Boolean queries did not extract words/terms using the
	72	same method as is done at inverted-file creation and
	73	as is used for rank query parsing.
	74	PROBLEM
	75	The hand-written lex. analyser, query_lex, which is called by
	76	the boolean query parser was not calling a common
	77	word-extraction routine as used by the rest of mg.
	78	This would be ok if the code did the same things - but they didn't.
	79	Query_lex, for instance, did NOT place any limit on the
	80	number of digits in a term.
	81	Of even more concern, it would allow arbitrary sized words
	82	although it used Pascal style strings which store the length
	83	in the first byte and can therefore only be 255 characters in length.
	84	SOLUTION
	85	Query_lex in "query.bool.y", was modified to call the routine
	86	PARSE_STEM_WORD which is also used by text-inversion routines and
	87	ranking query routines.
	88	Now all terms are extracted by the same routine.
	89	To do this, the end of the line buffer had to be noted as
	90	PARSE_STEM_WORD requires a pointer to the end - which is the
	91	safe thing to do (don't want to run over the end).
	92	This meant I had to find the length of the query line buffer.
	93	This was allocated in the file "read_line.c" by the routine,
	94	"readline". Its size was the literal number 1024.
	95	This was changed to a constant and placed in "read_line.h".
	96	The definition for PARSE_STEM_WORD can be found in "words.h".
	97	FILES
	98	* query.bool.y
	99	* query.bool.c (by bison)
	100	* read_line.c
	101	* read_line.h
	102	*************************************************************
	103	TITLE
	104	Highlighting of query terms
	105	APPLICATION
	106	mg-1
	107	TYPE
	108	extend
	109	REPORT
	110	[email protected] - Aug 94
	111	FIX
	112	[email protected] - Sep 94
	113	CLAIM
	114	Difficult to feel happy that the query-result returned is
	115	satisfying the query - need to look hard to find the queried words.
	116	Need to show words in results using some highlighting method.
	117	PROBLEM
	118	No highlighting of query terms in results.
	119	SOLUTION
	120	Mgquery was previously outputting the decompressed text to a pager
	121	such as "less(1)" or "more(1)".
	122	(Except when redirected or piped elsewhere :)
	123	So what was needed was some sort of highlight pager that instead of
	124	displaying the text would also use some means for highlighting the
	125	stemmed query words.
	126	Two common forms of highlighting were chosen: underline and bolding.
	127	These are supported by "less(1)" and possibly by "more(1)" by
	128	using the backspace character.
	129	A highlight pager will also need to know which words need to be
	130	highlighted. Therefore, the code was modified to build up a
	131	string of the stemmed query words for passing to the highlight pager.
	132	Design Options:
	133	---------------
	134	* Could do text filtering in mgquery before passing out to pager.
	135	Instead I pipe to a separate process, the "hilite_words" pager,
	136	which filters and pipes into less/more.
	137	* Could do different highlighting or a combination.
	138	* Could use a different structure for storing the query words other
	139	than the hash-table I used.
	140	FILES
	141	* Makefile - to include hilite_words target
	142	* mg_hilite_words.c
	143	* mgquery.c
	144	* mgquery.1
	145	* query.bool.y
	146	* query.ranked.c
	147	* environment.c
	148	* environment.h
	149	* backend.h
	150	*************************************************************
	151	TITLE
	152	Mg_compression_dict did premature free
	153	APPLICATION
	154	mg-1
	155	TYPE
	156	bug
	157	REPORT
	158	[email protected] - 23 Sep 94
	159	FIX
	160	[email protected] - 23 Sep 94
	161	CLAIM
	162	mg_compression_dict dumped core in
	163	file: mg_compression_dict.c
	164	function: Write_data
	165	line: int codelen = hd->clens[i];
	166	PROBLEM
	167	Huffman data, hd, was freed before it was accessed again.
	168	SOLUTION
	169	The freeing of hd has been moved to after all accesses
	170	(just before returning).
	171	FILES
	172	* mg_compression_dict.c
	173	*************************************************************
	174	TITLE
	175	Boolean tree optimising rewrite
	176	APPLICATION
	177	mg-1
	178	TYPE
	179	bug
	180	REPORT
	181	[email protected] - 23 Sep 94
	182	FIX
	183	[email protected] - Oct 94
	184	CLAIM
	185	"I am still getting core dump in "and" queries in mgquery,
	186	where the first word does not exist, but the second one does."
	187	PROBLEM
	188	Having freed a particular node, it tried to refree it and
	189	access one of its fields.
	190
	191	I.e. code-fragment...
	192
	193	FreeNode(curr); /* where curr = CHILD(base) for 1st term in list */
	194	FreeNodes(next);
	195	FreeNodes(CHILD(base));
	196	/* but CHILD(base) has already been freed above */
	197	/* if the node was the first one in the list */
	198
	199	SOLUTION
	200	A number of things in the code seemed a bit dubious to me.
	201	So I have rewritten the boolean optimising stage and abstracted out
	202	the various stages - each file starts with "bool".
	203	Boolean query optimising seems to be a tricky problem.
	204	It is not clear that putting an expression into a certain form will
	205	actually simplify it and whether simplification means faster querying.
	206	I have converted a given boolean expression into DNF
	207	(Disjunctive Normal Form). "And not" nodes, which are readily apparent
	208	in DNF, are converted to "diff" nodes. I have only applied the idempotency
	209	laws involving TRUE and FALSE, and not the ones requiring matching of
	210	expressions - it is a potentially more complicated problem.
	211	The optimiser has been tested by playing with "bool_tester", and if you are
	212	having a crash or problem in a boolean query it would be worth testing the
	213	query on the "bool_tester." The token "*" stands for TRUE (or all documents)
	214	and the token "_" stands for FALSE (or no documents). This should show the
	215	expression before and after optimisation in an ascii tree bracketting format.
	216	FILES
	217	* bool_tree.c
	218	* bool_parser.y
	219	* bool_optimiser.c
	220	* bool_query.c
	221	* bool_tester.c
	222	* term_lists.c
	223	*************************************************************
	224	TITLE
	225	Mgtic pixel placement
	226	APPLICATION
	227	mg-1
	228	TYPE
	229	bug
	230	REPORT
	231	Bruce McKenzie - [email protected] (21st Oct 1994)
	232	FIX
	233	[email protected]
	234	CLAIM
	235	mgtic crashed on certain files.
	236	PROBLEM
	237	Placing pixels outside of bitmap.
	238	SOLUTION
	239	Changed the putpixel routine to truncate at borders of the image.
	240	FILES
	241	* mgtic.c
	242	*************************************************************
	243	TITLE
	244	Improved boolean tree optimising
	245	APPLICATION
	246	mg-1
	247	TYPE
	248	improve
	249	REPORT
	250	[email protected] - 12/Dec/94
	251	FIX
	252	[email protected] - 21/Dec/94, 14/Mar/95
	253	CLAIM
	254	Optimising by conversion to DNF is not necessarily such
	255	a good idea - can actually slow things down.
	256	PROBLEM
	257	The distributive law used in converting to DNF
	258	duplicates expressions.
	259	SOLUTION
	260	Introduce a query environment variable, optimise_type = 0 \| 1 \| 2.
	261	Type 0 does nothing to the parse tree.
	262	Type 2 does the DNF conversion.
	263	Type 1 is the new default and does the following...
	264	Do simple tree rearrangement like flattening.
	265	Optimise for CNF queries.
	266	FILES
	267	* bool_query.c, .h
	268	* bool_optimiser.c
	269	* environment.c
	270	* invf_get.c
	271	* bool_tree.c, .h
	272	* bool_tester.c
	273	* lists.h
	274	*************************************************************
	275	TITLE
	276	Mgstat with non-existent files
	277	APPLICATION
	278	mg-1
	279	TYPE
	280	bug
	281	REPORT
	282	[email protected] - 16 May 1994
	283	FIX
	284	[email protected] - 10 Aug 1994
	285	CLAIM
	286	NaNs and Infinites would be printed out by mgstat
	287	if unable to open .text or .text.dict file.
	288	PROBLEM
	289	The NaNs etc. were output in the column stating
	290	the percentage size of the file compared with the
	291	number of input bytes of the source text data.
	292	If it couldn't read the .text file with its
	293	header describing the number of source text bytes, then
	294	in working out the percentage it would divide by zero.
	295	Also due to some bad control flow, it wouldn't attempt to
	296	open the .text file if it failed when opening
	297	the .text.dict file.
	298	SOLUTION
	299	Only printout the percentage if we can read the header
	300	from the .text file.
	301	Read in text header irrespective of text dictionary file.
	302	FILES
	303	* mgstat.c
	304	*************************************************************
	305	TITLE
	306	nonexistent HOME bug
	307	APPLICATION
	308	mg-1, mg-2
	309	TYPE
	310	bug
	311	REPORT
	312	[email protected] - 2/May/95
	313	FIX
	314	[email protected] - 2/May/95
	315	CLAIM
	316	"The big problem was that mgquery crashes when the HOME environment
	317	variable is not set, which is the case when it is run by the www server."
	318	[...] "I expect it happens when looking for $HOME/.mgrc."
	319	PROBLEM
	320	The result of getenv("HOME")" was used directly in
	321	a sprintf call. If the environment variable HOME
	322	was not in existence then null would be used.
	323	In some C libraries sprintf will convert the 0
	324	string into the string "(null)" on others it will core dump.
	325	(For example, Solaris seems to core dump, sunos 4 seems ok).
	326	SOLUTION
	327	The result from getenv("HOME")" is tested before
	328	being used.
	329	FILES
	330	* commands.c
	331	*************************************************************
	332	TITLE
	333	mgquery collection name preference
	334	APPLICATION
	335	mg-1, mg-2
	336	TYPE
	337	improve
	338	REPORT
	339	[email protected] - 2/May/95
	340	FIX
	341	[email protected] - 4/May/95
	342	CLAIM
	343	Surely something must override mquery's preference for ./bib.
	344	If MGDATA is set correctly, I think it should prefer that collection,
	345	and -d should definitely override it.
	346	I could always say -d . if I really wanted ./bib.
	347	PROBLEM
	348	Currently the priority is:
	349	1. Check if ./name is a directory,
	350	If so then use it as the collection directory.
	351	2. Check if ./name.text is a file,
	352	If so then use ./ as the collection directory.
	353	3. Check if mgdir/name is a directory,
	354	If so then use mgdir/name as the collection directory.
	355	4. Otherwise,
	356	Use mgdir/name as the database file prefix.
	357	This would be the case if one used "-f alice/alice".
	358	However, one would then not specify a final name argument
	359	and we'd never get here. Go figure ???
	360	SOLUTION
	361	Moved step 3 to the top instead.
	362	FILES
	363	* mgquery.c [search_for_collection()]
	364	*************************************************************
	365	TITLE
	366	Printout of query terms
	367	APPLICATION
	368	mg-1, mg-2
	369	TYPE
	370	extend
	371	REPORT
	372	[email protected] - April 95
	373	FIX
	374	[email protected] - April 95
	375	CLAIM
	376	No easy way to find out the parsed and stemmed words
	377	used in the query. Would like to know these words
	378	so I can call a separate highlighting program to
	379	highlight these words.
	380	PROBLEM
	381	No facility available.
	382	SOLUTION
	383	A ".queryterms" mgquery command was added which lists
	384	out the parsed/stemmed queryterms of the last query.
	385	FILES
	386	* commands.c (added CmdQueryTerms)
	387	*************************************************************
	388	TITLE
	389	mg_getrc
	390	APPLICATION
	391	mg-1, mg-2
	392	TYPE
	393	extend
	394	REPORT
	395	[email protected] - 2/May/95
	396	FIX
	397	-
	398	CLAIM
	399	Repeated code had to be written for different named
	400	gets but really the same type of parsing required.
	401	E.g. one might want to use a standard method for inserting
	402	^Bs between paragraphs for different books. One doesn't
	403	want to write duplicate code for each different named book,
	404	rather note that each book should be filtered "book" style.
	405	PROBLEM
	406	There was no way of abstracting out types of filters from
	407	the name of an instance of a collection.
	408	SOLUTION
	409	Allow information to be given with <name, type, files>.
	410	This extra info can be provided in a mg_getrc file.
	411	See man page for mg_get for details.
	412	FILES
	413	* mg_get.sh
	414	*************************************************************
	415	TITLE
	416	Boolean optimiser #1 with `!'
	417	APPLICATION
	418	mg-1, mg-2
	419	TYPE
	420	bug
	421	REPORT
	422	[email protected] - 20/7/95
	423	FIX
	424	[email protected] - 27/7/95
	425	CLAIM
	426	Complained about not-nodes.
	427	e.g. complained about "croquet & !hedgehog"
	428	PROBLEM
	429	Boolean optimiser type#1 didn't convert
	430	"and not"s into diff nodes.
	431	SOLUTION
	432	Added code to convert '&!' to '-'.
	433	FILES
	434	* mg/bool_optimiser.c [mg-1]
	435	* query/bool_optimiser.c [mg-2]
	436	*************************************************************
	437	TITLE
	438	Consistent use of stderr
	439	APPLICATION
	440	mg-1
	441	TYPE
	442	improve
	443	REPORT
	444	[email protected] - 16 May 1994
	445	FIX
	446	[email protected] - 11 August 1994
	447	CLAIM
	448	Inconsistent use of stdout/stderr in usage messages.
	449	PROBLEM
	450	Sometimes used "printf" and sometimes used "fprintf(stderr"
	451	in usage messages.
	452	SOLUTION
	453	All should now use "fprintf(stderr" in usage messages.
	454	FILES
	455	* mg_compression_dict.c
	456	* mg_compression_dict.1
	457	* mg_fast_comp_dict.c
	458	* mg_fast_comp_dict.1
	459	* mg_invf_dict.c
	460	* mg_invf_dict.1
	461	* mg_invf_dump.c
	462	* mg_invf_dump.1
	463	* mg_invf_rebuild.c
	464	* mg_invf_rebuild.1
	465	* mg_perf_hash_build.c
	466	* mg_perf_hash_build.1
	467	* mg_text_estimate.c
	468	* mg_text_estimate.1
	469	* mg_weights_build.c
	470	* mg_weights_build.1
	471	*************************************************************
	472	TITLE
	473	xmg bug
	474	APPLICATION
	475	mg-1
	476	TYPE
	477	bug
	478	REPORT
	479	[email protected] - 22 April 1994
	480	FIX
	481	[email protected] - 22 April 1994
	482	CLAIM
	483	"Serious problem in xmg, which I fear occurs whenever a query
	484	doesn't return anything."
	485	PROBLEM
	486	??
	487	SOLUTION
	488	[xmg.sh 201] set rank 0
	489	FILES
	490	* xmg.sh
	491	*************************************************************
	492	TITLE
	493	Unnecessary loading of text
	494	APPLICATION
	495	mg-1
	496	TYPE
	497	bug
	498	REPORT
	499	[email protected] - ?? August 1994
	500	FIX
	501	[email protected] - 12 August 1994
	502	CLAIM
	503	Mg was loading and uncompressing text when the
	504	query did not require the text.
	505	PROBLEM
	506	There was no test for the query mode
	507	before loading and uncompressing the text.
	508	SOLUTION
	509	Only load/uncompress text if query mode
	510	is for text, headers or silent(for timing).
	511	FILES
	512	* mgquery.c
	513	*************************************************************
	514	TITLE
	515	Man page errors
	516	APPLICATION
	517	mg-1
	518	TYPE
	519	bug
	520	REPORT
	521	[email protected] - 16 May 1994
	522	FIX
	523	[email protected] - 16 May 1994
	524	CLAIM
	525	Man page errors.
	526	PROBLEM
	527	See below.
	528	SOLUTION
	529	"The mg_make_fast_dict.1 file has been renamed mg_fast_comp_dict.1,
	530	and all mg_make_fast_dict strings changed to mg_fast_comp_dict in all
	531	man pages.
	532	A large number of errors of spelling, typography, spacing, fonts,
	533	grammar, omitted words, slang, punctuation, missing man page
	534	cross-references, and man-page style have been corrected."
	535	FILES
	536	* mg_compression_dict.1
	537	* mg_fast_comp_dict.1
	538	* mg_get.1
	539	* mg_invf_dict.1
	540	* mg_invf_dump.1
	541	* mg_invf_rebuild.1
	542	* mg_passes.1
	543	* mg_perf_hash_build.1
	544	* mg_text_estimate.1
	545	* mg_weights_build.1
	546	* mgbilevel.1
	547	* mgbuild.1
	548	* mgdictlist.1
	549	* mgfelics.1
	550	* mgquery.1
	551	* mgstat.1
	552	* mgtic.1
	553	* mgticbuild.1
	554	* mgticdump.1
	555	* mgticprune.1
	556	* mgticstat.1
	557	* xmg.1
	558	*************************************************************
	559	TITLE
	560	Man page overview
	561	APPLICATION
	562	mg-1
	563	TYPE
	564	extend
	565	REPORT
	566	[email protected] -
	567	FIX
	568	[email protected] - 17 August 1994
	569	CLAIM
	570	"Write new mg.1 file to give a brief overview of mg, with samples
	571	of how to use it. Otherwise, users are likely to be completely
	572	overwhelmed by the number of programs (about 20) which might need to
	573	be used, when in reality, only 2 or 3 are likely to be run by end
	574	users."
	575	SOLUTION
	576	It was thought that mg.1, written by Nelson Beebe, was very useful
	577	but a bit too comprehensive for an introduction.
	578	Therefore, two man files, mgintro.1 and mgintro++.1 were written
	579	with the basic stuff in mgintro.1 and slightly more advanced stuff
	580	in mgintro++.1 .
	581	FILES
	582	* mg.1
	583	* mgintro.1
	584	* mgintro++.1
	585	*************************************************************
	586	TITLE
	587	Parse errors not bus errors
	588	APPLICATION
	589	mg-1
	590	TYPE
	591	bug
	592	REPORT
	593	[email protected] - 2 Jun 94
	594	FIX
	595	[email protected] - 19 Aug 94
	596	CLAIM
	597	"These two queries
	598	(which I typed in before I knew what I was doing!!)
	599	> The Queen of Hearts, she made some tarts
	600	> "The Queen of Hearts" and "she made some tarts"
	601	produced the following result:
	602	mgquery : parse error
	603	Bus error
	604	"
	605	PROBLEM
	606	What is expected to happen under boolean querying:
	607	Query1:
	608	> The Queen of Hearts, she made some tarts
	609	will produce a parse error due to the comma which
	610	is not a valid TERM.
	611	Query2:
	612	> "The Queen of Hearts" and "she made some tarts"
	613	will store a post-processing string
	614	of ''The Queen of Hearts" and "she made some tarts'' and
	615	will have a main boolean query of the empty string.
	616	This is because the postprocessing string takes in
	617	everything between the first quote and the last one.
	618	An empty string is illegal for the boolean grammar and
	619	hence a parse error.
	620	The problem stems from the fact that the processing of
	621	the parse tree is carried out, even though we have a
	622	parse error. In the case of using an empty string to build
	623	a parse tree, it is likely to leave the parse tree undefined.
	624	SOLUTION
	625	As soon as we find out that there is a parse-error,
	626	we abandon any processing of the parse tree.
	627	FILES
	628	* query.bool.y
	629	* query.bool.c (generated from query.bool.y)
	630	*************************************************************
	631	TITLE
	632	Perfect hashing on small vocab
	633	APPLICATION
	634	mg-1
	635	TYPE
	636	bug
	637	REPORT
	638	[email protected] - July 1994
	639	FIX
	640	[email protected] - July 1994
	641	CLAIM
	642	Mg could not handle small collections in the case
	643	where there was only a small number of unique words.
	644	The perfect hash function would report an error.
	645	PROBLEM
	646	Rounding of the arithmetic during the calculation of the
	647	parameters of the perfect hash function was resulting in a
	648	combination of values such that the probability of a hash
	649	function being found was very small. This led to the limit
	650	on the generation loop being exceeded, and eventual
	651	failure.
	652	SOLUTION
	653	By using ceiling rather than floor when converting from a
	654	floating point value to an integer parameter, the arithmetic
	655	is now correct for all lexicon sizes, and the probability of
	656	each iteration successfully generating a hash function is
	657	sufficiently great that with _very_ high probability the
	658	execution loop counter will not be exceeded unless there
	659	genuinely is no hash function (for example, if the lexicon
	660	contains two words the same there cannot be a hash
	661	function).
	662	FILES
	663	* perf_hash.c
	664	*************************************************************

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format