Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

MODIFICATIONS@ 13670

Last change on this file since 13670 was 3745, checked in by mdewsnip, 21 years ago
Addition of MG package for search and retrieval
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 34.3 KB

Rev	Line
[3745]	1	TITLE
	2	Parsing of Long Words
	3	APPLICATION
	4	mg-1, mg-2
	5	TYPE
	6	bug
	7	REPORT
	8	[email protected] - May 11th 1994
	9	FIX
	10	[email protected] - August 9th 1994
	11	CLAIM
	12	Mg didn't handle long words properly; it crashed.
	13	PROBLEM
	14	Invf passes calls PARSE_LONG_WORD [words.h] which uses a limit of
	15	MAXLONGWORD on iterating thru the string and storing into
	16	a word. MAXLONGWORD = 8192.
	17	However, mg strings generally store the length in the first
	18	byte limiting them to 255 characters. The word which was passed
	19	to PARSE_LONG_WORD was an allocated string of MAXSTEMLEN = 255,
	20	which is as large as we should get anyway. Thus when accessing
	21	a larger word than 255 chars, PARSE_LONG_WORD would allow it
	22	(less than 8192) and would try storing beyond the array limit.
	23	SOLUTION
	24	The author can't remember why PARSE_LONG_WORD was used and what
	25	the significance of MAXLONGWORD = 8192 is.
	26	So PARSE_LONG_WORD has been changed to PARSE_STEM_WORD which
	27	uses MAXSTEMLEN as its limit.
	28	FILES
	29	* words.h
	30	* invf.pass1.c
	31	* invf.pass2.c
	32	* ivf.pass1.c
	33	* ivf.pass2.c
	34	* query.ranked.c
	35	*************************************************************
	36	TITLE
	37	Use of Lovins stemmer
	38	APPLICATION
	39	mg-1
	40	TYPE
	41	improve
	42	REPORT
	43	local - 1994
	44	FIX
	45	[email protected] - 1994
	46	CLAIM
	47	Stemming was done naively.
	48	PROBLEM
	49	Only a few types of words and their endings
	50	were considered.
	51	SOLUTION
	52	Replacement with a more elaborate "known" stemmer by Lovins.
	53	The algorithm is described in:
	54	J.B. Lovins, "Development of a Stemming Algorithm",
	55	Mechanical Translation and Computational Linguistics, Vol 11,1968.
	56	FILES
	57	* stem.c
	58	* stem.h
	59	*************************************************************
	60	TITLE
	61	Different term parsing
	62	APPLICATION
	63	mg-1
	64	TYPE
	65	bug
	66	REPORT
	67	[email protected] - 23 Aug 1994
	68	FIX
	69	[email protected] - 23 Aug 1994
	70	CLAIM
	71	Boolean queries did not extract words/terms using the
	72	same method as is done at inverted-file creation and
	73	as is used for rank query parsing.
	74	PROBLEM
	75	The hand-written lex. analyser, query_lex, which is called by
	76	the boolean query parser was not calling a common
	77	word-extraction routine as used by the rest of mg.
	78	This would be ok if the code did the same things - but they didn't.
	79	Query_lex, for instance, did NOT place any limit on the
	80	number of digits in a term.
	81	Of even more concern, it would allow arbitrary sized words
	82	although it used Pascal style strings which store the length
	83	in the first byte and can therefore only be 255 characters in length.
	84	SOLUTION
	85	Query_lex in "query.bool.y", was modified to call the routine
	86	PARSE_STEM_WORD which is also used by text-inversion routines and
	87	ranking query routines.
	88	Now all terms are extracted by the same routine.
	89	To do this, the end of the line buffer had to be noted as
	90	PARSE_STEM_WORD requires a pointer to the end - which is the
	91	safe thing to do (don't want to run over the end).
	92	This meant I had to find the length of the query line buffer.
	93	This was allocated in the file "read_line.c" by the routine,
	94	"readline". Its size was the literal number 1024.
	95	This was changed to a constant and placed in "read_line.h".
	96	The definition for PARSE_STEM_WORD can be found in "words.h".
	97	FILES
	98	* query.bool.y
	99	* query.bool.c (by bison)
	100	* read_line.c
	101	* read_line.h
	102	*************************************************************
	103	TITLE
	104	Highlighting of query terms
	105	APPLICATION
	106	mg-1
	107	TYPE
	108	extend
	109	REPORT
	110	[email protected] - Aug 94
	111	FIX
	112	[email protected] - Sep 94
	113	CLAIM
	114	Difficult to feel happy that the query-result returned is
	115	satisfying the query - need to look hard to find the queried words.
	116	Need to show words in results using some highlighting method.
	117	PROBLEM
	118	No highlighting of query terms in results.
	119	SOLUTION
	120	Mgquery was previously outputting the decompressed text to a pager
	121	such as "less(1)" or "more(1)".
	122	(Except when redirected or piped elsewhere :)
	123	So what was needed was some sort of highlight pager that instead of
	124	displaying the text would also use some means for highlighting the
	125	stemmed query words.
	126	Two common forms of highlighting were chosen: underline and bolding.
	127	These are supported by "less(1)" and possibly by "more(1)" by
	128	using the backspace character.
	129	A highlight pager will also need to know which words need to be
	130	highlighted. Therefore, the code was modified to build up a
	131	string of the stemmed query words for passing to the highlight pager.
	132	Design Options:
	133	---------------
	134	* Could do text filtering in mgquery before passing out to pager.
	135	Instead I pipe to a separate process, the "hilite_words" pager,
	136	which filters and pipes into less/more.
	137	* Could do different highlighting or a combination.
	138	* Could use a different structure for storing the query words other
	139	than the hash-table I used.
	140	FILES
	141	* Makefile - to include hilite_words target
	142	* mg_hilite_words.c
	143	* mgquery.c
	144	* mgquery.1
	145	* query.bool.y
	146	* query.ranked.c
	147	* environment.c
	148	* environment.h
	149	* backend.h
	150	*************************************************************
	151	TITLE
	152	Mg_compression_dict did premature free
	153	APPLICATION
	154	mg-1
	155	TYPE
	156	bug
	157	REPORT
	158	[email protected] - 23 Sep 94
	159	FIX
	160	[email protected] - 23 Sep 94
	161	CLAIM
	162	mg_compression_dict dumped core in
	163	file: mg_compression_dict.c
	164	function: Write_data
	165	line: int codelen = hd->clens[i];
	166	PROBLEM
	167	Huffman data, hd, was freed before it was accessed again.
	168	SOLUTION
	169	The freeing of hd has been moved to after all accesses
	170	(just before returning).
	171	FILES
	172	* mg_compression_dict.c
	173	*************************************************************
	174	TITLE
	175	Boolean tree optimising rewrite
	176	APPLICATION
	177	mg-1
	178	TYPE
	179	bug
	180	REPORT
	181	[email protected] - 23 Sep 94
	182	FIX
	183	[email protected] - Oct 94
	184	CLAIM
	185	"I am still getting core dump in "and" queries in mgquery,
	186	where the first word does not exist, but the second one does."
	187	PROBLEM
	188	Having freed a particular node, it tried to refree it and
	189	access one of its fields.
	190
	191	I.e. code-fragment...
	192
	193	FreeNode(curr); /* where curr = CHILD(base) for 1st term in list */
	194	FreeNodes(next);
	195	FreeNodes(CHILD(base));
	196	/* but CHILD(base) has already been freed above */
	197	/* if the node was the first one in the list */
	198
	199	SOLUTION
	200	A number of things in the code seemed a bit dubious to me.
	201	So I have rewritten the boolean optimising stage and abstracted out
	202	the various stages - each file starts with "bool".
	203	Boolean query optimising seems to be a tricky problem.
	204	It is not clear that putting an expression into a certain form will
	205	actually simplify it and whether simplification means faster querying.
	206	I have converted a given boolean expression into DNF
	207	(Disjunctive Normal Form). "And not" nodes, which are readily apparent
	208	in DNF, are converted to "diff" nodes. I have only applied the idempotency
	209	laws involving TRUE and FALSE, and not the ones requiring matching of
	210	expressions - it is a potentially more complicated problem.
	211	The optimiser has been tested by playing with "bool_tester", and if you are
	212	having a crash or problem in a boolean query it would be worth testing the
	213	query on the "bool_tester." The token "*" stands for TRUE (or all documents)
	214	and the token "_" stands for FALSE (or no documents). This should show the
	215	expression before and after optimisation in an ascii tree bracketting format.
	216	FILES
	217	* bool_tree.c
	218	* bool_parser.y
	219	* bool_optimiser.c
	220	* bool_query.c
	221	* bool_tester.c
	222	* term_lists.c
	223	*************************************************************
	224	TITLE
	225	Mgtic pixel placement
	226	APPLICATION
	227	mg-1
	228	TYPE
	229	bug
	230	REPORT
	231	Bruce McKenzie - [email protected] (21st Oct 1994)
	232	FIX
	233	[email protected]
	234	CLAIM
	235	mgtic crashed on certain files.
	236	PROBLEM
	237	Placing pixels outside of bitmap.
	238	SOLUTION
	239	Changed the putpixel routine to truncate at borders of the image.
	240	FILES
	241	* mgtic.c
	242	*************************************************************
	243	TITLE
	244	Improved boolean tree optimising
	245	APPLICATION
	246	mg-1
	247	TYPE
	248	improve
	249	REPORT
	250	[email protected] - 12/Dec/94
	251	FIX
	252	[email protected] - 21/Dec/94, 14/Mar/95
	253	CLAIM
	254	Optimising by conversion to DNF is not necessarily such
	255	a good idea - can actually slow things down.
	256	PROBLEM
	257	The distributive law used in converting to DNF
	258	duplicates expressions.
	259	SOLUTION
	260	Introduce a query environment variable, optimise_type = 0 \| 1 \| 2.
	261	Type 0 does nothing to the parse tree.
	262	Type 2 does the DNF conversion.
	263	Type 1 is the new default and does the following...
	264	Do simple tree rearrangement like flattening.
	265	Optimise for CNF queries.
	266	FILES
	267	* bool_query.c, .h
	268	* bool_optimiser.c
	269	* environment.c
	270	* invf_get.c
	271	* bool_tree.c, .h
	272	* bool_tester.c
	273	* lists.h
	274	*************************************************************
	275	TITLE
	276	Similarity variants
	277	APPLICATION
	278	mg-2
	279	TYPE
	280	extend
	281	REPORT
	282	[email protected]/alistair - June 1994
	283	FIX
	284	[email protected] - July 1994 .. Feb 1995
	285	CLAIM
	286	Can only use one type of similarity measure - the
	287	standard cosine measure.
	288	PROBLEM
	289	See CITRI/TR-95-3 for more details.
	290	The standard measure can be broken up into 7 components.
	291	The 7 components are
	292	Each one of these components has a number of alternatives.
	293	The overall measure, S_qd, can also be altered.
	294	Thus the particular similarity measure used can be specified
	295	by an 8 dimensional vector.
	296	What is desired is to be able to specify to mgquery an option
	297	and a 8-digit string representing this vector (assuming that
	298	any one alternative can have at most 9 (not using zero) variants).
	299	SOLUTION
	300	The programs which had to be modified were:
	301	(i) mgquery,
	302	(ii) mg_weights_build.
	303	The other mg programs in existence store the text, indexing info,
	304	and the basic statistics such as N, n, ft, fdt.
	305	Other programs which had to be created were:
	306	(i) mg_fmd_build,
	307	(ii) mg_wt_build.
	308	Mg_fmd_build will create the file to store the f_md statistic,
	309	where f_md is the largest (maximum) f_dt of any term in document, d.
	310	Mg_wt_build will create the file to store the w_t primitive.
	311	It only creates this for the w_t variants 6-9 which would require
	312	extra passes of invf at query time if they were not stored here.
	313	For details on similarity changes for mgquery and mg_weights_build,
	314	please see the other modification entries.
	315	FILES
	316	* mg_fmd_build.c
	317	* mg_wt_build.c
	318	* build_lib.c, build_lib.h
	319	*************************************************************
	320	TITLE
	321	Similarity variants for mgquery
	322	APPLICATION
	323	mg-2
	324	TYPE
	325	extend
	326	REPORT
	327	[email protected]/alistair - June 1994
	328	FIX
	329	[email protected] - July 1994 .. Feb 1995
	330	CLAIM
	331	"mgquery" needs to be altered to allow modification of
	332	the similarity measure.
	333	PROBLEM
	334	See CITRI/TR-95-3 for more details.
	335	SOLUTION
	336
	337	Most of the similarity measures, Sqd, are of the
	338	form: Aqd
	339	-----
	340	Bqd
	341	where Bqd is an expression involving Wd and possibly Wq,
	342	where Aqd is a sum over the common document/query terms
	343	of w_qt and w_dt.
	344	Building of Aqd
	345	===============
	346	The calculation of Aqd is done in the file build_Aqd.c .
	347	The functions for doing this used to be in the file invf_get.c .
	348	Build_Aqd.c contains 4 different functions for building Aqd, each
	349	of them building a different data structure:
	350	(i) Array, (ii) Splay Tree, (iii) Hash Table, (iv) List Table.
	351	Each of these routines seems to have been construction by modifications
	352	to duplicated code. This is often the easiest way to construct variants
	353	but is quite difficult to maintain consistency.
	354	As the aim of the exercise was to try out different sim. measures for
	355	retrieval effectiveness, I only modified the code that constructed
	356	an array. This routine was called "CosineDecode"; I changed it to
	357	"build_Aqd_Array." This change reflects the fact that we are only
	358	calculating Aqd and this need not be used for the Cosine measure.
	359	The other routines: "CosineDecodeSplay," "CosineDecodeHash," and
	360	"CosineDecodeList" have been left unaltered - they need to be updated
	361	in the future which would be best be done by abstracting out common code.
	362	By the stage of building Aqd, the query terms have been looked up in
	363	the inverted file dictionary and put into a list.
	364	This list of common terms is traversed to lookup the corresponding
	365	invf entries. Before the invf entry is processed, all query and term
	366	relevant statistics and primitive quantities are calculated.
	367	For example, fqt, ft, wt, rqt, wqt, Wq-partial-sum.
	368	To save unnecessary calculations, there is a test for each value
	369	to see whether it is needed e.g. "if (sim->variant_needs & NEEDS_wt) ... ".
	370	Aside: Variant Needs
	371	--------------------
	372	The idea behind the "variant_needs" field is to be able to have all
	373	the code in the one place for each possible variant and this code would
	374	get the information at the correct time/place only when it is needed.
	375	The overhead is a "bit-and" and "test" for each component.
	376	The important concern though, is that the "variant_needs" must be
	377	accurate i.e. it should be carefully maintained.
	378	Each possible need is stored as a bit position in a constant/macro
	379	of the form "NEEDS_component" e.g. NEEDS_wt, NEEDS_nt
	380	More recently (Jan/95), I have found that the _need_ing of a component
	381	may be relevant to a particular purpose, that is, it may be needed in
	382	one situation and not another. This was the case for Wd.
	383	For rdt#6, Wd was needed, however, it might not be needed for Bqd, the
	384	denominator of Sqd. So I changed from NEEDS_Wd to NEEDS_Wd_for_Sqd and
	385	NEEDS_Wd_for_rdt.
	386	--------------------
	387	To handle the different calculations of the variants, I wrote macros
	388	based around a switch statement. All these macros are stored in the file
	389	"similarity.h" . The point of doing this is to centralise most of the changes
	390	and cut down on the number of files which have to be altered if a new
	391	way of calculating a primitive is to be added.
	392	This is achieved by having a data record called, "Similarity_variant" whose
	393	fields includes all the statistics and the similarity primitives.
	394	So the standard procedure is to see if something is needed and if so, then
	395	extract it from an mg file or calculate it using a macro - most of the
	396	input and output to the macro is done via the "Similarity_variant" structure.
	397	As well as calculating Aqd, it is also a convenient place to calculate Wq.
	398	Previously, for Cosine measure, Wq was not needed because each Aqd was directly
	399	divided by Wq and thus would not change the ordering. However, there are
	400	some Sqd's which divide by a sum involving Wq and thus need Wq.
	401	Calculating Sqd and Ordering Documents
	402	======================================
	403	The file query.ranked.c contains the code for calculating Sqd using the
	404	approximate Wd in order to do the ranking of the documents.
	405	The mg-1 file was cleaned up and modified slightly.
	406	All the heap data structures and routines were taken out and placed in
	407	their own file, heap_weights.c/.h .
	408	The major components/steps in the ranking process were abstracted into
	409	macros and functions:
	410	calc_MaxParas
	411	insert_heap_entry
	412	insert_greater_heap_entry
	413	approx_guided_insert
	414	fill_initial_heap
	415	add_heap_remainders
	416	change_heap_to_exact
	417	add_remainder_exact_weights
	418	Make_Exact_Root
	419	build_doc_list
	420	Aside: Zeroing of an Aqd Element
	421	-------------------------
	422	One interesting change concerned the use of zeroing out an Aqd element
	423	so as to mark it as being used in the heap.
	424	In the heap, the approx. Sqd is stored based on Aqd/Wd-approx.
	425	Later when Aqd is required, it can be extracted from the approx. Sqd
	426	by muliplication of Wd-approx i.e. Aqd = Sqd-approx * Wd-approx.
	427	This, however, is not always possible for the various Sqd variants.
	428	So, instead of zeroing out Aqd, I decided just to make it negative.
	429	-------------------------
	430	In some Sqd cases, Wd and Wd-approx. are not required. In which case,
	431	step 3 which calls on "change_heap_to_exact", "Heap_Build" and
	432	"add_remainder_exact_weights" is not required.
	433
	434	FILES
	435	* build_Aqd.c
	436	* query.ranked.c
	437	* similarity.c/.h (in libmg)
	438	* heap_weights.c/.h
	439	* backend.c/.h
	440	*************************************************************
	441	TITLE
	442	Similarity variants for mg_weights_build
	443	APPLICATION
	444	mg-2
	445	TYPE
	446	extend
	447	REPORT
	448	[email protected]/alistair - June 1994
	449	FIX
	450	[email protected] - July 1994 .. Feb 1995
	451	CLAIM
	452	"mg_weights_build" needs to be altered to allow modification of
	453	the similarity measure.
	454	PROBLEM
	455	See CITRI/TR-95-3 for more details.
	456	SOLUTION
	457	The weight files which are generated for a particular similarity
	458	measure have their names extended by a suffix.
	459	In the case of Wd#1 no weights are generated.
	460	In the case of the standard cosine weights, Wd#2, a 3 letter suffix
	461	is used to represent .
	462	In the case of the other Wd variants, a one letter suffix is used
	463	to represent which Wd variant it is.
	464	In each case, the variant input (e.g. -q 22222222) should be the whole
	465	similarity variant string and the relevant fields will be extracted out.
	466	This is done for consistency in code and interface.
	467	The code is fairly similar to the original.
	468	A dependency check has been added so that the dates of files and the
	469	type of needed files is verified before building.
	470	The dependencies include, invf dictionary, invf index, invf, fmd and wt files.
	471	The major change here, is the possible use of fmd and/or wt files.
	472	Later when I was having to write mg_fmd_build and mg_wt_build,
	473	I decided to abstract out some macros, namely:
	474	Get_ft, Get_ft_Ft, loop_invf_entry,
	475	which were put into "build_lib.c" .
	476	"loop_invf_entry" takes a function/macro name as a parameter and applies
	477	it to the sim record (with field fdt set), current doc number and modifies
	478	the return value.
	479	FILES
	480	* mg_weights_build.c
	481	* build_lib.c
	482	*************************************************************
	483	TITLE
	484	Mgstat with non-existent files
	485	APPLICATION
	486	mg-1
	487	TYPE
	488	bug
	489	REPORT
	490	[email protected] - 16 May 1994
	491	FIX
	492	[email protected] - 10 Aug 1994
	493	CLAIM
	494	NaNs and Infinites would be printed out by mgstat
	495	if unable to open .text or .text.dict file.
	496	PROBLEM
	497	The NaNs etc. were output in the column stating
	498	the percentage size of the file compared with the
	499	number of input bytes of the source text data.
	500	If it couldn't read the .text file with its
	501	header describing the number of source text bytes, then
	502	in working out the percentage it would divide by zero.
	503	Also due to some bad control flow, it wouldn't attempt to
	504	open the .text file if it failed when opening
	505	the .text.dict file.
	506	SOLUTION
	507	Only printout the percentage if we can read the header
	508	from the .text file.
	509	Read in text header irrespective of text dictionary file.
	510	FILES
	511	* mgstat.c
	512	*************************************************************
	513	TITLE
	514	Boolean tree optimisations
	515	APPLICATION
	516	mg-2
	517	TYPE
	518	extend
	519	REPORT
	520	(i) [email protected] - 28/Sep/94
	521	(ii) [email protected] - 12/Dec/94
	522	FIX
	523	(i) [email protected] - 18/Oct/94
	524	(ii) [email protected] - 21/Dec/94
	525	CLAIM
	526	The initial prompt for investigating the optimisation of
	527	boolean queries is noted in the mg-1 mod14.txt.
	528	The code for optimising seemed to have a number of faults.
	529	PROBLEM
	530	Boolean optimisation was unreliable.
	531	SOLUTION
	532	Initially (in case (i) above, see mg-1/mod14.txt), I rewrote
	533	all the boolean tree and optimising code. I converted the boolean
	534	expression into DNF. I did this after reading some notes about
	535	the steps involved in optimisation and they suggested standardising
	536	in some normal form. I thought that DNF would be appropriate so that
	537	all the terms are converted to be part of "and" expressions and be
	538	evaluated quickly using skipping.
	539	This, however, can suffer quite badly if the distributive law is
	540	applied to often and the query expands in size. If there was
	541	some sort of cacheing of invf entries, then it might not be so
	542	bad otherwise there is quite an overhead on reading the same
	543	invf entry more than once.
	544
	545	As it happens, CNF queries are reasonably common, where the user
	546	queries with a conjunction of disjunctions of similes:
	547	e.g. (car \| automobile \| vehicle) & (fast \| quick \| speedy)
	548	This sort of CNF query expands a hell of a lot !
	549	So after speaking with Justin who wanted to benchmark Atlas with Mg on
	550	these sort of queries, I looked up the MG book for other ideas.
	551
	552	The method that I implemented was the following:
	553
	554	-----------------------------------------------------------------
	555	Steps of tree modifications:
	556	Gets literals by pushing the nots down, detecting T/F at leaves
	557	and collapsing the tree by detecting 'and' of 'and's and 'or' of
	558	'or's.
	559	Next it looks at the or nodes and if all the children are terms
	560	then it marks the or-node as such.
	561	Finally, the or-term-nodes are sorted by using the sum of their
	562	ft's for comparison.
	563	Steps at query evaluation:
	564	If it comes across an 'and' of 'or-terms' then the evaluation is
	565	done noting the distributive law.
	566	I.e. a & (b \| c \| d) = (a & b) \| (a & c) \| (a & d)
	567	Assuming 'a' is the c-set of documents.
	568	All of 'a' is tested against 'b' and matching ones are marked.
	569	Next, all the unmarked members of 'a' are tested against 'c'.
	570	Likewise for 'd'.
	571	Now all the marked members of our c-set are kept.
	572	When we do the testing, we can use the skipping in the invf entries.
	573	-----------------------------------------------------------------
	574
	575
	576	After doing this, I added the choice of which type of optimisation
	577	the user wanted by adding query-environment-variable, "optimise_type".
	578	Type 0 = no parse tree modification.
	579	Type 1 = Or-term recognition and CNF query evaluation optimisation.
	580	Type 2 = Put into DNF form. [generally not recommended]
	581	FILES
	582	* bool_tree.c
	583	* bool_parser.y
	584	* bool_optimiser.c
	585	* bool_query.c
	586	* bool_tester.c
	587	* term_lists.c
	588	* query_env.c
	589	* invf_get.c [GetDocsOp]
	590	*************************************************************
	591	TITLE
	592	nonexistent HOME bug
	593	APPLICATION
	594	mg-1, mg-2
	595	TYPE
	596	bug
	597	REPORT
	598	[email protected] - 2/May/95
	599	FIX
	600	[email protected] - 2/May/95
	601	CLAIM
	602	"The big problem was that mgquery crashes when the HOME environment
	603	variable is not set, which is the case when it is run by the www server."
	604	[...] "I expect it happens when looking for $HOME/.mgrc."
	605	PROBLEM
	606	The result of getenv("HOME")" was used directly in
	607	a sprintf call. If the environment variable HOME
	608	was not in existence then null would be used.
	609	In some C libraries sprintf will convert the 0
	610	string into the string "(null)" on others it will core dump.
	611	(For example, Solaris seems to core dump, sunos 4 seems ok).
	612	SOLUTION
	613	The result from getenv("HOME")" is tested before
	614	being used.
	615	FILES
	616	* commands.c
	617	*************************************************************
	618	TITLE
	619	mgquery collection name preference
	620	APPLICATION
	621	mg-1, mg-2
	622	TYPE
	623	improve
	624	REPORT
	625	[email protected] - 2/May/95
	626	FIX
	627	[email protected] - 4/May/95
	628	CLAIM
	629	Surely something must override mquery's preference for ./bib.
	630	If MGDATA is set correctly, I think it should prefer that collection,
	631	and -d should definitely override it.
	632	I could always say -d . if I really wanted ./bib.
	633	PROBLEM
	634	Currently the priority is:
	635	1. Check if ./name is a directory,
	636	If so then use it as the collection directory.
	637	2. Check if ./name.text is a file,
	638	If so then use ./ as the collection directory.
	639	3. Check if mgdir/name is a directory,
	640	If so then use mgdir/name as the collection directory.
	641	4. Otherwise,
	642	Use mgdir/name as the database file prefix.
	643	This would be the case if one used "-f alice/alice".
	644	However, one would then not specify a final name argument
	645	and we'd never get here. Go figure ???
	646	SOLUTION
	647	Moved step 3 to the top instead.
	648	FILES
	649	* mgquery.c [search_for_collection()]
	650	*************************************************************
	651	TITLE
	652	Printout of query terms
	653	APPLICATION
	654	mg-1, mg-2
	655	TYPE
	656	extend
	657	REPORT
	658	[email protected] - April 95
	659	FIX
	660	[email protected] - April 95
	661	CLAIM
	662	No easy way to find out the parsed and stemmed words
	663	used in the query. Would like to know these words
	664	so I can call a separate highlighting program to
	665	highlight these words.
	666	PROBLEM
	667	No facility available.
	668	SOLUTION
	669	A ".queryterms" mgquery command was added which lists
	670	out the parsed/stemmed queryterms of the last query.
	671	FILES
	672	* commands.c (added CmdQueryTerms)
	673	*************************************************************
	674	TITLE
	675	mg_getrc
	676	APPLICATION
	677	mg-1, mg-2
	678	TYPE
	679	extend
	680	REPORT
	681	[email protected] - 2/May/95
	682	FIX
	683	-
	684	CLAIM
	685	Repeated code had to be written for different named
	686	gets but really the same type of parsing required.
	687	E.g. one might want to use a standard method for inserting
	688	^Bs between paragraphs for different books. One doesn't
	689	want to write duplicate code for each different named book,
	690	rather note that each book should be filtered "book" style.
	691	PROBLEM
	692	There was no way of abstracting out types of filters from
	693	the name of an instance of a collection.
	694	SOLUTION
	695	Allow information to be given with <name, type, files>.
	696	This extra info can be provided in a mg_getrc file.
	697	See man page for mg_get for details.
	698	FILES
	699	* mg_get.sh
	700	*************************************************************
	701	TITLE
	702	TREC DocNo file
	703	APPLICATION
	704	mg-2
	705	TYPE
	706	improve
	707	REPORT
	708	[email protected] - 26/May/95
	709	FIX
	710	[email protected] - 26/May/95
	711	CLAIM
	712	MG has problems dealing with trec docnos for trec disk 3.
	713	PROBLEM
	714	Trec DocNos file didn't have a wide enough field
	715	to handle disk 3.
	716	SOLUTION
	717	Allow different width fields for file.
	718	It is still fixed width but a number in the header
	719	says how wide the field is.
	720	FILES
	721	* passes/mg.special.c
	722	* query/mgquery.c
	723	*************************************************************
	724	TITLE
	725	Boolean optimiser #1 with `!'
	726	APPLICATION
	727	mg-1, mg-2
	728	TYPE
	729	bug
	730	REPORT
	731	[email protected] - 20/7/95
	732	FIX
	733	[email protected] - 27/7/95
	734	CLAIM
	735	Complained about not-nodes.
	736	e.g. complained about "croquet & !hedgehog"
	737	PROBLEM
	738	Boolean optimiser type#1 didn't convert
	739	"and not"s into diff nodes.
	740	SOLUTION
	741	Added code to convert '&!' to '-'.
	742	FILES
	743	* mg/bool_optimiser.c [mg-1]
	744	* query/bool_optimiser.c [mg-2]
	745	*************************************************************
	746	TITLE
	747	Autoconfiguring mg-1
	748	APPLICATION
	749	mg-1
	750	TYPE
	751	improve
	752	REPORT
	753	many people - 94/95
	754	FIX
	755	[email protected] - Aug/95
	756	CLAIM
	757	Portability is limited by setting up c-macros just for particular
	758	machines and operating systems.
	759	People had to make changes for HP, Next, Linux, Dec Alpha, ...
	760	PROBLEM
	761	Porting was only targetting at the machines that the author had
	762	access to.
	763	SOLUTION
	764	Use GNU's autoconfigure program.
	765	This allows checking of the systems features/characteristics.
	766	It also allows some checking for specific machines/OS - although
	767	I have not utilised this option.
	768	I used GNU's tar-1.11.8 as an example to base my changes on.
	769	I also used autoscan to generate the initial "configure.in".
	770	The "Makefile.in"s were done very similarly to GNU tar's.
	771	The "config.h" and "sysfuncs.h" files were scrapped and
	772	rewritten. The new "config.h" file is generated by the configure
	773	script - it contains all the #define's for the system features.
	774	The "sysfuncs.h" file wraps up a number of system headers.
	775	For example, some systems use , while some use ;
	776	which one is included is decided in "sysfuncs.h".
	777	I have also used GNU tar's use of ansi2knr in its Makefiles.
	778	This should hopefully allow the package to work on a system with
	779	only a K&R C compiler.
	780	However, there are probably problems with what I have done.
	781	I am concerned about <stdarg.h> for example.
	782	I also noticed that "ansi2knr" require function definitions as
	783	the GNU coding style recommends ie. with function name the first
	784	string on the line. This prompted me to run all the package's code
	785	thru GNU's indent.
	786	Setting up the configure changes is difficult. It really seems
	787	necessary to try the package out on many target machines so one
	788	can know what is necessary.
	789	A simple check target for the main Makefile has been written.
	790	It is used to see if the installation is working - it does
	791	not test much of the functionality of mg.
	792	It does cmp's on data files and diff's on query/result files.
	793	FILES
	794	Most of the files in the distribution.
	795	*************************************************************
	796	TITLE
	797	Consistent use of stderr
	798	APPLICATION
	799	mg-1
	800	TYPE
	801	improve
	802	REPORT
	803	[email protected] - 16 May 1994
	804	FIX
	805	[email protected] - 11 August 1994
	806	CLAIM
	807	Inconsistent use of stdout/stderr in usage messages.
	808	PROBLEM
	809	Sometimes used "printf" and sometimes used "fprintf(stderr"
	810	in usage messages.
	811	SOLUTION
	812	All should now use "fprintf(stderr" in usage messages.
	813	FILES
	814	* mg_compression_dict.c
	815	* mg_compression_dict.1
	816	* mg_fast_comp_dict.c
	817	* mg_fast_comp_dict.1
	818	* mg_invf_dict.c
	819	* mg_invf_dict.1
	820	* mg_invf_dump.c
	821	* mg_invf_dump.1
	822	* mg_invf_rebuild.c
	823	* mg_invf_rebuild.1
	824	* mg_perf_hash_build.c
	825	* mg_perf_hash_build.1
	826	* mg_text_estimate.c
	827	* mg_text_estimate.1
	828	* mg_weights_build.c
	829	* mg_weights_build.1
	830	*************************************************************
	831	TITLE
	832	xmg bug
	833	APPLICATION
	834	mg-1
	835	TYPE
	836	bug
	837	REPORT
	838	[email protected] - 22 April 1994
	839	FIX
	840	[email protected] - 22 April 1994
	841	CLAIM
	842	"Serious problem in xmg, which I fear occurs whenever a query
	843	doesn't return anything."
	844	PROBLEM
	845	??
	846	SOLUTION
	847	[xmg.sh 201] set rank 0
	848	FILES
	849	* xmg.sh
	850	*************************************************************
	851	TITLE
	852	Unnecessary loading of text
	853	APPLICATION
	854	mg-1
	855	TYPE
	856	bug
	857	REPORT
	858	[email protected] - ?? August 1994
	859	FIX
	860	[email protected] - 12 August 1994
	861	CLAIM
	862	Mg was loading and uncompressing text when the
	863	query did not require the text.
	864	PROBLEM
	865	There was no test for the query mode
	866	before loading and uncompressing the text.
	867	SOLUTION
	868	Only load/uncompress text if query mode
	869	is for text, headers or silent(for timing).
	870	FILES
	871	* mgquery.c
	872	*************************************************************
	873	TITLE
	874	Man page errors
	875	APPLICATION
	876	mg-1
	877	TYPE
	878	bug
	879	REPORT
	880	[email protected] - 16 May 1994
	881	FIX
	882	[email protected] - 16 May 1994
	883	CLAIM
	884	Man page errors.
	885	PROBLEM
	886	See below.
	887	SOLUTION
	888	"The mg_make_fast_dict.1 file has been renamed mg_fast_comp_dict.1,
	889	and all mg_make_fast_dict strings changed to mg_fast_comp_dict in all
	890	man pages.
	891	A large number of errors of spelling, typography, spacing, fonts,
	892	grammar, omitted words, slang, punctuation, missing man page
	893	cross-references, and man-page style have been corrected."
	894	FILES
	895	* mg_compression_dict.1
	896	* mg_fast_comp_dict.1
	897	* mg_get.1
	898	* mg_invf_dict.1
	899	* mg_invf_dump.1
	900	* mg_invf_rebuild.1
	901	* mg_passes.1
	902	* mg_perf_hash_build.1
	903	* mg_text_estimate.1
	904	* mg_weights_build.1
	905	* mgbilevel.1
	906	* mgbuild.1
	907	* mgdictlist.1
	908	* mgfelics.1
	909	* mgquery.1
	910	* mgstat.1
	911	* mgtic.1
	912	* mgticbuild.1
	913	* mgticdump.1
	914	* mgticprune.1
	915	* mgticstat.1
	916	* xmg.1
	917	*************************************************************
	918	TITLE
	919	Man page overview
	920	APPLICATION
	921	mg-1
	922	TYPE
	923	extend
	924	REPORT
	925	[email protected] -
	926	FIX
	927	[email protected] - 17 August 1994
	928	CLAIM
	929	"Write new mg.1 file to give a brief overview of mg, with samples
	930	of how to use it. Otherwise, users are likely to be completely
	931	overwhelmed by the number of programs (about 20) which might need to
	932	be used, when in reality, only 2 or 3 are likely to be run by end
	933	users."
	934	SOLUTION
	935	It was thought that mg.1, written by Nelson Beebe, was very useful
	936	but a bit too comprehensive for an introduction.
	937	Therefore, two man files, mgintro.1 and mgintro++.1 were written
	938	with the basic stuff in mgintro.1 and slightly more advanced stuff
	939	in mgintro++.1 .
	940	FILES
	941	* mg.1
	942	* mgintro.1
	943	* mgintro++.1
	944	*************************************************************
	945	TITLE
	946	Parse errors not bus errors
	947	APPLICATION
	948	mg-1
	949	TYPE
	950	bug
	951	REPORT
	952	[email protected] - 2 Jun 94
	953	FIX
	954	[email protected] - 19 Aug 94
	955	CLAIM
	956	"These two queries
	957	(which I typed in before I knew what I was doing!!)
	958	> The Queen of Hearts, she made some tarts
	959	> "The Queen of Hearts" and "she made some tarts"
	960	produced the following result:
	961	mgquery : parse error
	962	Bus error
	963	"
	964	PROBLEM
	965	What is expected to happen under boolean querying:
	966	Query1:
	967	> The Queen of Hearts, she made some tarts
	968	will produce a parse error due to the comma which
	969	is not a valid TERM.
	970	Query2:
	971	> "The Queen of Hearts" and "she made some tarts"
	972	will store a post-processing string
	973	of ''The Queen of Hearts" and "she made some tarts'' and
	974	will have a main boolean query of the empty string.
	975	This is because the postprocessing string takes in
	976	everything between the first quote and the last one.
	977	An empty string is illegal for the boolean grammar and
	978	hence a parse error.
	979	The problem stems from the fact that the processing of
	980	the parse tree is carried out, even though we have a
	981	parse error. In the case of using an empty string to build
	982	a parse tree, it is likely to leave the parse tree undefined.
	983	SOLUTION
	984	As soon as we find out that there is a parse-error,
	985	we abandon any processing of the parse tree.
	986	FILES
	987	* query.bool.y
	988	* query.bool.c (generated from query.bool.y)
	989	*************************************************************
	990	TITLE
	991	Perfect hashing on small vocab
	992	APPLICATION
	993	mg-1
	994	TYPE
	995	bug
	996	REPORT
	997	[email protected] - July 1994
	998	FIX
	999	[email protected] - July 1994
	1000	CLAIM
	1001	Mg could not handle small collections in the case
	1002	where there was only a small number of unique words.
	1003	The perfect hash function would report an error.
	1004	PROBLEM
	1005	Rounding of the arithmetic during the calculation of the
	1006	parameters of the perfect hash function was resulting in a
	1007	combination of values such that the probability of a hash
	1008	function being found was very small. This led to the limit
	1009	on the generation loop being exceeded, and eventual
	1010	failure.
	1011	SOLUTION
	1012	By using ceiling rather than floor when converting from a
	1013	floating point value to an integer parameter, the arithmetic
	1014	is now correct for all lexicon sizes, and the probability of
	1015	each iteration successfully generating a hash function is
	1016	sufficiently great that with _very_ high probability the
	1017	execution loop counter will not be exceeded unless there
	1018	genuinely is no hash function (for example, if the lexicon
	1019	contains two words the same there cannot be a hash
	1020	function).
	1021	FILES
	1022	* perf_hash.c
	1023	*************************************************************

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format