[3745] | 1 | TITLE
|
---|
| 2 | Parsing of Long Words
|
---|
| 3 | APPLICATION
|
---|
| 4 | mg-1, mg-2
|
---|
| 5 | TYPE
|
---|
| 6 | bug
|
---|
| 7 | REPORT
|
---|
| 8 | [email protected] - May 11th 1994
|
---|
| 9 | FIX
|
---|
| 10 | [email protected] - August 9th 1994
|
---|
| 11 | CLAIM
|
---|
| 12 | Mg didn't handle long words properly; it crashed.
|
---|
| 13 | PROBLEM
|
---|
| 14 | Invf passes calls PARSE_LONG_WORD [words.h] which uses a limit of
|
---|
| 15 | MAXLONGWORD on iterating thru the string and storing into
|
---|
| 16 | a word. MAXLONGWORD = 8192.
|
---|
| 17 | However, mg strings generally store the length in the first
|
---|
| 18 | byte limiting them to 255 characters. The word which was passed
|
---|
| 19 | to PARSE_LONG_WORD was an allocated string of MAXSTEMLEN = 255,
|
---|
| 20 | which is as large as we should get anyway. Thus when accessing
|
---|
| 21 | a larger word than 255 chars, PARSE_LONG_WORD would allow it
|
---|
| 22 | (less than 8192) and would try storing beyond the array limit.
|
---|
| 23 | SOLUTION
|
---|
| 24 | The author can't remember why PARSE_LONG_WORD was used and what
|
---|
| 25 | the significance of MAXLONGWORD = 8192 is.
|
---|
| 26 | So PARSE_LONG_WORD has been changed to PARSE_STEM_WORD which
|
---|
| 27 | uses MAXSTEMLEN as its limit.
|
---|
| 28 | FILES
|
---|
| 29 | * words.h
|
---|
| 30 | * invf.pass1.c
|
---|
| 31 | * invf.pass2.c
|
---|
| 32 | * ivf.pass1.c
|
---|
| 33 | * ivf.pass2.c
|
---|
| 34 | * query.ranked.c
|
---|
| 35 | *************************************************************
|
---|
| 36 | TITLE
|
---|
| 37 | Use of Lovins stemmer
|
---|
| 38 | APPLICATION
|
---|
| 39 | mg-1
|
---|
| 40 | TYPE
|
---|
| 41 | improve
|
---|
| 42 | REPORT
|
---|
| 43 | local - 1994
|
---|
| 44 | FIX
|
---|
| 45 | [email protected] - 1994
|
---|
| 46 | CLAIM
|
---|
| 47 | Stemming was done naively.
|
---|
| 48 | PROBLEM
|
---|
| 49 | Only a few types of words and their endings
|
---|
| 50 | were considered.
|
---|
| 51 | SOLUTION
|
---|
| 52 | Replacement with a more elaborate "known" stemmer by Lovins.
|
---|
| 53 | The algorithm is described in:
|
---|
| 54 | J.B. Lovins, "Development of a Stemming Algorithm",
|
---|
| 55 | Mechanical Translation and Computational Linguistics, Vol 11,1968.
|
---|
| 56 | FILES
|
---|
| 57 | * stem.c
|
---|
| 58 | * stem.h
|
---|
| 59 | *************************************************************
|
---|
| 60 | TITLE
|
---|
| 61 | Different term parsing
|
---|
| 62 | APPLICATION
|
---|
| 63 | mg-1
|
---|
| 64 | TYPE
|
---|
| 65 | bug
|
---|
| 66 | REPORT
|
---|
| 67 | [email protected] - 23 Aug 1994
|
---|
| 68 | FIX
|
---|
| 69 | [email protected] - 23 Aug 1994
|
---|
| 70 | CLAIM
|
---|
| 71 | Boolean queries did not extract words/terms using the
|
---|
| 72 | same method as is done at inverted-file creation and
|
---|
| 73 | as is used for rank query parsing.
|
---|
| 74 | PROBLEM
|
---|
| 75 | The hand-written lex. analyser, query_lex, which is called by
|
---|
| 76 | the boolean query parser was not calling a common
|
---|
| 77 | word-extraction routine as used by the rest of mg.
|
---|
| 78 | This would be ok if the code did the same things - but they didn't.
|
---|
| 79 | Query_lex, for instance, did NOT place any limit on the
|
---|
| 80 | number of digits in a term.
|
---|
| 81 | Of even more concern, it would allow arbitrary sized words
|
---|
| 82 | although it used Pascal style strings which store the length
|
---|
| 83 | in the first byte and can therefore only be 255 characters in length.
|
---|
| 84 | SOLUTION
|
---|
| 85 | Query_lex in "query.bool.y", was modified to call the routine
|
---|
| 86 | PARSE_STEM_WORD which is also used by text-inversion routines and
|
---|
| 87 | ranking query routines.
|
---|
| 88 | Now all terms are extracted by the same routine.
|
---|
| 89 | To do this, the end of the line buffer had to be noted as
|
---|
| 90 | PARSE_STEM_WORD requires a pointer to the end - which is the
|
---|
| 91 | safe thing to do (don't want to run over the end).
|
---|
| 92 | This meant I had to find the length of the query line buffer.
|
---|
| 93 | This was allocated in the file "read_line.c" by the routine,
|
---|
| 94 | "readline". Its size was the literal number 1024.
|
---|
| 95 | This was changed to a constant and placed in "read_line.h".
|
---|
| 96 | The definition for PARSE_STEM_WORD can be found in "words.h".
|
---|
| 97 | FILES
|
---|
| 98 | * query.bool.y
|
---|
| 99 | * query.bool.c (by bison)
|
---|
| 100 | * read_line.c
|
---|
| 101 | * read_line.h
|
---|
| 102 | *************************************************************
|
---|
| 103 | TITLE
|
---|
| 104 | Highlighting of query terms
|
---|
| 105 | APPLICATION
|
---|
| 106 | mg-1
|
---|
| 107 | TYPE
|
---|
| 108 | extend
|
---|
| 109 | REPORT
|
---|
| 110 | [email protected] - Aug 94
|
---|
| 111 | FIX
|
---|
| 112 | [email protected] - Sep 94
|
---|
| 113 | CLAIM
|
---|
| 114 | Difficult to feel happy that the query-result returned is
|
---|
| 115 | satisfying the query - need to look hard to find the queried words.
|
---|
| 116 | Need to show words in results using some highlighting method.
|
---|
| 117 | PROBLEM
|
---|
| 118 | No highlighting of query terms in results.
|
---|
| 119 | SOLUTION
|
---|
| 120 | Mgquery was previously outputting the decompressed text to a pager
|
---|
| 121 | such as "less(1)" or "more(1)".
|
---|
| 122 | (Except when redirected or piped elsewhere :)
|
---|
| 123 | So what was needed was some sort of highlight pager that instead of
|
---|
| 124 | displaying the text would also use some means for highlighting the
|
---|
| 125 | stemmed query words.
|
---|
| 126 | Two common forms of highlighting were chosen: underline and bolding.
|
---|
| 127 | These are supported by "less(1)" and possibly by "more(1)" by
|
---|
| 128 | using the backspace character.
|
---|
| 129 | A highlight pager will also need to know which words need to be
|
---|
| 130 | highlighted. Therefore, the code was modified to build up a
|
---|
| 131 | string of the stemmed query words for passing to the highlight pager.
|
---|
| 132 | Design Options:
|
---|
| 133 | ---------------
|
---|
| 134 | * Could do text filtering in mgquery before passing out to pager.
|
---|
| 135 | Instead I pipe to a separate process, the "hilite_words" pager,
|
---|
| 136 | which filters and pipes into less/more.
|
---|
| 137 | * Could do different highlighting or a combination.
|
---|
| 138 | * Could use a different structure for storing the query words other
|
---|
| 139 | than the hash-table I used.
|
---|
| 140 | FILES
|
---|
| 141 | * Makefile - to include hilite_words target
|
---|
| 142 | * mg_hilite_words.c
|
---|
| 143 | * mgquery.c
|
---|
| 144 | * mgquery.1
|
---|
| 145 | * query.bool.y
|
---|
| 146 | * query.ranked.c
|
---|
| 147 | * environment.c
|
---|
| 148 | * environment.h
|
---|
| 149 | * backend.h
|
---|
| 150 | *************************************************************
|
---|
| 151 | TITLE
|
---|
| 152 | Mg_compression_dict did premature free
|
---|
| 153 | APPLICATION
|
---|
| 154 | mg-1
|
---|
| 155 | TYPE
|
---|
| 156 | bug
|
---|
| 157 | REPORT
|
---|
| 158 | [email protected] - 23 Sep 94
|
---|
| 159 | FIX
|
---|
| 160 | [email protected] - 23 Sep 94
|
---|
| 161 | CLAIM
|
---|
| 162 | mg_compression_dict dumped core in
|
---|
| 163 | file: mg_compression_dict.c
|
---|
| 164 | function: Write_data
|
---|
| 165 | line: int codelen = hd->clens[i];
|
---|
| 166 | PROBLEM
|
---|
| 167 | Huffman data, hd, was freed *before* it was accessed again.
|
---|
| 168 | SOLUTION
|
---|
| 169 | The freeing of hd has been moved to after all accesses
|
---|
| 170 | (just before returning).
|
---|
| 171 | FILES
|
---|
| 172 | * mg_compression_dict.c
|
---|
| 173 | *************************************************************
|
---|
| 174 | TITLE
|
---|
| 175 | Boolean tree optimising rewrite
|
---|
| 176 | APPLICATION
|
---|
| 177 | mg-1
|
---|
| 178 | TYPE
|
---|
| 179 | bug
|
---|
| 180 | REPORT
|
---|
| 181 | [email protected] - 23 Sep 94
|
---|
| 182 | FIX
|
---|
| 183 | [email protected] - Oct 94
|
---|
| 184 | CLAIM
|
---|
| 185 | "I am still getting core dump in "and" queries in mgquery,
|
---|
| 186 | where the first word does not exist, but the second one does."
|
---|
| 187 | PROBLEM
|
---|
| 188 | Having freed a particular node, it tried to refree it and
|
---|
| 189 | access one of its fields.
|
---|
| 190 |
|
---|
| 191 | I.e. code-fragment...
|
---|
| 192 |
|
---|
| 193 | FreeNode(curr); /* where curr = CHILD(base) for 1st term in list */
|
---|
| 194 | FreeNodes(next);
|
---|
| 195 | FreeNodes(CHILD(base));
|
---|
| 196 | /* but CHILD(base) has already been freed above */
|
---|
| 197 | /* if the node was the first one in the list */
|
---|
| 198 |
|
---|
| 199 | SOLUTION
|
---|
| 200 | A number of things in the code seemed a bit dubious to me.
|
---|
| 201 | So I have rewritten the boolean optimising stage and abstracted out
|
---|
| 202 | the various stages - each file starts with "bool".
|
---|
| 203 | Boolean query optimising seems to be a tricky problem.
|
---|
| 204 | It is not clear that putting an expression into a certain form will
|
---|
| 205 | actually simplify it and whether simplification means faster querying.
|
---|
| 206 | I have converted a given boolean expression into DNF
|
---|
| 207 | (Disjunctive Normal Form). "And not" nodes, which are readily apparent
|
---|
| 208 | in DNF, are converted to "diff" nodes. I have only applied the idempotency
|
---|
| 209 | laws involving TRUE and FALSE, and not the ones requiring matching of
|
---|
| 210 | expressions - it is a potentially more complicated problem.
|
---|
| 211 | The optimiser has been tested by playing with "bool_tester", and if you are
|
---|
| 212 | having a crash or problem in a boolean query it would be worth testing the
|
---|
| 213 | query on the "bool_tester." The token "*" stands for TRUE (or all documents)
|
---|
| 214 | and the token "_" stands for FALSE (or no documents). This should show the
|
---|
| 215 | expression before and after optimisation in an ascii tree bracketting format.
|
---|
| 216 | FILES
|
---|
| 217 | * bool_tree.c
|
---|
| 218 | * bool_parser.y
|
---|
| 219 | * bool_optimiser.c
|
---|
| 220 | * bool_query.c
|
---|
| 221 | * bool_tester.c
|
---|
| 222 | * term_lists.c
|
---|
| 223 | *************************************************************
|
---|
| 224 | TITLE
|
---|
| 225 | Mgtic pixel placement
|
---|
| 226 | APPLICATION
|
---|
| 227 | mg-1
|
---|
| 228 | TYPE
|
---|
| 229 | bug
|
---|
| 230 | REPORT
|
---|
| 231 | Bruce McKenzie - [email protected] (21st Oct 1994)
|
---|
| 232 | FIX
|
---|
| 233 | [email protected]
|
---|
| 234 | CLAIM
|
---|
| 235 | mgtic crashed on certain files.
|
---|
| 236 | PROBLEM
|
---|
| 237 | Placing pixels outside of bitmap.
|
---|
| 238 | SOLUTION
|
---|
| 239 | Changed the putpixel routine to truncate at borders of the image.
|
---|
| 240 | FILES
|
---|
| 241 | * mgtic.c
|
---|
| 242 | *************************************************************
|
---|
| 243 | TITLE
|
---|
| 244 | Improved boolean tree optimising
|
---|
| 245 | APPLICATION
|
---|
| 246 | mg-1
|
---|
| 247 | TYPE
|
---|
| 248 | improve
|
---|
| 249 | REPORT
|
---|
| 250 | [email protected] - 12/Dec/94
|
---|
| 251 | FIX
|
---|
| 252 | [email protected] - 21/Dec/94, 14/Mar/95
|
---|
| 253 | CLAIM
|
---|
| 254 | Optimising by conversion to DNF is not necessarily such
|
---|
| 255 | a good idea - can actually slow things down.
|
---|
| 256 | PROBLEM
|
---|
| 257 | The distributive law used in converting to DNF
|
---|
| 258 | duplicates expressions.
|
---|
| 259 | SOLUTION
|
---|
| 260 | Introduce a query environment variable, optimise_type = 0 | 1 | 2.
|
---|
| 261 | Type 0 does nothing to the parse tree.
|
---|
| 262 | Type 2 does the DNF conversion.
|
---|
| 263 | Type 1 is the new default and does the following...
|
---|
| 264 | Do simple tree rearrangement like flattening.
|
---|
| 265 | Optimise for CNF queries.
|
---|
| 266 | FILES
|
---|
| 267 | * bool_query.c, .h
|
---|
| 268 | * bool_optimiser.c
|
---|
| 269 | * environment.c
|
---|
| 270 | * invf_get.c
|
---|
| 271 | * bool_tree.c, .h
|
---|
| 272 | * bool_tester.c
|
---|
| 273 | * lists.h
|
---|
| 274 | *************************************************************
|
---|
| 275 | TITLE
|
---|
| 276 | Similarity variants
|
---|
| 277 | APPLICATION
|
---|
| 278 | mg-2
|
---|
| 279 | TYPE
|
---|
| 280 | extend
|
---|
| 281 | REPORT
|
---|
| 282 | [email protected]/alistair - June 1994
|
---|
| 283 | FIX
|
---|
| 284 | [email protected] - July 1994 .. Feb 1995
|
---|
| 285 | CLAIM
|
---|
| 286 | Can only use one type of similarity measure - the
|
---|
| 287 | standard cosine measure.
|
---|
| 288 | PROBLEM
|
---|
| 289 | See CITRI/TR-95-3 for more details.
|
---|
| 290 | The standard measure can be broken up into 7 components.
|
---|
| 291 | The 7 components are
|
---|
| 292 | Each one of these components has a number of alternatives.
|
---|
| 293 | The overall measure, S_qd, can also be altered.
|
---|
| 294 | Thus the particular similarity measure used can be specified
|
---|
| 295 | by an 8 dimensional vector.
|
---|
| 296 | What is desired is to be able to specify to mgquery an option
|
---|
| 297 | and a 8-digit string representing this vector (assuming that
|
---|
| 298 | any one alternative can have at most 9 (not using zero) variants).
|
---|
| 299 | SOLUTION
|
---|
| 300 | The programs which had to be modified were:
|
---|
| 301 | (i) mgquery,
|
---|
| 302 | (ii) mg_weights_build.
|
---|
| 303 | The other mg programs in existence store the text, indexing info,
|
---|
| 304 | and the basic statistics such as N, n, ft, fdt.
|
---|
| 305 | Other programs which had to be created were:
|
---|
| 306 | (i) mg_fmd_build,
|
---|
| 307 | (ii) mg_wt_build.
|
---|
| 308 | Mg_fmd_build will create the file to store the f_md statistic,
|
---|
| 309 | where f_md is the largest (maximum) f_dt of any term in document, d.
|
---|
| 310 | Mg_wt_build will create the file to store the w_t primitive.
|
---|
| 311 | It only creates this for the w_t variants 6-9 which would require
|
---|
| 312 | extra passes of invf at query time if they were not stored here.
|
---|
| 313 | For details on similarity changes for mgquery and mg_weights_build,
|
---|
| 314 | please see the other modification entries.
|
---|
| 315 | FILES
|
---|
| 316 | * mg_fmd_build.c
|
---|
| 317 | * mg_wt_build.c
|
---|
| 318 | * build_lib.c, build_lib.h
|
---|
| 319 | *************************************************************
|
---|
| 320 | TITLE
|
---|
| 321 | Similarity variants for mgquery
|
---|
| 322 | APPLICATION
|
---|
| 323 | mg-2
|
---|
| 324 | TYPE
|
---|
| 325 | extend
|
---|
| 326 | REPORT
|
---|
| 327 | [email protected]/alistair - June 1994
|
---|
| 328 | FIX
|
---|
| 329 | [email protected] - July 1994 .. Feb 1995
|
---|
| 330 | CLAIM
|
---|
| 331 | "mgquery" needs to be altered to allow modification of
|
---|
| 332 | the similarity measure.
|
---|
| 333 | PROBLEM
|
---|
| 334 | See CITRI/TR-95-3 for more details.
|
---|
| 335 | SOLUTION
|
---|
| 336 |
|
---|
| 337 | Most of the similarity measures, Sqd, are of the
|
---|
| 338 | form: Aqd
|
---|
| 339 | -----
|
---|
| 340 | Bqd
|
---|
| 341 | where Bqd is an expression involving Wd and possibly Wq,
|
---|
| 342 | where Aqd is a sum over the common document/query terms
|
---|
| 343 | of w_qt and w_dt.
|
---|
| 344 | Building of Aqd
|
---|
| 345 | ===============
|
---|
| 346 | The calculation of Aqd is done in the file build_Aqd.c .
|
---|
| 347 | The functions for doing this used to be in the file invf_get.c .
|
---|
| 348 | Build_Aqd.c contains 4 different functions for building Aqd, each
|
---|
| 349 | of them building a different data structure:
|
---|
| 350 | (i) Array, (ii) Splay Tree, (iii) Hash Table, (iv) List Table.
|
---|
| 351 | Each of these routines seems to have been construction by modifications
|
---|
| 352 | to duplicated code. This is often the easiest way to construct variants
|
---|
| 353 | but is quite difficult to maintain consistency.
|
---|
| 354 | As the aim of the exercise was to try out different sim. measures for
|
---|
| 355 | retrieval effectiveness, I only modified the code that constructed
|
---|
| 356 | an array. This routine was called "CosineDecode"; I changed it to
|
---|
| 357 | "build_Aqd_Array." This change reflects the fact that we are only
|
---|
| 358 | calculating Aqd and this need not be used for the Cosine measure.
|
---|
| 359 | The other routines: "CosineDecodeSplay," "CosineDecodeHash," and
|
---|
| 360 | "CosineDecodeList" have been left unaltered - they need to be updated
|
---|
| 361 | in the future which would be best be done by abstracting out common code.
|
---|
| 362 | By the stage of building Aqd, the query terms have been looked up in
|
---|
| 363 | the inverted file dictionary and put into a list.
|
---|
| 364 | This list of common terms is traversed to lookup the corresponding
|
---|
| 365 | invf entries. Before the invf entry is processed, all query and term
|
---|
| 366 | relevant statistics and primitive quantities are calculated.
|
---|
| 367 | For example, fqt, ft, wt, rqt, wqt, Wq-partial-sum.
|
---|
| 368 | To save unnecessary calculations, there is a test for each value
|
---|
| 369 | to see whether it is needed e.g. "if (sim->variant_needs & NEEDS_wt) ... ".
|
---|
| 370 | Aside: Variant Needs
|
---|
| 371 | --------------------
|
---|
| 372 | The idea behind the "variant_needs" field is to be able to have all
|
---|
| 373 | the code in the one place for each possible variant and this code would
|
---|
| 374 | get the information at the correct time/place only when it is needed.
|
---|
| 375 | The overhead is a "bit-and" and "test" for each component.
|
---|
| 376 | The important concern though, is that the "variant_needs" must be
|
---|
| 377 | accurate i.e. it should be carefully maintained.
|
---|
| 378 | Each possible need is stored as a bit position in a constant/macro
|
---|
| 379 | of the form "NEEDS_component" e.g. NEEDS_wt, NEEDS_nt
|
---|
| 380 | More recently (Jan/95), I have found that the _need_ing of a component
|
---|
| 381 | may be relevant to a particular purpose, that is, it may be needed in
|
---|
| 382 | one situation and not another. This was the case for Wd.
|
---|
| 383 | For rdt#6, Wd was needed, however, it might not be needed for Bqd, the
|
---|
| 384 | denominator of Sqd. So I changed from NEEDS_Wd to NEEDS_Wd_for_Sqd and
|
---|
| 385 | NEEDS_Wd_for_rdt.
|
---|
| 386 | --------------------
|
---|
| 387 | To handle the different calculations of the variants, I wrote macros
|
---|
| 388 | based around a switch statement. All these macros are stored in the file
|
---|
| 389 | "similarity.h" . The point of doing this is to centralise most of the changes
|
---|
| 390 | and cut down on the number of files which have to be altered if a new
|
---|
| 391 | way of calculating a primitive is to be added.
|
---|
| 392 | This is achieved by having a data record called, "Similarity_variant" whose
|
---|
| 393 | fields includes all the statistics and the similarity primitives.
|
---|
| 394 | So the standard procedure is to see if something is needed and if so, then
|
---|
| 395 | extract it from an mg file or calculate it using a macro - most of the
|
---|
| 396 | input and output to the macro is done via the "Similarity_variant" structure.
|
---|
| 397 | As well as calculating Aqd, it is also a convenient place to calculate Wq.
|
---|
| 398 | Previously, for Cosine measure, Wq was not needed because each Aqd was directly
|
---|
| 399 | divided by Wq and thus would not change the ordering. However, there are
|
---|
| 400 | some Sqd's which divide by a sum involving Wq and thus need Wq.
|
---|
| 401 | Calculating Sqd and Ordering Documents
|
---|
| 402 | ======================================
|
---|
| 403 | The file query.ranked.c contains the code for calculating Sqd using the
|
---|
| 404 | approximate Wd in order to do the ranking of the documents.
|
---|
| 405 | The mg-1 file was cleaned up and modified slightly.
|
---|
| 406 | All the heap data structures and routines were taken out and placed in
|
---|
| 407 | their own file, heap_weights.c/.h .
|
---|
| 408 | The major components/steps in the ranking process were abstracted into
|
---|
| 409 | macros and functions:
|
---|
| 410 | calc_MaxParas
|
---|
| 411 | insert_heap_entry
|
---|
| 412 | insert_greater_heap_entry
|
---|
| 413 | approx_guided_insert
|
---|
| 414 | fill_initial_heap
|
---|
| 415 | add_heap_remainders
|
---|
| 416 | change_heap_to_exact
|
---|
| 417 | add_remainder_exact_weights
|
---|
| 418 | Make_Exact_Root
|
---|
| 419 | build_doc_list
|
---|
| 420 | Aside: Zeroing of an Aqd Element
|
---|
| 421 | -------------------------
|
---|
| 422 | One interesting change concerned the use of zeroing out an Aqd element
|
---|
| 423 | so as to mark it as being used in the heap.
|
---|
| 424 | In the heap, the approx. Sqd is stored based on Aqd/Wd-approx.
|
---|
| 425 | Later when Aqd is required, it can be extracted from the approx. Sqd
|
---|
| 426 | by muliplication of Wd-approx i.e. Aqd = Sqd-approx * Wd-approx.
|
---|
| 427 | This, however, is not always possible for the various Sqd variants.
|
---|
| 428 | So, instead of zeroing out Aqd, I decided just to make it negative.
|
---|
| 429 | -------------------------
|
---|
| 430 | In some Sqd cases, Wd and Wd-approx. are not required. In which case,
|
---|
| 431 | step 3 which calls on "change_heap_to_exact", "Heap_Build" and
|
---|
| 432 | "add_remainder_exact_weights" is not required.
|
---|
| 433 |
|
---|
| 434 | FILES
|
---|
| 435 | * build_Aqd.c
|
---|
| 436 | * query.ranked.c
|
---|
| 437 | * similarity.c/.h (in libmg)
|
---|
| 438 | * heap_weights.c/.h
|
---|
| 439 | * backend.c/.h
|
---|
| 440 | *************************************************************
|
---|
| 441 | TITLE
|
---|
| 442 | Similarity variants for mg_weights_build
|
---|
| 443 | APPLICATION
|
---|
| 444 | mg-2
|
---|
| 445 | TYPE
|
---|
| 446 | extend
|
---|
| 447 | REPORT
|
---|
| 448 | [email protected]/alistair - June 1994
|
---|
| 449 | FIX
|
---|
| 450 | [email protected] - July 1994 .. Feb 1995
|
---|
| 451 | CLAIM
|
---|
| 452 | "mg_weights_build" needs to be altered to allow modification of
|
---|
| 453 | the similarity measure.
|
---|
| 454 | PROBLEM
|
---|
| 455 | See CITRI/TR-95-3 for more details.
|
---|
| 456 | SOLUTION
|
---|
| 457 | The weight files which are generated for a particular similarity
|
---|
| 458 | measure have their names extended by a suffix.
|
---|
| 459 | In the case of Wd#1 no weights are generated.
|
---|
| 460 | In the case of the standard cosine weights, Wd#2, a 3 letter suffix
|
---|
| 461 | is used to represent .
|
---|
| 462 | In the case of the other Wd variants, a one letter suffix is used
|
---|
| 463 | to represent which Wd variant it is.
|
---|
| 464 | In each case, the variant input (e.g. -q 22222222) should be the whole
|
---|
| 465 | similarity variant string and the relevant fields will be extracted out.
|
---|
| 466 | This is done for consistency in code and interface.
|
---|
| 467 | The code is fairly similar to the original.
|
---|
| 468 | A dependency check has been added so that the dates of files and the
|
---|
| 469 | type of needed files is verified before building.
|
---|
| 470 | The dependencies include, invf dictionary, invf index, invf, fmd and wt files.
|
---|
| 471 | The major change here, is the possible use of fmd and/or wt files.
|
---|
| 472 | Later when I was having to write mg_fmd_build and mg_wt_build,
|
---|
| 473 | I decided to abstract out some macros, namely:
|
---|
| 474 | Get_ft, Get_ft_Ft, loop_invf_entry,
|
---|
| 475 | which were put into "build_lib.c" .
|
---|
| 476 | "loop_invf_entry" takes a function/macro name as a parameter and applies
|
---|
| 477 | it to the sim record (with field fdt set), current doc number and modifies
|
---|
| 478 | the return value.
|
---|
| 479 | FILES
|
---|
| 480 | * mg_weights_build.c
|
---|
| 481 | * build_lib.c
|
---|
| 482 | *************************************************************
|
---|
| 483 | TITLE
|
---|
| 484 | Mgstat with non-existent files
|
---|
| 485 | APPLICATION
|
---|
| 486 | mg-1
|
---|
| 487 | TYPE
|
---|
| 488 | bug
|
---|
| 489 | REPORT
|
---|
| 490 | [email protected] - 16 May 1994
|
---|
| 491 | FIX
|
---|
| 492 | [email protected] - 10 Aug 1994
|
---|
| 493 | CLAIM
|
---|
| 494 | NaNs and Infinites would be printed out by mgstat
|
---|
| 495 | if unable to open .text or .text.dict file.
|
---|
| 496 | PROBLEM
|
---|
| 497 | The NaNs etc. were output in the column stating
|
---|
| 498 | the percentage size of the file compared with the
|
---|
| 499 | number of input bytes of the source text data.
|
---|
| 500 | If it couldn't read the .text file with its
|
---|
| 501 | header describing the number of source text bytes, then
|
---|
| 502 | in working out the percentage it would divide by zero.
|
---|
| 503 | Also due to some bad control flow, it wouldn't attempt to
|
---|
| 504 | open the .text file if it failed when opening
|
---|
| 505 | the .text.dict file.
|
---|
| 506 | SOLUTION
|
---|
| 507 | Only printout the percentage if we can read the header
|
---|
| 508 | from the .text file.
|
---|
| 509 | Read in text header irrespective of text dictionary file.
|
---|
| 510 | FILES
|
---|
| 511 | * mgstat.c
|
---|
| 512 | *************************************************************
|
---|
| 513 | TITLE
|
---|
| 514 | Boolean tree optimisations
|
---|
| 515 | APPLICATION
|
---|
| 516 | mg-2
|
---|
| 517 | TYPE
|
---|
| 518 | extend
|
---|
| 519 | REPORT
|
---|
| 520 | (i) [email protected] - 28/Sep/94
|
---|
| 521 | (ii) [email protected] - 12/Dec/94
|
---|
| 522 | FIX
|
---|
| 523 | (i) [email protected] - 18/Oct/94
|
---|
| 524 | (ii) [email protected] - 21/Dec/94
|
---|
| 525 | CLAIM
|
---|
| 526 | The initial prompt for investigating the optimisation of
|
---|
| 527 | boolean queries is noted in the mg-1 mod14.txt.
|
---|
| 528 | The code for optimising seemed to have a number of faults.
|
---|
| 529 | PROBLEM
|
---|
| 530 | Boolean optimisation was unreliable.
|
---|
| 531 | SOLUTION
|
---|
| 532 | Initially (in case (i) above, see mg-1/mod14.txt), I rewrote
|
---|
| 533 | all the boolean tree and optimising code. I converted the boolean
|
---|
| 534 | expression into DNF. I did this after reading some notes about
|
---|
| 535 | the steps involved in optimisation and they suggested standardising
|
---|
| 536 | in some normal form. I thought that DNF would be appropriate so that
|
---|
| 537 | all the terms are converted to be part of "and" expressions and be
|
---|
| 538 | evaluated quickly using skipping.
|
---|
| 539 | This, however, can suffer quite badly if the distributive law is
|
---|
| 540 | applied to often and the query expands in size. If there was
|
---|
| 541 | some sort of cacheing of invf entries, then it might not be so
|
---|
| 542 | bad otherwise there is quite an overhead on reading the same
|
---|
| 543 | invf entry more than once.
|
---|
| 544 |
|
---|
| 545 | As it happens, CNF queries are reasonably common, where the user
|
---|
| 546 | queries with a conjunction of disjunctions of similes:
|
---|
| 547 | e.g. (car | automobile | vehicle) & (fast | quick | speedy)
|
---|
| 548 | This sort of CNF query expands a hell of a lot !
|
---|
| 549 | So after speaking with Justin who wanted to benchmark Atlas with Mg on
|
---|
| 550 | these sort of queries, I looked up the MG book for other ideas.
|
---|
| 551 |
|
---|
| 552 | The method that I implemented was the following:
|
---|
| 553 |
|
---|
| 554 | -----------------------------------------------------------------
|
---|
| 555 | Steps of tree modifications:
|
---|
| 556 | Gets literals by pushing the nots down, detecting T/F at leaves
|
---|
| 557 | and collapsing the tree by detecting 'and' of 'and's and 'or' of
|
---|
| 558 | 'or's.
|
---|
| 559 | Next it looks at the or nodes and if all the children are terms
|
---|
| 560 | then it marks the or-node as such.
|
---|
| 561 | Finally, the or-term-nodes are sorted by using the sum of their
|
---|
| 562 | ft's for comparison.
|
---|
| 563 | Steps at query evaluation:
|
---|
| 564 | If it comes across an 'and' of 'or-terms' then the evaluation is
|
---|
| 565 | done noting the distributive law.
|
---|
| 566 | I.e. a & (b | c | d) = (a & b) | (a & c) | (a & d)
|
---|
| 567 | Assuming 'a' is the c-set of documents.
|
---|
| 568 | All of 'a' is tested against 'b' and matching ones are marked.
|
---|
| 569 | Next, all the unmarked members of 'a' are tested against 'c'.
|
---|
| 570 | Likewise for 'd'.
|
---|
| 571 | Now all the marked members of our c-set are kept.
|
---|
| 572 | When we do the testing, we can use the skipping in the invf entries.
|
---|
| 573 | -----------------------------------------------------------------
|
---|
| 574 |
|
---|
| 575 |
|
---|
| 576 | After doing this, I added the choice of which type of optimisation
|
---|
| 577 | the user wanted by adding query-environment-variable, "optimise_type".
|
---|
| 578 | Type 0 = no parse tree modification.
|
---|
| 579 | Type 1 = Or-term recognition and CNF query evaluation optimisation.
|
---|
| 580 | Type 2 = Put into DNF form. [generally not recommended]
|
---|
| 581 | FILES
|
---|
| 582 | * bool_tree.c
|
---|
| 583 | * bool_parser.y
|
---|
| 584 | * bool_optimiser.c
|
---|
| 585 | * bool_query.c
|
---|
| 586 | * bool_tester.c
|
---|
| 587 | * term_lists.c
|
---|
| 588 | * query_env.c
|
---|
| 589 | * invf_get.c [GetDocsOp]
|
---|
| 590 | *************************************************************
|
---|
| 591 | TITLE
|
---|
| 592 | nonexistent HOME bug
|
---|
| 593 | APPLICATION
|
---|
| 594 | mg-1, mg-2
|
---|
| 595 | TYPE
|
---|
| 596 | bug
|
---|
| 597 | REPORT
|
---|
| 598 | [email protected] - 2/May/95
|
---|
| 599 | FIX
|
---|
| 600 | [email protected] - 2/May/95
|
---|
| 601 | CLAIM
|
---|
| 602 | "The big problem was that mgquery crashes when the HOME environment
|
---|
| 603 | variable is not set, which is the case when it is run by the www server."
|
---|
| 604 | [...] "I expect it happens when looking for $HOME/.mgrc."
|
---|
| 605 | PROBLEM
|
---|
| 606 | The result of getenv("HOME")" was used directly in
|
---|
| 607 | a sprintf call. If the environment variable HOME
|
---|
| 608 | was not in existence then null would be used.
|
---|
| 609 | In some C libraries sprintf will convert the 0
|
---|
| 610 | string into the string "(null)" on others it will core dump.
|
---|
| 611 | (For example, Solaris seems to core dump, sunos 4 seems ok).
|
---|
| 612 | SOLUTION
|
---|
| 613 | The result from getenv("HOME")" is tested before
|
---|
| 614 | being used.
|
---|
| 615 | FILES
|
---|
| 616 | * commands.c
|
---|
| 617 | *************************************************************
|
---|
| 618 | TITLE
|
---|
| 619 | mgquery collection name preference
|
---|
| 620 | APPLICATION
|
---|
| 621 | mg-1, mg-2
|
---|
| 622 | TYPE
|
---|
| 623 | improve
|
---|
| 624 | REPORT
|
---|
| 625 | [email protected] - 2/May/95
|
---|
| 626 | FIX
|
---|
| 627 | [email protected] - 4/May/95
|
---|
| 628 | CLAIM
|
---|
| 629 | Surely something must override mquery's preference for ./bib.
|
---|
| 630 | If MGDATA is set correctly, I think it should prefer that collection,
|
---|
| 631 | and -d should definitely override it.
|
---|
| 632 | I could always say -d . if I really wanted ./bib.
|
---|
| 633 | PROBLEM
|
---|
| 634 | Currently the priority is:
|
---|
| 635 | 1. Check if ./name is a directory,
|
---|
| 636 | If so then use it as the collection directory.
|
---|
| 637 | 2. Check if ./name.text is a file,
|
---|
| 638 | If so then use ./ as the collection directory.
|
---|
| 639 | 3. Check if mgdir/name is a directory,
|
---|
| 640 | If so then use mgdir/name as the collection directory.
|
---|
| 641 | 4. Otherwise,
|
---|
| 642 | Use mgdir/name as the database file prefix.
|
---|
| 643 | This would be the case if one used "-f alice/alice".
|
---|
| 644 | However, one would then not specify a final name argument
|
---|
| 645 | and we'd never get here. Go figure ???
|
---|
| 646 | SOLUTION
|
---|
| 647 | Moved step 3 to the top instead.
|
---|
| 648 | FILES
|
---|
| 649 | * mgquery.c [search_for_collection()]
|
---|
| 650 | *************************************************************
|
---|
| 651 | TITLE
|
---|
| 652 | Printout of query terms
|
---|
| 653 | APPLICATION
|
---|
| 654 | mg-1, mg-2
|
---|
| 655 | TYPE
|
---|
| 656 | extend
|
---|
| 657 | REPORT
|
---|
| 658 | [email protected] - April 95
|
---|
| 659 | FIX
|
---|
| 660 | [email protected] - April 95
|
---|
| 661 | CLAIM
|
---|
| 662 | No easy way to find out the parsed and stemmed words
|
---|
| 663 | used in the query. Would like to know these words
|
---|
| 664 | so I can call a separate highlighting program to
|
---|
| 665 | highlight these words.
|
---|
| 666 | PROBLEM
|
---|
| 667 | No facility available.
|
---|
| 668 | SOLUTION
|
---|
| 669 | A ".queryterms" mgquery command was added which lists
|
---|
| 670 | out the parsed/stemmed queryterms of the last query.
|
---|
| 671 | FILES
|
---|
| 672 | * commands.c (added CmdQueryTerms)
|
---|
| 673 | *************************************************************
|
---|
| 674 | TITLE
|
---|
| 675 | mg_getrc
|
---|
| 676 | APPLICATION
|
---|
| 677 | mg-1, mg-2
|
---|
| 678 | TYPE
|
---|
| 679 | extend
|
---|
| 680 | REPORT
|
---|
| 681 | [email protected] - 2/May/95
|
---|
| 682 | FIX
|
---|
| 683 | -
|
---|
| 684 | CLAIM
|
---|
| 685 | Repeated code had to be written for different named
|
---|
| 686 | gets but really the same type of parsing required.
|
---|
| 687 | E.g. one might want to use a standard method for inserting
|
---|
| 688 | ^Bs between paragraphs for different books. One doesn't
|
---|
| 689 | want to write duplicate code for each different named book,
|
---|
| 690 | rather note that each book should be filtered "book" style.
|
---|
| 691 | PROBLEM
|
---|
| 692 | There was no way of abstracting out types of filters from
|
---|
| 693 | the name of an instance of a collection.
|
---|
| 694 | SOLUTION
|
---|
| 695 | Allow information to be given with <name, type, files>.
|
---|
| 696 | This extra info can be provided in a mg_getrc file.
|
---|
| 697 | See man page for mg_get for details.
|
---|
| 698 | FILES
|
---|
| 699 | * mg_get.sh
|
---|
| 700 | *************************************************************
|
---|
| 701 | TITLE
|
---|
| 702 | TREC DocNo file
|
---|
| 703 | APPLICATION
|
---|
| 704 | mg-2
|
---|
| 705 | TYPE
|
---|
| 706 | improve
|
---|
| 707 | REPORT
|
---|
| 708 | [email protected] - 26/May/95
|
---|
| 709 | FIX
|
---|
| 710 | [email protected] - 26/May/95
|
---|
| 711 | CLAIM
|
---|
| 712 | MG has problems dealing with trec docnos for trec disk 3.
|
---|
| 713 | PROBLEM
|
---|
| 714 | Trec DocNos file didn't have a wide enough field
|
---|
| 715 | to handle disk 3.
|
---|
| 716 | SOLUTION
|
---|
| 717 | Allow different width fields for file.
|
---|
| 718 | It is still fixed width but a number in the header
|
---|
| 719 | says how wide the field is.
|
---|
| 720 | FILES
|
---|
| 721 | * passes/mg.special.c
|
---|
| 722 | * query/mgquery.c
|
---|
| 723 | *************************************************************
|
---|
| 724 | TITLE
|
---|
| 725 | Boolean optimiser #1 with `!'
|
---|
| 726 | APPLICATION
|
---|
| 727 | mg-1, mg-2
|
---|
| 728 | TYPE
|
---|
| 729 | bug
|
---|
| 730 | REPORT
|
---|
| 731 | [email protected] - 20/7/95
|
---|
| 732 | FIX
|
---|
| 733 | [email protected] - 27/7/95
|
---|
| 734 | CLAIM
|
---|
| 735 | Complained about not-nodes.
|
---|
| 736 | e.g. complained about "croquet & !hedgehog"
|
---|
| 737 | PROBLEM
|
---|
| 738 | Boolean optimiser type#1 didn't convert
|
---|
| 739 | "and not"s into diff nodes.
|
---|
| 740 | SOLUTION
|
---|
| 741 | Added code to convert '&!' to '-'.
|
---|
| 742 | FILES
|
---|
| 743 | * mg/bool_optimiser.c [mg-1]
|
---|
| 744 | * query/bool_optimiser.c [mg-2]
|
---|
| 745 | *************************************************************
|
---|
| 746 | TITLE
|
---|
| 747 | Autoconfiguring mg-1
|
---|
| 748 | APPLICATION
|
---|
| 749 | mg-1
|
---|
| 750 | TYPE
|
---|
| 751 | improve
|
---|
| 752 | REPORT
|
---|
| 753 | many people - 94/95
|
---|
| 754 | FIX
|
---|
| 755 | [email protected] - Aug/95
|
---|
| 756 | CLAIM
|
---|
| 757 | Portability is limited by setting up c-macros just for particular
|
---|
| 758 | machines and operating systems.
|
---|
| 759 | People had to make changes for HP, Next, Linux, Dec Alpha, ...
|
---|
| 760 | PROBLEM
|
---|
| 761 | Porting was only targetting at the machines that the author had
|
---|
| 762 | access to.
|
---|
| 763 | SOLUTION
|
---|
| 764 | Use GNU's autoconfigure program.
|
---|
| 765 | This allows checking of the systems features/characteristics.
|
---|
| 766 | It also allows some checking for specific machines/OS - although
|
---|
| 767 | I have not utilised this option.
|
---|
| 768 | I used GNU's tar-1.11.8 as an example to base my changes on.
|
---|
| 769 | I also used autoscan to generate the initial "configure.in".
|
---|
| 770 | The "Makefile.in"s were done very similarly to GNU tar's.
|
---|
| 771 | The "config.h" and "sysfuncs.h" files were scrapped and
|
---|
| 772 | rewritten. The new "config.h" file is generated by the configure
|
---|
| 773 | script - it contains all the #define's for the system features.
|
---|
| 774 | The "sysfuncs.h" file wraps up a number of system headers.
|
---|
| 775 | For example, some systems use , while some use ;
|
---|
| 776 | which one is included is decided in "sysfuncs.h".
|
---|
| 777 | I have also used GNU tar's use of ansi2knr in its Makefiles.
|
---|
| 778 | This should hopefully allow the package to work on a system with
|
---|
| 779 | only a K&R C compiler.
|
---|
| 780 | However, there are probably problems with what I have done.
|
---|
| 781 | I am concerned about <stdarg.h> for example.
|
---|
| 782 | I also noticed that "ansi2knr" require function definitions as
|
---|
| 783 | the GNU coding style recommends ie. with function name the first
|
---|
| 784 | string on the line. This prompted me to run all the package's code
|
---|
| 785 | thru GNU's indent.
|
---|
| 786 | Setting up the configure changes is difficult. It really seems
|
---|
| 787 | necessary to try the package out on many target machines so one
|
---|
| 788 | can know what is necessary.
|
---|
| 789 | A simple check target for the main Makefile has been written.
|
---|
| 790 | It is used to see if the installation is working - it does
|
---|
| 791 | not test much of the functionality of mg.
|
---|
| 792 | It does cmp's on data files and diff's on query/result files.
|
---|
| 793 | FILES
|
---|
| 794 | Most of the files in the distribution.
|
---|
| 795 | *************************************************************
|
---|
| 796 | TITLE
|
---|
| 797 | Consistent use of stderr
|
---|
| 798 | APPLICATION
|
---|
| 799 | mg-1
|
---|
| 800 | TYPE
|
---|
| 801 | improve
|
---|
| 802 | REPORT
|
---|
| 803 | [email protected] - 16 May 1994
|
---|
| 804 | FIX
|
---|
| 805 | [email protected] - 11 August 1994
|
---|
| 806 | CLAIM
|
---|
| 807 | Inconsistent use of stdout/stderr in usage messages.
|
---|
| 808 | PROBLEM
|
---|
| 809 | Sometimes used "printf" and sometimes used "fprintf(stderr"
|
---|
| 810 | in usage messages.
|
---|
| 811 | SOLUTION
|
---|
| 812 | All should now use "fprintf(stderr" in usage messages.
|
---|
| 813 | FILES
|
---|
| 814 | * mg_compression_dict.c
|
---|
| 815 | * mg_compression_dict.1
|
---|
| 816 | * mg_fast_comp_dict.c
|
---|
| 817 | * mg_fast_comp_dict.1
|
---|
| 818 | * mg_invf_dict.c
|
---|
| 819 | * mg_invf_dict.1
|
---|
| 820 | * mg_invf_dump.c
|
---|
| 821 | * mg_invf_dump.1
|
---|
| 822 | * mg_invf_rebuild.c
|
---|
| 823 | * mg_invf_rebuild.1
|
---|
| 824 | * mg_perf_hash_build.c
|
---|
| 825 | * mg_perf_hash_build.1
|
---|
| 826 | * mg_text_estimate.c
|
---|
| 827 | * mg_text_estimate.1
|
---|
| 828 | * mg_weights_build.c
|
---|
| 829 | * mg_weights_build.1
|
---|
| 830 | *************************************************************
|
---|
| 831 | TITLE
|
---|
| 832 | xmg bug
|
---|
| 833 | APPLICATION
|
---|
| 834 | mg-1
|
---|
| 835 | TYPE
|
---|
| 836 | bug
|
---|
| 837 | REPORT
|
---|
| 838 | [email protected] - 22 April 1994
|
---|
| 839 | FIX
|
---|
| 840 | [email protected] - 22 April 1994
|
---|
| 841 | CLAIM
|
---|
| 842 | "Serious problem in xmg, which I fear occurs whenever a query
|
---|
| 843 | doesn't return anything."
|
---|
| 844 | PROBLEM
|
---|
| 845 | ??
|
---|
| 846 | SOLUTION
|
---|
| 847 | [xmg.sh 201] set rank 0
|
---|
| 848 | FILES
|
---|
| 849 | * xmg.sh
|
---|
| 850 | *************************************************************
|
---|
| 851 | TITLE
|
---|
| 852 | Unnecessary loading of text
|
---|
| 853 | APPLICATION
|
---|
| 854 | mg-1
|
---|
| 855 | TYPE
|
---|
| 856 | bug
|
---|
| 857 | REPORT
|
---|
| 858 | [email protected] - ?? August 1994
|
---|
| 859 | FIX
|
---|
| 860 | [email protected] - 12 August 1994
|
---|
| 861 | CLAIM
|
---|
| 862 | Mg was loading and uncompressing text when the
|
---|
| 863 | query did not require the text.
|
---|
| 864 | PROBLEM
|
---|
| 865 | There was no test for the query mode
|
---|
| 866 | before loading and uncompressing the text.
|
---|
| 867 | SOLUTION
|
---|
| 868 | Only load/uncompress text if query mode
|
---|
| 869 | is for text, headers or silent(for timing).
|
---|
| 870 | FILES
|
---|
| 871 | * mgquery.c
|
---|
| 872 | *************************************************************
|
---|
| 873 | TITLE
|
---|
| 874 | Man page errors
|
---|
| 875 | APPLICATION
|
---|
| 876 | mg-1
|
---|
| 877 | TYPE
|
---|
| 878 | bug
|
---|
| 879 | REPORT
|
---|
| 880 | [email protected] - 16 May 1994
|
---|
| 881 | FIX
|
---|
| 882 | [email protected] - 16 May 1994
|
---|
| 883 | CLAIM
|
---|
| 884 | Man page errors.
|
---|
| 885 | PROBLEM
|
---|
| 886 | See below.
|
---|
| 887 | SOLUTION
|
---|
| 888 | "The mg_make_fast_dict.1 file has been renamed mg_fast_comp_dict.1,
|
---|
| 889 | and all mg_make_fast_dict strings changed to mg_fast_comp_dict in all
|
---|
| 890 | man pages.
|
---|
| 891 | A large number of errors of spelling, typography, spacing, fonts,
|
---|
| 892 | grammar, omitted words, slang, punctuation, missing man page
|
---|
| 893 | cross-references, and man-page style have been corrected."
|
---|
| 894 | FILES
|
---|
| 895 | * mg_compression_dict.1
|
---|
| 896 | * mg_fast_comp_dict.1
|
---|
| 897 | * mg_get.1
|
---|
| 898 | * mg_invf_dict.1
|
---|
| 899 | * mg_invf_dump.1
|
---|
| 900 | * mg_invf_rebuild.1
|
---|
| 901 | * mg_passes.1
|
---|
| 902 | * mg_perf_hash_build.1
|
---|
| 903 | * mg_text_estimate.1
|
---|
| 904 | * mg_weights_build.1
|
---|
| 905 | * mgbilevel.1
|
---|
| 906 | * mgbuild.1
|
---|
| 907 | * mgdictlist.1
|
---|
| 908 | * mgfelics.1
|
---|
| 909 | * mgquery.1
|
---|
| 910 | * mgstat.1
|
---|
| 911 | * mgtic.1
|
---|
| 912 | * mgticbuild.1
|
---|
| 913 | * mgticdump.1
|
---|
| 914 | * mgticprune.1
|
---|
| 915 | * mgticstat.1
|
---|
| 916 | * xmg.1
|
---|
| 917 | *************************************************************
|
---|
| 918 | TITLE
|
---|
| 919 | Man page overview
|
---|
| 920 | APPLICATION
|
---|
| 921 | mg-1
|
---|
| 922 | TYPE
|
---|
| 923 | extend
|
---|
| 924 | REPORT
|
---|
| 925 | [email protected] -
|
---|
| 926 | FIX
|
---|
| 927 | [email protected] - 17 August 1994
|
---|
| 928 | CLAIM
|
---|
| 929 | "Write new mg.1 file to give a brief overview of mg, with samples
|
---|
| 930 | of how to use it. Otherwise, users are likely to be completely
|
---|
| 931 | overwhelmed by the number of programs (about 20) which might need to
|
---|
| 932 | be used, when in reality, only 2 or 3 are likely to be run by end
|
---|
| 933 | users."
|
---|
| 934 | SOLUTION
|
---|
| 935 | It was thought that mg.1, written by Nelson Beebe, was very useful
|
---|
| 936 | but a bit too comprehensive for an introduction.
|
---|
| 937 | Therefore, two man files, mgintro.1 and mgintro++.1 were written
|
---|
| 938 | with the basic stuff in mgintro.1 and slightly more advanced stuff
|
---|
| 939 | in mgintro++.1 .
|
---|
| 940 | FILES
|
---|
| 941 | * mg.1
|
---|
| 942 | * mgintro.1
|
---|
| 943 | * mgintro++.1
|
---|
| 944 | *************************************************************
|
---|
| 945 | TITLE
|
---|
| 946 | Parse errors not bus errors
|
---|
| 947 | APPLICATION
|
---|
| 948 | mg-1
|
---|
| 949 | TYPE
|
---|
| 950 | bug
|
---|
| 951 | REPORT
|
---|
| 952 | [email protected] - 2 Jun 94
|
---|
| 953 | FIX
|
---|
| 954 | [email protected] - 19 Aug 94
|
---|
| 955 | CLAIM
|
---|
| 956 | "These two queries
|
---|
| 957 | (which I typed in before I knew what I was doing!!)
|
---|
| 958 | > The Queen of Hearts, she made some tarts
|
---|
| 959 | > "The Queen of Hearts" and "she made some tarts"
|
---|
| 960 | produced the following result:
|
---|
| 961 | mgquery : parse error
|
---|
| 962 | Bus error
|
---|
| 963 | "
|
---|
| 964 | PROBLEM
|
---|
| 965 | What is expected to happen under boolean querying:
|
---|
| 966 | Query1:
|
---|
| 967 | > The Queen of Hearts, she made some tarts
|
---|
| 968 | will produce a parse error due to the comma which
|
---|
| 969 | is not a valid TERM.
|
---|
| 970 | Query2:
|
---|
| 971 | > "The Queen of Hearts" and "she made some tarts"
|
---|
| 972 | will store a post-processing string
|
---|
| 973 | of ''The Queen of Hearts" and "she made some tarts'' and
|
---|
| 974 | will have a main boolean query of the empty string.
|
---|
| 975 | This is because the postprocessing string takes in
|
---|
| 976 | everything between the first quote and the last one.
|
---|
| 977 | An empty string is illegal for the boolean grammar and
|
---|
| 978 | hence a parse error.
|
---|
| 979 | The problem stems from the fact that the processing of
|
---|
| 980 | the parse tree is carried out, even though we have a
|
---|
| 981 | parse error. In the case of using an empty string to build
|
---|
| 982 | a parse tree, it is likely to leave the parse tree undefined.
|
---|
| 983 | SOLUTION
|
---|
| 984 | As soon as we find out that there is a parse-error,
|
---|
| 985 | we abandon any processing of the parse tree.
|
---|
| 986 | FILES
|
---|
| 987 | * query.bool.y
|
---|
| 988 | * query.bool.c (generated from query.bool.y)
|
---|
| 989 | *************************************************************
|
---|
| 990 | TITLE
|
---|
| 991 | Perfect hashing on small vocab
|
---|
| 992 | APPLICATION
|
---|
| 993 | mg-1
|
---|
| 994 | TYPE
|
---|
| 995 | bug
|
---|
| 996 | REPORT
|
---|
| 997 | [email protected] - July 1994
|
---|
| 998 | FIX
|
---|
| 999 | [email protected] - July 1994
|
---|
| 1000 | CLAIM
|
---|
| 1001 | Mg could not handle small collections in the case
|
---|
| 1002 | where there was only a small number of unique words.
|
---|
| 1003 | The perfect hash function would report an error.
|
---|
| 1004 | PROBLEM
|
---|
| 1005 | Rounding of the arithmetic during the calculation of the
|
---|
| 1006 | parameters of the perfect hash function was resulting in a
|
---|
| 1007 | combination of values such that the probability of a hash
|
---|
| 1008 | function being found was very small. This led to the limit
|
---|
| 1009 | on the generation loop being exceeded, and eventual
|
---|
| 1010 | failure.
|
---|
| 1011 | SOLUTION
|
---|
| 1012 | By using ceiling rather than floor when converting from a
|
---|
| 1013 | floating point value to an integer parameter, the arithmetic
|
---|
| 1014 | is now correct for all lexicon sizes, and the probability of
|
---|
| 1015 | each iteration successfully generating a hash function is
|
---|
| 1016 | sufficiently great that with _very_ high probability the
|
---|
| 1017 | execution loop counter will not be exceeded unless there
|
---|
| 1018 | genuinely is no hash function (for example, if the lexicon
|
---|
| 1019 | contains two words the same there cannot be a hash
|
---|
| 1020 | function).
|
---|
| 1021 | FILES
|
---|
| 1022 | * perf_hash.c
|
---|
| 1023 | *************************************************************
|
---|