[3745] | 1 | TITLE
|
---|
| 2 | Parsing of Long Words
|
---|
| 3 | APPLICATION
|
---|
| 4 | mg-1, mg-2
|
---|
| 5 | TYPE
|
---|
| 6 | bug
|
---|
| 7 | REPORT
|
---|
| 8 | [email protected] - May 11th 1994
|
---|
| 9 | FIX
|
---|
| 10 | [email protected] - August 9th 1994
|
---|
| 11 | CLAIM
|
---|
| 12 | Mg didn't handle long words properly; it crashed.
|
---|
| 13 | PROBLEM
|
---|
| 14 | Invf passes calls PARSE_LONG_WORD [words.h] which uses a limit of
|
---|
| 15 | MAXLONGWORD on iterating thru the string and storing into
|
---|
| 16 | a word. MAXLONGWORD = 8192.
|
---|
| 17 | However, mg strings generally store the length in the first
|
---|
| 18 | byte limiting them to 255 characters. The word which was passed
|
---|
| 19 | to PARSE_LONG_WORD was an allocated string of MAXSTEMLEN = 255,
|
---|
| 20 | which is as large as we should get anyway. Thus when accessing
|
---|
| 21 | a larger word than 255 chars, PARSE_LONG_WORD would allow it
|
---|
| 22 | (less than 8192) and would try storing beyond the array limit.
|
---|
| 23 | SOLUTION
|
---|
| 24 | The author can't remember why PARSE_LONG_WORD was used and what
|
---|
| 25 | the significance of MAXLONGWORD = 8192 is.
|
---|
| 26 | So PARSE_LONG_WORD has been changed to PARSE_STEM_WORD which
|
---|
| 27 | uses MAXSTEMLEN as its limit.
|
---|
| 28 | FILES
|
---|
| 29 | * words.h
|
---|
| 30 | * invf.pass1.c
|
---|
| 31 | * invf.pass2.c
|
---|
| 32 | * ivf.pass1.c
|
---|
| 33 | * ivf.pass2.c
|
---|
| 34 | * query.ranked.c
|
---|
| 35 | *************************************************************
|
---|
| 36 | TITLE
|
---|
| 37 | Use of Lovins stemmer
|
---|
| 38 | APPLICATION
|
---|
| 39 | mg-1
|
---|
| 40 | TYPE
|
---|
| 41 | improve
|
---|
| 42 | REPORT
|
---|
| 43 | local - 1994
|
---|
| 44 | FIX
|
---|
| 45 | [email protected] - 1994
|
---|
| 46 | CLAIM
|
---|
| 47 | Stemming was done naively.
|
---|
| 48 | PROBLEM
|
---|
| 49 | Only a few types of words and their endings
|
---|
| 50 | were considered.
|
---|
| 51 | SOLUTION
|
---|
| 52 | Replacement with a more elaborate "known" stemmer by Lovins.
|
---|
| 53 | The algorithm is described in:
|
---|
| 54 | J.B. Lovins, "Development of a Stemming Algorithm",
|
---|
| 55 | Mechanical Translation and Computational Linguistics, Vol 11,1968.
|
---|
| 56 | FILES
|
---|
| 57 | * stem.c
|
---|
| 58 | * stem.h
|
---|
| 59 | *************************************************************
|
---|
| 60 | TITLE
|
---|
| 61 | Different term parsing
|
---|
| 62 | APPLICATION
|
---|
| 63 | mg-1
|
---|
| 64 | TYPE
|
---|
| 65 | bug
|
---|
| 66 | REPORT
|
---|
| 67 | [email protected] - 23 Aug 1994
|
---|
| 68 | FIX
|
---|
| 69 | [email protected] - 23 Aug 1994
|
---|
| 70 | CLAIM
|
---|
| 71 | Boolean queries did not extract words/terms using the
|
---|
| 72 | same method as is done at inverted-file creation and
|
---|
| 73 | as is used for rank query parsing.
|
---|
| 74 | PROBLEM
|
---|
| 75 | The hand-written lex. analyser, query_lex, which is called by
|
---|
| 76 | the boolean query parser was not calling a common
|
---|
| 77 | word-extraction routine as used by the rest of mg.
|
---|
| 78 | This would be ok if the code did the same things - but they didn't.
|
---|
| 79 | Query_lex, for instance, did NOT place any limit on the
|
---|
| 80 | number of digits in a term.
|
---|
| 81 | Of even more concern, it would allow arbitrary sized words
|
---|
| 82 | although it used Pascal style strings which store the length
|
---|
| 83 | in the first byte and can therefore only be 255 characters in length.
|
---|
| 84 | SOLUTION
|
---|
| 85 | Query_lex in "query.bool.y", was modified to call the routine
|
---|
| 86 | PARSE_STEM_WORD which is also used by text-inversion routines and
|
---|
| 87 | ranking query routines.
|
---|
| 88 | Now all terms are extracted by the same routine.
|
---|
| 89 | To do this, the end of the line buffer had to be noted as
|
---|
| 90 | PARSE_STEM_WORD requires a pointer to the end - which is the
|
---|
| 91 | safe thing to do (don't want to run over the end).
|
---|
| 92 | This meant I had to find the length of the query line buffer.
|
---|
| 93 | This was allocated in the file "read_line.c" by the routine,
|
---|
| 94 | "readline". Its size was the literal number 1024.
|
---|
| 95 | This was changed to a constant and placed in "read_line.h".
|
---|
| 96 | The definition for PARSE_STEM_WORD can be found in "words.h".
|
---|
| 97 | FILES
|
---|
| 98 | * query.bool.y
|
---|
| 99 | * query.bool.c (by bison)
|
---|
| 100 | * read_line.c
|
---|
| 101 | * read_line.h
|
---|
| 102 | *************************************************************
|
---|
| 103 | TITLE
|
---|
| 104 | Highlighting of query terms
|
---|
| 105 | APPLICATION
|
---|
| 106 | mg-1
|
---|
| 107 | TYPE
|
---|
| 108 | extend
|
---|
| 109 | REPORT
|
---|
| 110 | [email protected] - Aug 94
|
---|
| 111 | FIX
|
---|
| 112 | [email protected] - Sep 94
|
---|
| 113 | CLAIM
|
---|
| 114 | Difficult to feel happy that the query-result returned is
|
---|
| 115 | satisfying the query - need to look hard to find the queried words.
|
---|
| 116 | Need to show words in results using some highlighting method.
|
---|
| 117 | PROBLEM
|
---|
| 118 | No highlighting of query terms in results.
|
---|
| 119 | SOLUTION
|
---|
| 120 | Mgquery was previously outputting the decompressed text to a pager
|
---|
| 121 | such as "less(1)" or "more(1)".
|
---|
| 122 | (Except when redirected or piped elsewhere :)
|
---|
| 123 | So what was needed was some sort of highlight pager that instead of
|
---|
| 124 | displaying the text would also use some means for highlighting the
|
---|
| 125 | stemmed query words.
|
---|
| 126 | Two common forms of highlighting were chosen: underline and bolding.
|
---|
| 127 | These are supported by "less(1)" and possibly by "more(1)" by
|
---|
| 128 | using the backspace character.
|
---|
| 129 | A highlight pager will also need to know which words need to be
|
---|
| 130 | highlighted. Therefore, the code was modified to build up a
|
---|
| 131 | string of the stemmed query words for passing to the highlight pager.
|
---|
| 132 | Design Options:
|
---|
| 133 | ---------------
|
---|
| 134 | * Could do text filtering in mgquery before passing out to pager.
|
---|
| 135 | Instead I pipe to a separate process, the "hilite_words" pager,
|
---|
| 136 | which filters and pipes into less/more.
|
---|
| 137 | * Could do different highlighting or a combination.
|
---|
| 138 | * Could use a different structure for storing the query words other
|
---|
| 139 | than the hash-table I used.
|
---|
| 140 | FILES
|
---|
| 141 | * Makefile - to include hilite_words target
|
---|
| 142 | * mg_hilite_words.c
|
---|
| 143 | * mgquery.c
|
---|
| 144 | * mgquery.1
|
---|
| 145 | * query.bool.y
|
---|
| 146 | * query.ranked.c
|
---|
| 147 | * environment.c
|
---|
| 148 | * environment.h
|
---|
| 149 | * backend.h
|
---|
| 150 | *************************************************************
|
---|
| 151 | TITLE
|
---|
| 152 | Mg_compression_dict did premature free
|
---|
| 153 | APPLICATION
|
---|
| 154 | mg-1
|
---|
| 155 | TYPE
|
---|
| 156 | bug
|
---|
| 157 | REPORT
|
---|
| 158 | [email protected] - 23 Sep 94
|
---|
| 159 | FIX
|
---|
| 160 | [email protected] - 23 Sep 94
|
---|
| 161 | CLAIM
|
---|
| 162 | mg_compression_dict dumped core in
|
---|
| 163 | file: mg_compression_dict.c
|
---|
| 164 | function: Write_data
|
---|
| 165 | line: int codelen = hd->clens[i];
|
---|
| 166 | PROBLEM
|
---|
| 167 | Huffman data, hd, was freed *before* it was accessed again.
|
---|
| 168 | SOLUTION
|
---|
| 169 | The freeing of hd has been moved to after all accesses
|
---|
| 170 | (just before returning).
|
---|
| 171 | FILES
|
---|
| 172 | * mg_compression_dict.c
|
---|
| 173 | *************************************************************
|
---|
| 174 | TITLE
|
---|
| 175 | Boolean tree optimising rewrite
|
---|
| 176 | APPLICATION
|
---|
| 177 | mg-1
|
---|
| 178 | TYPE
|
---|
| 179 | bug
|
---|
| 180 | REPORT
|
---|
| 181 | [email protected] - 23 Sep 94
|
---|
| 182 | FIX
|
---|
| 183 | [email protected] - Oct 94
|
---|
| 184 | CLAIM
|
---|
| 185 | "I am still getting core dump in "and" queries in mgquery,
|
---|
| 186 | where the first word does not exist, but the second one does."
|
---|
| 187 | PROBLEM
|
---|
| 188 | Having freed a particular node, it tried to refree it and
|
---|
| 189 | access one of its fields.
|
---|
| 190 |
|
---|
| 191 | I.e. code-fragment...
|
---|
| 192 |
|
---|
| 193 | FreeNode(curr); /* where curr = CHILD(base) for 1st term in list */
|
---|
| 194 | FreeNodes(next);
|
---|
| 195 | FreeNodes(CHILD(base));
|
---|
| 196 | /* but CHILD(base) has already been freed above */
|
---|
| 197 | /* if the node was the first one in the list */
|
---|
| 198 |
|
---|
| 199 | SOLUTION
|
---|
| 200 | A number of things in the code seemed a bit dubious to me.
|
---|
| 201 | So I have rewritten the boolean optimising stage and abstracted out
|
---|
| 202 | the various stages - each file starts with "bool".
|
---|
| 203 | Boolean query optimising seems to be a tricky problem.
|
---|
| 204 | It is not clear that putting an expression into a certain form will
|
---|
| 205 | actually simplify it and whether simplification means faster querying.
|
---|
| 206 | I have converted a given boolean expression into DNF
|
---|
| 207 | (Disjunctive Normal Form). "And not" nodes, which are readily apparent
|
---|
| 208 | in DNF, are converted to "diff" nodes. I have only applied the idempotency
|
---|
| 209 | laws involving TRUE and FALSE, and not the ones requiring matching of
|
---|
| 210 | expressions - it is a potentially more complicated problem.
|
---|
| 211 | The optimiser has been tested by playing with "bool_tester", and if you are
|
---|
| 212 | having a crash or problem in a boolean query it would be worth testing the
|
---|
| 213 | query on the "bool_tester." The token "*" stands for TRUE (or all documents)
|
---|
| 214 | and the token "_" stands for FALSE (or no documents). This should show the
|
---|
| 215 | expression before and after optimisation in an ascii tree bracketting format.
|
---|
| 216 | FILES
|
---|
| 217 | * bool_tree.c
|
---|
| 218 | * bool_parser.y
|
---|
| 219 | * bool_optimiser.c
|
---|
| 220 | * bool_query.c
|
---|
| 221 | * bool_tester.c
|
---|
| 222 | * term_lists.c
|
---|
| 223 | *************************************************************
|
---|
| 224 | TITLE
|
---|
| 225 | Mgtic pixel placement
|
---|
| 226 | APPLICATION
|
---|
| 227 | mg-1
|
---|
| 228 | TYPE
|
---|
| 229 | bug
|
---|
| 230 | REPORT
|
---|
| 231 | Bruce McKenzie - [email protected] (21st Oct 1994)
|
---|
| 232 | FIX
|
---|
| 233 | [email protected]
|
---|
| 234 | CLAIM
|
---|
| 235 | mgtic crashed on certain files.
|
---|
| 236 | PROBLEM
|
---|
| 237 | Placing pixels outside of bitmap.
|
---|
| 238 | SOLUTION
|
---|
| 239 | Changed the putpixel routine to truncate at borders of the image.
|
---|
| 240 | FILES
|
---|
| 241 | * mgtic.c
|
---|
| 242 | *************************************************************
|
---|
| 243 | TITLE
|
---|
| 244 | Improved boolean tree optimising
|
---|
| 245 | APPLICATION
|
---|
| 246 | mg-1
|
---|
| 247 | TYPE
|
---|
| 248 | improve
|
---|
| 249 | REPORT
|
---|
| 250 | [email protected] - 12/Dec/94
|
---|
| 251 | FIX
|
---|
| 252 | [email protected] - 21/Dec/94, 14/Mar/95
|
---|
| 253 | CLAIM
|
---|
| 254 | Optimising by conversion to DNF is not necessarily such
|
---|
| 255 | a good idea - can actually slow things down.
|
---|
| 256 | PROBLEM
|
---|
| 257 | The distributive law used in converting to DNF
|
---|
| 258 | duplicates expressions.
|
---|
| 259 | SOLUTION
|
---|
| 260 | Introduce a query environment variable, optimise_type = 0 | 1 | 2.
|
---|
| 261 | Type 0 does nothing to the parse tree.
|
---|
| 262 | Type 2 does the DNF conversion.
|
---|
| 263 | Type 1 is the new default and does the following...
|
---|
| 264 | Do simple tree rearrangement like flattening.
|
---|
| 265 | Optimise for CNF queries.
|
---|
| 266 | FILES
|
---|
| 267 | * bool_query.c, .h
|
---|
| 268 | * bool_optimiser.c
|
---|
| 269 | * environment.c
|
---|
| 270 | * invf_get.c
|
---|
| 271 | * bool_tree.c, .h
|
---|
| 272 | * bool_tester.c
|
---|
| 273 | * lists.h
|
---|
| 274 | *************************************************************
|
---|
| 275 | TITLE
|
---|
| 276 | Mgstat with non-existent files
|
---|
| 277 | APPLICATION
|
---|
| 278 | mg-1
|
---|
| 279 | TYPE
|
---|
| 280 | bug
|
---|
| 281 | REPORT
|
---|
| 282 | [email protected] - 16 May 1994
|
---|
| 283 | FIX
|
---|
| 284 | [email protected] - 10 Aug 1994
|
---|
| 285 | CLAIM
|
---|
| 286 | NaNs and Infinites would be printed out by mgstat
|
---|
| 287 | if unable to open .text or .text.dict file.
|
---|
| 288 | PROBLEM
|
---|
| 289 | The NaNs etc. were output in the column stating
|
---|
| 290 | the percentage size of the file compared with the
|
---|
| 291 | number of input bytes of the source text data.
|
---|
| 292 | If it couldn't read the .text file with its
|
---|
| 293 | header describing the number of source text bytes, then
|
---|
| 294 | in working out the percentage it would divide by zero.
|
---|
| 295 | Also due to some bad control flow, it wouldn't attempt to
|
---|
| 296 | open the .text file if it failed when opening
|
---|
| 297 | the .text.dict file.
|
---|
| 298 | SOLUTION
|
---|
| 299 | Only printout the percentage if we can read the header
|
---|
| 300 | from the .text file.
|
---|
| 301 | Read in text header irrespective of text dictionary file.
|
---|
| 302 | FILES
|
---|
| 303 | * mgstat.c
|
---|
| 304 | *************************************************************
|
---|
| 305 | TITLE
|
---|
| 306 | nonexistent HOME bug
|
---|
| 307 | APPLICATION
|
---|
| 308 | mg-1, mg-2
|
---|
| 309 | TYPE
|
---|
| 310 | bug
|
---|
| 311 | REPORT
|
---|
| 312 | [email protected] - 2/May/95
|
---|
| 313 | FIX
|
---|
| 314 | [email protected] - 2/May/95
|
---|
| 315 | CLAIM
|
---|
| 316 | "The big problem was that mgquery crashes when the HOME environment
|
---|
| 317 | variable is not set, which is the case when it is run by the www server."
|
---|
| 318 | [...] "I expect it happens when looking for $HOME/.mgrc."
|
---|
| 319 | PROBLEM
|
---|
| 320 | The result of getenv("HOME")" was used directly in
|
---|
| 321 | a sprintf call. If the environment variable HOME
|
---|
| 322 | was not in existence then null would be used.
|
---|
| 323 | In some C libraries sprintf will convert the 0
|
---|
| 324 | string into the string "(null)" on others it will core dump.
|
---|
| 325 | (For example, Solaris seems to core dump, sunos 4 seems ok).
|
---|
| 326 | SOLUTION
|
---|
| 327 | The result from getenv("HOME")" is tested before
|
---|
| 328 | being used.
|
---|
| 329 | FILES
|
---|
| 330 | * commands.c
|
---|
| 331 | *************************************************************
|
---|
| 332 | TITLE
|
---|
| 333 | mgquery collection name preference
|
---|
| 334 | APPLICATION
|
---|
| 335 | mg-1, mg-2
|
---|
| 336 | TYPE
|
---|
| 337 | improve
|
---|
| 338 | REPORT
|
---|
| 339 | [email protected] - 2/May/95
|
---|
| 340 | FIX
|
---|
| 341 | [email protected] - 4/May/95
|
---|
| 342 | CLAIM
|
---|
| 343 | Surely something must override mquery's preference for ./bib.
|
---|
| 344 | If MGDATA is set correctly, I think it should prefer that collection,
|
---|
| 345 | and -d should definitely override it.
|
---|
| 346 | I could always say -d . if I really wanted ./bib.
|
---|
| 347 | PROBLEM
|
---|
| 348 | Currently the priority is:
|
---|
| 349 | 1. Check if ./name is a directory,
|
---|
| 350 | If so then use it as the collection directory.
|
---|
| 351 | 2. Check if ./name.text is a file,
|
---|
| 352 | If so then use ./ as the collection directory.
|
---|
| 353 | 3. Check if mgdir/name is a directory,
|
---|
| 354 | If so then use mgdir/name as the collection directory.
|
---|
| 355 | 4. Otherwise,
|
---|
| 356 | Use mgdir/name as the database file prefix.
|
---|
| 357 | This would be the case if one used "-f alice/alice".
|
---|
| 358 | However, one would then not specify a final name argument
|
---|
| 359 | and we'd never get here. Go figure ???
|
---|
| 360 | SOLUTION
|
---|
| 361 | Moved step 3 to the top instead.
|
---|
| 362 | FILES
|
---|
| 363 | * mgquery.c [search_for_collection()]
|
---|
| 364 | *************************************************************
|
---|
| 365 | TITLE
|
---|
| 366 | Printout of query terms
|
---|
| 367 | APPLICATION
|
---|
| 368 | mg-1, mg-2
|
---|
| 369 | TYPE
|
---|
| 370 | extend
|
---|
| 371 | REPORT
|
---|
| 372 | [email protected] - April 95
|
---|
| 373 | FIX
|
---|
| 374 | [email protected] - April 95
|
---|
| 375 | CLAIM
|
---|
| 376 | No easy way to find out the parsed and stemmed words
|
---|
| 377 | used in the query. Would like to know these words
|
---|
| 378 | so I can call a separate highlighting program to
|
---|
| 379 | highlight these words.
|
---|
| 380 | PROBLEM
|
---|
| 381 | No facility available.
|
---|
| 382 | SOLUTION
|
---|
| 383 | A ".queryterms" mgquery command was added which lists
|
---|
| 384 | out the parsed/stemmed queryterms of the last query.
|
---|
| 385 | FILES
|
---|
| 386 | * commands.c (added CmdQueryTerms)
|
---|
| 387 | *************************************************************
|
---|
| 388 | TITLE
|
---|
| 389 | mg_getrc
|
---|
| 390 | APPLICATION
|
---|
| 391 | mg-1, mg-2
|
---|
| 392 | TYPE
|
---|
| 393 | extend
|
---|
| 394 | REPORT
|
---|
| 395 | [email protected] - 2/May/95
|
---|
| 396 | FIX
|
---|
| 397 | -
|
---|
| 398 | CLAIM
|
---|
| 399 | Repeated code had to be written for different named
|
---|
| 400 | gets but really the same type of parsing required.
|
---|
| 401 | E.g. one might want to use a standard method for inserting
|
---|
| 402 | ^Bs between paragraphs for different books. One doesn't
|
---|
| 403 | want to write duplicate code for each different named book,
|
---|
| 404 | rather note that each book should be filtered "book" style.
|
---|
| 405 | PROBLEM
|
---|
| 406 | There was no way of abstracting out types of filters from
|
---|
| 407 | the name of an instance of a collection.
|
---|
| 408 | SOLUTION
|
---|
| 409 | Allow information to be given with <name, type, files>.
|
---|
| 410 | This extra info can be provided in a mg_getrc file.
|
---|
| 411 | See man page for mg_get for details.
|
---|
| 412 | FILES
|
---|
| 413 | * mg_get.sh
|
---|
| 414 | *************************************************************
|
---|
| 415 | TITLE
|
---|
| 416 | Boolean optimiser #1 with `!'
|
---|
| 417 | APPLICATION
|
---|
| 418 | mg-1, mg-2
|
---|
| 419 | TYPE
|
---|
| 420 | bug
|
---|
| 421 | REPORT
|
---|
| 422 | [email protected] - 20/7/95
|
---|
| 423 | FIX
|
---|
| 424 | [email protected] - 27/7/95
|
---|
| 425 | CLAIM
|
---|
| 426 | Complained about not-nodes.
|
---|
| 427 | e.g. complained about "croquet & !hedgehog"
|
---|
| 428 | PROBLEM
|
---|
| 429 | Boolean optimiser type#1 didn't convert
|
---|
| 430 | "and not"s into diff nodes.
|
---|
| 431 | SOLUTION
|
---|
| 432 | Added code to convert '&!' to '-'.
|
---|
| 433 | FILES
|
---|
| 434 | * mg/bool_optimiser.c [mg-1]
|
---|
| 435 | * query/bool_optimiser.c [mg-2]
|
---|
| 436 | *************************************************************
|
---|
| 437 | TITLE
|
---|
| 438 | Consistent use of stderr
|
---|
| 439 | APPLICATION
|
---|
| 440 | mg-1
|
---|
| 441 | TYPE
|
---|
| 442 | improve
|
---|
| 443 | REPORT
|
---|
| 444 | [email protected] - 16 May 1994
|
---|
| 445 | FIX
|
---|
| 446 | [email protected] - 11 August 1994
|
---|
| 447 | CLAIM
|
---|
| 448 | Inconsistent use of stdout/stderr in usage messages.
|
---|
| 449 | PROBLEM
|
---|
| 450 | Sometimes used "printf" and sometimes used "fprintf(stderr"
|
---|
| 451 | in usage messages.
|
---|
| 452 | SOLUTION
|
---|
| 453 | All should now use "fprintf(stderr" in usage messages.
|
---|
| 454 | FILES
|
---|
| 455 | * mg_compression_dict.c
|
---|
| 456 | * mg_compression_dict.1
|
---|
| 457 | * mg_fast_comp_dict.c
|
---|
| 458 | * mg_fast_comp_dict.1
|
---|
| 459 | * mg_invf_dict.c
|
---|
| 460 | * mg_invf_dict.1
|
---|
| 461 | * mg_invf_dump.c
|
---|
| 462 | * mg_invf_dump.1
|
---|
| 463 | * mg_invf_rebuild.c
|
---|
| 464 | * mg_invf_rebuild.1
|
---|
| 465 | * mg_perf_hash_build.c
|
---|
| 466 | * mg_perf_hash_build.1
|
---|
| 467 | * mg_text_estimate.c
|
---|
| 468 | * mg_text_estimate.1
|
---|
| 469 | * mg_weights_build.c
|
---|
| 470 | * mg_weights_build.1
|
---|
| 471 | *************************************************************
|
---|
| 472 | TITLE
|
---|
| 473 | xmg bug
|
---|
| 474 | APPLICATION
|
---|
| 475 | mg-1
|
---|
| 476 | TYPE
|
---|
| 477 | bug
|
---|
| 478 | REPORT
|
---|
| 479 | [email protected] - 22 April 1994
|
---|
| 480 | FIX
|
---|
| 481 | [email protected] - 22 April 1994
|
---|
| 482 | CLAIM
|
---|
| 483 | "Serious problem in xmg, which I fear occurs whenever a query
|
---|
| 484 | doesn't return anything."
|
---|
| 485 | PROBLEM
|
---|
| 486 | ??
|
---|
| 487 | SOLUTION
|
---|
| 488 | [xmg.sh 201] set rank 0
|
---|
| 489 | FILES
|
---|
| 490 | * xmg.sh
|
---|
| 491 | *************************************************************
|
---|
| 492 | TITLE
|
---|
| 493 | Unnecessary loading of text
|
---|
| 494 | APPLICATION
|
---|
| 495 | mg-1
|
---|
| 496 | TYPE
|
---|
| 497 | bug
|
---|
| 498 | REPORT
|
---|
| 499 | [email protected] - ?? August 1994
|
---|
| 500 | FIX
|
---|
| 501 | [email protected] - 12 August 1994
|
---|
| 502 | CLAIM
|
---|
| 503 | Mg was loading and uncompressing text when the
|
---|
| 504 | query did not require the text.
|
---|
| 505 | PROBLEM
|
---|
| 506 | There was no test for the query mode
|
---|
| 507 | before loading and uncompressing the text.
|
---|
| 508 | SOLUTION
|
---|
| 509 | Only load/uncompress text if query mode
|
---|
| 510 | is for text, headers or silent(for timing).
|
---|
| 511 | FILES
|
---|
| 512 | * mgquery.c
|
---|
| 513 | *************************************************************
|
---|
| 514 | TITLE
|
---|
| 515 | Man page errors
|
---|
| 516 | APPLICATION
|
---|
| 517 | mg-1
|
---|
| 518 | TYPE
|
---|
| 519 | bug
|
---|
| 520 | REPORT
|
---|
| 521 | [email protected] - 16 May 1994
|
---|
| 522 | FIX
|
---|
| 523 | [email protected] - 16 May 1994
|
---|
| 524 | CLAIM
|
---|
| 525 | Man page errors.
|
---|
| 526 | PROBLEM
|
---|
| 527 | See below.
|
---|
| 528 | SOLUTION
|
---|
| 529 | "The mg_make_fast_dict.1 file has been renamed mg_fast_comp_dict.1,
|
---|
| 530 | and all mg_make_fast_dict strings changed to mg_fast_comp_dict in all
|
---|
| 531 | man pages.
|
---|
| 532 | A large number of errors of spelling, typography, spacing, fonts,
|
---|
| 533 | grammar, omitted words, slang, punctuation, missing man page
|
---|
| 534 | cross-references, and man-page style have been corrected."
|
---|
| 535 | FILES
|
---|
| 536 | * mg_compression_dict.1
|
---|
| 537 | * mg_fast_comp_dict.1
|
---|
| 538 | * mg_get.1
|
---|
| 539 | * mg_invf_dict.1
|
---|
| 540 | * mg_invf_dump.1
|
---|
| 541 | * mg_invf_rebuild.1
|
---|
| 542 | * mg_passes.1
|
---|
| 543 | * mg_perf_hash_build.1
|
---|
| 544 | * mg_text_estimate.1
|
---|
| 545 | * mg_weights_build.1
|
---|
| 546 | * mgbilevel.1
|
---|
| 547 | * mgbuild.1
|
---|
| 548 | * mgdictlist.1
|
---|
| 549 | * mgfelics.1
|
---|
| 550 | * mgquery.1
|
---|
| 551 | * mgstat.1
|
---|
| 552 | * mgtic.1
|
---|
| 553 | * mgticbuild.1
|
---|
| 554 | * mgticdump.1
|
---|
| 555 | * mgticprune.1
|
---|
| 556 | * mgticstat.1
|
---|
| 557 | * xmg.1
|
---|
| 558 | *************************************************************
|
---|
| 559 | TITLE
|
---|
| 560 | Man page overview
|
---|
| 561 | APPLICATION
|
---|
| 562 | mg-1
|
---|
| 563 | TYPE
|
---|
| 564 | extend
|
---|
| 565 | REPORT
|
---|
| 566 | [email protected] -
|
---|
| 567 | FIX
|
---|
| 568 | [email protected] - 17 August 1994
|
---|
| 569 | CLAIM
|
---|
| 570 | "Write new mg.1 file to give a brief overview of mg, with samples
|
---|
| 571 | of how to use it. Otherwise, users are likely to be completely
|
---|
| 572 | overwhelmed by the number of programs (about 20) which might need to
|
---|
| 573 | be used, when in reality, only 2 or 3 are likely to be run by end
|
---|
| 574 | users."
|
---|
| 575 | SOLUTION
|
---|
| 576 | It was thought that mg.1, written by Nelson Beebe, was very useful
|
---|
| 577 | but a bit too comprehensive for an introduction.
|
---|
| 578 | Therefore, two man files, mgintro.1 and mgintro++.1 were written
|
---|
| 579 | with the basic stuff in mgintro.1 and slightly more advanced stuff
|
---|
| 580 | in mgintro++.1 .
|
---|
| 581 | FILES
|
---|
| 582 | * mg.1
|
---|
| 583 | * mgintro.1
|
---|
| 584 | * mgintro++.1
|
---|
| 585 | *************************************************************
|
---|
| 586 | TITLE
|
---|
| 587 | Parse errors not bus errors
|
---|
| 588 | APPLICATION
|
---|
| 589 | mg-1
|
---|
| 590 | TYPE
|
---|
| 591 | bug
|
---|
| 592 | REPORT
|
---|
| 593 | [email protected] - 2 Jun 94
|
---|
| 594 | FIX
|
---|
| 595 | [email protected] - 19 Aug 94
|
---|
| 596 | CLAIM
|
---|
| 597 | "These two queries
|
---|
| 598 | (which I typed in before I knew what I was doing!!)
|
---|
| 599 | > The Queen of Hearts, she made some tarts
|
---|
| 600 | > "The Queen of Hearts" and "she made some tarts"
|
---|
| 601 | produced the following result:
|
---|
| 602 | mgquery : parse error
|
---|
| 603 | Bus error
|
---|
| 604 | "
|
---|
| 605 | PROBLEM
|
---|
| 606 | What is expected to happen under boolean querying:
|
---|
| 607 | Query1:
|
---|
| 608 | > The Queen of Hearts, she made some tarts
|
---|
| 609 | will produce a parse error due to the comma which
|
---|
| 610 | is not a valid TERM.
|
---|
| 611 | Query2:
|
---|
| 612 | > "The Queen of Hearts" and "she made some tarts"
|
---|
| 613 | will store a post-processing string
|
---|
| 614 | of ''The Queen of Hearts" and "she made some tarts'' and
|
---|
| 615 | will have a main boolean query of the empty string.
|
---|
| 616 | This is because the postprocessing string takes in
|
---|
| 617 | everything between the first quote and the last one.
|
---|
| 618 | An empty string is illegal for the boolean grammar and
|
---|
| 619 | hence a parse error.
|
---|
| 620 | The problem stems from the fact that the processing of
|
---|
| 621 | the parse tree is carried out, even though we have a
|
---|
| 622 | parse error. In the case of using an empty string to build
|
---|
| 623 | a parse tree, it is likely to leave the parse tree undefined.
|
---|
| 624 | SOLUTION
|
---|
| 625 | As soon as we find out that there is a parse-error,
|
---|
| 626 | we abandon any processing of the parse tree.
|
---|
| 627 | FILES
|
---|
| 628 | * query.bool.y
|
---|
| 629 | * query.bool.c (generated from query.bool.y)
|
---|
| 630 | *************************************************************
|
---|
| 631 | TITLE
|
---|
| 632 | Perfect hashing on small vocab
|
---|
| 633 | APPLICATION
|
---|
| 634 | mg-1
|
---|
| 635 | TYPE
|
---|
| 636 | bug
|
---|
| 637 | REPORT
|
---|
| 638 | [email protected] - July 1994
|
---|
| 639 | FIX
|
---|
| 640 | [email protected] - July 1994
|
---|
| 641 | CLAIM
|
---|
| 642 | Mg could not handle small collections in the case
|
---|
| 643 | where there was only a small number of unique words.
|
---|
| 644 | The perfect hash function would report an error.
|
---|
| 645 | PROBLEM
|
---|
| 646 | Rounding of the arithmetic during the calculation of the
|
---|
| 647 | parameters of the perfect hash function was resulting in a
|
---|
| 648 | combination of values such that the probability of a hash
|
---|
| 649 | function being found was very small. This led to the limit
|
---|
| 650 | on the generation loop being exceeded, and eventual
|
---|
| 651 | failure.
|
---|
| 652 | SOLUTION
|
---|
| 653 | By using ceiling rather than floor when converting from a
|
---|
| 654 | floating point value to an integer parameter, the arithmetic
|
---|
| 655 | is now correct for all lexicon sizes, and the probability of
|
---|
| 656 | each iteration successfully generating a hash function is
|
---|
| 657 | sufficiently great that with _very_ high probability the
|
---|
| 658 | execution loop counter will not be exceeded unless there
|
---|
| 659 | genuinely is no hash function (for example, if the lexicon
|
---|
| 660 | contains two words the same there cannot be a hash
|
---|
| 661 | function).
|
---|
| 662 | FILES
|
---|
| 663 | * perf_hash.c
|
---|
| 664 | *************************************************************
|
---|