Ticket #848 (closed defect: fixed)

Opened 5 years ago

Last modified 5 years ago

mgpp crash: bit buffer overrun

Reported by: kjdon Owned by: nobody
Priority: high Milestone: 3.05 Release
Component: Collection Building Severity: major
Keywords: Cc:

Description

Diego had some very messy PDFs containing strings like

fffffffff (x500ish) - which triggered word splitting at max_stem_len which is 255.

Building an mgpp index would crash the build, with a bit buffer overrun.

Change History

Changed 5 years ago by kjdon

  • status changed from new to closed
  • resolution set to fixed

When generating the list of gaps for each word, there was not enough space to store all the gaps - a buffer overrun.

Problem stemmed from bug in utf8_word_to_unicode - comapring inlen instead of outlen with max_output_length. For a 255 long word, inlen was greater than max_output_len, and the conversion never happened, so in another part of the code where it was adding up how many occurrences of each word there were, all 255 long words (which were different) got added into the same score. later on when they were separate words, the right counts weren't available.

Fixing up this also showed up an off-by-one error in max_output_length, between what the code was expecting and what the calling code was giving. Position 0 of the string/array holds the length. Should that char be included in the length or not? Made it all be the max length of the string, NOT including the length char.

Note: See TracTickets for help on using tickets.