dodgy text_t functions
|Reported by:||kjdon||Owned by:||kjdon|
|Priority:||moderate||Milestone:||Greenstone 2 wishlist|
From an email from Michael:
We've recently found that these functions (findchar, findlastchar, findword, and maybe others in the text_t class) are a bit dodgy. It's not really clear what should be done about these, and unfortunately I don't have any time to work on them. Hopefully someone there can look into this.
The worst function is findlastchar(). The problem is that it dereferences the "last" iterator -- but in most cases this will be a text_t end() value, in which case Greenstone will crash if the OS is strict about checking memory access. It seems that the best way to fix this is to just change findlastchar() so it decrements the last iterator before starting the loop. However, you can't do this because the last iterator might not be a text_t end() value -- in which case you're not searching the full range that was specified.
The other find functions (e.g. findchar) are a bit different -- the problem with these is the return value. If you're not passing in a text_t end() value as the "last" parameter, there is no way to tell if the character you're searching for is at the end of the range specified -- because you get the same value back in this case or if the character doesn't appear at all.
Things would be much cleaner if these functions just took a text_t as the parameter, instead of two iterators. In this case you could write implementations of these find functions that always worked properly and returned sensible values. I think this would cover about 80% of the usage of these functions too -- in most cases text_t.begin() and text_t.end() are used as the arguments. It's the other cases that are the problem, and I don't know what to do about those. Maybe all we can do is mark the two iterator versions of the find functions as being deprecated, and add comments explaining their shortcomings.
I guess the other approach is that we clearly mark these functions as taking "first" and "last_plus_one" iterators, since this is what is being provided in most cases. If it is clear that the last character isn't being considered then the return value problem goes away, and we can fix findlastchar() as well. In this case I think we just need to change the "last" variable names in the functions (so very inaccurate), add comments to explain what is being expected for this argument, and fix findlastchar().