PDFBox text conversion
|Reported by:||kjdon||Owned by:||nobody|
Description (last modified by )
Diego has a PDF file. When you convert to text using pdfbox, the output is invalid for Lucene. MGPP handles it ok, I guess not trying to parse the text as XML.
&# -> &amp;# in the html case &# -> &# in the text case.
In the text case, it ends up as &# going through to lucene which then complains as &# should be the start of an entity.
Can we change the output? Maybe it doesn't make sense to have convert to text as we always are putting the output inside XML.
A simple lest file containing the content "Katherine was here &# some chars." also fails with the same error.
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 231; A decimal representation must immediately follow the "&# mgpp processes it ok.
Maybe this is not a plugin specific thing, but rather we need to encode more before passing through to lucene??
Note the conversion output is actually wrong for diego's document (looks like + '+1&- ,+-.# *# #9+/ #$ /7- *#$ #$.+ 1 #$ /7- *# ), but regardless of that, it shouldn't cause lucene to fail.
The file was too big to upload here. I have put it at files.greenstone.org:/greenstone/files/testfiles/Diego-test.pdf