Opened 6 years ago

Last modified 3 years ago

#937 new defect

PDFBox text conversion

Reported by: kjdon Owned by: nobody
Priority: moderate Milestone: 3.11 Release
Component: Collection Building Severity: minor
Keywords: Cc:

Description (last modified by kjdon)

Diego has a PDF file. When you convert to text using pdfbox, the output is invalid for Lucene. MGPP handles it ok, I guess not trying to parse the text as XML.

&# -> &# in the html case &# -> &# in the text case.

In the text case, it ends up as &# going through to lucene which then complains as &# should be the start of an entity.

Can we change the output? Maybe it doesn't make sense to have convert to text as we always are putting the output inside XML.

A simple lest file containing the content "Katherine was here &# some chars." also fails with the same error.

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 231; A decimal representation must immediately follow the "&# mgpp processes it ok.

Maybe this is not a plugin specific thing, but rather we need to encode more before passing through to lucene??

Note the conversion output is actually wrong for diego's document (looks like + '+1&- ,+-.# *# #9+/ #$ /7- *#$ #$.+ 1 #$ /7- *# ), but regardless of that, it shouldn't cause lucene to fail.

The file was too big to upload here. I have put it at files.greenstone.org:/greenstone/files/testfiles/Diego-test.pdf

Change History (3)

comment:1 by kjdon, 6 years ago

Description: modified (diff)

comment:2 by ak19, 6 years ago

I think this was the solution:

http://trac.greenstone.org/changeset/32089

comment:3 by kjdon, 3 years ago

Milestone: 3.10 Release3.11 Release

Ticket retargeted after milestone closed

Note: See TracTickets for help on using tickets.