Apache Tika for document Conversion
|Reported by:||kjdon||Owned by:||nobody|
|Priority:||moderate||Milestone:||Possible 2.88 Release|
|Component:||Collection Building: Plugins||Severity:||major|
We have talked about using Open Office to do document conversion. Doug Carter suggested Apache Tika as an alternative:
I had a lot of trouble messing with the Open Office for conversion, but found a better solution using Apache Tika:
I put a shell wrapper around a java command line and created a ooxmltohtml.pl script that handles nearly all of the Office 2007 document formats. I hacked gsConvert.pl to include the new doc types, and a single new plugin OOXMLPlug to handle the importing.
It works great, but performance is bit of an issue. If the performance problem was addressed, you could probably dump nearly all of the proprietary document converters.
see also #426