Ticket #664 (new feature)

Opened 7 years ago

Last modified 6 years ago

Apache Tika for document Conversion

Reported by: kjdon Owned by: nobody
Priority: moderate Milestone: 2.87 Release
Component: Collection Building: Plugins Severity: major
Keywords: Cc:

Description

We have talked about using Open Office to do document conversion. Doug Carter suggested Apache Tika as an alternative:

***

I had a lot of trouble messing with the Open Office for conversion, but found a better solution using Apache Tika:

 http://lucene.apache.org/tika/

I put a shell wrapper around a java command line and created a ooxmltohtml.pl script that handles nearly all of the Office 2007 document formats. I hacked gsConvert.pl to include the new doc types, and a single new plugin OOXMLPlug to handle the importing.

It works great, but performance is bit of an issue. If the performance problem was addressed, you could probably dump nearly all of the proprietary document converters.

***

see also #426

Change History

Changed 7 years ago by kjdon

  • milestone changed from Collection building wishlist to 2.84 Release

Changed 7 years ago by kjdon

  • milestone changed from 2.84 Release to 2.85 Release

Changed 6 years ago by sjm84

  • milestone changed from 2.85 Release to 2.86 Release
Note: See TracTickets for help on using tickets.