Opened 14 years ago

Last modified 13 years ago

#664 new feature

Apache Tika for document Conversion

Reported by: kjdon Owned by: nobody
Priority: moderate Milestone: Possible 2.88 Release
Component: Collection Building: Plugins Severity: major
Keywords: Cc:

Description

We have talked about using Open Office to do document conversion. Doug Carter suggested Apache Tika as an alternative:

*

I had a lot of trouble messing with the Open Office for conversion, but found a better solution using Apache Tika:

http://lucene.apache.org/tika/

I put a shell wrapper around a java command line and created a ooxmltohtml.pl script that handles nearly all of the Office 2007 document formats. I hacked gsConvert.pl to include the new doc types, and a single new plugin OOXMLPlug to handle the importing.

It works great, but performance is bit of an issue. If the performance problem was addressed, you could probably dump nearly all of the proprietary document converters.

*

see also #426

Change History (3)

comment:1 by kjdon, 14 years ago

Milestone: Collection building wishlist2.84 Release

comment:2 by kjdon, 14 years ago

Milestone: 2.84 Release2.85 Release

comment:3 by sjm84, 13 years ago

Milestone: 2.85 Release2.86 Release
Note: See TracTickets for help on using tickets.