Ticket #822 (new defect)

Opened 6 years ago

Last modified 6 years ago

Better processing of epubs

Reported by: ak19 Owned by: nobody
Priority: moderate Milestone: 2.87 Release
Component: Collection Building: Plugins Severity: minor
Keywords: Cc:

Description

Hello Renate,

I think the matter may be complicated by the fact that epub files are zips containing multiple files internally.

There are two ways I can get Greenstone to do something (in fact, my aim was initially to try both of them together, since I thought that's how I could get things to work):

SOLUTION 1. The first way is to simply add the UnknownPlugin? and configure it with the following options as specified below: - mime_type: application/xhtml+xml - process_extension: epub

SOLUTION 2. The second way does not conflict with the above, so leave your UnknownPlugin? with the configuration settings described above in the plugin pipeline. But I also configured the ZIPPlugin's process_exp option to include epub: - process_exp: (?i)\.(gz|tgz|z|taz|bz|bz2|zip|jar|tar|epub)$

GLI wasn't happy building, and by turning the verbosity level up during import, it turned out that the ZIPPlugin's was using gzip on the Linux where I tested it, and gzip simply does not recognize the epub extension and gave up. What I did instead, to understand more of the issue, was go to the Gather panel and there rename the epub file's extension to ".zip". Now it no longer mattered whether the ZIPPlugin's process_exp included "epub", but it doesn't hurt to leave it in.

Hitting the Build Collection button, it processed all the individual XML files making up the chapters of the Pride And Prejudice (P&P) epub that I just downloaded for this test. Then in the Titles classifier it displayed every single chapter of P&P as a separate document, since each was a separate XML in the epub file. However, this did mean that the contents were indexed now: I can search on "Darcy" (the name of a character in the epub book) and Greenstone happily returned the results of several chapters.

Having investigated a bit further now, I find that the "gunzip" program, which the ZIPPlugin uses by default for unrecognised extensions of compressed files, can't process .zip files. And epub extensions are essentially .zip files. However, the ZIPPlugin uses "unzip" to process .zip files. I merely added "epub" to the list of extensions to be processed by "unzip" in the ZIPPlugin script and also added it to the list of file extensions recognised by ZIPPlugin, and now I don't need to rename the epub file in the Gather panel to have a zip extension anymore, nor do I need to configure the ZIPPlugin to recognise epub anymore (as it's added to the process_exp by default).

So if you wish to have the results of the 2nd solution outlined above, but without having to rename the epub file in the Gather panel and without having to configure the ZIPPlugin, then use a text editor to edit your Greenstone's perllib/plugins/ZIPPlugin.pm file as follows:

a. Find the bit where it says: sub get_default_process_exp {

return q(?i)\.(gz|tgz|z|taz|bz|bz2|zip|jar|tar)$;

}

And append the "|epub" near the end of this, for it to become: sub get_default_process_exp {

return q(?i)\.(gz|tgz|z|taz|bz|bz2|zip|jar|tar|epub)$;

}

b. Find the bit where ZIPPlugin.pm says:

} elsif ($file =~ /\.(zip|jar)$/i) {

$self->unzip ($filename_no_path);

And append "|epub" once more to get this to look like:

} elsif ($file =~ /\.(zip|jar|epub)$/i) {

$self->unzip ($filename_no_path);

I am going to commit the ZIPPlugin file with only the 2nd change above, so that by default, the epub file is not unzipped by ZIPPlugin and you can use the UnknownPlugin? to treat the entire epub as one single file (as seen in Solution 1). However, if a GLI user chooses to configure the process_exp option of the ZIPPlugin to include epub files, then Greenstone will automatically work as in Solution 2 above.

Change History

Changed 6 years ago by ak19

Setting the ZIPPlugin (configured to process epub) with -store_original_file ticked and -associate_tail_re set to "x?html?"

does not associate the various html files in my sample epub with any stored epub files, the individual html (chapter) files are still presented individually, instead of all being seen as part of one epub document.

The epub includes a tocx where the document consists of multiple parts (xhtml or html) that make up the document, so it should be possible to create a hierarchical document from the various (x)html parts that make up the original.

We can consider a separate plugin for epub files if the demand for Greenstone processing this grows.

Note: See TracTickets for help on using tickets.