source: main/trunk/gli/src/org/greenstone/gatherer/metadata/DocXMLFileManager.java

Last change on this file was 34507, checked in by ak19, 4 years ago

Redoing work of commit revision 34394: Redoing Bugfix 1 for GLI doc.xml metadata slowdown resulting from earlier bugfix to help GLI cope with filenames and assigned meta that have non-ASCII chars in them. The slowdown happened when gathered files got selected in GLI and was fixed in commit 34394, but the fix was not ideal for 2 reasons. 1. A new form of filename encoding (hexed unicode) going into doc.xml, instead of existing encodings like URL and base64, though those existing encodings weren't the right ones for my first solution. 2. The solution was specific to Windows to cope with special chars in filenames and relied on a new meta field gsdlfullsourcepath being written out to doc.xml by doc.pm. So a built collection moved from Linux to Windows won't show up doc.xml meta in GLI, as it won't have the new doc.xml meta field that Windows is expecting. Have a better solution for 1 that doesn't require the new field. But still can't fix all of point 2, as the existing gsdlsourcefilename meta field in doc.xml can contain Windows Short filenames when the coll is built on Windows and this won't be backwards compatible on Linux anyway. This problem existed before too, except I didn't realise it until now. But the new solution fixes more issues. Second step: modified DocXMLFile to no longer use the new field gsdlfullsourcepath, but return to using gsdlsourcefilename field. This time however, the code is optimised to detect a filename match between doc.xml and any file selected in GLI by storing gsdlsourcefilename in its Long filename form whenever doc.xml had stored it in Win 8.3 Short filename form. The Long filename can be obtained for any file that exists by calling getCanonicalPath(). Of course, the full filename was not stored in gsdlsourcefilename, rather the filename from import folder onwards. So to ensure a file by that filename in long form has a chance of existing, first prefixed the current collection folder and then checked for existence before obtaining the canonical form for it. This is then stored in the hashmap in place of any win short filename. Now a match is more readily found without using any hex encoded unicode filenames stored by doc.pm, and without using the older and inefficient method of making cmd calls to DOS to calculate the Win 8.3 Short filename for each selected file.

  • Property svn:keywords set to Author Date Id Revision
File size: 3.9 KB
Line 
1/**
2 *############################################################################
3 * A component of the Greenstone Librarian Interface, part of the Greenstone
4 * digital library suite from the New Zealand Digital Library Project at the
5 * University of Waikato, New Zealand.
6 *
7 * Author: Michael Dewsnip, NZDL Project, University of Waikato, NZ
8 *
9 * Copyright (C) 2004 New Zealand Digital Library Project
10 *
11 * This program is free software; you can redistribute it and/or modify
12 * it under the terms of the GNU General Public License as published by
13 * the Free Software Foundation; either version 2 of the License, or
14 * (at your option) any later version.
15 *
16 * This program is distributed in the hope that it will be useful,
17 * but WITHOUT ANY WARRANTY; without even the implied warranty of
18 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19 * GNU General Public License for more details.
20 *
21 * You should have received a copy of the GNU General Public License
22 * along with this program; if not, write to the Free Software
23 * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
24 *############################################################################
25 */
26
27package org.greenstone.gatherer.metadata;
28
29
30import java.io.*;
31import java.util.*;
32import org.greenstone.gatherer.DebugStream;
33import org.greenstone.gatherer.util.Utility;
34
35/** This class is a static class that manages the doc.xml files */
36public class DocXMLFileManager
37{
38 static private ArrayList doc_xml_files = new ArrayList();
39
40 static public void clearDocXMLFiles()
41 {
42 doc_xml_files.clear();
43 }
44
45
46 static public ArrayList getMetadataExtractedFromFile(File file)
47 {
48 // Work out relative file path and its hex encoded value here,
49 // avoids making DocXMLFile.java recalculate these each time
50 String file_relative_path = file.getAbsolutePath();
51 int import_index = file_relative_path.indexOf("import");
52 if (import_index != -1) {
53 file_relative_path = file_relative_path.substring(import_index + "import".length() + 1);
54 }
55
56 // Build up a list of metadata values extracted from this file
57 ArrayList metadata_values = new ArrayList();
58
59 // Look at each loaded doc.xml file to see if any have extracted metadata for this file
60 for (int i = 0; i < doc_xml_files.size(); i++) {
61 DocXMLFile doc_xml_file = (DocXMLFile) doc_xml_files.get(i);
62 ///System.err.println("@@@@ Looking at doc.xml file: " + doc_xml_files.get(i));
63 metadata_values.addAll(doc_xml_file.getMetadataExtractedFromFile(file, file_relative_path));
64 }
65
66 return metadata_values;
67 }
68
69
70 static public void loadDocXMLFiles(File directory, String filename_match)
71 {
72 // Make sure the directory (archives) exists
73 if (directory.exists() == false) {
74 return;
75 }
76
77 // Look recursively at each subfile of the directory for doc.xml files
78 File[] directory_files = directory.listFiles();
79 for (int i = 0; i < directory_files.length; i++) {
80 File child_file = directory_files[i];
81 if (child_file.isDirectory()) {
82 loadDocXMLFiles(child_file,filename_match);
83 }
84 else if (child_file.getName().equals(filename_match)) {
85 // e.g. doc.xml (for regular Greenstone, docmets.xml for Fedora)
86
87 loadDocXMLFile(child_file,filename_match);
88 }
89 }
90 }
91
92
93 static private void loadDocXMLFile(File doc_xml_file_file,String filename_match)
94 {
95 String file = doc_xml_file_file.getAbsolutePath();
96
97 // Need to do typecasts in the following to keep Java 1.4 happy
98 DocXMLFile doc_xml_file
99 = (filename_match.equals("docmets.xml"))
100 ? (DocXMLFile) new DocMetsXMLFile(file)
101 : (DocXMLFile) new DocGAFile(file);
102
103 try {
104 doc_xml_file.skimFile();
105 doc_xml_files.add(doc_xml_file);
106 }
107 catch (Exception exception) {
108 // We catch any exceptions here so errors in doc.xml files don't stop the collection from loading
109 System.err.println("Error: Could not skim doc.xml file " + doc_xml_file.getAbsolutePath());
110 DebugStream.printStackTrace(exception);
111 }
112 }
113}
Note: See TracBrowser for help on using the repository browser.