Context Navigation

source: main/trunk/greenstone2/perllib/lucenebuildproc.pm@ 33327

Last change on this file since 33327 was 33327, checked in by ak19, 5 years ago
In order to get map coordinate metadata stored correctly in solr, changes were required. These changes revealed that the way in which some index fields were stored in solr but also lucene were not exactly correct and required changing too. 1. Coordinate/CD, CoordShort/CS and GPSMapOverlayLabel/ML meta are now being stored. The schema elements created for these indexed fields notably need to say they're multivalued (multiple values per docOID) and are of type=string rather than type=text_en_splitting as the other meta have been so far. No term related information being stored for them as that doesn't appear important for these indexed fields. 2. Changes to solrbuildproc required and these changes were also repeated into lucenebuildproc: in their code before this commit, <field name=... /> elements were stored once for all meta elements in that field. It sort of worked out so far since the type=text_en_splitting for these fields. This however created the problem that for example all Coordinate meta for a docOID went into a single <field name=CD .../> element separate by spaces rather than a <field name=CD .../> element for each Coordinate meta. We wanted the latter behaviour for CD, CS and ML meta but also for all other indexed meta fields such as TI for titles. But also for indexed fields that include multiple meta in one index such as a hypothetical TT where TT would include dc.Title,ex.Title,text. In that case too we want a <field name=TT /> element for each title meta and for the text meta. 3. The num_processed_bytes calculation is left untouched and still includes the encapsulating <field name=.../> element and has not been changed to be calculated over just the meta data value of each field. This is because not only is it calculated to include the field in super -buildproc.pm classes, but also because the definition of num_processed_bytes in basebuilder.pm is defined as the number of bytes actually passed to (mg) for the current index, where lucene and mgpp buildprocs both include the enclosing element in the calculation which seems deliberate. Further, num_processed_bytes contrasts against num_bytes, declared and defined in basebuildproc.pm too as The actual number of bytes in the collection, normally the same as what's processed during text compression. num_bytes seems to be what Dr Bainbridge had in mind today when he said that actually the enclosing <field/> element shouldn't be included in the calculation of num_processed_bytes. Since the definition of num_processed_bytes seems ambiguous to me now, I leave it alone until discussed with Dr Bainbridge again, as there are many places where it needs changing otherwise.
Property svn:keywords set to `Author Date Id Revision`
File size: 22.1 KB

Line
1	###########################################################################
2	#
3	# lucenebuildproc.pm -- perl wrapper for building index with Lucene
4	# A component of the Greenstone digital library software
5	# from the New Zealand Digital Library Project at the
6	# University of Waikato, New Zealand.
7	#
8	# Copyright (C) 1999 New Zealand Digital Library Project
9	#
10	# This program is free software; you can redistribute it and/or modify
11	# it under the terms of the GNU General Public License as published by
12	# the Free Software Foundation; either version 2 of the License, or
13	# (at your option) any later version.
14	#
15	# This program is distributed in the hope that it will be useful,
16	# but WITHOUT ANY WARRANTY; without even the implied warranty of
17	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
18	# GNU General Public License for more details.
19	#
20	# You should have received a copy of the GNU General Public License
21	# along with this program; if not, write to the Free Software
22	# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
23	#
24	###########################################################################
25
26	package lucenebuildproc;
27
28	# This document processor outputs a document
29	# for lucene to process
30
31	# Use same basic XML structure setup by mgppbuilder/mgppbuildproc
32
33	use mgppbuildproc;
34	use ghtml;
35	use strict;
36	no strict 'refs'; # allow filehandles to be variables and viceversa
37
38
39	use IncrementalBuildUtils;
40	use FileUtils;
41
42	sub BEGIN {
43	@lucenebuildproc::ISA = ('mgppbuildproc');
44	}
45
46
47	sub new {
48	my $class = shift @_;
49	my $self = new mgppbuildproc (@_);
50
51	$self->{'numincdocs'} = 0;
52	$self->{'specified_fields'} = (); # list of fields actually specified in the index, in a map
53	$self->{'allfields_index'} = 0; # do we need allfields index?
54	$self->{'all_metadata_specified'} = 0; # are we indexing all metadata?
55	$self->{'actualsortfields'} = {}; # sort fields that have actually been used
56	$self->{'sortfieldnamemap'} = {}; # mapping between field name and field shortname, eg dc.Title->byTI
57	return bless $self, $class;
58	}
59
60	sub set_index {
61	my $self = shift (@_);
62	my ($index, $indexexparr) = @_;
63
64	$self->mgppbuildproc::set_index($index, $indexexparr);
65
66	# just get the list of index fields without any subcoll stuff
67	my ($fields) = split (/:/, $self->{'index'});
68
69	foreach my $field (split (/;/, $fields)) {
70	if ($field eq "allfields") {
71	$self->{'allfields_index'} = 1;
72	} elsif ($field eq "metadata") {
73	$self->{'all_metadata_specified'} = 1;
74	} else {
75	$field =~ s/^top//;
76	$self->{'specified_fields'} ->{$field} = 1;
77	}
78	}
79	}
80
81	sub set_sections_sort_on_document_metadata {
82	my $self= shift (@_);
83	my ($index_type) = @_;
84
85	$self->{'sections_sort_on_document_metadata'} = $index_type;
86	}
87
88	sub set_sortfields {
89	my $self = shift (@_);
90
91	my ($sortfields) = @_;
92	$self->{'sortfields'} = ();
93	# lets just go through and check for text, allfields, metadata which are only valid for indexes, not for sortfields
94	foreach my $s (@$sortfields) {
95	if ($s !~ /^(text\|allfields\|metadata)$/) {
96	push (@{$self->{'sortfields'}}, $s);
97	}
98	}
99	}
100
101	sub is_incremental_capable
102	{
103	my $self = shift (@_);
104
105	# Unlike MG and MGPP, Lucene supports incremental building
106	return 1;
107	}
108
109
110	sub textedit {
111	my $self = shift (@_);
112	my ($doc_obj,$file,$edit_mode) = @_;
113
114	my $lucenehandle = $self->{'output_handle'};
115	my $outhandle = $self->{'outhandle'};
116
117	# only output this document if it is one to be indexed
118	return if ($doc_obj->get_doc_type() ne "indexed_doc");
119
120	# skip this document if in "compress-text" mode and asked to delete it
121	return if (!$self->get_indexing_text() && ($edit_mode eq "delete"));
122
123	# 0/1 to indicate whether this doc is part of the specified subcollection
124	my $indexed_doc = $self->is_subcollection_doc($doc_obj);
125
126	# this is another document
127	if (($edit_mode eq "add") \|\| ($edit_mode eq "update")) {
128	$self->{'num_docs'} += 1;
129	}
130	else {
131	$self->{'num_docs'} -= 1;
132	}
133
134
135	# get the parameters for the output
136	# split on : just in case there is subcoll and lang stuff
137	my ($fields) = split (/:/, $self->{'index'});
138
139	my $doc_tag_name = $mgppbuildproc::level_map{'document'};
140
141	my $levels = $self->{'levels'};
142	my $ldoc_level = $levels->{'document'};
143	my $lsec_level = $levels->{'section'};
144
145	my $gs2_docOID = $doc_obj->get_OID();
146	my $documenttag = undef;
147	my $documentendtag = undef;
148
149	$documenttag = "<$doc_tag_name xmlns:gs2=\"http://www.greenstone.org/gs2\" file=\"$file\" gs2:docOID=\"$gs2_docOID\" gs2:mode=\"$edit_mode\">\n";
150	$documentendtag = "\n</$doc_tag_name>\n";
151
152	my $sec_tag_name = "";
153	if ($lsec_level)
154	{
155	$sec_tag_name = $mgppbuildproc::level_map{'section'};
156	}
157
158	my $doc_section = 0; # just for this document
159
160	my $text = "";
161	$text .= $documenttag;
162	# get the text for this document
163	my $section = $doc_obj->get_top_section();
164	while (defined $section)
165	{
166	# update a few statistics
167	$doc_section++;
168	$self->{'num_sections'}++;
169
170	my $sec_gs2_id = $self->{'num_sections'};
171	my $sec_gs2_docOID = $gs2_docOID;
172	$sec_gs2_docOID .= ".$section" if ($section ne "");
173
174	# if we are doing subcollections, then some docs shouldn't be indexed.
175	# but we need to put the section tag placeholders in there so the
176	# sections match up with database
177	my $indexed_section = $doc_obj->get_metadata_element($section, "gsdldoctype") \|\| "indexed_section";
178	if (($indexed_doc == 0) \|\| ($indexed_section ne "indexed_section" && $indexed_section ne "indexed_doc")) {
179	if ($sec_tag_name ne "") {
180	$text .= "\n<$sec_tag_name gs2:docOID=\"$sec_gs2_docOID\" gs2:mode=\"ignore\">\n";
181	$text .= "\n</$sec_tag_name>\n"
182	}
183	$section = $doc_obj->get_next_section($section);
184	next;
185	}
186
187	if ($sec_tag_name ne "")
188	{
189	$text .= "\n<$sec_tag_name gs2:docOID=\"$sec_gs2_docOID\" gs2:mode=\"$edit_mode\">\n";
190	}
191
192	if (($edit_mode eq "add") \|\| ($edit_mode eq "update")) {
193	$self->{'num_bytes'} += $doc_obj->get_text_length ($section);
194	}
195	else {
196	# delete
197	$self->{'num_bytes'} -= $doc_obj->get_text_length ($section);
198	}
199
200
201	# collect up all the text for allfields index in here (if there is one)
202	my $allfields_text = "";
203
204	foreach my $field (split (/;/, $fields)) {
205
206	# only deal with this field if it doesn't start with top or
207	# this is the first section
208	my $real_field = $field;
209	next if (($real_field =~ s/^top//) && ($doc_section != 1));
210
211	# process these two later
212	next if ($real_field eq "allfields" \|\| $real_field eq "metadata");
213
214	#individual metadata and or text specified - could be a comma separated list
215	#$specified_fields->{$real_field} = 1;
216	my $shortname="";
217	my $new_field = 0; # have we found a new field name?
218	if (defined $self->{'fieldnamemap'}->{$real_field}) {
219	$shortname = $self->{'fieldnamemap'}->{$real_field};
220	} else {
221	$shortname = $self->create_shortname($real_field);
222	$self->{'fieldnamemap'}->{$real_field} = $shortname;
223	$self->{'fieldnamemap'}->{$shortname} = 1;
224	}
225	my @metadata_list = (); # put any metadata values in here
226	my $section_text = ""; # put the text in here
227	foreach my $submeta (split /,/, $real_field) {
228	if ($submeta eq "text") {
229	# no point in indexing text more than once
230	if ($section_text eq "") {
231	$section_text = $doc_obj->get_text($section);
232	if ($self->{'indexing_text'}) {
233	# we always strip html
234	$section_text = $self->preprocess_text($section_text, 1, "");
235	}
236	else {
237	# leave html stuff in, but escape the tags
238	&ghtml::htmlsafe($section_text);
239	}
240	}
241	}
242	else {
243	$submeta =~ s/^ex\.([^.]+)$/$1/; #strip off ex. iff it's the only metadata set prefix (will leave ex.dc.* intact)
244
245	# its a metadata element
246	my @section_metadata = @{$doc_obj->get_metadata ($section, $submeta)};
247	if ($section ne $doc_obj->get_top_section() && $self->{'indexing_text'} && defined ($self->{'sections_index_document_metadata'})) {
248	if ($self->{'sections_index_document_metadata'} eq "always" \|\| ( scalar(@section_metadata) == 0 && $self->{'sections_index_document_metadata'} eq "unless_section_metadata_exists")) {
249	push (@section_metadata, @{$doc_obj->get_metadata ($doc_obj->get_top_section(), $submeta)});
250	}
251	}
252	push (@metadata_list, @section_metadata);
253	}
254	} # for each field in this one index
255
256
257	# now we add the text and/or metadata into new_text
258	if ($section_text ne "" \|\| scalar(@metadata_list)) {
259	my $new_text = "";
260
261	if ($section_text ne "") {
262
263	if ($self->{'allfields_index'}) {
264	$allfields_text .= "$section_text ";
265	}
266
267	if ($self->{'indexing_text'}) {
268	# add the tag
269	$new_text .= "<$shortname index=\"1\">$section_text</$shortname>";
270	$self->{'allindexfields'}->{$real_field} = 1;
271	} else {
272	$new_text .= "$section_text ";
273	}
274	}
275
276	foreach my $item (@metadata_list) {
277	&ghtml::htmlsafe($item);
278
279	if ($self->{'allfields_index'}) {
280	$allfields_text .= "$item ";
281	}
282
283	if ($self->{'indexing_text'}) {
284	# add the tag
285	$new_text .= "<$shortname index=\"1\">$item</$shortname>";
286	$self->{'allindexfields'}->{$real_field} = 1;
287	} else {
288	$new_text .= "$item ";
289	}
290	} # end for loop processing @metadata_list
291
292	# filter the text
293	$new_text = $self->filter_text ($field, $new_text);
294
295	if (($edit_mode eq "add") \|\| ($edit_mode eq "update")) {
296	$self->{'num_processed_bytes'} += length ($new_text);
297	$text .= "$new_text";
298	}
299	else {
300	# delete
301	$self->{'num_processed_bytes'} -= length ($new_text);
302	}
303	}
304
305	} # foreach field
306
307	if ($self->{'all_metadata_specified'}) {
308
309	my $new_text = "";
310	my $shortname = "";
311	my $metadata = $doc_obj->get_all_metadata ($section);
312	foreach my $pair (@$metadata) {
313	my ($mfield, $mvalue) = (@$pair);
314	# no value
315	next unless defined $mvalue && $mvalue ne "";
316	# we have already indexed this
317	next if defined ($self->{'specified_fields'}->{$mfield});
318	# check fields here, maybe others dont want - change to use dontindex!!
319	next if ($mfield eq "Identifier" \|\| $mfield eq "classifytype" \|\| $mfield eq "assocfilepath");
320	next if ($mfield =~ /^gsdl/);
321
322	&ghtml::htmlsafe($mvalue);
323
324	if (defined $self->{'fieldnamemap'}->{$mfield}) {
325	$shortname = $self->{'fieldnamemap'}->{$mfield};
326	}
327	else {
328	$shortname = $self->create_shortname($mfield);
329	$self->{'fieldnamemap'}->{$mfield} = $shortname;
330	$self->{'fieldnamemap'}->{$shortname} = 1;
331	}
332	$self->{'allindexfields'}->{$mfield} = 1;
333	$new_text .= "<$shortname index=\"1\">$mvalue</$shortname>\n";
334	if ($self->{'allfields_index'}) {
335	$allfields_text .= "$mvalue ";
336	}
337
338	if (!defined $self->{'extraindexfields'}->{$mfield}) {
339	$self->{'extraindexfields'}->{$mfield} = 1;
340	}
341
342	}
343	# filter the text
344	$new_text = $self->filter_text ("metadata", $new_text);
345
346	if (($edit_mode eq "add") \|\| ($edit_mode eq "update")) {
347	$self->{'num_processed_bytes'} += length ($new_text);
348	$text .= "$new_text";
349	}
350	else {
351	# delete
352	$self->{'num_processed_bytes'} -= length ($new_text);
353	}
354	}
355
356	if ($self->{'allfields_index'}) {
357
358	my $new_text = "<ZZ index=\"1\">$allfields_text</ZZ>\n";
359	# filter the text
360	$new_text = $self->filter_text ("allfields", $new_text);
361
362	if (($edit_mode eq "add") \|\| ($edit_mode eq "update")) {
363	$self->{'num_processed_bytes'} += length ($new_text);
364	$text .= "$new_text";
365	}
366	else {
367	# delete
368	$self->{'num_processed_bytes'} -= length ($new_text);
369	}
370	}
371	# only add sort fields for this section if we are indexing this section, we are doing section level indexing or this is the top section
372	if ($self->{'indexing_text'} && ($sec_tag_name ne "" \|\| $doc_section == 1 )) {
373	# add sort fields if there are any
374
375	foreach my $sfield (@{$self->{'sortfields'}}) {
376	# ignore special field rank
377	next if ($sfield eq "rank" \|\| $sfield eq "none");
378	my $sf_shortname;
379	if (defined $self->{'sortfieldnamemap'}->{$sfield}) {
380	$sf_shortname = $self->{'sortfieldnamemap'}->{$sfield};
381	}
382	else {
383	$sf_shortname = $self->create_sortfield_shortname($sfield);
384	$self->{'sortfieldnamemap'}->{$sfield} = $sf_shortname;
385	$self->{'sortfieldnamemap'}->{$sf_shortname} = 1;
386	}
387	my @metadata_list = (); # put any metadata values in here
388	foreach my $submeta (split /,/, $sfield) {
389	$submeta =~ s/^ex\.([^.]+)$/$1/; #strip off ex. iff it's the only metadata set prefix (will leave ex.dc.* intact)
390
391	my @section_metadata = @{$doc_obj->get_metadata ($section, $submeta)};
392	if ($section ne $doc_obj->get_top_section() && defined ($self->{'sections_sort_on_document_metadata'})) {
393	if ($self->{'sections_sort_on_document_metadata'} eq "always" \|\| ( scalar(@section_metadata) == 0 && $self->{'sections_sort_on_document_metadata'} eq "unless_section_metadata_exists")) {
394	push (@section_metadata, @{$doc_obj->get_metadata ($doc_obj->get_top_section(), $submeta)});
395	}
396	}
397	push (@metadata_list, @section_metadata);
398	}
399	# my $new_text = "";
400	# foreach my $item (@metadata_list) {
401	# &ghtml::htmlsafe($item);
402	# $new_text .= "$item"; # should be .="$item "; But will be commenting out and rewriting this entire thing, so it doesn't matter
403	# }
404	# if ($new_text =~ /\S/) {
405	# $new_text = "<$sf_shortname index=\"1\" tokenize=\"0\">$new_text</$sf_shortname>";
406	# # filter the text???
407	# $text .= "$new_text"; # add it to the main text block
408	# $self->{'actualsortfields'}->{$sfield} = 1;
409	# }
410
411	foreach my $item (@metadata_list) {
412	&ghtml::htmlsafe($item);
413	if ($item =~ /\S/) {
414	$item = "<$sf_shortname index=\"1\" tokenize=\"0\">$item</$sf_shortname>";
415	$text .= "$item"; # add it to the main text block
416	}
417	}
418	if(scalar @metadata_list > 0) {
419	$self->{'actualsortfields'}->{$sfield} = 1;
420	}
421	}
422	}
423	$text .= "\n</$sec_tag_name>\n" if ($sec_tag_name ne "");
424
425	$section = $doc_obj->get_next_section($section);
426	} # for each section
427
428	#open (TEXTOUT, ">text.out");
429	#print TEXTOUT "$text\n$documentendtag";
430	#close TEXTOUT;
431
432	print $lucenehandle "$text\n$documentendtag";
433
434	## if ($edit_mode eq "delete") {
435	## print STDERR "$text\n$documentendtag";
436	## }
437
438	}
439
440	sub text {
441	my $self = shift (@_);
442	my ($doc_obj,$file) = @_;
443
444	$self->textedit($doc_obj,$file,"add");
445	}
446
447	sub textreindex
448	{
449	my $self = shift (@_);
450	my ($doc_obj,$file) = @_;
451
452	$self->textedit($doc_obj,$file,"update");
453	}
454
455	sub textdelete
456	{
457	my $self = shift (@_);
458	my ($doc_obj,$file) = @_;
459
460	$self->textedit($doc_obj,$file,"delete");
461	}
462
463
464
465
466
467	# /** We make this builder pretend to be a document processor so we can get
468	# * information back from the plugins.
469	# *
470	# * @param $self A reference to this Lucene builder
471	# * @param $doc_obj A reference to a document object representing what was
472	# * parsed by the GAPlug
473	# * @param $file The name of the file parsed as a string
474	# *
475	# * @author John Thompson, DL Consulting Ltd
476	# */
477	sub process()
478	{
479	my $self = shift (@_);
480	my ($doc_obj, $file) = @_;
481
482	# If this is called from any stage other than an incremental infodb we want
483	# to pass through to the superclass of build
484	if ($self->get_mode() eq "incinfodb")
485	{
486	print STDERR "* Processing a document added using INCINFODB *\n" if ($self->{'verbosity'} > 3);
487	my ($archivedir) = $file =~ /^(.?)(?:\/\|\\)[^\/\\]$/;
488	$archivedir = "" unless defined $archivedir;
489	$archivedir =~ s/\\/\//g;
490	$archivedir =~ s/^\/+//;
491	$archivedir =~ s/\/+$//;
492
493	# Number of files
494	print STDERR "There are " . scalar(@{$doc_obj->get_assoc_files()}) . " associated documents...\n" if ($self->{'verbosity'} > 3);
495
496	# resolve the final filenames of the files associated with this document
497	$self->assoc_files ($doc_obj, $archivedir);
498
499	# is this a paged or a hierarchical document
500	my ($thistype, $childtype) = $self->get_document_type ($doc_obj);
501
502	# Determine the actual docnum by checking if we've processed any
503	# previous incrementally added documents. If so, carry on from there.
504	# Otherwise we set the counter to be the same as the number of
505	# sections encountered during the previous build
506	if ($self->{'numincdocs'} == 0)
507	{
508	$self->{'numincdocs'} = $self->{'starting_num_sections'} + 1;
509	}
510
511	my $section = $doc_obj->get_top_section ();
512	print STDERR "+ top section: '$section'\n" if ($self->{'verbosity'} > 3);
513	my $doc_OID = $doc_obj->get_OID();
514	my $url = "";
515	while (defined $section)
516	{
517	print STDERR "+ processing section: '$section'\n" if ($self->{'verbosity'} > 3);
518	# Attach all the other metadata to this document
519	# output the fact that this document is a document (unless doctype
520	# has been set to something else from within a plugin
521	my $dtype = $doc_obj->get_metadata_element ($section, "doctype");
522	if (!defined $dtype \|\| $dtype !~ /\w/)
523	{
524	#$doc_obj->add_utf8_metadata($section, "doctype", $dtype);
525	$doc_obj->add_utf8_metadata($section, "doctype", "doc");
526	}
527	# output whether this node contains text
528	if ($doc_obj->get_text_length($section) > 0)
529	{
530	$doc_obj->add_utf8_metadata($section, "hastxt", 1);
531	}
532	else
533	{
534	$doc_obj->add_utf8_metadata($section, "hastxt", 0);
535	}
536
537	# output archivedir if at top level
538	if ($section eq $doc_obj->get_top_section())
539	{
540	$doc_obj->add_utf8_metadata($section, "archivedir", $archivedir);
541	$doc_obj->add_utf8_metadata($section, "thistype", $thistype);
542	}
543
544	# output a list of children
545	my $children = $doc_obj->get_children ($section);
546	if (scalar(@$children) > 0)
547	{
548	$doc_obj->add_utf8_metadata($section, "childtype", $childtype);
549	my @contains = ();
550	foreach my $child (@$children)
551	{
552	if ($child =~ /^.*?\.(\d+)$/)
553	{
554	push (@contains, "\".$1");
555	}
556	else
557	{
558	push (@contains, "\".$child");
559	}
560	}
561	$doc_obj->add_utf8_metadata($section, "contains", join(";", @contains));
562	}
563	#output the matching doc number
564	print STDERR "+ docnum=" . $self->{'numincdocs'} . "\n" if ($self->{'verbosity'} > 3);
565	$doc_obj->add_utf8_metadata($section, "docnum", $self->{'numincdocs'});
566
567	$self->{'numincdocs'}++;
568	$section = $doc_obj->get_next_section($section);
569	# if no sections wanted, only add the docs
570	last if ($self->{'db_level'} eq "document");
571	}
572	print STDERR "\n*** incrementally add metadata from document at: " . $file . "\n" if ($self->{'verbosity'} > 3);
573	&IncrementalBuildUtils::addDocument($self->{'collection'}, $self->{'infodbtype'}, $doc_obj, $doc_obj->get_top_section());
574	}
575	else
576	{
577	$self->mgppbuildproc::process(@_);
578	}
579	}
580	# / process() /
581
582
583	# Following methods seem to be no different to those defined in basebuildproc.pm
584	# From inspection, it looks like these ones can be removed
585
586
587	sub get_num_docs {
588	my $self = shift (@_);
589	#rint STDERR "get_num_docs(): $self->{'num_docs'}\n";
590	return $self->{'num_docs'};
591	}
592
593	sub get_num_sections {
594	my $self = shift (@_);
595	#rint STDERR "get_num_sections(): $self->{'num_sections'}\n";
596	return $self->{'num_sections'};
597	}
598
599	# num_bytes is the actual number of bytes in the collection
600	# this is normally the same as what's processed during text compression
601	sub get_num_bytes {
602	my $self = shift (@_);
603	#rint STDERR "get_num_bytes(): $self->{'num_bytes'}\n";
604	return $self->{'num_bytes'};
605	}
606
607
608	# This is similar to mgppbuildproc's preprocess_text but adds extra spaces
609	# Otherwise the removal of tags below might lead to Lucene turning
610	# "...farming</p>\n<p>EDWARD.." into "farmingedward"
611	# (example from demo collection b20cre)
612	# Many thanks to John Thompson, DL Consulting Ltd. (www.dlconsulting.com)
613	sub preprocess_text
614	{
615	my $self = shift (@_);
616	my ($text, $strip_html, $para) = @_;
617	# at this stage, we do not do paragraph tags unless have strip_html -
618	# it will result in a huge mess of non-xml
619	return unless $strip_html;
620
621	my $new_text = $text;
622
623	# if we have <pre> tags, we can have < > inside them, need to delete
624	# the <> before stripping tags
625	$new_text =~ s/<pre>(.*?)<\/pre>/$self->remove_gtlt($1,$para)/gse;
626
627	if ($para eq "") {
628	# just remove all tags
629	$new_text =~ s/<[^>]*>/ /gs;
630	} else {
631	# strip all tags except <p> tags which get turned into $para
632	$new_text =~ s/<([^>]*)>/$self->process_tags($1, $para)/gse;
633	}
634
635	# It's important that we remove name entities because otherwise the text passed to Lucene for indexing
636	# may not be valid XML (eg. if HTML-only entities like   are used)
637	$new_text =~ s/&\w{1,10};//g;
638	# Remove stray '&' characters, except in &#nnnn; or &#xhhhh; entities (which are valid XML)
639	$new_text =~ s/&([^\#])/ $1/g;
640
641	return $new_text;
642	}
643
644	sub delete_assoc_files
645	{
646	my $self = shift (@_);
647	my ($archivedir, $edit_mode) = @_;
648
649	$self->basebuildproc::delete_assoc_files(@_);
650
651	if ($edit_mode eq "delete") {
652	# if we are deleting the doc, then also delete the lucene text version
653	my $assoc_dir = &FileUtils::filenameConcatenate($self->{'build_dir'},"text", $archivedir);
654	if (-d $assoc_dir) {
655	&FileUtils::removeFilesRecursive($assoc_dir);
656	}
657	}
658	}
659
660	sub create_sortfield_shortname {
661	my $self = shift(@_);
662
663	my ($realname) = @_;
664
665	my $index_shortname;
666	# if we have created a shortname for an index on this field, then use it.
667	if (defined $self->{'fieldnamemap'}->{$realname}) {
668	$index_shortname = $self->{'fieldnamemap'}->{$realname};
669	} else {
670	$index_shortname = $self->create_shortname($realname);
671	}
672	return "by".$index_shortname;
673	}
674
675
676	1;
677
678

Note: See TracBrowser for help on using the repository browser.

Download in other formats: