Context Navigation

NutchTextDumpPlugin.pm@ 34122

Last change on this file since 34122 was 34122, checked in by ak19, 4 years ago

After some testing of building the complete commoncrawl collection, noticed warnings about windows-1252 set by nutch as charset encoding. Attempting to use latin-1 for windows-1252 encodings also in to_utf8(), to decode text in such cases. 2. And when encoding is utf8 (set by nutch as utf8), uncommenting the immediate return statement in the to_utf8() function to take away if(not utf8) conditions that call the function. 3. Tidying up. 4. Tabbed lines in emacs after earlier occasional work on Windows.

File size: 29.1 KB

Line
1	###########################################################################
2	#
3	# NutchTextDumpPlugin.pm -- plugin for dump.txt files generated by Nutch
4	#
5	# A component of the Greenstone digital library software
6	# from the New Zealand Digital Library Project at the
7	# University of Waikato, New Zealand.
8	#
9	# Copyright (C) 2002 New Zealand Digital Library Project
10	#
11	# This program is free software; you can redistribute it and/or modify
12	# it under the terms of the GNU General Public License as published by
13	# the Free Software Foundation; either version 2 of the License, or
14	# (at your option) any later version.
15	#
16	# This program is distributed in the hope that it will be useful,
17	# but WITHOUT ANY WARRANTY; without even the implied warranty of
18	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19	# GNU General Public License for more details.
20	#
21	# You should have received a copy of the GNU General Public License
22	# along with this program; if not, write to the Free Software
23	# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
24	#
25	###########################################################################
26
27	# This plugin originally created to process Nutch dump.txt files produced from recrawling commoncrawl (CC)
28	# results for pages detected by CC as being in Maori.
29	# It splits each web site's dump.txt into its individual records: as each record represents a web page,
30	# this produces one greenstone document per web page.
31	#
32	# For a commoncrawl collection of siteID-labelled folders containing dump.txt files each,
33	# set <importOption name="OIDtype" value="dirname"/>
34	# And create 2 List browsing classifiers (with bookshelf_type set to always) on ex.siteID and ex.srcDomain
35	# both sorted by ex.srcURL, and an ex.Title classifier.
36	# For the ex.srcDomain classifier, set removeprefix to: https?\:\/\/(www\.)?
37	# An alternative is to build that List classifier on ex.basicDomain instead of ex.srcDomain.
38	# Finally, in the "display" format statement, add the following before the "wrappedSectionText" to
39	# display the most relevant metadata of each record:
40	# <gsf:template name="documentContent">
41	# <div id="nutch-dump-txt-record">
42	# <h3>Record:</h3>
43	# <br/>
44	# <dl>
45	# <dt>URL:</dt>
46	# <dd>
47	# <gsf:metadata name="srcURL"/>
48	# </dd>
49	# <dt>Title:</dt>
50	# <dd>
51	# <gsf:metadata name="ex.Title"/>
52	# </dd>
53	# <dt>SiteID:</dt>
54	# <dd>
55	# <gsf:metadata name="siteID"/>
56	# </dd>
57	# <dt>Status:</dt>
58	# <dd>
59	# <gsf:metadata name="status"/>
60	# </dd>
61	# <dt>ProtocolStatus:</dt>
62	# <dd>
63	# <gsf:metadata name="protocolStatus"/>
64	# </dd>
65	# <dt>ParseStatus:</dt>
66	# <dd>
67	# <gsf:metadata name="parseStatus"/>
68	# </dd>
69	# <dt>CharEncodingForConversion:</dt>
70	# <dd>
71	# <gsf:metadata name="CharEncodingForConversion"/>
72	# </dd>
73	# <dt>OriginalCharEncoding:</dt>
74	# <dd>
75	# <gsf:metadata name="OriginalCharEncoding"/>
76	# </dd>
77	# </dl>
78	# </div>
79
80	# + DONE: remove illegible values for metadata _rs_ and _csh_ in the example below before
81	# committing, in case their encoding affects the loading/reading in of this perl file.
82	#
83	# Example record in dump.txt to process:
84	# https://www.whanau-tahi.school.nz/ key: nz.school.whanau-tahi.www:https/
85	# baseUrl: null
86	# status: 2 (status_fetched)
87	# fetchTime: 1575199241154
88	# prevFetchTime: 1572607225779
89	# fetchInterval: 2592000
90	# retriesSinceFetch: 0
91	# modifiedTime: 0
92	# prevModifiedTime: 0
93	# protocolStatus: SUCCESS, args=[]
94	# signature: d84c84ccf0c86aa16a19e03cb1fc5827
95	# parseStatus: success/ok (1/0), args=[]
96	# title: Te Kura Kaupapa MÄori o Te WhÄnau Tahi
97	# score: 1.0
98	# marker _injmrk_ : y
99	# marker _updmrk_ : 1572607228-9584
100	# marker dist : 0
101	# reprUrl: null
102	# batchId: 1572607228-9584
103	# metadata CharEncodingForConversion : utf-8
104	# metadata OriginalCharEncoding : utf-8
105	# metadata _rs_ :
106	# metadata _csh_ :
107	# text:start:
108	# Te Kura Kaupapa MÄori o Te WhÄnau Tahi He mihi He mihi Te Kaupapa NgÄ TÄngata Te KÄkano Te Pihinga Te Tipuranga Te PuÄwaitanga Te Tari Te Poari Matua WhakapÄ mai He mihi He mihi Te Kaupapa NgÄ TÄngata Te KÄkano Te Pihinga Te Tipuranga Te PuÄwaitanga Te Tari Te Poari Matua WhakapÄ mai TE KURA KAUPAPA MÄORI O TE WHÄNAU TAHI He mihi Kei te mÅteatea tonu nei ngÄ mahara ki te huhua kua mene atu ki te pÅ, te pÅuriuri, te pÅtangotango, te pÅ oti atu rÄ. Kua rite te wÄhanga ki a rÄtou, hoki mai ki te ao tÅ«roa nei Ko Io Matua Kore te pÅ«taketanga, te pÅ«kaea, te pÅ«tÄtara ka rangona whÄnuitia e te ao. Ko tÄna ko ngÄ whetÅ«, te marama, te haeata ki a Tamanui te rÄ. He atua i whakateretere mai ai ngÄ waka i tawhiti nui, i tawhiti roa, i tawhiti mai rÄ anÅ. Kei nga ihorei, kei ngÄ wahapÅ«, kei ngÄ pukumahara, kei ngÄ kanohi kai mÄtÄrae o tÅ tÄtou nei kura Aho Matua, Te Kura Kaupapa MÄori o Te Whanau Tahi. Anei rÄ te maioha ki a koutou katoa e pÅ«mau tonu ki ngÄ wawata me ngÄ whakakitenga i whakatakotoria e ngÄ poupou i te wÄ i a rÄtou. Ka whakanuia hoki te toru tekau tau o tÄnei kura mai i tÅna orokohanga timatanga tae noa ki tÄnei wÄ Ka pÅ«mau tÅnu mÄtou ki te whakatauki o te kura e mea ana âPoipoia Å tÄtou nei pÅ«manawaâ Takiritia tonutia te ra ki runga i Te Kura Kaupapa Maori o Te Whanau Tahi . Back to Top " Poipoia Å tÄtou nei pÅ«manawa - Â Making our potential a reality " Â Â©Â Te Kura Kaupapa MÄori o Te WhÄnau Tahi, 2019Â Cart ( 0 )
109	# text:end:
110	#
111	# https://www.whanau-tahi.school.nz/cart key: nz.school.whanau-tahi.www:https/cart
112	# baseUrl: null
113	# status: 2 (status_fetched)
114	# ...
115	#
116	# - Some records may have empty text content between the text:start: and text:end: markers,
117	# while other records may be missing these markers along with any text.
118	# - Metadata is of the form key : value, but some metadata values contain ":", for example
119	# "protocolStatus" metadata can contain a URL for value, including protocol that contains ":".
120	# - metadata _rs_ and _csh_ contain illegible values, so this code discards them when storing metadata.
121	#
122	# If you provide a keep_urls_file when configuring NutchTextDumpPlugin, then if relative the path is relative
123	# it will check the collection's etc folder for a urls.txt file.
124
125
126	package NutchTextDumpPlugin;
127
128	use SplitTextFile;
129
130	use Encode;
131	use unicode;
132	use util;
133
134	use strict;
135	no strict 'refs'; # allow filehandles to be variables and viceversa
136
137	# TODO:
138	# + 1. Split each dump.txt file into its individual records as individual docs
139	# + 2. Store the meta of each individual record/doc
140	# ?3. Name each doc, siteID.docID else HASH internal text. See EmailPlugin?
141	# - In SplitTextFile::read(), why is $segment which counts discarded docs too used to add record ID
142	# rather than $count which only counts included docs? I am referring to code:
143	# $self->add_OID($doc_obj, $id, $segment);
144	# The way I've solved this is by setting the OIDtype importOption. Not sure if this is what was required.
145	# + 4. Keep a map of all URLs seen - whitelist URLs.
146	# + 5. Implement the optional input file of URLs: if infile provided, keep only those records
147	# whose URLs are in the map. Only these matching records should become docs.
148	# 6. Rebuild full collection of all dump.txt files with this collection design.
149	#
150	# TIDY UP:
151	# + Create util::trim()
152	# + Add to perl's strings.properties: NutchTextDumpPlugin.keep_urls_file
153	#
154	# CLEANUP:
155	# Remove MetadataRead functions and inheritance
156	#
157	# QUESTIONS:
158	# - encoding = utf-8, changed to "utf8" as required by copied to_utf8(str) method. Why does it not convert
159	# the string parameter but fails in decode() step? Is it because the string is already in UTF8?
160	# - Problem converting text with encoding in full set of nutch dump.txt when there encoding is windows-1252.
161	# - TODOs
162	#
163	# - Should I add metadata as "ex."+meta or as meta? e.g. ex.srcURL or srcURL?
164	# - Want to read in keep_urls_file, maintaining a hashmap of its URLs, only on import, isn't that correct?
165	# Then how can I initialise this only once and only during import? constructor and init() methods are called during buildcol too.
166	# For now, I've done it in can_proc_this_file() but there must be a more appropriate place and correct way to do this?
167	# - why can't I do doc_obj->get_meta_element($section, "ex.srcURL") but have to pass "srcURL" and 1 to ignore
168	# namespace?
169	# - in collectionConfig file I have to leave out ex. prefix for all but Title, why?
170	# - in GLI, browsing classifier sort_leaf options, "ex.srcURL" appears only as "ex.srcurl" (lowercased). Why?
171	# - On the other hand, in GLI's search indexes, both ex.srcurl and ex.srcURL appear. But only building
172	# with an index on ex.srcURL provides a search option in the search box. But then searching on an existing
173	# srcURL produces 0 results anyway.
174	# - Is this all because I am naming my ex metadata names wrongly? e.g. ex.srcURL, ex.siteID, ex.srcDomain.
175	#
176	# CHECK:
177	# - title fallback is URL.
178	# - util::tidy_up_OID() prints warning. SiteID is foldername and OIDtype=dirname, so fully numeric
179	# siteID to OID conversion results in warning message that siteID is fully numeric and gets 'D' prefixed.
180	# Is this warning still necessary?
181
182
183	sub BEGIN {
184	@NutchTextDumpPlugin::ISA = ('SplitTextFile');
185	unshift (@INC, "$ENV{'GSDLHOME'}/perllib/cpan");
186	}
187
188	my $arguments =
189	[ { 'name' => "keep_urls_file",
190	'desc' => "{NutchTextDumpPlugin.keep_urls_file}",
191	'type' => "string",
192	#'deft' => "urls.txt",
193	'reqd' => "no" },
194	{ 'name' => "process_exp",
195	'desc' => "{BaseImporter.process_exp}",
196	'type' => "regexp",
197	'reqd' => "no",
198	'deft' => &get_default_process_exp() },
199	{ 'name' => "split_exp",
200	'desc' => "{SplitTextFile.split_exp}",
201	'type' => "regexp",
202	'reqd' => "no",
203	'deft' => &get_default_split_exp() }
204	];
205
206	my $options = { 'name' => "NutchTextDumpPlugin",
207	'desc' => "{NutchTextDumpPlugin.desc}",
208	'abstract' => "no",
209	'inherits' => "yes",
210	'explodes' => "yes",
211	'args' => $arguments };
212
213	sub new {
214	my ($class) = shift (@_);
215	my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
216	push(@$pluginlist, $class);
217
218	push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
219	push(@{$hashArgOptLists->{"OptList"}},$options);
220
221	my $self = new SplitTextFile($pluginlist, $inputargs, $hashArgOptLists);
222
223	if ($self->{'info_only'}) {
224	# don't worry about the options
225	return bless $self, $class;
226	}
227
228	$self->{'keep_urls_processed'} = 0;
229	$self->{'keep_urls'} = undef;
230
231	#return bless $self, $class;
232	$self = bless $self, $class;
233
234	# Can only call any methods on $self AFTER the bless operation above
235	#$self->setup_keep_urls(); # want to set up the keep_urls hashmap only once, so have to do it here (init is also called by buildcol)
236
237	return $self;
238	}
239
240	# sub init {
241	# my $self = shift (@_);
242	# my ($verbosity, $outhandle, $failhandle) = @_;
243
244	# if(!$self->{'keep_urls_file'}) {
245	# my $msg = "NutchTextDumpPlugin INFO: No urls file provided.\n" .
246	# " No records will be filtered.\n";
247	# print $outhandle $msg if ($verbosity > 2);
248
249	# $self->SUPER::init(@_);
250	# return;
251	# }
252
253	# # read in the keep urls files
254	# my $keep_urls_file = &util::locate_config_file($self->{'keep_urls_file'});
255	# if (!defined $keep_urls_file)
256	# {
257	# my $msg = "NutchTextDumpPlugin INFO: Can't locate urls file $keep_urls_file.\n" .
258	# " No records will be filtered.\n";
259
260	# print $outhandle $msg;
261
262	# $self->{'keep_urls'} = undef;
263	# # Not an error if there's no $keep_urls_file: it just means all records
264	# # in dump.txt will be processed.
265	# }
266	# else {
267	# #$self->{'keep_urls'} = $self->parse_keep_urls_file($keep_urls_file, $outhandle);
268	# #$self->{'keep_urls'} = {};
269	# $self->parse_keep_urls_file($keep_urls_file, $outhandle, $failhandle);
270	# }
271
272	## if($self->{'keep_urls'} && $verbosity > 2) {
273	# # print STDERR "@@@@ keep_urls hash map contains:\n";
274	# # map { print STDERR $_."=>".$self->{'keep_urls'}->{$_}."\n"; } keys %{$self->{'keep_urls'}};
275	## }
276	# $self->SUPER::init(@_);
277	# }
278
279	sub setup_keep_urls {
280	my $self = shift (@_);
281
282	my $verbosity = $self->{'verbosity'};
283	my $outhandle = $self->{'outhandle'};
284	my $failhandle = $self->{'failhandle'};
285
286	$self->{'keep_urls_processed'} = 1; # flag to track whether this method has been called already during import
287
288	#print $outhandle "@@@@ In NutchTextDumpPlugin::setup_keep_urls()\n";
289
290	if(!$self->{'keep_urls_file'}) {
291	my $msg = "NutchTextDumpPlugin INFO: No urls file provided.\n" .
292	" No records will be filtered.\n";
293	print $outhandle $msg if ($verbosity > 2);
294
295	return;
296	}
297
298	# read in the keep urls files
299	my $keep_urls_file = &util::locate_config_file($self->{'keep_urls_file'});
300	if (!defined $keep_urls_file)
301	{
302	my $msg = "NutchTextDumpPlugin INFO: Can't locate urls file $keep_urls_file.\n" .
303	" No records will be filtered.\n";
304
305	print $outhandle $msg;
306
307	$self->{'keep_urls'} = undef;
308	# TODO: Not a fatal error if $keep_urls_file can't be found: it just means all records
309	# in dump.txt will be processed?
310	}
311	else {
312	#$self->{'keep_urls'} = $self->parse_keep_urls_file($keep_urls_file, $outhandle);
313	#$self->{'keep_urls'} = {};
314	$self->parse_keep_urls_file($keep_urls_file, $outhandle, $failhandle);
315	}
316
317	#if(defined $self->{'keep_urls'}) {
318	# print STDERR "@@@@ keep_urls hash map contains:\n";
319	# map { print STDERR $_."=>".$self->{'keep_urls'}->{$_}."\n"; } keys %{$self->{'keep_urls'}};
320	#}
321
322	}
323
324	# TODO: This is an ugly way to do this anda non-intuitive place to do this. Is there a better way?
325	# Overriding can_process_this_file() in order to avoid setting up the keep_urls hashmap during
326	# buildcol.pl. We only want to setup the hash during import.
327	# During buildcol, this method is called with directories and not files and this method will return
328	# false as a result. So when it returns true, it will be import.pl, and we check whether we haven't
329	# already setup the keep_urls map. If the keep urls file has not yet been processed, then we set up
330	# the hashmap once.
331	sub can_process_this_file {
332	my $self = shift(@_);
333	my ($filename) = @_;
334	my $can_process_return_val = $self->SUPER::can_process_this_file(@_);
335
336	# We want to load in the keep_urls_file and create the keep_urls hashmap only once, during import
337	# Because the keep urls file can be large and it and the hashmap serve no purpose during buildcol.pl.
338	# Check whether we've already processed the file/built the hashmap, as we don't want to do this
339	# more than 1 time even within just the import cycle.
340	if($can_process_return_val && !$self->{'keep_urls_processed'}) { #!defined $self->{'keep_urls'}) {
341	$self->setup_keep_urls();
342	}
343
344	return $can_process_return_val;
345
346	}
347
348	sub parse_keep_urls_file {
349	my $self = shift (@_);
350	my ($urls_file, $outhandle, $failhandle) = @_;
351
352	# https://www.caveofprogramming.com/perl-tutorial/perl-hashes-a-guide-to-associative-arrays-in-perl.html
353	# https://stackoverflow.com/questions/1817394/whats-the-difference-between-a-hash-and-hash-reference-in-perl
354	$self->{'keep_urls'} = {}; # hash reference init to {}
355
356	# What if it is a very long file of URLs? Need to read a line at a time!
357	#my $contents = &FileUtils::readUTF8File($urls_file); # could just call $self->read_file() inherited from SplitTextFile's parent ReadTextFile
358	#my @lines = split(/(?:\r?\n)+/, $$textref);
359
360	# Open the file in UTF-8 mode https://stackoverflow.com/questions/2220717/perl-read-file-with-encoding-method
361	# and read in line by line into map
362	my $fh;
363	if (open($fh,'<:encoding(UTF-8)', $urls_file)) {
364	while (defined (my $line = <$fh>)) {
365	$line = &util::trim($line); #$line =~ s/^\s+\|\s+$//g; # trim whitespace
366	if($line =~ m@^https?://@) { # add only URLs
367	$self->{'keep_urls'}->{$line} = 1; # add the url to our perl hash
368	}
369	}
370	close $fh;
371	} else {
372	my $msg = "NutchTextDumpPlugin ERROR: Unable to open file keep_urls_file: \"" .
373	$self->{'keep_urls_file'} . "\".\n " .
374	" No records will be filtered.\n";
375	print $outhandle $msg;
376	print $failhandle $msg;
377	# Not fatal. TODO: should it be fatal when it can still process all URLs just because
378	# it can't find the specified keep-urls.txt file?
379	}
380
381	# if keep_urls hash is empty, ensure it is undefined from this point onward
382	# https://stackoverflow.com/questions/9444915/how-to-check-if-a-hash-is-empty-in-perl
383	my %urls_map = $self->{'keep_urls'};
384	if(!keys %urls_map) {
385	$self->{'keep_urls'} = undef;
386	}
387
388	}
389
390	# Accept "dump.txt" files (which are in numeric siteID folders),
391	# and txt files with numeric siteID, e.g. "01441.txt"
392	# if I preprocessed dump.txt files by renaming them this way.
393	sub get_default_process_exp {
394	my $self = shift (@_);
395
396	return q^(?i)((dump\|\d+)\.txt)$^;
397	}
398
399
400	sub get_default_split_exp {
401
402	# prev line is either a new line or start of dump.txt
403	# current line should start with url protocol and contain " key: .... http(s)/"
404	# \r\n for msdos eol, \n for unix
405
406	# The regex return value of this method is passed into a call to perl split.
407	# Perl's split(), by default throws away delimiter
408	# Any capturing group that makes up or is part of the delimiter becomes a separate element returned by split
409	# We want to throw away the empty newlines preceding the first line of a record "https? .... key: https?/"
410	# but we want to keep that first line as part of the upcoming record.
411	# - To keep the first line of a record, though it becomes its own split-element, use capture groups in split regex:
412	# https://stackoverflow.com/questions/14907772/split-but-keep-delimiter
413	# - To skip the unwanted empty lines preceding the first line of a record use ?: in front of its capture group
414	# to discard that group:
415	# https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions
416	# - Next use a positive look-ahead (?= in front of capture group, vs ?! for negative look ahead)
417	# to match but not capture the first line of a record (so the look-ahead matched is retained as the
418	# first line of the next record):
419	# https://stackoverflow.com/questions/14907772/split-but-keep-delimiter
420	# and http://www.regular-expressions.info/lookaround.html
421	# - For non-greedy match, use .*?
422	# https://stackoverflow.com/questions/11898998/how-can-i-write-a-regex-which-matches-non-greedy
423	return q^(?:$\|\r?\n\r?\n)(?=https?://.+?\skey:\s+.*?https?/)^;
424
425	}
426
427	# TODO: Copied method from MARCPlugin.pm and uncommented return statement when encoding = utf8
428	# Move to a utility perl file, since code is mostly shared?
429	# The bulk of this function is based on read_line in multiread.pm
430	# Unable to use read_line original because it expects to get its input
431	# from a file. Here the line to be converted is passed in as a string
432
433	# TODO:
434	# Is this function even applicable to NutchTextDumpPlugin?
435	# I get errors in this method when encoding is utf-8 in the decode step.
436	# I get warnings/errors somewhere in this file (maybe also at decode) when encoding is windows-1252.
437
438	sub to_utf8
439	{
440	my $self = shift (@_);
441	my ($encoding, $line) = @_;
442
443	if ($encoding eq "utf8") {
444	# nothing needs to be done
445	return $line;
446	} elsif ($encoding eq "iso_8859_1" \|\| $encoding eq "windows-1252") { # TODO: do this also for windows-1252?
447	# we'll use ascii2utf8() for this as it's faster than going
448	# through convert2unicode()
449	#return &unicode::ascii2utf8 (\$line);
450	$line = &unicode::ascii2utf8 (\$line);
451	} else {
452
453	# everything else uses unicode::convert2unicode
454	$line = &unicode::unicode2utf8 (&unicode::convert2unicode ($encoding, \$line));
455	}
456	# At this point $line is a binary byte string
457	# => turn it into a Unicode aware string, so full
458	# Unicode aware pattern matching can be used.
459	# For instance: 's/\x{0101}//g' or '[[:upper:]]'
460
461	return decode ("utf8", $line);
462	}
463
464
465
466	# do plugin specific processing of doc_obj
467	# This gets done for each record found by SplitTextFile in marc files.
468	sub process {
469	my $self = shift (@_);
470	my ($textref, $pluginfo, $base_dir, $file, $metadata, $doc_obj, $gli) = @_;
471
472	my $outhandle = $self->{'outhandle'};
473	my $filename = &util::filename_cat($base_dir, $file);
474
475	my $cursection = $doc_obj->get_top_section();
476
477
478	#print STDERR "---------------\nDUMP.TXT\n---------\n", $$textref, "\n------------------------\n";
479
480
481	# (1) parse out the metadata of this record
482	my $metaname;
483	my $encoding;
484	my $title_meta;
485
486	my $line_index = 0;
487	my $text_start_index = -1;
488	my @lines = split(/(?:\r?\n)+/, $$textref);
489
490	foreach my $line (@lines) {
491	# first line is special and contains the URL (no metaname)
492	# and the inverted URL labelled with metaname "key"
493	if($line =~ m/^https?/ && $line =~ m/\s+key:\s+/) {
494	my @vals = split(/key:/, $line);
495	# get url and key, and trim whitespace simultaneously
496	my $url = &util::trim($vals[0]);
497	my $key = &util::trim($vals[1]);
498
499	# if we have a keep_urls hash, then only process records of whitelisted urls
500	if(defined $self->{'keep_urls'} && !$self->{'keep_urls'}->{$url}) {
501	# URL not whitelisted, so stop processing this record
502	print STDERR "@@@@@@ INFO NutchTextDumpPlugin::process(): discarding record for URL not whitelisted: $url\n"
503	if $self->{'verbosity'} > 3;
504	return 0;
505	} else {
506	print STDERR "@@@@@@ INFO NutchTextDumpPlugin::process(): processing record of whitelisted URL $url...\n"
507	if $self->{'verbosity'} > 3;
508	}
509	$doc_obj->add_utf8_metadata ($cursection, "ex.srcURL", $url);
510	$doc_obj->add_utf8_metadata ($cursection, "ex.key", $key);
511
512
513	# let's also set the domain from the URL, as that will make a
514	# more informative bookshelf label than siteID
515	# For complete domain, keep protocol:// and every non-slash after.
516	# (This avoids requiring presence of subsequent slash)
517	# https://stackoverflow.com/questions/3652527/match-regex-and-assign-results-in-single-line-of-code
518	# Can clean up protocol and www. in List classifier's bookshelf's remove_prefix option
519	# or can build classifier on basicDomain instead.
520
521	my ($domain, $basicDomain) = $url =~ m@(^https?://(?:www\.)?([^/]+)).*@;
522	#my ($domain, $protocol, $basicdomain) = $url =~ m@((^https?)://([^/]+)).*@; # Works
523	$doc_obj->add_utf8_metadata ($cursection, "ex.srcDomain", $domain);
524	$doc_obj->add_utf8_metadata ($cursection, "ex.basicDomain", $basicDomain);
525
526	}
527	# check for full text
528	elsif ($line =~ m/text:start:/) {
529	$text_start_index = $line_index;
530	last; # if we've reached the full text portion, we're past the metadata portion of this record
531	}
532	elsif($line =~ m/^[^:]+:.+$/) { # look for meta #elsif($line =~ m/^[^:]+:[^:]+$/) { # won't allow protocol://url in metavalue
533	my @metakeyvalues = split(/:/, $line); # split on first :
534
535	my $metaname = shift(@metakeyvalues);
536	my $metavalue = join("", @metakeyvalues);
537
538	# skip "metadata _rs_" and "metadata _csh_" as these contain illegible characters for values
539	if($metaname !~ m/metadata\s+_(rs\|csh)_/) {
540
541	# trim whitespace
542	$metaname = &util::trim($metaname);
543	$metavalue = &util::trim($metavalue);
544
545	if($metaname eq "title") { # TODO: what to do about "title: null" cases?
546	##print STDERR "@@@@ Found title: $metavalue\n";
547	#$metaname = "Title"; # will set "title" as "Title" metadata instead
548	# TODO: treat title metadata specially by using character encoding to store correctly?
549
550	# Won't add Title metadata to docObj until after all meta is processed,
551	# when we'll know encoding and can process title meta
552	$title_meta = $metavalue;
553	$metavalue = ""; # will force ex.Title metadata to be added AFTER for loop
554	}
555	elsif($metaname =~ m/CharEncodingForConversion/) { # TODO: or look for "OriginalCharEncoding"?
556	##print STDERR "@@@@ Found encoding: $metavalue\n";
557	$encoding = $metavalue; # TODO: should we use this to interpret the text and title in the correct encoding and convert to utf-8?
558
559	if($encoding eq "utf-8") {
560	$encoding = "utf8"; # method to_utf8() recognises "utf8" not "utf-8"
561	} else {
562	print STDERR "@@@@@@ WARNING NutchTextDumpPlugin::process(): Record's Nutch-assigned CharEncodingForConversion was not utf-8: $encoding\n";
563	}
564
565
566	}
567
568	# move occurrences of "marker " or "metadata " strings at start of metaname to end
569	#$metaname =~ s/^(marker\|metadata)\s+(.*)$/$2$1/;
570	# remove "marker " or "metadata " strings from start of metaname
571	$metaname =~ s/^(marker\|metadata)\s+//;
572	# remove underscores and all remaining spaces in metaname
573	$metaname =~ s/[ _]//g;
574
575	# add meta to docObject if both metaname and metavalue are non-empty strings
576	if($metaname ne "" && $metavalue ne "") { # && $metaname ne "rs" && $metaname ne "csh") {
577	$doc_obj->add_utf8_metadata ($cursection, "ex.".$metaname, $metavalue);
578	#print STDERR "Added meta \|$metaname\| = \|$metavalue\|\n"; #if $metaname =~ m/ProtocolStatus/i;
579	}
580
581	}
582	} elsif ($line !~ m/^\s*$/) { # Not expecting any other type of non-empty line (or even empty lines)
583	print STDERR "NutchTextDump line not recognised as URL meta, other metadata or text content:\n\t$line\n";
584	}
585
586	$line_index++;
587	}
588
589
590	# Add fileFormat as the metadata
591	$doc_obj->add_metadata($cursection, "FileFormat", "NutchDumpTxt");
592
593	# Correct title metadata using encoding, if we have $encoding at last
594	# $title_meta = $self->to_utf8($encoding, $title_meta) if $encoding;
595	# https://stackoverflow.com/questions/12994100/perl-encode-pm-cannot-decode-string-with-wide-character
596	# Error message: "Perl Encode.pm cannot decode string with wide character"
597	# "That error message is saying that you have passed in a string that has already been decoded
598	# (and contains characters above codepoint 255). You can't decode it again."
599	if($title_meta && $title_meta ne "" && $title_meta ne "null") {
600	$title_meta = $self->to_utf8($encoding, $title_meta) if ($encoding);
601	} else { # if we have "null" as title metadata, set it to the record URL?
602	#my $srcURLs = $doc_obj->get_metadata($cursection, "ex.srcURL");
603	#print STDERR "@@@@ null title to be replaced with ".$srcURLs->[0]."\n";
604	#$title_meta = $srcURLs->[0] if (scalar @$srcURLs > 0);
605	my $srcURL = $doc_obj->get_metadata_element($cursection, "srcURL", 1); # TODO: why does ex.srcURL not work, nor srcURL without 3rd param
606	if(defined $srcURL) {
607	print STDERR "@@@@ null/empty title to be replaced with ".$srcURL."\n"
608	if $self->{'verbosity'} > 3;
609	$title_meta = $srcURL;
610	}
611	}
612	$doc_obj->add_utf8_metadata ($cursection, "Title", $title_meta);
613
614
615	# When importOption OIDtype = dirname, the base_OID will be that dirname
616	# which was crafted to be the siteID. However, because our siteID is all numeric,
617	# a D gets prepended to create baseOID. Remove the starting 'D' to get actual siteID.
618	my $siteID = $self->get_base_OID($doc_obj);
619	#print STDERR "BASE OID: " . $self->get_base_OID($doc_obj) . "\n";
620	$siteID =~ s/^D//;
621	$doc_obj->add_utf8_metadata ($cursection, "ex.siteID", $siteID);
622
623
624	# (2) parse out text of this record
625	# if($text_start_index != -1 && pop(@lines) =~ m/text:end:/) { # we only have text content if there were "text:start:" and "text:end:" markers.
626	# # TODO: are we guaranteed popped line is text:end: and not empty/newline?
627	# @lines = splice(@lines,0,$text_start_index+1); # just keep every line AFTER text:start:, have already removed (popped) "text:end:"
628
629	# # glue together remaining lines, if there are any, into textref
630	# # https://stackoverflow.com/questions/7406807/find-size-of-an-array-in-perl
631	# if(scalar (@lines) > 0) {
632	# # TODO: do anything with $encoding to convert line to utf-8?
633	# foreach my $line (@lines) {
634	# $line = $self->to_utf8($encoding, $line) if $encoding; #if $encoding ne "utf-8";
635	# $$textref .= $line."\n";
636	# }
637	# }
638	# $$textref = "<pre>\n".$$textref."</pre>";
639	# } else {
640	# print STDERR "WARNING: NutchTextDumpPlugin::process: had found a text start marker but not text end marker.\n";
641	# $$textref = "<pre></pre>";
642	# }
643
644	# (2) parse out text of this record
645	my $no_text = 1;
646	if($text_start_index != -1) { # had found a "text:start:" marker, so we should have text content for this record
647	if($$textref =~ m/text:start:\r?\n(.*?)\r?\ntext:end:/) {
648	$$textref = $1;
649	if($$textref !~ m/^\s*$/) {
650	$$textref = $self->to_utf8($encoding, $$textref) if ($encoding);
651	$$textref = "<pre>\n".$$textref."\n</pre>";
652	$no_text = 0;
653	}
654	}
655	}
656	if($no_text) {
657	$$textref = "<pre></pre>";
658	}
659
660	# Debugging
661	# To avoid "wide character in print" messages for debugging, set binmode of handle to utf8/encoding
662	# https://stackoverflow.com/questions/15210532/use-of-use-utf8-gives-me-wide-character-in-print
663	# if ($self->{'verbosity'} > 3) {
664	# if($encoding && $encoding eq "utf8") {
665	# binmode STDERR, ':utf8';
666	# }
667
668	# print STDERR "TITLE: $title_meta\n";
669	# print STDERR "ENCODING = $encoding\n" if $encoding;
670	# #print STDERR "---------------\nTEXT CONTENT\n---------\n", $$textref, "\n------------------------\n";
671	# }
672
673
674	$doc_obj->add_utf8_text($cursection, $$textref);
675
676	return 1;
677	}
678
679
680	1;

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm@ 34122

Download in other formats: