Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

source: main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm@ 34121

Last change on this file since 34121 was 34121, checked in by ak19, 4 years ago

Introducing NutchTextDumpPlugin to process the records (representing web pages' text content) of the dump.txt files produced for each website crawled by Nutch. Created for handling the commoncrawl URLs of interest that we recrawled with Nutch. This first version does everything, but the code requires more cleaning up. 2. Also added a useful util::trim() function as I kept reusing the same code several times.

File size: 29.6 KB

Line
1	###########################################################################
2	#
3	# NutchTextDumpPlugin.pm -- plugin for dump.txt files generated by Nutch
4	#
5	# A component of the Greenstone digital library software
6	# from the New Zealand Digital Library Project at the
7	# University of Waikato, New Zealand.
8	#
9	# Copyright (C) 2002 New Zealand Digital Library Project
10	#
11	# This program is free software; you can redistribute it and/or modify
12	# it under the terms of the GNU General Public License as published by
13	# the Free Software Foundation; either version 2 of the License, or
14	# (at your option) any later version.
15	#
16	# This program is distributed in the hope that it will be useful,
17	# but WITHOUT ANY WARRANTY; without even the implied warranty of
18	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19	# GNU General Public License for more details.
20	#
21	# You should have received a copy of the GNU General Public License
22	# along with this program; if not, write to the Free Software
23	# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
24	#
25	###########################################################################
26
27	# This plugin originally created to process Nutch dump.txt files produced from recrawling commoncrawl (CC)
28	# results for pages detected by CC as being in Maori.
29	# It splits each web site's dump.txt into its individual records: as each record represents a web page,
30	# this produces one greenstone document per web page.
31	#
32	# For a commoncrawl collection of siteID-labelled folders containing dump.txt files each,
33	# set <importOption name="OIDtype" value="dirname"/>
34	# And create 2 List browsing classifiers (with bookshelf_type set to always) on ex.siteID and ex.srcDomain
35	# both sorted by ex.srcURL, and an ex.Title classifier.
36	# For the ex.srcDomain classifier, set removeprefix to: https?\:\/\/(www\.)?
37	# Finally, in the "display" format statement, add the following before the "wrappedSectionText" to
38	# display the most relevant metadata of each record:
39	# <gsf:template name="documentContent">
40	# <div id="nutch-dump-txt-record">
41	# <h3>Record:</h3>
42	# <br/>
43	# <dl>
44	# <dt>URL:</dt>
45	# <dd>
46	# <gsf:metadata name="srcURL"/>
47	# </dd>
48	# <dt>Title:</dt>
49	# <dd>
50	# <gsf:metadata name="ex.Title"/>
51	# </dd>
52	# <dt>SiteID:</dt>
53	# <dd>
54	# <gsf:metadata name="siteID"/>
55	# </dd>
56	# <dt>Status:</dt>
57	# <dd>
58	# <gsf:metadata name="status"/>
59	# </dd>
60	# <dt>ProtocolStatus:</dt>
61	# <dd>
62	# <gsf:metadata name="protocolStatus"/>
63	# </dd>
64	# <dt>ParseStatus:</dt>
65	# <dd>
66	# <gsf:metadata name="parseStatus"/>
67	# </dd>
68	# <dt>CharEncodingForConversion:</dt>
69	# <dd>
70	# <gsf:metadata name="CharEncodingForConversion"/>
71	# </dd>
72	# <dt>OriginalCharEncoding:</dt>
73	# <dd>
74	# <gsf:metadata name="OriginalCharEncoding"/>
75	# </dd>
76	# </dl>
77	# </div>
78
79	# TODO: remove illegible values for metadata _rs_ and _csh_ in the example below before
80	# committing, in case their encoding affects the loading/reading in of this perl file.
81	#
82	# Example record in dump.txt to process:
83	# https://www.whanau-tahi.school.nz/ key: nz.school.whanau-tahi.www:https/
84	# baseUrl: null
85	# status: 2 (status_fetched)
86	# fetchTime: 1575199241154
87	# prevFetchTime: 1572607225779
88	# fetchInterval: 2592000
89	# retriesSinceFetch: 0
90	# modifiedTime: 0
91	# prevModifiedTime: 0
92	# protocolStatus: SUCCESS, args=[]
93	# signature: d84c84ccf0c86aa16a19e03cb1fc5827
94	# parseStatus: success/ok (1/0), args=[]
95	# title: Te Kura Kaupapa MÄori o Te WhÄnau Tahi
96	# score: 1.0
97	# marker _injmrk_ : y
98	# marker _updmrk_ : 1572607228-9584
99	# marker dist : 0
100	# reprUrl: null
101	# batchId: 1572607228-9584
102	# metadata CharEncodingForConversion : utf-8
103	# metadata OriginalCharEncoding : utf-8
104	# metadata _rs_ : ï¿œ
105	# metadata _csh_ :
106	# text:start:
107	# Te Kura Kaupapa MÄori o Te WhÄnau Tahi He mihi He mihi Te Kaupapa NgÄ TÄngata Te KÄkano Te Pihinga Te Tipuranga Te PuÄwaitanga Te Tari Te Poari Matua WhakapÄ mai He mihi He mihi Te Kaupapa NgÄ TÄngata Te KÄkano Te Pihinga Te Tipuranga Te PuÄwaitanga Te Tari Te Poari Matua WhakapÄ mai TE KURA KAUPAPA MÄORI O TE WHÄNAU TAHI He mihi Kei te mÅteatea tonu nei ngÄ mahara ki te huhua kua mene atu ki te pÅ, te pÅuriuri, te pÅtangotango, te pÅ oti atu rÄ. Kua rite te wÄhanga ki a rÄtou, hoki mai ki te ao tÅ«roa nei Ko Io Matua Kore te pÅ«taketanga, te pÅ«kaea, te pÅ«tÄtara ka rangona whÄnuitia e te ao. Ko tÄna ko ngÄ whetÅ«, te marama, te haeata ki a Tamanui te rÄ. He atua i whakateretere mai ai ngÄ waka i tawhiti nui, i tawhiti roa, i tawhiti mai rÄ anÅ. Kei nga ihorei, kei ngÄ wahapÅ«, kei ngÄ pukumahara, kei ngÄ kanohi kai mÄtÄrae o tÅ tÄtou nei kura Aho Matua, Te Kura Kaupapa MÄori o Te Whanau Tahi. Anei rÄ te maioha ki a koutou katoa e pÅ«mau tonu ki ngÄ wawata me ngÄ whakakitenga i whakatakotoria e ngÄ poupou i te wÄ i a rÄtou. Ka whakanuia hoki te toru tekau tau o tÄnei kura mai i tÅna orokohanga timatanga tae noa ki tÄnei wÄ Ka pÅ«mau tÅnu mÄtou ki te whakatauki o te kura e mea ana âPoipoia Å tÄtou nei pÅ«manawaâ Takiritia tonutia te ra ki runga i Te Kura Kaupapa Maori o Te Whanau Tahi . Back to Top " Poipoia Å tÄtou nei pÅ«manawa - Â Making our potential a reality " Â Â©Â Te Kura Kaupapa MÄori o Te WhÄnau Tahi, 2019Â Cart ( 0 )
108	# text:end:
109	#
110	# https://www.whanau-tahi.school.nz/cart key: nz.school.whanau-tahi.www:https/cart
111	# baseUrl: null
112	# status: 2 (status_fetched)
113	# ...
114	#
115	# - Some records may have empty text content between the text:start: and text:end: markers,
116	# while other records may be missing these markers along with any text.
117	# - Metadata is of the form key : value, but some metadata values contain ":", for example
118	# "protocolStatus" metadata can contain a URL for value, including protocol that contains ":".
119	# - metadata _rs_ and _csh_ contain illegible values, so this code discards them when storing metadata.
120	#
121	# If you provide a keep_urls_file when configuring NutchTextDumpPlugin, then if relative the path is relative
122	# it will check the collection's etc folder for a urls.txt file.
123
124
125	package NutchTextDumpPlugin;
126
127	use SplitTextFile;
128
129	use Encode;
130	use unicode;
131	use util;
132
133	use strict;
134	no strict 'refs'; # allow filehandles to be variables and viceversa
135
136	# TODO:
137	# + 1. Split each dump.txt file into its individual records as individual docs
138	# + 2. Store the meta of each individual record/doc
139	# ?3. Name each doc, siteID.docID else HASH internal text. See EmailPlugin?
140	# - In SplitTextFile::read(), why is $segment which counts discarded docs too used to add record ID
141	# rather than $count which only counts included docs? I am referring to code:
142	# $self->add_OID($doc_obj, $id, $segment);
143	# The way I've solved this is by setting the OIDtype importOption. Not sure if this is what was required.
144	# + 4. Keep a map of all URLs seen - whitelist URLs.
145	# + 5. Implement the optional input file of URLs: if infile provided, keep only those records
146	# whose URLs are in the map. Only these matching records should become docs.
147	# 6. Rebuild full collection of all dump.txt files with this collection design.
148	#
149	# TIDY UP:
150	# + Create util::trim()
151	# + Add to perl's strings.properties: NutchTextDumpPlugin.keep_urls_file
152	#
153	# CLEANUP:
154	# Remove MetadataRead functions and inheritance
155	#
156	# QUESTIONS:
157	# - encoding = utf-8, changed to "utf8" as required by copied to_utf8(str) method. Why does it not convert
158	# the string parameter but fails in decode() step? Is it because the string is already in UTF8?
159	# - Should I add metadata as "ex."+meta or as meta? e.g. ex.srcURL or srcURL?
160	# - Want to read in keep_urls_file, maintaining a hashmap of its URLs, only on import, isn't that correct?
161	# Then how can I initialise this only once and only during import? constructor and init() methods are called during buildcol too.
162	# For now, I've done it in can_proc_this_file() but there must be a more appropriate place and correct way to do this?
163	# - TODOs
164	# - why can't I do doc_obj->get_meta_element($section, "ex.srcURL") but have to pass "srcURL" and 1 to ignore
165	# namespace?
166	# - in collectionConfig file I have to leave out ex. prefix for all but Title, why?
167	# - in GLI, browsing classifier sort_leaf options, "ex.srcURL" appears only as "ex.srcurl" (lowercased). Why?
168	# - On the other hand, in GLI's search indexes, both ex.srcurl and ex.srcURL appear. But only building
169	# with an index on ex.srcURL provides a search option in the search box. But then searching on an existing
170	# srcURL produces 0 results anyway.
171	# - Is this all because I am naming my ex metadata names wrongly? e.g. ex.srcURL, ex.siteID, ex.srcDomain.
172	#
173	# CHECK:
174	# - title fallback is URL.
175	# - util::tidy_up_OID() prints warning. SiteID is foldername and OIDtype=dirname, so fully numeric
176	# siteID to OID conversion results in warning message that siteID is fully numeric and gets 'D' prefixed.
177	# Is this warning still necessary?
178
179	# methods defined in superclasses that have the same signature take
180	# precedence in the order given in the ISA list. We want MetaPlugins to
181	# call MetadataRead's can_process_this_file_for_metadata(), rather than
182	# calling BaseImporter's version of the same method, so list inherited
183	# superclasses in this order.
184	sub BEGIN {
185	@NutchTextDumpPlugin::ISA = ('SplitTextFile');
186	unshift (@INC, "$ENV{'GSDLHOME'}/perllib/cpan");
187	}
188
189	my $arguments =
190	[ { 'name' => "keep_urls_file",
191	'desc' => "{NutchTextDumpPlugin.keep_urls_file}",
192	'type' => "string",
193	#'deft' => "urls.txt",
194	'reqd' => "no" },
195	{ 'name' => "process_exp",
196	'desc' => "{BaseImporter.process_exp}",
197	'type' => "regexp",
198	'reqd' => "no",
199	'deft' => &get_default_process_exp() },
200	{ 'name' => "split_exp",
201	'desc' => "{SplitTextFile.split_exp}",
202	'type' => "regexp",
203	'reqd' => "no",
204	'deft' => &get_default_split_exp() }
205	];
206
207	my $options = { 'name' => "NutchTextDumpPlugin",
208	'desc' => "{NutchTextDumpPlugin.desc}",
209	'abstract' => "no",
210	'inherits' => "yes",
211	'explodes' => "yes",
212	'args' => $arguments };
213
214	sub new {
215	my ($class) = shift (@_);
216	my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
217	push(@$pluginlist, $class);
218
219	push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
220	push(@{$hashArgOptLists->{"OptList"}},$options);
221
222	my $self = new SplitTextFile($pluginlist, $inputargs, $hashArgOptLists);
223
224	if ($self->{'info_only'}) {
225	# don't worry about the options
226	return bless $self, $class;
227	}
228
229	$self->{'keep_urls_processed'} = 0;
230	$self->{'keep_urls'} = undef;
231	$self->{'type'} = ""; # TODO: value can be 'ascii' or other. Used in MARCPlugin.pm. Keep this field here?
232
233	#return bless $self, $class;
234	$self = bless $self, $class;
235
236	# Can only call any methods on $self AFTER the bless operation above
237	#$self->setup_keep_urls(); # want to set up the keep_urls hashmap only once, so have to do it here (init is also called by buildcol)
238
239	return $self;
240	}
241
242	# sub init {
243	# my $self = shift (@_);
244	# my ($verbosity, $outhandle, $failhandle) = @_;
245
246	# if(!$self->{'keep_urls_file'}) {
247	# my $msg = "NutchTextDumpPlugin INFO: No urls file provided.\n" .
248	# " No records will be filtered.\n";
249	# print $outhandle $msg if ($verbosity > 2);
250
251	# $self->SUPER::init(@_);
252	# return;
253	# }
254
255	# # read in the keep urls files
256	# my $keep_urls_file = &util::locate_config_file($self->{'keep_urls_file'});
257	# if (!defined $keep_urls_file)
258	# {
259	# my $msg = "NutchTextDumpPlugin INFO: Can't locate urls file $keep_urls_file.\n" .
260	# " No records will be filtered.\n";
261
262	# print $outhandle $msg;
263
264	# $self->{'keep_urls'} = undef;
265	# # Not an error if there's no $keep_urls_file: it just means all records
266	# # in dump.txt will be processed.
267	# }
268	# else {
269	# #$self->{'keep_urls'} = $self->parse_keep_urls_file($keep_urls_file, $outhandle);
270	# #$self->{'keep_urls'} = {};
271	# $self->parse_keep_urls_file($keep_urls_file, $outhandle, $failhandle);
272	# }
273
274	## if($self->{'keep_urls'} && $verbosity > 2) {
275	# # print STDERR "@@@@ keep_urls hash map contains:\n";
276	# # map { print STDERR $_."=>".$self->{'keep_urls'}->{$_}."\n"; } keys %{$self->{'keep_urls'}};
277	## }
278	# $self->SUPER::init(@_);
279	# }
280
281	sub setup_keep_urls {
282	my $self = shift (@_);
283
284	my $verbosity = $self->{'verbosity'};
285	my $outhandle = $self->{'outhandle'};
286	my $failhandle = $self->{'failhandle'};
287
288	$self->{'keep_urls_processed'} = 1; # flag to track whether this method has been called already during import
289
290	#print $outhandle "@@@@ In NutchTextDumpPlugin::setup_keep_urls()\n";
291
292	if(!$self->{'keep_urls_file'}) {
293	my $msg = "NutchTextDumpPlugin INFO: No urls file provided.\n" .
294	" No records will be filtered.\n";
295	print $outhandle $msg if ($verbosity > 2);
296
297	return;
298	}
299
300	# read in the keep urls files
301	my $keep_urls_file = &util::locate_config_file($self->{'keep_urls_file'});
302	if (!defined $keep_urls_file)
303	{
304	my $msg = "NutchTextDumpPlugin INFO: Can't locate urls file $keep_urls_file.\n" .
305	" No records will be filtered.\n";
306
307	print $outhandle $msg;
308
309	$self->{'keep_urls'} = undef;
310	# TODO: Not a fatal error if $keep_urls_file can't be found: it just means all records
311	# in dump.txt will be processed?
312	}
313	else {
314	#$self->{'keep_urls'} = $self->parse_keep_urls_file($keep_urls_file, $outhandle);
315	#$self->{'keep_urls'} = {};
316	$self->parse_keep_urls_file($keep_urls_file, $outhandle, $failhandle);
317	}
318
319	#if(defined $self->{'keep_urls'}) {
320	# print STDERR "@@@@ keep_urls hash map contains:\n";
321	# map { print STDERR $_."=>".$self->{'keep_urls'}->{$_}."\n"; } keys %{$self->{'keep_urls'}};
322	#}
323
324	}
325
326	# TODO: This is an ugly way to do this anda non-intuitive place to do this. Is there a better way?
327	# Overriding can_process_this_file() in order to avoid setting up the keep_urls hashmap during
328	# buildcol.pl. We only want to setup the hash during import.
329	# During buildcol, this method is called with directories and not files and this method will return
330	# false as a result. So when it returns true, it will be import.pl, and we check whether we haven't
331	# already setup the keep_urls map. If the keep urls file has not yet been processed, then we set up
332	# the hashmap once.
333	sub can_process_this_file {
334	my $self = shift(@_);
335	my ($filename) = @_;
336	my $can_process_return_val = $self->SUPER::can_process_this_file(@_);
337
338	# We want to load in the keep_urls_file and create the keep_urls hashmap only once, during import
339	# Because the keep urls file can be large and it and the hashmap serve no purpose during buildcol.pl.
340	# Check whether we've already processed the file/built the hashmap, as we don't want to do this
341	# more than 1 time even within just the import cycle.
342	if($can_process_return_val && !$self->{'keep_urls_processed'}) { #!defined $self->{'keep_urls'}) {
343	$self->setup_keep_urls();
344	}
345
346	return $can_process_return_val;
347
348	}
349
350	sub parse_keep_urls_file {
351	my $self = shift (@_);
352	my ($urls_file, $outhandle, $failhandle) = @_;
353
354	# https://www.caveofprogramming.com/perl-tutorial/perl-hashes-a-guide-to-associative-arrays-in-perl.html
355	# https://stackoverflow.com/questions/1817394/whats-the-difference-between-a-hash-and-hash-reference-in-perl
356	#my %urls_map = (); # hash init to ()
357	$self->{'keep_urls'} = {}; # hash reference init to {}
358
359	# What if it is a very long file of URLs? Need to read a line at a time!
360	#my $contents = &FileUtils::readUTF8File($urls_file); # could just call $self->read_file() inherited from SplitTextFile's parent ReadTextFile
361	#my @lines = split(/(?:\r?\n)+/, $$textref);
362
363	# Open the file in UTF-8 mode https://stackoverflow.com/questions/2220717/perl-read-file-with-encoding-method
364	# and read in line by line into map
365	my $fh;
366	if (open($fh,'<:encoding(UTF-8)', $urls_file)) {
367	while (defined (my $line = <$fh>)) {
368	$line = &util::trim($line); #$line =~ s/^\s+\|\s+$//g; # trim whitespace
369	if($line =~ m@^https?://@) { # add only URLs
370	#%urls_map{$line} = 1; # add the url to our perl hash
371	$self->{'keep_urls'}->{$line} = 1;
372	}
373	}
374	close $fh;
375	} else {
376	my $msg = "NutchTextDumpPlugin ERROR: Unable to open file keep_urls_file: \"" .
377	$self->{'keep_urls_file'} . "\".\n " .
378	" No records will be filtered.\n";
379	print $outhandle $msg;
380	print $failhandle $msg;
381	# Not fatal. TODO: should it be fatal when it can still process all URLs just because
382	# it can't find the specified keep-urls.txt file?
383	}
384
385	# if keep_urls hash is empty, ensure it is undefined from this point onward
386	# https://stackoverflow.com/questions/9444915/how-to-check-if-a-hash-is-empty-in-perl
387	my %urls_map = $self->{'keep_urls'};
388	if(!keys %urls_map) {
389	$self->{'keep_urls'} = undef;
390	}
391
392	#return %urls_map;
393	}
394
395	sub get_default_process_exp {
396	my $self = shift (@_);
397
398	return q^(?i)((dump\|\d+)\.txt)$^;
399	}
400
401
402	sub get_default_split_exp {
403
404	# prev line is either a new line or start of dump.txt
405	# current line should start with url protocol and contain " key: .... http(s)/"
406	# \r\n for msdos eol, \n for unix
407
408	#return q^($\|\r?\n)https?://\w+\s+key:\s+\w+https?/^;
409	#return q^\r?\n(text:end:\|metadata _csh_ :)\r?\n\r?\n^;
410
411	#return q^(\r?\n)https?://\w+\s+key:\s+\w+https?/\s\r?\n^;
412
413
414	#return q^(?:$\|\r?\n\r?\n)(https?://.+?\skey:\s+.*?https?/)^;
415
416
417	#return q^($\|\r?\n\r?\n)https?://^;
418
419	#return q^\r?\n(text:end:)\r?\n\r?\n^;
420
421	# return q^\r?\n\s*\r?\n\|\[\w+\]Record type: USmarc^;
422
423
424	# split by default throws away delimiter
425	# Any capturing group that makes up or is part of the delimiter becomes a separate element returned by split
426	# We want to throw away the empty newlines preceding the first line of a record "https? .... key: https?/"
427	# but we want to keep that first line as part of the upcoming record.
428	# - To keep the first line of a record, though it becomes its own split-element, use capture groups in split regex:
429	# https://stackoverflow.com/questions/14907772/split-but-keep-delimiter
430	# - To skip the unwanted empty lines preceding the first line of a record use ?: in front of its capture group
431	# to discard that group:
432	# https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions
433	# - Next use a positive look-ahead (?= in front of capture group, vs ?! for negative look ahead)
434	# to match but not capture the first line of a record (so the look-ahead matched is retained as the
435	# first line of the next record):
436	# https://stackoverflow.com/questions/14907772/split-but-keep-delimiter
437	# and http://www.regular-expressions.info/lookaround.html
438	# - For non-greedy match, use .*?
439	# https://stackoverflow.com/questions/11898998/how-can-i-write-a-regex-which-matches-non-greedy
440	return q^(?:$\|\r?\n\r?\n)(?=https?://.+?\skey:\s+.*?https?/)^;
441
442	}
443
444	# TODO: COPIED METHOD STRAIGHT FROM MarcPlugin.pm - move to a utility perl file?
445	# The bulk of this function is based on read_line in multiread.pm
446	# Unable to use read_line original because it expects to get its input
447	# from a file. Here the line to be converted is passed in as a string
448
449	sub to_utf8
450	{
451	my $self = shift (@_);
452	my ($encoding, $line) = @_;
453
454	if ($encoding eq "utf8") {
455	# nothing needs to be done
456	#return $line;
457	} elsif ($encoding eq "iso_8859_1") {
458	# we'll use ascii2utf8() for this as it's faster than going
459	# through convert2unicode()
460	#return &unicode::ascii2utf8 (\$line);
461	$line = &unicode::ascii2utf8 (\$line);
462	} else {
463
464	# everything else uses unicode::convert2unicode
465	$line = &unicode::unicode2utf8 (&unicode::convert2unicode ($encoding, \$line));
466	}
467	# At this point $line is a binary byte string
468	# => turn it into a Unicode aware string, so full
469	# Unicode aware pattern matching can be used.
470	# For instance: 's/\x{0101}//g' or '[[:upper:]]'
471
472	return decode ("utf8", $line);
473	}
474
475
476
477	# do plugin specific processing of doc_obj
478	# This gets done for each record found by SplitTextFile in marc files.
479	sub process {
480	my $self = shift (@_);
481	my ($textref, $pluginfo, $base_dir, $file, $metadata, $doc_obj, $gli) = @_;
482
483	my $outhandle = $self->{'outhandle'};
484	my $filename = &util::filename_cat($base_dir, $file);
485
486	my $cursection = $doc_obj->get_top_section();
487
488
489	#print STDERR "---------------\nDUMP.TXT\n---------\n", $$textref, "\n------------------------\n";
490
491
492	# (1) parse out the metadata of this record
493	my $metaname;
494	my $encoding;
495	my $title_meta;
496
497	my $line_index = 0;
498	my $text_start_index = -1;
499	my @lines = split(/(?:\r?\n)+/, $$textref);
500
501	foreach my $line (@lines) {
502	# first line is special and contains the URL (no metaname)
503	# and the inverted URL labelled with metaname "key"
504	if($line =~ m/^https?/ && $line =~ m/\s+key:\s+/) {
505	my @vals = split(/key:/, $line);
506	my $url = $vals[0];
507	my $key = $vals[1];
508	# trim whitespace https://perlmaven.com/trim
509	$url = &util::trim($url); #=~ s/^\s+\|\s+$//g;
510	$key = &util::trim($key); #=~ s/^\s+\|\s+$//g;
511
512	# if we have a keep_urls hash, then only process records of whitelisted urls
513	if(defined $self->{'keep_urls'} && !$self->{'keep_urls'}->{$url}) {
514	# URL not whitelisted, so stop processing this record
515	print STDERR "@@@@@@ INFO NutchTextDumpPlugin::process(): discarding record for URL not whitelisted: $url\n"
516	if $self->{'verbosity'} > 3;
517	return 0;
518	} else {
519	print STDERR "@@@@@@ INFO NutchTextDumpPlugin::process(): processing record of whitelisted URL $url...\n"
520	if $self->{'verbosity'} > 3;
521	}
522	$doc_obj->add_utf8_metadata ($cursection, "ex.srcURL", $url);
523	$doc_obj->add_utf8_metadata ($cursection, "ex.key", $key);
524
525	# # let's also set the domain from the URL, as that will make a
526	# # more informative bookshelf label than siteID
527	# my $domain = $url;
528	# # remove protocol:// and everything after and including subsequent slash
529	# $domain =~ s@^https?://([^/]+).*@$1@;
530	# #$domain =~ s@^https?://@@; # remove protocol
531	# #$domain =~ s@/.*$@@; # now remove everything after first slash
532	# my $protocol = $url;# =~ s@(^https?).*$@@;
533	# $protocol =~ s@(^https?).*$@$1@;
534	# $domain = $protocol."://".$domain;
535	# #$domain =~ s@[\.\-]@@g;
536	# #$domain = "pinky";
537	# $doc_obj->add_utf8_metadata ($cursection, "ex.srcDomain", $domain);
538
539
540	# let's also set the domain from the URL, as that will make a
541	# more informative bookshelf label than siteID
542	# For complete domain, keep protocol:// and every non-slash after,
543	# without requiring presence of subsequent slash
544	# https://stackoverflow.com/questions/3652527/match-regex-and-assign-results-in-single-line-of-code
545	# Can clean up protocol and www. in bookshelf's remove_prefix option
546
547	my ($domain, $basicDomain) = $url =~ m@(^https?://(?:www\.)?([^/]+)).*@;
548
549	# For domain, the following removes protocol:// and
550	# everything after and including subsequent slash, without requiring subsequent slash
551	#my ($domain, $protocol, $basicdomain) = $url =~ m@((^https?)://([^/]+)).*@; # Works
552	#my ($protocol, $basicdomain) = $url =~ m@(^https?)://([^/]+).*@; # Should work
553	#my $domain = $protocol."://".$basicdomain;
554	$doc_obj->add_utf8_metadata ($cursection, "ex.srcDomain", $domain);
555	$doc_obj->add_utf8_metadata ($cursection, "ex.basicDomain", $basicDomain);
556
557	}
558	# check for full text
559	elsif ($line =~ m/text:start:/) {
560	$text_start_index = $line_index;
561	last; # if we've reached the full text portion, we're past the metadata portion of this record
562	}
563	elsif($line =~ m/^[^:]+:.+$/) { # look for meta #elsif($line =~ m/^[^:]+:[^:]+$/) { # won't allow protocol://url in metavalue
564	my @metakeyvalues = split(/:/, $line);
565	#my $metaname = $metakeyvalues[0];
566	#my $metavalue = $metakeyvalues[1];
567	my $metaname = shift(@metakeyvalues);
568	my $metavalue = join("", @metakeyvalues);
569
570	# skip "metadata _rs_" and "metadata _csh_" as these contain illegible characters for values
571	if($metaname !~ m/metadata\s+_(rs\|csh)_/) {
572
573	# trim whitespace
574	$metaname = &util::trim($metaname); #=~ s/^\s+\|\s+$//g;
575	$metavalue = &util::trim($metavalue); #=~ s/^\s+\|\s+$//g;
576
577	if($metaname eq "title") { # TODO: what to do about "title: null" cases?
578	##print STDERR "@@@@ Found title: $metavalue\n";
579	#$metaname = "Title"; # set this as ex.Title metadata
580	# TODO: treat title metadata specially by using character encoding to store correctly?
581
582	# won't add Title metadata to docObj until after all meta is processed, when we'll know encoding and can process title meta
583	$title_meta = $metavalue;
584	$metavalue = "";
585	}
586	elsif($metaname =~ m/CharEncodingForConversion/) { # TODO: or look for "OriginalCharEncoding"?
587	##print STDERR "@@@@ Found encoding: $metavalue\n";
588	$encoding = $metavalue; # TODO: should we use this to interpret the text and title in the correct encoding and convert to utf-8?
589
590	if($encoding eq "utf-8") {
591	$encoding = "utf8"; # method to_utf8() recognises "utf8" not "utf-8"
592	} else {
593	print STDERR "@@@@@@ WARNING NutchTextDumpPlugin::process(): Record's Nutch-assigned CharEncodingForConversion was not utf-8: $encoding\n";
594	}
595
596
597	}
598
599	# move occurrences of "marker " or "metadata " strings at start of metaname to end
600	#$metaname =~ s/^(marker\|metadata)\s+(.*)$/$2$1/;
601	# remove "marker " or "metadata " strings from start of metaname
602	$metaname =~ s/^(marker\|metadata)\s+//;
603	# remove underscores and all remaining spaces in metaname
604	$metaname =~ s/[ _]//g;
605
606	# add meta to docObject if both metaname and metavalue are non-empty strings
607	if($metaname ne "" && $metavalue ne "") { # && $metaname ne "rs" && $metaname ne "csh") {
608	$doc_obj->add_utf8_metadata ($cursection, "ex.".$metaname, $metavalue);
609	#print STDERR "Added meta \|$metaname\| = \|$metavalue\|\n"; #if $metaname =~ m/ProtocolStatus/i;
610	}
611
612	}
613	} elsif ($line !~ m/^\s*$/) { # Not expecting any other type of non-empty line (or even empty lines)
614	print STDERR "NutchTextDump line not recognised as URL meta, other metadata or text content:\n\t$line\n";
615	}
616
617	$line_index++;
618	}
619
620
621	# Add fileFormat as the metadata
622	$doc_obj->add_metadata($cursection, "FileFormat", "NutchDumpTxt");
623
624	# Correct title metadata using encoding, if we have $encoding at last
625	# $title_meta = $self->to_utf8($encoding, $title_meta) if $encoding;
626	# https://stackoverflow.com/questions/12994100/perl-encode-pm-cannot-decode-string-with-wide-character
627	# Error message: "Perl Encode.pm cannot decode string with wide character"
628	# "That error message is saying that you have passed in a string that has already been decoded
629	# (and contains characters above codepoint 255). You can't decode it again."
630	if($title_meta && $title_meta ne "" && $title_meta ne "null") {
631	$title_meta = $self->to_utf8($encoding, $title_meta) if ($encoding && $encoding ne "utf8");
632	} else { # if we have "null" as title metadata, set it to the record URL?
633	#my $srcURLs = $doc_obj->get_metadata($cursection, "ex.srcURL");
634	#print STDERR "@@@@ null title to be replaced with ".$srcURLs->[0]."\n";
635	#$title_meta = $srcURLs->[0] if (scalar @$srcURLs > 0);
636	my $srcURL = $doc_obj->get_metadata_element($cursection, "srcURL", 1); # TODO: why does ex.srcURL not work, nor srcURL without 3rd param
637	if(defined $srcURL) {
638	print STDERR "@@@@ null/empty title to be replaced with ".$srcURL."\n"
639	if $self->{'verbosity'} > 3;
640	$title_meta = $srcURL;
641	}
642	}
643	$doc_obj->add_utf8_metadata ($cursection, "Title", $title_meta);
644
645
646
647	my $siteID = $self->get_base_OID($doc_obj);
648	#print STDERR "BASE OID: " . $self->get_base_OID($doc_obj) . "\n";
649	# remove the 'D' that was inserted by a superclass in front of the all-numeric siteID to become baseOID:
650	$siteID =~ s/^D//;
651	$doc_obj->add_utf8_metadata ($cursection, "ex.siteID", $siteID);
652
653
654	# (2) parse out text of this record
655	# if($text_start_index != -1 && pop(@lines) =~ m/text:end:/) { # we only have text content if there were "text:start:" and "text:end:" markers.
656	# # TODO: are we guaranteed popped line is text:end: and not empty/newline?
657	# @lines = splice(@lines,0,$text_start_index+1); # just keep every line AFTER text:start:, have already removed (popped) "text:end:"
658
659	# # glue together remaining lines, if there are any, into textref
660	# # https://stackoverflow.com/questions/7406807/find-size-of-an-array-in-perl
661	# if(scalar (@lines) > 0) {
662	# # TODO: do anything with $encoding to convert line to utf-8?
663	# foreach my $line (@lines) {
664	# $line = $self->to_utf8($encoding, $line) if $encoding; #if $encoding ne "utf-8";
665	# $$textref .= $line."\n";
666	# }
667	# }
668	# $$textref = "<pre>\n".$$textref."</pre>";
669	# } else {
670	# print STDERR "WARNING: NutchTextDumpPlugin::process: had found a text start marker but not text end marker.\n");
671	# $$textref = "<pre></pre>";
672	# }
673
674	my $no_text = 1;
675	if($text_start_index != -1) { # had found a "text:start:" marker, so we should have text content for this record
676	if($$textref =~ m/text:start:\r?\n(.*?)\r?\ntext:end:/) {
677	$$textref = $1;
678	if($$textref !~ m/^\s*$/) {
679	$$textref = $self->to_utf8($encoding, $$textref) if ($encoding && $encoding ne "utf8");
680	$$textref = "<pre>\n".$$textref."\n</pre>";
681	$no_text = 0;
682	}
683	}
684	}
685	if($no_text) {
686	$$textref = "<pre></pre>";
687	}
688
689	# Debugging
690	# To avoid "wide character in print" messages for debugging, set binmode of handle to utf8/encoding
691	# https://stackoverflow.com/questions/15210532/use-of-use-utf8-gives-me-wide-character-in-print
692	# if ($self->{'verbosity'} > 3) {
693	# if($encoding && $encoding eq "utf8") {
694	# binmode STDERR, ':utf8';
695	# }
696
697	# print STDERR "TITLE: $title_meta\n";
698	# print STDERR "ENCODING = $encoding\n" if $encoding;
699	# #print STDERR "---------------\nTEXT CONTENT\n---------\n", $$textref, "\n------------------------\n";
700	# }
701
702
703	$doc_obj->add_utf8_text($cursection, $$textref);
704
705	return 1;
706	}
707
708
709	1;

Note: See TracBrowser for help on using the repository browser.

Download in other formats: