Context Navigation

NutchTextDumpPlugin.pm@ 34137

Last change on this file since 34137 was 34137, checked in by ak19, 4 years ago
Have only been able to incorporate one of Dr Bainbridge's improvements so far: when there's no title meta, the first title fallback is not basicURL but web page name without file extension, e.g. domain.com/path/my-web-page.html will have the title 'my web page'. Only if that works out to be the empty string, do we resort to basicURL again for title.
File size: 34.6 KB

Line
1	###########################################################################
2	#
3	# NutchTextDumpPlugin.pm -- plugin for dump.txt files generated by Nutch
4	#
5	# A component of the Greenstone digital library software
6	# from the New Zealand Digital Library Project at the
7	# University of Waikato, New Zealand.
8	#
9	# Copyright (C) 2002 New Zealand Digital Library Project
10	#
11	# This program is free software; you can redistribute it and/or modify
12	# it under the terms of the GNU General Public License as published by
13	# the Free Software Foundation; either version 2 of the License, or
14	# (at your option) any later version.
15	#
16	# This program is distributed in the hope that it will be useful,
17	# but WITHOUT ANY WARRANTY; without even the implied warranty of
18	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19	# GNU General Public License for more details.
20	#
21	# You should have received a copy of the GNU General Public License
22	# along with this program; if not, write to the Free Software
23	# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
24	#
25	###########################################################################
26
27	# This plugin originally created to process Nutch dump.txt files produced from recrawling commoncrawl (CC)
28	# results for pages detected by CC as being in Maori.
29	# It splits each web site's dump.txt into its individual records: as each record represents a web page,
30	# this produces one greenstone document per web page.
31	#
32	# For a commoncrawl collection of siteID-labelled folders containing dump.txt files each,
33	# - set <importOption name="OIDtype" value="dirname"/>
34	# - Create 2 List browsing classifiers (with bookshelf_type set to always) on ex.siteID and ex.srcDomain
35	# both sorted by ex.srcURL, and an ex.Title classifier.
36	# For the ex.srcDomain classifier, set removeprefix to: https?\:\/\/(www\.)?
37	# An alternative is to build that List classifier on ex.basicDomain instead of ex.srcDomain.
38	# Set this List classifier's "partition_type_within_level" option to "per_letter".
39	# - Add search indexes on text (default), Title, basicDomain, siteID, Identifier, srcURL (not working)
40	#
41	# Finally, in the "display" format statement, add the following before the "wrappedSectionText" to
42	# display the most relevant metadata of each record:
43	# <gsf:template name="documentContent">
44	# <div id="nutch-dump-txt-record">
45	# <h3>Record:</h3>
46	# <br/>
47	# <dl>
48	# <dt>URL:</dt>
49	# <dd>
50	# <gsf:metadata name="srcURL"/>
51	# </dd>
52	# <dt>Title:</dt>
53	# <dd>
54	# <gsf:metadata name="ex.Title"/>
55	# </dd>
56	# <dt>Identifier:</dt>
57	# <dd>
58	# <gsf:metadata name="Identifier"/>
59	# </dd>
60	# <dt>SiteID:</dt>
61	# <dd>
62	# <gsf:metadata name="siteID"/>
63	# </dd>
64	# <dt>Status:</dt>
65	# <dd>
66	# <gsf:metadata name="status"/>
67	# </dd>
68	# <dt>ProtocolStatus:</dt>
69	# <dd>
70	# <gsf:metadata name="protocolStatus"/>
71	# </dd>
72	# <dt>ParseStatus:</dt>
73	# <dd>
74	# <gsf:metadata name="parseStatus"/>
75	# </dd>
76	# <dt>CharEncodingForConversion:</dt>
77	# <dd>
78	# <gsf:metadata name="CharEncodingForConversion"/>
79	# </dd>
80	# <dt>OriginalCharEncoding:</dt>
81	# <dd>
82	# <gsf:metadata name="OriginalCharEncoding"/>
83	# </dd>
84	# </dl>
85	# </div>
86
87	# + DONE: remove illegible values for metadata _rs_ and _csh_ in the example below before
88	# committing, in case their encoding affects the loading/reading in of this perl file.
89	#
90	# Example record in dump.txt to process:
91	# https://www.whanau-tahi.school.nz/ key: nz.school.whanau-tahi.www:https/
92	# OR: http://yutaka.it-n.jp/apa/750010010.html key: jp.it-n.yutaka:http/apa/750010010.html
93	# baseUrl: null
94	# status: 2 (status_fetched)
95	# fetchTime: 1575199241154
96	# prevFetchTime: 1572607225779
97	# fetchInterval: 2592000
98	# retriesSinceFetch: 0
99	# modifiedTime: 0
100	# prevModifiedTime: 0
101	# protocolStatus: SUCCESS, args=[]
102	# signature: d84c84ccf0c86aa16a19e03cb1fc5827
103	# parseStatus: success/ok (1/0), args=[]
104	# title: Te Kura Kaupapa MÄori o Te WhÄnau Tahi
105	# score: 1.0
106	# marker _injmrk_ : y
107	# marker _updmrk_ : 1572607228-9584
108	# marker dist : 0
109	# reprUrl: null
110	# batchId: 1572607228-9584
111	# metadata CharEncodingForConversion : utf-8
112	# metadata OriginalCharEncoding : utf-8
113	# metadata _rs_ :
114	# metadata _csh_ :
115	# text:start:
116	# Te Kura Kaupapa MÄori o Te WhÄnau Tahi He mihi He mihi Te Kaupapa NgÄ TÄngata Te KÄkano Te Pihinga Te Tipuranga Te PuÄwaitanga Te Tari Te Poari Matua WhakapÄ mai He mihi He mihi Te Kaupapa NgÄ TÄngata Te KÄkano Te Pihinga Te Tipuranga Te PuÄwaitanga Te Tari Te Poari Matua WhakapÄ mai TE KURA KAUPAPA MÄORI O TE WHÄNAU TAHI He mihi Kei te mÅteatea tonu nei ngÄ mahara ki te huhua kua mene atu ki te pÅ, te pÅuriuri, te pÅtangotango, te pÅ oti atu rÄ. Kua rite te wÄhanga ki a rÄtou, hoki mai ki te ao tÅ«roa nei Ko Io Matua Kore te pÅ«taketanga, te pÅ«kaea, te pÅ«tÄtara ka rangona whÄnuitia e te ao. Ko tÄna ko ngÄ whetÅ«, te marama, te haeata ki a Tamanui te rÄ. He atua i whakateretere mai ai ngÄ waka i tawhiti nui, i tawhiti roa, i tawhiti mai rÄ anÅ. Kei nga ihorei, kei ngÄ wahapÅ«, kei ngÄ pukumahara, kei ngÄ kanohi kai mÄtÄrae o tÅ tÄtou nei kura Aho Matua, Te Kura Kaupapa MÄori o Te Whanau Tahi. Anei rÄ te maioha ki a koutou katoa e pÅ«mau tonu ki ngÄ wawata me ngÄ whakakitenga i whakatakotoria e ngÄ poupou i te wÄ i a rÄtou. Ka whakanuia hoki te toru tekau tau o tÄnei kura mai i tÅna orokohanga timatanga tae noa ki tÄnei wÄ Ka pÅ«mau tÅnu mÄtou ki te whakatauki o te kura e mea ana âPoipoia Å tÄtou nei pÅ«manawaâ Takiritia tonutia te ra ki runga i Te Kura Kaupapa Maori o Te Whanau Tahi . Back to Top " Poipoia Å tÄtou nei pÅ«manawa - Â Making our potential a reality " Â Â©Â Te Kura Kaupapa MÄori o Te WhÄnau Tahi, 2019Â Cart ( 0 )
117	# text:end:
118	#
119	# https://www.whanau-tahi.school.nz/cart key: nz.school.whanau-tahi.www:https/cart
120	# baseUrl: null
121	# status: 2 (status_fetched)
122	# ...
123	#
124	# - Some records may have empty text content between the text:start: and text:end: markers,
125	# while other records may be missing these markers along with any text.
126	# - Metadata is of the form key : value, but some metadata values contain ":", for example
127	# "protocolStatus" metadata can contain a URL for value, including protocol that contains ":".
128	# - metadata _rs_ and _csh_ contain illegible values, so this code discards them when storing metadata.
129	#
130	# If you provide a keep_urls_file when configuring NutchTextDumpPlugin, then if relative the path is relative
131	# it will check the collection's etc folder for a urls.txt file.
132
133
134	package NutchTextDumpPlugin;
135
136	use SplitTextFile;
137
138	use Encode;
139	use unicode;
140	use util;
141
142	use strict;
143	no strict 'refs'; # allow filehandles to be variables and viceversa
144
145
146	# Seems to be
147	# nohup command
148	# Not: nohup command > bla.txt 2&>1 &
149	# nor even: nohup command &
150	# nohup.out (possibly both STDERR and STDOUT, do a quick test first and then delete nohup.out before re-running)
151	# in the folder the command is run
152	# Delete nohup.out when re-running command.
153	# Tripped up and unhappy only when commands require keyboard input at any stage.
154	#
155	#
156	# TODO:
157	# Use "od" to print out bytevalues of the dump.txt file to check _rs_ and _csh_
158	# Also google Nutch about what those fields mean.
159	# od -a
160	# every byte as ASCII character
161	# od -ab
162	# ASCII and bytevalue:
163	# First comes byteoffset and then ascii character (sp for space). Line underneath the numeric byte values in hex of the individual characters.
164	#
165	# + 1. Split each dump.txt file into its individual records as individual docs
166	# + 2. Store the meta of each individual record/doc
167	# ?3. Name each doc, siteID.docID else HASH internal text. See EmailPlugin?
168	# + In SplitTextFile::read(), why is $segment which counts discarded docs too used to add record ID
169	# rather than $count which only counts included docs? I am referring to code:
170	# $self->add_OID($doc_obj, $id, $segment);
171	# Because we get persistent URLs, regardless of whitelist urls file content!
172	# The way I've solved this is by setting the OIDtype importOption. Not sure if this is what was required.
173	# + 4. Keep a map of all URLs seen - whitelist URLs.
174	# + 5. Implement the optional input file of URLs: if infile provided, keep only those records
175	# whose URLs are in the map. Only these matching records should become docs.
176	# 6. Rebuild full collection of all dump.txt files with this collection design.
177	#
178	# TIDY UP:
179	# + Create util::trim()
180	# + Add to perl's strings.properties: NutchTextDumpPlugin.keep_urls_file
181	#
182	# CLEANUP:
183	# + Remove MetadataRead functions and inheritance
184	#
185	# QUESTIONS:
186	# - encoding = utf-8, changed to "utf8" as required by copied to_utf8(str) method. Why does it not convert
187	# the string parameter but fails in decode() step? Is it because the string is already in UTF8?
188	# - Problem converting text with encoding in full set of nutch dump.txt when there encoding is windows-1252 and Shift-JIS.
189	# - TODOs
190	#
191
192	# CHECK:
193	# + title fallback is URL. Remove domain/all folder prefix (unless nothing remains), convert underscores and hyphens to spaces.
194	# + util::tidy_up_OID() prints warning. SiteID is foldername and OIDtype=dirname, so fully numeric
195	# siteID to OID conversion results in warning message that siteID is fully numeric and gets 'D' prefixed.
196	# Is this warning still necessary?
197	# - Ask about binmode usage (for debugging) in this file
198
199
200	# To get all the isMRI results, I ran Robo-3T against our mongodb as
201	# in the instructions at http://trac.greenstone.org/browser/other-projects/maori-lang-detection/MoreReading/mongodb.txt
202	# Then I launched Robo-3T and connected to the mongodb
203	#
204	# Then in the "ateacrawldata" database, I ran the following queries
205	# to get a URL listing of all the Webpages where isMRI = true as determined
206	# by apache openNLP.
207	#
208	#db.getCollection('Webpages').find({isMRI:true}).count();
209	#7830
210	#
211	#db.getCollection('Webpages').find({isMRI:true},{URL: 1, _id: 0});
212	#
213	#Then I set robo-3T's output display to display 8000 results on a page, then copied the results into this file below.
214	#
215	# I cleaned out all the JSON from the results using regex in Notepad++.
216	# This then becomes our urls.txt file, which I put into the cc nutch crawl
217	# GS3 collection's etc folder under the name isMRI_urls.txt,
218	# to consider processing only webpages apache Open-NLP detected as isMRI
219	# into our collection.
220	# Remember to configure the NutchTextDumpPlugin with option "keep_urls_file" = isMRI_urls.txt to make use of this.
221	#
222	# + ex meta -> don't add with ex. prefix
223	# + check for and call to setup_keep_urls(): move into process() rather than doing this in more convoluted way in can_process_this_file()
224	# + util::tidy_up_oid() -> print callstack to find why it's called on every segment
225	# X- binmode STDERR: work out what default mode on STDERR is and reset to that after printing debug messages in utf8 binmode
226	# - test collection to check various encodings with and without to_utf8() function - tested collection 00436 in collection cctest3.
227	# The srcURL .../divrey/shaar.htm (Identifier: D00436s184) is in Hebrew and described as being in char encoding iso-8859-8.
228	# But when I paste the build output when using NutchTextDumpPlugin.pm_debug_iso-8859-8
229	# into emacs, the text for this record reads and scrolls R to L in emacs.
230	# When previewing the text in the full text section in GS3, it reads L to R.
231	# The digits used in the text seem to match, occurring in reverse order from each other between emacs and GS3 preview.
232	# Building displays error messages if to_utf8() called to decode this record's title meta or full text
233	# using the discovered encoding.
234
235	sub BEGIN {
236	@NutchTextDumpPlugin::ISA = ('SplitTextFile');
237	unshift (@INC, "$ENV{'GSDLHOME'}/perllib/cpan");
238	}
239
240	my $arguments =
241	[ { 'name' => "keep_urls_file",
242	'desc' => "{NutchTextDumpPlugin.keep_urls_file}",
243	'type' => "string",
244	#'deft' => "urls.txt",
245	'reqd' => "no" },
246	{ 'name' => "process_exp",
247	'desc' => "{BaseImporter.process_exp}",
248	'type' => "regexp",
249	'reqd' => "no",
250	'deft' => &get_default_process_exp() },
251	{ 'name' => "split_exp",
252	'desc' => "{SplitTextFile.split_exp}",
253	'type' => "regexp",
254	'reqd' => "no",
255	'deft' => &get_default_split_exp() }
256	];
257
258	my $options = { 'name' => "NutchTextDumpPlugin",
259	'desc' => "{NutchTextDumpPlugin.desc}",
260	'abstract' => "no",
261	'inherits' => "yes",
262	'explodes' => "yes",
263	'args' => $arguments };
264
265	sub new {
266	my ($class) = shift (@_);
267	my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
268	push(@$pluginlist, $class);
269
270	push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
271	push(@{$hashArgOptLists->{"OptList"}},$options);
272
273	my $self = new SplitTextFile($pluginlist, $inputargs, $hashArgOptLists);
274
275	if ($self->{'info_only'}) {
276	# don't worry about the options
277	return bless $self, $class;
278	}
279
280	$self->{'keep_urls_processed'} = 0;
281	$self->{'keep_urls'} = undef;
282
283	#return bless $self, $class;
284	$self = bless $self, $class;
285	# Can only call any $self->method(); AFTER the bless operation above, so from this point onward
286	return $self;
287	}
288
289
290	sub setup_keep_urls {
291	my $self = shift (@_);
292
293	my $verbosity = $self->{'verbosity'};
294	my $outhandle = $self->{'outhandle'};
295	my $failhandle = $self->{'failhandle'};
296
297	$self->{'keep_urls_processed'} = 1; # flag to track whether this method has been called already during import
298
299	#print $outhandle "@@@@ In NutchTextDumpPlugin::setup_keep_urls() - this method should only be called once and only during import.pl\n";
300
301	if(!$self->{'keep_urls_file'}) {
302	my $msg = "NutchTextDumpPlugin INFO: No urls file provided.\n" .
303	" No records will be filtered.\n";
304	print $outhandle $msg if ($verbosity > 2);
305
306	return;
307	}
308
309	# read in the keep urls files
310	my $keep_urls_file = &util::locate_config_file($self->{'keep_urls_file'});
311	if (!defined $keep_urls_file)
312	{
313	my $msg = "NutchTextDumpPlugin INFO: Can't locate urls file $keep_urls_file.\n" .
314	" No records will be filtered.\n";
315
316	print $outhandle $msg;
317
318	$self->{'keep_urls'} = undef;
319	# TODO: Not a fatal error if $keep_urls_file can't be found: it just means all records
320	# in dump.txt will be processed?
321	}
322	else {
323	#$self->{'keep_urls'} = $self->parse_keep_urls_file($keep_urls_file, $outhandle);
324	#$self->{'keep_urls'} = {};
325	$self->parse_keep_urls_file($keep_urls_file, $outhandle, $failhandle);
326	}
327
328	#if(defined $self->{'keep_urls'}) {
329	# print STDERR "@@@@ keep_urls hash map contains:\n";
330	# map { print STDERR $_."=>".$self->{'keep_urls'}->{$_}."\n"; } keys %{$self->{'keep_urls'}};
331	#}
332
333	}
334
335
336	sub parse_keep_urls_file {
337	my $self = shift (@_);
338	my ($urls_file, $outhandle, $failhandle) = @_;
339
340	# https://www.caveofprogramming.com/perl-tutorial/perl-hashes-a-guide-to-associative-arrays-in-perl.html
341	# https://stackoverflow.com/questions/1817394/whats-the-difference-between-a-hash-and-hash-reference-in-perl
342	$self->{'keep_urls'} = {}; # hash reference init to {}
343
344	# What if it is a very long file of URLs? Need to read a line at a time!
345	#my $contents = &FileUtils::readUTF8File($urls_file); # could just call $self->read_file() inherited from SplitTextFile's parent ReadTextFile
346	#my @lines = split(/(?:\r?\n)+/, $$textref);
347
348	# Open the file in UTF-8 mode https://stackoverflow.com/questions/2220717/perl-read-file-with-encoding-method
349	# and read in line by line into map
350	my $fh;
351	if (open($fh,'<:encoding(UTF-8)', $urls_file)) {
352	while (defined (my $line = <$fh>)) {
353	$line = &util::trim($line); #$line =~ s/^\s+\|\s+$//g; # trim whitespace
354
355	if($line =~ m@^https?://@) { # add only URLs
356	# remove any ",COUNTRYCODE" at end
357	# country code can be NZ but also UNKNOWN, so not 2 chars
358	$line =~ s/,[A-Z]+$//;
359	#print STDERR "LINE: \|$line\|\n";
360	$self->{'keep_urls'}->{$line} = 1; # add the url to our perl hash
361	}
362	}
363	close $fh;
364	} else {
365	my $msg = "NutchTextDumpPlugin ERROR: Unable to open file keep_urls_file: \"" .
366	$self->{'keep_urls_file'} . "\".\n " .
367	" No records will be filtered.\n";
368	print $outhandle $msg;
369	print $failhandle $msg;
370	# Not fatal. TODO: should it be fatal when it can still process all URLs just because
371	# it can't find the specified keep-urls.txt file?
372	}
373
374	# If keep_urls hash is empty, ensure it is undefined from this point onward
375	# Use if(!keys %hash) to SECURELY test for an empty hash
376	# https://stackoverflow.com/questions/9444915/how-to-check-if-a-hash-is-empty-in-perl
377	#
378	# But may not do: keys $hashref, only: keys %hash.
379	# Unable to work out how to dereference the hashref that is $self->{'keep_urls'},
380	# in order for me to then finally get the keys of the hashmap it refers to
381	# Googled: perl convert reference to hashmap
382	# The way to dereference hashref and get the keys is at https://www.thegeekstuff.com/2010/06/perl-hash-reference/
383	# keys % { $hash_ref };
384	my $hashmap_ref = $self->{'keep_urls'};
385	my %urls_map = %$hashmap_ref;
386	if(!keys %urls_map) {
387	$self->{'keep_urls'} = undef;
388	}
389
390	}
391
392	# Accept "dump.txt" files (which are in numeric siteID folders),
393	# and txt files with numeric siteID, e.g. "01441.txt"
394	# if I preprocessed dump.txt files by renaming them this way.
395	sub get_default_process_exp {
396	my $self = shift (@_);
397
398	return q^(?i)((dump\|\d+)\.txt)$^;
399	}
400
401
402	sub get_default_split_exp {
403
404	# prev line is either a new line or start of dump.txt
405	# current line should start with url protocol and contain " key: .... http(s)/"
406	# \r\n for msdos eol, \n for unix
407
408	# The regex return value of this method is passed into a call to perl split.
409	# Perl's split(), by default throws away delimiter
410	# Any capturing group that makes up or is part of the delimiter becomes a separate element returned by split
411	# We want to throw away the empty newlines preceding the first line of a record "https? .... key: https?/"
412	# but we want to keep that first line as part of the upcoming record.
413	# - To keep the first line of a record, though it becomes its own split-element, use capture groups in split regex:
414	# https://stackoverflow.com/questions/14907772/split-but-keep-delimiter
415	# - To skip the unwanted empty lines preceding the first line of a record use ?: in front of its capture group
416	# to discard that group:
417	# https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions
418	# - Next use a positive look-ahead (?= in front of capture group, vs ?! for negative look ahead)
419	# to match but not capture the first line of a record (so the look-ahead matched is retained as the
420	# first line of the next record):
421	# https://stackoverflow.com/questions/14907772/split-but-keep-delimiter
422	# and http://www.regular-expressions.info/lookaround.html
423	# - For non-greedy match, use .*?
424	# https://stackoverflow.com/questions/11898998/how-can-i-write-a-regex-which-matches-non-greedy
425	return q^(?:$\|\r?\n\r?\n)(?=https?://.+?\skey:\s+.*?https?/)^;
426
427	}
428
429	# TODO: Copied method from MARCPlugin.pm and uncommented return statement when encoding = utf8
430	# Move to a utility perl file, since code is mostly shared?
431	# The bulk of this function is based on read_line in multiread.pm
432	# Unable to use read_line original because it expects to get its input
433	# from a file. Here the line to be converted is passed in as a string
434
435	# TODO:
436	# Is this function even applicable to NutchTextDumpPlugin?
437	# I get errors in this method when encoding is utf-8 in the decode step.
438	# I get warnings/errors somewhere in this file (maybe also at decode) when encoding is windows-1252.
439
440	sub to_utf8
441	{
442	my $self = shift (@_);
443	my ($encoding, $line) = @_;
444
445	if ($encoding eq "utf8") {
446	# nothing needs to be done
447	return $line;
448	} elsif ($encoding eq "iso_8859_1" \|\| $encoding eq "windows-1252") { # TODO: do this also for windows-1252?
449	# we'll use ascii2utf8() for this as it's faster than going
450	# through convert2unicode()
451	#return &unicode::ascii2utf8 (\$line);
452	$line = &unicode::ascii2utf8 (\$line);
453	} else {
454
455	# everything else uses unicode::convert2unicode
456	$line = &unicode::unicode2utf8 (&unicode::convert2unicode ($encoding, \$line));
457	}
458	# At this point $line is a binary byte string
459	# => turn it into a Unicode aware string, so full
460	# Unicode aware pattern matching can be used.
461	# For instance: 's/\x{0101}//g' or '[[:upper:]]'
462
463	return decode ("utf8", $line);
464	}
465
466
467
468	# do plugin specific processing of doc_obj
469	# This gets done for each record found by SplitTextFile in marc files.
470	sub process {
471	my $self = shift (@_);
472	my ($textref, $pluginfo, $base_dir, $file, $metadata, $doc_obj, $gli) = @_;
473
474	# Only load the urls from the keep_urls_file into a hash if we've not done so before.
475	# Although this method is called on each dump.txt file found, and we want to only setup_keep_urls()
476	# once for a collection and only during import and not buildcol, it's best to do the check and setup_keep_urls()
477	# call here, because this subroutine, process(), is only called during import() and not during buildcol.
478	# During buildcol, can_process_this_file() is not called on dump.txt files but on folders (archives folder).
479	# Only if this plugin's called on can_process_this_file() is called on a dump.txt, will this process() be called
480	# on each segment of the dump.txt file
481	# So this is the best spot to ensure we've setup_keep_urls() here, if we haven't already:
482
483	if(!$self->{'keep_urls_processed'}) {
484	$self->setup_keep_urls();
485	}
486
487
488	my $outhandle = $self->{'outhandle'};
489	my $filename = &util::filename_cat($base_dir, $file);
490
491
492	my $cursection = $doc_obj->get_top_section();
493
494	# https://perldoc.perl.org/functions/binmode.html
495	# "To mark FILEHANDLE as UTF-8, use :utf8 or :encoding(UTF-8) . :utf8 just marks the data as UTF-8 without further checking,
496	# while :encoding(UTF-8) checks the data for actually being valid UTF-8. More details can be found in PerlIO::encoding."
497	# https://stackoverflow.com/questions/27801561/turn-off-binmodestdout-utf8-locally
498	# Is there anything useful here:
499	# https://perldoc.perl.org/PerlIO/encoding.html and https://stackoverflow.com/questions/21452621/binmode-encoding-handling-malformed-data
500	# https://stackoverflow.com/questions/1348639/how-can-i-reinitialize-perls-stdin-stdout-stderr
501	# https://metacpan.org/pod/open::layers
502	# if() { # Google: "what is perl choosing to make the default char encoding for the file handle". Does it take a hint from somewhere, like env vars? Look for env vars
503	# # is there a perl env var to use, to check char enc? If set to utf-8, do this
504	#binmode(STDERR, ':utf8'); ## FOR DEBUGGING! To avoid "wide character in print" messages, but modifies globally for process!
505	#}
506	# Then move this if-block to BEGIN blocks of all perl process files.
507
508	#print STDERR "---------------\nDUMP.TXT\n---------\n", $$textref, "\n------------------------\n";
509
510
511	# (1) parse out the metadata of this record
512	my $metaname;
513	my $encoding;
514	my $title_meta;
515
516	my $line_index = 0;
517	my $text_start_index = -1;
518	my @lines = split(/(?:\r?\n)+/, $$textref);
519
520	foreach my $line (@lines) {
521	#$line =~ s@\{@\\{@g; # escape open curly braces for newer perl
522
523	# first line is special and contains the URL (no metaname)
524	# and the inverted URL labelled with metaname "key"
525	if($line =~ m/^https?/ && $line =~ m/\s+key:\s+/) {
526	my @vals = split(/key:/, $line);
527	# get url and key, and trim whitespace simultaneously
528	my $url = &util::trim($vals[0]);
529	my $key = &util::trim($vals[1]);
530
531	# if we have a keep_urls hash, then only process records of whitelisted urls
532	if(defined $self->{'keep_urls'} && !$self->{'keep_urls'}->{$url}) {
533	# URL not whitelisted, so stop processing this record
534	print STDERR "@@@@@@ INFO NutchTextDumpPlugin::process(): discarding record for URL not whitelisted: $url\n"
535	if $self->{'verbosity'} > 3;
536	return 0;
537	} else {
538	print STDERR "@@@@@@ INFO NutchTextDumpPlugin::process(): processing record of whitelisted URL $url...\n"
539	if $self->{'verbosity'} > 3;
540	}
541	$doc_obj->add_utf8_metadata ($cursection, "srcURL", $url);
542	$doc_obj->add_utf8_metadata ($cursection, "key", $key);
543
544
545	# let's also set the domain from the URL, as that will make a
546	# more informative bookshelf label than siteID
547	# For complete domain, keep protocol:// and every non-slash after.
548	# (This avoids requiring presence of subsequent slash)
549	# https://stackoverflow.com/questions/3652527/match-regex-and-assign-results-in-single-line-of-code
550	# Can clean up protocol and www. in List classifier's bookshelf's remove_prefix option
551	# or can build classifier on basicDomain instead.
552
553	my ($domain, $basicDomain) = $url =~ m@(^https?://(?:www\.)?([^/]+)).*@;
554	#my ($domain, $protocol, $basicdomain) = $url =~ m@((^https?)://([^/]+)).*@; # Works
555	$doc_obj->add_utf8_metadata ($cursection, "srcDomain", $domain);
556	$doc_obj->add_utf8_metadata ($cursection, "basicDomain", $basicDomain);
557
558	}
559	# check for full text
560	elsif ($line =~ m/text:start:/) {
561	$text_start_index = $line_index;
562	last; # if we've reached the full text portion, we're past the metadata portion of this record
563	}
564	elsif($line =~ m/^[^:]+:.+$/) { # look for meta #elsif($line =~ m/^[^:]+:[^:]+$/) { # won't allow protocol://url in metavalue
565	my @metakeyvalues = split(/:/, $line); # split on first :
566
567	my $metaname = shift(@metakeyvalues);
568	my $metavalue = join("", @metakeyvalues);
569
570	# skip "metadata _rs_" and "metadata _csh_" as these contain illegible characters for values
571	if($metaname !~ m/metadata\s+_(rs\|csh)_/) {
572
573	# trim whitespace
574	$metaname = &util::trim($metaname);
575	$metavalue = &util::trim($metavalue);
576
577	if($metaname eq "title") { # TODO: what to do about "title: null" cases?
578	##print STDERR "@@@@ Found title: $metavalue\n";
579	#$metaname = "Title"; # will set "title" as "Title" metadata instead
580	# TODO: treat title metadata specially by using character encoding to store correctly?
581
582	# Won't add Title metadata to docObj until after all meta is processed,
583	# when we'll know encoding and can process title meta
584	$title_meta = $metavalue;
585	$metavalue = ""; # will force ex.Title metadata to be added AFTER for loop
586	}
587	elsif($metaname =~ m/CharEncodingForConversion/) { # TODO: or look for "OriginalCharEncoding"?
588	##print STDERR "@@@@ Found encoding: $metavalue\n";
589	$encoding = $metavalue; # TODO: should we use this to interpret the text and title in the correct encoding and convert to utf-8?
590
591	if($encoding eq "utf-8") {
592	$encoding = "utf8"; # method to_utf8() recognises "utf8" not "utf-8"
593	} else {
594	my $srcURL = $doc_obj->get_metadata_element($cursection, "srcURL");
595	print STDERR "@@@@@@ WARNING NutchTextDumpPlugin::process(): Record's Nutch-assigned CharEncodingForConversion was not utf-8 but $encoding\n\tfor record: $srcURL\n";
596	}
597
598	}
599
600	# move occurrences of "marker " or "metadata " strings at start of metaname to end
601	#$metaname =~ s/^(marker\|metadata)\s+(.*)$/$2$1/;
602	# remove "marker " or "metadata " strings from start of metaname
603	$metaname =~ s/^(marker\|metadata)\s+//;
604	# remove underscores and all remaining spaces in metaname
605	$metaname =~ s/[ _]//g;
606
607	# add meta to docObject if both metaname and metavalue are non-empty strings
608	if($metaname ne "" && $metavalue ne "") {
609	# when no namespace is provided as here, adds as ex. meta.
610	# Don't explicitly prefix ex., as things becomes convoluted when retrieving meta
611	$doc_obj->add_utf8_metadata ($cursection, $metaname, $metavalue);
612	#print STDERR "Added meta \|$metaname\| = \|$metavalue\|\n"; #if $metaname =~ m/ProtocolStatus/i;
613	}
614
615	}
616	} elsif ($line !~ m/^\s*$/) { # Not expecting any other type of non-empty line (or even empty lines)
617	print STDERR "NutchTextDump line not recognised as URL meta, other metadata or text content:\n\t$line\n";
618	}
619
620	$line_index++;
621	}
622
623
624	# Add fileFormat as the metadata
625	$doc_obj->add_metadata($cursection, "FileFormat", "NutchDumpTxt");
626
627	# Correct title metadata using encoding, if we have $encoding at last
628	# https://stackoverflow.com/questions/12994100/perl-encode-pm-cannot-decode-string-with-wide-character
629	# Error message: "Perl Encode.pm cannot decode string with wide character"
630	# "That error message is saying that you have passed in a string that has already been decoded
631	# (and contains characters above codepoint 255). You can't decode it again."
632	if($title_meta && $title_meta ne "" && $title_meta ne "null") {
633	#$title_meta = $self->to_utf8($encoding, $title_meta) if ($encoding);
634	} else { # if we have "null" as title metadata, set it to the record URL?
635	my $srcURL = $doc_obj->get_metadata_element($cursection, "srcURL");
636	if(defined $srcURL) {
637	# Use the web page name without file ext for doc title, if web page name present,
638	# else use basicURL for title for title instead of srcURL,
639	# else many docs get classified under "Htt" bucket for https
640
641	my ($basicURL) = $srcURL =~ m@^https?://(?:www\.)?(.*)$@;
642	my ($pageName) = $basicURL =~ m@([^/]+)$@;
643	if (!$pageName) {
644	$pageName = $basicURL;
645	} else {
646	# remove any file extension
647	$pageName =~ s@\.[^\.]+@@;
648	# replace _ and - with spaces
649	$pageName =~ s@[_\-]@ @g;
650	}
651
652	print STDERR "@@@@ null/empty title for $basicURL to be replaced with: $pageName\n"
653	if $self->{'verbosity'} > 3;
654	$title_meta = $pageName;
655	}
656	}
657
658	$doc_obj->add_utf8_metadata ($cursection, "Title", $title_meta);
659
660
661	# When importOption OIDtype = dirname, the base_OID will be that dirname
662	# which was crafted to be the siteID. However, because our siteID is all numeric,
663	# a D gets prepended to create baseOID. Remove the starting 'D' to get actual siteID.
664	my $siteID = $self->get_siteID($doc_obj, $file);
665	#print STDERR "BASE OID: " . $siteID . "\n";
666	$siteID =~ s/^D//;
667	$doc_obj->add_utf8_metadata ($cursection, "siteID", $siteID);
668
669
670	# (2) parse out text of this record
671	# if($text_start_index != -1 && pop(@lines) =~ m/text:end:/) { # we only have text content if there were "text:start:" and "text:end:" markers.
672	# # TODO: are we guaranteed popped line is text:end: and not empty/newline?
673	# @lines = splice(@lines,0,$text_start_index+1); # just keep every line AFTER text:start:, have already removed (popped) "text:end:"
674
675	# # glue together remaining lines, if there are any, into textref
676	# # https://stackoverflow.com/questions/7406807/find-size-of-an-array-in-perl
677	# if(scalar (@lines) > 0) {
678	# # TODO: do anything with $encoding to convert line to utf-8?
679	# foreach my $line (@lines) {
680	# $line = $self->to_utf8($encoding, $line) if $encoding; #if $encoding ne "utf-8";
681	# $$textref .= $line."\n";
682	# }
683	# }
684	# $$textref = "<pre>\n".$$textref."</pre>";
685	# } else {
686	# print STDERR "WARNING: NutchTextDumpPlugin::process: had found a text start marker but not text end marker.\n";
687	# $$textref = "<pre></pre>";
688	# }
689
690	# (2) parse out text of this record
691	my $no_text = 1;
692	if($text_start_index != -1) { # had found a "text:start:" marker, so we should have text content for this record
693
694	if($$textref =~ m/text:start:\r?\n(.*?)\r?\ntext:end:/) {
695	$$textref = $1;
696	if($$textref !~ m/^\s*$/) {
697	#$$textref = $self->to_utf8($encoding, $$textref) if ($encoding);
698	$$textref = "<pre>\n".$$textref."\n</pre>";
699	$no_text = 0;
700	}
701	}
702	}
703	if($no_text) {
704	$$textref = "<pre></pre>";
705	}
706
707	# Debugging
708	# To avoid "wide character in print" messages for debugging, set binmode of handle to utf8/encoding
709	# https://stackoverflow.com/questions/15210532/use-of-use-utf8-gives-me-wide-character-in-print
710	# if ($self->{'verbosity'} > 3) {
711	# if($encoding && $encoding eq "utf8") {
712	# binmode STDERR, ':utf8';
713	# }
714
715	# print STDERR "TITLE: $title_meta\n";
716	# print STDERR "ENCODING = $encoding\n" if $encoding;
717	# #print STDERR "---------------\nTEXT CONTENT\n---------\n", $$textref, "\n------------------------\n";
718	# }
719
720
721	$doc_obj->add_utf8_text($cursection, $$textref);
722
723	return 1;
724	}
725
726	# returns siteID when file in import of form siteID.txt
727	# returns siteID when import contains siteID/dump.txt (as happens when OIDtype=dirname)
728	# Returns whatever baseOID in other situations, not sure if meaningful, but shouldn't have
729	# passed can_process_this_file() test for anything other than siteID/dump.txt and siteID.txt anyway
730	sub get_siteID {
731	my $self = shift(@_);
732	my ($doc_obj, $file) = @_;
733
734	my $siteID;
735	if ($file =~ /(\d+).txt/) {
736	# file name without extension is site ID, e.g. 00001.txt
737	$siteID = $1;
738	}
739	else { # if($doc_obj->{'OIDtype'} eq "dirname") or even otherwise, just use baseOID
740	# baseOID is the same as site ID when OIDtype is configured to dirname because docs are stored as 00001/dump.txt
741	# siteID has no real meaning in other cases
742	$siteID = $self->{'dirname_siteID'} \|\| $self->get_base_OID($doc_obj);
743
744	}
745	if(!$self->{'siteID'} \|\| $siteID ne $self->{'siteID'}) {
746	$self->{'siteID'} = $siteID;
747	}
748	return $self->{'siteID'};
749	}
750
751
752	# SplitTextFile::get_base_OID() has the side-effect of calling SUPER::add_OID()
753	# in order to initialise segment IDs.
754	# This then ultimately results in calling util::tidy_up_OID() to print warning messages
755	# about siteIDs forming all-numeric baseOIDs that require the D prefix prepended.
756	# In cases where site ID is the same as baseOID and is needed to set siteID meta, we want to avoid
757	# the warning messages but don't want to prevent the important side-effects of SplitTextFile::get_base_OID()
758	# So instead of overriding this method to calculate and store baseOID the first time and return
759	# the stored value subsequent times (which has the undesirable result that the side-effect from
760	# ALWAYS calling super's get_base_OID() even when there's a stored value), we just always store
761	# the return value before returning it. Next, we push the check for first testing for a stored value
762	# to use, else forcing it to be computed by calling this get_base_OID(), onto a separate function that
763	# calls this one, get_siteID(). Problem solved.
764	sub get_base_OID {
765	my $self = shift(@_);
766	my ($doc_obj) = @_;
767
768	#if(!defined $self->{'dirname_siteID'}) { # DON'T DO THIS: loses essential side-effect of always calling super's get_base_OID()
769	# this method is overridden, so it's not just called by this NutchTextDumpPlugin
770
771	$self->{'dirname_siteID'} = $self->SUPER::get_base_OID($doc_obj); # store for NutchTextDumpPlugin's internal use
772	#}
773	return $self->{'dirname_siteID'}; # return superclass return value as always
774	}
775	1;

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm@ 34137

Download in other formats: