root/other-projects/maori-lang-detection/hdfs-cc-work/conf/regex-urlfilter.GS_TEMPLATE @ 33666

Revision 33666, 2.0 KB (checked in by ak19, 2 months ago)

Having finished sending all the crawl data to mongodb 1. Recrawled the 2 sites which I had earlier noted required recrawling 00152, 00332. 00152 required changes to how it needed to be crawled. MP3 files needed to be blocked, as there were HBase error messages about key values being too large. 2. Modified the regex-urlfilter.GS_TEMPLATE file for this to block mp3 files in general for future crawls too (in the location of the file where jpg etc were already blocked by nutch's default regex url filters). 3. Further had to control the 00152 site to only be crawled under its /maori/ sub-domain. Since the seedURL maori.html was not off a /maori/ url, this revealed that the CCWETProcessor code didn't already consider allowing filters to okay seedURLs even where the crawl was controlled to run over a subdomain (as expressed in conf/sites-too-big-to-exhaustively-crawl file) but where the seedURL didn't match these controlled regex filters. So now, in such cases, the CCWETProcessor adds seedURLs that don't match to the filters too (so we get just the single file of the seedURL pages) besides a filter on the requested subdomain, so we follow all pages linked by the seedURLs that match the subdomain expression. 4. Adding to_crawl.tar.gz to svn, the tarball of the sites to_crawl that I actually ran nutch over, of all the sites folders with their seedURL.txt and regex-urlfilter.txt files that the batchcrawl.sh runs over. This didn't use the latest version of the sites folder and blacklist/whitelist files generated by CCWETProcessor, since the latest version was regenerated after the final modifications to CCWETProcessor which was after crawling was finished. But to_crawl.tar.gz does have a manually modified 00152, wit the correct regex-urlfilter file and uses the newer regex-urlfilter.GS_TEMPLATE file that blocks mp3 files. 5. crawledNode6.tar.gz now contains the dump output for sites 00152 and 00332, which were crawled on node6 today (after which their processed dump.txt file results were added into MongoDB). 7. MoreReading?/mongodb.txt now contains the results of some queries I ran against the total nutch-crawled data.

Line 
1# Licensed to the Apache Software Foundation (ASF) under one or more
2# contributor license agreements.  See the NOTICE file distributed with
3# this work for additional information regarding copyright ownership.
4# The ASF licenses this file to You under the Apache License, Version 2.0
5# (the "License"); you may not use this file except in compliance with
6# the License.  You may obtain a copy of the License at
7#
8#     http://www.apache.org/licenses/LICENSE-2.0
9#
10# Unless required by applicable law or agreed to in writing, software
11# distributed under the License is distributed on an "AS IS" BASIS,
12# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13# See the License for the specific language governing permissions and
14# limitations under the License.
15
16
17# The default url filter.
18# Better for whole-internet crawling.
19
20# Each non-comment, non-blank line contains a regular expression
21# prefixed by '+' or '-'.  The first matching pattern in the file
22# determines whether a URL is included or ignored.  If no pattern
23# matches, the URL is ignored.
24
25# skip file: ftp: and mailto: urls
26-^(file|ftp|mailto):
27
28# skip image and other suffixes we can't yet parse
29# for a more extensive coverage use the urlfilter-suffix plugin
30#
31# GS: Added mp3 to file types to skip, as we don't get text out of it
32# and it ends up to large to download. (Result in "key value to large"
33# error message as for site 00152, http://loquevendra318.com/fox/maori.html
34# where the audios are not in Maori anyway, being reuse for all languages
35# on the site.)
36#
37-\.(mp3|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
38
39# skip URLs containing certain characters as probable queries, etc.
40-[?*!@=]
41
42# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
43-.*(/[^/]+)/[^/]+\1/[^/]+\1/
44
45# accept anything else
46#+.
47
48# batchcrawl.sh will automatically append regex url filters below for each site to be crawled
Note: See TracBrowser for help on using the browser.