source: gs3-extensions/maori-lang-detection/hdfs-cc-work/conf/regex-urlfilter.GS_TEMPLATE@ 33596

Last change on this file since 33596 was 33596, checked in by ak19, 5 years ago

Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template file that need to go into apache-nutch-2.3.1/nutch when setting this up for crawls

File size: 1.7 KB
Line 
1# Licensed to the Apache Software Foundation (ASF) under one or more
2# contributor license agreements. See the NOTICE file distributed with
3# this work for additional information regarding copyright ownership.
4# The ASF licenses this file to You under the Apache License, Version 2.0
5# (the "License"); you may not use this file except in compliance with
6# the License. You may obtain a copy of the License at
7#
8# http://www.apache.org/licenses/LICENSE-2.0
9#
10# Unless required by applicable law or agreed to in writing, software
11# distributed under the License is distributed on an "AS IS" BASIS,
12# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13# See the License for the specific language governing permissions and
14# limitations under the License.
15
16
17# The default url filter.
18# Better for whole-internet crawling.
19
20# Each non-comment, non-blank line contains a regular expression
21# prefixed by '+' or '-'. The first matching pattern in the file
22# determines whether a URL is included or ignored. If no pattern
23# matches, the URL is ignored.
24
25# skip file: ftp: and mailto: urls
26-^(file|ftp|mailto):
27
28# skip image and other suffixes we can't yet parse
29# for a more extensive coverage use the urlfilter-suffix plugin
30-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
31
32# skip URLs containing certain characters as probable queries, etc.
33-[?*!@=]
34
35# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
36-.*(/[^/]+)/[^/]+\1/[^/]+\1/
37
38# accept anything else
39#+.
40
41# batchcrawl.sh will automatically append regex url filters below for each site to be crawled
Note: See TracBrowser for help on using the repository browser.