Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

regex-urlfilter.GS_TEMPLATE@ 33596

Last change on this file since 33596 was 33596, checked in by ak19, 5 years ago
Adding in the nutch-site.xml and regex-urlfilter.GS_TEMPLATE template file that need to go into apache-nutch-2.3.1/nutch when setting this up for crawls
File size: 1.7 KB

Line
1	# Licensed to the Apache Software Foundation (ASF) under one or more
2	# contributor license agreements. See the NOTICE file distributed with
3	# this work for additional information regarding copyright ownership.
4	# The ASF licenses this file to You under the Apache License, Version 2.0
5	# (the "License"); you may not use this file except in compliance with
6	# the License. You may obtain a copy of the License at
7	#
8	# http://www.apache.org/licenses/LICENSE-2.0
9	#
10	# Unless required by applicable law or agreed to in writing, software
11	# distributed under the License is distributed on an "AS IS" BASIS,
12	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13	# See the License for the specific language governing permissions and
14	# limitations under the License.
15
16
17	# The default url filter.
18	# Better for whole-internet crawling.
19
20	# Each non-comment, non-blank line contains a regular expression
21	# prefixed by '+' or '-'. The first matching pattern in the file
22	# determines whether a URL is included or ignored. If no pattern
23	# matches, the URL is ignored.
24
25	# skip file: ftp: and mailto: urls
26	-^(file\|ftp\|mailto):
27
28	# skip image and other suffixes we can't yet parse
29	# for a more extensive coverage use the urlfilter-suffix plugin
30	-\.(gif\|GIF\|jpg\|JPG\|png\|PNG\|ico\|ICO\|css\|CSS\|sit\|SIT\|eps\|EPS\|wmf\|WMF\|zip\|ZIP\|ppt\|PPT\|mpg\|MPG\|xls\|XLS\|gz\|GZ\|rpm\|RPM\|tgz\|TGZ\|mov\|MOV\|exe\|EXE\|jpeg\|JPEG\|bmp\|BMP\|js\|JS)$
31
32	# skip URLs containing certain characters as probable queries, etc.
33	-[?*!@=]
34
35	# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
36	-.*(/[^/]+)/[^/]+\1/[^/]+\1/
37
38	# accept anything else
39	#+.
40
41	# batchcrawl.sh will automatically append regex url filters below for each site to be crawled

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format