Context Navigation

automaton-urlfilter.txt @ 66

Last change on this file since 66 was 66, checked in by waue, 15 years ago
NutchEz - an easy way to nutch
File size: 1.3 KB

Line
1	# Licensed to the Apache Software Foundation (ASF) under one or more
2	# contributor license agreements. See the NOTICE file distributed with
3	# this work for additional information regarding copyright ownership.
4	# The ASF licenses this file to You under the Apache License, Version 2.0
5	# (the "License"); you may not use this file except in compliance with
6	# the License. You may obtain a copy of the License at
7	#
8	# http://www.apache.org/licenses/LICENSE-2.0
9	#
10	# Unless required by applicable law or agreed to in writing, software
11	# distributed under the License is distributed on an "AS IS" BASIS,
12	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13	# See the License for the specific language governing permissions and
14	# limitations under the License.
15
16	# The default url filter.
17	# Better for whole-internet crawling.
18
19	# Each non-comment, non-blank line contains a regular expression
20	# prefixed by '+' or '-'. The first matching pattern in the file
21	# determines whether a URL is included or ignored. If no pattern
22	# matches, the URL is ignored.
23
24	# skip file: ftp: and mailto: urls
25	-(file\|ftp\|mailto):.*
26
27	# skip image and other suffixes we can't yet parse
28	-.*\.(gif\|GIF\|jpg\|JPG\|ico\|ICO\|css\|sit\|eps\|wmf\|zip\|ppt\|mpg\|xls\|gz\|rpm\|tgz\|mov\|MOV\|exe)
29
30	# skip URLs containing certain characters as probable queries, etc.
31	-.[?!@=].*
32
33	# accept anything else
34	+.*

Note: See TracBrowser for help on using the repository browser.