source: nutchez-0.1/tomcat/webapps/ROOT/WEB-INF/classes/automaton-urlfilter.txt @ 66

Last change on this file since 66 was 66, checked in by waue, 15 years ago

NutchEz - an easy way to nutch

File size: 1.3 KB
Line 
1# Licensed to the Apache Software Foundation (ASF) under one or more
2# contributor license agreements.  See the NOTICE file distributed with
3# this work for additional information regarding copyright ownership.
4# The ASF licenses this file to You under the Apache License, Version 2.0
5# (the "License"); you may not use this file except in compliance with
6# the License.  You may obtain a copy of the License at
7#
8#     http://www.apache.org/licenses/LICENSE-2.0
9#
10# Unless required by applicable law or agreed to in writing, software
11# distributed under the License is distributed on an "AS IS" BASIS,
12# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13# See the License for the specific language governing permissions and
14# limitations under the License.
15
16# The default url filter.
17# Better for whole-internet crawling.
18
19# Each non-comment, non-blank line contains a regular expression
20# prefixed by '+' or '-'.  The first matching pattern in the file
21# determines whether a URL is included or ignored.  If no pattern
22# matches, the URL is ignored.
23
24# skip file: ftp: and mailto: urls
25-(file|ftp|mailto):.*
26
27# skip image and other suffixes we can't yet parse
28-.*\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)
29
30# skip URLs containing certain characters as probable queries, etc.
31-.*[?*!@=].*
32
33# accept anything else
34+.*
Note: See TracBrowser for help on using the repository browser.