source: nutchez-0.1/conf/regex-urlfilter.txt @ 95

Last change on this file since 95 was 66, checked in by waue, 16 years ago

NutchEz - an easy way to nutch

  • Property svn:executable set to *
File size: 1.7 KB
Line 
1# Licensed to the Apache Software Foundation (ASF) under one or more
2# contributor license agreements.  See the NOTICE file distributed with
3# this work for additional information regarding copyright ownership.
4# The ASF licenses this file to You under the Apache License, Version 2.0
5# (the "License"); you may not use this file except in compliance with
6# the License.  You may obtain a copy of the License at
7#
8#     http://www.apache.org/licenses/LICENSE-2.0
9#
10# Unless required by applicable law or agreed to in writing, software
11# distributed under the License is distributed on an "AS IS" BASIS,
12# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13# See the License for the specific language governing permissions and
14# limitations under the License.
15
16
17# The url filter file used by the crawl command.
18
19# Better for intranet crawling.
20# Be sure to change MY.DOMAIN.NAME to your domain name.
21
22# Each non-comment, non-blank line contains a regular expression
23# prefixed by '+' or '-'.  The first matching pattern in the file
24# determines whether a URL is included or ignored.  If no pattern
25# matches, the URL is ignored.
26
27# skip file:, ftp:, & mailto: urls
28-^(ftp|mailto):
29
30# skip image and other suffixes we can't yet parse
31-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
32
33# skip URLs containing certain characters as probable queries, etc.
34#-[?*!@=]
35-[*!@]
36
37# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
38#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
39
40# accept hosts in MY.DOMAIN.NAME
41#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
42#+^http://([a-z0-9]*\.)*.*/
43
44# skip everything else
45#-.
46
47#accecpt anything else
48+.*
Note: See TracBrowser for help on using the repository browser.