Changeset 74
- Timestamp:
- Jun 3, 2009, 2:46:36 PM (16 years ago)
- Location:
- nutchez-0.1/conf
- Files:
-
- 3 edited
Legend:
- Unmodified
- Added
- Removed
-
nutchez-0.1/conf/crawl-urlfilter.txt
r66 r74 29 29 30 30 # skip image and other suffixes we can't yet parse 31 -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP )$31 -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|swf)$ 32 32 33 33 # skip URLs containing certain characters as probable queries, etc. -
nutchez-0.1/conf/nutch-site.xml
r71 r74 28 28 <property> 29 29 <name>plugin.includes</name> 30 <value>protocol-http|urlfilter-regex|parse-(text|html| js|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|swf|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>30 <value>protocol-http|urlfilter-regex|parse-(text|html|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> 31 31 <description> Regular expression naming plugin directory names</description> 32 32 </property> -
nutchez-0.1/conf/sav/n.url.txt
r72 r74 1 www.nchc.org.tw2 www.hadoop.tw1 http://www.nchc.org.tw 2 http://www.hadoop.tw
Note: See TracChangeset
for help on using the changeset viewer.