Changeset 74


Ignore:
Timestamp:
Jun 3, 2009, 2:46:36 PM (15 years ago)
Author:
waue
Message:

let it work

Location:
nutchez-0.1/conf
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • nutchez-0.1/conf/crawl-urlfilter.txt

    r66 r74  
    2929
    3030# skip image and other suffixes we can't yet parse
    31 -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
     31-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|swf)$
    3232
    3333# skip URLs containing certain characters as probable queries, etc.
  • nutchez-0.1/conf/nutch-site.xml

    r71 r74  
    2828<property>
    2929  <name>plugin.includes</name>
    30   <value>protocol-http|urlfilter-regex|parse-(text|html|js|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|swf|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
     30  <value>protocol-http|urlfilter-regex|parse-(text|html|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    3131  <description> Regular expression naming plugin directory names</description>
    3232 </property>
  • nutchez-0.1/conf/sav/n.url.txt

    r72 r74  
    1 www.nchc.org.tw
    2 www.hadoop.tw
     1http://www.nchc.org.tw
     2http://www.hadoop.tw
Note: See TracChangeset for help on using the changeset viewer.