[66] | 1 | Apache Nutch README |
---|
| 2 | |
---|
| 3 | Important note: Due to licensing issues we cannot provide two libraries that |
---|
| 4 | are normally provided with PDFBox (jai_core.jar, jai_codec.jar), the parser |
---|
| 5 | library we use for parsing PDF files. If you encounter unexpected problems when |
---|
| 6 | working with PDF files please |
---|
| 7 | |
---|
| 8 | 1. download the two missing libraries from: |
---|
| 9 | http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/external/ |
---|
| 10 | |
---|
| 11 | 2. Put them to directory src/plugin/parse-pdf/lib |
---|
| 12 | 3. follow the instructions in file src/plugin/parse-pdf/plugin.xml |
---|
| 13 | 4. Rebuild nutch. |
---|
| 14 | |
---|
| 15 | |
---|
| 16 | |
---|
| 17 | Interesting files include: |
---|
| 18 | |
---|
| 19 | |
---|
| 20 | docs/api/index.html |
---|
| 21 | Javadocs for the Nutch software. |
---|
| 22 | |
---|
| 23 | CHANGES.txt |
---|
| 24 | Log of changes to Nutch. |
---|
| 25 | |
---|
| 26 | |
---|
| 27 | For the latest information about Nutch, please visit our website at: |
---|
| 28 | |
---|
| 29 | http://lucene.apache.org/nutch/ |
---|
| 30 | |
---|
| 31 | and our wiki, at: |
---|
| 32 | |
---|
| 33 | http://wiki.apache.org/nutch/ |
---|
| 34 | |
---|
| 35 | To get started using Nutch read Tutorial: |
---|
| 36 | |
---|
| 37 | http://lucene.apache.org/nutch/tutorial.html |
---|
| 38 | |
---|
| 39 | Export Control |
---|
| 40 | |
---|
| 41 | This distribution includes cryptographic software. The country in which you |
---|
| 42 | currently reside may have restrictions on the import, possession, use, and/or |
---|
| 43 | re-export to another country, of encryption software. BEFORE using any encryption |
---|
| 44 | software, please check your country's laws, regulations and policies concerning the |
---|
| 45 | import, possession, or use, and re-export of encryption software, to see if this is |
---|
| 46 | permitted. See <http://www.wassenaar.org/> for more information. |
---|
| 47 | |
---|
| 48 | The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has |
---|
| 49 | classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which |
---|
| 50 | includes information security software using or performing cryptographic functions with |
---|
| 51 | asymmetric algorithms. The form and manner of this Apache Software Foundation |
---|
| 52 | distribution makes it eligible for export under the License Exception ENC Technology |
---|
| 53 | Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, |
---|
| 54 | Section 740.13) for both object code and source code. |
---|
| 55 | |
---|
| 56 | The following provides more details on the included cryptographic software: |
---|
| 57 | |
---|
| 58 | Apache Nutch uses the PDFBox API in its parse-pdf plugin for extracting textual content |
---|
| 59 | and metadata from encrypted PDF files. See http://incubator.apache.org/pdfbox/ for more |
---|
| 60 | details on PDFBox. |
---|