nutch 1.x (nutch 1.11为例)
抓取网页存储到本地
bin/crawl urls crawl 2
建索引
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
nutch 2.x (nutch 2.2.1为例)
mysql
my.ini或my.cnf中修改编码:
[mysqld] character-set-server=utf8 [client]、[mysql] default-character-set=utf8
数据表字段映射在gora-sql-mapping.xml中配置。
配置ivy对mysql的支持,在ivy/ivy.xml中配置
<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/> <dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/> <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
配置nutch数据连接设置gora.properties
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true gora.sqlstore.jdbc.user=xxxx(MySQL用户名) gora.sqlstore.jdbc.password=xxxx(MySQL密码)
<property> <name>storage.data.store.class
name>
org.apache.gora.sql.store.SqlStore
The Gora DataStore
class
for storing
and retrieving data. Currently
the following stores are available:.
property> <property> <name>generate.batch.id
name>
*
property
然后设置爬取网站。
执行爬取操作,爬取数据到数据库
bin/nutch crawl urls -depth 3 -topN 5
solr
发布者:全栈程序员-站长,转载请注明出处:https://javaforall.net/200031.html原文链接:https://javaforall.net
