Wednesday, May 4, 2016

Building Apache Nutch Job & Running the WebCrawler in Hadoop

This blog talks on - How to compile / build the Nutch Job from Apache Nutch Source code and executing it in Hadoop.


1) Setup the Apache Nutch

Download the Apache Nutch source code from http://nutch.apache.org/downloads.html to Linux Machine. 

[root@rvm ~]# cd /opt

[root@rvm opt]# mkdir nutch_build


[root@rvm opt]# cd nutch_build/


[root@rvm nutch_build]# wget apache.mirror.digitalpacific.com.au/nutch/1.11/apache-nutch-1.11-src.tar.gz
--2016-05-03 17:51:02--  http://apache.mirror.digitalpacific.com.au/nutch/1.11/apache-nutch-1.11-src.tar.gz
Resolving apache.mirror.digitalpacific.com.au... 101.0.120.90, 2401:fc00:0:20e::a0
Connecting to apache.mirror.digitalpacific.com.au|101.0.120.90|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3807144 (3.6M) [application/x-gzip]
Saving to: âapache-nutch-1.11-src.tar.gzâ

100%[========================================>] 3,807,144   1.34M/s   in 2.7s

2016-05-03 17:51:07 (1.34 MB/s) - âapache-nutch-1.11-src.tar.gzâ

[root@rvm nutch_build]# ls
apache-nutch-1.11-src.tar.gz



[root@rvm nutch_build]# tar -xvzf apache-nutch-1.11-src.tar.gz -C /opt/nutch_build/

[root@rvm nutch_build]# ls
apache-nutch-1.11  apache-nutch-1.11-src.tar.gz




2) Install the Apache Ant

Ensure Java is installed.

[root@rvm nutch_build]# java -version
openjdk version "1.8.0_45"
OpenJDK Runtime Environment (build 1.8.0_45-b13)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
[root@rvm nutch_build]#


Download the Apache ANT Binaries. 

[root@rvm nutch_build]# pwd
/opt/nutch_build


[root@rvm nutch_build]# ls
apache-nutch-1.11  apache-nutch-1.11-src.tar.gz


[root@rvm nutch_build]# wget mirror.ventraip.net.au/apache//ant/binaries/apache-ant-1.9.7-bin.tar.gz
--2016-05-03 18:13:57--  http://mirror.ventraip.net.au/apache//ant/binaries/apache-ant-1.9.7-bin.tar.gz
Resolving mirror.ventraip.net.au... 103.252.152.2, 2400:8f80:0:11::1
Connecting to mirror.ventraip.net.au|103.252.152.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5601575 (5.3M) [application/x-gzip]
Saving to: âapache-ant-1.9.7-bin.tar.gzâ

100%[=========================================>] 5,601,575   1.90M/s   in 2.8s

2016-05-03 18:14:02 (1.90 MB/s) - âapache-ant-1.9.7-bin.tar.gzâ

[root@rvm nutch_build]# ls
apache-ant-1.9.7-bin.tar.gz  apache-nutch-1.11  apache-nutch-1.11-src.tar.gz
[root@rvm nutch_build]#


Move the downloaded ANT binary to Java Home and untar it. 

[root@rvm nutch_build]# mv apache-ant-1.9.7-bin.tar.gz /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/

[root@rvm nutch_build]# cd /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/

[root@rvm java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64]# tar xzf apache-ant-1.9.7-bin.tar.gz


[root@rvm java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64]# rm -rf apache-ant-1.9.7-bin.tar.gz



3) Building the Apache Nutch

Set the JAVA_HOME and NUTCH_JAVA_HOME. 

[root@rvm apache-nutch-1.11]# pwd
/opt/nutch_build/apache-nutch-1.11


[root@rvm apache-nutch-1.11]# export JAVA_HOME=/usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/


[root@rvm apache-nutch-1.11]# export NUTCH_JAVA_HOME=/usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/
[root@rvm apache-nutch-1.11]#


Add the property http.agent.name in /opt/nutch_build/apache-nutch-1.11/conf/nutch-site.xml 

[root@rvm nutch_build]# pwd
/opt/nutch_build


[root@rvm nutch_build]# cat apache-nutch-1.11/conf/nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
         <configuration>

        </configuration>


[root@rvm nutch_build]# vi apache-nutch-1.11/conf/nutch-site.xml


 

[root@rvm nutch_build]# cat apache-nutch-1.11/conf/nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
      <property>
                   <name>http.agent.name</name>
                    <value>WebCrawler</value>
                    <description></description>
       </property>
</configuration>
[root@rvm nutch_build]#


Run the makescript using ant. The execution takes more than 45 mins and it will download required jars from external repositories. Ensure that the system has the internet access.

[root@rvm apache-nutch-1.11]# /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/apache-ant-1.9.7/bin/ant runtime

The Nutch job will be generated under /opt/nutch_build/apache-nutch-1.11/runtime/deploy

[root@rvm deploy]# pwd
/opt/nutch_build/apache-nutch-1.11/runtime/deploy


[root@rvm deploy]# ls
apache-nutch-1.11.job  bin
[root@rvm deploy]#



4) Modifying the Nutch code to use the class org.apache.nutch.crawl.Crawl

In older version of Nutch, we had a class org.apache.nutch.crawl.Crawl that perform all the crawling operations using one single API call, that is removed in latest Nutch versions. If your application uses that class org.apache.nutch.crawl.Crawl then you can build the job with that code also.

For that, download the Crawl.java from Apache. In below, I am getting the Crawl.java from Nutch 1.7 branch.

[root@rvm crawl]# pwd
/opt/nutch_build/apache-nutch-1.11/src/java/org/apache/nutch/crawl


[root@rvm crawl]# wget http://svn.apache.org/viewvc/nutch/branches/branch-1.7/src/java/org/apache/nutch/crawl/Crawl.java?view=co -O Crawl.java

--2016-05-03 23:14:50--  http://svn.apache.org/viewvc/nutch/branches/branch-1.7/src/java/org/apache/nutch/crawl/Crawl.java?view=co
Resolving svn.apache.org... 209.188.14.144
Connecting to svn.apache.org|209.188.14.144|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: âCrawl.javaâ

    [ <=>                                                                                                                                                ] 5,895       --.-K/s   in 0.04s

2016-05-03 23:14:53 (137 KB/s) - âCrawl.javaâ


In /opt/nutch_build/apache-nutch-1.11/src/java/org/apache/nutch/crawl/Crawl.java, remove the references of solr. Remove the below lines, from the code

a)
import org.apache.nutch.indexer.solr.SolrDeleteDuplicates;

b)
else if ("-solr".equals(args[i])) {
        solrUrl = args[i + 1];
        i++;
      }

c)
if (solrUrl != null) {
        // index, dedup & merge
        FileStatus[] fstats = fs.listStatus(segments, HadoopFSUtil.getPassDirectoriesFilter(fs));
     
        IndexingJob indexer = new IndexingJob(getConf());
        indexer.index(crawlDb, linkDb,
                Arrays.asList(HadoopFSUtil.getPaths(fstats)));

        SolrDeleteDuplicates dedup = new SolrDeleteDuplicates();
        dedup.setConf(getConf());
        dedup.dedup(solrUrl);
      }

d)
LOG.info("solrUrl=" + solrUrl);

e)

if (solrUrl == null) {
      LOG.warn("solrUrl is not set, indexing will be skipped...");
    }
    else {
        // for simplicity assume that SOLR is used
        // and pass its URL via conf
        getConf().set("solr.server.url", solrUrl);
    }

f)
 String solrUrl = null;

g) Modify the line
System.out.println
      ("Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]");
to
System.out.println
      ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN N]");


Rebuild the code.

[root@rvm apache-nutch-1.11]# pwd
/opt/nutch_build/apache-nutch-1.11


[root@rvm apache-nutch-1.11]# /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/apache-ant-1.9.7/bin/ant runtime




5) Running the generted nutch job from Hadoop

Create the output and input directory in HDFS

[root@rvm apache-nutch-1.11]# hadoop fs -mkdir /tmp/testNutchJob
[root@rvm apache-nutch-1.11]# hadoop fs -mkdir /tmp/testNutchJob/input
[root@rvm apache-nutch-1.11]# hadoop fs -mkdir /tmp/testNutchJob/output
[root@rvm apache-nutch-1.11]#


Create a file that have set of seed URLs and load to HDFS

[root@rvm apache-nutch-1.11]# vi /opt/nutch_build/urllist.txt
[root@rvm apache-nutch-1.11]#
[root@rvm apache-nutch-1.11]# cat /opt/nutch_build/urllist.txt
http://www.ibm.com/
http://www.ibm.com/developerworks/
 


[root@rvm apache-nutch-1.11]#
[root@rvm apache-nutch-1.11]# hadoop fs -put /opt/nutch_build/urllist.txt /tmp/testNutchJob/input
[root@rvm apache-nutch-1.11]#
[root@rvm apache-nutch-1.11]# hadoop fs -tail /tmp/testNutchJob/input/urllist.txt
http://www.ibm.com/
http://www.ibm.com/developerworks/



Run the nutch.job from hadoop as hdfs user.

The org.apache.nutch.crawl.Crawl takes the arguments <urlDirContainingSeedURL> [-dir d] [-threads n] [-depth i] [-topN N] Refer: https://wiki.apache.org/nutch/bin/nutch%20crawl


[root@rvm apache-nutch-1.11]# su hdfs

[hdfs@rvm apache-nutch-1.11]$
[hdfs@rvm apache-nutch-1.11]$ hadoop jar /opt/nutch_build/apache-nutch-1.11/runtime/deploy/apache-nutch-1.11.job org.apache.nutch.crawl.Crawl /tmp/testNutchJob/input -dir /tmp/testNutchJob/output -depth 2 -topN 10




6) Viewing the crawled data

Copy the generated output from Hadoop file system to Linux file system.

[hdfs@rvm apache-nutch-1.11]$ hadoop fs -copyToLocal /tmp/testNutchJob/output /tmp

[hdfs@rvm apache-nutch-1.11]$ cd /tmp/output/
 
[hdfs@rvm output]$ ls
crawldb  linkdb  segments
[hdfs@rvm output]$


The below command convert the crawled output in sequence format to an html output for testing.

[hdfs@rvm bin]$ su root
Password:
[root@rvm bin]#
[root@rvm bin]# ./nutch commoncrawldump -outputDir /tmp/commoncrawlOutput -segment /tmp/output/segments
[root@rvm bin]#

If you want to change the Nutch configurations, you can manually open the apache-nutch-1.11.job using 7zip and update the nutch-site.xml.

In my next blog, I will be covering how to set these properties & URL filter dynamically.