Thursday, July 13, 2017

Crawling Tor using Apache Nutch


In this post, I’ll cover crawling and indexing hidden sites. This might be useful for research. Also, you can have your own search engine just to play with. There are already couple of projects that have done this, one of the well known ones is Ahmia. This post does not make any improvements on that project. I just wanted to have something of my own. I might end up doing some research project with the data but I can’t think of any ideas right now.

Ubuntu Server 14.04 64-bit - I have a Proxmox template that I just clone. You should be able to use newer versions.
Apache Nutch 1.13 - There are two versions of Nutch. 1.x and 2.x. Version 1.13 was recently released so I’m just using that.
Elasticsearch 2.3.3 - Nutch 1.13 works correctly with version 2.3.3.
Kibana 4.5.4 - According to the the support matrix here: Kibana 4.5.4 works with Elasticsearch 2.3.3.
Docker - We’ll use rotating HAProxy -> Tor container that’s here This will basically distribute our requests over multiple tor connections.

We will start Rotating Proxy. Nutch will be configured to use Rotating Proxy port and use Elasticsearch to save data. We will also configure it to only crawl .onion domains. We will provide seed URL list to Nutch. Seed URLs is where Nutch will start begin crawling. Nutch will take the data and put it in Elasticsearch. The data we’ll get is text data. We won’t be getting any images. Finally, we will use Kibana to browse the data and do our searches.

We will install Java first.

apt-get update
apt-get install python-software-properties
add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java8-installer -y

Add the following line into your /etc/environment file:

Now we’ll install Docker and pull the rotating proxy container.

curl -fsSL -o
docker pull mattes/rotating-proxy:latest

Next, we’ll download and configure Nutch.

tar xvfs apache-nutch-1.13-bin.tar.gz

We need to modify the following values inside of apache-nutch-1.13/conf/nutch-site.xml: needs to be Mozilla/5.0. You can modify other http.agent values if you need to. needs to be since we’re running Rotating proxy container on the same machine.
http.proxy.port needs to be 5566. should be set to as well since we’re running that on the same machine.
That’s all the things we need to modify. You can modify other settings to your likings if you want.

We need to edit apache-nutch-1.13/conf/regex-urlfilter.txt too. We just want to following hidden sites (.onion).
Comment out the bottom line, which is ‘+.’
Add the following line underneath

Now we need to generate our seed URL file. For our seed, we’ll just use Hidden Wiki. You can also use hidden site lists we find in other places such as Ahmia or Pastebin.

Inside apache-nutch-1.13/bin, create a folder called urls. Add seed.txt file which contains the following line:

That’s all. We’re done with Nutch for now.

Finally, Elastic and Kibana.

tar xvfs kibana-4.5.4-linux-x64.tar.gz

We just need to configure Kibana. In kibana-4.5.4-linux-x64/config/kibana.yml, we just need to uncomment "".

The binary file for Kibana is kibana-4.5.4-linux-x64/bin/kibana
The binary file for Elasticsearch is elasticsearch-2.3.3/bin/elasticsearch
Running everything:
First we’ll get proxy up and running and test it out.

docker run -d -p 5566:5566 -p 4444:4444 --env tors=25 mattes/rotating-proxy
curl --proxy

Start elasticsearch and then kibana:

Kibana should run on You should be able to visit the IP address of the machine you’re working on and port 5601 via a browser and see Kibana load.

Finally, we can begin crawling with Nutch.

Inside of bin folder under Nutch, there is a crawl script.
These are the arguments you can provide:
crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir> <Num Rounds>

We’ll run:
./crawl -i urls/ onioncrawl/ 3

-i so we can index our data
urls/ is our directory with seed.txt file.
onioncrawl/ is the directory Nutch will create to store data
3 is the number of rounds. It’s basically the depth. Lower number means low depth.

In Kibana webUI, go to Settings and add ‘nutch’ as your index and set time-field name to ‘tstamp’. Click ‘Create’. Go to Discover page and you should see some data.

This is what Kibana looks like.

I searched for “information security”

Content containing “information security”
This shows the fields that are indexed.

I had couple of issues. First thing is, I haven’t optimized Nutch and Elasticsearch. I am running everything on a single machine. This may hurt the performance. Optimization is something I’ll look into in the future. Second problem I had was with Nutch crashing while putting the data on Elasticsearch. Here’s the error:
ERROR CleaningJob: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(
       at org.apache.nutch.indexer.CleaningJob.delete(
       at org.apache.nutch.indexer.CleaningJob.main(

Look at “Fix” section.

In the past I was able to crawl about 10 gigs of data without any issues, at that time I didn’t have Elasticsearch part working. If there are improvements that can be made, leave them in the comments. I can’t say that I’m too familiar with Nutch. I was just familiar enough to get it running.

apt-get install ant
cd nutch; ant runtime

Go make some coffee.

“BUILD SUCCESSFUL” is what you’re looking for.

cd runtime/local

In this folder, you’ll see conf and bin. Copy nutch-site.xml and regex-urlfilter.txt into the conf folder. Copy the urls folder into the bin folder.

We can delete the old nutch Elasticsearch index by running this:
curl -XDELETE 'localhost:9200/nutch?pretty'

You’ll have to run crawl a bit differently since the argument format is different.
crawl [-i|--index] [-D "key=value"] [-w|--wait] [-s <Seed Dir>] <Crawl Dir> <Num Rounds>

We’ll run:
./crawl -i -s urls/ onioncrawl/ 4

I haven’t had another crash like I did previously! Final message I got was this “Finished loop with 4 iterations”

I used a lot of different resources to figure out what I was doing. I’ll put bunch of them here. You may or may not find all of them useful.

I will put preconfigured files here:
You don’t have to use any files from the repo, I am uploading them to make setup easier for myself in the future.

Resources for this research are provided by Living Lab IUPUI ( and IUPUI (
It’s always nice to have fast internet connection through the university. :-D

Leave a comment if there are mistakes in this post.


  1. Hi,
    I am looking for the files in the github repo you mentioned and couldnt find any files except README. Are the files public?

    1. Sorry about that. I had some issues with github while trying to upload that so i didn't. I'll find another place to upload the files soon.

    2. Hello all
      am looking few years that some guys comes into the market
      they called themselves hacker, carder or spammer they rip the
      peoples with different ways and it’s a badly impact to real hacker
      now situation is that peoples doesn’t believe that real hackers and carder scammer exists.
      Anyone want to make deal with me any type am available but first
      I‘ll show the proof that am real then make a deal like

      Available Services

      ..Wire Bank Transfer all over the world

      ..Western Union Transfer all over the world

      ..Credit Cards (USA, UK, AUS, CAN, NZ)

      ..School Grade upgrade / remove Records

      ..Spamming Tool

      ..keyloggers / rats

      ..Social Media recovery

      .. Teaching Hacking / spamming / carding (1/2 hours course)

      discount for re-seller

      Contact: 24/7

  2. I love the information you have shared here. Keep posting these valuable information. For further detail on Minecraft Server Host visit MelonCube !!

  3. I'm on Mac and I am having trouble with apache Ant. How do you run Ant" in the Nutch directory? I downloaded Ant from the Apache site but I can't figure out how to run Ant in the Nutch directory

  4. Hello Everyone !

    USA Fresh & Verified SSN Leads along with Driving License/ ID Number, AVAILABLE with 99.9% connectivity
    All Leads have genuine & valid information.

    First Name | Last Name | SSN | Dob | Driving License Number | Address | City | State | Zip | Phone Number | Account Number | Payday | Bank Name | Employee Details | IP Address

    *Price for SSN lead $2
    *You can ask for sample before any deal
    *If anyone buy in bulk, we can negotiate
    *Sampling is just for serious buyers

    ->$5 PER EACH

    ->Hope for the long term Business
    ->Interested buyers will be welcome

    **Contact 24/7**
    Whatsapp > +923172721122
    Email >
    Telegram > @leadsupplier
    ICQ > 752822040

  5. The Hidden Wiki is one of the oldest link directories on the dark web. Famous for listing all important .onion links. From drug marketplaces to financial services you can find all the important deep web services listed here. If you can not find the link you are looking for, check the other introduction points.


    I am Mrs,Leores J Miguel by name, I live in United State Of America, who have been a scam victim to so many fake lenders online between November last year till July this year but i thank my creator so much that he has finally smiled on me by directing me to this new lender who put a smile on my face this year 2020 and he did not scam me and also by not deceiving or lying to me and my friends but however this lending firm is BENJAMIN LOAN INVESTMENTS FINANCE ( gave me 2% loan which amount is $900,000.00 united states dollars after my agreement to their company terms and conditions and one significant thing i love about this loan company is that they are fast and unique. {Dr.Benjamin Scarlet Owen} can also help you with a legit loan offer. He Has also helped some other colleagues of mine. If you need a genuine loan without cost/stress he his the right loan lender to wipe away your financial problems and crisis today. BENJAMIN LOAN INVESTMENTS FINANCE holds all of the information about how to obtain money quickly and painlessly via Call/Text: +1(415)630-7138 Email:

    When it comes to financial crisis and loan then BENJAMIN LOAN INVESTMENTS FINANCE is the place to go please just tell him I Mrs. Leores Miguel direct you Good Luck....