DigitalPebble's Blog: nutch

Showing posts with label nutch. Show all posts

Monday 11 February 2019

Meet StormCrawler users: Q&A with Pixray (Germany)

We are opening a series of Q&A blogs with Maik Piel telling us about the use of StormCrawler at Pixray.

Q: What do you guys do at Pixray? Why do you need web crawling?

We are experts in image tracking on the web. We work for image rights holders to protect their pictures on the web as well as brands and manufacturers to monitor sales channels. Our customers range from news agencies and picture agencies, individual photographers, e-commerce companies to luxury brands. Web crawling is one of the core buildings blocks of our platform - next to a massive picture matching platform, various APIs and our customer portals.

Q: What sort of crawls do you do? How big are they?

We do three kinds of scans: broad scans across complete regions of the web (like the EU or North America), deep scans on single domains and also near-realtime discovery scans on thousands of selected domains. For all of these different scans, we employ customized versions of StormCrawler to match the very distinct requirements in crawling patterns. Obviously, the biggest crawls are the broad regional scans, including more than 10 billion URLs and tens of millions of different domains.

Q: What software stack do you use? e.g. SC + ES + Grafana? Hardware used?

Adapted and extended versions of StormCrawler as well as Elasticsearch and Kibana. We couple our crawling infrastructure with the rest of our platform through RabbitMQ. Our crawler is built on Ubuntu servers, with 32 GB of RAM and Intel Core I7 and 4 TB of disk space. Each runs Apache Storm and Elasticsearch. In the future, we will split the storage (Elasticsearch) and the computation (Storm) layers to separate hardware. We are also looking at options to employ container and service orchestration frameworks to scale our crawler infrastructure dynamically.

Q: Why did you choose StormCrawler?

We initially built our crawler on Apache Nutch. Needless to say that Nutch is a great and robust platform. But once you grow beyond a certain point you start to see limitations. The biggest limitation is the low responsiveness to changes and the uneven system utilization due to the long generate/crawl/update cycles. It sometimes took us 24 hours or more till we could see the effects of a change we made to the software. Furthermore, we found that it is a bit troublesome to get valid statistics data from Nutch in real time. StormCrawler solves all that for us. Every config or code change that we commit shows its effect immediately and you get statistics very, very easily. There is no long-cycle batching anymore in StormCrawler which gives us a very even and continuous crawling, reducing our need for massive queuing of results to ensure an even utilization of down-stream infrastructure. Kibana gives us great real-time insights into the crawl database. With Nutch, we had to run analysis jobs of around 4 hours, even if we just needed the status of a single url.

Q: What do you like the most / least in StormCrawler?

Besides the points mentioned above, we have to praise StormCrawlers extensibility. In our different setups we have both made changes to existing code in the StormCrawler project but also wrote large amounts of own code. The structure Apache Storm imposes is great. Components are very cleanly decoupled and it is easy to introduce custom functionality by just writing new Spouts and Bolts and linking them into the topology. For our use case we, of course, had to deal with pictures - which StormCrawler itself does not do. We just created our own Bolts for that. For our near-realtime discovery crawler, we needed an engine that calculates the revisit date for a URL based on various factors instead of a static value, again we could just create a specific spout for that.

Q: Anything in particular you'd like to have in a future release?

It would be great to have a built-in way to prioritize different TLDs within the StormCrawler spouts. We have built a custom solution for that which we might contribute back to StormCrawler at some point.

Wednesday 29 March 2017

Need billions of web pages? Don't bother crawling...

How big did you say?

I am often contacted by prospective clients to help them crawl the web on a very large scale or find questions such as this one on StackOverflow. What people want to achieve with web data varies greatly from one case to the next: some need to extract specific data from as many pages as possible, some want to build search engines, while others wish to test the accuracy of a machine learning model on real data.

Luckily, there are resources available for large scale web crawling, both on the platform side (e.g. Amazon Web Services) and the software side (StormCrawler, Apache Nutch), however large scale crawling (think billions of pages and hundreds of servers) is costly, complex and time-consuming. At DigitalPebble, we help our clients with such tasks but what I often tend to recommend as an initial step is to have a look at CommonCrawl.

CommonCrawl to the rescue

CommonCrawl is a non-profit organisation which provides web crawl data for free. Their datasets are used by various organisations, both in academia and industry, as can be seen on the examples page. The applications range from machine learning to natural language processing or computational linguistics. For instance, at DigitalPebble, we have used the CommonCrawl dataset for some of our clients for information extraction (phone numbers and contact details publicly available), machine learning (to check the accuracy of a classifier on real, big, messy data) as well as lexicometry (get frequencies of anchor tags). I should also mention that CommonCrawl themselves are clients of ours: we developed Apache Nutch resources for them and also ran their February 2016 web crawl. We also contributed to the set up of their news crawl (see below).

CommonCrawl provides two types of datasets, both hosted on Amazon S3 as part of the Amazon Public Datasets program.

Web crawl

The main dataset is released on a monthly basis and consists of billions of web pages stored in WARC format on AWS S3. The latest release had 3.08 billion web pages and about 250 TiB of uncompressed content: that’s a lot of data to play with, and it comes for free!

These pages are mainly HTML documents, but there are also a few PDF and images. Until recently, the coverage was very US-centric and the datasets contained mostly the same URLs from one release to the next, but this is no longer the case as European domain names and the top 1 million Alexa domains are crawled (see details on http://commoncrawl.org/2017/03/february-2017-crawl-archive-now-available/). Interestingly, CommonCrawl use Apache Nutch to generate their datasets, albeit with a few home-made modifications.

Basically, each release is split into 100 segments. Each segment has three types of files WARC, WAT and WET. As explained on the Get Started page:

WARC files store the raw crawl data
WAT files store computed metadata for the data stored in the WARC
WET files store extracted plaintext from the data stored in the WARC

Note that WAT and WET are in the WARC format too! In fact, the WARC format is nothing more than an envelope with metadata and content. In the case of the WARC files, that content is the HTTP requests and responses, whereas for the WET files, it is simply the plain text extracted from the WARCs. The WAT files contain a JSON representation of metadata extracted from the WARCs e.g. title, links etc…

So, not only have CommonCrawl given you loads of web data for free, they’ve also made your life easier by preprocessing the data for you. For many tasks, the content of the WAT or WET files will be sufficient and you won’t have to process the WARC files.

This should not only help you simplify your code but also make the whole processing faster. We recently ran an experiment on CommonCrawl where we needed to extract anchor text from HTML pages. We initially wrote some MapReduce code to extract the binary content of the pages from their WARC representation, processed the HTML with JSoup and reduced on the anchor text. Processing a single WARC segment took roughly 100 minutes on a 10-node EMR cluster. We then simplified the extraction logic, took the WAT files as input and the processing time dropped to 17 minutes on the same cluster. This gain was partly due to not having to parse the web pages, but also to the fact that WAT files are a lot smaller than their WARC counterparts.

News dataset

Unlike the main web crawl, the news dataset is released continuously. As its name suggests, it consists exclusively of news pages and articles as described on http://commoncrawl.org/2016/10/news-dataset-available/. There are between 3 and 5 WARC files (1GB each) generated daily, corresponding to 300 to 400 thousand pages. In total, over 25 million news pages have been crawled to date. The dataset contains WARC files only so you will have to write some code to extract the text and metadata yourself.

The news dataset is generated using our very own StormCrawler and the code of the news crawl is publicly available on CommonCrawl’s GitHub account.

Resources

The Get Started page on the CommonCrawl website contains useful pointers to libraries and code in various programming languages to process the datasets. There is also a list of tutorials and presentations.

It is also worth noting that CommonCrawl provides an index per release, allowing you to search for URLs (including wildcards) and retrieve the segment and offset therein where the content of the URL is stored e.g.

{ "urlkey": "org,apache)/", "timestamp": "20170220105827", "status": "200", "url": "http://apache.org/", "filename": "crawl-data/CC-MAIN-2017-09/segments/1487501170521.30/warc/CC-MAIN-20170219104610-00206-ip-10-171-10-108.ec2.internal.warc.gz", "length": "13315", "mime": "text/html", "offset": "14131184", "digest": "KJREISJSKKGH6UX5FXGW46KROTC6MBEM" }

This is useful but only if you are interested in a limited number of URLs which you know in advance. In many cases, what you know in advance is what you want to extract, not where it will be extracted from. For situations such as these, you will need distributed batch-processing using MapReduce in Apache Hadoop or Apache Spark.

As hinted above, I tend to use AWS EMR (ElasticMapReduce). Running the code in AWS makes sense as the data sets are stored on S3 so access is fast and there is no transfer cost, also the EC2 instances will have the credentials pre-set so there is no additional configuration needed to access the data. There is an additional cost in using EMR but this saves me from having to configure Hadoop. In addition, I usually store the output of the reduce steps on a S3 bucket so that nothing is kept on HDFS and I can use spot instances to keep the cost down. If they get terminated, nothing is lost. Of course, other platforms (Azure, Google) or alternatives to EMR (Hortonworks HDP) can be used instead.

Finally, I implement the logic with MapReduce in Java thanks to libraries such as warc-hadoop which deals with the low-level access to WARC files. If you need to process CommonCrawl with existing frameworks and libraries such as Apache UIMA, Tika or GATE, our good old open source project Behemoth could help as it can ingest WARCs too!

Conclusion

As we’ve seen, CommonCrawl is an awesome resource and should be the first thing you try before embarking on web scale crawling (although if you must, DigitalPebble would be happy to help). It is large, it is free, it is relatively easy to process and a lot of effort has been put into making your life easier.

Web data are big, messy and often don’t give the results you expect. Processing the CommonCrawl dataset is a great way of checking your assumptions at a fraction of the cost of a web scale crawl. It also saves you time, as the fetch politeness has been done for you but on the minus side, you will be able to process only content allowed by robots.txt directives as CommonCrawl’s crawler is polite (but then yours should be too).

I hope you will give CommonCrawl a try and if you find it useful, you can donate to the project.

Tuesday 3 January 2017

The Battle of the Crawlers : Apache Nutch vs StormCrawler

Happy New Year everyone!

For this first blog post of 2017, we'll compare the performance of StormCrawler and Apache Nutch. As you probably know, these are open source solutions for distributed web-crawling and we provided an overview last year of both as well as a performance comparison when crawling a single website.

StormCrawler has been steadily gaining in popularity over the last 18 months and a frequent question asked by prospective users is how fast it is compared to Nutch. Last year's blog post provided some insights into this but now we'll go one step further by crawling not a single website, but a thousand. The benchmark will still be on a single server though but will cover multi-million pages.

Disclaimer: I am a committer on Apache Nutch and the author of StormCrawler.

Meet the contestants

Please have a look at our previous blog post for a more detailed description of both projects. This Q and A should also be useful.

Apache Nutch is a well-established web crawler based on Apache Hadoop. As such, it operated by batches with the various aspects of web crawling done as separate steps (e.g. generate a list of URLs to fetch, fetch, parse the web pages and update its data structures.

In this benchmark, we'll use the 1.x version of Nutch. There is a 2.x branch but as we saw in a previous benchmark, it is a lot slower. It also lacks some of the functionalities of 1.x and is not actively maintained.

StormCrawler, on the other hand, is based on Apache Storm, a distributed stream processing platform. All the web crawling operations are done continuously and at the same time.

What we can assume (and observed previously) is that StormCrawler should be more efficient as Nutch does not fetch web pages continuously, but only as one of the various batch steps. On top of that, some of its operations - mainly the ones that deal with the crawldb, the datastructure used by Nutch - take increasingly longer as the size of the crawl grows.

The battleground

We ran the benchmark on a dedicated server provided by OVH with the following specs :

Intel Xeon E5 E5 1630v3 4c/8t 3,7 / 3,8 GHz

64 GB of RAM DDR4 ECC 2133 MHz

2x480GB RAID 0 SSD

Ubuntu 16.10 server

We installed the following software

Apache Storm 1.0.2

Elasticsearch 2.4

Kibana 4.6.3

StormCrawler 1.3-SNAPSHOT

Hadoop 2.7.3

Apache Nutch 1.13-SNAPSHOT

Finally, the resources and configurations for the benchmark can be found on the first release of https://github.com/DigitalPebble/stormcrawlerfight.

We followed the recommendations from

for the configuration of Elasticsearch on SSD and gave it 10GB RAM to run on.

Apache Nutch

The configuration for Nutch can be found in the GitHub repo under the nutch directory. This should allow you to reproduce the benchmarks if you wished to do so.

The main changes to the crawl script, apart from the addition of a contribution I recently made to Nutch, was to :

set the number of fetch threads to 500
change the max size of the fetchlist to 50,000,000
use 4 reducer tasks
remove the link inversion and dedup steps

The latter was done in order to keep the crawl to a minimum. We left the setting for the limitation of fetch time to 3 hours. The aim of this was to avoid long tails in the fetching step, where the process is busy fetching from only a handful of slow servers.

In order to optimise the crawl, we limited the number of URLs per hostname in the fetchlist to 100, which guarantees a good distribution of URLs and again, prevents the long tail phenomenon, which is commonly observed with Nutch. We also tried to avoid the conundrum whereby setting too low a duration for the fetching step requires more crawl iterations, meaning that more generate and update steps are necessary.

Note: we initially intended to index the documents into Elasticsearch, however, this step proved unreliable and caused errors with Hadoop. We ended up deactivating the indexing step from the script, which should benefit Nutch when comparing to StormCrawler.

We ran 10 crawl iterations between 2016.12.16 11:18:37 CET and 2016.12.17 19:29:30 CET, the breakdown of times per step is as follows :

Iteration #	Steps	Time
1	Generation	0:00:38
	Fetcher	0:00:45
	Parse	0:00:26
	Update	0:00:24
2	Generation	0:00:40
	Fetcher	0:23:26
	Parse	0:01:53
	Update	0:00:30
3	Generation	0:00:52
	Fetcher	0:55:46
	Parse	0:08:24
	Update	0:01:07
4	Generation	0:01:15
	Fetcher	1:08:36
	Parse	0:19:00
	Update	0:02:01
5	Generation	0:02:03
	Fetcher	2:14:20
	Parse	0:47:59
	Update	0:04:34
6	Generation	0:03:55
	Fetcher	4:20:02
	Parse	1:30:58
	Update	0:08:46
7	Generation	0:06:44
	Fetcher	3:52:36
	Parse	1:09:40
	Update	0:08:15
8	Generation	0:08:31
	Fetcher	3:48:35
	Parse	1:04:27
	Update	0:08:32
9	Generation	0:10:00
	Fetcher	3:51:57
	Parse	1:18:38
	Update	0:09:27
10	Generation	0:11:44
	Fetcher	3:32:35
	Parse	0:59:13
	Update	0:09:39
		33:08:53

What you can observe is that the generate and update steps do take an increasingly longer time, as mentioned above.

The final stats from the final update step were :

db_fetched	10,626,298
db_gone	686,834
db_redir_perm	123,087
db_redir_temp	217,191
db_unfetched	64,678,627

which gives us a total of 11,653,410 URLs processed (fetch + gone + redirs) in a total time of 1930 minutes.

On average, Nutch fetched 6,038 URLs per minute.

The graph below shows the bandwidth usage of the server when the Nutch crawl was running.

Network graph of Nutch crawl

This is a good illustration of the batch nature of Nutch, where the fetching is only one part of the whole process.

Let's now see how StormCrawler fared in a similar situation.

StormCrawler

StormCrawler can use different backends for storing the status of the URLs (i.e. which is what the crawldb does in Nutch). For this benchmark, we used the Elasticsearch module of StormCrawler as it is the most commonly used. This means that we won't just be storing the content of the webpages to Elasticsearch, we'll also be using it to store the status of the URLs as well as displaying metrics about the crawl with Kibana.

We ran the crawl for over 2 and a half days and got the following values in the status index

DISCOVERED	188,396,525
FETCHED	32,656,149
ERROR	2,901,502
REDIRECTION	2,050,757
FETCH_ERROR	1,335,437

which means a total of 38,943,845 webpages processed over 3977minutes, i.e. an average of 9792.26 pages per min.

The network graph looked like this:

Network graph of StormCrawler crawl

which, apart from an unexplained and possibly unrelated spike on Christmas day, shows a pretty solid use of the bandwidth. Whereas Nutch was often around the 50M mark, StormCrawler is lower but constant.

The metrics stored in Elasticsearch and displayed with Kibana gave a similar impression:

StormCrawler metrics displayed with Kibana

Interestingly, the Storm UI indicated that the bottleneck of the pipeline was the update step, which is not unusual given the 'write-heavy' nature of StormCrawler.

Conclusion

This benchmark as set out above shows that StormCrawler is 60% more efficient than Apache Nutch. We also found StormCrawler to run more reliably than Nutch but this could be due to a misconfiguration of Apache Hadoop on the test server. We had to omit the indexing step from the Nutch crawl script because of reliability issues, whereas the StormCrawler topology did index the documents successfully. This would have added to the processing time of Nutch.

The main explanation lies in the design of the crawlers: Nutch achieves greater spikes in the fetching step but does not fetching continuously as StormCrawler does. I had compared Nutch to a sumo and StormCrawler to a ninja previously but it seems that the tortoise and hare parable would be just as appropriate.

It is important to bear in mind that the raw performance of the crawlers is just one aspect of an overall comparison. One should also consider the frequency of releases and contributions as well as more subjective aspects such as the ease of use and versatility. There is also of course the question of the functionalities provided. To be fair to Nutch, it currently does thing that StormCrawler does not yet support such as document deduplication and scoring. On the other hand, StormCrawler too has a few aces up its sleeve with Xpath extraction, sitemap processing and live monitoring with Kibana.

As often said in similar situations: “your mileage might vary”. The figures given here depend on the particular seed list and hardware, you might get different results on your specific use case. The resources and configurations of the benchmark being publicly available, you can reproduce it and extend it as you wish.

Hopefully we’ll run more benchmarks in the future. These could cover larger scale crawling in fully distributed mode and/or comparing different backends for StormCrawler (e.g. Redis+RabbitMQ vs Elasticsearch).

Happy crawling!