Monday 29 May 2017

What's new in StormCrawler 1.5

StormCrawler 1.5 has just been released! It is an important road mark with the move to Elasticsearch 5.x and the implementation of long-awaited features such as the Selenium-based protocol. The code has been improved in many ways and despite the seemingly low number of lines below, this new release is a mammoth one!

The project, in general, is in very good health, with more and more organisations using it in production, and an increased visibility, reflected by the growing number of questions on StackOverflow.

Here are the main changes in 1.5.

CORE DEPENDENCIES UPGRADES
  • Apache Storm 1.1.0 (#450)

CORE MODULE
  • HTTP Protocol: implement cookie handling (#32)
  • java.util.zip.ZipException: Not in GZIP format thrown on redirs with httpclient (#455)
  • Selenium-based protocol implementation (#144) which I described in a separate blog post
  • Indicate whether RobotsRules come from cache or have been fetched (#460)
  • Memory issues when ByteArrayBuffer gets instantiated with a large value despite maxLength being set (#462)
  • FetcherBolt to dump URLs being fetched to log (#464)
  • Override sitemapsAutoDiscovery settings per URL (#469)

Knowing whether RobotsRules come from the cache gives us more insights into the behaviour of the crawlers as we can display the ratio of cache vs live (see illustration below)


as well as pages fetched vs robots fetched.

  

ELASTICSEARCH
  • Utility class to export URL and metadata from ES index to file (#444)
  • Fixed sampling with aggregation spout in ES5
  • Upgrade to Elasticsearch 5.3 (#221 and  #451)
  • Optimise nextFetchDate to speed up queries to Elasticsearch (#429 and  #452)
  • Delete gone pages from index (#253)
  • metrics - remove filtering (#281)

One of the main changes related to Elasticsearch is the removal of ElasticsearchSpout and the introduction of CollapsingSpout, which uses the brand new FieldCollapsing in Elasticsearch. We also fixed a concurrency issue in the StatusUpdaterBolt (9fefac8), improved the efficiency of the spouts by getting them to process results in a separate thread (1b0fb42), which combined with the optimisation of nextFetchDate (see above) and the fix of the sampling in AggregationSpout, means that the Elasticsearch module is more efficient than ever.

The move to Elasticsearch 5.x was not without difficulties but the result justifies the effort. I described in a separate post the common pitfalls of upgrading an existing topology to Elasticsearch 5.

Coming next?
As usual, it is hard to guess what the next release will be made of as the project is driven by its community.

Having said that, I'd expect the Selenium-based protocol to get improved as users start to use it. It is also likely that we'll move away from Apache HttpClient library (#443). As mentioned in the previous release, we'll probably upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser.

In the meantime and as usual, thanks to all contributors and users and happy crawling!




No comments:

Post a Comment

Note: only a member of this blog may post a comment.