Wednesday 4 November 2015

What's new in Storm-Crawler 0.7

Storm-Crawler 0.7 has been released yesterday. This release fixes some bugs and provides numerous improvements, we advise users to upgrade to it. Here are the main changes:

  • AbstractIndexingBolt to use status stream in declareOutputFields #190
  • Change Status to ERROR when FETCH_ERROR above threshold #202
  • FetcherBolt tracks cause of error in metadata
  • Add default config file in resources #193
  • FileSpout chokes on very large files #196
  • Use Maven-Shade everywhere #199
  • Ack tick tuples #194
  • Remove PrinterBolt and IndexerBolt, added StdOutStatusUpdater #187
  • Upgraded Tika to 1.11

This release contains many improvements to the Elasticsearch module :


  • Added README with a getting started section
  • IndexerBolt uses url as doc ID
  • ESSpout : maxSecSinceQueriedDate param to avoid deep paging
  • ElasticSearchSpout can random sort -> better diversity of URLs
  • ElasticSearchSpout implements de/activate, counter for time spent querying, configurable result size
  • Simple Kibana dashboards for metrics and status indices
  • Metadata as structured object. Implements #197
  • ES Spout - more metrics acked, failed, es queries and docs
  • ESSeedInjector topology
  • Index init script uses ttl for metrics
  • Upgraded ES version to 1.7.2

The SOLR module has also received some attention :
  • solr-metadata #210
  • Cleaning some documentation and typo issues
  • Remove outdated configuration options for solr module
We also improved the metrics by adding a PerSecondReducer (#209) which is used by the FetcherBolts to provide page and byte per second metrics. The metrics names and codes got also improved - notably the gauges for ESSpout and FetcherBolt.

These changes combined with the Kibana dashboard templates make it easy to monitor a crawl and get addition insights into its behaviour, as illustrated below.



Of course thanks to Storm's pluggable and versatile metrics mechanism, it is relatively easy to send metrics to other backends such as AWS Cloudwatch for instance.

Thanks to the various users and contributors who helped with this release.