I need to extract text from Webpage using Stormcrawler. I was able to bring up the storm crawler with a seeds.txt but when I execute the crawler it prints only the number of characters like this
content 1177 chars
url https://sle-dashboard.iana.org/chart/root-zone-file-accuracy/?
granularity=daily&last=7
domain iana.org
format html
description
title
I tried to implement a bolt and add to the topology but it was not being recognised at all when I run it
....
builder.setBolt("parse", new JSoupParserBolt()).localOrShuffleGrouping("feeds");
// Integrate TextExtractionBolt after the FetcherBolt
TextExtractionBolt textExtractionBolt = new TextExtractionBolt();
builder.setBolt("textExtractor",
textExtractionBolt).localOrShuffleGrouping("parse");
builder.setBolt("shunt", new
RedirectionBolt()).localOrShuffleGrouping("textExtractor");
builder.setBolt("tika", new ParserBolt()).localOrShuffleGrouping("shunt");
builder.setBolt("index", new
StdOutIndexer()).localOrShuffleGrouping("shunt").localOrShuffleGrouping("tika");
Fields furl = new Fields("url");
// can also use MemoryStatusUpdater for simple recursive crawls
builder.setBolt("status", new StatusUpdaterBolt()).fieldsGrouping("fetch", Constants.StatusStreamName, furl)
.fieldsGrouping("sitemap", Constants.StatusStreamName, furl)
.fieldsGrouping("feeds", Constants.StatusStreamName, furl)
.fieldsGrouping("parse", Constants.StatusStreamName, furl)
.fieldsGrouping("textExtractor", Constants.StatusStreamName, furl)
.fieldsGrouping("tika", Constants.StatusStreamName, furl)
.fieldsGrouping("index", Constants.StatusStreamName, furl);
What I am missing? Please help