I’ve been working for a time with some researches developing a tool to fetch tweets from Twitter and process them in some way. The first prototype “worked” but became a pain as we used sockets to connect different components. The architecture was similar to this:
The controller had to create tasks (information about the desired tweets) and the trackers job was to download them and send them back to the controller to be stored and processed. We used multiple trackers because Twitter rate limits and we had terrible bottleneck problems and missing data.
Now we are interested in rewrite the whole project and I’m looking for a good approach to improve the performance of the tool. The first idea over the desk is using RabbitMQ between the controller and the trackers, serializing all to an external database or to HDFS. Another idea is using Apache Flume. Here are my questions about these two options:
- Is RabbitMQ suitable for this kind of task?
- Does the RabbitMQ server act as a worker? Does it receive tasks by default?
- When using Apache Flume, is it possible to define multiple Twitter agents? I know that is easy to define an agent to download from Twitter, but I would need multiple instances (trackers) running in different nodes.
- Is it possible to replace agent keywords dynamically in Apache Flume.
- The last one is a broad question, sorry for that. Do you have another alternatives apart from RabbitMQ and Apache flume?
2