I need to write a server application that fetches mails from different mail servers/mailboxes and then needs to process/analyze these mails.
Traditionally, I would do this multi-threaded, launching a thread for fetching mails (or maybe one per mailbox) and then process the mails.
We are moving more and more to servers where we have 8+ cores, so I would like to make use of these cores as much as possible (and not use 1 at 100% and leave the seven others untouched). So conceptually, as an example, it would be nice that I could write the application in such a way that two cores are “continuously” fetching emails and four cores are “continuously” processing/analyzing the emails (since processing and analyzing mails is more CPU intensive than fetching mails).
This seems like a good concept, but after studying some parallel patterns, I’m not really sure how this is best implemented. None of the patterns really fit. I’m working in VS2012, native C++, but I guess from a design point of view this does not really matter and just some pointers on how to organize this would be great!
2
The actor model of concurrency seems like it might be a good fit for this.
The Model
In case you’re not familiar with this model, it goes as follows:
Actors are threads that run in a loop. Each actor has a producer-consumer message queue; external code and other actors communicate with an actor by sending a message to it (queuing it in its message queue).
An actor’s thread will block waiting for a message in its message queue, and when one appears the actor will deal with it, then loop back to process or wait for the next message. Repeat.
Note: “Actors” are sometimes called “Agents”, but that term is misapplied. See the comment thread below for more.
Architecture
You could create actors specifically for downloading messages (say one per mailserver/mailbox) and other actors for processing the e-mails once they’ve been downloaded.
Connecting the two you could have a single routing actor that would receive references to the downloaded mail files from the fetching actors and either send each reference to an available processing actor or spin up another one to process it if all the other processing actors were busy. When a processing actor was finished processing, it would send a message to the routing actor saying it was done so the routing actor would know it could send another message to it to process.
I’m betting by this point there’s a library for actors for C++ [UPDATE: See comment by @rwong below]. If all else fails you could try Erlang 😉
I’m not sure how the C++ threading libraries work — whether they map threads to a single core or multiple cores — but if this doesn’t do it for you you could take the same concept and instead of using threads have them be discrete processes and use some sort of message passing framework for the communication.
Edit: I’m betting you’ll have a bottleneck at the network, though, so it might not even make sense to want to occupy all the cores at once (unless the processing takes a loooong time).
Edit: Expanded answer and corrected terminology (Agent -> Actor)
11