I got the following question (originally about RavenDB, but I generalized it a bit):
I'm currently working on a open source project where I need background processing. The main scenarios are:
- Processing data from a queue of incoming messages, like processing incoming mail that's put in a queue.
- Processing data from a lot of different web services.
I've worked with scheduling frameworks like quartz.net before to schedule processing but in this case I'm looking at much bigger amounts of processing. It would be nice to add more workers depending on the load like raven db.
I think my main question is what's your experience when building background workers? What should I think about? Is there any framework that can help me?
The first thing to understand is that for data processing, actually implementing queuing is going to be a losing proposition. The absolutely major cost for most data processing task is IO, and the best way to handle that is to handle this via batching. Queues doesn’t really work for this scenario because they make it hard to process a batch of changes in one shot. Queues are natural for “pull from queue, process, move to next message”, which isn’t good when you are processing large amount of information.
The way this is implemented in RavenDB is that I have ensured that there is a cheap way to query by “last updated timestamp”. After that, it means that I am able to issues queries such as:
Give me the next batch of updated documents since update point 121.
Those queries are very cheap (they are fully indexed queries at the storage level).
Following that, each data processing task merely need to keep track of the last update point that it processed. Things get a little more complex when you assume that there can be periods of time where no activity happens, since you want to avoid polling in that scenario.
With RavenDB, if a processing task doesn’t find anything to process, it goes to sleep, and we ensured that this can work by raising a notification whenever the database change, in which case we can wake the waiting tasks. This approach allows us to efficiently process data without waiting for scheduled tasks (which result in update delays), without polling (which consume additional resources) and without complex logic (scheduling, determining what changed, queues, etc).
I find this to be quite an elegant solution.