Sometimes you just don’t want or need the complexity of a fancy distributed transaction manager or complicated RPCs to coordinate a concurrent job amongst a farm of servers.
Here is a very simple and efficient way to achieve a lock/release type of concurrency control in a farm-like architecture using nothing but a few SQL sentences and your relational database.
Imagine the following scenario: A web app that runs on multiple servers/instances where users activity is balanced. All servers are clones, and you want to keep it that way so adding/removing servers from the farm remains easy and safe; and you need to run some background scheduled job, not triggered by user activities.
To keep things simple (operations wise), you want to keep those batch jobs inside your app and you don’t want/can’t afford to integrate a fancy multi-server transaction manager.
For example, you want to implement a background job that periodically checks the price of stocks and sends an email to all subscribed users. Since all the servers in your farm are clones, this job will be running on all of them, and that’s a good thing, because if one fails any of the other can pick it up. But on the other hand you don’t want to send 25 emails to each user every time, if you have 25 servers in your farm.
So our goal is to ensure only one server performs the job, but if it doesn’t, any other can realize it and do it himself.
The solution is really simple, we define a table in our database with the following structure:
CREATE TABLE job_status ( job_id varchar(254) NOT NULL, running_at varchar(254) NULL, last_run_started datetime NULL, next_run datetime NULL, PRIMARY KEY(job_id) )
And pre-insert one row for each job (only one in our example, but it’s nice to have support for several from scratch, right?) with only the job_id field set (all other fields can start as NULL).
Assuming we code our batch job as an endless loop with a pause between runs, when each cycle starts we should execute the following statement:
UPDATE job_status SET running_at=<NODE_IP_OR_ADDRESS>, last_run=<CURRENT_DATETIME>, next_run=<CALCULATED_DATETIME_NEXT_RUN> WHERE job_id=<JOB_IDENTIFICATION> AND ( (running_at IS NULL AND next_run <= <CURRENT_DATETTIME>) OR (running_at IS NOT NULL AND last_run_started <= <CALCULATED_TIMEOUT>) )
- <NODE_IP_OR_ADDRESS>: Identification of the node in the farm where the job is running
- <CURRENT_DATETIME>: Current system date and time
- <CALCULATED_DATETIME_NEXT_RUN>: The date and time when the job should run again assuming this run finishes correctly.
- <JOB_IDENTIFICATION>: A string that uniquely indentifies the job
- <CALCULATED_TIMEOUT>: A calculated date and time (usually the current system time minus a pre-defined value) upon which the job is considered “dead” and should be picked up by another node.
In most relational DBMS you get the number of updated rows for each UPDATE command as a result of the UPDATE itself (or you can embed a second query in the same database transaction that returns it). If the number of updated rows was 1, the executing node should go ahead and execute the job now, if the number was 0 it should go back to sleep until the next iteration.
The UPDATE “locks” the job (setting the running_at field to the executing node address) only if the job was due to be executed and no other node had started it, or if the job was in fact being executed by another node, but so long ago it was considered timed out. Combine this with the ACID guarantees of a RDBMS and we got ourselves a multi-node semaphore.
To wrap things up, if the current node actually got to execute the job, it should finish the iteration by releasing the lock with:
UPDATE job_status SET running_at=NULL WHERE job_id=<JOB_IDENTIFICATION>
Nice, isn’t it? No messy RPC, no fancy transaction manager and our app has scheduled jobs with farm support.