Originally Written by Bill Kayser
About the simplest thing you can do with a relational database is tell it to delete some data:
delete from accounts where accounts.cancelled = true
You don’t have to be a DBA to figure out what that does. If only everything was that simple!
If you’re using MySQL, as you begin collecting more and more data the challenges of deleting data slowly unfold. Delete operations can be very expensive and cause contention. Deleting too many records at a time can impact other sessions trying to modify the table.
You can sometimes work around this by deleting in chunks:
repeat until done: delete from accounts where accounts.cancelled = true limit 500
But this approach can lead to re-scanning rows unnecessarily and locking rows in long running transactions. And it only gets more complicated from there:
- How do you monitor progress and tune your script?
- What if you are moving data?
- How do you avoid introducing replication lag when deleting rows at a high rate?
The most common patterns we encounter for purging data are deleting by timestamps and deleting orphan data. There’s an excellent utility in the free Percona Toolkit called pt-archiver that provides many features to help you write purge scripts efficiently and with no impact on your application in production. In particular, it allows you to explicitly identify the index to traverse when doing deletes.
Here’s a simple example deleting rows of the errors table that are older than one week and not flagged for archive:
pt-archiver \ --source u=dbuser,D=production,t=errors,i=index_on_timestamp \ --purge \ --limit 1000 \ --commit-each \ --bulk-delete \ --where "timestamp < now() - INTERVAL 1 WEEK AND archive = 0"
This specifies that rows should be deleted 1,000 at a time, descending the timestamp index, ensuring that each iteration starts scanning where the last row was deleted. It does this by first querying for the error IDs to purge, and then deleting them separately. In this case that avoids re-scanning rows where archive is not 0.
Building the cleansweep Ruby gem
While pt-archiver works great for purges with single table scans, New Relic engineers weren’t able to use it for more complex operations deleting orphan data using joins instead of subqueries. We also missed all the tools for scripting, scheduling, monitoring, and notification we use for our Ruby on Rails tasks.
I decided to build a Ruby gem—called cleansweep—to give us many of pt-archiver’s benefits while also leveraging features of ActiveRecord that simplified the construction of purge jobs, allowed us to easily handle purging orphan data, and was more easily integrated into our existing Ruby tools.
Using the cleansweep gem, you specify a purge in Ruby like this:
copier = CleanSweep::PurgeRunner.new \ model: Error, index: 'index_on_timestamp' do | model | model.where(timestamp < ? and archive = ?', 1.week.ago, false) end copier.execute_in_batches
This creates a PurgeRunner instance and yields to an Arel scope you can use to specify the where clause. It then iterates through the table deleting in chunks, traversing the timestamp index.
We delete orphan rows by joining with tables that reference the data we want to delete:
copier = CleanSweep::PurgeRunner.new model: Error do | model | model.joins('left join applications app on errors.app_id = app.id') .where('app.id is null’) end
In this version I don’t specify an index. By default, the cleansweep gem will use the primary key or the first unique index, and traverse that index in one direction. It does this in chunks, first querying for the rows and then deleting them. You can preview the exact queries by specifying:
You can use this output to examine query plans and tune the criteria:
Initial Query: SELECT `errors`.`id` FROM `errors` FORCE INDEX(primary) left join applications on errors.app_id = app_id WHERE (app.id IS NULL) ORDER BY `errors`.`id` ASC LIMIT 500 Chunk Query: SELECT `errors`.`id` FROM `errors` FORCE INDEX(primary) left join applications on errors.app_id = app_id WHERE (app.id IS NULL) AND (`errors`.`id` > 1001) ORDER BY `errors`.`id` ASC LIMIT 500 Delete Statement: DELETE FROM `errors` WHERE (`errors`.`id` in ( 1, 3, 4, … 1001))
Being able to specify the scope becomes especially helpful as the criteria get more complex. For instance, perhaps you want to delete by a timestamp, but only on certain accounts. Or maybe you want to delete orphan data, but you only need to look at a certain type of error.
More bells and whistles
The cleansweep gem also offers a number of other bells and whistles:
- Option to throttle rate of deletes by sleeping between chunks
- Built-in custom instrumentation for monitoring with New Relic
- Option for printing out progress statistics at an interval you specify
- Ability to copy rows into another table instead of deleting them
- Ability to purge rows in one table using IDs in another table. At New Relic, we use this to purge satellite tables by building a temporary table of IDs and creating cleansweep instances for each table we need to purge that references the IDs in the temp table
- Ability to suspend when the replication lag exceeds a certain time threshold
- Ability to suspend when the history list size exceeds a certain threshold
- Ability to traverse an index in reverse order, or not traverse at all
- Accept a logger instance to use for logging. We use this to pass in a facade to our remote logger
This only scratches the surface of all the things the cleansweep gem can do. You can find details, documentation, and more examples at http://bkayser.github.com/cleansweep.