Over a million developers have joined DZone.

Writing MySQL Purge Jobs in Ruby with the Cleansweep Gem

DZone's Guide to

Writing MySQL Purge Jobs in Ruby with the Cleansweep Gem

· Java Zone
Free Resource

Microservices! They are everywhere, or at least, the term is. When should you use a microservice architecture? What factors should be considered when making that decision? Do the benefits outweigh the costs? Why is everyone so excited about them, anyway?  Brought to you in partnership with IBM.

Originally Written by Bill Kayser

About the simplest thing you can do with a relational database is tell it to delete some data:

delete from accounts where accounts.cancelled = true

You don’t have to be a DBA to figure out what that does. If only everything was that simple!

RubyIf you’re using MySQL, as you begin collecting more and more data the challenges of deleting data slowly unfold. Delete operations can be very expensive and cause contention. Deleting too many records at a time can impact other sessions trying to modify the table.

You can sometimes work around this by deleting in chunks:

repeat until done:
delete from accounts where accounts.cancelled = true limit 500

But this approach can lead to re-scanning rows unnecessarily and locking rows in long running transactions. And it only gets more complicated from there:

  • How do you monitor progress and tune your script?
  • What if you are moving data?
  • How do you avoid introducing replication lag when deleting rows at a high rate?

The most common patterns we encounter for purging data are deleting by timestamps and deleting orphan data. There’s an excellent utility in the free Percona Toolkit called pt-archiver that provides many features to help you write purge scripts efficiently and with no impact on your application in production. In particular, it allows you to explicitly identify the index to traverse when doing deletes.

Here’s a simple example deleting rows of the errors table that are older than one week and not flagged for archive:

pt-archiver \
--source u=dbuser,D=production,t=errors,i=index_on_timestamp \
--purge \
--limit 1000 \
--commit-each \
--bulk-delete \
--where "timestamp < now() - INTERVAL 1 WEEK AND archive = 0"

This specifies that rows should be deleted 1,000 at a time, descending the timestamp index, ensuring that each iteration starts scanning where the last row was deleted. It does this by first querying for the error IDs to purge, and then deleting them separately. In this case that avoids re-scanning rows where archive is not 0.

Building the cleansweep Ruby gem

While pt-archiver works great for purges with single table scans, New Relic engineers weren’t able to use it for more complex operations deleting orphan data using joins instead of subqueries. We also missed all the tools for scripting, scheduling, monitoring, and notification we use for our Ruby on Rails tasks.

I decided to build a Ruby gem—called cleansweep—to give us many of pt-archiver’s benefits while also leveraging features of ActiveRecord that simplified the construction of purge jobs, allowed us to easily handle purging orphan data, and was more easily integrated into our existing Ruby tools.

Using the cleansweep gem, you specify a purge in Ruby like this:

copier = CleanSweep::PurgeRunner.new \
model: Error,
index: 'index_on_timestamp' do | model |
model.where(timestamp < ? and archive = ?', 1.week.ago, false)

This creates a PurgeRunner instance and yields to an Arel scope you can use to specify the where clause. It then iterates through the table deleting in chunks, traversing the timestamp index.

We delete orphan rows by joining with tables that reference the data we want to delete:

copier = CleanSweep::PurgeRunner.new model: Error do | model |
model.joins('left join applications app on errors.app_id = app.id')
.where('app.id is null’)

In this version I don’t specify an index. By default, the cleansweep gem will use the primary key or the first unique index, and traverse that index in one direction. It does this in chunks, first querying for the rows and then deleting them. You can preview the exact queries by specifying:

puts copier.print_queries

You can use this output to examine query plans and tune the criteria:

Initial Query:
SELECT  `errors`.`id`
FROM `errors` FORCE INDEX(primary)
left join applications on errors.app_id = app_id
WHERE (app.id IS NULL)
ORDER BY `errors`.`id` ASC
Chunk Query:
SELECT  `errors`.`id`
FROM `errors` FORCE INDEX(primary)
left join applications on errors.app_id = app_id
WHERE (app.id IS NULL) AND (`errors`.`id` > 1001)
ORDER BY `errors`.`id` ASC
Delete Statement:
FROM `errors`
WHERE (`errors`.`id` in ( 1, 3, 4, … 1001))

Being able to specify the scope becomes especially helpful as the criteria get more complex. For instance, perhaps you want to delete by a timestamp, but only on certain accounts. Or maybe you want to delete orphan data, but you only need to look at a certain type of error.

More bells and whistles

The cleansweep gem also offers a number of other bells and whistles:

  • Option to throttle rate of deletes by sleeping between chunks
  • Built-in custom instrumentation for monitoring with New Relic
  • Option for printing out progress statistics at an interval you specify
  • Ability to copy rows into another table instead of deleting them
  • Ability to purge rows in one table using IDs in another table. At New Relic, we use this to purge satellite tables by building a temporary table of IDs and creating cleansweep instances for each table we need to purge that references the IDs in the temp table
  • Ability to suspend when the replication lag exceeds a certain time threshold
  • Ability to suspend when the history list size exceeds a certain threshold
  • Ability to traverse an index in reverse order, or not traverse at all
  • Accept a logger instance to use for logging. We use this to pass in a facade to our remote logger

This only scratches the surface of all the things the cleansweep gem can do. You can find details, documentation, and more examples at http://bkayser.github.com/cleansweep.

Discover how the Watson team is further developing SDKs in Java, Node.js, Python, iOS, and Android to access these services and make programming easy. Brought to you in partnership with IBM.


Published at DZone with permission of Fredric Paul, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}