Questions to Resolve While Writing Data Synchronization Code
My home laptop fell a few days ago and the power button stopped working until I got new parts shipped and replaced. Everything went back to normal or so I thought until I opened Google Chrome on another system I was using, and found that the bookmarks I had added in the last few days had completely disappeared. For some reason, the synching in Google Chrome decided to delete the bookmarks.
It is always fun to criticize companies for the business decisions they make (is Google+ the new Orkut or is it the new Buzz?), but when it comes to engineering, you always want to give a company like Google the benefit of the doubt. Maybe I missed something or had deleted everything accidentally before the synching happened.
But it got me thinking about automatic synchronization and how in many respects, it is a difficult challenge. Not only from the engineering aspect, but also with respect to requirements. As I shall explain below, doing automatic synchronization requires you to make many assumptions about the data and the user that may not be fully warranted. Some synching would be better served by not going to full automation, but allow the user to resolve conflicts. Also a commitment to “do no harm” would be helpful.
Let us start with the example of a computer “A” synching with another computer “B” and the information being synchronized is a (name, phone) pair. The easiest case is if all editing is done on one computer, say A. So if any additions, updates or deletions are done on A, they can be synced to B. But what if editing can be done on both sides? Something like adding a record and then synching it seems simple:
If you add a record on A, on synching, it will add it to B.
If you add a record on B, on synching, it will add it to A.
But what if you added (John Smith, (111) 222-3333) on both A and B? If you just follow the above rules, you will have a duplicate. So you not only need to just check for a new record, but also compare field by field if you want to have a clean database. For a <name, phone> record, this doesn’t seem like a big deal. But if you had a large address book with several fields, having duplicates could result in one record of the duplicates having some important information and other missing it or having something else.
Modifying the example, suppose you had two people enter the exact same customer order on A and B. Going by the previous paragraph, that would be a duplicate order and should be merged into a single order. But wait, maybe it is not a duplicate! Maybe they are two separate orders of the same item for the same customer. So you cannot know whether something is a duplicate by simply doing a field-by-field comparison of the record, because they will be identical in every respect. You may need to flag the extra record as a potential duplicate and need a manual intervention by someone in your sales department. Or maybe, someone at the customer’s purchasing department gets an email to confirm that it is not a duplicate. This is not simply a theoretical exercise: Doing this saves you and the customer unnecessary shipping expense and valuable time.
What about modification of the same record in different systems? A naive implementation could look at which record was updated last and then replace the other. Something better would be to merge the records and then use last updated time for fields that have changes on both sides. But here we are making assumptions about what the user intended to do. You can only get a true answer if you show the conflict, explain what you are doing to do and then get a confirmation. Now, it is entirely possible that in real life, this is not a big deal and users do not want the hassle of dealing with conflicts. They may just decide that some data loss is acceptable and the time needed to sort things out is not worth it. That is fine, but once again, we are reverting to the user as the ultimate arbitrator in how things should operate.
Now to delete, which is what I started this post with. One possible engineering problem has to do with absolute deletes. By absolute delete, I mean that you delete something (instead of marking it as Delete) and it is truly wiped off from your database and cannot be retrieved through normal programming means (i.e., other than restoring a backup). Now, when you have “A” containing a record and “B” missing that record, you can come up with two stories:
- The record was added in A and therefore should be added to B.
- (OR) The record existed in B and was deleted. So, it should be deleted on A too.
How would you know which is correct? The only way is if there was some way to track what happens to records in B. But if you are doing an absolute delete of records, B will contain no trace of a deleted record. So you come up with something roundabout like this:
It looks like we did a sync between A and B on Tuesday. Then on Wednesday, we found a record that was created on Monday on A, but does not exist on B. Therefore, the record was copied to B on Tuesday and was then deleted. So let us delete it on A.
Problem is, you have to be 100% sure that your sync actually worked on Tuesday. That it did copy the record. In other words, you have to verify the information of the record being copied to B. So you need extra metadata somewhere to track that.
The Google Chrome example that I mentioned above falls into a different class of synchronization. It is not about A and B synching with each other. It is about A and B synching with the cloud separately. So the way it seemed to have worked in my case was:
Tuesday: Synched with A
Wednesday: B trying to sync. B is missing some records. Synch => records deleted on server.
Thursday: A trying to sync. What is it doing with those deleted records? Synch => records deleted on A [Wrong!].
My big assumption is that Chrome knows that a person can have an account on more than one computer and is not synching to and from one computer to the cloud. If that is the case, it may be the following disastrous story (I am speculating here).
Tuesday: John Smith synched.
Wednesday: John trying to sync with many records missing. Synch => records deleted on server.
Thursday: John trying to sync, but records that were deleted yesterday still in the same folder. Synch => records deleted on John’s system.
Here on the server, you are tracking the records that were deleted. So if they were deleted, they should stay deleted(!) With bookmarks, the problem is probably exacerbated because for all I know, only the name, URL and location is saved and the time of creation or modification is not stored. (This could be answered by looking at the internals of Chrome, I suppose.)
But even the time of creation cannot completely save you! Let’s say that for whatever reason, the time on your computer gets changed to a time that was *before* the last synched time. If you add new records and then sync, the sync program will look at the modified time on the record and think that they were created before the last sync. The sync should have taken care of those, and if something was deleted on the server, then it should be deleted locally. So you might need a metadata field like “modified-since-last-sync”.
What can you do? The “simple and safest” solution is never delete data, allow duplicates to be created and get the user to step in to resolve conflicts. That does avoid the problem of data loss, but introduces too much friction, especially the fact that you can never delete any data as it will always get synched back!
A “less safe, but more efficient” is do it all, but allow the user to roll back deletions. This requires that absolute deletions be banned. Also requires a good interface to notify users what happened in the last sync and be helpful enough to non-technical users.
A more engineering-heavy solution is to have lots of metadata around the synching process. The idea is that you need to track the correlation between a record on the server and the same record on different computers. When activities happen on a record, you need to be able to track it with respect to the last sync. If time was constant and reliable, we could have used it to resolve conflicts. But without that, we need something else that can give an indication of “relative time”, i.e., the order in which things happened. You cannot have absolute deletions because you need to track that record too. You also need an identifier for the sync itself. This has to be also device-time-independent and know which device did the sync. This has to be robust in case your sync fails in the midst of an operation.
I, of course, do not mean this article to suggest that this is an unsolved problem. Dropbox seems to have done it right. But the most important thing, in my opinion, that Dropbox gets right is this:
Dropbox keeps a one-month history of your work. Any changes can be undone, and files can be undeleted.
This is excellent. Users make mistakes. However good your engineering is, it is no match for what users do. You need to give them a rollback option.