How It Works

So how does this database synchronization work? Here’s the scheme, which I made up myself, so it may be completely beyond the pale. We shall find out.

The first piece is a table called sync_hosts that stores the ids of other hosts, the time they were last synced, and whether or not a sync is in progress.

Then each table indicated in the config file has two additional tables created to store metadata. These tables are suffixed with _map and _sync, so if the table is called “table”, then the two additional tables are called “table_map” and “table_sync”. In this initial version, it will be left to the user to ensure that table name conflicts are avoided.

The sync table

The sync table contains the primary keys of the original table, plus three additional columns: “created” (timestamp), “modified” (timestamp), and “deleted” (tinyint(1)). These columns contain, as one might suspect, the creation time, modification time, and whether or not the row has been deleted.

The map table

The map table contains two copies of the primary keys, plus a remote host id. This table allows one to look up what the primary keys of a row are on a remote host. This is necessary because each host will be creating different rows at different times, so the primary keys will only be the same by coincidence.

How it all works

The basic scheme is this. We install triggers on all the tables such that whenever a row is created, modified, or deleted in the original table, a corresponding row is created or updated in the sync table.

Then, when it is time to sync, we get all the rows from the sync table that have been created or updated since the last sync time. We take these rows over to the remote host, look them up in the map table, and make the updates (or create the rows if they aren’t found in the map table). We then do the entire operation in reverse, so that all of the updates from the remote host are brought to the local host as well. When we are doing our updates, we check the sync table of the destination host so that we don’t write over any changes that are newer than ours.

That’s about it! Things will surely get more complicated as we get deeper into this, and we may have to make more rules to make it work. Or it may not work at all! Who knows?

dbsync

Time for a new project! This one is an old one, I wrote it in Java a long time ago… maybe 2009 I would say. I never quite got it working, and I think part of the problem was that I had no automated testing infrastructure.

Automated testing is something that has been elusive in my career. People like to talk about it, particularly during interviews, but very few people actually do it. I’ve only worked at one company that was really serious about it. Other places I’ve tried to introduce it myself, only to be told “we don’t have time for that”. It seems to me that if you don’t have time to do it, then you certainly don’t have time not to do it. Because that will end up taking more time in the long run. But we seem to get by without it somehow.

This project is a perfect candidate, though. It is complex enough that bugs will become intractable without some kind of testing infrastructure, and it lends itself well to verification. The basic premise of the project is to synchronize two relational databases. To simplify this task there are a few rules:

  • The databases are expected to have the same table structure
  • Synchronization is limited to a configured list of tables
  • There are no circular foreign key references
  • Synchronization follows a “last update wins” model

I am sure other rules will be added as things get more complicated, but that’s a start. The reason it lends itself well to testing is because we can set up our test databases with a script; synchronize them; and then dump the results to a text file that can be compared to an expected result. We may need to refine this approach, but as with our list of rules, it is a good start. Follow the progress here and on github.