Monday, January 15, 2007

TOD

The rails development mail list at RHG is busy at any day, holidays included: trunk is broken, again ... Please stop bouncing Dev ... new build delivered to QA ... [SVN COMMIT] rhg r40941 ... IE+HTTPS+XML+NoCache=Bad ... Another topic often seen there is TOD for Tip Of [the] Day. Here is an example of such tip:

Prefer atomic migrations

There is a potential issue with any db migration that tries to do more then one thing. For example, if a migration adds three indexes and the second one fails, because a key is misspelled, the migration is stuck. When the key name is corrected, the migration still won’t pass on a DB where it was originally run (read on forty workstations of your fellow coworkers who just updated the code and run migrations ). This time it would fail on the first index creation, because it was created by the failing migration and was not rolled back due to the nature of rails migrations.

The obvious solution for such problems is to make all migrations atomic. If there is a need to add three indexes, there should be three migrations. Instead of adding a table, populating it with data, and adding an index, all within a single migration, there should be three of them, with data population wrapped in a transaction.

DB migrations with a twist

As any project using Rails, we try to stay within the existing framework boundaries as much as we can but we are not afraid to push them if we see that there is a better way for us to adopt. The way we handle db migrations in our project have been driven by these forces:

  • Dozens of developers working on the same rails source tree
  • Multiple branches with frequent merges
  • Multiple DBs defined in database.yml
  • Migrations are used to populate DBs for tests
  • No-outage policy for the site

Test DB recreation via migrations

We have accumulated hundreds of migrations over the life of the project. If we were running all of them to recreate the test DB before running tests, it would take more then an hour to finish the loading task. To bring that time down to just a couple of minutes, a process called db migrations consolidation is being used. Every time a build is deployed in production, all migrations up to the latest one in production are executed and the DB is dumped to a SQL file. A single migration, loading the generated SQL, then replaces all those migrations. The number used for the migration is the one that was already applied in production, so the new migration is never executed in production but on empty DBs like the one for tests.


Multiple DBs

We use DB partitioning for various reasons (from secure DBs with all data encrypted to read-only referential DBs). There is not much needed, from the migration point, to support multiple DBs. We have a rake task to read database.yml and to purge all DBs defined there before populating them via migrations. Another code included by migrations helps to run migration actions on a specific DB. Example:

run_on_db('secondary_db') { drop_index :x, :y }

We have multiple environments. In some we use a full set of DBs while in others it is a single DB with all entries in the config file pointing to it.

For population of some referential DBs with large quantities of data we have chosen to not use rails migrations at all but to store, deliver and deploy it via a separate process.


Many developers working on different branches

It was the biggest issue for us. At early stages name clashing of migrations was a norm. We tried to mitigate it by adding a special file (next_migration) where people could reserve the migration number they would use next. It did not work well. Some developers were forgetting to claim the number, other would do so but, due to the nature of rails migrations their migrations, would not run on other developers' machines because some new migrations were added after that, etc. Finally we gave up the standard way and created a gem which contained these migration enhancements:
  • Using timestamps for migration numbers. It reduced name clashing to none.
  • Using a migration tracking table to support migrations with numbers below the last migration run. There is no need to renumber migrations after merges anymore.
There is nothing new in this approach and some people have been using similar implementations (see François 's blog for example). It seems to work pretty good for us and the problems we had with the standard way went away since we started using the migrations gem.


Minimal or no site outages

Keeping db migrations backwards compatible comes with a cost in terms of development time and until recently we cheated on that since the site was not in the full production mode. But it all changed a couple of weeks ago when we were told that our goal should be to not bringing down the whole site for applying a new release.

From the practical point of view, it means making sure these two conditions are met:
  • The old code can work with the new data during migrations and, potentially, long after in case the new code is rolled back (we don’t roll back data if a release fails because of the code).
  • The new code code works with both new data structures and old data structures because there is no guarantee there will be no changes to the old data between the time when a db migration is applied and the new code is rolled to all nodes. For example, if a field is moved from one model to another (= a column from one table to another) the approach to just copy all data between tables and then ignore the old field is not acceptable. The reason is that at the time a new model comes live, there might be some updates to the old field. So, the right solution is to make the model intelligent enough to lookup in both places.

Our data structure retention period is one production release. I.e. an unused table column cannot be dropped in the same release when it made obsolete, but only in a release after that.

There might be rear exceptions when there would be no way to make changes backward compatible, but those are treated as special cases and go through peer->DBAs->management->QA reviews before being approved for production.

So far, we have not had a really challenged migration to meet the no-outage requirement. It would be a good team exercise to brain-storm the best solution when the time comes.