An ActiveRecord Model Backed by Two Datastores

I recently encountered a MySQL table backing an ActiveRecord model that grew so unexpectedly fast that it ran out of space. Our engineering team did what anybody would do in an emergency: moved all the existing data somewhere with more space and created a new, empty table so we wouldn’t lose new writes.

Due to the nature of this data and the code that references it we couldn’t immediately move all queries to just use the new storage location. We had to keep both datasets for a while and we needed existing references to the ActiveRecord model to just keep working.

Note: Our “fix” was purely temporary. What I’m about to describe is a gross hack that got us through the immediate crisis but we immediately began work to properly collapse these two datasets into one with a better runway for future growth. It’s a useful example of Ruby’s flexibility but please don’t put what follows into production unless you absolutely have to.

What we did was swap out the ActiveRecord model for a bare class that simply delegates all the messages an instance receives to both of the MySQL datastores to present a unified view of two totally different tables on two different hosts.

Here’s how we did it.

Delegating everything

All of our code was referencing TheModel and expecting it to behave like ActiveRecord. So the first step was to hide the actual model as inner class of a BasicObject (which is an object that has almost no methods defined). Then we created a second inner class that referenced the new dataset. Then we took all the existing, shared behavior from the original model and put it in a module to be included by both.

class TheModel < BasicObject
  module ExistingBehavior  # methods that used to be inside TheModel
    def self.included(model)
      model.class_eval do  # method calls that used to be in the class
                           # e.g. `belongs_to :user`
      end
    end
  end
  class New < ::ActiveRecord::Base
    establish_connection "original_datastore"  # Connect to the right db
    self.table_name = :records
    include ExistingBehavior
  end
  class Old < ::ActiveRecord::Base
    include ExistingBehavior
    establish_connection "other_datastore"  # Connect to the right db
    self.table_name = :records_other
  end
end

Reading from two places

What we’ll need is a way to turn calls for TheModel.all or TheModel.where(some_condition).all actually reads from two different datastores. It’s not too hard to proxy to just one datastore because all we need to do is delegate a few calls on the TheModel class to the inner model:

class TheModel < BasicObject
  class << self
    delegate :all,
             :where,  # This isn't a complete list,
             :first,  # of creation methods but
             :last,   # it'll do for now
             :new,
             :create,
             :create!,
             :to => :'TheModel::New'
  end
  class New < ActiveRecord::Base
  end
end

Now if you run TheModel.all you’re actually calling TheModel::New.all. And any chained messages you add on to the result of that call will go to the right place.

But what we really want is to have TheModel.all return TheModel::New.all + TheModel::Old.all. Let’s start with a naive approach:

class TheModel < BasicObject
  class << self                            # Keep delegating constructor methods
  delegate :new, :create, :create!,        # to the new dataset
           :to => :'TheModel::New'
  end

  def self.method_missing(method_name, *args, &block)  # all other calls get sent to
    New.send(method_name, *args, &block) +             # both models and joined
    Old.send(method_name, *args, &block)
  end
end

But that assumes that all of our operations will be immediately concatenating two values. That’s not true, sometimes we want to chain up successive where calls. To make this possible we need some kind of intermediate object that holds state before we’re ready to concatenate the results:

class TheModel < BasicObject
  def self.method_missing(method_name, *args, &block)
    FindInTwoPlaces.new(New, Old).send(method_name, *args, &block)
  end

  class FindInTwoPlaces
    def initialize(new_dataset, old_dataset)
      @new_dataset, @old_dataset = new_dataset, old_dataset
    end

    def all
      @new_dataset.all + @old_dataset.all
    end

    def method_missing(method_name, *args, &block)
      FindInTwoPlaces.new(                              # All calls except `all`
        @new_dataset.send(method_name, *args, &block),  # just create a new instance
        @old_dataset.send(method_name, *args, &block)
      )
   end
end

This FindInTwoPlaces class is initialized with two datastores. Initially that’s just the bare models representing the complete datastores. But as we chain messages each datastore object gets further refined. If you call TheModel.where("id=1") you get a FindInTwoPlaces instances that contains TheModel::New.where("id=1") and TheModel::Old.where("id=1"). When you then call .all on that object they get concatenated together.

So far so good, but we’re getting a lot of complexity for just the .all class method. We can do better by defining all of the final methods that we will want to use and proxying all other messages to a new intermediate object. For now let’s just implement all, count, and to_a

class TheModel < BasicObject
  def self.method_missing(method_name, *args, &block)
    FindInTwoPlaces.new(New, Old).send(method_name, *args, &block)
  end

  class FindInTwoPlaces
    def initialize(new_dataset, old_dataset)
      @new_dataset, @old_dataset = new_dataset, old_dataset
    end

    def all
      @new_dataset.all + @old_dataset.all
    end

    def count
      @new_dataset.count + @old_dataset.count
    end

    def to_a
      @new_dataset.all + @old_dataset.all
    end

    def method_missing(method_name, *args, &block)
      FindInTwoPlaces.new(
        @new_dataset.send(method_name, *args, &block),
        @old_dataset.send(method_name, *args, &block)
      )
   end
end

Now TheModel.where("id=1").order("created_at DESC").limit(2).all works as expected. It’ll return all records from both datasets that match those conditions. However, if you call .first at the end instead of .all it’ll totally fail to work. Let’s fix that.

class TheModel < BasicObject
  def self.method_missing(method_name, *args, &block)
    FindInTwoPlaces.new(New, Old).send(method_name, *args, &block)
  end

  class FindInTwoPlaces
    def initialize(new_dataset, old_dataset)
      @new_dataset, @old_dataset = new_dataset, old_dataset
    end

    def first
      @new_dataset.first || @old_dataset.first
    end

    def all
      @new_dataset.all + @old_dataset.all
    end

    def count
      @new_dataset.count + @old_dataset.count
    end

    def to_a
      @new_dataset.all + @old_dataset.all
    end

    def method_missing(method_name, *args, &block)
      FindInTwoPlaces.new(
        @new_dataset.send(method_name, *args, &block),
        @old_dataset.send(method_name, *args, &block)
      )
    end
  end
end

Here we’ve introduced our first application-specific decision. It may be that you want one dataset prioritized over another. For my case here I need the newer one if it exists. We’ve also introduced our first operation that isn’t concatenative. If we implement all possible final methods we’ll have to be careful to use + or || or other operators as is appropriate.

Let’s implement all of the rest of the methods that might appear as the final part of an ActiveRecord query chain. And to save on some typing I’m going to go ahead and refactor them into lists of method names grouped by operator. I’ll also be fixing a bug that exists in the above implementations by ensuring that all arguments to these methods get properly forwarded to the internal datastore objects.

class TheModel < BasicObject
  def self.method_missing(method_name, *args, &block)
    FindInTwoPlaces.new(New, Old).send(method_name, *args, &block)
  end

  class FindInTwoPlaces < Struct(:new_dataset, :old_dataset)  # Making this a struct lets us skip
    %w{all count to_a pluck}.each do |m|                      # writing an initializer and it
      define_method m do |*args, &block|                      # gives us accessors for free
        new_dataset.send(m, *args, &block) + old_dataset.send(m, *args, &block)
      end
    end

    %w{first last}.each do |m|
      define_method m do |*args, &block|
        new_dataset.send(m, *args, &block) || old_dataset.send(m, *args, &block)
      end
    end

    def empty?
      new_store.empty? && old_store.empty?
    end

    def method_missing(method_name, *args, &block)
      FindInTwoPlaces.new(
        new_dataset.send(method_name, *args, &block),
        old_dataset.send(method_name, *args, &block)
      )
   end
end

You may notice I haven’t implemented the entire API of an ActiveRecord model. That’s because the full API includes a ton more methods including both .forty_two and .forty_two!.

If we needed all of those operations we’d probably have deeper problems because the objects in an application need to communicate over the narrowest API possible to keep the app simple. However, blindly passing them off to method_missing will have indeterminate results. So we should explicitly disallow their use.

Here, then is the final version of our wacky meta-model. It presents two totally separate ActiveRecord datastores (possibly on different hosts or even using different database technology) as a single, unified datastore:

class TheModel < BasicObject
  class << self
    delegate :new, :create, :create!, :to => :'TheModel::New'
  end

  def self.method_missing(method_name, *args, &block)
    FindInTwoPlaces.new(New, Old).send(method_name, *args, &block)
  end

  module ExistingBehavior
    def self.included(model)
      model.class_eval do
      end
    end
  end
  class New < ::ActiveRecord::Base
    establish_connection "original_datastore"
    self.table_name = :records
    include ExistingBehavior
  end
  class Old < ::ActiveRecord::Base
    include ExistingBehavior
    establish_connection "other_datastore"
    self.table_name = :records_other
  end

  FindInTwoPlaces = Struct.new(:new_dataset, :old_dataset) do

## Concatenating operations
    %w{
 all
 count
 delete_all
 destroy_all
 explain
 ids
 pluck
 sum
 to_a
 update_all
 }.each do |m|
      define_method m do |*args, &block|
        new_dataset.send(m, *args, &block) + old_dataset.send(m, *args, &block)
      end
    end

## Selecting a value from one or the other
    %w{
 any
 first
 include
 last
 many?
 }.each do |m|
      define_method m do |*args, &block|
        new_dataset.send(m, *args, &block) || old_dataset.send(m, *args, &block)
      end
    end

## Impossible operations
    %w{
 average
 calculate
 create
 create_with
 delete
 destroy
 exec_explain
 exists
 fifth
 find
 find_by
 find_each
 find_in_batches
 find_or_create_by
 find_or_initialize_by
 first
 first_or_create
 first_or_initialize
 forty_two
 fourth
 fourth!
 last
 lock
 maximum
 minimum
 readonly
 second
 take
 third
 update
 }.each do |m|
      define_method m do |*args, &block|
        raise "Sorry, #{m} isn't possible via FindInTwoPlaces, do it by hand"
      end
    end

    def method_missing(method_name, *args, &block)
      self.class.new(
        new_dataset.send(method_name, *args, &block),
        old_dataset.send(method_name, *args, &block)
      )
    end
  end
end

Again, please don’t use this in production unless you absolutely have to.

Delegating everything¶

Delegating everything