Friday, April 20, 2007

[PLUGIN SEED] - Data sharder

In the spirit of beating a dead horse, I figured I would stick up this very incomplete plugin that I was fooling around with about 10 months ago.

The idea revolved around splitting up really huge tables into multiple databases, all while keeping "peer" data colocated. The 'parent' records get evenly distributed among databases based on ID, and all child records live with the parent so that you can eagerly load associations via the :include directive on AR finders.

Limitations/Pitfalls: It's probably only useful for very special purpose situations where you're perfoming almost no ad-hoc queries, sorting, etc. The reason is that ad-hoc queries must be joined with software after performing the query on all databases. Clearly this creates trouble with sorted, paginated queries.

Basically throwing it out there as a conversation piece to see if anyone else is exploring the idea of splitting up massive tables.


Here's some illustrative snippets:

class Forum < ActiveRecord::Base
sharded_among :forums
has_many :topics
end



class Topic < ActiveRecord::Base
sharded_with :forum
belongs_to :forum
end



development:
adapter: mysql
database: primary_development
username: root
password:
host: localhost
forums:
- adapter: mysql
database: forum1_development
username: root
password:
host: localhost
- adapter: mysql
database: forum2_development
username: root
password:
host: localhost
- adapter: mysql
database: forum3_development
username: root
password:
host: localhost



Resultant console/mysql:


>> @forum1 = Forum.create :title => "Forum 1"
=> ..
>> @forum2 = Forum.create :title => "Forum 2"
=> ..
>> @forum3 = Forum.create :title => "Forum 3"
=> ..

>> @forum1.topics.create :title => "forum 1, topic 1"
=> ..
>> @forum1.topics.create :title => "forum 1, topic 2"
=> ..

>> @forum2.topics.create :title => "forum 2, topic 1"
=> ..
>> @forum2.topics.create :title => "forum 2, topic 2"
=> ..

>> @forum3.topics.create :title => 'Forum 3, topic 1'
=> ..
>> @forum3.topics.create :title => 'Forum 3, topic 2'
=> ..


mysql> use forum1_development;
Database changed

mysql> select * from forums;
+----+---------+
| id | title |
+----+---------+
| 3 | Forum 3 |
+----+---------+
1 row in set (0.00 sec)

mysql> select * from topics;
+----+------------------+----------+
| id | title | forum_id |
+----+------------------+----------+
| 1 | Forum 3, topic 1 | 3 |
| 2 | Forum 3, topic 2 | 3 |
+----+------------------+----------+

mysql> use forum2_development;
Database changed

mysql> select * from forums;
+----+---------+
| id | title |
+----+---------+
| 1 | Forum 1 |
| 4 | Forum 1 |
+----+---------+
2 rows in set (0.00 sec)

mysql> select * from topics;
+----+------------------+----------+
| id | title | forum_id |
+----+------------------+----------+
| 1 | forum 1, topic 1 | 4 |
| 2 | forum 1, topic 2 | 4 |
+----+------------------+----------+
2 rows in set (0.00 sec)

mysql> use forum3_development;
Database changed

mysql> select * from forums;
+----+---------+
| id | title |
+----+---------+
| 2 | Forum 2 |
| 5 | Forum 2 |
+----+---------+
2 rows in set (0.00 sec)

mysql> select * from topics;
+----+------------------+----------+
| id | title | forum_id |
+----+------------------+----------+
| 1 | forum 2, topic 1 | 5 |
| 2 | forum 2, topic 2 | 5 |
+----+------------------+----------+
2 rows in set (0.00 sec)


Can do eager joins because child objects are sharded with their parents.

epf-lap:/Users/epf/ws/triplebeam $ ruby script/console
Loading development environment.
>> f = Forum.find 3, :include => :topics
Forum Columns (0.002027) SHOW FIELDS FROM forums
Topic Columns (0.001440) SHOW FIELDS FROM topics
Forum Load Including Associations (0.001279) SELECT forums.`id` AS t0_r0, forums.`title` AS t0_r1, topics.`id` AS t1_r0, topics.`title` AS t1_r1, topics.`forum_id` AS t1_r2 FROM forums LEFT OUTER JOIN topics ON topics.forum_id = forums.id WHERE (forums.id = 3)

?> f = Forum.find 1, :include => :topics
Forum Load Including Associations (0.000430) SELECT forums.`id` AS t0_r0, forums.`title` AS t0_r1, topics.`id` AS t1_r0, topics.`title` AS t1_r1, topics.`forum_id` AS t1_r2 FROM forums LEFT OUTER JOIN topics ON topics.forum_id = forums.id WHERE (forums.id = 1)



If you want to play with it, download it here

1 comment:

Jon Frisby said...

This is something my company has been spending a lot of time thinking about, and so here are some things you might want to consider:

- It may be beneficial to allow the user of the plugin to decide what column to hash on, as a common scenario is going to be things like breaking up a user table where the primary access pattern is to look the user up by login/email/whatever.

- Giving the user control over the hashing function itself is a good thing. It would allow more complex scenarios, such as where I have M logical hash buckets, divided across N physical machines, with some lookup mechanism to map hash buckets to physical machine (this lets me add more physical machines later on by just migrating some buckets around).

- We're still looking for a good general solution to the dependant-data problem too. You can essentially generalize it down to access patterns/use-cases for a model. Partitioning or sharding data works great when you have one use case of the lookup-by-key scenario but as you noted relationships throw a monkeywrench into that.

I've considered the possibility of having "pointer records" for secondary access patterns. For example, if my primary use case is to look up a user record by the login identifier, but I also need to support looking them up by PK (id), and by email then I can create lookup tables that contain (secondary_lookup_column, primary_lookup_column), indexed and hashed by "secondary_lookup_column". So one lookup table with (id, login) and another with (email, login).

This means you have to issue two queries to find a row when using a secondary access pattern (and it doesn't address bulk fetches at all), but that may be an acceptable tradeoff if either pattern is reasonably infrequent -- given that you're getting the ability to horizontally scale your database tier.


-JF