Monday 27 September 2010

Some MySQL projects I think are cool - Spider Storage Engine

One thing that has puzzled me about MySQL Server is that it became famous for sharded scale-out deployments in well known web sites and yet has no visible support for such deployments. The MySQL killer feature for some time has been built-in asynchronous replication and gigabytes of blogs have been written about how to setup, use, debug and optimise replication, but when it comes to 'sharding' there is nothing built in. Perhaps to have attempted to implement something would have artificially constrained user's imaginations, whereas having no support at all has allowed 1,000 solutions to sprout? Perhaps there just wasn't MySQL developer bandwidth available, or perhaps it just wasn't the best use of the available time. In any case, it remains unclaimed territory to this day.

On first hearing of the Federated storage engine some years ago, I mistakenly assumed that this could be the basis of some MySQL scale-out solution. Perhaps a layer of front end 'proxy' MySQLDs could federate tables from a layer of backend MySQLDs giving some level of distribution transparency to sharded data. However as I found out, the Federated engine was not designed with such a scenario in mind. It has a certain internal elegance and simplicity, but unfortunately it is a little too simple for anything other than light duties.

The Spider storage engine extends the Federated concept of a table definition being a 'link' to a table on a remote MySQL server. However, it also integrates with the table partitioning features of MySQL 5.1, allowing each partition of a table to be specified as a 'link' to a table on a remote MySQL server. This effectively allows the built-in partitioning mechanisms of MySQLD (PARTITION BY RANGE/LIST/HASH) to be used to shard/partition rows across multiple MySQL servers transparently.

One of the major drawbacks of the Federated engine was that it had very little support for 'pushing conditions' to the MySQLD instance storing the source tables. This meant that well behaved selective queries issued on the 'front-end' MySQLD instance could result in non-selective queries being issued to the 'back-end' MySQLD instances, and large volumes of data being unnecessarily transferred back to the 'front-end' MySQLD where query processing then discarded it.

Spider attempts to improve this situation by pushing conditions down to the MySQLDs containing the source data. Combined with the partition pruning available from the MySQLD partitioning engine this should significantly reduce the amount of redundant data transferred in some cases.

So I think Spider is a pretty cool project. Like MySQL Cluster, it bears the burden of making MySQLD more data-distribution-aware and I think they're doing great work. It'd be great to hear stories about how Spider is being used, especially if anyone is using it *with* MySQL Cluster.

No comments: