Wednesday, 12 October 2011

Some MySQL projects I think are cool - Shard-Query

I've already described Justin Swanhart's Flexviews project as something I think is cool. Since then Justin appears to have been working more on Shard-Query which I also think is cool, perhaps even more so than Flexviews.

On the page linked above, Shard-Query is described using the following statements :

"Shard-Query is a distributed parallel query engine for MySQL"
"ShardQuery is a PHP class which is intended to make working with a partitioned dataset easier"
"ParallelPipelining - MPP distributed query engines runs fragments of queries in parallel, combining the results at the end. Like map/reduce except it speaks SQL directly."

The things I like from the above description :
  • Distributed
  • Parallel
  • MySQL
  • Partitioned
  • Fragments
  • Map/Reduce
  • SQL
The things that scare me :
My fear of PHP is most likely groundless, based on experiences circa 1998. I suspect it runs much of the real money-earning web, and perhaps brings my scribblings to you. However, the applicability of Shard-Query seems so general, that to actually (or apparently) limit it to web-oriented use cases seems a shame. In any case I am not hipster enough to know which language would be better - OCaml? Dart? Only joking. I suppose that if the MySQL Proxy could do something along these lines then the language debate would be moot.

I am likely to fall foul of the lack-of-original-content test if I quote too much from the Shard-Query website, but the How-it-works section seems relevant here.

How it works

  • The query is parsed using
  • A modified version of the query is executed on each shard.
  • The queries are executed in parallel using
  • The results from each shard are combined together
  • A version of the original query is then executed over the combined results
  • All aggregation is done on the slaves (pushed down)
  • Queries with inlists can be made into parallel queries.
  • A callback can be used for QueryRouting. You provide a partition column, and a callback which returns information pointing to the correct shard. The most convenient way to do this is with Shard-Key-Mapper

Query rewriting rules

The core of Shard-Query are the query rewriting rules, which Justin introduces in a blog post entitled 8 substitution rules for running SQL in parallel. These transforms and substitutions allow Shard-Query to execute a user supplied query across multiple database shards. A single query (SELECT) can be mapped into a query to be applied to some, or all shards, and further queries to be used to merge the results of the per-shard queries into a final result.

Compared to a single system query, the consistency of the view that the sharded query executes against is less well defined, but this may well be acceptable for some applications.

On a single server

A single MySQL instance offers inter-query parallelism, but currently has very limited intra-query parallelism, Shard-Query can circumvent this by splitting a single query into multiple sharded sub-queries which can run in different MySQLD threads (as they are each submitted by different clients) to give intra-query parallelism. To me this seems more of a cool side effect and proof of reasonable implementation efficiency, than a real reason to use Shard-Query. Perhaps someone out there has the perfect use case for this.

Across multiple servers

The big-name MySQL users scale-out with MySQL, storing subsets of data on separate MySQL instances. Shard-Query allows SQL queries spanning all shards to be executed. This is what scaled-out MySQL has been waiting for.

I don't think it would be a good idea to run heavy traffic through Shard-Query to access a set of sharded MySQL instances yet, but Shard-Query gives a great way to perform occasional queries across all shards. This could be great for reporting and perhaps some light mining for patterns, trends etc. The ability to query across live real time data may be a real gain.

Loose coupling and availability

Scaling out via sharding standalone database servers has many difficulties, but the independence of each shard can benefit availability relative to a more tightly coupled distributed system. The loose coupling of the MySQL instances means that it's far less likely that the failure of one shard will drag others down with it, increasing system availability. Shard-Query can give the loosely coupled shards a smoother facade. The limited set of capabilities that Shard-Query gives over the set of shards may well be more than good enough. Note that 'good enough' is a recurring theme in this 'MySQL projects I think are cool' series. Often 'best' results in expensive or unnecessary compromises as a side-effect of trying to please everybody all the time.

Yet another MySQL sharded scaleout design

Looking at Shard-Query, and the recent MySQL-NoSQL Api developments, it seems like a modern MySQL sharded scaleout design might make use of :
  • MySQL SQL Apis (PHP, JDBC, ODBC, Ruby, Python, ....)
  • NoSQL access mechanisms
    (HandlerSocket, Memcached(1,2))
  • ShardQuery for SQL reporting / analysis
Per-instance efficiency can be maximised by using the NoSQL access Apis, single-instance SQL is still available if required for the application, and a global SQL view is also available.

This combination of scalability, efficiency and SQL query-ability could be a sweet spot in the increasingly confusing multi-dimensional space of high throughput distributed databases.

1 comment:

Justin Swanhart said...

Shard-Query includes a MySQL proxy module and can be invoked through gearman by any language gearman supports. So no need to be afraid of PHP.

--Justin (Shard-Query author)