Monday, 27 September 2010

Some MySQL projects I think are cool - OpenQuery Graph Engine (OQG)

This project was announced a year or so ago by Antony Curtis who used to work for MySQL AB. Having met Antony a few times I was intrigued to see what he was up to. The quote on the OpenQuery website describes it well :
The Open Query GRAPH engine (OQGRAPH) is a computation engine allowing hierarchies and more complex graph structures to be handled in a relational fashion. In a nutshell, tree structures and friend-of-a-friend style searches can now be done using standard SQL syntax, and results joined onto other tables.

That sounds cool, and it's the first time I've heard of a MySQL 'Computation engine' plugin. Delving further into the manual gives some insight, and there's some unexpected twists there :
  • OQG is a storage engine, but data stored is not persistent w.r.t. server crashes.
  • All tables have the same schema, storing details of graph 'edges'.
  • The fixed schema has a magic column called 'latch'
  • Depending on the constant value of latch used in a SELECT statement on the table, the engine will return different 'pseudo results'.
The last fact is the coolest one. As far as I understand it :
  • SELECT where latch = NULL AND ... allows queries on the graph as though it were a list of edges (as the data was entered).
  • SELECT where latch = 0 [AND ...] allows queries on the graph as though it were a list of nodes.
  • SELECT where latch = 1 [AND ...] allows Dijkstra's shortest path algorithm to be applied to the graph
  • SELECT where latch = 2 [AND ...] allows a breadth-first search to be applied to the graph
This is a superb hack! I imagine the OQG engine internally has an in-memory graph structure which is maintained as edges are added via the INSERT Api. The SELECT Api then gives access to different views of the underlying graph and even allows complex parameterised functions to be applied to the graph, giving results as a set of rows which can be decoded into the required result. It's not pretty, but it's an extremely pragmatic approach to embedding graph access and operations within a database.

It's also undeniable that the use of magic numbers and the 'latch' column adds a certain arcane wackiness that charms this reader. It's definitely a MySQL-style solution, continuing the tradition of MyISAM, Blackhole, Federated etc where good-enough gets to market before best, and 20% of the implementation effort delivers 80% of the functionality.

Once again I'd be interested to hear about how this is actually being used, and what sort of difference it is making.

Each of these three cool projects enable new solutions individually and expand the dimensions of what is possible using MySQL. In combination they open up a vast expanse of potential. One of the best things about them is that they all happened outside the confines of MySQL / Sun / Oracle. Hopefully they will get the success they deserve so that we can have more cool new projects in future.

Some MySQL projects I think are cool - Spider Storage Engine

One thing that has puzzled me about MySQL Server is that it became famous for sharded scale-out deployments in well known web sites and yet has no visible support for such deployments. The MySQL killer feature for some time has been built-in asynchronous replication and gigabytes of blogs have been written about how to setup, use, debug and optimise replication, but when it comes to 'sharding' there is nothing built in. Perhaps to have attempted to implement something would have artificially constrained user's imaginations, whereas having no support at all has allowed 1,000 solutions to sprout? Perhaps there just wasn't MySQL developer bandwidth available, or perhaps it just wasn't the best use of the available time. In any case, it remains unclaimed territory to this day.

On first hearing of the Federated storage engine some years ago, I mistakenly assumed that this could be the basis of some MySQL scale-out solution. Perhaps a layer of front end 'proxy' MySQLDs could federate tables from a layer of backend MySQLDs giving some level of distribution transparency to sharded data. However as I found out, the Federated engine was not designed with such a scenario in mind. It has a certain internal elegance and simplicity, but unfortunately it is a little too simple for anything other than light duties.

The Spider storage engine extends the Federated concept of a table definition being a 'link' to a table on a remote MySQL server. However, it also integrates with the table partitioning features of MySQL 5.1, allowing each partition of a table to be specified as a 'link' to a table on a remote MySQL server. This effectively allows the built-in partitioning mechanisms of MySQLD (PARTITION BY RANGE/LIST/HASH) to be used to shard/partition rows across multiple MySQL servers transparently.

One of the major drawbacks of the Federated engine was that it had very little support for 'pushing conditions' to the MySQLD instance storing the source tables. This meant that well behaved selective queries issued on the 'front-end' MySQLD instance could result in non-selective queries being issued to the 'back-end' MySQLD instances, and large volumes of data being unnecessarily transferred back to the 'front-end' MySQLD where query processing then discarded it.

Spider attempts to improve this situation by pushing conditions down to the MySQLDs containing the source data. Combined with the partition pruning available from the MySQLD partitioning engine this should significantly reduce the amount of redundant data transferred in some cases.

So I think Spider is a pretty cool project. Like MySQL Cluster, it bears the burden of making MySQLD more data-distribution-aware and I think they're doing great work. It'd be great to hear stories about how Spider is being used, especially if anyone is using it *with* MySQL Cluster.

Some MySQL projects I think are cool - Flexviews

Most of the time we think of SQL queries as being executed at a point in time and generating a single definitive result, but huge efficiency gains are available when data changes are tracked and derived views are partially updated as needed rather than being fully recomputed periodically. MySQL has support for views on tables, but there is currently no support for materialized views. While thinking about this topic I decided to have another look at Justin Swanhart's Flexviews tool and it's definitely a cool MySQL based project.

Flexviews is an open source set of non-intrusive addons to MySQL enabling materialized views to be defined and maintained as the underlying tables are changed. If you're not sure what a materialized view is or why they can be useful then I recommend reading the intro on the Flexviews site. I was particularly impressed by the documented support for GROUP BY, aggregates and joins.

I have a vague recollection of reading a blog post about an early version of Flexviews which used MySQL triggers to collect data changes on underlying tables, and feeling that it was probably a little flaky. However I now read that the 'Change Data Capture' in recent versions has been factored out into a separate tool called FlexCDC which 'mines' Row-based Binlog entries. This is a far more promising approach, and useful for many other applications. From my own point of view, it makes Flexviews potentially useful for maintaining materialized views of data stored in MySQL Cluster, where Binlog is the only centralised record of change. It also got me thinking that MySQL Cluster already has code inside the Server listening to data changes for writing the Binlog, which could be extended to capture data changes into some other tables if it were a useful feature.

The fact that Flexviews is implemented without Server changes or hooks is impressive and it's a great example of a MySQL 'ecosystem' project. It would be great to read some blog entries about how people are using it and what it is doing for them.