<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-2987855187574329171</id><updated>2012-01-18T15:20:32.459Z</updated><category term='mobile'/><category term='pls'/><category term='active-active'/><category term='sos'/><category term='xacore'/><category term='distributed-systems'/><category term='nortel'/><category term='talking'/><category term='mysql'/><category term='cluster'/><category term='parallel'/><category term='design'/><category term='nosql'/><category term='gsm'/><category term='protel'/><category term='rambling'/><category term='general'/><category term='cpu-design'/><category term='replication'/><category term='message-passing'/><category term='telecoms'/><category term='dms'/><category term='latency-hiding'/><title type='text'>message passing</title><subtitle type='html'>Things that have interested me</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>31</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-3519742339745296117</id><published>2011-12-22T17:36:00.003Z</published><updated>2011-12-23T10:47:08.580Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='active-active'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Eventual Consistency in MySQL Cluster - implementation part 3</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s1600/image2.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:left;cursor:pointer; cursor:hand;width: 250px; height: 203px;" src="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s320/image2.gif" alt="" id="BLOGGER_PHOTO_ID_5689269172198146146" usemap="#mymap" border="0" /&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;map name="mymap"&gt;&lt;area shape="rect" coords="0,182,249,200" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html"&gt;&lt;area shape="rect" coords="0,166,249,183" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html"&gt;&lt;area shape="rect" coords="0,147,249,166" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,127,249,147" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,109,249,127" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html"&gt;&lt;area shape="rect" coords="0,92,249,109" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,73,249,92" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,59,249,73" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html"&gt;&lt;area shape="rect" coords="0,37,249,59" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;area shape="rect" coords="0,0,249,37" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;/map&gt;&lt;br /&gt;As promised, this is the final post in a series looking at eventual consistency with MySQL Cluster asynchronous replication.  This time I'll describe the transaction dependency tracking used with NDB$EPOCH_TRANS and review some of the implementation properties.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Transaction based conflict handling with NDB$EPOCH_TRANS&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;NDB$EPOCH_TRANS is almost exactly the same as NDB$EPOCH, except that when a conflict is detected on a row, the whole user transaction which made the conflicting row change is marked as conflicting, along with any dependent transactions. All of these rejected row operations are then handled using inserts to an exceptions table and realignment operations. This helps avoid the row-shear problems described &lt;a href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Including user transaction ids in the Binlog&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Ndb Binlog &lt;a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html"&gt;epoch transactions&lt;/a&gt; contain row events from all the user transactions which committed in an epoch. However there is no information in the Binlog indicating which user transaction caused each row event. To allow detected conflicts to 'rollback' the other rows modified in the same user transaction, the Slave applying an epoch transaction needs to know which user transaction was responsible for each of the row events in the epoch transaction. This information can now be recorded in the Binlog by using the --ndb-log-transaction-id MySQLD option. Logging Ndb user transaction ids against rows in-turn requires a v2 format RBR Binlog, enabled with the --log-bin-use-v1-row-events=0 option. The &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysqlbinlog.html"&gt;mysqlbinlog&lt;/a&gt; --verbose tool can be used to see per-row transaction information in the Binlog.&lt;br /&gt;&lt;br /&gt;User transaction ids in the Binlog are useful for NDB$EPOCH_TRANS and more. One interesting possibility is to use the user transaction ids and same-row operation dependencies to &lt;a href="http://en.wikipedia.org/wiki/Topological_sorting"&gt;sort&lt;/a&gt; the row events inside an epoch into a partial order. This could enable recovery to a consistent point other than an epoch boundary. A project for a rainy day perhaps?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;NDB$EPOCH_TRANS multiple slave passes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Initially, NDB$EPOCH_TRANS proceeds in the same &lt;a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;way&lt;/a&gt; as NDB$EPOCH, attempting to apply replicated row changes, with interpreted code attached to detect conflicts. If no row conflicts are detected, the epoch transaction is committed as normal with the same minimal overhead as NDB$EPOCH. However if a row conflict is detected, the epoch transaction is rolled back, and reapplied.  This is where NDB$EPOCH_TRANS starts to diverge from NDB$EPOCH.&lt;br /&gt;&lt;br /&gt;In this second pass, the user transaction ids of rows with detected conflicts are tracked, along with any inter-transaction dependencies detectable from the Binlog. At the end of the second pass, prior to commit, the set of conflicting user transactions is combined with the user transaction dependency data to get a complete set of conflicting user transactions. The epoch transaction initiated in the second pass is then rolled-back and a third pass begins.&lt;br /&gt;&lt;br /&gt;In the third pass, only row events for non-conflicting transactions are applied, though these are still applied with conflict detecting interpreted programs attached in case a further conflict has arisen since the second pass. Conflict handling for row events belonging to conflicting transactions is performed in the same &lt;a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html"&gt;way&lt;/a&gt; as NDB$EPOCH. Prior to commit, the applied row events are checked for further conflicts. If further conflicts have occurred then the epoch transaction is rolled back again and we return to the second pass. If no further conflicts have occurred then the epoch transaction is committed.&lt;br /&gt;&lt;br /&gt;These three passes, and associated rollbacks are only externally visible via new counters added to the MySQLD server. From an external observer's point of view, only non-conflicting transactions are committed, and all row events associated with conflicting transactions are handled as conflicts. As an optimisation, when transactional conflicts have been detected, further epochs are handled with just two passes (second and third) to improve efficiency. Once an epoch transaction with no conflicts has been applied, further epochs are initially handled with the more optimistic and efficient first pass.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Dependency tracking implementation&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To build the set of inter-transaction dependencies and conflicts, two hash tables are used. The first is a unique hashmap mapping row event tables and primary keys to transaction ids. If two events for the same table and primary key are found in a single epoch transaction then there is a dependency between those events, specifically the second event depends on the first. If the events belong to different user transactions then there is a dependency between the transactions.&lt;br /&gt;&lt;br /&gt;Transaction dependency detection hash :&lt;br /&gt;&lt;div style="text-align: center;"&gt;{Table, Primary keys} -&amp;gt; {Transaction id}&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;The second hash table is a hashmap of transaction id to an in-conflict marker and a list of dependent user transactions. When transaction dependencies are discovered using the first dependency detection hash, the second hash is modified to reflect the dependency. By the end of processing the epoch transaction, all dependencies detectable from the Binlog are described.&lt;br /&gt;&lt;br /&gt;Transaction dependency tracking and conflict marking hash :&lt;br /&gt;&lt;div style="text-align: center;"&gt;{Transaction id} -&amp;gt; {in_conflict, List}&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;As epoch operations are applied and row conflicts are detected, the operation's user transaction id is marked in the dependency hash as in-conflict. When marking a transaction as in-conflict, all of its dependent transactions must also be transitively marked as in-conflict. This is done by a traverse through the dependency tree of the in-conflict transaction.  Due to slave batching, the addition of new dependencies and the marking of conflicting transactions is interleaved, so adding a dependency can result in a sub-tree being marked as in-conflict.&lt;br /&gt;&lt;br /&gt;After the second pass is complete, the transaction dependency hash is used as a simple hash for looking up whether a particular transaction id is in conflict or not :&lt;br /&gt;&lt;br /&gt;Transaction in-conflict lookup hash :&lt;br /&gt;&lt;div style="text-align: center;"&gt;{Transaction id} -&amp;gt; {in_conflict}&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;This is used in the third pass to determine whether to apply each row event, or to proceed straight to conflict handling.&lt;br /&gt;&lt;br /&gt;The size of these hashes, and the complexity of the dependency graph is bounded by the size of the epoch transaction.  There is no need to track dependencies across the boundary of two epoch transactions, as any dependencies will be discovered via conflicts on the data committed by the first epoch transaction when attempting to apply the second epoch transaction.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Event counters&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Like the existing conflict detection functions, NDB$EPOCH_TRANS has a row-conflict detection counter called ndb_conflict_epoch_trans.&lt;br /&gt;&lt;br /&gt;Additional counters have been added which specifically track the different events associated with transactional conflict detection.  These can be seen with the usual SHOW GLOBAL STATUS LIKE &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/show-status.html"&gt;syntax&lt;/a&gt;, or via the INFORMATION_SCHEMA &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/status-table.html"&gt;tables&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;ndb_conflict_trans_row_conflict_count&lt;/span&gt;&lt;br /&gt;This is essentially the same as ndb_conflict_epoch_trans - the number of row events with conflict detected.&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;ndb_conflict_trans_row_reject_count&lt;/span&gt;&lt;br /&gt;The number of row events which were handled as in-conflict. It will be at least as large as ndb_conflict_trans_row_count, and will be higher if other rows are implicated by being in a conflicting transaction, or being dependent on a row in a conflicting transaction.&lt;br /&gt;A separate ndb_conflict_trans_row_implicated_count could be constructed as ndb_conflict_trans_row_reject_count - ndb_conflict_trans_row_conflict_count&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;ndb_conflict_trans_reject_count&lt;/span&gt;&lt;br /&gt;The number of discrete user transactions detected as in-conflict.&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;ndb_conflict_trans_conflict_commit_count&lt;/span&gt;&lt;br /&gt;The number of epoch transactions which had transactional conflicts detected during application.&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;ndb_conflict_trans_detect_iter_count&lt;/span&gt;&lt;br /&gt;The number of iterations of the three-pass algorithm that have occurred. Each set of passes counts as one. Normally this would be the same as ndb_conflict_trans_conflict_commit_count. Where further conflicts are found on the third pass, another iteration may be required, which would increase this count. So if this count is larger than ndb_conflict_trans_conflict_commit_count then there have been some conflicts generated concurrently with conflict detection, perhaps suggesting a high conflict rate.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Performance properties of NDB$EPOCH and NDB$EPOCH_TRANS&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I have tried to avoid getting involved in an explanation of Ndb replication in general which would probably fill a terabyte of posts. Comparing replication using NDB$EPOCH and NDB$EPOCH_TRANS relative to Ndb replication with no conflict detection, what can we can say?&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Conflict detection logic is pushed down to data nodes for execution&lt;br /&gt;Minimising extra data transfer + locking&lt;/li&gt;&lt;li&gt;Slave operation batching is preserved&lt;br /&gt;Multiple row events are applied together, saving MySQLD &amp;lt;-&amp;gt; data node round trips, using data node parallelism&lt;br /&gt;For both algorithms, one extra MySQLD &amp;lt;-&amp;gt; data node round-trip is required in the no-conflicts case (best case)&lt;/li&gt;&lt;li&gt;NDB$EPOCH : One extra MySQLD &amp;lt;-&amp;gt; data node round-trip is required per *batch* in the all-conflicts case (worst case)&lt;/li&gt;&lt;li&gt;NDB$EPOCH : Minimal impact to Binlog sizes - one extra row event per epoch.&lt;/li&gt;&lt;li&gt;NDB$EPOCH : Minimal overhead to Slave SQL CPU consumption&lt;/li&gt;&lt;li&gt;NDB$EPOCH_TRANS : One extra MySQLD &amp;lt;-&amp;gt; data node round-trip is required per *batch* per *pass* in the all-conflicts case (worst case)&lt;/li&gt;&lt;li&gt;NDB$EPOCH_TRANS : One round of two passes is required for each conflict newly created since the previous pass.&lt;/li&gt;&lt;li&gt;NDB$EPOCH_TRANS : Small impact to Binlog sizes - one extra row event per epoch plus one user transaction id per row event.&lt;/li&gt;&lt;li&gt;NDB$EPOCH_TRANS : Small overhead to Slave SQL CPU consumption in no-conflict case&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Current and intrinsic limitations&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;These functions support automatic conflict detection and handling without schema or application changes, but there are a number of limitations. Some limitations are due to the current implementation, some are just intrinsic in the asynchronous distributed consistency problem itself.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Intrinsic limitations&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Reads from the Secondary are tentative&lt;/span&gt;&lt;br /&gt;Data committed on the secondary may later be rolled back. The window of potential rollback is limited, after which Secondary data can be considered stable.  This is described in more detail &lt;a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html"&gt;here&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Writes to the Secondary may be rolled back&lt;/span&gt;&lt;br /&gt;If this occurs, the fact will be recorded on the Primary. Once a committed write is &lt;a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html"&gt;stable&lt;/a&gt; it will not be rolled back.&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Out-of-band dependencies between transactions are out-of-scope&lt;/span&gt;&lt;br /&gt;For example direct communication between two clients creating a dependency between their committed transactions, not observable from their database footprints.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Current implementation limitations&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Detected transaction dependencies are limited to dependencies between binlogged writes&lt;/span&gt; (Insert, Update, Delete)&lt;br /&gt;Reads are not currently included.&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Delete vs Delete+Insert conflicts risk data divergence&lt;/span&gt;&lt;br /&gt;Delete vs Delete conflicts are detected, but currently do not result in conflict handling, so that Delete vs Delete + Insert can result in data divergence.&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;With NDB$EPOCH_TRANS, unplanned Primary outages may require manual steps to restore Secondary consistency&lt;/span&gt;&lt;br /&gt;With pending multiple, time spaced, non-overlapping transactional conflicts, an unexpected failure may need some Binlog processing to ensure consistency.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Want to try it out?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Andrew Morgan has written a great &lt;a href="http://www.clusterdb.com/mysql-cluster/enhanced-conflict-resolution-with-mysql-cluster-active-active-replication/"&gt;post&lt;/a&gt; showing how to setup NDB$EPOCH_TRANS. He's even included non-ascii art.  This is probably the easiest way to get started. NDB$EPOCH is slightly easier to get started with as the --ndb-log-transaction-id (and Binlog v2) options are not required.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Edit 23/12/11 : Added index&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-3519742339745296117?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/3519742339745296117/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=3519742339745296117' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/3519742339745296117'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/3519742339745296117'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html' title='Eventual Consistency in MySQL Cluster - implementation part 3'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s72-c/image2.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-5904731119010279019</id><published>2011-12-19T13:30:00.001Z</published><updated>2011-12-23T10:46:37.775Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='active-active'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Eventual consistency in MySQL Cluster - implementation part 2</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s1600/image2.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:left;cursor:pointer; cursor:hand;width: 250px; height: 203px;" src="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s320/image2.gif" alt="" id="BLOGGER_PHOTO_ID_5689269172198146146" usemap="#mymap" border="0" /&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;map name="mymap"&gt;&lt;area shape="rect" coords="0,182,249,200" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html"&gt;&lt;area shape="rect" coords="0,166,249,183" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html"&gt;&lt;area shape="rect" coords="0,147,249,166" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,127,249,147" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,109,249,127" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html"&gt;&lt;area shape="rect" coords="0,92,249,109" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,73,249,92" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,59,249,73" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html"&gt;&lt;area shape="rect" coords="0,37,249,59" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;area shape="rect" coords="0,0,249,37" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;/map&gt;&lt;br /&gt;In previous posts I described how row conflicts are detected using epochs.  In this post I describe how they are handled.&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;Row based conflict handling with NDB$EPOCH&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Once a row conflict is detected, as well as rejecting the row change, row based conflict handling in the Slave will :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Increment conflict counters&lt;/li&gt;&lt;li&gt;Optionally insert a row into an exceptions table&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;For NDB$EPOCH, conflict detection and handling operates on one Cluster in an Active-Active pair designated as the Primary.  When a Slave MySQLD attached to the Primary Cluster detects a conflict between data stored in the Primary and a replicated event from the Secondary, it needs to realign the Secondary to store the same values for the conflicting data.  Realignment involves injecting an event into the Primary Cluster's Binlog which, when applied idempotently on the Secondary Cluster, will force the row on the Secondary Cluster to take the supplied values.  This requires either a WRITE_ROW event, with all columns, or a DELETE_ROW event with just the primary key columns.  These events can be thought of as &lt;a href="http://en.wikipedia.org/wiki/Compensating_transaction"&gt;compensating&lt;/a&gt; events used to revert the original effect of the rejected events.&lt;br /&gt;&lt;br /&gt;Conflicts are detected by a Slave MySQLD attached to the Primary Cluster, and realignment events must appear in Binlogs recorded by the same MySQLD and/or other Binlogging MySQLDs attached to the Primary Cluster.  This is achieved using a new &lt;a href="http://dev.mysql.com/doc/ndbapi/en/index.html"&gt;NdbApi&lt;/a&gt; primary key operation type called &lt;span style="font-style: italic;"&gt;refreshTuple&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;When a refreshTuple operation is executed it will :&lt;br /&gt;&lt;ol&gt;&lt;li&gt; Lock the affected row/primary key until transaction commit time, even if it does not exist (much as an Insert would).&lt;/li&gt;&lt;li&gt;Set the affected row's author metacolum to 0&lt;br /&gt;The refresh is logically a local change&lt;/li&gt;&lt;li&gt;On commit&lt;br /&gt;- Row exists case : Set the row's last committed epoch to the current epoch&lt;br /&gt;- Cause a WRITE_ROW (row exists case) or DELETE_ROW (no row exists) event to be generated by attached Binlogging MySQLDs.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;Locking the row as part of refreshTuple serialises the conflicting epoch transaction with other potentially conflicting local transactions.  Updating the stored epoch and author metacolumns results in the conflicting row conflicting with any further replicated changes occurring while the realignment event is 'in flight'.  The compensating row events are effectively new row changes originating at the Primary cluster which need to be monitored for conflicts in the same way as normal row changes.&lt;br /&gt;&lt;br /&gt;It is important that the Slave running at the Secondary Cluster where the realignment events will be applied, is running in idempotent mode, so that it can handle the realignment events correctly.  If this is not the case then WRITE_ROW realignment events may hit 'Row already exists' errors, and DELETE_ROW realignment events may hit 'Row does not exist' errors.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Observations on conflict windows and consistency&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;When a conflict is detected, the refresh process results in the row's epoch and author metacolumns being modified so that the window of potential conflict is extended, until the epoch in which the refresh operation was recorded has itself been reflected.  If ongoing updates at both clusters continually conflict then refresh operations will continue to be generated, and the conflict window will remain open until a refresh operation manages to propagate with no further conflicts occurring.  As with any eventually consistent system, consistency is only guaranteed when the system (or at least the data of interest) is quiescent for a period.&lt;br /&gt;&lt;br /&gt;From the Primary cluster's point of view, the &lt;span style="font-style: italic;"&gt;conflict window length&lt;/span&gt; is the time between committing a local transaction in epoch &lt;span style="font-style: italic;"&gt;n&lt;/span&gt;, and the attached Slave committing a replicated epoch transaction indicating that epoch &lt;span style="font-style: italic;"&gt;n&lt;/span&gt; has been applied at the Secondary.  Any Secondary-sourced overlapping change applied in this time is in-conflict.&lt;br /&gt;&lt;br /&gt;This &lt;span style="font-style: italic;"&gt;Cluster conflict window&lt;/span&gt; &lt;span style="font-style: italic;"&gt;length&lt;/span&gt; is comprised of :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt; Time between commit of transaction, and next Primary Cluster epoch boundary&lt;br /&gt;(Worst = 1 * &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbd-timebetweenepochs"&gt;&lt;span style="font-style: italic;"&gt;TimeBetweenEpochs&lt;/span&gt;&lt;/a&gt;, Best = 0, Avg = 0.5 * &lt;span style="font-style: italic;"&gt;TimeBetweenEpochs&lt;/span&gt;)&lt;/li&gt;&lt;li&gt;Time required to log event in Primary Cluster's Binlogging MySQLDs Binlog (~negligible)&lt;/li&gt;&lt;li&gt;Time required for Secondary Slave MySQLD IO thread to&lt;br /&gt;- Minimum : Detect new Binlog data - negligible&lt;br /&gt;- Maximum : Consume queued Binlog prior to the new data - unbounded&lt;br /&gt;- Pull new epoch transaction&lt;br /&gt;- Record in Relay log&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Time required for Secondary Slave MySQLD SQL thread to&lt;br /&gt;- Minimum : Detect new events in relay log&lt;br /&gt;- Maximum : Consume queued Relay log prior to new data - unbounded&lt;br /&gt;- Read and apply events&lt;br /&gt;- Potentially multiple batches.&lt;br /&gt;- Commit epoch transaction at Secondary&lt;/li&gt;&lt;li&gt;Time between commit of replicated epoch transaction and next Secondary Cluster epoch boundary&lt;br /&gt;(Worst = 1 * &lt;span style="font-style: italic;"&gt;TimeBetweenEpochs&lt;/span&gt;, Best = 0, Avg = 0.5 * &lt;span style="font-style: italic;"&gt;TimeBetweenEpochs&lt;/span&gt;)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;After this point a Secondary-local commit on the data is possible without conflict&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Time required to log event in Secondary Cluster's Binlogging MySQLDs Binlog (~negligible)&lt;/li&gt;&lt;li&gt;Time required for Primary Slave MySQLD IO thread to&lt;br /&gt;- Minimum : Detect new Binlog data&lt;br /&gt;- Maximum : Consume queued Binlog data prior to the new data - unbounded&lt;br /&gt;- Pull new epoch transaction&lt;br /&gt;- Record in Relay log&lt;/li&gt;&lt;li&gt;Time required for Primary Slave MySQLD SQL thread to&lt;br /&gt;- Minimum : Detect new events in relay log&lt;br /&gt;- Maximum : Consume queued Relay log prior to new data - unbounded&lt;br /&gt;- Read and apply events&lt;br /&gt;- Potentially multiple batches.&lt;br /&gt;- For NDB$EPOCH_TRANS, potentially multiple passes&lt;br /&gt;- Commit epoch transaction&lt;br /&gt;- Update max replicated epoch to reflect new maximum.&lt;/li&gt;&lt;li&gt;Further Secondary sourced modifications to the rows are now considered not-in-conflict&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;From the point of view of an external client with access to both Primary and Secondary clusters, the conflict window only extends from the time transaction commit occurs at the Primary to the time the replicated operations are applied at the Secondary, and its commit time Secondary epoch ends. Changes committed at the Secondary after this will clearly appear to the Primary to have occurred after its epoch was applied on the Secondary and therefore are not in-conflict.&lt;br /&gt;&lt;br /&gt;Assuming that both Clusters have the same &lt;span style="font-style: italic;"&gt;TimeBetweenEpochs&lt;/span&gt;, we can simplify the Cluster conflict window to :&lt;br /&gt;&lt;pre&gt;  Cluster_conflict_window_length = EpochDelay +&lt;br /&gt;                                  P_Binlog_lag +&lt;br /&gt;                                  S_Relay_lag +&lt;br /&gt;                                  S_Binlog_lag +&lt;br /&gt;                                  P_Relay_lag&lt;br /&gt;&lt;br /&gt; Where&lt;br /&gt;    EpochDelay minimum is 0&lt;br /&gt;    EpochDelay avg     is TimeBetweenEpochs&lt;br /&gt;    EpochDelay maximum is 2 * TimeBetweenEpochs&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Substituting the default value of &lt;span style="font-style: italic;"&gt;TimeBetweenEpochs&lt;/span&gt; of 100 millis, we get :&lt;br /&gt;&lt;pre&gt;     EpochDelay minimum is 0&lt;br /&gt;    EpochDelay avg     is 100 millis&lt;br /&gt;    EpochDelay maximum is 200 millis&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Note that TimeBetweenEpochs is an epoch-increment trigger delay.  The actual experienced time between epochs can be longer depending on system load.  The various Binlog and Relay log delays can vary from close to zero up to infinity.  Infinity occurs when replication stops in either direction.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-style: italic;"&gt;Cluster conflict window&lt;/span&gt; length can be thought of as both&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The time taken to detect a conflict with a Primary transaction&lt;/li&gt;&lt;li&gt;The time taken for a committed Secondary transaction to become stable or be reverted&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;We can define a &lt;span style="font-style: italic;"&gt;Client conflict window&lt;/span&gt; &lt;span style="font-style: italic;"&gt;length &lt;/span&gt;as either :&lt;br /&gt;&lt;pre&gt; Primary-&amp;gt;Secondary&lt;br /&gt;&lt;br /&gt;  Client_conflict_window_length = EpochDelay +&lt;br /&gt;                                  P_Binlog_lag +&lt;br /&gt;                                  S_Relay_lag +&lt;br /&gt;                                  EpochDelay&lt;br /&gt;&lt;br /&gt;or&lt;br /&gt;&lt;br /&gt;Secondary-&amp;gt;Primary&lt;br /&gt;&lt;br /&gt;  Client_conflict_window_length = EpochDelay +&lt;br /&gt;                                  S_Binlog_lag +&lt;br /&gt;                                  P_Relay_lag&lt;br /&gt;&lt;br /&gt;Where EpochDelay is defined as above.&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;These definitions are asymmetric.  They represent the time taken by the system to determine that a particular change at one cluster definitely happened-before another change at the other cluster.  The asymmetry is due to the need for the Secondary part of a Primary-&amp;gt;Secondary conflict to be recorded in a different Secondary epoch.  The first definition considers an initial change at the Primary cluster, and a following change at the Secondary.  The second definition is for the inverse case.&lt;br /&gt;&lt;br /&gt;An interesting observation is that for a single pair of near-concurrent updates at different clusters, happened-before depends only on latencies in one direction.  For example, an update to the Primary at time &lt;span style="font-style: italic;"&gt;Ta&lt;/span&gt;, followed by an update to the Secondary at time &lt;span style="font-style: italic;"&gt;Tb&lt;/span&gt; will not be considered in conflict if:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt; Tb - Ta &amp;gt; Client_conflict_window_length(Primary-&amp;gt;Secondary)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Client_conflict_window_length(Primary-&amp;gt;Secondary)&lt;/span&gt; depends on the &lt;span style="font-style: italic;"&gt;EpochDelay&lt;/span&gt;, the &lt;span style="font-style: italic;"&gt;P_Binlog_lag&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;S_Relay_lag&lt;/span&gt;, but not on the &lt;span style="font-style: italic;"&gt;S_Binlog_lag&lt;/span&gt; or &lt;span style="font-style: italic;"&gt;P_Relay_lag&lt;/span&gt;.  This can mean that high replication latency, or a complete outage in one direction does not always result in increased conflict rates.  However, in the case of multiple sequences of near-concurrent updates at different sites, it probably will.&lt;br /&gt;&lt;br /&gt;A general property of the NDB$EPOCH family is that the conflict rate has some dependency on the replication latency.  Whether two updates to the same row at times &lt;span style="font-style: italic;"&gt;Ta&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;Tb&lt;/span&gt; are considered to be in conflict depends on the relationship between those times and the &lt;span style="font-weight: bold;"&gt;current&lt;/span&gt; system replication latencies.  This can remove the need for highly synchronised real-time clocks as recommended for NDB$MAX, but can mean that the observed conflict rate increases when the system is lagging.  This also implies that more work is required to catch up, which could further affect lag.  NDB$MAX requires manual timestamp maintenance, and will not detect incorrect behaviour, but the basic decision on whether two updates are in-conflict is decided at commit time and is independent of the system replication latency.&lt;br /&gt;&lt;br /&gt;In summary :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The &lt;span style="font-style: italic;"&gt;Client_conflict_window_length&lt;/span&gt; in either direction will on average not be less than the &lt;span style="font-style: italic;"&gt;EpochDelay&lt;/span&gt; (100 millis by default)&lt;/li&gt;&lt;li&gt;Clients racing against replication to update both clusters need only beat the current &lt;span style="font-style: italic;"&gt;Client_conflict_window_length&lt;/span&gt; to cause a conflict&lt;/li&gt;&lt;li&gt;Replication latencies in either direction are potentially independent&lt;/li&gt;&lt;li&gt;Detected conflict rates partly depend on replication latencies&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Stability of reads from the Primary Cluster&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In the case of a conflict, the rows at the Primary Cluster will tentatively have replicated operations applied against them by a Slave MySQLD.   These conflicting operations will fail prior to commit as their interpreted precondition checks will fail, therefore the conflicting rows will not be modified on the Primary.  One effect of this is that a &lt;span style="font-weight: bold;"&gt;read from the Primary Cluster only ever returns stable data&lt;/span&gt;, as conflicting changes are never committed there.  In contrast, a read from the Secondary Cluster returns data which has been committed, but may be subject to later 'rollback' via refresh operations from the Primary Cluster.&lt;br /&gt;&lt;br /&gt;The same stability of reads observation applies to a row change event stream on the Primary Cluster - events received for a single key will be received in the order they were committed, and no later-to-be-rolled-back events will be observed in the stream.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Stability of reads from the Secondary Cluster&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;If the Secondary Cluster is also receiving reflected applied epoch information back from the Primary then it will know when it's epoch &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; has been applied successfully at the Primary.  Therefore a read of some row &lt;span style="font-style: italic;"&gt;y&lt;/span&gt; on the Secondary can be considered tentative while Max_Replicated_Epoch(Secondary) &amp;lt; row_epoch(&lt;span style="font-style: italic;"&gt;y&lt;/span&gt;), but once Max_Replicated_Epoch(Secondary) &amp;gt;= row_epoch(&lt;span style="font-style: italic;"&gt;y&lt;/span&gt;) then the read can be considered stable.  This is because if the Primary were going to detect a conflict with a Secondary change committed in epoch &lt;span style="font-style: italic;"&gt;x&lt;/span&gt;, then the refresh events associated with the conflict would be recorded in the same Primary epoch as the notification of the application of epoch &lt;span style="font-style: italic;"&gt;x&lt;/span&gt;.  So if the Secondary observes the notification of epoch &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; (and updates Max_Replicated_Epoch accordingly), and row &lt;span style="font-style: italic;"&gt;y&lt;/span&gt; is not modified in the same epoch transaction, then it is stable.  The time taken to reach stability after a Secondary Cluster commit will be the &lt;span style="font-style: italic;"&gt;Cluster conflict window length.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Perhaps some applications can make better use of the potentially transiently inconsistent Secondary data by categorising their reads from the Secondary as either potentially-inconsistent or stable.  To do this, they need to maintain Max_replicated_epoch(Secondary) (By listening to row change events on the ndb_apply_status table) and read the NDB$GCI_64 metacolumn when reading row data.  A read from the Secondary is stable if all the NDB$GCI_64 values for all rows read are &amp;lt;= the Secondary's Max_Replicated_Epoch.&lt;br /&gt;&lt;br /&gt;In the next post (final post I promise!) I will describe the implementation of the transaction dependency tracking in NDB$EPOCH_TRANS, and review the implementation of both NDB$EPOCH and NDB$EPOCH_TRANS.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Edit 23/12/11 : Added index&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-5904731119010279019?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/5904731119010279019/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=5904731119010279019' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/5904731119010279019'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/5904731119010279019'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html' title='Eventual consistency in MySQL Cluster - implementation part 2'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s72-c/image2.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-2635540360364141806</id><published>2011-12-08T00:20:00.006Z</published><updated>2011-12-23T10:46:04.718Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='active-active'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Eventual consistency in MySQL Cluster - implementation part 1</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s1600/image2.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:left;cursor:pointer; cursor:hand;width: 250px; height: 203px;" src="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s320/image2.gif" alt="" id="BLOGGER_PHOTO_ID_5689269172198146146" usemap="#mymap" border="0" /&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;map name="mymap"&gt;&lt;area shape="rect" coords="0,182,249,200" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html"&gt;&lt;area shape="rect" coords="0,166,249,183" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html"&gt;&lt;area shape="rect" coords="0,147,249,166" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,127,249,147" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,109,249,127" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html"&gt;&lt;area shape="rect" coords="0,92,249,109" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,73,249,92" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,59,249,73" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html"&gt;&lt;area shape="rect" coords="0,37,249,59" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;area shape="rect" coords="0,0,249,37" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;/map&gt;&lt;br /&gt;The last &lt;a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html"&gt;post&lt;/a&gt; described MySQL Cluster epochs and why they provide a good basis for conflict detection, with a few enhancements required.  This post describes the enhancements.&lt;br /&gt;&lt;br /&gt;The following four mechanisms are required to implement conflict detection via epochs :&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Slaves should 'reflect' information about replicated epochs they have applied&lt;/span&gt;&lt;br /&gt;Applied epoch numbers should be included in the Slave Binlog events returning to the originating cluster, in a Binlog position corresponding to the commit time of the replicated epoch transaction relative to Slave local transactions.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Masters should maintain a maximum replicated epoch&lt;/span&gt;&lt;br /&gt;A cluster should use the reflected epoch information to track which of its epochs has been applied by a Slave cluster.  This will be the maximum of all epochs applied by the Slave.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Masters should track commit-time epoch per row&lt;/span&gt;&lt;br /&gt;To allow per-row detection of conflicts&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Masters should track commit-authorship per row&lt;/span&gt;&lt;br /&gt;To differentiate recent epochs due to replication or conflicting activity.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;'Reflecting' epoch information and maintaining the maximum replicated epoch&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Every epoch transaction in the Binlog contains a special WRITE_ROW event on the mysql.ndb_apply_status table which carries the epoch transaction's epoch number.  This is designed to give an atomically consistent way to determine a Slave cluster's position relative to a Master cluster.  Normally these WRITE_ROW events are applied by the Slave but not logged in the Slave's Binlog, even when &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/replication-options-slave.html#option_mysqld_log-slave-updates"&gt;--log-slave-updates&lt;/a&gt; is ON.  A new MySQLD option, &lt;a href="http://dev.mysql.com/doc/mysql-cluster-excerpt/5.1/en/mysql-cluster-program-options-mysqld.html"&gt;--ndb-log-apply-status&lt;/a&gt; causes WRITE_ROW events applied to the mysql.ndb_apply_status table to be binlogged at a Slave, even when --log-slave-updates is OFF.  These events are logged with the ServerId of the Slave MySQLD, so that they can be applied on the Master, but will not loop infinitely.&lt;br /&gt;&lt;br /&gt;Allowing this applied epoch information to propagate through a Slave Cluster has the following effects :&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Downstream Clusters become aware of their position relative to all upstream Master clusters, not just their immediate Master cluster.&lt;br /&gt;&lt;span style="font-style: italic;"&gt;They gain extra mysql.ndb_apply_status entries for all upstream Masters.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Circularly replicating clusters become aware of which of their epochs, and epoch transactions, have been applied to all clusters in the circle.&lt;br /&gt;&lt;span style="font-style: italic;"&gt;They gain extra mysql.ndb_apply_status entries for all Binlogging MySQLDs in the loop&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;Effect 1 is useful for replication failover with more than two replication-chained clusters where an intermediate cluster is being routed-around (A-&amp;gt;B-&amp;gt;C) -&amp;gt; (A-&amp;gt;C).   Cluster C knows the correct Binlog file and position to resume from on A, without consulting B.&lt;br /&gt;&lt;br /&gt;Effect 2 could be used to allow clients to wait until their writes have been fully replicated and are globally visible, a kind of synchronous replication.  More relevantly, effect 2 allows us to maintain a maximum replicated epoch value for detecting conflicts.&lt;br /&gt;&lt;br /&gt;The visible result of using --ndb-log-apply-status on a Slave is that the mysql.ndb_apply_status table on the Master contains extra entries for the Binlogging MySQLDs attached to its Cluster.  The maximum replicated epoch is the maximum of these epoch values.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;    Cluster 1 Epoch transactions in flight in&lt;br /&gt;         a circular configuration&lt;br /&gt;        (Ignoring Cluster 2 epochs)&lt;br /&gt;&lt;br /&gt;                           39       38       37&lt;br /&gt;                     -&amp;gt;----&amp;gt;-----&amp;gt;-----&amp;gt;-----&amp;gt;--&lt;br /&gt;                    /                           \ (Queued epochs 36-26)&lt;br /&gt;          Cluster 1                             Cluster 2&lt;br /&gt;(Queued epochs 23,24) \                           /&lt;br /&gt;                     -&amp;lt;---&amp;lt;------&amp;lt;----&amp;lt;----&amp;lt;----&lt;br /&gt;                          25       26       27&lt;br /&gt;&lt;br /&gt;Current epoch = 40&lt;br /&gt;Max replicated epoch = 22              &lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;A MySQLD acting as a conflict detecting Slave for a cluster needs to know the attached cluster's maximum replicated epoch for conflict detection.  On Slave start, before the Slave starts applying replicated changes to the Ndb storage engine, it scans the &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication-schema.html"&gt;mysql.ndb_apply_status&lt;/a&gt; table to find the highest reflected epoch value.   Rows in mysql.ndb_apply_status with server ids in the &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/change-master-to.html"&gt;CHANGE MASTER&lt;/a&gt; TO IGNORE_SERVER_IDS list are considered to be local servers, as well as the Slave's own server id, and the maximum replicated epoch is the maximum epoch value from these rows.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;@ Slave start&lt;br /&gt;&lt;br /&gt; max_replicated_epoch = SELECT MAX(epoch)&lt;br /&gt;                          FROM mysql.ndb_apply_status&lt;br /&gt;                         WHERE server_id IN @@IGNORE_SERVER_IDS;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Once the Max_replicated_epoch has been initialised at slave start, it is updated as each reflected epoch event (WRITE_ROW event to mysql.ndb_apply_status) arrives and is processed by the Slave SQL thread.  The current Max_replicated_epoch can be seen by issuing the command SHOW STATUS LIKE 'Ndb_slave_max_replicated_epoch';.  Note that this is really just a cached copy of the current result of the SELECT MAX(epoch) query from above.  One subtlety is that the max_replicated_epoch is only changed when the Slave commits an epoch transaction, as it is only at this point that we know for sure that any event committed on the other cluster before the replicated epoch was applied has been handled.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Per row last-modified epoch storage&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Each row stored in Ndb has a built-in hidden metadata column called NDB$GCI64.  This columns stores the epoch number at which the row was last modified.  For normal system recovery purposes, only the top 32 bits of the 64 bit epoch, called the Global Checkpoint Index or GCI are used.  NDB$EPOCH needs further bits to be stored per-row.  Epoch values only use a few of the bits in the bottom 32 bits of the epoch, so by default 6 extra bits per row are used to enable a full 64 bit epoch to be stored for each row.  The actual number of bits used can be controlled by a parameter to NDB$EPOCH.  Where some epoch is not fully expressible in the number of bits available, the bottom 32 bits are saturated, which again errs on the side of safety, potentially causing false conflicts, but ensuring no real conflicts are missed.  The &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-programs-ndb-select-all.html"&gt;ndb_select_all&lt;/a&gt; tool has a --gci64 option which shows each row's stored epoch value.&lt;br /&gt;&lt;br /&gt;A conflict detecting slave detects conflicts between transactions already committed, whose rows have their commit-time epoch numbers, and incoming operations in an epoch transaction, which are considered to have been committed at the epoch given by the current Maximum Replicated Epoch.  An incoming operation is considered to be in-conflict if the row it affects has a last-committed epoch that is greater than the current Maximum Replicated Epoch.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;  in_conflict = (ndb_gci64 &amp;gt; max_replicated_epoch)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;In other words, at the time the change was committed on the other Cluster, that other Cluster was only aware of our changes as-of our epoch (max_replicated_epoch).  Therefore it was unaware of any changes committed in more recent epochs.  If the row being changed has been locally modified since that epoch then there have been concurrent modifications and a conflict has been discovered.&lt;br /&gt;&lt;br /&gt;Note that this mechanism is purely based on monitoring serialisation of updates to rows.  No semantic understanding of row data, or the meaning of applied changes is attempted.  Even if both clusters update some row to contain exactly the same value it will be considered to be a conflict, as the updates were not serialised with respect to each other.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Per row hidden Author metacolumn&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;One advantage of reusing the row's last-modified epoch number for conflict detection is that it is automatically set on every commit.  However the downside is that when a replicated modification is found to &lt;span style="font-weight: bold;"&gt;not&lt;/span&gt; be in conflict, and is applied, the row's epoch is automatically set to the current value at commit time as normal.  By definition, the current epoch value is always greater than the maximum replicated epoch, and so if a further replicated modification to the same row were to arrive, it would find the row's epoch to be higher than the current maximum replicated epoch, and detect a false conflict.&lt;br /&gt;&lt;br /&gt;In theory we could consider the current maximum replicated epoch to be the row's commit time epoch, but as the per-row epoch is used for other more critical DB recovery purposes it's not safe to abuse it in this way.  Instead we use the observation that if we found a previous row update from some other cluster to be not-in-conflict, then further updates from it are also not-in-conflict.&lt;br /&gt;&lt;br /&gt;To detect this, a new hidden metadata column is introduced called NDB$AUTHOR.  This column is set to zero when a row is modified by any unmodified NdbApi client, including MySQLD, but when a row is modified by the MySQLD Slave SQL thread, it is set to one.  More generally, NDB$AUTHOR could be set to a non-zero identifier of which other cluster sourced an accepted change.  Just setting to one limits us to having one other cluster originating potentially conflicting changes.  The &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-programs-ndb-select-all.html"&gt;ndb_select_all&lt;/a&gt; tool has a --author option which shows each row's stored Author value.&lt;br /&gt;&lt;br /&gt;By extending the conflict detecting function to examine the NDB$AUTHOR value, we avoid the problem of falsely detecting conflicts when applied consecutive replicated changes.&lt;br /&gt;&lt;pre&gt;  in_conflict = (ndb$author != change_author) &amp;amp;&amp;amp; (ndb_gci64 &amp;gt; max_replicated_epoch)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;We are currently just using 1 to mean 'other author', so this simplifies to :&lt;br /&gt;&lt;pre&gt; in_conflict = (ndb$author != 1) &amp;amp;&amp;amp; (ndb_gci64 &amp;gt; max_replicated_epoch)&lt;br /&gt;&lt;br /&gt;            = (ndb$author == 0) &amp;amp;&amp;amp; (ndb_gci64 &amp;gt; max_replicated_epoch)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This conflict detection function is encoded in an &lt;a href="http://dev.mysql.com/doc/ndbapi/en/ndb-ndbinterpretedcode.html"&gt;Ndb interpreted program&lt;/a&gt; and attached to the replicated DELETE and UPDATE NdbApi operations so that it can be quickly and atomically executed at the Ndb data nodes as a predicate prior to applying the operation.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Ndb binlog row event ordering and false conflicts&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The happened-before relationship between reflected epoch events (WRITE_ROW to mysql.ndb_apply_status) and incoming row events is used to determine whether a conflict has occurred.   As described in the last &lt;a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html"&gt;post&lt;/a&gt;, Ndb offers limited ordering guarantees on the row events within an epoch transaction.  The only guarantee is that multiple changes to the same row will be recorded in the order they committed.  This implies that the relative ordering of the reflected epoch WRITE_ROW event, on some row in mysql.ndb_apply_status, and other row events on other tables sharing the same epoch transaction is meaningless.  The only ordering guarantees between different rows exist at epoch boundaries.&lt;br /&gt;&lt;br /&gt;This means that if we see a reflected epoch WRITE_ROW event somewhere in replicated epoch &lt;span style="font-style: italic;"&gt;j&lt;/span&gt;, then we can only safely assume that this happened before incoming row events in epoch &lt;span style="font-style: italic;"&gt;j+1&lt;/span&gt; and later.  The row events appearing before and after the reflected epoch WRITE_ROW event in epoch&lt;span style="font-style: italic;"&gt; j &lt;/span&gt;may have committed before or after the reflected epoch event.&lt;br /&gt;&lt;br /&gt;The relaxed relative ordering gives us reduced precision in determining happened-before, and to be safe, we must err on the side of assuming that a conflict exists rather than that it does not.  Consider a Master committing a change to row &lt;span style="font-style: italic;"&gt;X&lt;/span&gt;, recorded in epoch &lt;span style="font-style: italic;"&gt;N&lt;/span&gt;.  This is then applied on the Slave in Slave epoch &lt;span style="font-style: italic;"&gt;S&lt;/span&gt;.  If the Slave then commits a local change affecting the same row &lt;span style="font-style: italic;"&gt;X&lt;/span&gt; in the same epoch &lt;span style="font-style: italic;"&gt;S&lt;/span&gt;, this will be returned to the Master in the same Slave epoch transaction, and the Master will be unable to determine whether it occurred before or after it's original write to &lt;span style="font-style: italic;"&gt;X&lt;/span&gt;, so must assume that it occurred before and is therefore in conflict.  If the Slave had committed its change in epoch &lt;span style="font-style: italic;"&gt;S+1&lt;/span&gt; or later, the happened-before relationship would be clear and the change would not be considered in conflict.&lt;br /&gt;&lt;br /&gt;These potential false conflicts are the price paid here for the lack of fine grained event ordering in the Ndb Binlog.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;I'm lost&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There's been a lot of information, or at least a lot of words.  Let's summarise how NDB$EPOCH and NDB$EPOCH_TRANS detect row conflicts by following&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;@Cluster A&lt;br /&gt;Transactions modify rows, automatically setting their hidden NDB$GCI64 column to the current epoch and their NDB$AUTHOR column to 0&lt;br /&gt;&lt;br /&gt;Binlogging MySQLDs record modified rows in epoch transactions in their Binlogs, together with MySQLD generated mysql.ndb_apply_status WRITE_ROW events&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;@Cluster B&lt;br /&gt;Slave MySQLDs apply replicated epoch transactions along with their generated mysql.ndb_apply_status WRITE_ROW events&lt;br /&gt;&lt;br /&gt;Other clients of Cluster B commit transactions against the same data.&lt;br /&gt;&lt;br /&gt;Binlogging MySQLDs 'reflect' the applied-replicated epoch information by recording the mysql.ndb_apply_status WRITE_ROW events in their Binlogs as a result of --ndb-log-apply-status.&lt;br /&gt;&lt;br /&gt;Binlogging MySQLDs also record the row changes made by local clients.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;@Cluster A&lt;br /&gt;Slave MySQLDs track the incoming reflected epoch mysql.ndb_apply_status WRITE_ROW events to maintain their ndb_slave_max_replicated_epoch variables&lt;br /&gt;&lt;br /&gt;Slave MySQLDs attach NdbApi interpreted programs to UPDATE and DELETE operations as they are applied to the database, comparing the row's stored NDB$GCI64 and NDB$AUTHOR columns with constant values supplied in the program.&lt;br /&gt;&lt;br /&gt;If there are no conflicts, the UPDATE and DELETE operations are applied, and the row's NDB$AUTHOR columns are set to one indicating a successful Slave modification&lt;br /&gt;&lt;br /&gt;If there are conflicts then conflict handling for the conflicting rows begins.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Now does that make any sense?  Assuming it does, then next we look at how detected conflicts are handled.&lt;br /&gt;&lt;br /&gt;Once again, another wordy endurance test and we're not finished.  Surely the end must be near?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Edit 23/12/11 : Added index&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-2635540360364141806?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/2635540360364141806/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=2635540360364141806' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/2635540360364141806'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/2635540360364141806'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html' title='Eventual consistency in MySQL Cluster - implementation part 1'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s72-c/image2.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-3891833330745782871</id><published>2011-12-07T14:28:00.007Z</published><updated>2011-12-23T10:45:39.617Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='active-active'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Eventual Consistency in MySQL Cluster - using epochs</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s1600/image2.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:left;cursor:pointer; cursor:hand;width: 250px; height: 203px;" src="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s320/image2.gif" alt="" id="BLOGGER_PHOTO_ID_5689269172198146146" usemap="#mymap" border="0" /&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;map name="mymap"&gt;&lt;area shape="rect" coords="0,182,249,200" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html"&gt;&lt;area shape="rect" coords="0,166,249,183" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html"&gt;&lt;area shape="rect" coords="0,147,249,166" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,127,249,147" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,109,249,127" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html"&gt;&lt;area shape="rect" coords="0,92,249,109" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,73,249,92" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,59,249,73" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html"&gt;&lt;area shape="rect" coords="0,37,249,59" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;area shape="rect" coords="0,0,249,37" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;/map&gt;&lt;br /&gt;Before getting to the details of how eventual consistency is implemented, we need to look at epochs.  Ndb Cluster maintains an internal distributed logical clock known as the epoch, represented as a 64 bit number.  This epoch serves a number of internal functions, and is atomically advanced across all data nodes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Epochs and consistent distributed state&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Ndb is a parallel database, with multiple internal transaction coordinator components starting, executing and committing transactions against rows stored in different data nodes.  Concurrent transactions only interact where they attempt to lock the same row.  This design minimises unnecessary system-wide synchronisation, enabling linear scalability of reads and writes.&lt;br /&gt;&lt;br /&gt;The stream of changes made to rows stored at a data node are written to a local Redo log for node and system recovery.  The change stream is also published to NdbApi event listeners, including MySQLD servers recording Binlogs.  Each node's change stream contains the row changes it was involved in, as committed by multiple transactions, and coordinated by multiple independent transaction coordinators, interleaved in a partial order.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;  Incoming independent transactions&lt;br /&gt; affecting multiple rows&lt;br /&gt;&lt;br /&gt;   T3         T4         T7&lt;br /&gt;   T1         T2         T5&lt;br /&gt;&lt;br /&gt;    |         |          |&lt;br /&gt;    V         V          V&lt;br /&gt;&lt;br /&gt; --------  --------  --------&lt;br /&gt; |  1   |  |  2   |  |  3   |&lt;br /&gt; |  TC  |  |  TC  |  |  TC  |   Data nodes with multiple&lt;br /&gt; |      |--|      |--|      |   transaction coordinators&lt;br /&gt; |------|  |------|  |------|   acting on data stored in&lt;br /&gt; |      |  |      |  |      |       different nodes&lt;br /&gt; | DATA |  | DATA |  | DATA |&lt;br /&gt; --------  --------  --------&lt;br /&gt;&lt;br /&gt;    |         |          |&lt;br /&gt;    V         V          V&lt;br /&gt;&lt;br /&gt;   t4        t4          t3&lt;br /&gt;   t1        t7          t2&lt;br /&gt;   t2        t1          t7&lt;br /&gt;             t5&lt;br /&gt;&lt;br /&gt; Outgoing row change event&lt;br /&gt;  streams by causing&lt;br /&gt;     transaction&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;These row event streams are generated independently by each data node in a cluster, but to be useful they need to be correlated together.  For system recovery from a crash, the data nodes need to recover to a cluster-wide consistent state.  A state which contains only whole transactions, and a state which, logically at least, existed at some point in time.  This correlation could be done by an analysis of the transaction ids and row dependencies of each recorded row change to determine a valid order for the merged event streams, but this would add significant overhead. Instead, the Cluster uses a distributed logical clock known as the epoch to group large sets of committed transactions together.&lt;br /&gt;&lt;br /&gt;Each epoch contains zero or more committed transactions.  Each committed transaction is in only one epoch.  The epoch clock advances periodically, every 100 milliseconds by &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbd-timebetweenepochs"&gt;default&lt;/a&gt;.  When it is time for a new epoch to start, a distributed protocol known as the Global Commit Protocol (GCP) results in all of the transaction coordinators in the Cluster agreeing on a point of time in the flow of committing transactions at which to change epoch.  This epoch boundary, between the commit of the last transaction in epoch &lt;span style="font-style:italic;"&gt;n&lt;/span&gt;, and the commit of the first transaction in epoch &lt;span style="font-style:italic;"&gt;n+1&lt;/span&gt;, is a cluster-wide consistent point in time.  Obtaining this consistent point in time requires cluster-wide synchronisation, between all transaction coordinators, but it need only happen periodically.&lt;br /&gt;&lt;br /&gt;Furthermore, each node ensures that the all events for epoch &lt;span style="font-style:italic;"&gt;n&lt;/span&gt; are published before any events for epoch &lt;span style="font-style:italic;"&gt;n+1&lt;/span&gt; are published.  Effectively the event streams are sorted by epoch number, and the first time a new epoch is encountered signifies a precise epoch boundary.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt; Incoming independent transactions&lt;br /&gt;&lt;br /&gt;   T3         T4         T7&lt;br /&gt;   T1         T2         T5&lt;br /&gt;&lt;br /&gt;    |         |          |&lt;br /&gt;    V         V          V&lt;br /&gt;&lt;br /&gt; --------  --------  --------&lt;br /&gt; |  1   |  |  2   |  |  3   |&lt;br /&gt; |  TC  |  |  TC  |  |  TC  |   Data nodes with multiple&lt;br /&gt; |      |--|      |--|      |   transaction coordinators&lt;br /&gt; |------|  |------|  |------|   acting on data stored in&lt;br /&gt; |      |  |      |  |      |      different nodes&lt;br /&gt; | DATA |  | DATA |  | DATA |&lt;br /&gt; --------  --------  --------&lt;br /&gt;&lt;br /&gt;    |         |          |&lt;br /&gt;    V         V          V&lt;br /&gt;&lt;br /&gt;  t4(22)    t4(22)      t3(22)            Epoch 22&lt;br /&gt;  ......    ......      ......&lt;br /&gt;  t1(23)    t7(23)      t2(23)            Epoch 23&lt;br /&gt;  t2(23)    t1(23)      t7(23)&lt;br /&gt;            ......&lt;br /&gt;            t5(24)                        Epoch 24&lt;br /&gt;&lt;br /&gt;  Outgoing row change event&lt;br /&gt;  streams by causing transaction&lt;br /&gt;  with epoch numbers in ()&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;When these independent streams are merge-sorted by epoch number we get a unified change stream.  Multiple possible orderings can result.&lt;br /&gt;One Partial ordering is shown here :&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;      Events      Transactions&lt;br /&gt;               contained in epoch&lt;br /&gt;&lt;br /&gt;   t4(22)&lt;br /&gt;   t4(22)      {T4,T3}&lt;br /&gt;   t3(22)&lt;br /&gt;&lt;br /&gt;   ......&lt;br /&gt;&lt;br /&gt;   t1(23)&lt;br /&gt;   t2(23)&lt;br /&gt;   t7(23)&lt;br /&gt;   t1(23)      {T1, T2, T7}&lt;br /&gt;   t2(23)&lt;br /&gt;   t7(23)&lt;br /&gt;&lt;br /&gt;   ......&lt;br /&gt;&lt;br /&gt;   t5(24)      {T5}&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Note that we can state from this that T4 -&amp;gt; T1 (Happened before), and T1 -&amp;gt; T5.  However we cannot say whether T4 -&amp;gt; T3 or T3 -&amp;gt; T4.  In epoch 23 we see that the row events resulting from T1, T2 and T7 are interleaved.&lt;br /&gt;&lt;br /&gt;Epoch boundaries act as markers in the flow of row events generated by each node, which are then used as consistent points to recover to.  Epoch boundaries also allow a single system wide unified transaction log to be generated from each node's row change stream, by merge-sorting the per-node row change streams by epoch number.  Note that the order of events within an epoch is still not tightly constrained. As concurrent transactions can only interact via row locks, the order of events on a single row (Table and Primary key value) signifies transaction commit order, but there is by definition no order between transactions affecting independent row sets.&lt;br /&gt;&lt;br /&gt;To record a Binlog of Ndb row changes, MySQLD listens to the row change streams arriving from each data node, and merge-sorts them them by epoch into a single, epoch-ordered stream.  When all events for a given epoch have been received, MySQLD records a single Binlog transaction containing all row events for that epoch.  This Binlog transaction is referred to as an 'Epoch transaction' as it describes all row changes that occurred in an epoch.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Epoch transactions in the Binlog&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Epoch transactions in the Binlog have some interesting properties :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Efficiency&lt;/span&gt; : They can be considered a kind of Binlog group commit, where multiple user transactions are recorded in one Binlog (epoch) transaction.  As an epoch normally contains 100 milliseconds of row changes from a cluster, this is a significant amortisation.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Consistency&lt;/span&gt; : Each epoch transaction contains the row operations which occurred when moving the cluster from epoch boundary consistent state A to epoch boundary consistent state B&lt;br /&gt;Therefore, when applied as a transaction by a slave, the slave will atomically move from consistent state A to consistent state B&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Inter-epoch ordering&lt;/span&gt; : Any row event recorded in epoch &lt;span style="font-style: italic;"&gt;n+1&lt;/span&gt; logically happened after every row event in epoch &lt;span style="font-style: italic;"&gt;n&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Intra-epoch disorder&lt;/span&gt; : Any two row events recorded in epoch &lt;span style="font-style: italic;"&gt;n&lt;/span&gt;, affecting different rows, may have happened in any order.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Intra-epoch key-order&lt;/span&gt; : Any two row events recorded in epoch &lt;span style="font-style: italic;"&gt;n&lt;/span&gt;, affecting the same row, happened in the order they are recorded.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The ordering properties show that epochs give only a partial order, enough to subdivide the row change streams into self-consistent chunks.  Within an epoch, row changes may be interleaved in any way, except that multiple changes to the same row will be recorded in the order they were committed.&lt;br /&gt;&lt;br /&gt;Each epoch transaction contains the row changes for a particular epoch, and that information is recorded in the epoch transaction itself, as an extra WRITE_ROW event on a system table called &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication-schema.html"&gt;mysql.ndb_apply_status&lt;/a&gt;.  This WRITE_ROW event contains the binlogging MySQLD's server id and the epoch number.  This event is added so that it will be atomically applied by the Slave along with the rest of the row changes in the epoch transaction, giving an atomically reliable indicator of the replication 'position' of the Slave relative to the Master Cluster in terms of epoch number.  As the epoch number is abstracted from the details of a particular Master MySQLD's binlog files and offsets, it can be used to &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication-failover.html"&gt;failover&lt;/a&gt; to an alternative Master.&lt;br /&gt;&lt;br /&gt;We can visualise a MySQL Cluster Binlog as looking something like this.  Each Binlog transaction contains one 'artificially generated' WRITE_ROW event at the start, and then RBR row events for all row changes that occurred in that epoch.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;    BEGIN&lt;br /&gt;  WRITE_ROW mysql.ndb_apply_status server_id=4, epoch=6998&lt;br /&gt;  WRITE_ROW ...&lt;br /&gt;  UPDATE_ROW ...&lt;br /&gt;  DELETE_ROW ...&lt;br /&gt;  ...&lt;br /&gt;  COMMIT # Consistent state of the database&lt;br /&gt;&lt;br /&gt;  BEGIN&lt;br /&gt;  WRITE_ROW mysql.ndb_apply_status server_id=4, epoch=6999&lt;br /&gt;  ...&lt;br /&gt;  COMMIT # Consistent state of the database&lt;br /&gt;&lt;br /&gt;  BEGIN&lt;br /&gt;  WRITE_ROW mysql.ndb_apply_status server_id=4, epoch=7000&lt;br /&gt;  ...&lt;br /&gt;  COMMIT # Consistent state of the database&lt;br /&gt;  ...&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;A series of epoch transactions, each with a special WRITE_ROW event for recording the epoch on the Slave.  You can see this structure using the &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysqlbinlog.html"&gt;mysqlbinlog&lt;/a&gt; tool with the --verbose option.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Rows tagged with last-commit epoch&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Each row in a MySQL Cluster stores a hidden metadata column which contains the epoch at which a write to the row was last committed.  This information is used internally by the Cluster during node recovery and other operations.  The &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-programs-ndb-select-all.html"&gt;ndb_select_all&lt;/a&gt; tool can be used to see the epoch numbers for rows in a table by supplying the --gci or --gci64 options.  Note that the per-row epoch is not a row version, as two updates to a row in reasonably quick succession will have the same commit epoch.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Epochs and eventual consistency&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Reviewing epochs from the point of view of my previous posts on eventual consistency we see that :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Epochs provide an incrementing logical clock&lt;/li&gt;&lt;li&gt;Epochs are recorded in the Binlog, and therefore shipped to Slaves&lt;/li&gt;&lt;li&gt;Epoch boundaries imply happened-before relationships between events before and after them in the Binlog&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The properties mean that epochs are almost perfect for monitoring conflict windows in an active-active circular replication setup, with only a few enhancements required.&lt;br /&gt;&lt;br /&gt;I'll describe these enhancements in the next post.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Edit 23/12/11 : Added index&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-3891833330745782871?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/3891833330745782871/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=3891833330745782871' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/3891833330745782871'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/3891833330745782871'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html' title='Eventual Consistency in MySQL Cluster - using epochs'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s72-c/image2.gif' height='72' width='72'/><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-254350422842436575</id><published>2011-11-25T12:02:00.005Z</published><updated>2011-11-25T12:22:32.524Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='talking'/><title type='text'>Speaking at Oracle UK User Group conference</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2011.ukoug.org/"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 180px; height: 150px;" src="http://3.bp.blogspot.com/-tY5XvNC2REg/Ts-HPbOAlmI/AAAAAAAAAAQ/e_PHjNNY3LQ/s320/i-am-speaking-at-ukoug-2011-xsmall-copy.gif" alt="" id="BLOGGER_PHOTO_ID_5678906354211788386" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;I will be speaking in the MySQL track of the &lt;a href="http://2011.ukoug.org/"&gt;UK Oracle User Group conference&lt;/a&gt; on 5th December in Birmingham UK.  The title of the session is "Building Highly Available and Scalable, Real Time Services with MySQL Cluster" - full details &lt;a href="http://2011.ukoug.org/default.asp?p=8850&amp;amp;dlgact=shwprs&amp;amp;prs_prsid=6385&amp;amp;day_dayid=56"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I'm not a regular conference attendee, never mind speaker.  However I'm looking forward to meeting current and potential MySQL users, and also attending some of the talks in the MySQL and other tracks.  Maybe I can learn something about RAC, or Exadata?&lt;br /&gt;&lt;br /&gt;If you are attending and want to talk about MySQL or MySQL Cluster then please track me down and say hello.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Note that this is the first picture I have included in 3 years of posts - maybe I shouldn't wait 3 years for the next one?&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-254350422842436575?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/254350422842436575/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=254350422842436575' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/254350422842436575'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/254350422842436575'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/11/speaking-at-oracle-uk-user-group.html' title='Speaking at Oracle UK User Group conference'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-tY5XvNC2REg/Ts-HPbOAlmI/AAAAAAAAAAQ/e_PHjNNY3LQ/s72-c/i-am-speaking-at-ukoug-2011-xsmall-copy.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-9036540843522030105</id><published>2011-10-20T01:05:00.006+01:00</published><updated>2011-12-23T10:45:07.048Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='active-active'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Eventual Consistency - detecting conflicts</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s1600/image2.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:left;cursor:pointer; cursor:hand;width: 250px; height: 203px;" src="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s320/image2.gif" alt="" id="BLOGGER_PHOTO_ID_5689269172198146146" usemap="#mymap" border="0" /&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;map name="mymap"&gt;&lt;area shape="rect" coords="0,182,249,200" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html"&gt;&lt;area shape="rect" coords="0,166,249,183" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html"&gt;&lt;area shape="rect" coords="0,147,249,166" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,127,249,147" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,109,249,127" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html"&gt;&lt;area shape="rect" coords="0,92,249,109" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,73,249,92" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,59,249,73" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html"&gt;&lt;area shape="rect" coords="0,37,249,59" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;area shape="rect" coords="0,0,249,37" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;/map&gt;&lt;br /&gt;In my &lt;a href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;previous&lt;/a&gt; &lt;a href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html"&gt;posts&lt;/a&gt; I introduced two new conflict detection functions, NDB$EPOCH and NDB$EPOCH_TRANS without explaining how these functions actually detect conflicts?   To simplify the explanation I'll initially consider two circularly replicating MySQL Servers, A and B, rather than two replicating Clusters, but the principles are the same.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Commit ordering&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Avoiding conflicts requires that data is only modified on one Server at a time.  This can be done by defining Master/Slave roles or Active/Passive partitions etc.  Where this is not done, and data can be modified anywhere, there can be conflicts.  A conflict occurs when the same data is modified at both Servers concurrently, but what does concurrently mean?  On a single server, modifications to the same data are serialised by locking or MVCC mechanisms, so that there is a defined order between them.  e.g. two modifications MX and MY are committed either in order {MX, MY} or {MY, MX}.&lt;br /&gt;&lt;br /&gt;For the purposes of replication, two modifications MX and MY on the same data are concurrent if the order of commit is different at different servers in the system.  Each server will choose one order, but if they don't all choose the same order then there is a conflict.  Having a different order means that the last modification on each server is different, and therefore the final state of the data can be different on different servers.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://codership.com/"&gt;One way&lt;/a&gt; to avoid conflicts is to get all servers to agree on a commit order before processing an operation - this ensures that all replicas process operations in the same order, waiting if necessary for missing operations to arrive to ensure no commit-order variance.&lt;br /&gt;&lt;br /&gt;Note that commit-ordering is only important between modifications affecting the same data - modifications which do not overlap in their data footprint are unrelated and can be committed in any order.  A system which totally orders commits may be less efficient than one which only orders conflicting commits.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Happened before&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For the NDB$EPOCH asynchronous conflict detection functions, commit orders are monitored to detect when two modifications to the same data have been committed in different orders.&lt;br /&gt;&lt;br /&gt;Given two modifications MX and MY to the same data, each server will decide a &lt;a href="http://en.wikipedia.org/wiki/Happened-before"&gt;happened before&lt;/a&gt; (denoted -&amp;gt;) relationship between them :&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;MX -&amp;gt; MY  (MX happened before MY)&lt;br /&gt;&lt;br /&gt;or&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;MY -&amp;gt; MX  (MY happened before MX)&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;If all servers agree on order 1, or all servers agree on order 2 then there is no conflict.  If there is any disagreement then there is a conflict.&lt;br /&gt;&lt;br /&gt;In practice, disagreement arises because the same data is modified at both Server A and Server B before the Server A modification is replicated to B and/or vice-versa.&lt;br /&gt;&lt;br /&gt;Sometimes when reading about commit ordering, the reason why commit orders should not diverge is lost - the only reason to care about commit ordering is because it is related to conflicting modifications and the potential for data divergence.&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;Determining happened before from the Binlog&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We assume a steady start state, where both Server A and Server B agree about the state of their data, and no modifications are in-flight.  If a client of Server A then commits modification MA1 to row X, then from Server A's point of view, MA1 happened before any future modification to row X.&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;MA1 -&amp;gt; M*&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;If a client of Server B commits modification MB1 to row X around the same time (before, or after, or thereabouts), from Server B's point of view, MB1 happened before any future modification to row X.&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;MB1 -&amp;gt; M*&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;Both Servers are correct, and content with their world view.  Note that in general, when committing a modification Mj, a server naturally asserts that from its point of view the modification happened before any as-yet-unseen modification Mk.&lt;br /&gt;&lt;br /&gt;Some time will pass and the replication mechanisms will pull Binlogged changes across and apply them.  When Server B pulls and applies Server A's Binlogged changes, modification MA1 will be applied to row X.  Server B will then naturally be of the opinion that :&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;MB1 -&amp;gt; MA1&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;Independently, Server A will pull Server B's binlogged changes and apply modification MB1 to row X, and will come to the certain opinion that :&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;MA1 -&amp;gt; MB1&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;These happened before relationships are contradictory so there is a conflict.  If nothing is done then A and B will have diverged, with Server A storing the outcome of MB1, and Server B storing the outcome of MA1.&lt;br /&gt;&lt;br /&gt;Note that if the &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/replication-options-slave.html"&gt;--log-slave-updates&lt;/a&gt; server option were on, then Server A's Binlog would have recorded {...MA1...MB1...}, whereas Server B's Binlog would have recorded {...MB1...MA1...}.  By recording when the Slave applies replicated updates in the Binlog, we record the commit order of the replicated updates relative to other local updates, and encode the happened before relationship in the relative positions of events in the Binlog.&lt;br /&gt;&lt;br /&gt;The Binlog is of course transferred between servers, so in a circular replication setup, Server A can become aware of the happened before information from Server B and vice-versa by examining the received Binlogs.  The Slave SQL thread examines Binlogs as it applies them, so can be extended to extract happened before information, and use it to detect conflicts.&lt;br /&gt;&lt;br /&gt;Recall that Server A asserts that its committed modification to row X (MA1) happened before any as-yet-unseen replicated modification :&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;MA1 -&amp;gt; M*&lt;/blockquote&gt;&lt;br /&gt;Therefore, to detect a conflict, Server A only needs to detect the case where the incoming Binlog from Server B infers that some modification MB* to row X happened before server A's already committed modification MA1.&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;If Server B Binlog implies MB* -&amp;gt; MA1  then there has been a conflict&lt;/blockquote&gt;&lt;br /&gt;This is in essence how the NDB$EPOCH functions work - the Binlog is used to capture happened before relationships which are checked to determine whether conflicting concurrent modifications have occurred.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Conflict Windows&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;In the previous example, Server A commits MA1 modifying row X, and Server B commits MB1 also modifying row X.  From Server A's point of view, as soon as it commits MA1, there is potential for a replicated modification from B such as MB1 to be found in-conflict with MA1.  We say that from Server A's point of view a window of potential conflict on row X has opened when MA1 was committed.  Server A monitors Server B's Binlog as it is applied and when it reaches the point where the commit of MA1 at Server B is recorded, Server A knows that any further MB* recorded in Server B's Binlog after this cannot have happened before MA1, therefore the window of potential conflict on row X has closed.&lt;br /&gt;&lt;br /&gt;We define the window of potential conflict on a row X as the time between the commit of a modification M1, and the Slave processing of an event in a replicated Binlog indicating that modification M1 has been applied on the other server(s) in the replication loop.&lt;br /&gt;&lt;br /&gt;Any incoming replicated modification M2 also affecting row X while it has an open conflict window is in conflict with M1, as it must appear to have happened-before M1 to the server which committed it.&lt;br /&gt;&lt;br /&gt;Observations about the window of potential conflict :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It is defined per committed modification per disjoint data set&lt;br /&gt;&lt;/li&gt;&lt;li&gt;It can be extended by further modifications to the same data from the same server&lt;br /&gt;The window does not close all further modifications have been fully replicated&lt;/li&gt;&lt;li&gt;Window duration is dependent on the replication round-trip delay&lt;br /&gt;Which can vary greatly&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Once it closes, further modifications to the same data from anywhere are safe, but will each open their own window of potential conflict.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;From the point of view of one Server, conflicts can occur at any time until the conflict window is closed&lt;br /&gt;&lt;/li&gt;&lt;li&gt;From the point of view of one Server, the duration of the window of potential conflict is similar to&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Replication Propagation Delay A to B&lt;/span&gt; + &lt;span style="font-style: italic;"&gt;Replication Propagation Delay B to A&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;These delays may not be symmetric.&lt;/li&gt;&lt;li&gt;From the point of view of an external observer/actor, the system will detect two modifications MA1 and MB1 committed at times tMA1 and tMA2 as in-conflict if&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;tMB1 - tMA1 &amp;lt; Replication Propagation Delay A to B&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;( A before B, but not by enough to avoid conflict )&lt;br /&gt;&lt;br /&gt;or&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;tMA1 - tMB1 &amp;lt; Replication Propagation Delay B to A&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;( B before A, but not by enough to avoid conflict )&lt;/li&gt;&lt;li&gt;The window of potential conflict can only be as short as the replication propagation delay between systems, which can tend towards, but never reach zero.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Tracking conflict windows with a logical clock&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A row's conflict window opens when a modification is committed to it, and closes when the Slave processes an event indicating that the modification was committed on the other server(s).  How can we track all of these independent conflict windows?  If only we had a database :)&lt;br /&gt;&lt;br /&gt;This is solved by maintaining a per-server &lt;a href="http://en.wikipedia.org/wiki/Logical_clock"&gt;logical clock&lt;/a&gt;, which increments periodically.  Each modification to a row sets a hidden metacolumn of the row to the current value of the server's logical clock.  This gives each row a kind of coarse logical timestamp.  When the logical clock increments, an event is included in the Binlog to record the transition.  Further, all row events for modifications with logical clock value X are stored in the Binlog before any row events for modifications with logical clock value X+1.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt; Server A Binlog events    ClockVal stored in DB&lt;br /&gt;                         by Modification&lt;br /&gt;&lt;br /&gt;...&lt;br /&gt;MA1                       39&lt;br /&gt;MA2                       39&lt;br /&gt;MA3                       39&lt;br /&gt;ClockVal_A = 40&lt;br /&gt;MA4                       40&lt;br /&gt;MA5                       40&lt;br /&gt;ClockVal_A = 41&lt;br /&gt;MA6                       41&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;When a Slave applies the Binlog, the ClockVal events are passed through into its Binlog, and are then made available to the original server in a circular configuration.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt; Server B Binlog events&lt;br /&gt;&lt;br /&gt;...&lt;br /&gt;MB1&lt;br /&gt;MB2&lt;br /&gt;ClockVal_A = 40&lt;br /&gt;MB3&lt;br /&gt;MB4&lt;br /&gt;ClockVal_B = 234&lt;br /&gt;MB5&lt;br /&gt;MB6&lt;br /&gt;ClockVal_A = 41&lt;br /&gt;MB7&lt;br /&gt;...&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Using the Binlog ordering, we can see that ClockVal_A = 40 happened before MB3 and MB4 at Server B.  This implies that MA1, MA2 and MA3 happened before MB3 and MB4 at server B.&lt;br /&gt;&lt;br /&gt;When applying Server B's Binlog to Server A, the Slave at Server A maintains a maximum replicated clock value, which increases as it observes its ClockVal_A events returned.  When applying a row event originating from Server B, the affected row's stored clock value is first compared to the maximum replicated clock value to determine whether the row event from B conflicts with the latest committed change to the row at Server A.&lt;br /&gt;&lt;br /&gt;The two modifications are in conflict if the stored row's clock value is greater than or equal to the maximum replicated clock value.&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;  in_conflict = row_clockval &amp;gt;= maximum_replicated_clockval&lt;/blockquote&gt;&lt;br /&gt;Using a logical clock to track conflict windows has the following benefits :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Automatic update on commit of row modification, opening conflict window&lt;/li&gt;&lt;li&gt;Automatic extension of conflict window on further modification on row with open conflict window.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Automatic closure of conflict window on maximum replicated clock value exceeding row's stored value&lt;/li&gt;&lt;li&gt;Efficient storage cost per row - one clock value.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Efficient runtime processing cost - inequality comparison between maximum replicated clock value and row's stored clock value.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;As you might have guessed, NDB$EPOCH uses the MySQL Cluster epoch values as a logical clock to detect conflicts.  The details of this will have to wait for yet another post.  In my first two posts on this subject I thought, 'one more post and I can finish describing this', but here I am at three posts and still not finished.  Hopefully the next will get more concrete and finally describe the mysterious workings of NDB$EPOCH.  We're getting closer, honest.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Edit 23/12/11 : Added index&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-9036540843522030105?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/9036540843522030105/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=9036540843522030105' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/9036540843522030105'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/9036540843522030105'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html' title='Eventual Consistency - detecting conflicts'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s72-c/image2.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-1053602935500819157</id><published>2011-10-12T14:00:00.000+01:00</published><updated>2011-10-12T14:36:53.826+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Some MySQL projects I think are cool - Shard-Query</title><content type='html'>I've already described Justin Swanhart's Flexviews project as something &lt;a href="http://messagepassing.blogspot.com/2010/09/some-mysql-projects-i-think-are-cool.html"&gt;I think is cool&lt;/a&gt;.  Since then Justin appears to have been working more on &lt;a href="http://code.google.com/p/shard-query/"&gt;Shard-Query&lt;/a&gt; which I also think is cool, perhaps even more so than Flexviews.&lt;br /&gt;&lt;br /&gt;On the page linked above, Shard-Query is described using the following statements :&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;"Shard-Query is a distributed parallel query engine for MySQL"&lt;br /&gt;&lt;/blockquote&gt;&lt;blockquote&gt;"ShardQuery is a PHP class which is intended to make working with a partitioned dataset easier"&lt;/blockquote&gt;&lt;blockquote&gt;"ParallelPipelining  - MPP distributed query engines runs fragments of queries in parallel,  combining the results at the end.  Like map/reduce except it speaks SQL  directly."&lt;br /&gt;&lt;br /&gt;&lt;/blockquote&gt;The things I like from the above description :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Distributed&lt;/li&gt;&lt;li&gt;Parallel&lt;/li&gt;&lt;li&gt;MySQL&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Partitioned&lt;/li&gt;&lt;li&gt;Fragments&lt;/li&gt;&lt;li&gt;Map/Reduce&lt;/li&gt;&lt;li&gt;SQL&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The things that scare me :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.php.net/"&gt;PHP&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;My fear of PHP is most likely groundless, based on experiences circa 1998.  I suspect it runs much of the real money-earning web, and perhaps brings my scribblings to you.  However, the applicability of Shard-Query seems so general, that to actually (or apparently) limit it to web-oriented use cases seems a shame.  In any case I am not hipster enough to know which language would be better - OCaml?  Dart?  Only joking.  I suppose that if the &lt;a href="http://forge.mysql.com/wiki/MySQL_Proxy"&gt;MySQL Proxy&lt;/a&gt; could do something along these lines then the language debate would be moot.&lt;br /&gt;&lt;br /&gt;I am likely to fall foul of the lack-of-original-content test if I quote too much from the Shard-Query website, but the How-it-works section seems relevant here.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;How it works&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The query is parsed using &lt;a href="http://code.google.com/p/php-sql-parser" rel="nofollow"&gt;http://code.google.com/p/php-sql-parser&lt;/a&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;A modified version of the query is executed on each shard.&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The queries are executed &lt;i&gt;in parallel&lt;/i&gt; using &lt;a href="http://gearman.org/" rel="nofollow"&gt;http://gearman.org&lt;/a&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;a href="http://gearman.org/" rel="nofollow"&gt;&lt;br /&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;The results from each shard are combined together  &lt;/li&gt;&lt;li&gt;A version of the original query is then executed over the combined results&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;All aggregation is done on the slaves (pushed down) &lt;/li&gt;&lt;li&gt;Queries with inlists can be made into parallel queries.   &lt;/li&gt;&lt;li&gt;A  callback can be used for QueryRouting.  You provide a partition column,  and a callback which returns information pointing to the correct shard.   The most convenient way to do this is with &lt;a href="http://code.google.com/p/shard-key-mapper" rel="nofollow"&gt;Shard-Key-Mapper&lt;/a&gt; &lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Query rewriting rules&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The core of Shard-Query are the query rewriting rules, which Justin introduces in&lt;span style="font-size:100%;"&gt; a blog post entitled &lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;span class="subj-link"&gt;&lt;a href="http://swanhart.livejournal.com/134177.html"&gt;8 substitution rules for running SQL in parallel&lt;/a&gt;.  These transforms and substitutions allow Shard-Query to execute a user supplied query across multiple database shards.  A single query (SELECT) can be mapped into a query to be applied to some, or all shards, and further queries to be used to merge the results of the per-shard queries into a final result.&lt;br /&gt;&lt;br /&gt;Compared to a single system query, the consistency of the view that the sharded query executes against is less well defined, but this may well be acceptable for some applications.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;On a single server&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A single MySQL instance offers inter-query parallelism, but currently has very limited intra-query parallelism,  Shard-Query can circumvent this by splitting a single query into multiple sharded sub-queries which can run in different MySQLD threads (as they are each submitted by different clients) to give intra-query parallelism.  To me this seems more of a cool side effect and proof of reasonable implementation efficiency, than a real reason to use Shard-Query.  Perhaps someone out there has the perfect use case for this.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Across multiple servers&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The big-name MySQL users scale-out with MySQL, storing subsets of data on separate MySQL instances.  Shard-Query allows SQL queries spanning all shards to be executed.  This is what scaled-out MySQL has been waiting for.&lt;br /&gt;&lt;br /&gt;I don't think it would be a good idea to run heavy traffic through Shard-Query to access a set of sharded MySQL instances yet, but Shard-Query gives a great way to perform occasional queries across all shards.  This could be great for reporting and perhaps some light mining for patterns, trends etc.  The ability to query across live real time data may be a real gain.&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;Loose coupling and availability&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Scaling out via sharding standalone database servers has many difficulties, but the independence of each shard can benefit availability relative to a more tightly coupled distributed system.   The loose coupling of the MySQL instances means that it's far less likely that the failure of one shard will drag others down with it, increasing system availability.   Shard-Query can give the loosely coupled shards a smoother facade.  The limited set of capabilities that Shard-Query gives over the set of shards may well be more than good enough.  Note that 'good enough' is a recurring theme in this 'MySQL projects I think are cool' series.  Often 'best' results in expensive or unnecessary compromises as a side-effect of trying to please everybody all the time.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Yet another MySQL sharded scaleout design&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Looking at Shard-Query, and the recent MySQL-NoSQL Api developments, it seems like a modern MySQL sharded scaleout design might make use of :&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;ul&gt;&lt;li&gt;MySQL SQL &lt;a href="http://en.wikipedia.org/wiki/MySQL#Platforms_and_interfaces"&gt;Apis&lt;/a&gt; (PHP, JDBC, ODBC, Ruby, Python, ....)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;NoSQL access mechanisms&lt;br /&gt;(&lt;a href="http://messagepassing.blogspot.com/2010/10/some-mysql-projects-i-think-are-cool.html"&gt;HandlerSocket&lt;/a&gt;, Memcached(&lt;a href="http://blogs.innodb.com/wp/2011/04/nosql-to-innodb-with-memcached/"&gt;1&lt;/a&gt;,&lt;a href="http://mysqlblog.lenoxway.net/index.php?/archives/14-MySQL-Cluster-and-Memcached-Together-at-Last.html"&gt;2&lt;/a&gt;))&lt;/li&gt;&lt;li&gt;ShardQuery for SQL reporting / analysis&lt;/li&gt;&lt;/ul&gt;Per-instance efficiency can be maximised by using the NoSQL access Apis, single-instance SQL is still available if required for the application, and a global SQL view is also available.&lt;br /&gt;&lt;br /&gt;This combination of scalability, efficiency and SQL query-ability could be a sweet spot in the increasingly confusing multi-dimensional space of high throughput distributed databases.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-1053602935500819157?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/1053602935500819157/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=1053602935500819157' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/1053602935500819157'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/1053602935500819157'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/06/some-mysql-projects-i-think-are-cool.html' title='Some MySQL projects I think are cool - Shard-Query'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-1842601198918442493</id><published>2011-10-10T01:26:00.005+01:00</published><updated>2011-12-23T10:44:10.216Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='active-active'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='latency-hiding'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Eventual consistency with transactions</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s1600/image2.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:left;cursor:pointer; cursor:hand;width: 250px; height: 203px;" src="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s320/image2.gif" alt="" id="BLOGGER_PHOTO_ID_5689269172198146146" usemap="#mymap" border="0" /&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;map name="mymap"&gt;&lt;area shape="rect" coords="0,182,249,200" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html"&gt;&lt;area shape="rect" coords="0,166,249,183" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html"&gt;&lt;area shape="rect" coords="0,147,249,166" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,127,249,147" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,109,249,127" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html"&gt;&lt;area shape="rect" coords="0,92,249,109" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,73,249,92" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,59,249,73" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html"&gt;&lt;area shape="rect" coords="0,37,249,59" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;area shape="rect" coords="0,0,249,37" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;/map&gt;&lt;br /&gt;In my last post I &lt;a href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;described&lt;/a&gt; the motivation for the new NDB$EPOCH conflict detection function in &lt;a href="http://en.wikipedia.org/wiki/MySQL_Cluster"&gt;MySQL Cluster&lt;/a&gt;.  This function detects when a row has been concurrently updated on two asynchronously replicating MySQL Cluster databases, and takes steps to keep the databases in alignment.&lt;br /&gt;&lt;br /&gt;With NDB$EPOCH, conflicts are detected and handled on a row granularity, as opposed to column granularity, as this is the granularity of the epoch metadata used to detect conflicts.  Dealing with conflicts on a row-by-row basis has implications for schema and application design.  The NDB$EPOCH_TRANS function extends NDB$EPOCH, giving stronger consistency guarantees and reducing the impact on applications and schemas.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Concurrency control in a single synchronous system&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;MySQL Cluster is a relational system.  Data is stored in tables with defined schemas of typed columns.  As with any relational system, real-world concepts can be &lt;a href="http://en.wikipedia.org/wiki/Relational_model"&gt;modelled&lt;/a&gt; in a number of ways with different trade offs.  One such consideration is the level of normalisation applied to a data model.  Transactions and concurrency control ensure that some data modelled using multiple tables, rows and columns, appears to any external observer to move instantaneously between stable, self consistent states.  This is a powerful simplification, and eases the complexity burden on application writers.  Each transaction provides the illusion of serialised access to the database.  Multiple transactions can execute in parallel, so long as they do not interfere by accessing the same data.  Where transactions do interfere, some real serialisation can occur.  In practice, applications depend on the serialisation and atomicity guarantees given by transactions, often in ways not fully made explicit or understood by the application designers.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Concurrency control in independent, asynchronously replicated systems&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Asynchronously replicating writes between two independent systems erodes the guarantees given by single system concurrency control.  Each system maintains its transactional guarantees in parallel, and incorporates modifications from the other system asynchronously, at some time after they were originally committed.  Where the same row is modified on both systems concurrently, two versions of the same row are produced, and there is no longer a single history of values for the given row.  This can cause replicas to diverge.  Note that the window of 'concurrency', or 'potential conflict' is related to the time taken for a committed update to be applied on all replicas.  This is similar, or equivalent to the commit delay experienced by a synchronous 2-phase commit system.&lt;br /&gt;&lt;br /&gt;Conflicts can be detected using some form of conflict detection.  On detecting a conflict, steps can then be taken to avoid divergence, and resolve any unwanted effects of the concurrent writes.&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;Replica divergence, &lt;/span&gt;&lt;span style="font-size:130%;"&gt;external effects and cascading impacts&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Divergence can be avoided if conflicting writes can be merged in some way.  Some write conflicts may be equivalent, associative or otherwise mergeable, especially if the operations are replicated rather than their resulting states.  However merging requires specific schema and application knowledge to determine how to merge conflicting writes.&lt;br /&gt;&lt;br /&gt;More generally, divergence can be avoided by rejecting one or both conflicting writes.  This is the approach we have taken, with handling of rejected writes delegated to the application, where the knowledge exists to handle them via the exceptions table mechanism.&lt;br /&gt;&lt;br /&gt;However write conflicts are handled, it is important to consider :&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Cascading impacts on dependent operations&lt;br /&gt;Operations based on the results of conflicting operations may themselves require handling to avoid divergence.&lt;/li&gt;&lt;li&gt;Real world / other system effects based on conflicting writes&lt;br /&gt;Maintaining database consistency does not guarantee that real world effects have been correctly compensated.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;A database system does not exist in a vacuum.  Operations are performed to reflect external world, or external system events.  When the effects of an operation are later reverted, the real world effects may also require some compensating actions.  These external world compensating actions are beyond the scope of any DBMS system and are application specific.  In a real application of this technology, this is probably the most important part of the design.&lt;br /&gt;&lt;br /&gt;Any particular conflict originates between two concurrent operations, but once a conflicting operation is committed, other operations can read its results, and commit their own, expanding the impact of the original conflict.  Conflicts are discovered asynchronously, some time after the original operations are committed, so there can be a large number of subsequent operations in the replication pipeline which depend on the conflicting operations at the point they are discovered.  All of the invalidated subsequent operations must be handled.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Row based conflict detection and data shearing&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Using row-based conflict detection and re-alignment can counteract data divergence so that rows become consistent eventually, but this comes at the cost of eroding the atomicity of committed transactions.  For example, a committed transaction which writes to three rows may, after conflict handling, have none, one, two or all three row changes reverted.&lt;br /&gt;&lt;br /&gt;Within a single system, the two potentially visible states were :&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Before transaction (All rows at version 1 : Av1, Bv1, Cv1)&lt;/li&gt;&lt;li&gt;After transaction (All rows at version 2 : Av2, Bv2, Cv2)&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;With row based conflict detection, ignoring the row variants we're actually conflicting with, there could be :&lt;br /&gt;&lt;br /&gt;Before transaction : (All rows at version 1 :  Av1, Bv1, Cv1)&lt;br /&gt;&lt;br /&gt;After transaction&lt;ol&gt;&lt;li&gt;Av2, Bv2, Cv2  (All rows at version 2)&lt;/li&gt;&lt;li&gt;Av2, Bv2, &lt;span style="font-weight: bold;"&gt;Cv1&lt;/span&gt;  (Cv2 reverted)&lt;/li&gt;&lt;li&gt;Av2, &lt;span style="font-weight: bold;"&gt;Bv1&lt;/span&gt;, Cv2  (Bv2 reverted)&lt;/li&gt;&lt;li&gt;Av2, &lt;span style="font-weight: bold;"&gt;Bv1&lt;/span&gt;, &lt;span style="font-weight: bold;"&gt;Cv1&lt;/span&gt;  (Bv2, Cv2 reverted)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Av1&lt;/span&gt;, Bv2, Cv2  (Av2 reverted)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Av1&lt;/span&gt;, Bv2, &lt;span style="font-weight: bold;"&gt;Cv1&lt;/span&gt;  (Av2, Cv2 reverted)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Av1&lt;/span&gt;, &lt;span style="font-weight: bold;"&gt;Bv1&lt;/span&gt;, Cv2  (Av2, Bv2 reverted)&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Av1&lt;/span&gt;, &lt;span style="font-weight: bold;"&gt;Bv1&lt;/span&gt;, &lt;span style="font-weight: bold;"&gt;Cv1&lt;/span&gt;  (Av2, Bv2, Cv2 reverted)&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;Depending on the concept that the distinct rows A,B,C represented, this can vastly increase the complexity of understanding the data.  If A, B and C model entirely separate entities, which just happened to be transactionally updated together then there may be no problem if they fare differently in conflict detection.  If they model portions of the state of a larger entity then reasoning about the state of that entity becomes complex.&lt;br /&gt;&lt;br /&gt;This potential chopping up of changes committed in a transaction can be described as shearing of the data model represented by the schema.  In practice, the potential for shearing between rows implies that for tables with conflicts handled on a row basis, cross row consistency is not available.  This in turn implies that the schema must be modified to ensure that data items which cannot tolerate relative shear are placed in the same row so that they share the same fate and remain self-consistent.  This single-row limit to consistency is native and natural to some NoSQL / key-value / wide column store products, but is a weakening of the normal guarantees in a transactional system.&lt;br /&gt;&lt;br /&gt;Requiring that schemas and applications using conflict detection can tolerate shear between any two rows is quite a heavy burden to place on applications, especially those not written with eventual consistency in mind.  Is there some way to support optimistic conflict detection without breaking up committed transactions, and shearing rows?&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;Transaction based conflict detection&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;/span&gt;&lt;br /&gt;One way to avoid inter-row shearing is to perform conflict detection on a row-by-row basis, but on discovering a conflict, take action on a transaction basis.  More concretely, when a row conflict is discovered, any other rows written as part of the same transaction should also be considered in-conflict by implication.  This reduces the set of stable states back to the original case - all rows at version 1 or all rows at version 2.&lt;br /&gt;&lt;br /&gt;Where a row is found to be in-conflict with some replicated row operation, a further replicated row operation on the same row should also be found to be in-conflict, until the conflict condition has been cleared.  This property is implicitly implemented in the existing row based conflict detection functions.&lt;br /&gt;&lt;br /&gt;When the scope of a conflict is extended to include all row modifications in a transaction, this implies that all following replicated row operations which affect the same rows, must also be in conflict by implication.  To avoid row shearing, these implied-in-conflict rows must implicate the other rows in their transactions, and those rows may in-turn implicate other rows.  The overall effect is that a single row conflict must cause its transaction, and all dependent transactions to be considered to be in conflict.&lt;br /&gt;&lt;br /&gt;From our database centric point of view, transactions can only become dependent on each other through the data they access in the database.  If transaction X updates rows A and B, and transaction Y then reads row B and updates row C, then we can say that transaction B has a read-write dependency on transaction A via row B.  We cannot tell whether there is some other out-of-band communication between transactions.&lt;br /&gt;&lt;br /&gt;By tracking this transaction 'footprint' information, and looking for row overlaps, we can determine transaction dependencies.  This is how the new NDB$EPOCH_TRANS function provides transactional conflict detection.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;NDB$EPOCH_TRANS conflict detection function&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;The NDB$EPOCH_TRANS conflict function uses the same mechanism as the NDB$EPOCH function to detect concurrent updates to the same row across two clusters.  However, once a row conflict has been detected in an operation which is part of a replicated transaction, all other operations in that replicated transaction are considered to be in conflict.  Furthermore, any transactions found to be dependent on that transaction are also considered in conflict.  Once the full set of in conflict transactions has been determined, the set of affected rows are handled in the same way as in NDB$EPOCH.&lt;br /&gt;&lt;br /&gt;Specifically :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The replicated operations are not applied&lt;/li&gt;&lt;li&gt;The exceptions table(s) are populated with the affected primary keys&lt;/li&gt;&lt;li&gt;The affected row epochs are updated&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Realignment Binlog events are generated to (eventually) realign the Secondary cluster&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;As with NDB$EPOCH, NDB$EPOCH_TRANS is asymmetric, so the Primary Cluster always wins when a conflict is detected.  As with NDB$EPOCH, this allows applications needing pessimistic properties to obtain them by accessing the Primary Cluster.  Applications which can handle the relaxed consistency of optimism can access either Cluster.  With NDB$EPOCH_TRANS, transactions committed on the Secondary Cluster are guaranteed to be atomic, whether or not they are later found to be in conflict.  Each committed transaction will either be unaffected by conflict detection, or be completely reverted.  There will be no row shear.&lt;br /&gt;&lt;br /&gt;This slightly stronger optimistic consistency guarantee may ease the implementation of relaxed consistency / eventually consistent applications.  For example, where some concept is modelled by a number of different rows in different tables, any transactional modification will either be atomically applied, or not applied at all, so the relationships between the rows affected by a transaction will preserved.  The need to flatten a schema into single-row entities is reduced, although careful design is still required to get a good understanding of transaction boundaries, and the behaviour of the overall system when transactions are reverted.&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;Transaction dependency tracking&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;NDB$EPOCH_TRANS is built in to the MySQL Cluster Storage Engine.  It is active in the normal MySQL Slave SQL thread, as part of the normal table handler calls made when applying a replicated Binlog.  The NDB$EPOCH_TRANS code in the Ndb storage engine tracks transaction dependencies based on the primary keys accessed by row events in the Binlog, and their transaction ids.  If two row events have the same table and primary key values, then they affect the same row.  If two events affect the same row, and are in different transactions, then the second transaction depends on the first.  In this way, a transaction dependency graph is built by the MySQL Cluster Storage Engine as row events are applied by the Slave from a replicated Binlog.  This graph is then used to find dependencies when a conflict is detected.&lt;br /&gt;&lt;br /&gt;A Binlog only contains WRITE_ROW, UPDATE_ROW and DELETE_ROW events.  This means that we only detect dependencies between transactions which write the same rows.  We do not currently track dependencies between writers and readers.  For example :&lt;br /&gt;&lt;br /&gt;Transaction A : {Write row X, Write row &lt;span style="font-weight: bold;"&gt;Y&lt;/span&gt;}&lt;br /&gt;Transaction B : {Read row &lt;span style="font-weight: bold;"&gt;Y&lt;/span&gt;, Write row Z}&lt;br /&gt;&lt;br /&gt;Binlog : {{Tx A : Wr X, Wr Y}, {Tx B : Wr Z}}&lt;br /&gt;&lt;br /&gt;In this example, the dependency of Transaction B on Transaction A is not recorded in the Binlog, and so the Slave is not aware of it.  This would result in the write to row Z not being considered in conflict, when it should be.&lt;br /&gt;&lt;br /&gt;A future improvement is to add selective tracking of reads to the Binlog, so that Write -&amp;gt; Read dependencies will implicate reading transactions when a conflict is discovered.&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;There's more to come&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Another long dry post, best consumed with your favourite drink in hand.  As I mentioned last time, these functions are pushed, and available in the latest releases of MySQL Cluster.  I'd be happy to hear from anyone who wants to try them out and give feedback.  I've been deliberately light with implementation details thus far, as I'm saving those for yet another posting.  I think that some of the implementation details are interesting from a replication point of view, even if you're not interested in these particular conflict detection algorithms.  You may disagree :)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Edit 23/12/11 : Added index&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-1842601198918442493?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/1842601198918442493/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=1842601198918442493' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/1842601198918442493'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/1842601198918442493'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html' title='Eventual consistency with transactions'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s72-c/image2.gif' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-2589003264580418728</id><published>2011-10-03T13:50:00.005+01:00</published><updated>2011-12-23T10:41:46.068Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='active-active'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='latency-hiding'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Eventual consistency with MySQL</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s1600/image2.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:left;cursor:pointer; cursor:hand;width: 250px; height: 203px;" src="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s320/image2.gif" alt="" id="BLOGGER_PHOTO_ID_5689269172198146146" usemap="#mymap" border="0" /&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;map name="mymap"&gt;&lt;area shape="rect" coords="0,182,249,200" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html"&gt;&lt;area shape="rect" coords="0,166,249,183" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html"&gt;&lt;area shape="rect" coords="0,147,249,166" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,127,249,147" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html"&gt;&lt;area shape="rect" coords="0,109,249,127" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html"&gt;&lt;area shape="rect" coords="0,92,249,109" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,73,249,92" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html"&gt;&lt;area shape="rect" coords="0,59,249,73" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html"&gt;&lt;area shape="rect" coords="0,37,249,59" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;area shape="rect" coords="0,0,249,37" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html"&gt;&lt;/map&gt;&lt;br /&gt;&lt;blockquote style="font-style: italic;"&gt;tl;dr : New 'automatic' optimistic conflict detection functions available giving the best of both optimistic and pessimistic replication on the same data&lt;/blockquote&gt;&lt;br /&gt;MySQL replication supports a number of topologies, and one of the most interesting is an active-active, or master-master topology, where two or more Servers accept read and write traffic, with asynchronous replication between them.&lt;br /&gt;&lt;br /&gt;This topology has a number of attractions, including :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Potentially higher availability&lt;/li&gt;&lt;li&gt;Potentially low impact on read/write latency&lt;/li&gt;&lt;li&gt;Service availability insensitive to replication failures&lt;/li&gt;&lt;li&gt;Conceptually simple&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;However, data consistency is hard to maintain in this environment.  Data, and access to it, must usually be partitioned or otherwise controlled, so that the consistency of reads is acceptable, and to avoid lost writes or badly merged concurrent writes.  Implementing a distributed data access partitioning scheme which can safely handle communication failures is not simple.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Relaxed read consistency&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Relaxed read consistency is a fairly well understood concept, with many Master-Slave topologies deployed where some read traffic is routed to the Slave to offload the Master and get 'read scaling'.&lt;br /&gt;Generally this is acceptable as :&lt;br /&gt;&lt;ol&gt;&lt;li&gt;A Read-only Slave's state is self-consistent.  It is a state which, at least logically, existed on the Master at some time in the past.&lt;/li&gt;&lt;li&gt;The reading application can tolerate some level of read-staleness w.r.t. the most recently committed writes to the Master&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;A surprisingly large number of applications can manage with a stale view as long as it is self-consistent.&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;Read-your-writes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Applications requiring 'read your writes' consistency (or session consistency) must either read from the Master, or wait until the Slave has replicated up to at least the point in time where the application's last write committed on the Master before reading from it.  It is simpler and less delay-prone to just read from the Master, but this increases the load on the Master, reducing the ability of a system to read-scale.  When the Master is unavailable, some sort of failover is required, and therefore, some sort of recovery process is also required.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Partitioned Active-Active/ Balanced Master-Slave&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Rather than treating a whole replica as either Master or Slave, we can have each replica be both a Master and a Slave.  The partitioning could be on database level, table level, or some function of the rows contained in tables, perhaps some key prefix which maps to application entities.  Balancing the Master/Slave role in this way allows the request load to be balanced, reducing issues with a single system providing the Master 'role' having to do more work.&lt;br /&gt;&lt;br /&gt;In this configuration, rather than talking about Master and Slave, it makes more sense to talk about some partition of data being 'Active' on one replica, and 'Backup' on the others.  Read requests routed to the Active replica will be guaranteed to get the latest state, whereas the Backup replicas can potentially return stale states.  Write requests should always be routed to the Active replica to avoid potential races between concurrent writes.&lt;br /&gt;&lt;br /&gt;Implementing a partitioned replicated system like this generally requires application knowledge to choose a partitioning scheme where expected transaction footprints align with the partitioning scheme, and cross-partition transactions are rare/non-existant.  Additionally, it requires application modification, or a front-end routing mechanism to ensure that requests are correctly routed.  The routing system must also be designed to re-route in cases of communication or system failure, to ensure availability, and avoid data divergence.  After a failure, recovery must take care to ensure replicas are resynchronised before restoring Active status to partitions in a recovered replica.&lt;br /&gt;&lt;br /&gt;Implementing a partitioned replicated system with request routing, failover and recovery is a complex undertaking.  Additionally, it can be considered a pessimistic system.  For embarassingly parallel applications, with constrained behaviours, most transactions are non-overlapping in their data footprint in space and (reasonable lengths of) time.  Enforced routing of requests to a primary replica adds cost and complexity that is most often unnecessary.  Is it possible to take a more optimistic approach?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Optimistic Active-Active replication&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;An &lt;a href="http://en.wikipedia.org/wiki/Optimistic_replication"&gt;optimistic active-active replication system&lt;/a&gt; assumes that conflicting operations are rare, and prefers to handle conflicts after they happen, than to make conflicts impossible, by mapping them to delays or overheads all of the time.  The one-time cost of recovering from a conflict after it happens may be higher than the one-time cost of preventing a conflict, but this can be a win if conflicts are rare enough.&lt;br /&gt;&lt;br /&gt;Practically, optimistic active-active replication involves allowing transactions to execute and commit at all replicas, and asynchronously propagating their effects between replicas.  When applying replicated changes to a replica, checks are made to determine whether any conflicts have occurred.&lt;br /&gt;&lt;br /&gt;Benefits of optimism include :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Local reads - low latency, higher availability&lt;/li&gt;&lt;li&gt;Local writes - low latency, higher availability&lt;/li&gt;&lt;li&gt;No need to route requests, failover, recover&lt;br /&gt;Recovery from network failure is the same as for normal async replication - catch up the backlog.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;A pessimist is never disappointed, as they always expect the worst, but an optimist is occasionally (often?) disappointed.  With active-active replication, this disappointment can include reading stale data, as with relaxed read consistency, or having committed writes later rejected due to a conflict.  This is the price of optimism.  Not all applications are suited to the slings and arrows inherent in optimism.  Some prefer the safety of a pessimistic outlook.&lt;br /&gt;&lt;br /&gt;Benefits of pessimism include :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Only durable data returned by reads&lt;/li&gt;&lt;li&gt;Committed writes are durable&lt;/li&gt;&lt;/ul&gt;MySQL Cluster replication has supported symmetric optimistic &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication-conflict-resolution.html"&gt;conflict detection functions&lt;/a&gt;  since the 6.3 release.  These provide detection of conflicts for  optimistic active-active replication, allowing data to be written on any cluster, and write-write conflicts to be detected for handling.  The  functions use an application defined comparison value to determine when a  conflict has occurred, and optionally, which change should 'win'.  This  is very flexible, but can be difficult to understand, and requires  application and schema changes to be made use of.&lt;br /&gt;&lt;br /&gt;When presented with an either-or decision, why not ask for both?  Is it possible to have the benefits of both optimistic and pessimistic replication?  Can we have them both on the same data at the same time?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Asymmetric optimistic Active-Active replication&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;I have recently been working on new asymmetric conflict detection functions for MySQL Cluster replication.  These functions do not require schema or application modifications.  They are asymmetric in that one data replica is regarded as the Active replica.  However, unlike a pessimistic partitioned replicated system, writes can be made at Active or Backup replicas - they do not have to be routed to the Active replica.  Writes made at the Backup replica will asynchronously propagate to the Active replica and be applied, but only if they do not conflict with writes made concurrently at the Active replica.&lt;br /&gt;&lt;br /&gt;Having a first class Active replica and a second class Backup replica may seem like a weakness.  However, it allows optimistic and pessimistic replication to be mixed, on the same data for different use-cases.&lt;br /&gt;&lt;br /&gt;Where a pessimistic approach is required, requests can be routed to the Active replica.  At the Active replica, they will be guaranteed to read durable data, and once committed, writes will not be rejected later.&lt;br /&gt;&lt;br /&gt;Where an optimistic approach is acceptable, requests can also be routed to the Backup replica.  At the Backup replica, committed writes may later be rejected, and reads may return data which will later be rejected.  The potential for disappointment is there, and applications must be able to cope with that, but in return, they can read and write locally, with latency and availability independent of network conditions between replicas.&lt;br /&gt;&lt;br /&gt;A well understood application and schema can use pessimistic replication, with request routing, where appropriate, and write-anywhere active-active where the application and schema can cope with the relaxed consistency.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;New conflict functions - NDB$EPOCH, NDB$EPOCH_TRANS&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;The new NDB$EPOCH function implements asymmetic conflict detection, on a row basis.  One replica of a table is considered Active (or Primary), and the other(s) are Backup (or Secondary).  Writes originating from the Backup replica are checked at the Active replica to ensure that they don't conflict with concurrent writes originating at the Active replica.  If they do conflict, then they are rejected, and the Backup is realigned to the Active replica's state.  In this way, data divergence is avoided, and the replicated system eventually becomes consistent.&lt;br /&gt;&lt;br /&gt;The conflict detection, and realignment to give eventual consistency all occur asynchronously as part of the normal MySQL replication mechanisms.&lt;br /&gt;&lt;br /&gt;As with the existing conflict detection functions, an exceptions table can be defined which will be populated with the primary keys of rows which have experienced a conflict.  This can be used to take application specific actions when a conflict is detected.&lt;br /&gt;&lt;br /&gt;Unlike the existing conflict detection functions, no schema changes or application changes are required.  However, as with any optimistic replication system, applications must be able to cope with the relaxed consistency on offer.  Applications which cannot cope, can still access the data, but should route their requests to Active replicas only, as with a more traditional pessimistic system.&lt;br /&gt;&lt;br /&gt;As these functions build on the existing MySQL Cluster asynchronous replication, the existing features are all available :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Slave batching performance optimisations&lt;br /&gt;&lt;/li&gt;&lt;li&gt;High availability - redundant replication channels&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Transactional replication and progress tracking&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Normal MySQL replication features : DDL replication, Binlog, replicate to other engines etc..&lt;/li&gt;&lt;/ul&gt;Ok, that's long enough for one post - I'll describe NDB$EPOCH_TRANS and  its motivations in a follow-up.  If you're interested in trying this out, then  download the latest versions of MySQL Cluster.  If you're interested in  the optimistic replication concept in general, I recommend reading Saito  and Shapiro's &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.6907"&gt;survey&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Edit 23/12/11 : Added index&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-2589003264580418728?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/2589003264580418728/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=2589003264580418728' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/2589003264580418728'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/2589003264580418728'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html' title='Eventual consistency with MySQL'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s72-c/image2.gif' height='72' width='72'/><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-447106870440199377</id><published>2011-06-23T23:58:00.000+01:00</published><updated>2011-06-24T00:00:14.039+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='latency-hiding'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><title type='text'>Some MySQL projects I think are cool - HandlerSocket Plugin</title><content type='html'>The HandlerSocket &lt;a href="http://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL"&gt;project&lt;/a&gt; is described in &lt;a href="http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html"&gt;Yoshinori Matsunobu&lt;/a&gt;'s blog entry under the title 'Using MySQL as a NoSQL - A story for exceeding 750,000 qps on a commodity server'.   It's a great headline and has generated a lot of buzz.  Quite a few early commentators were a little confused about what it was - a new NoSQL system using InnoDB?  A cache?  In memory only?  Where does Memcached come in?  Does it support the Memcached protocol?  If not, why not?  Why is it called HandlerSocket?&lt;br /&gt;&lt;br /&gt;Inspirations from Memcache may include the focus on simplicity, performance and a simple human readable protocol.  As Yoshinori says, Kazuho Oku has already implemented a MySQLD-embedded Memcached server, no need to do it again.  What's more, the Memcache protocol offers key-value functionality, whereas implementing a new protocol allows more functionality to be exposed.&lt;br /&gt;&lt;span&gt;&lt;br /&gt;The choice of name has come in for some flak.   I believe the etymology is that HandlerSocket exposes the existing MySQL &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/handler.html"&gt;Handler&lt;/a&gt; interface directly over a separate socket.  Perhaps a more exciting name will appear at some point, but looking at the MySQL Handler documentation gives a good background on the basis of the HandlerSocket Api.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;HandlerSocket implements more than a Key-Value Api.  It supports indexed data access in general including :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Equality search on any index prefix (returning 0, 1 or more rows)&lt;/li&gt;&lt;li&gt;Inequality search on any index prefix (returning 0, 1 or more rows)&lt;/li&gt;&lt;/ul&gt;This allows far more general use than a simple key-value API.  Composite keys can be used and Secondary indexes searched and maintained.  This exposes much more of the value of a storage engine like InnoDB.&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;SQL and non-SQL access to the same data&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Yoshinori mentions inspiration from &lt;a href="http://dev.mysql.com/doc/ndbapi/en/index.html"&gt;NdbApi&lt;/a&gt; for HandlerSocket - they wanted to get fast indexed access performance without excluding the possibility of performing ad-hoc SQL for reports etc.  This has been one of the unique benefits of Ndb for some time - the ability to operate on the same underlying data via multiple Apis.  Extending this to other MySQL engines (especially InnoDB) is a great idea.  What is surprising is the difference a different access layer implementation can make to throughput.  Who would have thought that so much performance could be consumed by parsing etc?  I suspect that HandlerSocket may create a new benchmark for the MySQL team to optimise parsing towards in future.&lt;br /&gt;&lt;br /&gt;As a MySQL daemon plugin, the HandlerSocket plugin gets to create threads running within a MySQLD server instance.  These threads can then listen on network sockets and use the Storage Engine Api inside the server to perform primitive operations on Storage engines.  From the point of view of the Storage engine, they are just client request handling threads in the Server accessing data concurrently with 'normal' SQL client threads.&lt;br /&gt;&lt;br /&gt;Using the Storage Engine Api, HandlerSocket gets :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Storage engine independence&lt;/li&gt;&lt;li&gt;Concurrency control as implemented by the engine&lt;/li&gt;&lt;li&gt;Index maintenance as implemented by the engine&lt;/li&gt;&lt;li&gt;Constraint enforcement as implemented by the engine&lt;/li&gt;&lt;li&gt;Engine features such as online backup, crash recovery, compression, encryption etc.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Not going via the 'SQL layer' means that HandlerSocket misses out on :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;SQL functionality (queries, joins, aggregation, UDFs etc.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Stored procedures&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Trigger activation&lt;/li&gt;&lt;li&gt;Query cache&lt;/li&gt;&lt;li&gt;ACL checks&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Views&lt;/li&gt;&lt;/ul&gt;As with NdbApi, this is often a very good trade-off as many applications don't need these features for their heavy-lifting.  However they can be very useful for less frequent reporting and administration tasks. Supporting consistent access via SQL or some simpler Api to the same data can avoid the need to split caching from an OLTP database and also potentially the need to split an OLTP and analytic database.  Ndb gives more of a 'firewall' between different usage types by physically separating the storage from more complex query processing, but MySQLD could be extended to have more 'workload management' features internally if this were a problem for HandlerSocket.&lt;br /&gt;&lt;br /&gt;The impressive published benchmarks use data which fits entirely in memory buffers, so that InnoDb need only write logs and checkpoints.  HandlerSocket does not require that all data is in-cache, but the best performance will be achieved if this is the case.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Operation batching/pipelining&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As the &lt;a href="http://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL/blob/master/docs-en/protocol.en.txt"&gt;protocol description&lt;/a&gt; states, requests can be pipelined.  There is no need for individual clients to make blocking synchronous DB requests.  This can vastly increase throughput without excessively straining connection resources.&lt;br /&gt;&lt;br /&gt;Often SQL DB access protocols do not make much use of the potential for batching requests on a single connection. &lt;a href="http://download.oracle.com/javase/6/docs/api/java/sql/package-summary.html"&gt; JDBC&lt;/a&gt; supports batched &lt;a href="http://download.oracle.com/javase/6/docs/api/java/sql/Statement.html#addBatch%28java.lang.String%29"&gt;updates and inserts&lt;/a&gt;, and appears to have support (not often talked about) for batched queries, but not every driver implements these and they're often not implemented in the same way.  Often the SQL approach seems to be 'If you want different sets of data in one round trip, you need to find a way to get them all in one SQL statement'.  This creates a false tension between reducing client-server round trips and avoiding complex joins and unnecessary unions.  Decoupling query boundaries from thread blocking / flow of control changes removes this artificial tension and can simplify applications.  A recent &lt;a href="http://mysqldba.blogspot.com/2010/11/facebook-live-running-mysql-at-scale.html"&gt;'Facebook at Scale'&lt;/a&gt; talk describes how they use the MySQL Client's &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-real-connect.html"&gt;CLIENT_MULTI_STATEMENTS&lt;/a&gt; flag to decouple query boundaries and request latency.  This pattern is one of the keys to implementing efficient NdbApi clients as well.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Commit grouping&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Another HandlerSocket feature possibly inspired by Ndb is the commit grouping.  Multiple user writes (it's not clear if/how transaction boundaries can be specified), are combined and committed to the engine together.  This amortizes the commit cost across multiple operations.  Where the engine performs expensive durability operations (e.g fsync) this can improve write throughput.  Writes are group-committed to the Binlog as well, again in a similar way to Ndb.  Another shared advantage of merging multiple client 'transactions' into fewer Binlog 'transactions' is that the Slave also gets to benefit from fewer, larger transactions for a given data change &lt;span style="font-size:100%;"&gt;rate.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-size:130%;"&gt;Impact&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It's not clear yet what the impact of HandlerSocket will be.  It enters a crowded market of SQL, NoSQL, NewSQL, DataGrid and other technologies.  Competitors use JSON over HTTP, or Memcached protocols, support richer or simpler Apis, offer transparent sharding, eventually consistent replication or web-scale Erlang distributed Map Reduce.  Perhaps HandlerSocket is too old-school?&lt;br /&gt;&lt;br /&gt;I think it's a great technology, deserving of success.  Coupled with an external sharding layer, it seems to offer a great way to improve the efficiency of MySQL scale out, without losing the ability to perform ad-hoc SQL etc.  &lt;/span&gt;&lt;span style="font-size:100%;"&gt;  Time will tell.&lt;/span&gt;&lt;span style="font-size:100%;"&gt;  &lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-447106870440199377?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/447106870440199377/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=447106870440199377' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/447106870440199377'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/447106870440199377'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2010/10/some-mysql-projects-i-think-are-cool.html' title='Some MySQL projects I think are cool - HandlerSocket Plugin'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-285707541399929572</id><published>2011-04-02T01:05:00.007+01:00</published><updated>2011-04-02T01:44:36.218+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='latency-hiding'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Journey upriver to the dark heart of ha_ndbcluster</title><content type='html'>Unlike most other MySQL storage engines, Ndb does not perform all of its work in the MySQLD process.  The Ndb table handler maps Storage Engine Api calls onto &lt;a href="http://dev.mysql.com/doc/ndbapi/en/index.html"&gt;NdbApi&lt;/a&gt; calls, which eventually result in communication with data nodes.  In terms of layers, we have SQL -&amp;gt; Handler Api -&amp;gt; NdbApi -&amp;gt; Communication.  At each of these layer boundaries, the mapping between operations at the upper layer to operations at the lower layer is non trivial, based on runtime state, statistics, optimisations etc.&lt;br /&gt;&lt;br /&gt;The MySQL &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/server-status-variables.html"&gt;status variables&lt;/a&gt; can be used to understand the behaviour of the MySQL Server in terms of user commands processed, and also how these map to some of the Storage Engine Handler Api calls.&lt;br /&gt;&lt;br /&gt;Status variables tracking user commands start with 'Com_'&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;mysql&amp;gt; show status like 'Com\_%';&lt;br /&gt;+---------------------------+-------+&lt;br /&gt;| Variable_name             | Value |&lt;br /&gt;+---------------------------+-------+&lt;br /&gt;| Com_admin_commands        | 0     |&lt;br /&gt;| Com_assign_to_keycache    | 0     |&lt;br /&gt;| Com_alter_db              | 0     |&lt;br /&gt;| Com_alter_db_upgrade      | 0     |&lt;br /&gt;| Com_alter_event           | 0     |&lt;br /&gt;| Com_alter_function        | 0     |&lt;br /&gt;| Com_alter_procedure       | 0     |&lt;br /&gt;| Com_alter_server          | 0     |&lt;br /&gt;| Com_alter_table           | 0     |&lt;br /&gt;| Com_alter_tablespace      | 0     |&lt;br /&gt;| Com_analyze               | 0     |&lt;br /&gt;| Com_backup_table          | 0     |&lt;br /&gt;| Com_begin                 | 0     |&lt;br /&gt;| Com_binlog                | 0     |&lt;br /&gt;| Com_call_procedure        | 0     |&lt;br /&gt;| Com_change_db             | 1     |&lt;br /&gt;| Com_change_master         | 0     |&lt;br /&gt;| Com_check                 | 0     |&lt;br /&gt;| Com_checksum              | 0     |&lt;br /&gt;.........&lt;br /&gt;| Com_stmt_reset            | 0     |&lt;br /&gt;| Com_stmt_send_long_data   | 0     |&lt;br /&gt;| Com_truncate              | 0     |&lt;br /&gt;| Com_uninstall_plugin      | 0     |&lt;br /&gt;| Com_unlock_tables         | 0     |&lt;br /&gt;| Com_update                | 1     |&lt;br /&gt;| Com_update_multi          | 0     |&lt;br /&gt;| Com_xa_commit             | 0     |&lt;br /&gt;| Com_xa_end                | 0     |&lt;br /&gt;| Com_xa_prepare            | 0     |&lt;br /&gt;| Com_xa_recover            | 0     |&lt;br /&gt;| Com_xa_rollback           | 0     |&lt;br /&gt;| Com_xa_start              | 0     |&lt;br /&gt;+---------------------------+-------+&lt;br /&gt;144 rows in set (0.01 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Status variables tracking Handler (Storage engine) Api calls start with 'Handler_'.&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;mysql&amp;gt; show status like 'Handler\_%';&lt;br /&gt;+----------------------------+-------+&lt;br /&gt;| Variable_name              | Value |&lt;br /&gt;+----------------------------+-------+&lt;br /&gt;| Handler_commit             | 1     |&lt;br /&gt;| Handler_delete             | 0     |&lt;br /&gt;| Handler_discover           | 0     |&lt;br /&gt;| Handler_prepare            | 0     |&lt;br /&gt;| Handler_read_first         | 0     |&lt;br /&gt;| Handler_read_key           | 0     |&lt;br /&gt;| Handler_read_next          | 0     |&lt;br /&gt;| Handler_read_prev          | 0     |&lt;br /&gt;| Handler_read_rnd           | 0     |&lt;br /&gt;| Handler_read_rnd_next      | 21    |&lt;br /&gt;| Handler_rollback           | 0     |&lt;br /&gt;| Handler_savepoint          | 0     |&lt;br /&gt;| Handler_savepoint_rollback | 0     |&lt;br /&gt;| Handler_update             | 4     |&lt;br /&gt;| Handler_write              | 14    |&lt;br /&gt;+----------------------------+-------+&lt;br /&gt;15 rows in set (0.00 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;The 'Com_%' and 'Handler_%' variables are maintained by the Server for all storage engines.  The server maintains these on a per-session, and global basis.  By default the session status is shown, but the GLOBAL keyword shows the global view, aggregated across all sessions.&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;mysql&amp;gt; show global status like 'Handler\_%';&lt;br /&gt;+----------------------------+--------+&lt;br /&gt;| Variable_name              | Value  |&lt;br /&gt;+----------------------------+--------+&lt;br /&gt;| Handler_commit             | 167    |&lt;br /&gt;| Handler_delete             | 494041 |&lt;br /&gt;| Handler_discover           | 0      |&lt;br /&gt;| Handler_prepare            | 0      |&lt;br /&gt;| Handler_read_first         | 3      |&lt;br /&gt;| Handler_read_key           | 1      |&lt;br /&gt;| Handler_read_next          | 0      |&lt;br /&gt;| Handler_read_prev          | 0      |&lt;br /&gt;| Handler_read_rnd           | 0      |&lt;br /&gt;| Handler_read_rnd_next      | 561132 |&lt;br /&gt;| Handler_rollback           | 6      |&lt;br /&gt;| Handler_savepoint          | 0      |&lt;br /&gt;| Handler_savepoint_rollback | 0      |&lt;br /&gt;| Handler_update             | 24     |&lt;br /&gt;| Handler_write              | 43442  |&lt;br /&gt;+----------------------------+--------+&lt;br /&gt;15 rows in set (0.00 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;The SHOW STATUS command is handy for a quick check, but for more interesting analysis, the INFORMATION_SCHEMA tables SESSION_STATUS and GLOBAL_STATUS contain the same data, and support all SQL queries and views.&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;mysql&amp;gt; select * from information_schema.session_status where Variable_name like 'Handler\_%';&lt;br /&gt;+----------------------------+----------------+&lt;br /&gt;| VARIABLE_NAME              | VARIABLE_VALUE |&lt;br /&gt;+----------------------------+----------------+&lt;br /&gt;| HANDLER_COMMIT             | 1              |&lt;br /&gt;| HANDLER_DELETE             | 0              |&lt;br /&gt;| HANDLER_DISCOVER           | 0              |&lt;br /&gt;| HANDLER_PREPARE            | 0              |&lt;br /&gt;| HANDLER_READ_FIRST         | 0              |&lt;br /&gt;| HANDLER_READ_KEY           | 0              |&lt;br /&gt;| HANDLER_READ_NEXT          | 0              |&lt;br /&gt;| HANDLER_READ_PREV          | 0              |&lt;br /&gt;| HANDLER_READ_RND           | 0              |&lt;br /&gt;| HANDLER_READ_RND_NEXT      | 85             |&lt;br /&gt;| HANDLER_ROLLBACK           | 0              |&lt;br /&gt;| HANDLER_SAVEPOINT          | 0              |&lt;br /&gt;| HANDLER_SAVEPOINT_ROLLBACK | 0              |&lt;br /&gt;| HANDLER_UPDATE             | 4              |&lt;br /&gt;| HANDLER_WRITE              | 89             |&lt;br /&gt;+----------------------------+----------------+&lt;br /&gt;15 rows in set (0.40 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Unfortunately the entries in the information_schema are in shouty 1960s CAPITALS.  Also note that some Handler calls are made as part of using the information_schema 'database' to fetch the data.  This is Heisenberg's principle in action!&lt;br /&gt;&lt;br /&gt;To shine some light into the depths of the Ndb storage engine, a set of new status variables has been added to recent cluster-7.0 and cluster-7.1 releases.  These status variables track activity at the NdbApi and data node communication layers of the stack.  Currently they are divided into 4 subsets :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Global counters&lt;/span&gt;&lt;br /&gt;show status like 'ndb_api_%_count';&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Session counters&lt;/span&gt;&lt;br /&gt;show status like 'ndb_api_%_session';&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Slave counters&lt;/span&gt;&lt;br /&gt;show status like 'ndb_api_%_slave';&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Binlog injector counters&lt;/span&gt;&lt;br /&gt;show status like 'ndb_api_%_injector';&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Unfortunately the mysql-5.1 server does not allow Storage Engines to differentiate GLOBAL or SESSION status variables, so the Global and Session specific versions of these variables are differentiated by name, and are visible from both GLOBAL and SESSION views of status, so it doesn't matter which you look at.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Global counters&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;mysql&amp;gt; show status like 'ndb_api_%_count';&lt;br /&gt;+------------------------------------+------------+&lt;br /&gt;| Variable_name                      | Value      |&lt;br /&gt;+------------------------------------+------------+&lt;br /&gt;| Ndb_api_wait_exec_complete_count   | 10         |&lt;br /&gt;| Ndb_api_wait_scan_result_count     | 19295      |&lt;br /&gt;| Ndb_api_wait_meta_request_count    | 67         |&lt;br /&gt;| Ndb_api_wait_nanos_count           | 6361966340 |&lt;br /&gt;| Ndb_api_bytes_sent_count           | 415704     |&lt;br /&gt;| Ndb_api_bytes_received_count       | 116921552  |&lt;br /&gt;| Ndb_api_trans_start_count          | 14         |&lt;br /&gt;| Ndb_api_trans_commit_count         | 3          |&lt;br /&gt;| Ndb_api_trans_abort_count          | 1          |&lt;br /&gt;| Ndb_api_trans_close_count          | 14         |&lt;br /&gt;| Ndb_api_pk_op_count                | 6          |&lt;br /&gt;| Ndb_api_uk_op_count                | 0          |&lt;br /&gt;| Ndb_api_table_scan_count           | 11         |&lt;br /&gt;| Ndb_api_range_scan_count           | 0          |&lt;br /&gt;| Ndb_api_pruned_scan_count          | 0          |&lt;br /&gt;| Ndb_api_scan_batch_count           | 25850      |&lt;br /&gt;| Ndb_api_read_row_count             | 103371     |&lt;br /&gt;| Ndb_api_trans_local_read_row_count | 51625      |&lt;br /&gt;| Ndb_api_event_data_count           | 0          |&lt;br /&gt;| Ndb_api_event_nondata_count        | 0          |&lt;br /&gt;| Ndb_api_event_bytes_count          | 0          |&lt;br /&gt;+------------------------------------+------------+&lt;br /&gt;21 rows in set (0.00 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;These counts are aggregated across all MySQL clients in the Server accessing tables in Ndb, since this MySQL Server was started.  Note that this *does not* include accesses by clients of other MySQL Servers, or other NdbApi clients.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Session counters&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;mysql&amp;gt; show status like 'ndb_api_%_session';&lt;br /&gt;+--------------------------------------------+----------+&lt;br /&gt;| Variable_name                              | Value    |&lt;br /&gt;+--------------------------------------------+----------+&lt;br /&gt;| Ndb_api_wait_exec_complete_count_session   | 0        |&lt;br /&gt;| Ndb_api_wait_scan_result_count_session     | 38       |&lt;br /&gt;| Ndb_api_wait_meta_request_count_session    | 2        |&lt;br /&gt;| Ndb_api_wait_nanos_count_session           | 11064398 |&lt;br /&gt;| Ndb_api_bytes_sent_count_session           | 872      |&lt;br /&gt;| Ndb_api_bytes_received_count_session       | 230764   |&lt;br /&gt;| Ndb_api_trans_start_count_session          | 1        |&lt;br /&gt;| Ndb_api_trans_commit_count_session         | 0        |&lt;br /&gt;| Ndb_api_trans_abort_count_session          | 0        |&lt;br /&gt;| Ndb_api_trans_close_count_session          | 1        |&lt;br /&gt;| Ndb_api_pk_op_count_session                | 0        |&lt;br /&gt;| Ndb_api_uk_op_count_session                | 0        |&lt;br /&gt;| Ndb_api_table_scan_count_session           | 1        |&lt;br /&gt;| Ndb_api_range_scan_count_session           | 0        |&lt;br /&gt;| Ndb_api_pruned_scan_count_session          | 0        |&lt;br /&gt;| Ndb_api_scan_batch_count_session           | 51       |&lt;br /&gt;| Ndb_api_read_row_count_session             | 204      |&lt;br /&gt;| Ndb_api_trans_local_read_row_count_session | 136      |&lt;br /&gt;+--------------------------------------------+----------+&lt;br /&gt;18 rows in set (0.00 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;These counts are for the current session (MySQL client connection)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Slave counters&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;mysql&amp;gt; show status like 'ndb_api_%_slave';&lt;br /&gt;+------------------------------------------+-------+&lt;br /&gt;| Variable_name                            | Value |&lt;br /&gt;+------------------------------------------+-------+&lt;br /&gt;| Ndb_api_wait_exec_complete_count_slave   | 0     |&lt;br /&gt;| Ndb_api_wait_scan_result_count_slave     | 0     |&lt;br /&gt;| Ndb_api_wait_meta_request_count_slave    | 0     |&lt;br /&gt;| Ndb_api_wait_nanos_count_slave           | 0     |&lt;br /&gt;| Ndb_api_bytes_sent_count_slave           | 0     |&lt;br /&gt;| Ndb_api_bytes_received_count_slave       | 0     |&lt;br /&gt;| Ndb_api_trans_start_count_slave          | 0     |&lt;br /&gt;| Ndb_api_trans_commit_count_slave         | 0     |&lt;br /&gt;| Ndb_api_trans_abort_count_slave          | 0     |&lt;br /&gt;| Ndb_api_trans_close_count_slave          | 0     |&lt;br /&gt;| Ndb_api_pk_op_count_slave                | 0     |&lt;br /&gt;| Ndb_api_uk_op_count_slave                | 0     |&lt;br /&gt;| Ndb_api_table_scan_count_slave           | 0     |&lt;br /&gt;| Ndb_api_range_scan_count_slave           | 0     |&lt;br /&gt;| Ndb_api_pruned_scan_count_slave          | 0     |&lt;br /&gt;| Ndb_api_scan_batch_count_slave           | 0     |&lt;br /&gt;| Ndb_api_read_row_count_slave             | 0     |&lt;br /&gt;| Ndb_api_trans_local_read_row_count_slave | 0     |&lt;br /&gt;+------------------------------------------+-------+&lt;br /&gt;18 rows in set (0.00 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Hopefully you are seeing a pattern here.  These counters are the NdbApi operations performed by the Slave SQL thread as part of replicating Binlogs into Ndb tables.  These counts will only increase from zero if the MySQLD is acting, or has acted as a Slave, and has accessed tables stored in Ndb.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Binlog Injector counters&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;mysql&amp;gt; show status like 'ndb_api_%_injector';&lt;br /&gt;+--------------------------------------+-------+&lt;br /&gt;| Variable_name                        | Value |&lt;br /&gt;+--------------------------------------+-------+&lt;br /&gt;| Ndb_api_event_data_count_injector    | 0     |&lt;br /&gt;| Ndb_api_event_nondata_count_injector | 0     |&lt;br /&gt;| Ndb_api_event_bytes_count_injector   | 0     |&lt;br /&gt;+--------------------------------------+-------+&lt;br /&gt;3 rows in set (0.01 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;These counts track the data change events received by the Ndb Binlog Injector thread.  The Binlog Injector is responsible for recording Cluster changes in the Binlog, but even when Binlogs are not being written, it receives events related to schema changes and other system management functions, so these counts can be non zero on Servers which are not writing a Binlog.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Counter definitions&lt;br /&gt;&lt;/span&gt;As you've hopefully noticed, there is naming overlap between each set of status counters.  The same events are being counted, and recorded globally, per-session, against the Slave SQL thread and against the Ndb Binlog injector thread.&lt;br /&gt;&lt;br /&gt;So what do these different counts actually mean?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_wait_exec_complete_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of times a user thread has blocked waiting for some batch of primary key, or secondary unique hash key operations to complete.  From the point of view of the MySQL Server, this is idle time, waiting for data nodes to send some response.  An alternative name for it could be 'round trip count', and minimising it through operation batching is a good way to reduce response time and increase throughput.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_wait_scan_result_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of times a user thread has blocked waiting for some scan operation to complete.  It could be waiting for a batch of scan results, or waiting for an acknowledgement of a scan close.  In any case, it indicates time spent waiting on communication related to scan processing.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_wait_meta_request_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of times a user thread has blocked waiting for some metadata operation to complete.  This is quite a catch-all term, which can include DDL (Create/Drop table etc), and some transaction initialisation steps.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_wait_nanos_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of nanoseconds a user thread has blocked in one of the three scenarios above.  This is kind of an 'IO_WAIT' time for Ndb operations.  It tracks how long the thread was blocked waiting for the data nodes to complete their operations.  The resolution is nanoseconds, but this requires support from the operating system.  On operating systems with lower resolution, this count will be coarser, and some operations may complete with zero observed wait time.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_bytes_sent_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of bytes sent to the Ndb data nodes.  This includes all request types, rows inserted etc.  It does not include regular heartbeating as that generally adds too much noise to make the counters useful.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_bytes_received_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of bytes received from the Ndb data nodes.  This includes all request types, rows read etc.  It does not include regular heartbeating as that generally adds too much noise to make the counters useful.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_trans_start_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of NdbApi transactions started.  Note that NdbApi transactions are not always immediately on a BEGIN statement as an optimisation.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_trans_commit_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The NdbApi transactions which have been explicitly committed.  Not all transactions started are committed or aborted, some are started, and then closed.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_trans_abort_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The NdbApi transactions which have been explicitly aborted.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_trans_close_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The NdbApi transactions which have been closed.  It should closely track the number which have been started.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_pk_op_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of Primary Key (pk) operations which have been executed.  Eack pk operation affects zero or one rows.  This includes read, insert, update, write, delete.  Note that operations on tables with Blobs can also generate pk and uk operations.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_uk_op_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of Unique key (uk) operations which have been executed.  Each uk operation affects zero or one rows.  This includes read, update, write, delete by unique key.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_table_scan_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of table scans which have been started.  Table scans can have pushed-down filters, so although they must access all data in a table, they might not return it all to the MySQL Server.  Also, started scans may be stopped before all table data is accessed.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_range_scan_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of range (Ordered Index) scans which have been started.  Range scans take bounds and pushed-down filters, so may access and/or return zero to all rows to the MySQL Server.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_pruned_scan_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of scans which have been successfully pruned to one partition of the scanned table/index.  See my previous Blog entries for details about scan partition pruning.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_scan_batch_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of batches of rows returned to the MySQL Server from scans.  Each scanned table/index fragment returns matching rows in batches, whose size is controlled by the batchsize parameters.  NdbApi handles one batch from each fragment at a time.  Fetching the next batch from a fragment requires a round-trip to the data nodes, although multiple fragments can be asked for their next batch in one trip.  Minimising the Ndb_api_scan_batch_count, by increasing batchsize and improving scan selectivity can improve throughput, latency and efficiency.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_read_row_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of rows returned to the MySQL Server from Primary Key read, Unique Key reads and Table and Index scans.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_trans_local_read_row_count[_session|_slave]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of rows returned to the MySQL Server from the same data node where the reading transaction has its Transaction Coordinator (TC).  Where the TC and the data read reside on the same data node, one hop in the data reading control protocols is avoided.  This is the goal of transaction hinting and distribution awareness, and its effectiveness can be checked by comparing &lt;span style="font-style: italic;"&gt;Ndb_api_trans_local_read_row_count&lt;/span&gt; to &lt;span style="font-style: italic;"&gt;Ndb_api_trans_read_row_count.&lt;/span&gt;  The higher the proportion of local reads, the better.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_event_data_count[_injector]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of data change events (row insert, delete, update notifications) received from the data nodes.  On a Binlogging MySQL Server, this count can give a measure of the rate of data change in a cluster in terms of rows/second.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_event_nondata_count[_injector]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The number of non-data events (table alter/drop notifications etc) received from the data nodes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ndb_api_event_bytes_count[_injector]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The total number of bytes of event data (data and nondata events) received from the data nodes.  This gives another measure of the Cluster change rate in terms of bytes/second.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;At NdbApi level&lt;br /&gt;&lt;/span&gt;As the names suggest, all of these counters are reflecting things happening in the NdbApi implementation.  The data collection is built into NdbApi, and is therefore also available to any NdbApi client, not just the MySQL Server.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Using the counters&lt;br /&gt;&lt;/span&gt;During implementation of these counters, I found it easiest to create a temporary table to store 'base' values for the counters, and define a view containing the difference between the current values and the base values.  This made it easy to see the effect on the counters of various different SQL statements, slave operations etc.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Setting up a baseline and a view&lt;br /&gt;&lt;/span&gt;Please excuse my schoolboy SQL :&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;mysql&amp;gt; create table test.counts_base(variable_name varchar(255) primary key, variable_value bigint);&lt;br /&gt;Query OK, 0 rows affected (0.04 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; replace into test.counts_base (variable_name, variable_value) select * from information_schema.session_status where variable_name like 'ndb_api%';&lt;br /&gt;Query OK, 60 rows affected (0.01 sec)&lt;br /&gt;Records: 60  Duplicates: 0  Warnings: 0&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; create view test.counts_diff as select test.counts_base.variable_name, information_schema.session_status.variable_value - test.counts_base.variable_value as diff from test.counts_base, information_schema.session_status where test.counts_base.variable_name = information_schema.session_status.variable_name and (information_schema.session_status.variable_value - test.counts_base.variable_value) &amp;gt; 0;&lt;br /&gt;Query OK, 0 rows affected (0.05 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; select * from test.counts_diff where variable_name like 'ndb_api%_session';&lt;br /&gt;Empty set (0.05 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Looking at the effects of a SQL statement&lt;/span&gt;&lt;br /&gt;First check that the baseline is ok :&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;mysql&amp;gt; select * from test.counts_diff where variable_name like 'ndb_api%_session';&lt;br /&gt;Empty set (0.05 sec)&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Now run the statement :&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;mysql&amp;gt; select count(1) from demo_table;&lt;br /&gt;+----------+&lt;br /&gt;| count(1) |&lt;br /&gt;+----------+&lt;br /&gt;|   103343 |&lt;br /&gt;+----------+&lt;br /&gt;1 row in set (0.00 sec)&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Now look at the difference :&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;mysql&amp;gt; select * from test.counts_diff where variable_name like 'ndb_api%_session';&lt;br /&gt;+--------------------------------------------+--------+&lt;br /&gt;| variable_name                              | diff   |&lt;br /&gt;+--------------------------------------------+--------+&lt;br /&gt;| NDB_API_WAIT_SCAN_RESULT_COUNT_SESSION     |      4 |&lt;br /&gt;| NDB_API_WAIT_META_REQUEST_COUNT_SESSION    |      1 |&lt;br /&gt;| NDB_API_WAIT_NANOS_COUNT_SESSION           | 619143 |&lt;br /&gt;| NDB_API_BYTES_SENT_COUNT_SESSION           |    116 |&lt;br /&gt;| NDB_API_BYTES_RECEIVED_COUNT_SESSION       |    268 |&lt;br /&gt;| NDB_API_TRANS_START_COUNT_SESSION          |      1 |&lt;br /&gt;| NDB_API_TRANS_CLOSE_COUNT_SESSION          |      1 |&lt;br /&gt;| NDB_API_TABLE_SCAN_COUNT_SESSION           |      1 |&lt;br /&gt;| NDB_API_SCAN_BATCH_COUNT_SESSION           |      2 |&lt;br /&gt;| NDB_API_READ_ROW_COUNT_SESSION             |      2 |&lt;br /&gt;| NDB_API_TRANS_LOCAL_READ_ROW_COUNT_SESSION |      1 |&lt;br /&gt;+--------------------------------------------+--------+&lt;br /&gt;11 rows in set (0.05 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;The select count(1) statement blocked 4 times on scan results, sent 116 bytes and received 268 bytes of data.  Two batches of rows were received, and two rows were received in total.  One of these rows was from the same node as the transaction's transaction coordinator.&lt;br /&gt;This indicates that select count(1) is optimised in the Ndb handler !&lt;br /&gt;&lt;br /&gt;Let's try a more tricky select count.  First we must reset the baseline to get a 'clean' difference.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;mysql&amp;gt; replace into test.counts_base (variable_name, variable_value) select * from information_schema.session_status where variable_name like 'ndb_api%';&lt;br /&gt;Query OK, 120 rows affected (0.01 sec)&lt;br /&gt;Records: 60  Duplicates: 60  Warnings: 0&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; select * from test.counts_diff where variable_name like 'ndb_api%_session';&lt;br /&gt;Empty set (0.05 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt;  select count(length(string_value) &amp;gt; 10) from demo_table;&lt;br /&gt;+----------------------------------+&lt;br /&gt;| count(length(string_value) &amp;gt; 10) |&lt;br /&gt;+----------------------------------+&lt;br /&gt;|                           103343 |&lt;br /&gt;+----------------------------------+&lt;br /&gt;1 row in set (6.77 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; select * from test.counts_diff where variable_name like 'ndb_api%_session';&lt;br /&gt;+--------------------------------------------+------------+&lt;br /&gt;| variable_name                              | diff       |&lt;br /&gt;+--------------------------------------------+------------+&lt;br /&gt;| NDB_API_WAIT_SCAN_RESULT_COUNT_SESSION     |      21139 |&lt;br /&gt;| NDB_API_WAIT_NANOS_COUNT_SESSION           | 5322402052 |&lt;br /&gt;| NDB_API_BYTES_SENT_COUNT_SESSION           |     441628 |&lt;br /&gt;| NDB_API_BYTES_RECEIVED_COUNT_SESSION       |  109026900 |&lt;br /&gt;| NDB_API_TRANS_START_COUNT_SESSION          |          1 |&lt;br /&gt;| NDB_API_TRANS_CLOSE_COUNT_SESSION          |          1 |&lt;br /&gt;| NDB_API_TABLE_SCAN_COUNT_SESSION           |          1 |&lt;br /&gt;| NDB_API_SCAN_BATCH_COUNT_SESSION           |      25836 |&lt;br /&gt;| NDB_API_READ_ROW_COUNT_SESSION             |     103343 |&lt;br /&gt;| NDB_API_TRANS_LOCAL_READ_ROW_COUNT_SESSION |      51735 |&lt;br /&gt;+--------------------------------------------+------------+&lt;br /&gt;10 rows in set (0.05 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;This select count waited for scan results ~21 thousand times, sent ~440kB of data, and received ~109MB of data from the data nodes.  103,343 rows were read in ~25 thousand scan batches, and roughly half came from the same node as the tranaction coordinator.  5.3 seconds of the 6.77 second runtime were spent waiting for responses from the data nodes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Other uses&lt;br /&gt;&lt;/span&gt;Hopefully this gives some notion of the possibilities with these new counters.  Some other ideas :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Debug slow queries&lt;/li&gt;&lt;li&gt;Optimise data distribution and table partitioning&lt;/li&gt;&lt;li&gt;Get real post-execution costs for queries, DML etc.&lt;/li&gt;&lt;li&gt;Understand how data transfer for Blobs is batched&lt;/li&gt;&lt;li&gt;Check bulk insert/deletes are functioning efficiently&lt;/li&gt;&lt;li&gt;Verify Ndb slave batching is in operation&lt;/li&gt;&lt;li&gt;Draw cool graphs&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-285707541399929572?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/285707541399929572/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=285707541399929572' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/285707541399929572'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/285707541399929572'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/04/journey-upriver-to-dark-heart-of.html' title='Journey upriver to the dark heart of ha_ndbcluster'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-8096555334143424493</id><published>2011-03-28T00:41:00.003+01:00</published><updated>2011-03-28T01:15:43.512+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>MySQL Cluster online scaling</title><content type='html'>Most people looking at a diagram showing the Cluster architecture soon want to know if the system can scale online.  Api nodes such as MySQLD processes can be added online, and the storage capacity of existing data nodes can be increased online, but it was not always possible to add new data nodes to the cluster without an initial system restart requiring a backup and restore.&lt;br /&gt;&lt;br /&gt;An online add node and data repartitioning feature was finally implemented in MySQL Cluster 7.0.  It's not clear how often users actually do scale their Clusters online, but it certainly is a cool thing to be able to do.&lt;br /&gt;&lt;br /&gt;There are two parts to the feature :&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Online add an empty data node to an existing cluster&lt;/li&gt;&lt;li&gt;Online rebalance existing data across the existing and new data nodes&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;Adding an empty data node to a cluster sounds trivial, but is actually fairly complex given the cluster's distributed configuration, ring heartbeating etc.  Stewart Smith did some preparatory work on this a few years ago, and this was revisited for the feature.&lt;br /&gt;&lt;br /&gt;Rebalancing existing table data to make use of the new storage capacity is more challenging.  How does this work?  More importantly, how does this work online, while transactions are starting and committing, and queries are running?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;As an aside, the definition of 'Online' used here is that multiple distributed clients continue to start and commit transactions reading and writing data to the cluster.  Some things, like concurrent DDL may be blocked during these operations.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To understand the online data rebalancing mechanism, we need to go into more detail on the native data distribution mechanisms introduced in the last post.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Ndb native partitioning variants&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are currently three variants of Ndb's native partitioning function :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Linear Key&lt;/li&gt;&lt;li&gt;Key &lt;/li&gt;&lt;li&gt;HashMap&lt;/li&gt;&lt;/ul&gt;Linear Key is used where the table is created with PARTITION BY LINEAR KEY.  Key is used where the table is created with PARTITION BY KEY *prior to Cluster 7.0.  HashMap is used where the table is created with PARTITION BY KEY in releases starting at Cluster 7.0.&lt;br /&gt;&lt;br /&gt;These partitioning functions can be decomposed into two functions, where the first, an MD5 hash of the partition/distribution key, is fixed.  MD5 is no longer considered secure, but for the purpose of 'balancing' rows across fragments it is more than adequate.&lt;br /&gt;&lt;br /&gt;The variant part of these algorithms is how the MD5 hash of the key is mapped to a partition/fragment number.  This has important implications for how repartitioning works.&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;/code&gt;&lt;div style="text-align: center;"&gt;&lt;code&gt;fragment_num( dist_key_cols ) = mapping_fn( md5 ( dist_key_cols ) )&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&lt;/code&gt;&lt;/div&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;Linear Key Mapping Function&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The Linear Key scheme is part of an old design for online repartitioning.  The old design was intended to minimise the amount of data transfer required when repartitioning tables by ensuring that an existing partition could be cleanly 'split in two' so that half of its data could be migrated.  This is a good policy, but Linear Key had the downside of requiring a power-of-2 number of partitions.&lt;br /&gt;&lt;br /&gt;Where a non power-of-2 number of data nodes existed, this causes problems.  Additionally, if repartitioning with this scheme had ever been implemented, it would have required fragments to be 'split in two' to expand the system.  The power-of-2 requirement of this scheme was listed as a Cluster limitation in the early days, though this has not been a problem since the non-linear Key scheme became default.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Key Mapping Function&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The Key scheme is simpler than Linear Key, and simply divides the hash result modulo the number of fragments of the table to determine which fragment a row resides in.  This removes the power-of-2 restriction that Linear Key required, so rows can be evenly balanced across any number of nodes.  However, it is not amenable to online reorganisation, as changing the number of table fragments changes the modulo division value, which can result in most of the resulting partition values changing.  This means that a reorganisation using this scheme could result in excessive data transfer.&lt;br /&gt;&lt;br /&gt;e.g.&lt;code&gt;&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;div style="text-align: center;"&gt;&lt;code&gt;    mapping_fn_key(x) = x % num_frags&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;   md5(dist_key_cols) = 23     &lt;/code&gt;&lt;br /&gt;&lt;code&gt;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;num_frags = 4,  23 % 4 = 3&lt;/code&gt;&lt;br /&gt;&lt;code&gt;num_frags = 6,  23 % 6 = 5&lt;/code&gt;&lt;br /&gt;&lt;code&gt;                                                       num_frags = 8,  23 % 8 = 7&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&lt;/code&gt;&lt;/div&gt;&lt;code&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;As the system expands, too much data is being moved.  This is expensive, slow, requires extra storage etc.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;HashMap Mapping Function&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In MySQL Cluster 7.0, the HashMap distribution scheme was added and became the default.  It is used when PARTITION BY KEY() is explicitly given, or implied if no partitioning specification is given.&lt;br /&gt;The HashMap scheme uses a mapping table from md5 hash result to fragment number.  The hash result used is 32-bits, which would require a large lookup table, so we first shrink it down to something more manageable (n) by modulo division by n.  n = 240 is the default number, though the implementation supports any modulo value.&lt;br /&gt;The resulting number is then used to lookup a table to get the fragment id which will store the row.&lt;br /&gt;&lt;br /&gt;e.g.&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;/code&gt;&lt;div style="text-align: right;"&gt;&lt;div style="text-align: center;"&gt;&lt;code&gt;   mapping_fn_hashmap(x) = lookup_tab [x % mod_val ]&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;         fragment_number = lookup [ md5( dist_key_cols ) % mod_val ]&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;                         fragment_number = lookup [ md5( dist_key_cols ) % 240 ]&lt;/code&gt;&lt;br /&gt;&lt;/div&gt;&lt;code&gt;&lt;/code&gt;&lt;/div&gt;&lt;br /&gt;The lookup table adds another layer of indirection between the hash result (which is fixed for any given key), and the fragment number, whose range can increase over time.&lt;br /&gt;&lt;br /&gt;Assuming mod_val is 240, and we start with 4 fragments, then the 240 entry lookup table will have 60 entries with 0, 60 with 1, 60 with 2 and 60 with 3.  As an aside, these will be sequenced as 0,1,2,3,0,1,2,3,0.... so that the actual default distribution will be exactly the same as with the KEY scheme.&lt;br /&gt;&lt;br /&gt;If we want to spread the table data over 6 fragments, then we can change the table to use a new hashmap lookup table, where 2/6 of the existing 240 values are changed to refer to the new fragment numbers.  The other 4/6 are unchanged.  Expanding again to 8 fragments, we can change to another new hashmap, where 2/8 of the existing 240 values are changed, and the other 6/8 are unchanged.  In each case, the minimum amount of data is affected to maintain balance.&lt;br /&gt;&lt;br /&gt;Changing the hashmap is easy, the real work is in moving the data while it's being operated upon, but what the hashmap gives is a way to move only the minimum amount of data required when adding nodes.  Only the data that has to move is moved, the rest stays where it is.  The data distribution randomisation given by the MD5 function is unaffected, so system balance is maintained.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Choice of HashMap mod_val&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As the default mod_val of 240 is significantly higher than common fragment counts, and because it factors well, most configurations will remain well balanced, despite being reorganised.&lt;br /&gt;&lt;br /&gt;e.g. Assuming 2-node increments, a minimum with NoOfReplicas=2&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;code&gt;            240 / 2  = 120 &lt;/code&gt;&lt;br /&gt;&lt;code&gt;            &lt;/code&gt;&lt;code&gt;240 / 4  = 60  &lt;/code&gt;&lt;br /&gt;&lt;code&gt;            &lt;/code&gt;&lt;code&gt;240 / 6  = 40  &lt;/code&gt;&lt;br /&gt;&lt;code&gt;            &lt;/code&gt;&lt;code&gt;240 / 8  = 30  &lt;/code&gt;&lt;br /&gt;&lt;code&gt;            &lt;/code&gt;&lt;code&gt;240 / 10 = 24   &lt;/code&gt;&lt;br /&gt;&lt;code&gt;            &lt;/code&gt;&lt;code&gt;240 / 12 = 20   &lt;/code&gt;&lt;br /&gt;&lt;code&gt;            &lt;/code&gt;&lt;code&gt;240 / 14 = 17.1&lt;/code&gt;&lt;br /&gt;&lt;code&gt;            &lt;/code&gt;&lt;code&gt;240 / 16 = 15   &lt;/code&gt;&lt;br /&gt;&lt;code&gt;            &lt;/code&gt;&lt;code&gt;240 / 18 = 13.3&lt;/code&gt;&lt;br /&gt;&lt;code&gt;            &lt;/code&gt;&lt;code&gt;240 / 20 = 12   &lt;/code&gt;&lt;br /&gt;&lt;code&gt;            &lt;/code&gt;&lt;code&gt;240 / 22 = 10.9&lt;/code&gt;&lt;br /&gt;&lt;code&gt;            &lt;/code&gt;&lt;code&gt;240 / 24 = 10   &lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;240 factors cleanly into whole numbers (of lookup table entries) meaning that data should be well balanced across the table fragments when the data is repartitioned.  Where there is not an integer result (e.g. 240/14), we would have most partitions with 17 lookup entries, and two with 18 lookup entries.  The imbalance between them would be (18/17)-1 = 6%.  If this were problematic, then a different mod_val could be used.  A higher mod_val gives smaller partition imbalances, but requires&lt;br /&gt;more memory to store.  If necessary, the lookup table could be expanded in size by any integer factor (e.g. 2,3,4..) online to make it large enough to factor better for some desired data node count.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Moving rows online&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The HashMap gives fine grained control of data placement, but how does the reorg happen online?&lt;br /&gt;&lt;br /&gt;Table Reorganisation is similar Node recovery in some ways, in that the data is copied via fragment scans of the existing fragments, while at the same time, synchronous triggers are used to forward changes made to the existing fragment rows.  The triggers and scans only copy data for rows which are to be moved,&lt;br /&gt;&lt;br /&gt;e.g. where&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;/code&gt;&lt;div style="text-align: center;"&gt;&lt;code&gt;new_hashmap_lookup[ md5( dist_key_cols ) % 240 ]&lt;br /&gt;!=&lt;br /&gt;old_hashmap_lookup[ md5( dist_key_cols ) % 240 ]&lt;/code&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;Other rows are left where they are.&lt;br /&gt;&lt;br /&gt;With this mechanism, the new fragments are populated with rows from the existing fragments while read and write transactions continue.  Once the fragment scans complete, the new fragments continue to be maintained by the synchronous triggers.&lt;br /&gt;&lt;br /&gt;A future GCP boundary is chosed to be the 'cutover' point, and at this GCI, the new HashMap starts getting used for new transaction processing, and the new fragments start being used.  Triggers are setup to propagate changes from the new fragments back to the pre-existing fragments, so that any older transactions using the old hashmap definition will see consistent data changes.&lt;br /&gt;&lt;br /&gt;Once all transactions using rows from the pre-existing fragments have committed, the synchronous triggers are dropped, and the pre-existing fragments are scanned again, deleting the moved rows.  Once this step completes, the reorganisation&lt;br /&gt;is done.&lt;br /&gt;&lt;br /&gt;Primary and Unique key operations in Ndb are short lived, and at Hashmap cutover, it doesn't take long until all old operations have committed.  However, ordered index and table scans are slower and may not complete for some time.  Both old and new row copies are maintained until all scans started using the old distribution have completed, so that ongoing transactions need not be aborted as part of the online reorg.&lt;br /&gt;&lt;br /&gt;At the same time as the hashmaps are cutover at a GCI boundary, any NdbApi event subscribers listening to data change events on the table, for example attached MySQLDs recording Binlogs, start receiving events for the moved rows from the new fragments, and stop receiving them from the old fragments.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Transient storage use&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;When adding data nodes, the reorganisation uses no extra storage space on existing data nodes.  On new data nodes, only the space used for the moved data is used.  After the reorganisation completes, the space formerly used on pre-existing data nodes can be used for new data, so the system capacity is increased.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Transactional behaviour&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Table reorganisation can take some time when there is a lot of data to move.  A node or cluster failure during a reorganisation could leave the system in a transient state which would be difficult to recover from.  One of the internal infrastructure changes in Cluster 7.0 was making all DDL operations transactional.  This means&lt;br /&gt;that they are atomic w.r.t. failures, including node and system failures.  This applies to CREATE/DROP/ALTER of TABLE/INDEX/TABLESPACE etc.&lt;br /&gt;This also applies to table reorganisation as it is a form of ALTER TABLE.  If the reorganisation fails, or a node fails, or the cluster fails at some point during the reorganisation, then as part of system recovery, the reorganisation will be rolled back, or completed, if it had committed at the time of failure.&lt;br /&gt;&lt;br /&gt;So that covers online table reorganisation.  I've been meaning to write about it for some time, though somehow these entries always seem to be more like adverts than technical info.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-8096555334143424493?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/8096555334143424493/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=8096555334143424493' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/8096555334143424493'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/8096555334143424493'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/03/mysql-cluster-online-scaling.html' title='MySQL Cluster online scaling'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-6639369915798676758</id><published>2011-03-26T00:43:00.004Z</published><updated>2011-10-12T11:42:34.207+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='message-passing'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Data distribution in MySQL Cluster</title><content type='html'>MySQL Cluster distributes rows amongst the data nodes in a cluster, and also provides data replication.  How does this work?  What are the trade offs?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Table fragments&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Tables are &lt;span style="font-style: italic;"&gt;'horizontally fragmented&lt;/span&gt;' into table fragments each containing a disjoint subset of the rows of the table.  The union of rows in all table fragments is the set of rows in the table.  Rows are always identified by their primary key.  Tables with no primary key are given a hidden primary key by MySQLD.&lt;br /&gt;&lt;br /&gt;By default, one table fragment is created for each data node in the cluster at the time the table is created.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Node groups and Fragment replicas&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The data nodes in a cluster are logically divided into &lt;span style="font-style: italic;"&gt;Node groups.&lt;/span&gt;  The size of each Node group is controlled by the NoOfReplicas parameter.  All data nodes in a Node group store the same data.  In other words, where the NoOfReplicas parameter is two or greater, each table fragment has a number of replicas, stored on multiple separate data nodes in the same nodegroup for availability.&lt;br /&gt;&lt;br /&gt;One replica of each fragment is considered &lt;span style="font-style: italic;"&gt;primary&lt;/span&gt;, and the other(s) are considered &lt;span style="font-style: italic;"&gt;backup&lt;/span&gt; replicas.  Normally, each node contains a mix of primary and backup fragments for every table, which encourages system balance.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Which replica to use?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The primary fragment replica is used to serialise locking between transactions concurrently accessing the same row.  Write operations update &lt;span style="font-weight: bold;"&gt;all&lt;/span&gt; fragment replicas synchronously, ensuring no committed data loss on node failure.  Read operations normally access the primary fragment replica, ensuring consistency.  Reads with a special lock mode can access the backup fragment replicas.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Primary key read protocol&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;When an NdbApi client (for example a MySQLD process) wants to read a row by primary key, it sends a read request to a data node acting as a &lt;span style="font-style: italic;"&gt;Transaction Coordinator&lt;/span&gt; (TC).&lt;br /&gt;The TC node will determine which fragment the row would be stored in from the primary key, decide which replica to access (usually the primary), and send a read request to the data node containing that fragment replica.  The data node containing the fragment replica then sends the row's data (if present) directly back to the requesting NdbApi client, and also sends a read acknowledgement or failure notification back to the TC node, which also propagates it back to the NdbApi client.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Minimising inter data node hops&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The 'critical path' for this protocol in terms of potential inter-data-node hops is four hops :&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;Client -&amp;gt; TC -&amp;gt; Fragment -&amp;gt; TC -&amp;gt; Client&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;To minimise remote client experienced latency, ideally two inter-node hops can be avoided by having the TC node and the Fragment replica(s) on the same node.  This requires controlling the choice of node for TC based on the primary key of the data which will be read.  Where a transaction only reads rows stored on the same node as its TC, this can improve latency and system efficiency.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Distribution awareness&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;From NdbApi, users can specify a table and key when starting a transaction.  The transaction will then choose a TC data node based on where the corresponding row's primary fragment replica is located in the system.  This mechanism is sometimes referred to as &lt;span style="font-style: italic;"&gt;'transaction hinting'&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The Ndb handler in MySQLD generally waits for the first primary key lookup in a user session before starting an NdbApi transaction, so that it can choose a TC node based on this.  This is a best-effort attempt at having the data node acting as TC colocated with the accessed data.  This feature is usually referred to as &lt;span style="font-style: italic;"&gt;'Distribution Awareness'&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Write operations also benefit from distribution awareness, but not to the same extent in systems with NoOfReplicas &amp;gt; 1.  Write operations must update all fragment replicas, which must be stored on different nodes, in the same nodegroup, so for NoOfReplicas &amp;gt; 1, distribution awareness avoids inter-node-group communication, and some intra-node-group communication, but some inter-data-node communication is always required.  In a system with good data partitioning and distribution awareness, most read transactions will access only one data node, and write transactions will result in messaging between the data nodes of a single node group.  Messaging between node groups will be minimal.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Distribution keys&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;By default, the whole of a table's primary key is used to determine which fragment replica will store a row.  However, any subset of the columns in the primary key can be used.  The key columns used to determine the row distribution are called the &lt;span style="font-style: italic;"&gt;'distribution key'&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Where a table's primary key contains only one column, the distribution key must be the full primary key.  Where the primary key has more than one column, the distribution key can be different to (a subset of) the primary key.&lt;br /&gt;&lt;br /&gt;From MySQLD, a distribution key can be set using the normal &lt;span style="font-family:courier new;"&gt;PARTITION BY KEY(&lt;keys&gt;)&lt;/keys&gt;&lt;/span&gt; syntax.  The effect of using a distribution key which is a subset of the primary key is that rows with different primary key values, but the same distribution key values are guaranteed to be stored in the same table fragment.&lt;br /&gt;&lt;br /&gt;For example, if we create a table :&lt;br /&gt;&lt;code&gt;&lt;br /&gt;CREATE TABLE user_accounts (user_id               BIGINT,&lt;br /&gt;                                                        account_type     VARCHAR(255),&lt;br /&gt;                                                        username             VARCHAR(60),&lt;br /&gt;                                                        state          INT,&lt;br /&gt;                           PRIMARY KEY (user_id, account_type))&lt;br /&gt;                              engine = ndb partition by key (user_id);&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Then insert some rows :&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;INSERT INTO user_accounts VALUES (22, "Twitter", "Bader", 2),&lt;br /&gt;                                                                  (22, "Facebook", "Bd77", 2),&lt;br /&gt;                                                                  (22, "Flickr", "BadB", 3),&lt;br /&gt;                                                                  (23, "Facebook", "JJ", 2);&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Then we know that all rows with the same value(s) for the distribution key (user_id), will be stored on the same fragment.  If we know that individual transactions are likely to access rows with the same distribution key value then this will increase the effectiveness of distribution awareness.  Many schemas are &lt;span style="font-style: italic;"&gt;'partitionable'&lt;/span&gt; like this, though not all.&lt;br /&gt;&lt;br /&gt;Note that partitioning is a performance hint in Ndb - correctness is not affected in any way, and transactions can always span table fragments on the same or different data nodes.  This allows applications to take advantage of the performance advantages of distribution awareness without requiring that all transactions affect only one node etc as required by simpler 'sharding' mechanisms.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Correlated distribution keys across tables&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A further guarantee from Ndb is that two tables with the same number of fragments, and the same &lt;span style="font-weight: bold;"&gt;number and type&lt;/span&gt; of distribution keys will have rows distributed in the same way.&lt;br /&gt;&lt;br /&gt;For example, if we add another table :&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;CREATE TABLE user_prefs (user_id         BIGINT,&lt;br /&gt;                                                  type               VARCHAR(60),&lt;br /&gt;                                                  value    VARCHAR(255),&lt;br /&gt;                        PRIMARY KEY (user_id, type))&lt;br /&gt;                           engine = ndb partition by key (user_id);&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Then insert some rows :&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;INSERT INTO user_prefs VALUES (22, "Coffee",  "Milk + 6 sugars"),&lt;br /&gt;                                                            (22, "Eggs",    "Over easy"),&lt;br /&gt;                                                            (23, "Custard", "With skin");&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Then we know that the rows with the same user_id in the user_prefs and user_accounts tables will be stored on the same data node.  Again, this helps with distribution awareness.  In this example, we are ensuring that rows related to a single user, as identified by a common user_id, will be located on one data node, maximising system efficiency, and minimising latency.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Ordered index scan pruning&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;MySQL Cluster supports arbitrary ordered indexes.  Ordered indexes are defined on one or more columns and support range scan operations.  Range scans are defined by supplying optional lower and upper bounds.  All rows between these bounds are returned.&lt;br /&gt;&lt;br /&gt;Each Ndb ordered index is implemented as a number of in memory tree structures (&lt;span style="font-style: italic;"&gt;index fragments&lt;/span&gt;), distributed with the fragments of the indexed table.  Each index fragment contains the index entries for the local table fragment.  Having ordered indexes local to the table fragments makes index maintenance more efficient, but means that there may not be much locality to exploit when scanning as rows in a range may be spread across all index fragments of an index.&lt;br /&gt;&lt;br /&gt;The only case where an ordered index scan does not require to scan all index fragments is where it is known that all rows in the range will be found in one table fragment.&lt;br /&gt;This is the case where both :&lt;br /&gt;&lt;ol&gt;&lt;li&gt; The ordered index has &lt;span style="font-weight: bold;"&gt;all&lt;/span&gt; of the table's distribution keys as a&lt;span style="font-weight: bold;"&gt; prefix&lt;/span&gt;&lt;/li&gt;&lt;li&gt;The range is &lt;span style="font-weight: bold;"&gt;contained within one value&lt;/span&gt; of the table's distribution keys&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;NdbApi detects this case when a range scan is defined, and &lt;span style="font-style: italic;"&gt;'prunes'&lt;/span&gt; the scan to one index fragment (and therefore one data node).  For all other cases, all index fragments must be scanned.&lt;br /&gt;&lt;br /&gt;Continuing the example above, assuming an ordered index on the primary key, the following ordered index scans can be pruned :&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt; SELECT * FROM user_accounts WHERE user_id = 22;&lt;br /&gt; SELECT * FROM user_accounts WHERE user_id = 22 AND account_type LIKE 'F%';&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;However, the following ordered index scans cannot be pruned, as matching rows are not guaranteed to be stored in one table fragment :&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt; SELECT * FROM user_accounts WHERE account_type = "Facebook";&lt;br /&gt; SELECT * FROM user_accounts WHERE user_id &amp;gt; 20 AND user_id &amp;lt; 30;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;MySQLD partitioning variants and manually controlling distribution&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Since MySQL 5.1, table partitioning has been supported.   Tables can be partitioned based on functions of the distribution keys such as :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;  KEY&lt;/li&gt;&lt;li&gt;  LINEAR KEY&lt;/li&gt;&lt;li&gt;  HASH&lt;/li&gt;&lt;li&gt;  RANGE&lt;/li&gt;&lt;li&gt;  LIST&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;For engines other than Ndb, partitioning is implemented in the Server, with each partition implemented as a separate table in the Storage engine.  Ndb implements these partition functions &lt;span style="font-style: italic;"&gt;natively&lt;/span&gt;, using them to control data distribution across table fragments in a single table.&lt;br /&gt;&lt;br /&gt;From Ndb's point of view, KEY and LINEAR KEY are native partitioning functions.  Ndb knows how to determine which table fragment to use for a row from a table's distribution key, based on an MD5 hash of the distribution key.&lt;br /&gt;&lt;br /&gt;HASH, RANGE and LIST are not natively supported by Ndb.  When accessing tables defined using these functions, MySQLD must supply information to NdbApi to indicate which fragments to access.  For example before primary key insert, update, delete and read operations, the table fragment to perform the operation on must be supplied.  From MySQLD, the partitioning layer supplies this information.&lt;br /&gt;&lt;br /&gt;Any NdbApi application can use the same mechanisms to manually control data distribution across table fragments. At the NdbApi level this is referred to as &lt;span style="font-style: italic;"&gt;'User Defined'&lt;/span&gt; partitioning.  This feature is rarely used.  One downside of using User Defined partitioning is that online data redistribution is not supported.  I'll discuss Online data redistribution in a future post here.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;&lt;br /&gt;Edited on 12/10/11 to fix formatting imbalance &lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-6639369915798676758?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/6639369915798676758/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=6639369915798676758' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/6639369915798676758'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/6639369915798676758'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/03/data-distribution-in-mysql-cluster.html' title='Data distribution in MySQL Cluster'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-2552179066895473968</id><published>2011-01-26T22:51:00.005Z</published><updated>2011-01-27T00:19:34.644Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='latency-hiding'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Low latency distributed parallel joins</title><content type='html'>When MySQL AB bought Sun Microsystems in 2008 (or did Sun buy MySQL?), most of the MySQL team merged with the existing Database Technology Group (DBTG) within Sun.  The DBTG group had been busy working on JavaDB, Postgres and other DB related projects as well as 'High Availability DB' (HADB), which was Sun's name for the database formerly known as Clustra.&lt;br /&gt;&lt;br /&gt;Clustra originated as a University research project which spun out into a startup company and was then acquired by Sun around the era of dot-com.  A number of technical papers describing aspects of Clustra's design and history can be found &lt;a href="http://www.google.com/search?q=clustra"&gt;online,&lt;/a&gt; and it is in many ways similar to Ndb Cluster, not just in their shared Scandinavian roots.  Both are shared-nothing parallel databases originally aimed at the Telecoms market, supporting high availability and horizontal scalability.  Clustra has an impressive feature set and many years of development behind it, but limited exposure to general purpose use.&lt;br /&gt;&lt;br /&gt;At the time of the MySQL acquisition, HADB/Clustra was embedded in a number of Sun products as a session store and metadata repository, but was not available for external customers for general purpose use.  Shortly afterwards, a decision was made to move HADB into a 'sustaining' model, and most of the ex-HADB team then became available to work on other projects.  MySQL has greatly benefited from the injection of skills and enthusiasm from the Sun DBTG across a number of different teams, which is maybe not well known to those outside the company.&lt;br /&gt;&lt;br /&gt;In the Cluster team, one project which has really benefited is the SPJ (Select Project Join) project, which couldn't have happened without the expertise and energy of the ex-Clustra/HADB team working on it.&lt;br /&gt;&lt;br /&gt;The SPJ project started around the time of the last MySQL Developers conference in Riga in September 2008.  The intention at the time was to look at ways of efficiently supporting more complex queries, specifically involving table joins, reducing unnecessary data transfer, communication latencies and context switches and increasing parallelism.&lt;br /&gt;&lt;br /&gt;The main insights at the start of the project included :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Join mechanisms should be based on &lt;span style="font-weight: bold;"&gt;linking&lt;/span&gt; existing NdbApi single-table access primitives&lt;/li&gt;&lt;li&gt;Join result sets need not and should not be fully materialised at the data nodes&lt;/li&gt;&lt;li&gt;Join mechanisms need not be fully general or fully capable initially, as full generality/capability is already available with the existing Apis&lt;/li&gt;&lt;li&gt;Small joins should be targeted (Number of rows, number of tables)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;These points have simplified the project scope greatly, and allowed many painful and costly detours to be avoided.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;SQL execution in MySQL Cluster&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As &lt;a href="http://messagepassing.blogspot.com/2009/09/ndb-software-architecture.html"&gt;described&lt;/a&gt; in a previous post, Ndb Cluster was originally designed to quickly execute small queries with a high update rate at low latency.  Larger more complex queries were executed by a separate query processor.  The emphasis in this design is that complex queries are possible, but not necessarily fast or efficient.  A main goal is that complex queries do not adversely affect the properties of the high volume, low latency requests.&lt;br /&gt;&lt;br /&gt;All data access in MySQL Cluster is via the NdbApi interface.  The NdbApi gives access to data stored in tables in the Cluster via four primitives :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Read row by primary key&lt;/li&gt;&lt;li&gt;Read row by secondary unique key&lt;/li&gt;&lt;li&gt;Scan a range of rows in an ordered index with optional conditions&lt;/li&gt;&lt;li&gt;Scan all rows in a table with optional conditions&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Each of these primitives operates on a single table, and any table joins must be done by the NdbApi user.  Different primary key/unique key operations and individual scan operations run in parallel across the data nodes in a cluster.  NdbApi allows different operations and scan requests to be batched together to minimise latency due to communication delays, and it is essential to use this batching to get minimal latencies with Ndb.&lt;br /&gt;&lt;br /&gt;In MySQL Cluster, attached MySQL Servers act as query processors, and MySQL's SQL execution engine breaks complex queries down into calls to its generic Storage Engine (SE) Api, which also deals with data access one table at a time.&lt;br /&gt;&lt;br /&gt;The Ndb storage engine then further decomposes these SE Api calls into NdbApi primitive operations on individual tables.&lt;br /&gt;&lt;br /&gt;MySQL supports SQL queries by performing the SE Api calls to read data, then comparing and matching results, sorting, buffering and reformatting.  This works very well and gives MySQL Cluster great SQL functionality and compatibility, although users may find that the latency of their individual queries is not as low as with other engines such as MyISAM and InnoDB, which do not have to perform inter-process communication to implement their SE Api calls.&lt;br /&gt;&lt;br /&gt;For minimum latency, Ndb requires that the MySQL Server makes efficient requests for data, requesting as much data as possible at once, and not using the data until it is essential to make forward progress - e.g. when there is a real data dependency.&lt;br /&gt;&lt;br /&gt;MySQL features such as Insert, Update and Delete batching, and Batched Key Access minimise the MySQLD to Data node round trips required to execute certain types of operations, but they are unable to help when there are real data dependencies in a query.  For example when the server needs to read some value from table t1 to know which rows to read from table t2 then there is no alternative but to read the t1 rows into memory before issuing any reads from t2.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Linked Operations&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To reduce the need for extra Api to Data node round trips for every data dependency, we must allow operations to be linked.   If we can describe the data dependency as a link between NdbApi operations, then it can be resolved amongst the data nodes.  For example, rather than stating :&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt; SQL &gt; select t1.b, t2.c from t1,t2 where t1.pk=22 and t1.b=t2.pk;&lt;br /&gt;        ndbapi &gt; read column b from t1 where pk = 22;&lt;br /&gt;           &lt;br /&gt;&lt;br /&gt;                   [round trip]&lt;br /&gt;           &lt;br /&gt;&lt;br /&gt;                   (b = 15)&lt;br /&gt;        ndbapi &gt; read column c from t2 where pk = 15;&lt;br /&gt;          &lt;br /&gt;&lt;br /&gt;                   [round trip]&lt;br /&gt;          &lt;br /&gt;&lt;br /&gt;                   (c = 30)&lt;br /&gt;        [ return b = 15, c = 30 ]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;We would state the join/operation linkage at the ndbapi level :&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;        ndbapi &gt; read column &lt;span style="font-weight: bold;"&gt;@b&lt;/span&gt;:=b from t1 where pk = 22;&lt;br /&gt;                 read column c from t2 where pk=&lt;span style="font-weight: bold;"&gt;@b&lt;/span&gt;;&lt;br /&gt;          &lt;br /&gt;&lt;br /&gt;                   [round trip]&lt;br /&gt;          &lt;br /&gt;&lt;br /&gt;                   (b = 15, c = 30)&lt;br /&gt;        [ return b = 15, c = 30 ]&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;We allow read operations to be parameterised on the results of previous operations, and have the linking of the operations, the flow of results into parameters, handled by the data nodes.  The data dependency still results in some execution serialisation at the data node layer, but not at the api layer, so data dependencies within queries needn't result in extra round trips between the MySQL server and the data nodes.  Where the dependent data happens to be on the same data node, the dependency can be resolved with no inter-process communication at all.&lt;br /&gt;&lt;br /&gt;Viewing the database software as a stack, the execution of the join is being 'pushed down' the stack, to a lower layer.  For this reason, the SPJ functionality is also sometimes referred to as pushed-down joins or just pushed joins.  Pushing functionality closer to the data can result in improved performance due to lower latency, reduced data transfer etc.  In the case of MySQL Cluster, it can avoid inter-process communication, as well as enable parallelism across the data nodes.&lt;br /&gt;&lt;br /&gt;In theory, linking can occur between any of the 4 primitive operation types :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Primary key access (PK)&lt;/li&gt;&lt;li&gt;Unique key access (UK)&lt;/li&gt;&lt;li&gt;Ordered index range scan (OI)&lt;br /&gt;(Range bounds and optional conditions parameterised)&lt;/li&gt;&lt;li&gt;Table scan (TS)&lt;br /&gt;(Optional conditions parameterised)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;In practice, linking the cardinality (0|1) operations (Primary key, Unique key) together is simpler than linking with the scans.  In turn, linking a scan to a cardinality (0|1) operation is simpler than linking a scan to another scan.&lt;br /&gt;&lt;br /&gt;Linking a table scan to a table scan results in a cross-join and is probably going to be unpleasantly expensive for anything other than small tables.&lt;br /&gt;&lt;br /&gt;The initial SPJ implementation supports combinations of Primary/Unique key operations linked together with at most one ordered index scan.&lt;br /&gt;&lt;br /&gt;A future implementation will support multiple ordered index scans in a single request.  This is more complex to handle due to the buffering required of the different scan result sets, and the resulting result ordering versus efficiency tradeoffs.&lt;br /&gt;&lt;br /&gt;The SPJ Api is implemented as an extension to the existing NdbApi, with similar primitive concepts, but with the addition of the means to link the primitives together.  As with the existing NdbApi, the usage pattern is along the lines :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Define operation(s)&lt;/li&gt;&lt;li&gt;Define further linked operation(s)&lt;/li&gt;&lt;li&gt;Execute() // One round trip to the data nodes&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Examine results&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;In terms of batching, a tree of linked operations, with one const-parameterised root operation, and one or more child operations, is considered to be a single operation.  Multiple SPJ operations, each actually a tree of primitive operations, can be executed simultaneously in a batch, along with other 'basic' NdbApi operations.&lt;br /&gt;&lt;br /&gt;Where a scan is included, the scan can be advanced using the normal nextResult() mechanism, which also advances the results returned by any cardinality (0|1) child operations.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;NoJoins - Not only Joins&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;While the SPJ extensions are described here in terms of joins, at the NdbApi level they are really 'linked operations'.  One design goal which is not completely aligned with the join concept was to allow scans of multiple different tables to be parameterised on a single root operation.  For example :&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;read &lt;span style="font-weight: bold;"&gt;@eid&lt;/span&gt;:= entity_id from map_table where username="jan";&lt;/li&gt;&lt;li&gt;scan blog_titles from blog_posts where entity_id=&lt;span style="font-weight: bold;"&gt;@eid&lt;/span&gt;;&lt;/li&gt;&lt;li&gt;scan latest_tweets from twitter_feed where entitiy_id = &lt;span style="font-weight: bold;"&gt;@eid&lt;/span&gt;;&lt;/li&gt;&lt;li&gt;scan share_prices from stock_feed where entity_id = &lt;span style="font-weight: bold;"&gt;@eid&lt;/span&gt;;&lt;/li&gt;&lt;li&gt;....&lt;/li&gt;&lt;/ul&gt;Here there is a data dependency between the first lookup and n peer child scans.  I want to read all of this data in one round trip, but I don't necessarily want to have to express this in a single 'join' query.  If we had a more relational/SQL oriented Api we might have had to create some unholy union of the different results, with masses of repeated values or nulls, or repeat the first lookup for each of &lt;span style="font-style: italic;"&gt;n&lt;/span&gt; two-way joins. &lt;br /&gt;&lt;br /&gt;With the linked operation concept, we can clearly state that the child scans are parameterised by the first lookup, &lt;span style="font-style: italic;"&gt;without&lt;/span&gt; having to introduce some further unnatural coupling between the rows returned by each scan, which are otherwise independent.&lt;br /&gt;&lt;br /&gt;So although SPJ is named after and described as supporting joins, it doesn't mean that you have to be 'join-oriented' or a SQL Samurai to benefit from it.  It may be quite useful for efficiently traversing graphs, hierarchies and other links between rows where the concept of a 'join' is quite alien.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Check it out&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A mysql-5.1-cluster-7.1 source tree with the SPJ enhancements can be downloaded from &lt;a href="ftp://ftp.mysql.com/pub/mysql/download/cluster_telco/mysql-5.1.44-ndb-7.1.3-spj-preview/mysql-cluster-gpl-7.1.3-spj-preview.tar.gz"&gt;here&lt;/a&gt;.   You can see the NdbApi extensions in the storage/ndb/include/ndbapi  directory of the source tree.  This source also includes extensions to  the MySQL Ndb handler to make use of the new SPJ Api for SQL queries,  which I hope to describe a little next time.  If you want to download and try  out SPJ then see some of the other blog &lt;a href="http://www.google.com/search?q=spj+cluster"&gt;posts&lt;/a&gt; about how to get started with it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-2552179066895473968?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/2552179066895473968/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=2552179066895473968' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/2552179066895473968'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/2552179066895473968'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2011/01/low-latency-distributed-parallel-joins.html' title='Low latency distributed parallel joins'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-6516222635586568823</id><published>2010-09-27T11:45:00.004+01:00</published><updated>2010-09-27T11:55:24.065+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><title type='text'>Some MySQL projects I think are cool - OpenQuery Graph Engine (OQG)</title><content type='html'>This project was announced a year or so ago by Antony Curtis who used to work for MySQL AB. Having met Antony a few times I was intrigued to see what he was up to. The quote on the &lt;a href="http://openquery.com/"&gt;OpenQuery&lt;/a&gt; website describes it well :&lt;br /&gt;&lt;blockquote&gt;The Open Query GRAPH engine (OQGRAPH) is a &lt;em&gt;computation engine&lt;/em&gt; allowing hierarchies and more complex graph structures to be handled in a relational fashion. In a nutshell, tree structures and friend-of-a-friend style searches can now be done using standard SQL syntax, and results joined onto other tables.&lt;/blockquote&gt;&lt;br /&gt;That sounds cool, and it's the first time I've heard of a MySQL 'Computation engine' plugin.  Delving further into the &lt;a href="http://openquery.com/graph/doc"&gt;manual&lt;/a&gt; gives some insight, and there's some unexpected twists there :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;OQG is a storage engine, but data stored is not persistent w.r.t. server crashes.&lt;/li&gt;&lt;li&gt;All tables have the same schema, storing details of graph 'edges'.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The fixed schema has a magic column called 'latch'&lt;/li&gt;&lt;li&gt;Depending on the constant value of latch used in a SELECT statement on the table, the engine will return different 'pseudo results'.&lt;/li&gt;&lt;/ul&gt;The last fact is the coolest one.  As far as I understand it :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;SELECT where latch = NULL AND ... allows queries on the graph as though it were a list of edges (as the data was entered).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;SELECT where latch = 0 [AND ...] allows queries on the graph as though it were a list of nodes.&lt;/li&gt;&lt;li&gt;SELECT where latch = 1 [AND ...] allows &lt;a href="http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm"&gt;Dijkstra's shortest path algorithm&lt;/a&gt; to be applied to the graph&lt;/li&gt;&lt;li&gt;SELECT where latch = 2 [AND ...] allows a &lt;a href="http://en.wikipedia.org/wiki/Breadth_first_search"&gt;breadth-first search&lt;/a&gt; to be applied to the graph&lt;/li&gt;&lt;/ul&gt;This is a superb hack! I imagine the OQG engine internally has an in-memory graph structure which is maintained as edges are added via the INSERT Api. The SELECT Api then gives access to different views of the underlying graph and even allows complex parameterised functions to be applied to the graph, giving results as a set of rows which can be decoded into the required result. It's not pretty, but it's an extremely pragmatic approach to embedding graph access and operations within a database.&lt;br /&gt;&lt;br /&gt;It's also undeniable that the use of magic numbers and the 'latch' column adds a certain arcane wackiness that charms this reader. It's definitely a MySQL-style solution, continuing the tradition of MyISAM, Blackhole, Federated etc where good-enough gets to market before best, and 20% of the implementation effort delivers 80% of the functionality.&lt;br /&gt;&lt;br /&gt;Once again I'd be interested to hear about how this is actually being used, and what sort of difference it is making.&lt;br /&gt;&lt;br /&gt;Each of these three cool projects enable new solutions individually and expand the dimensions of what is possible using MySQL. In combination they open up a vast expanse of potential. One of the best things about them is that they all happened outside the confines of MySQL / Sun / Oracle. Hopefully they will get the success they deserve so that we can have more cool new projects in future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-6516222635586568823?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/6516222635586568823/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=6516222635586568823' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/6516222635586568823'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/6516222635586568823'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2010/09/some-mysql-projects-i-think-are-cool_1227.html' title='Some MySQL projects I think are cool - OpenQuery Graph Engine (OQG)'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-4778124966172480320</id><published>2010-09-27T11:39:00.002+01:00</published><updated>2010-09-27T11:43:27.717+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Some MySQL projects I think are cool - Spider Storage Engine</title><content type='html'>One thing that has puzzled me about MySQL Server is that it became famous for sharded scale-out deployments in well known web sites and yet has no visible support for such deployments. The MySQL killer feature for some time has been built-in asynchronous replication and gigabytes of blogs have been written about how to setup, use, debug and optimise replication, but when it comes to 'sharding' there is nothing built in. Perhaps to have attempted to implement something would have artificially constrained user's imaginations, whereas having no support at all has allowed 1,000 solutions to sprout? Perhaps there just wasn't MySQL developer bandwidth available, or perhaps it just wasn't the best use of the available time. In any case, it remains unclaimed territory to this day.&lt;br /&gt;&lt;br /&gt;On first hearing of the &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/federated-storage-engine.html"&gt;Federated&lt;/a&gt; storage engine some years ago, I mistakenly assumed that this could be the basis of some MySQL scale-out solution. Perhaps a layer of front end 'proxy' MySQLDs could federate tables from a layer of backend MySQLDs giving some level of distribution transparency to sharded data. However as I found out, the Federated engine was not designed with such a scenario in mind. It has a certain internal elegance and simplicity, but unfortunately it is a little too simple for anything other than light duties.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://spiderformysql.com/"&gt;Spider&lt;/a&gt; storage engine extends the Federated concept of a table definition being a 'link' to a table on a remote MySQL server. However, it also integrates with the table &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/partitioning.html"&gt;partitioning&lt;/a&gt; features of MySQL 5.1, allowing each partition of a table to be specified as a 'link' to a table on a remote MySQL server. This effectively allows the built-in partitioning mechanisms of MySQLD (PARTITION BY RANGE/LIST/HASH) to be used to shard/partition rows across multiple MySQL servers transparently.&lt;br /&gt;&lt;br /&gt;One of the major drawbacks of the Federated engine was that it had very little support for 'pushing conditions' to the MySQLD instance storing the source tables. This meant that well behaved selective queries issued on the 'front-end' MySQLD instance could result in non-selective queries being issued to the 'back-end' MySQLD instances, and large volumes of data being unnecessarily transferred back to the 'front-end' MySQLD where query processing then discarded it.&lt;br /&gt;&lt;br /&gt;Spider attempts to improve this situation by pushing conditions down to the MySQLDs containing the source data. Combined with the partition pruning available from the MySQLD partitioning engine this should significantly reduce the amount of redundant data transferred in some cases.&lt;br /&gt;&lt;br /&gt;So I think Spider is a pretty cool project. Like MySQL Cluster, it bears the burden of making MySQLD more data-distribution-aware and I think they're doing great work. It'd be great to hear stories about how Spider is being used, especially if anyone is using it *with* MySQL Cluster.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-4778124966172480320?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/4778124966172480320/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=4778124966172480320' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/4778124966172480320'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/4778124966172480320'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2010/09/some-mysql-projects-i-think-are-cool_27.html' title='Some MySQL projects I think are cool - Spider Storage Engine'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-8316955527243718765</id><published>2010-09-27T11:34:00.002+01:00</published><updated>2010-09-27T11:37:40.832+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><title type='text'>Some MySQL projects I think are cool - Flexviews</title><content type='html'>Most of the time we think of SQL queries as being executed at a point in time and generating a single definitive result, but huge efficiency gains are available when data changes are tracked and derived views are partially updated as needed rather than being fully recomputed periodically. MySQL has support for views on tables, but there is currently no support for materialized views. While thinking about this topic I decided to have another look at Justin Swanhart's &lt;a href="http://flexviews.sourceforge.net/"&gt;Flexviews&lt;/a&gt; tool and it's definitely a cool MySQL based project.&lt;br /&gt;&lt;br /&gt;Flexviews is an open source set of non-intrusive addons to MySQL enabling materialized views to be defined and maintained as the underlying tables are changed. If you're not sure what a materialized view is or why they can be useful then I recommend reading the intro on the Flexviews site. I was particularly impressed by the documented support for GROUP BY, aggregates and joins.&lt;br /&gt;&lt;br /&gt;I have a vague recollection of reading a blog post about an early version of Flexviews which used MySQL triggers to collect data changes on underlying tables, and feeling that it was probably a little flaky. However I now read that the 'Change Data Capture' in recent versions has been factored out into a separate tool called FlexCDC which 'mines' Row-based Binlog entries. This is a far more promising approach, and useful for many other applications. From my own point of view, it makes Flexviews potentially useful for maintaining materialized views of data stored in MySQL Cluster, where Binlog is the only centralised record of change. It also got me thinking that MySQL Cluster already has code inside the Server listening to data changes for writing the Binlog, which could be extended to capture data changes into some other tables if it were a useful feature.&lt;br /&gt;&lt;br /&gt;The fact that Flexviews is implemented without Server changes or hooks is impressive and it's a great example of a MySQL 'ecosystem' project. It would be great to read some blog entries about how people are using it and what it is doing for them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-8316955527243718765?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/8316955527243718765/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=8316955527243718765' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/8316955527243718765'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/8316955527243718765'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2010/09/some-mysql-projects-i-think-are-cool.html' title='Some MySQL projects I think are cool - Flexviews'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-6330347659477238978</id><published>2010-03-25T09:48:00.009Z</published><updated>2010-03-25T23:49:50.429Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='rambling'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='general'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>ACID tradeoffs, modularity, plugins, Drizzle</title><content type='html'>Most software people are aware of the &lt;a href="http://en.wikipedia.org/wiki/ACID"&gt;ACID&lt;/a&gt; acronym coined by Jim Gray.  With the growth of the web and open source, the scaling and complexity constraints imposed on DBMS implementations supporting ACID are more visible, and new (or at least new terms for known) compromises and tradeoffs are being discussed widely.  The better known &lt;a href="http://en.wikipedia.org/wiki/NoSQL"&gt;NoSQL&lt;/a&gt; systems are giving insight by example into particular choices of tradeoffs.&lt;br /&gt;&lt;br /&gt;Working at MySQL, I have often been surprised at the variety of potential alternatives when implementing a DBMS, and the number of applications which don't need the full set of ACID letters in the strictest form. The original MySQL storage engine, &lt;a href="http://en.wikipedia.org/wiki/MyISAM"&gt;MyISAM&lt;/a&gt; is one of the first and most successful examples of an 'ACID remix'.  The people drawn to DBMS development work often have a perfectionist streak, which can cause them to tend to prefer 'nothing' over 'imperfect'.  MyISAM was and still is a flag-bearer for '&lt;a href="http://en.wikipedia.org/wiki/Principle_of_good_enough"&gt;good enough&lt;/a&gt;'.  Perhaps we should be less modest and call it 'more than good enough'.&lt;br /&gt;&lt;br /&gt;One seldom discussed benefit of MySQL's storage engine architecture is that pressure to make 'The One True Storage Engine' is reduced.  DBMS products with one fixed database engine need to optimise for all supported use cases.  This is a great engineering challenge, but increases design effort, requirements for configuration and auto-tuning, constraints on any design change or reoptimisation etc.  With MySQL, there are multiple existing storage engines, each with a (sub)set of target use-cases in mind.  A single MySQL server can maintain and access tables in different storage engines, each tuned as closely as possible to the use-case for the data, without adding complexity to unrelated engines.  Engines can be wildly optimised for a narrow use case as there are plausible alternative engines available for other use cases.&lt;br /&gt;&lt;br /&gt;I understand that one aim of the &lt;a href="http://drizzle.org/"&gt;Drizzle&lt;/a&gt; project is to extend the modularity of the MySQL Server on multiple axes, allowing diversity to flourish.  As a one-time Java coder, who enjoyed the pleasures of &lt;a href="http://www.drdobbs.com/184410856;jsessionid=SVWNPGOEDVC3LQE1GHOSKH4ATMY32JVN?pgno=1"&gt;design-by-interface&lt;/a&gt;, I can see the attraction.  While the effort is guided by an actual need for modularity and real examples of alternative plugins, it can be a great force multiplier.  There is always the risk of modularity for its own sake - a branch of Architecture Astronautics.  Sure symptoms, which I may have suffered from in the past, include the class names &lt;a href="http://discuss.joelonsoftware.com/default.asp?joel.3.219431.12"&gt;FactoryFactory&lt;/a&gt;..., PolicyPolicy, or &lt;anything&gt;[Anything]Broker).&lt;br /&gt;&lt;br /&gt;Another good vibe from Drizzle is the &lt;a href="http://en.wikipedia.org/wiki/Microkernel"&gt;microkernel&lt;/a&gt; concept, although would say that there's some terminological abuse occurring here!  Perhaps it could more reasonably be said that MySQL has a TeraKernel and Drizzle has a MegaKernel?  In any case the motivations are good.  Decoupling the huge chunks of functionality glued together inside MySQLD is great for long term software integrity, understanding dependencies, finding (and introducing) bugs, and might make it easier to start adding functionality again.  Replication seems especially ripe ground for alternative plugins.  User authentication is another often requested 'chunk'.  It will take longer to crystalise interfaces for more deeply embedded areas like the query Optimizer/Executor, but if these interfaces are arising from a real need then that can drive the API design.&lt;br /&gt;&lt;br /&gt;One aspect of storage engine modularity that is not often mentioned is that some MySQL storage engines also moonlight with other products.  The Berkeley database (BDB) is probably the oldest and most promiscuous, embedded in DNS daemons, LDAP servers and all sorts of other places.  Ndb is unusual in that it can be used from separate MySQLD and other NdbApi processes at the same time.  InnoDB has also recently added an &lt;a href="http://www.innodb.com/wp/products/embedded-innodb/"&gt;embedded&lt;/a&gt; variant. This trend will accelerate, especially when some of the distributed NoSQL systems start supporting 'pluggable local storage' APIs.   I imagine that a NoSQL local storage engine API could be somewhat simpler to implement than the MySQL SE API, at least to start with!&lt;br /&gt;&lt;/anything&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-6330347659477238978?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/6330347659477238978/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=6330347659477238978' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/6330347659477238978'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/6330347659477238978'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2010/03/acid-tradeoffs-modularity-plugins.html' title='ACID tradeoffs, modularity, plugins, Drizzle'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-7301010401806054876</id><published>2009-09-28T22:43:00.004+01:00</published><updated>2009-09-28T23:29:09.009+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='message-passing'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Ndb software architecture</title><content type='html'>I'm sure that someone else can describe the actual history of Ndb development much better, but here's my limited and vague understanding.&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Ndb is developed in an environment (Ericsson &lt;a href="http://en.wikipedia.org/wiki/AXE_telephone_exchange"&gt;AXE&lt;/a&gt; telecoms switch) where Ericsson's &lt;a href="http://en.wikipedia.org/wiki/PLEX_%28programming_language%29"&gt;PLEX&lt;/a&gt; is the language of choice&lt;/span&gt;&lt;br /&gt;PLEX supports multiple state machines (known as blocks) sending messages (known as signals) between them with some system-level conventions for starting up, restart and message classes. Blocks maintain internal state and define signal handling routines for different signal types. Very little abstraction within a block beyond subroutines is supported. (I'd love to hear some more detail on PLEX and how it has evolved). This architecture maps directly to the AXE processor design (APZ) which is unusual in having signal buffers implemented directly in silicon rather than software. This hard-coding drove Ndb's initial max supported signal size of 25 x 32-bit words.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;An emulated PLEX environment (VM) is made available on Unix systems, written in C++&lt;/span&gt;&lt;br /&gt;The VM runs as a Unix process.  PLEX code for blocks is interpreted. Signals are routed between blocks by the VM. This allows development and deployment of PLEX based systems on standard Unix systems. It also allows Plex based systems to easily interact with Unix software. Each VM instance is a single threaded process routing incoming signals to the signal handling functions in each block class.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;A PLEX to C++ translation system is designed&lt;/span&gt;&lt;br /&gt;Blocks are mapped to large C++ classes with signal handling methods and per-block global state mapped to member variables. The limited labelling and abstraction encoded in the PLEX source are mapped to C style code within C++ classes.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;The VM environment is 'branched' from the original PLEX/AXE environment and starts to evolve independently as a base for Ndb.&lt;/span&gt;&lt;br /&gt;It offers access to more OS services such as communication, disk IO etc. Plex interpretation functionality is removed as all relevant Plex code has been mapped to native C++.  VM instances can communicate with each other over various channels and form a distributed system.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul style="font-style: italic;"&gt;&lt;li&gt;At some point in the timeline around here the Ndb team and product leave Ericsson&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Over time, common block functionality is abstracted into base and utility classes&lt;/span&gt;.&lt;br /&gt;Hardware and system-convention sourced constraints are eased, the level of abstraction is raised.  New blocks are designed and implemented without a Plex heritage making use of C++ abstraction facilities.  Existing blocks are refactored.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Multi-threaded Ndbd (ndbmtd) is introduced, with groups of block instances running on different threads&lt;/span&gt;.&lt;br /&gt;Rather than being a radical design, it's a move back towards the original PLEX design point of 1 block instance per processor.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Today, Ndb executes a blocks communicating via signals model. Signals are no longer limited to 25 words. In single threaded Ndb (ndbd), all blocks share a single thread, with separate threads used for inter-VM communication setup and disk IO.  In multi threaded Ndb (ndbmtd), block instances are grouped, and different functional groups share threads. In all cases, each block instance remains single-threaded, although the thread may be shared with other blocks.&lt;br /&gt;&lt;br /&gt;The blocks and signals model is reminiscent of &lt;a href="http://en.wikipedia.org/wiki/Erlang_%28programming_language%29"&gt;Erlang&lt;/a&gt; and Hoare's &lt;a href="http://en.wikipedia.org/wiki/Communicating_sequential_processes"&gt;CSP&lt;/a&gt; – where concurrency is modelled as serial (or sequential) processes communicating with explicit messages, as opposed to a shared-memory model where communication occurs via memory with correctness controlled by locks, memory barriers and atomic instructions.  It can also be considered similar to &lt;a href="http://en.wikipedia.org/wiki/Message_Passing_Interface"&gt;MPI&lt;/a&gt; and the &lt;a href="http://en.wikipedia.org/wiki/Active_object"&gt;Active object&lt;/a&gt; / &lt;a href="http://en.wikipedia.org/wiki/Actor_model"&gt;Actor&lt;/a&gt; model.&lt;br /&gt;&lt;br /&gt;Using explicit messaging for synchronisation/communication has costs – at runtime a given algorithm may require more data copying.  At design time, potential concurrency must be explicitly designed-in with messaging and state changes.  Mapping sequential algorithms to message passing state machines may require a bigger code transformation than mapping to a naive multithread safe shared-memory and locks implementation.&lt;br /&gt;&lt;br /&gt;However I believe that these costs are generally paid off by the benefit of improved code clarity. Inter state-machine synchronisation becomes clearly visible, making synchronisation costs easier to visualise and understand. With explicit messaging as the main mechanism for inter-thread and inter-process communication, there is only a small &lt;span style="font-style: italic;"&gt;kernel&lt;/span&gt; of multithreaded code to be implemented, proved correct and optimised. The bulk of the code can be implemented in single threaded style. There is no need for diverse libraries of multithread-optimised data structures. Processor and system architecture specific code and tradeoffs are minimised.&lt;br /&gt;&lt;br /&gt;Internally, Ndb's VM supports only asynchronous messages between blocks. Using an asynchonous message passing style has many benefits. As the sending thread does not block awaiting a response to a message sent, it can work on other jobs, perhaps including the message just sent. This allows it to make the best use of warm instruction and data caches, reduces voluntary context switches and can reduce the likelihood of deadlock. Blocking IO (network, disk) is outsourced to a pool of threads. The signal processing thread(s) never block, except when no signals are available to process. The responsiveness of the system can be ensured by using prioritised job queues to determine the job to execute next and minimising the time spent processing individual jobs. From a formal point of view the number of possible multithreaded interactions is vastly reduced as thread-interleaving is only significant at signal processing boundaries. These limitations can make it easier to reason about the correctness and timing properties of the system.&lt;br /&gt;&lt;br /&gt;However, coding in this asynchronous, event-driven style can be demanding. Any blocking operations (disk access, blocking communications, requests to other threads or processes etc.) must be implemented as an asynchronous request and response pair. This style can have an abstraction-dissolving property as many published data structures and algorithms are implemented assuming a synchronous model and making much use of the caller's stack for state storage and managing control flow. It can be difficult to design abstractions for the asynchronous style which don't leak so much messy detail as to be pointless. Additionally, the asynchronous style tends to flatten a system – as the need to return control to the lowest-level call point whenever concurrency is possible acts as a force against deep layers of abstraction. Side effects of this can include a tendency for error handling code to be non-localised to the source of the error. However, that is part of the charm of working on the system. The C++ environment gives a wide set of tools for designing such abstractions, and each improvement made simplifies future work.&lt;br /&gt;&lt;br /&gt;Comments, corrections?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-7301010401806054876?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/7301010401806054876/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=7301010401806054876' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7301010401806054876'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7301010401806054876'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2009/09/ndb-software-architecture.html' title='Ndb software architecture'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-7082026650680258056</id><published>2009-09-10T16:15:00.013+01:00</published><updated>2009-09-10T18:17:26.152+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='cluster'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>MySQL Cluster development</title><content type='html'>&lt;a href="http://www.mysql.com/products/database/cluster/"&gt;MySQL Cluster&lt;/a&gt; is the name given to one or more MySQL Server processes, connected to an Ndb Cluster database.  From the point of view of the MySQL Server processes, the Ndb Cluster is a &lt;a href="http://dev.mysql.com/tech-resources/articles/storage-engine/part_1.html"&gt;Storage Engine&lt;/a&gt;, implementing transactional storage of tables containing rows.  From the point of view of the Ndb Cluster database, the MySQL Server processes are API nodes, performing DDL and DML transactions on tables stored in the cluster.  Both exist independently – Ndb Cluster can be used without attached MySQL Server processes, but almost all users of Ndb Cluster connect at least one MySQL Server for DDL and administration.&lt;br /&gt;&lt;br /&gt;Ndb stands for Network DataBase.  This is a telecoms phrase where &lt;span style="font-style: italic;"&gt;Network&lt;/span&gt; usually refers to a fixed or wireless telephone network, rather than the database &lt;a href="http://en.wikipedia.org/wiki/Network_database"&gt;topology&lt;/a&gt; definition of the term.  Ndb was originally designed as a platform for implementing databases required to operate telecoms networks - HLR, VLR, Number Portability, Fraud Detection etc.  At the time Ndb was first designed, Network Databases were generally implemented in-house on exotic 'switch' hardware by telecoms equipment vendors, often with hard-coded schemas and very inflexible query capabilities.  These databases were expensive to develop and maintain, but had superb reliability and exceptional performance on minimal spec. hardware.  The aim of the original Ndb design was to couple these desirable properties with more general purpose database functionality and deliver the result on a more standard hardware and OS stack.&lt;br /&gt;&lt;br /&gt;I first discovered Ndb Cluster around 2001, when looking at potential designs for the next generation of an existing HLR database.  I read the paper by Mikael Ronström in &lt;a href="http://www.ericsson.com/ericsson/corpinfo/publications/review/"&gt;Ericsson Review&lt;/a&gt; (No 4,1997) which gives a good overview of the Ndb functionality.  This paper describes functionality in the current tense when in fact some of the features described are yet to be implemented in 2009!   This sort of optimism and vision has helped Ndb to survive and thrive over the years.    The Ericsson Review paper was written while Ndb was one of multiple telecoms-database projects at Ericsson.   Since then the Ndb product and team were spun out as a separate company, before being sold to MySQL AB in 2003 as a result of the dot com affair.&lt;br /&gt;&lt;br /&gt;Ndb was originally designed for :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;High throughput&lt;/span&gt; – sustaining tens to hundreds of thousands of transactions per second&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Low latency&lt;/span&gt; – bounded transactions latencies which can be reliably factored into end-to-end latency budgets, implying main-memory storage&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;High update to read ratio&lt;/span&gt; – 50/50 as the norm&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Transactional properties&lt;/span&gt; : Atomicity, Consistency, Isolation, Durability&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Fault tolerance + HA &lt;/span&gt;– No single point of failure, automatic failover and recovery with minimal user or application involvement.   Online upgrade.  N-way synchronous and asynchronous replication.  Fail-fast fault isolation.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Persistence&lt;/span&gt; – disk checkpointing  and logging with automated recovery&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Scalability&lt;/span&gt; – Parallel query execution.  Distributed system can utilise &gt; 1 system's resources.  Capacity can be expanded horizontally with extra systems.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;In the original Ndb design, high volume low latency transactions are submitted directly to the cluster using simple access primitives on the client.  More complex queries are submitted to a separate query processor which itself uses combinations of the simpler primitives to access the cluster.  An early example of a higher-level query processor was created by Martin Sköld who extended an Object Oriented query processor to create '&lt;a href="http://www.ep.liu.se/ea/cis/1999/001/cis99001.pdf"&gt;QDB&lt;/a&gt;' which could perform queries against data stored in Ndb. Numerous high level front-end processors have been implemented since.&lt;br /&gt;&lt;br /&gt;Using MySQLD as a higher-level query processing front end we come to the architecture of MySQL Cluster, with MySQLD providing SQL based access to data stored in the cluster.  In this sense MySQLD and Ndb cluster are a perfect fit and were designed for each other before they first met!   Despite MySQLD being the default and most prominent front end to Ndb cluster, a number of others exist including several open and closed-source LDAP servers (&lt;a href="http://www.symas.com/openldap-mysql.shtml"&gt;OpenLDAP&lt;/a&gt;, &lt;a href="https://www.opends.org/wiki/page/EnableNDBBackend"&gt;OpenDS&lt;/a&gt;), several Java APIs and an Apache module giving HTTP access to data stored in Ndb.&lt;br /&gt;&lt;br /&gt;The separation of &lt;span style="font-style: italic;"&gt;low level, simple, fast&lt;/span&gt; access and &lt;span style="font-style: italic;"&gt;higher level, more flexible&lt;/span&gt; access allows MySQL Cluster to offer many benefits of a full RDBMS without always incurring the drawback of over-generality.  This fits well with many large transaction processing systems, where most heavy transaction processing does not require the full flexibility of the RDBMS, but some less frequent analysis does.  Separating the central database engine (which in Ndb is referred to as the &lt;span style="font-style: italic;"&gt;kernel &lt;/span&gt;) from the query processing layer can also help with workload management – even the most complex queries are subdivided into manageable components and resources can be shared fairly.&lt;br /&gt;&lt;br /&gt;The original Ndb design was &lt;b&gt;not&lt;/b&gt; aimed at : &lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Disk resident storage&lt;/span&gt;&lt;br /&gt;Where data  larger-than-aggregate-system-memory-capacity can be stored on disk.   This functionality was later added in the &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-disk-data.html"&gt;MySQL 5.1&lt;/a&gt; timeframe&lt;br /&gt; &lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Complex query processing&lt;/span&gt;&lt;br /&gt;Where  multiple tables are joined.  This was always &lt;b&gt;possible&lt;/b&gt;, but not  always &lt;b&gt;efficient&lt;/b&gt;.  Improving the efficiency of MySQL and Ndb  on complex query processing is ongoing work - as it is in all  actively developed RDBMS, for some definition of complex :).  &lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;Storing large rows&lt;/span&gt;&lt;br /&gt;Ndb  currently has a per-row size limit of around 8kB, ignoring Blob and  Text column types.  &lt;/li&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;One size fits all&lt;br /&gt;&lt;/span&gt;Being a drop-in replacement for an  existing MySQL engine such as MyISAM or InnoDB&lt;br /&gt;Many initial users  were not aware of the history of Ndb, and expected it to be (MySQL +  InnoDB/MyISAM) + 'Clustering'.  Issuing 'ALTER TABLE xxx  ENGINE=ndbcluster;' appeared to be all that was required to gain  fault tolerance, but the performance of queries on the resulting  tables was not always as expected! &lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Since the initial integration of Ndb Cluster with MySQLD in 2003+, there have been many improvements to bring Ndb closer in behaviour to the most popular MySQL engines, and to optimise MySQLD for Ndb's strengths, including : &lt;ul&gt;&lt;li&gt;Support for Autoincrement and primary key-less  tables  &lt;/li&gt;&lt;li&gt;Synchronisation of schemas across  connected MySQLD instances  &lt;/li&gt;&lt;li&gt;Support for MySQL character sets and  collations&lt;/li&gt;&lt;li&gt;Storage and retrieval of Blob and  Text columns  &lt;/li&gt;&lt;li&gt;Support for pushed-down filter  conditions  &lt;/li&gt;&lt;li&gt;Support for batching of operations  &lt;/li&gt;&lt;li&gt;Integration with MySQL  asynchronous replication  &lt;/li&gt;&lt;li&gt;'Distribution awareness' in MySQLD  for efficiency &lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;These improvements have required work in the Ndb &lt;span style="font-style: italic;"&gt;table handler&lt;/span&gt; - the code which maps MySQL storage engine API calls from the generic SQL layer to the underlying storage engine.  Some improvements have also required enhancements in the storage engine API and Server, for example a new API to expose conditions (WHERE or HAVING clause predicates) to the storage engine, enabling it to perform more efficient filtering.  These changes add complexity to MySQLD and the storage engine API, but as they are implemented generically, they can be reused by other engines.  The pushed conditions API is now being used by the &lt;a href="http://spiderformysql.com/"&gt;Spider&lt;/a&gt; engine for similar reasons to Ndb – e.g. to push filtering functionality as close to the data as possible.  The &lt;a href="http://forge.mysql.com/wiki/Batched_Key_Access"&gt;Batched Key Access&lt;/a&gt; (BKA) improvements made to the MySQLD join executor benefit Ndb, but also benefit MyISAM and InnoDB to a lesser extent.  This &lt;span style="font-style: italic;"&gt;Functionality push-down&lt;/span&gt; pattern – increasing the granularity and complexity of work items which can be passed to the storage engine - will continue and benefit all storage engines.&lt;br /&gt;&lt;br /&gt;The next large step to be taken by the MySQL Server team in this direction is referred to as &lt;a href="http://forge.mysql.com/worklog/task.php?id=4292"&gt;Query Fragment Pushdown&lt;/a&gt;, where MySQLD can pass parts of queries to a storage engine for execution.  Storage engines which support SQL natively could perhaps use their own implementation-aware optimisation and execution engines to efficiently evaluate query fragments.  For Ndb, we are designing composite primitives at the NdbApi level for evaluating query fragments more efficiently - in parallel and closer to the data.  This work will increase the number of query types that Ndb can handle efficiently, increasing the number of applications where Ndb is a good fit.&lt;br /&gt;&lt;br /&gt;For an in-depth description of the original Ndb requirements, design approach and some specific design solutions, Mikael's phD &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.884"&gt;thesis&lt;/a&gt; is the place to go.  This is probably the best source of information on the &lt;span style="font-style: italic;"&gt;design philosophy&lt;/span&gt; of Ndb Cluster.  However as it is a frozen document it does not reflect the current state of the system, and as it is an academic paper, it does not describe the lower level, more software engineering oriented aspects of the system implementation.&lt;br /&gt;&lt;br /&gt;I hope to cover some of these aspects in a future post.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-7082026650680258056?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/7082026650680258056/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=7082026650680258056' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7082026650680258056'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7082026650680258056'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2009/09/mysql-cluster-development.html' title='MySQL Cluster development'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-5743008579005395157</id><published>2009-04-28T00:51:00.005+01:00</published><updated>2009-04-29T00:58:19.460+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='latency-hiding'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='parallel'/><category scheme='http://www.blogger.com/atom/ns#' term='cpu-design'/><title type='text'>Latency hiding patterns in CPUs</title><content type='html'>Latency is a major factor in CPU design.  Most general purpose CPUs execute a sequence of instructions with the logical convention that one instruction completes before the next starts.  However memory access latency and even the latency of pure computation bottleneck the achievable processing throughput.  To circumvent this a number of techniques are used to parallelise computation and communication.&lt;br /&gt;&lt;br /&gt;Conductors have capacitance and resistance and therefore take time to switch between voltage levels.   This manifests as a propagation delay proportional to the length of the conductor.   As CPU clock speeds increase, the time between each rising clock edge reduces and the maximum length of conductor that can be charged in that time shrinks.   CPUs and other data-path designs must be designed to ensure that no signal propagation path is near the tolerance for propagation delay at the designed maximum clock speed.&lt;br /&gt;&lt;br /&gt;In practice this means that there is a trade-off between clock frequency and circuit size.   Large circuits can accomplish more per clock cycle, but cannot be clocked as high as smaller circuits.  To continue increasing clock rates with a fixed feature size, a circuit must be broken up into smaller and smaller sub-circuits with synchronous boundaries.  Even this is not enough in some cases as the clock signals themselves are &lt;span style="font-style: italic;"&gt;skewed&lt;/span&gt; by the propagation latencies across the chip.  In some cases this is solved by having &lt;span style="font-style: italic;"&gt;islands &lt;/span&gt;of synchronous logic which communicate asynchronously as no synchronous clock can span the whole circuit. &lt;br /&gt;&lt;br /&gt;So even within a chip, there can be a large, and growing difference between local and remote communication latency.  Ignoring this simplifies designs but forces lowest common denominator performance.  This trade-off between complexity and latency awareness and tolerance is repeated at many layers of the system stack.&lt;br /&gt;&lt;br /&gt;Even within a synchronous logic &lt;span style="font-style: italic;"&gt;island&lt;/span&gt; on a chip, there is latency to be hidden.  Instruction fetch, decode and execute takes time and &lt;span style="font-style: italic;"&gt;pipelining &lt;/span&gt;&lt;span&gt;used &lt;/span&gt;in CPU designs to maximise throughput despite irreducible latency.   The various steps of instruction decode, register selection, ALU activation etc. are split out with latches between so that the maximum propagation delay in any stage of the pipleine can be minimised.  For example, a datapath with a 400 gate worst-case propagation delay could theoretically be split into 4 stages, each with a 100 gate worst case propagation delay, allowing a 4 times faster clock speed.&lt;br /&gt;&lt;br /&gt;With pipelining in a CPU, the first instruction to be processed takes at least the same amount of time to be completed, but after that, a new instruction can be completed on every clock cycle - theoretically allowing 4 times as many instructions to be completed per unit time.  In practice memory latency and bandwidth limits and dependencies between instructions mean that the pipeline can stall, or bubbles can form, reducing the achieved throughput.   However, these problems can often be alleviated and pipelines have been very successful in CPU design with instruction fetch-execute-retire pipelines comprising as many as 40 or more stages with multiple instances of stages in &lt;span style="font-style: italic;"&gt;super-scalar &lt;/span&gt;designs&lt;br /&gt;&lt;br /&gt;So pipelining in general allows repetitive serial task processing to be parallelised with potential benefits :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Increased parallelism between instances of serial tasks&lt;br /&gt;    &lt;span style="font-style: italic;"&gt;Allowing greater throughput&lt;/span&gt;&lt;br /&gt;  &lt;/li&gt;&lt;li&gt;Benefits of task-specificity (instruction and data caching benefits)&lt;br /&gt;    &lt;span style="font-style: italic;"&gt;Potentially improving efficiency&lt;/span&gt;&lt;br /&gt;  &lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;and the following caveats :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Dependencies between tasks must be understood and honoured&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Sometimes this reduces or eliminates parallelism&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Pipeline stalls or bubbles due to job starvation or dependencies will reduce throughput&lt;br /&gt;&lt;span style="font-style: italic;"&gt;The pipeline must be kept fed.&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Achievable throughput is bottlenecked by the slowest stage&lt;br /&gt;&lt;span style="font-style: italic;"&gt;As with a production line, the slowest worker sets the pace.&lt;br /&gt;As with a production line, slow workers can be duplicated to improve throughput.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Individual task latency can increase&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Pipelining as a general technique can apply to any system that processes tasks of a regular form which can be split into stages.   The &lt;a href="http://www.eecs.harvard.edu/%7Emdw/proj/seda/"&gt;SEDA&lt;/a&gt; system from Harvard is a general purpose server framework for implementing server processes as pipelined stages.  The software setting allows more flexible and dynamic tradeoffs to be made between pipeline lengths and widths.  It also offers a flexible way to separate asynchronous and synchronous steps.&lt;br /&gt;&lt;br /&gt;Other interesting latency hiding techniques seen in CPUs are mostly associated with hiding memory access latency.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-5743008579005395157?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/5743008579005395157/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=5743008579005395157' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/5743008579005395157'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/5743008579005395157'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2009/04/latency-hiding-patterns-in-cpus.html' title='Latency hiding patterns in CPUs'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-7652641909599211661</id><published>2009-04-28T00:33:00.003+01:00</published><updated>2009-04-28T00:50:12.586+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='latency-hiding'/><category scheme='http://www.blogger.com/atom/ns#' term='message-passing'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed-systems'/><title type='text'>Latency hiding patterns</title><content type='html'>I want to write an entry about latency hiding patterns.  Unfortunately my previous attempts became too long and boring even for me.  This time I'm going to try something smaller and less rich to get the blog-flow more regular.&lt;br /&gt;&lt;br /&gt;I'm interested in latency hiding patterns as I repeatedly see them being implemented at all levels of systems from the silicon to the top of the application stack.  Often latency hiding techniques are what deliver better-than-Moore's law performance improvements and can have great impacts to usability as well as throughput and system efficiency.&lt;br /&gt;&lt;br /&gt;I would describe latency hiding techniques as things that :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Maximise the value of communication&lt;/li&gt;&lt;li&gt;Maximise the concurrency of communication and computation&lt;/li&gt;&lt;/ul&gt;Communication value is maximised by avoiding communication, and when it cannot be avoided, minimising the overheads.&lt;br /&gt;Communication and computation concurrency is maximised by maximising the independence of communication and computation.  This requires understanding the dependencies between computation and communication.&lt;br /&gt;&lt;br /&gt;Right, all very abstract.  Let's hope some examples are more interesting.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-7652641909599211661?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/7652641909599211661/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=7652641909599211661' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7652641909599211661'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7652641909599211661'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2009/04/latency-hiding-patterns.html' title='Latency hiding patterns'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-4632656885231916531</id><published>2009-04-16T23:13:00.003+01:00</published><updated>2009-04-17T00:36:03.144+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='xacore'/><category scheme='http://www.blogger.com/atom/ns#' term='nortel'/><category scheme='http://www.blogger.com/atom/ns#' term='dms'/><category scheme='http://www.blogger.com/atom/ns#' term='protel'/><title type='text'>Protel II</title><content type='html'>In the &lt;a href="http://messagepassing.blogspot.com/2009/04/protel-i.html"&gt;first&lt;/a&gt; post I described the basic Protel language.  In the mid 1990s it was extended with object oriented capabilities.  This was done in a number of phases and was tied into the development of a project at Nortel called Generic Services Framework (GSF).  This was an object oriented reimplementation of 'call processing' on the DMS with wide scope and a huge development team.  Nortel even created an 'Object Center' somewhere in Canada with a helpline that confused designers could call to get OO Protel advice.  I suspect that today it would be a 'Center of Object Excellence'.  I heard that the GSF project was not an unqualified success, but it did drive the evolution of Protel-2 which was used later in other products.  In retrospect it seems that GSF and Protel-2 were more motivated by the Objects-with-everything zeitgeist than any particularly compelling benefits.&lt;br /&gt;&lt;br /&gt;Protel-2 supports single inheritance with a common root class called $OBJECT.  Methods can be explicitly declared to be overridable in base classes, similarly to C++'s virtual keyword.  It does not support operator overloading, or overloading method names with different signatures.  It supports fully abstract classes.&lt;br /&gt;Like &lt;a href="http://en.wikipedia.org/wiki/Eiffel_(programming_language)"&gt;Eiffel&lt;/a&gt;, Protel-2 allows parameterless methods which return a value to be called without parentheses.  This allows data members to be refactored to be read directly or via an accessor function method without changing the callers.  I'm not sure how valuable this is in practice, especially as it does not affect assignment.  Perhaps some Eiffel practitioners have experience of finding this useful?  Protel-2 methods can be declared to be read-only with respect to the object instances they operate on, in a similar way to specifying const-ness in C++.&lt;br /&gt;&lt;br /&gt;The GENERIC keyword in Protel-2 allows class definitions to be parameterised by type.  This allows the creation of type-safe generic collection classes and datastructures.  The type parameterisation is similar to the Generics mechanism in Java, in that it is effectively a compile-time-only mechanism.  The compiler generates a single underlying class implementation and checks type-correctness at compile time.  A side-effect is that all access to the parameterised class must be made via a pointer.  This fits well with the requirement for online load-replacing implementation etc, but it offers much reduced power compared to C++'s code-generation style templating mechanisms.&lt;br /&gt;&lt;br /&gt;As with most C++ implementations, Protel-2 objects contain a vtbl pointer in the first word of their data followed by data members of superclasses and the class instance.  On XA-Core, this sometimes presented a problem in that the normal transactional memory ownership mechanism could create a bottleneck when used for the vtbl ptr, but often data in the rest of the class should use the transactional memory mechanism.  To deal with this, special libraries were created to allow the object header to be stored in WRITEBLOCKING memory and the rest of the object to be stored in BLOCKING memory.&lt;br /&gt;&lt;br /&gt;That's scraping the bottom of the barrel on Protel-2 information in my head.  I thought there was more there (or maybe just something more interesting).  At least I've written it down.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-4632656885231916531?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/4632656885231916531/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=4632656885231916531' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/4632656885231916531'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/4632656885231916531'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2009/04/protel-ii.html' title='Protel II'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-229800497367959909</id><published>2009-04-12T23:11:00.005+01:00</published><updated>2009-04-17T00:44:17.083+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nortel'/><category scheme='http://www.blogger.com/atom/ns#' term='sos'/><category scheme='http://www.blogger.com/atom/ns#' term='dms'/><category scheme='http://www.blogger.com/atom/ns#' term='protel'/><title type='text'>Protel I</title><content type='html'>Protel is the PRocedure Oriented Type Enforcing Language used for most DMS software.  Currently there's not much information available about it online.  An early paper describing the language is referenced &lt;a href="http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&amp;amp;arnumber=762490&amp;amp;isnumber=16515"&gt;here&lt;/a&gt;, but is hidden behind a subscription-only portal.  Wikipedia offers a very minimal definition of the &lt;a href="http://en.wikipedia.org/wiki/Protel"&gt;term&lt;/a&gt; but little else.  I think there's some good stuff that should be recorded so I'll attempt to describe what I found interesting.&lt;br /&gt;&lt;br /&gt;So what is Protel like?  I'm told that it's similar to &lt;a href="http://en.wikipedia.org/wiki/Modula-2"&gt;Modula-2&lt;/a&gt; and even &lt;a href="http://en.wikipedia.org/wiki/Modula_3"&gt;Modula-3&lt;/a&gt; and it's true that it shares explicit BEGIN / END block syntax with &lt;a href="http://en.wikipedia.org/wiki/Pascal_%28programming_language%29"&gt;Pascal&lt;/a&gt;, and all code is divided into modules.&lt;br /&gt;One of the most basic differences between Protel and these languages is its use of the composite symbol '-&gt;' or 'Gozinta' (Goes into) for assignment.  This eliminates any confusion between assignment and equality testing.  This 'removal of ambiguity' is a key pattern in the design of Protel.  Similar manifestations of the pattern are the rules :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;No operator precedence rules&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Unlike C, Protel assigns no built-in relative precedence to various operators.  All expressions are evaluated left-to-right, and the programmer must use brackets &lt;span style="font-weight: bold;"&gt;explicitly&lt;/span&gt; to specify non l2r evaluation.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;No&lt;span style="font-style: italic;"&gt; &lt;/span&gt;preprocessor&lt;br /&gt;&lt;span style="font-style: italic;"&gt;There is no standard preprocessor and therefore no macro language.  The compiler supports some limited compile time expression evaluation including sizeof() for types etc.  This avoids context specific semantics for source code and non-visible code expansion&lt;/span&gt;&lt;/li&gt;&lt;li&gt;No support for pointer arithmetic&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Where a pointer is to be treated as referring to some element of an array, the descriptior mechanism, which includes bounds checking, is to be used rather than pointer arithmetic.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Control structure specific end-of-block keywords&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Rather than a single END keyword, Protel employs ENDBLOCK, ENDPROC, ENDWHILE, etc. to aid code readability.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;No 'keystroke reduction mechanisms'&lt;br /&gt;&lt;span style="font-style: italic;"&gt;C's ++, += etc.&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;These rules can make life harder when writing code and increase verbosity, but can aid readability and reduce the amount of non-local knowledge needed to understand code.&lt;br /&gt;&lt;br /&gt;Protel modules contain one or more source code files which can export definitions for use by other modules.  Different source files within a module can be arranged into trees which control compile order within a module.  Modules often have multiple levels of interface source files - most public and general APIs in the top level, more private and specific APIs in the lower levels.  Access to definitions in each interface file can be controlled independently if required.&lt;br /&gt;&lt;br /&gt;Basic Protel supports built-in and user defined types, pointers, arrays, an 'array slice' descriptor mechanism, and a novel extensible fixed-size type called an Area as well as some type-reflection capabilities.&lt;br /&gt;&lt;br /&gt;A Protel DESCriptor is used to refer to a range of elements in an array of &lt;type&gt;.  It is used in the same way an array - with a subsript as an lvalue or rvalue to an expression (though the usual meaning of those terms is confused by the Gozinta operator!).  The compiler is aware of the &lt;type&gt; of the slice being DESCribed, and by inference the size of the elements.  In the storage allocated to the DESC itself, it stores a pointer to the zeroth element and an upperbound in terms of elements.  In this way it can provide bounds checking on accesses through the DESC.  When an out-of-bounds exception is hit, the actual upperbound and the supplied subscript are available in the exception report, often allowing debugging straight from the trace.  The array slice abstraction can be a nice way to deal with zero-copy in a protocol stack.&lt;br /&gt;&lt;br /&gt;Protel offers the BIND keyword which can be used to define a local-scope alias to some variable instance.  It's use is encouraged to reduce keystrokes, and it is also useful for indicating to the reader and the compiler that some dereferencing operations need only be performed once even though the referenced value is used multiple times.  A side effect of its use is that it reduces the tension between short, easily typed and long, descriptive variable names, allowing long descriptive names to be shortened in use when necessary.  Of course this can add to ambiguity and confusion.&lt;br /&gt;&lt;br /&gt;Protel supports typed procedure pointers with an explicit type classifier - PROCVAR.  I suspect that this type classifier exists to improve code readability rather than for any syntactic necessity as PTR to PROC can be used similarly.  PROCVARs are used heavily in SOS code to allow applications to override behaviours and extend OS behaviour.  SOS has unique terms for the use of procedure pointers :&lt;br /&gt;&lt;/type&gt;&lt;/type&gt;&lt;ul&gt;&lt;li&gt;GATE&lt;br /&gt;&lt;span style="font-style: italic;"&gt;SOS name for a Procedure Variable that is expected to be set by some other module.  It is a 'gate' to some other implementation.  Usually gates are defined in lower-level modules.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;TARGET&lt;br /&gt;&lt;span style="font-style: italic;"&gt;SOS name for a procedure implementation referenced from a GATE.  This is the 'target' of a call to a 'gate'.  Usually targets are defined in higher-level modules&lt;/span&gt;&lt;/li&gt;&lt;li&gt;ASPECT&lt;br /&gt;&lt;span style="font-style: italic;"&gt;SOS name for a structure containing a number of procedure variables.  This can be thought of as an 'interface' in the Java sense - a set of method signatures.  Often a lower-level module would provide an API for a higher-level API to add some functionality including some data and perhaps an 'aspect' of procedure variables.&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;As well as supporting standard composite structures similar to a C struct, Protel supports the concept of an Area which can be used to implement a form of type-inheritance.  An AREA is similar to a struct, but contains as part of its definition a storage size in bits, as well as zero or more members.  At compile time, it is checked that the declared member's types fit within the size given, and instances will be created with the size given.  Other modules can declare further areas which REFINE this area, and add definitions to the original AREA.  The compiler will check that the complete set of member definitions continue to fit in the bit size of the original AREA.  This mechanism can be used to create trees of hierarchically related data types which is very useful for code modularity and extensibility as well as more basic optionality similar to a C union.  Putting procedure pointers into the Area gives a rather rough extensible virtual-method mechanism in-language.  However, most DMS software designers used AREA refinements for hierarchically varying data rather than allowing control flow overrides.&lt;br /&gt;&lt;br /&gt;Protel offers some reflection capability via the TYPEDESC operator.  It is applied to a type and returns a structure which can be used to determine type names, bit offsets etc.  SOS supports an online extensible data dictionary which uses TYPEDESC to track types and their relationships.  It is version aware and is used by Table Control and for data reformatting between software releases.&lt;br /&gt;&lt;br /&gt;Combining PROCVARs and REFINEd AREAs allowed extensible systems to be built fairly easily without OO techniques built into the language.  However, the explicit nature of the PROCVARs, the requirement to define up-front bitsizes for refinable areas and the general micromanagement required to define, initialise and use 'object' hierarchies made from these components discouraged most designers from using them in this way.  Providing tools at this atomic level encouraged each designer to try their own combination of hard coding, ProcVars, extensible areas, pointers-to-extension-structs, pointers-to-data-structs-with-procvars etc. More manual visualisation effort was required to grasp these mechanisms than would be required for an equivalent language with OO extensions.&lt;br /&gt;&lt;br /&gt;I think PROTEL was fairly state-of-the-art when it was introduced for DMS software.  Especially considering its planned use for telecoms switching equipment, it is a very general purpose language, not visibly oriented towards telecoms.  It has a fairly clean split between language features and runtime libraries.  Perhaps if it had been more widely known of beyond the confines of Nortel/BNR then it could have enjoyed some life of its own?  I believe it is still being actively used - these days SOS images run in virtualised environments with code written by outsourced employees paid by a broken company, but I imagine there must still be patches getting written.  However the outlook looks bleak.  With Nortel on the rocks and apparently no interesting information about Protel available on the web (except this blog of course :) ), it looks like it could vanish after 30 years.&lt;br /&gt;&lt;br /&gt;In the mid to late nineties, Nortel added object orientation to Protel, I'll talk briefly about that in a &lt;a href="http://messagepassing.blogspot.com/2009/04/protel-ii.html"&gt;future&lt;/a&gt; post.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-229800497367959909?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/229800497367959909/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=229800497367959909' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/229800497367959909'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/229800497367959909'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2009/04/protel-i.html' title='Protel I'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-5994422688653973890</id><published>2009-03-28T23:43:00.003Z</published><updated>2009-03-29T02:04:30.474+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='pls'/><category scheme='http://www.blogger.com/atom/ns#' term='nortel'/><category scheme='http://www.blogger.com/atom/ns#' term='sos'/><category scheme='http://www.blogger.com/atom/ns#' term='protel'/><title type='text'>What is SOS? Part III</title><content type='html'>&lt;a href="http://messagepassing.blogspot.com/2008/12/what-is-sos.html"&gt;Part 1&lt;/a&gt;  &lt;a href="http://messagepassing.blogspot.com/2009/01/what-is-sos-part-ii.html"&gt;Part 2&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Online code patching and extension&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A SOS system is comprised of modules, which contain code and various data segments and are vaguely similar to shared libraries or DLLs.  Each function in a module has a function pointer stored at some known offset in a code header segment.  The actual machine code for the function is stored in a different segment of the module.  This indirection requires that all procedure calls involve a pointer dereference, but gives the flexibility to change the implementation of any procedure at any time.  The code and data segments in a module also include limited 'spare' space, so that a number of extra global data variables and functions can be added to a module online.  This, coupled with the ability to load completely new modules with arbitrary code and data makes a SOS system completly runtime-patchable, with all behaviours modifiable online.  Online upgrade of a running module is referred to as load-replacement.  In development it is used to test and debug code, and in deployment it is used to patch code, and to add small new features.&lt;br /&gt;&lt;br /&gt;Run-time code modification is made managable by the cooperation of the standard source code control system (PLS), the Protel language compiler and linker, and SOS.  At compile and link time, metadata about the header contents and sizes, and the version of source compiled is included in the module file.  When SOS is asked to load-replace the module, it compares the new module with the existing module and will only allow the load-replace if it can be done safely.  When a module is replaced, SOS updates its module metadata with the new module's version etc.&lt;br /&gt;&lt;br /&gt;SOS also includes a patch management system which tracks the state of applied patches.  Patches use the basic module load-replacement system in a controlled and automated way to :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Load-replace existing modules to add hooks into existing code&lt;/li&gt;&lt;li&gt;Load new modules to contain modified functionality and state storage space&lt;/li&gt;&lt;li&gt;Execute patch application and removal steps&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;SOS tracks patch dependencies and can generally unapply and reapply patches at runtime.  It tracks inter-patch dependencies, and allows different deployments to run different sets of patches.  This is especially useful when patches are used to implement features and functionality specific to a single user.&lt;br /&gt;&lt;br /&gt;Writing SOS patches is quite an art, and whole teams that write nothing but patches existed in Nortel's good times.  Often the patch specialists were very technically capable and innovative, being aware of the innards of SOS and able to deal with the extra dimensions of design visualisation required to consider patch application and later removal.  However, extended exposure to writing patches in the convoluted style required for safe application and removal tends to corrode a designer's sense of elegance and abstraction.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;'Relational' data access system&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;SOS includes a table based data access layer called 'Table control'.  This layer supports interactive and batch access to tables.  Tables include key columns and non-key columns with a flexible type system, including inheritance.  Tables are statically defined with a good deal of flexibility in the implementation of the mapping onto the underlying data source.  Table control supports separate data representations at 'External', 'Logical', 'Data' and 'Physical' layers.  These abstractions give great freedom to decouple the external, user visible view of the data from the internal constraint optimised storage of the data.  Table control was initially designed to give a standard way to store and retrieve DMS configuration information, but over time in different products is used to give standardised access to huge databases of mobile subscriber information etc.  The Table Control API was later built on to implement external data provisioning and management systems, and is a large part of the online software upgrade process.&lt;br /&gt;&lt;br /&gt;Despite being table and column oriented, table control is only 'relational' in a limited sense.  There is no SQL-style declarative language for querying data stored in tables, and no standard way to 'join' tables.  However, foreign key constraints can be enforced, and DMS supports a basic scripting language which can be used to write scripts to automate cross-table analysis and maintenance.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;User model&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;SOS supports multiple interactive user sessions, connected via telnet or older technologies.  Users can have various permissions with respect to commands, table access etc, and all user activity can be logged.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Online software upgrade&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Theoretically, module load-replacement can be used to upgrade software, but in practice it is only used for bug fixes and small features with economic or time-pressure reasons for in-release delivery.  Writing all code to be online replacable against old code adds an excessive burden to the design and test cycles.&lt;br /&gt;&lt;br /&gt;SOS supports online upgrade of duplex systems via the ONP process (One Night(mare) Process).  What happens is :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Hardware sync-match is dropped, splitting the system into two separate systems, one active, the other inactive.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The inactive side is rebooted with the new software&lt;/li&gt;&lt;li&gt;Bulk personality and state data from the active side is transferred across to the inactive side, potentially involving data reformats for changed or extended schemas.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Once all bulk data is transferred, up to date state data transfer starts&lt;/li&gt;&lt;li&gt;Once all components agree that state transfer is reasonably up-to-date&lt;br /&gt;- New Active side activity is stopped&lt;br /&gt;- All remaining state is transferred&lt;br /&gt;- Inactive side becomes active side (SWitch of ACTivity).  IO systems are reconnected to new active side.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Users perform acceptance testing on the new system for a limited period&lt;/li&gt;&lt;li&gt;If users decide to revert :&lt;br /&gt;- Modified state is transferred back to newly inactive side&lt;br /&gt;- SWACT is performed in reverse&lt;/li&gt;&lt;li&gt;If users decide to continue :&lt;br /&gt;- Newly inactive side is dropped and hardware sync-match is restored.&lt;/li&gt;&lt;/ul&gt;This upgrade mechanism is complex and error prone, but it offers online upgrade with minimal service outage (of the order of 4 seconds) at the cost of a temporary loss of redundancy.&lt;br /&gt;&lt;br /&gt;From the application designer's point of view, they need to consider :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Data reformats&lt;br /&gt;Table control can automate some conversions which map into type-promotions.  More complex conversions can be performed in user-code callbacks.&lt;/li&gt;&lt;li&gt;State transfer&lt;br /&gt;Essential state can be transferred around SWACT using user-code callbacks&lt;/li&gt;&lt;li&gt;Protocol compatibility&lt;br /&gt;Newer software versions must support old protocol versions until all parties can deal with newer versions.&lt;/li&gt;&lt;li&gt;Upgrade-abort implications&lt;br /&gt;If upgrade is reverted then data and states which only exist in the new version must be avoided or dealt with.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;br /&gt;SOS applications generally support direct upgrade over 3 versions.  In a DMS system comprising a number of smaller SOS based systems, generally the upgrades start at the leaves of the tree of systems (peripherals), and work back towards the Computing Module (CM).  This implies that each system must be willing to accept old-version protocol interactions from systems higher in the tree than it, but need not worry about protocol versions for systems lower in the tree (Assuming the usual computer-science leaves-at-the-bottom tree layout).  Given that individual system complexity increases as you go 'up' the tree to the root, this is a good arrangement.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Online multithreaded breakpoint capable debugger&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;SOS supports interactive use and one application users can use is an interactive breakpoint and tracepoint capable debugger.  This tool allows a running system to be inspected, and code and data to be modified on the fly.  Break and tracepoints can be made data-conditional and can use thread (process) ids to be thread conditional.  The debugger also has some knowledge of symbols and offsets within modules.&lt;br /&gt;Full debugging access with breakpoints is not usually made available for deployed systems as the risk of accidental damage is too great.&lt;br /&gt;&lt;br /&gt;Well that's been a quick tour of SOS.  It's an interesting system with very little documentation outside of Nortel.  My own memories of it are fading fast, so please excuse mistakes and the lack of detail here.  I don't intend to blog about the system in-general any more, but may cover some specific details that are of interest.&lt;br /&gt;&lt;br /&gt;(Well of interest to me, as no-one else seems interested so far :) )&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-5994422688653973890?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/5994422688653973890/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=5994422688653973890' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/5994422688653973890'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/5994422688653973890'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2009/03/what-is-sos-part-iii.html' title='What is SOS? Part III'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-7005314548088129952</id><published>2009-01-06T23:37:00.006Z</published><updated>2009-04-17T00:52:07.015+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='pls'/><category scheme='http://www.blogger.com/atom/ns#' term='nortel'/><category scheme='http://www.blogger.com/atom/ns#' term='sos'/><category scheme='http://www.blogger.com/atom/ns#' term='protel'/><title type='text'>What is SOS? part II</title><content type='html'>&lt;span style="display: block;" id="formatbar_Buttons"&gt;&lt;span class="down" style="display: block;" id="formatbar_CreateLink" title="Link" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);"&gt;&lt;span&gt;&lt;a href="http://messagepassing.blogspot.com/2008/12/what-is-sos.html"&gt;Part 1&lt;/a&gt;  &lt;a href="http://messagepassing.blogspot.com/2009/03/what-is-sos-part-iii.html"&gt;Part 3&lt;/a&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;Resource Ownership Model&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;SOS implements an extensible resource ownership module.  System resources such as memory, IPC 'mailboxes', semaphores and any user-defined resource are owned by either a loaded module or a running process.   Process death results in process owned resources being freed.   Module unload results in module owned resources being freed.    Processes are owned by Modules.&lt;br /&gt;&lt;br /&gt;Since all processes in SOS share memory by default (more like threads in other OS), a mechanism for cleaning up resources on process death is very useful.&lt;br /&gt;&lt;br /&gt;Applications can arrange to have their own resources owned by a process or a module using a mechanism confusingly called 'Events'.  This mechanism guarantees a callback on process death or module unload allowing arbitrary cleanup.  This is essential as processes can exit at any point with no stack cleanup due to exceptions (divide by zero, bus error etc.).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Prioritised proportional scheduling&lt;br /&gt;&lt;/span&gt;&lt;span&gt;&lt;br /&gt;SOS processes are each placed in a process class.  Each class has a guaranteed share of CPU where all process class shares sum to 100% of available CPU.  Additionally, all process classes have a relative priority.&lt;br /&gt;&lt;br /&gt;The scheduler chooses which process to run next based on answering the question "What is the highest priority class with both a runnable process and time left in it's CPU share?".&lt;br /&gt;&lt;br /&gt;The guaranteed share mechanism is used to ensure that all process classes regularly get some share of CPU time even in a heavily loaded system, while still providing priority to high urgency tasks.  This allows the system to avoid throughput or latency degradation as load increases.&lt;br /&gt;&lt;br /&gt;The placement of processes into process classes is generally fixed at design time.  The set of process class guaranteed shares (called a scheduler template) is also generally fixed, but is changed between restarts and normal operation.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Application control of scheduler pre-emption&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The SOS scheduler implements pre-emptive multitasking on a single CPU core, changing the running process periodically.  Processes can temporarily stop pre-emption occurring via the rather wordy setunpreemptable() call.  This can be used to demarcate critical regions in code, where atomic actions are performed.    All other processes can only observe the state of memory before the call to setunpreemptable(), or after a call to setpreemptable().&lt;br /&gt;&lt;br /&gt;Since all other processes are excluded from running while the current process is unpreemptable, it is important that only bounded amounts of computation are performed while unpreemptable.  To enforce this, SOS implements an unpreemptable timer, of the order of tens of milliseconds which stops the running process with an exception if it remains unpreemptable for longer than this.  For this reason, activities like blocking IO and dynamic memory allocation cannot reliably be performed while unpreemptable.&lt;br /&gt;&lt;br /&gt;Most applications are collections of message driven state machines, and run unpre-emptably in bursts, processing messages, updating state and sending responses.  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display: block;" id="formatbar_Buttons"&gt;&lt;span class="down" style="display: block;" id="formatbar_CreateLink" title="Link" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);"&gt;&lt;span&gt;As well as giving low-overhead mutual exclusion, making applications non-blocking with control over pre-emption maximises the instruction cache hit rate and improves data cache locality.&lt;br /&gt;&lt;br /&gt;That's enough for now, I'm boring myself !&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-7005314548088129952?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/7005314548088129952/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=7005314548088129952' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7005314548088129952'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7005314548088129952'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2009/01/what-is-sos-part-ii.html' title='What is SOS? part II'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-8338233732248252606</id><published>2008-12-12T16:36:00.011Z</published><updated>2009-04-17T00:54:03.895+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='pls'/><category scheme='http://www.blogger.com/atom/ns#' term='nortel'/><category scheme='http://www.blogger.com/atom/ns#' term='sos'/><category scheme='http://www.blogger.com/atom/ns#' term='protel'/><title type='text'>What is SOS?</title><content type='html'>&lt;a href="http://messagepassing.blogspot.com/2009/01/what-is-sos-part-ii.html"&gt;Part 2&lt;/a&gt;  &lt;a href="http://messagepassing.blogspot.com/2009/03/what-is-sos-part-iii.html"&gt;Part 3&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;SOS is either the Switch, Support or Service Operating System which runs on a number of the components making up a Digital Multiplex Switch (DMS).&lt;br /&gt;Work began on the system around 1979.  It is mostly written in and highly coupled to the PROTEL language and the PLS (Product Library System) SCCM tool.&lt;br /&gt;It is a pre-emptive multitasking operating system with some bullet-pointable features :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Comprised entirely of runtime reloadable modules&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Multiple memory pools with different durability + protection characteristics&lt;/li&gt;&lt;li&gt;Multiple levels of system restart with restart escalation&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Strong and extensible resource ownership model&lt;/li&gt;&lt;li&gt;Prioritised proportional scheduling&lt;/li&gt;&lt;li&gt;Online code patching and extension&lt;/li&gt;&lt;li&gt;Built-in relational style database system&lt;/li&gt;&lt;li&gt;Support for multiuser interactive use&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Support for online upgrade to new version&lt;/li&gt;&lt;li&gt;Contains online multi threaded trace/breakpoint debugger&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;One of the most retro features is having no per-process memory protection.  All processes run in a shared address space which makes them similar to modern day threads within a single process.  Chaos is somewhat contained by the support for write-protected memory.  One advantage of not having per-process address spaces is that processor caches and TLBs do not need to be flushed when context switching.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Module based&lt;/span&gt;&lt;br /&gt;The PROTEL language allows a system to be split into modules, each with multiple source code files.  All definitions in the files are contained in the scope of the module.   Each source file can be marked as either public interface, private interface or implementation.  Modules import definitions from other module's public and permitted private interfaces.  The Module concept provides a component-level encapsulation, independent of OO or other abstraction mechanisms used in the code itself.&lt;br /&gt;&lt;br /&gt;SOS allows modules to be loaded at runtime.  SOS also allows modules to be 'replaced' at runtime.  This involves overwriting the object code of the module while only making safe modifications to the module's exported procedure entry points and global data.  This is the basis of the online code patch system which allows any object code to be replaced while processes execute over it.&lt;br /&gt;&lt;br /&gt;Each module can define an entry procedure.  This is called when the system is performing a restart and allows the module to take different initialisation actions depending on the restart type.&lt;br /&gt;&lt;br /&gt;A SOS system is comprised of a set of modules and an initialisation order.  At the various restarts, the SOS system iterates through the modules in initialisation order, calling their entry procedures.&lt;br /&gt;&lt;br /&gt;To allow different types of systems sharing the same source modules to be easily defined, sets of modules, and their dependencies can be grouped together to form larger components.  A system can then be specified in terms of these larger components.  The inter-component and inter-module dependencies are then used together with some hints to compute the module initialisation order and build a system image.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Multiple memory types&lt;br /&gt;&lt;/span&gt;The DMS model is unusual in that memory is expected to provide sufficient persistence for most data, with disk based recovery only occasionally required.  This is a reasonable assumption given fault tolerant redundant memory with redundant power supplies, arrays of lead acid batteries etc.&lt;br /&gt;A number of basic memory types are defined by SOS, with a number of variants for special purposes.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;PSPROT&lt;br /&gt;Program Store, protected.  Used for object code.  Write protected.  Loaded from and Saved to a SOS image on disk/tape.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;DSSAVE&lt;br /&gt;Data Store.  Not initialised by operating system reboot or restart.  Not part of a SOS image.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;DSPROT&lt;br /&gt;Data Store, write protected.  Used for configuration or otherwise slow changing data.  Loaded from and Saved to a SOS image on disk/tape.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;DSPERM&lt;br /&gt;Data Store, permanently allocated, wiped by some restarts.  Not part of a SOS image.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;DSTEMP&lt;br /&gt;Data Store, temporarily allocated, wiped by most restarts.  Not part of a SOS image.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The DSSAVE memory is limited in size but is useful for tracking system debugging state across multiple OS reboots.  Most applications have no need for it.&lt;br /&gt;DSPROT is written to by transiently removing write protection during the write.  If a write is attempted while write protection is active, the writing process gets an exception.  Special handling is required while DSPROT is being backed up to ensure a consistent snapshot is taken.&lt;br /&gt;DSPERM remains allocated across all restart types but is reset on some (see below).  This gives it the interesting property that a pointer to allocated DSPERM must be stored in DSPROT memory to ensure that the allocated memory can be 'found' again after a restart.&lt;br /&gt;DSTEMP is deallocated and reset across all restart types.&lt;br /&gt;&lt;br /&gt;The memory types tie into the set of restart types supported by the operating system (below).&lt;br /&gt;One of the main benefits of this system is that it orients application designers towards thinking of their application in terms of multiple levels of state, and the benefits of throwing state away to recover from error situations.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;Multiple levels of system restart&lt;/span&gt;&lt;br /&gt;SOS defines three levels of restart :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Initial Program Load (IPL)&lt;br /&gt;This is performed only once when a module is initially loaded&lt;/li&gt;&lt;li&gt;Reload restart&lt;br /&gt;This is the most severe restart type and occurs as part of a reboot, or when an assertion failure or user request demands it.&lt;br /&gt;DSPERM memory is reset, DSTEMP memory is deallocated and reset.  All modules' entry procedures are called.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Cold restart&lt;br /&gt;This is second-most-severe restart type and occurs when requested by the user, or when a number of Warm restarts have failed to clear a problem.  DSTEMP memory is deallocated and reset.&lt;/li&gt;&lt;li&gt;Warm restart&lt;br /&gt;This is the least severe restart type and occurs when requested by the user or when the system determines that a number of failure indicators suggest ill health.  DSTEMP memory is deallocated and reset.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;By placing different parts of a module's state in different memory types, and reallocating/reinitialising the state in the module's entry procedure, Applications can cooperate with the system's restart escalation mechanism.  One of the Call Processing (CALLP) applications written on SOS uses Warm Restart to drop connecting calls, but keep connected calls, and Cold restart to drop all calls.&lt;br /&gt;&lt;br /&gt;Low level modules in SOS monitor system health indicators (number of process deaths, exceptions while in a critical region, system load etc.) and if there is a perceived problem will trigger a warm restart of the system.&lt;br /&gt;&lt;br /&gt;If the warm restart fails, or the system does not recover correctly after a number of warm restarts, the restart type is escalated to a cold restart.  Modules are generally designed to re-initialise more state during a cold restart (which as a result, generally takes longer to accomplish).&lt;br /&gt;&lt;br /&gt;If multiple cold restarts fail, the system escalates to a Reload restart, which, again, reinitialises more state, taking longer.&lt;br /&gt;&lt;br /&gt;If all attempts to restart the running system fail, a reboot can be attempted which reloads the system image from disk and performs a reload restart on it.&lt;br /&gt;&lt;br /&gt;If this fails, previously backed up images are tried.&lt;br /&gt;&lt;br /&gt;In this way, the system automatically escalates recovery efforts, resetting more and more state each time, eventually trying previous images.  The driving philosophy is to *never* give up trying to recover.  Never wait for a friendly user to press a key, or intervene.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;What makes this different?&lt;/span&gt;&lt;br /&gt;SOS is curious in the ways it differs from the Operating Systems in common use today but it is also similar in a number of ways :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Written in high level language&lt;/li&gt;&lt;li&gt;Written for general purpose CPU and memory model&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Multitasking&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Supports interactive use&lt;/li&gt;&lt;/ul&gt;These features are not particularly noteworthy for a modern general purpose OS, but for one designed in 1979 for a telecoms switch they are unusual.  Other telecoms software at the time tended to be more :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Written in assembly language &lt;span style="font-style: italic;"&gt;and/or&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Written in telecoms-specific DSL with severe expressivity limitations&lt;/li&gt;&lt;li&gt;Designed for telecoms specific CPUs and hardware&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Cooperatively scheduled&lt;/li&gt;&lt;li&gt;Very limited interactivity&lt;/li&gt;&lt;/ul&gt;I believe SOS was ahead of its time in being fairly general purpose, powerful and flexible.&lt;br /&gt;&lt;br /&gt;Well done if you got this far, I'll continue boring on about SOS in another post...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-8338233732248252606?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/8338233732248252606/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=8338233732248252606' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/8338233732248252606'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/8338233732248252606'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2008/12/what-is-sos.html' title='What is SOS?'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-7585986611436136108</id><published>2008-12-09T00:49:00.005Z</published><updated>2009-03-30T14:16:41.468+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='xacore'/><category scheme='http://www.blogger.com/atom/ns#' term='nortel'/><category scheme='http://www.blogger.com/atom/ns#' term='sos'/><title type='text'>Hardware Transactional Memory II</title><content type='html'>In my &lt;a href="http://messagepassing.blogspot.com/2008/12/hardware-transactional-memory-i.html"&gt;last entry&lt;/a&gt;, I introduced Nortel's XA-Core platform which I believe was one of the first commercially successful HTM machines.  This time I want to talk about the hardware.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Modular Architecture&lt;/span&gt;&lt;br /&gt;An XA-Core system is comprised of various card types including Processing Elements (PEs), Shared Memory cards (SM) and IO processors (IOP).  These components are connected by a 'Gigabit Interconnect' (GI) which in practice is a set of point-to-point optical links with agreeable 'hot pluggable' and optical isolation properties.  All of these card types exist in various versions and live in a standardish DMS rack with redundant power and cooling.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Processing Element&lt;/span&gt;&lt;br /&gt;The PE card has a number of large chips and a few smaller ones (probably most circuit boards do :)) :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Two lockstep PowerPC CPUs, initially PPC603.&lt;/li&gt;&lt;li&gt;A 'Hippo' or 'Rhino' chip acting as the CPU -&gt; Memory interface.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;One custom 'PIGI' chip interfacing the PE with the GI&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The processors run in lockstep with a fairly standard comparator mechanism to check them.  From each processor's perspective, the Hippo/Rhino chip looks like main memory.  Among other things, the Hippo/Rhino and PIGI chips provide :&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Lock-step comparison of CPU outputs&lt;/li&gt;&lt;li&gt;Mapping of PPC bus requests to GI protocol requests, including transaction identifiers etc.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Mapping of 32-bit PE address space to 40-bit Shared Memory Address space.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The PE board additionally has some local memory, referred to as 'Scratch' which applications can use as a non-persistent workspace.  While the PE has lockstepped CPUs, it is not possible to run the PE with only one CPU functioning.  If either CPU fails, the whole PE is isolated.  This avoids the requirement for a post-mismatch fault detection algorithm.&lt;br /&gt;&lt;br /&gt;The minimum configuration is two PE cards, giving tolerance of one failure.  Extra PE cards can be configured to give different n+m fault tolerance configurations.&lt;br /&gt;&lt;br /&gt;When PE cards are inserted, they perform a self test, a number of initialisation steps are performed and then they begin executing the SOS scheduler loop, taking work.  When a PE card is hot-pulled or fails, any outstanding memory transaction is rolled back.  Some other PE can then pick up the aborted work from wherever in shared memory the original PE found the work.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Shared Memory card&lt;/span&gt;&lt;br /&gt;The SM cards contain some fairly fast memory accessed via a custom 'SMOAC' (Shared Memory Ownership and Access Controller) chip.&lt;br /&gt;&lt;br /&gt;The SMOAC chip maintains the memory ownership information that is necessary to enforce the transactional semantics of memory access.   Ownership information is maintained for every 32-bytes (PPC cache line) of memory in the system using ownership information sent with cache-line read and write requests from the PE.&lt;br /&gt;&lt;br /&gt;To support rollback of unwanted memory transactions, &lt;span style="font-style: italic;"&gt;every&lt;/span&gt; cache line is duplicated within an SM card.  Every cache line has an Active copy (last committed) and an Update copy (dirty, yet to be committed).  This &lt;span style="font-style: italic;"&gt;doubles&lt;/span&gt; the amount of memory required, although it probably simplifies the hardware design and theoretically allows arbitrarily large transactions.&lt;br /&gt;&lt;br /&gt;Logical memory is mapped onto the SM cards in 32MByte blocks.  The normal configuration is that every block is mapped onto at least two SM cards, and sometimes three.  This allows for one or two SM card failures to be tolerated.  Combined with the two copies of memory required for the transaction mechanism, this means that each byte of logical shared memory requires four to six bytes of physical memory.&lt;br /&gt;&lt;br /&gt;When SM cards are inserted, SOS decides which blocks should be copied to the new card, and begins a background task to copy the blocks across.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Input/Output Processor&lt;/span&gt;&lt;br /&gt;The IO processor cards are used to connect the XA-Core to the outside world, including terminals, disk + tape and the rest of the DMS system.  The cards themselves contain single PPC CPUs (no lockstepped redundancy here) and ASICs to interface to the GI and provide some DMA capability.  IOPs are deployed in pairs so that no IO facility is completely lost due to an IOP failure.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;Weird / Cool things&lt;br /&gt;&lt;/span&gt;&lt;ul&gt;&lt;li&gt;Fault tolerant, single system image, shared memory multiprocessing&lt;br /&gt;&lt;span style="font-style: italic;"&gt;I don't think many examples exist where a single-system-image SMP can handle an arbitrary processor failure without a crash.&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Existing correct code runs correctly in parallel&lt;br /&gt;&lt;span style="font-style: italic;"&gt;But may be serialised due to contention on shared memory access&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Even IO is transactional&lt;br /&gt;&lt;span style="font-style: italic;"&gt;This puts pressure on IO latencies, and requires good batching to minimise IO overheads.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Easier identification of transient CPU faults&lt;br /&gt;&lt;span style="font-style: italic;"&gt;When a mismatch is detected within a PE, the failing operation can be safely rolled back and retried on the same PE multiple times.  This can be used to help diagnose hard faults from transient / temporal faults.&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Online System split is possible for upgrade&lt;br /&gt;&lt;span style="font-style: italic;"&gt;To support online software and data upgrade, the system can de-duplicate memory, assign a PE to the 'other' half of the memory and boot it from a system image on disk.  This gives 2 systems running on one machine.  At cutover time, most PEs and IOPs are quickly migrated from the old to the new side.  Eventually memory can be re-duplicated from the new half.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;No 'standard-SMP' cache-coherency glue logic required&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Not so cool things&lt;br /&gt;&lt;/span&gt;&lt;ul&gt;&lt;li&gt;Four to Six times memory hardware overhead&lt;br /&gt; &lt;span style="font-style: italic;"&gt;Perhaps some trade-off between maximum transaction size and hardware complexity could have been made?&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Expensive memory required to contain latency&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Large, complex custom ASICs required&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Pushes out time-to-market, reduces time-in-market for modifications.  Expensive.&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;OS cooperation required to assist with transaction demarcation, ensuring forward progress, IO handling, bringing PEs, IOPs and SMs on + offline.&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Requires cooperation from OS owners.&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;So that's my tour of the XA-Core hardware.  Please comment if you have corrections or further questions, I may be able to dredge up some more details.&lt;br /&gt;Next time I'll talk about some of the modifications made to the SOS operating system to make it run on this platform.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-7585986611436136108?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/7585986611436136108/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=7585986611436136108' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7585986611436136108'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7585986611436136108'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2008/12/hardware-transactional-memory-ii.html' title='Hardware Transactional Memory II'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-5379521974942102649</id><published>2008-12-05T19:26:00.005Z</published><updated>2009-03-30T14:15:15.349+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='xacore'/><category scheme='http://www.blogger.com/atom/ns#' term='nortel'/><category scheme='http://www.blogger.com/atom/ns#' term='sos'/><category scheme='http://www.blogger.com/atom/ns#' term='protel'/><title type='text'>Hardware Transactional Memory I</title><content type='html'>The first multiprocessor I worked on was Nortel's XA-Core platform. This exotic platform was a replacement for the 'Computing Module' (CM) of their &lt;a href="http://en.wikipedia.org/wiki/Digital_Multiplex_System"&gt;DMS&lt;/a&gt; telecoms switching platform.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Background&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;Previous CM generations are built on a pair of CPUs (Motorola 88k, 68k, BNR NT40) run in lockstep through a comparator for fault tolerance. The software running on these includes a multitasking OS (SOS) and a huge amount of call processing, database, hardware support and other telecoms code written in the proprietary &lt;a href="http://en.wikipedia.org/wiki/Protel"&gt;PROTEL&lt;/a&gt; language, starting around 1979.  SOS supports write-protectable memory, but not per-process memory protection, so the memory map resembles a heavily multithreaded process. Shared data is commonly used with an assumption of a strictly ordered memory model.  Heavy use is made of a single-global-lock to enforce mutual exclusion between processes to the extent that the bulk of the computation time is spent with a single process holding the global lock in 'jumbo' timeslices of tens of milliseconds.  Much of the large code base is &gt; 10 years old and in a 'frozen' state where changes are not possible,&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The problem&lt;/span&gt;&lt;br /&gt;How to increase CM computation capacity beyond the incremental improvements available from successive generations of CPUs without a huge software rewriting and revalidation effort and while maintaining CPU and memory fault tolerance?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The solution (&lt;/span&gt;&lt;a href="http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&amp;amp;Sect2=HITOFF&amp;amp;p=1&amp;amp;u=%2Fnetahtml%2Fsearch-bool.html&amp;amp;r=1&amp;amp;f=G&amp;amp;l=50&amp;amp;d=PALL&amp;amp;RefSrch=yes&amp;amp;Query=PN%2F5918248"&gt; XA-Core patent&lt;/a&gt;)&lt;br /&gt;Create a fault tolerant SMP platform with replicated hardware transactional memory.  Modify the OS so that a process claiming the 'single global lock' implictly sets the boundaries on a memory transaction.  Handle inter-process memory access contention by rolling back one of the contenders. Handle CPU failure by rolling back in-progress memory transactions.&lt;br /&gt;The achievable level of parallelism is then limited by the memory access patterns of the concurrently running processes at the cache-line level.&lt;br /&gt;Code can still be written using the 'single CPU multitasking OS with big-global-lock' approach. Incremental improvements to available parallelism can be made by changing the data access patterns of the parallel processes. Tools exist to monitor contention between competing processes and map it to stack traces and/or data structures.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The interesting details and issues&lt;br /&gt;&lt;/span&gt;The actual hardware used, the transaction handling in the operating system, handling IO, application modifications required etc.&lt;br /&gt;&lt;br /&gt;In the spirit of actually completing some blog entries, I'll continue this post later.&lt;br /&gt;&lt;br /&gt;To be &lt;a href="http://messagepassing.blogspot.com/2008/12/hardware-transactional-memory-ii.html"&gt;continued&lt;/a&gt;...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-5379521974942102649?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/5379521974942102649/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=5379521974942102649' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/5379521974942102649'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/5379521974942102649'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2008/12/hardware-transactional-memory-i.html' title='Hardware Transactional Memory I'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-7938545754164157048</id><published>2008-06-10T15:21:00.010+01:00</published><updated>2009-03-30T14:13:55.856+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gsm'/><category scheme='http://www.blogger.com/atom/ns#' term='mobile'/><category scheme='http://www.blogger.com/atom/ns#' term='design'/><category scheme='http://www.blogger.com/atom/ns#' term='telecoms'/><title type='text'>Design fossils</title><content type='html'>Mobile telephone networks have a kind of gritty Service Oriented Architecture.  The Services are defined in GSM, UMTS and other standards.  The protocols have hard unfashionable names like SS7, TCAP and MAP, and the bytes call themselves octets and don't waste their entropy transporting 'markup'.  Even so, the basic SOA design and benefits are still visible and these benefits have enabled multi{national, operator, vendor, protocol} mobile communication to become a basic utility.&lt;br /&gt;&lt;br /&gt;The scalability, reliability and success of the GSM network design and implementation seems to me to be an under-celebrated achievement of software and systems engineering.  Perhaps the price of success is to become invisible.  As the pace of technical development accelerates,  major achievements of the past lose their significance and appear to be inevitable and mundane increments on the road to now.&lt;br /&gt;&lt;br /&gt;Mobile telephony network technology is reasonably mature, and most software and system design effort goes into reducing the cost per subscriber and adding speculative features.  However there is a huge resource of distilled design expertise, proven design patterns and traces of design evolution captured in these telephony systems as they have existed, and as they exist now.  Unfortunately much of this information is encoded in odd and underknown languages, sealed in peculiar and fruity source code control systems and fading in the minds of ageing and modestly eccentric employees.  Fortunately, this adds to the charm of investigating.&lt;br /&gt;&lt;br /&gt;I intend to write up some of the things I have found interesting before I too forget....&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-7938545754164157048?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/7938545754164157048/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=7938545754164157048' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7938545754164157048'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7938545754164157048'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2008/06/design-fossils.html' title='Design fossils'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2987855187574329171.post-7825659519738789095</id><published>2008-06-09T15:48:00.003+01:00</published><updated>2008-06-09T15:55:35.762+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='general'/><title type='text'>Intention</title><content type='html'>To periodically post about systems I have worked on, am working on and am interested in.  To share references to interesting blog entries, articles, papers, books, designs, patterns and systems.  Hopefully not to share too much detail of my daily habits and emotional state.&lt;br /&gt;&lt;br /&gt;We shall see...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2987855187574329171-7825659519738789095?l=messagepassing.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://messagepassing.blogspot.com/feeds/7825659519738789095/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2987855187574329171&amp;postID=7825659519738789095' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7825659519738789095'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2987855187574329171/posts/default/7825659519738789095'/><link rel='alternate' type='text/html' href='http://messagepassing.blogspot.com/2008/06/intention.html' title='Intention'/><author><name>Frazer Clement</name><uri>http://www.blogger.com/profile/05435364450772586515</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
