Comments on message passing: Low latency distributed parallel joins

Here are some ageing slides describing some of the...

2016-10-25T10:26:35.501+01:00

Here are some ageing slides describing some of the concepts above (batching, data locality, pruning, use of EXPLAIN etc).

> 3. For the above problem, I think batching is...

2016-10-25T10:23:34.479+01:00

> 3. For the above problem, I think batching is the way to go for optimization, so I imagine a
> algorithm somewhat like this:
> ------
> declare tmp_ret_table {userid primary key, city, age, ...}
> declare tmp_index_table { tbl_nodeid, city, age, userid }
> declare fetchfactor: some constant
> declare batched_ids: varchar
>
> while total-rows in tmp_ret_table < n
> select top (n* fetchfactor) nodeid("tbl", city, age, userid), city, age, userid
> into tmp_index_table
> from tbl_index where userid > bound1 order by userid -- nodeid is a sql function
>
> for each tbl_nodeid in tmp_index_table, generate the batched_ids, formated as (pk1, pk1...)
> insert into tmp_ret_table
> select userid, ... from node(nodeid).tbl where pk in batched_ids and search_predicate
> delete * from tmp_ret_table
>
> if no more rows for all nodeid break
> while-end
>
> trim extra rows (if > n) in tmp_ret_table
>
> return tmp_ret_table
> -----

Right, well that is kind of what happens, but the responsibility is split across layers, and there's no join required.

1. Use an ordered index to find *batches of* candidate rows by predicate > bound1
Fetch factor above is effectively the ndb_batch_size [session] variable.

2. Use an always-local read to get the indexed data from the table tbl
(Ordered indexes are always co-located with the indexed rows - each table fragment has a colocated index fragment indexing only the rows it contains)

3. Use the data node filter executor to filter out non predicate matching rows in the data node, as far as possible

4. Return candidate rows to MySQLD

5. MySQLD applies 'full' filtering [which may reduce the candidates further]

6. MySQLD outputs 1 row at a time until LIMIT hit. Once a batch is exhausted, another batch is requested from the data nodes.

BTW, just to confuse things further, MySQLD supports a separate 'Batched Key Access' mechanism (on MyISAM, InnoDB, *) which works something like you describe, getting a batch of keys from table 1 to lookup in table 2 as an optimisation. This can also be used with Ndb, under the control of the Optimizer. IIRC, InnoDB will attempt to e.g. sort the batched keys to get a better disk/index access pattern. For Ndb, we will send the keys in a large batch, saving latency. Batched Key Access is totally separate to AQL / SPJ which is Ndb only.

> The 3rd question is, how do you think about the outlined algorithm, how does mysql cluster's
> optimization implement it?

Hopefully answered above!

Thanks for the questions, nice to hear that you are looking at MySQL Cluster, Ndb + SPJ!

Suggest reading about our ordered indexes, as they would be more relevant in executing queries of the type you mention.

Also, the EXPLAIN and EXPLAIN EXTENDED and SHOW WARNINGS commands should give a first-step into looking at how MySQL Cluster executes particular queries.

Frazer

> So my question is: > 1. How do the two dif...

2016-10-25T10:22:49.689+01:00

> So my question is:
> 1. How do the two different forms of queries perform on a MySql Cluster? and which one will
> win or will they tie? Why?

b1 not really used, as mentioned above.

Perhaps your question is rather about how MySQL Cluster would handle it if you made the unique index table explicit like you have done above, and added ordered indexes to e.g. the UserId column etc...

I guess I would first try to determine what the gain of b1 is over a1. Given that the secondary 'index' contains no data that is not already in the base table, all it provides are more efficient access methods for individual key values. As that is not required here, it seems like extra work for no gain, in which case I would expect it to be slower.

> 2. As seen from schema a and b, clearly the related data are not co-located, since, one
> partitioned by userid, the other by (cityid,age,userid), so, How will spj or AQL try to
> optimize for performace?

The ordered indices in the schema would probably be more relevant to discuss here. I would expect that query a1 would be executed without using SPJ as it may just be a range scan on an ordered index on UserId with a pushed predicate. e.g. Scan the UserId ordered index on table tbl starting at bound1 and returning rows which pass [partial] predicate function x.

The scan will probably run in parallel across all LDM instances holding an index fragment, as it will not be prunable. No joins are required. Each fragment scan will return a batch of rows. Perhaps more than top(n) will be returned to the MySQLD, but it can discard those it does not need. Only m * batchsize more rows than needed will be scanned where m = parallelism.

More generally, SPJ/AQL will :
1. Consume the output of query planning (Ordered NLJ of tables/indices) with limited feedback into the process
2. Identify query fragments that can be pushed to the data nodes (perhaps the whole query)
3. Prepare these fragments
4. Execute these fragments as the standard NLJ executor iterates the NLJ.

From the MySQLD point of view, it is standard NLJ (with MySQL enhancements for batched key access).

On the Ndb side of things we will of course use batching, so when we are asked for the first row from a query fragment we will execute the pushed portion to produce *at least* 1 row. The extra rows are buffered in NdbApi until the MySQLD iterates onto them. It may never do this.

So we will do batching, pushed, parallel joining etc when necessary. However there is currently not a lot of feedback from our SPJ to the MySQLD Query optimizer/planner about more or less efficient plans.

And down in the data nodes, data locality is just a matter of latency. Both local and remote lookups are performed using asynchronous message passing, so there are no different code paths etc. Of course local lookups are more efficient and lower latency, but the code is agnostic.

Hope this answers the question.

Hi JX, > “Where the dependent data happens to...

2016-10-25T10:21:17.675+01:00

Hi JX,

> “Where the dependent data happens to be on the same data node, the dependency can be resolved
> with no inter-process communication at all.", however this is often not the case, I mean, the
> data is not co-located. Then the performance will be bad?

Well inter-process communication will be required, which will increase latency, response time etc. However throughput need not be linearly affected if there are parallel queries etc.

> I have a few questions. In the era of web app, it is quite common to have pattern like:
> select top (n) x, ... from tbl where search_predicate and x > bound1 order by x,
> where x can be from a small set, e.g. {[UserID], [city, age], ...}. This query will provide
> the base data for paged search result sorted by some field choosen ad hoc by users from a
> small set.
>
> This will in turn require a schema like:
> a: create tabe tbl { userid, city, age, ... primary key (city, age, userid), unique key
> (userid) } engine ndb,
> Now since this is on NDB engine, the above is actually equivalent to:
> b: create tabe tbl { userid, city, age, ... primary key (city, age, userid)}
> create table tbl_index { userid, city, age, primary key (userid)}
> (see: http://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-ndbd-definition.html, unique
> constraint is implemented as separate table)

Yes

> Now the query will have two forms correspondingly:
>
> a1:
> select top (n) userid, ... from tbl where search_predicate and userid > bound1 order by
> userid,
>
> b1:
> select top (n) userid, ... from tbl_index, tbl on tbl_index.userid = tbl.user
> where search_predicate and userid > bound1 order by userid
>

We really only 'join' with the tbl_index when we are using it to access the table (tbl). We only use the tbl_index to access the table (tbl) when we have a single (or small set) of unique index values to lookup. In this case we have an inequality on the userid (userid > bound1), so we would rather use an *Ordered Index*. Ordered indices are always defined on the main table (tbl) - never on a secondary unique index table.

So I don't think b1 would occur using MySQLD.

To be continued...

Hi, “Where the dependent data happens to be on th...

2016-10-18T15:05:28.831+01:00

Hi,

“Where the dependent data happens to be on the same data node, the dependency can be resolved with no inter-process communication at all.", however this is often not the case, I mean, the data is not co-located. Then the performance will be bad?

I have a few questions. In the era of web app, it is quite common to have pattern like:
select top (n) x, ... from tbl where search_predicate and x > bound1 order by x,
where x can be from a small set, e.g. {[UserID], [city, age], ...}. This query will provide the base data for paged search result sorted by some field choosen ad hoc by users from a small set.

This will in turn require a schema like:
a: create tabe tbl { userid, city, age, ... primary key (city, age, userid), unique key (userid) } engine ndb,
Now since this is on NDB engine, the above is actually equivalent to:
b: create tabe tbl { userid, city, age, ... primary key (city, age, userid)}
create table tbl_index { userid, city, age, primary key (userid)}
(see: http://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-ndbd-definition.html, unique constraint is implemented as separate table)

Now the query will have two forms correspondingly:

a1:
select top (n) userid, ... from tbl where search_predicate and userid > bound1 order by userid,

b1:
select top (n) userid, ... from tbl_index, tbl on tbl_index.userid = tbl.user
where search_predicate and userid > bound1 order by userid

So my question is:
1. How do the two different forms of queries perform on a MySql Cluster? and which one will win or will they tie? Why?
2. As seen from schema a and b, clearly the related data are not co-located, since, one partitioned by userid, the other by (cityid,age,userid), so, How will spj or AQL try to optimize for performace?
3. For the above problem, I think batching is the way to go for optimization, so I imagine a algorithm somewhat like this:
------
declare tmp_ret_table {userid primary key, city, age, ...}
declare tmp_index_table { tbl_nodeid, city, age, userid }
declare fetchfactor: some constant
declare batched_ids: varchar

while total-rows in tmp_ret_table < n
select top (n* fetchfactor) nodeid("tbl", city, age, userid), city, age, userid
into tmp_index_table
from tbl_index where userid > bound1 order by userid -- nodeid is a sql function

for each tbl_nodeid in tmp_index_table, generate the batched_ids, formated as (pk1, pk1...)
insert into tmp_ret_table
select userid, ... from node(nodeid).tbl where pk in batched_ids and search_predicate
delete * from tmp_ret_table

if no more rows for all nodeid break
while-end

trim extra rows (if > n) in tmp_ret_table

return tmp_ret_table
-----
The 3rd question is, how do you think about the outlined algorithm, how does mysql cluster's optimization implement it?

Thanks, JX