Vous êtes sur la page 1sur 92

MySQL Group Replication:

'Synchronous',
multi-master,
auto-everything
Ulf Wendel, MySQL/Oracle

The speaker says...


MySQL 5.7 introduces a new kind of replication:
MySQL Group Replication. At the time of writing
(10/2014)
MySQL Group Replication is available as a preview
release on labs.mysql.com. In common user terms it
features (virtually) synchronous, multi-master, autoeverything replication.

Proper wording...
An eager update everywhere system based
on the database state machine approach
atop of a group communication system
offering virtual synchrony and
reliable total ordering messaging.
MySQL Group Replication offers
generalized snapshot isolation.

The speaker says...


And here is a more technical description....

WHAT ?!
Hmm, how does it compare?

The speaker says...


The technical description given for MySQL Group
Replication may sound confusing because it has
elements from the distributed systems and database
systems theory. From around 1996 and 2006 the two
research communities jointly formulated the
replication method implemented by MySQL Group
Replication.
As a web developer or MySQL DBA you are not
expected to know distributed systems theory inside
out. Yet to understand the properties of MySQL Group
Replication and to get most of it, we'll have to touch
some of the concepts.
Let's see first how the new stuff compares to the

Goals of distributed databases


Availability

Cluster as a whole unaffected by loss of nodes

Scalability

Geographic distribution

Scale size in terms of users and data

Database specific: read and/or write load

Distribution Transparency

Access, Location, Migration, Relocation (while in use)

Replication

Concurrency, Failure

The speaker says...


MySQL Group Replication is about building a
distributed database. To catalog it and compare it
with the existing MySQL solutions in this area, we
can ask what the goals of distributed databases are.
The goals lead to some criteria that is used to give a
first, brief overview.
Goal: a distributed database cluster strives for
maximum availability and scalability while
maintaining distribution transparency.
Criteria: availability, scalability, distribution
transparency.

MySQL clustering cheat sheet

Availability
Scalability
Scale on
WAN
Distribution
Transparenc
y

MySQL
Replication

MySQL
Cluster

MySQL
Fabric

Primary =
SpoF,
no auto failover

Shared
nothing,
auto failover

SpoF
monitored,
auto failover

Reads

Partial
replication,
node limit

Partial
replication,
no node limit

Asynchronous

Synchronous
(WAN option)

Asynchronous
(depends)

R/W splitting

SQL: yes
(low level:
no)

Special clients
No distributed
queries

The speaker says...


Already today MySQL has three solutions to build a
distributed MySQL cluster: MySQL Replication, MySQL
Cluster and MySQL Fabric. Each system has different
optimizations, none can achieve all the goals of a
distributed cluster at once. Some goals are
orthogonal.
Take MySQL Cluster. MySQL Cluster is a shared
nothing system. Data storage is reundant, nodes fail
independently. Transparent sharding (partial
replication) ensures read and write scalability until
the maximum number of nodes is reached. Great for
clients: any SQL node runs any SQL, synchronous
updates become visible immediately everywhere.
But, it won't scale on slow WAN connections.

How Group Replication fits in


Repl
.

Cluster

Group Repl.

Availability

Shared
nothing,
auto failover

Shared
nothing,
auto
failover/join

Scalability

Partial
replication,
node limit

Full replication,
read and some
write scalability

Synchronous
(WAN option)

(Virtually)
Synchronous

SQL: yes
(low level: no)

All nodes run


all SQL

Scale on
WAN
Distribution
Transparen
cy

Fabri
c

The speaker says...


MySQL Group Replication has many of the desireable
properties of MySQL Cluster. Its strong on availability
and client friendly due to the distribution
transparency. No complex client or application logic
is required to use the cluster. So, how do the two
differ?
Unlike MySQL Cluster, MySQL Group Replication
supports the InnoDB storage engine. InnoDB is the
dominant storage engine for web applications. This
makes MySQL Group Replication a very attractive
choice for small clusters (3-7 nodes) running Drupal,
WordPress, in LAN settings! Also, Group
Replication is not synchronous in a technical way. For
practical matters it is.

Group Replication (vs. Cluster)


Availability

Nodes fail independently

Cluster continues operation in case of node failures

Scalability

Geographic distribution: n/a, needs fast messaging

All nodes accept writes, mild write scalability

All nodes accept reads, full read scalability

Distribution Transparency

Full replication: all nodes have all the data

Fail stop model: developer free'd to worry about


consistency

The speaker says...


Another major difference between MySQL Cluster
and MySQL Group Replication is the use of partial
replication versus full replication. MySQL Cluster has
transparent sharding (partial replication) build-in. On
the inside, on the level of so-called MySQL Cluster
data nodes, not every node has all the data. Writes
don't add work to all nodes of the cluster but only a
subset of them. Partial replication is the only known
solution to write scalability. With MySQL Group
Replication all nodes have all the data. Writes can be
executed concurrently on different nodes but each
write must be coordinated with every other node.
time to dig deeper >:).

Eager update everywhere... ?!

A developers categorization...
Where are transactions run?

Eage
r
When does
synchronizati
on happen?

Primary Copy

Update Everywhere

(MySQL semisynch
Replication)

MySQL Cluster
MySQL Group
3rd party: Galera

MySQL
Replication/Fabri
Lazy
c
3rd party:
Tungsten

MySQL Cluster
Replication

The speaker says...


I've described MySQL Group Replication as an
eager update everywhere system. The term comes
from a categorization of different database
replication systems by the two questions:
- where can transaction every be run?
- when are transactions synchronized between
nodes?
The answers to the questions tells a developer which
challenges to expect. The answers determine which
additional tasks an application must handle when its
run on a cluster instead of a single server.

Lazy causes work...


Set price = 1.23
Node
price = 1.23

010101001011010
101010110100101
101010010101010
101010110101011
101010110111101

Node

Node

Node

price = 1.00

price = 1.23

price = 0.98

The speaker says...


When you try to scale an application running it on a
lazy (asynchronous) replication cluster instead of a
single server you will soon have users complaining
about outdated and incorrect data. Depending
which node the application connects to after a write,
a user may or may not see his own updates. This can
neither happen on a single server system nor on an
eager (synchronous) replication cluster. Lazy
replication causes extra work for the developer.
BTW, have a look at PECL/mysqlnd_ms. It abstracts
the problem of consistency for you. Things like readyour-writes boil down to a single function call.

Primary Copy causes work...


Read

Write

Primary

Copy

Copy

Copy

Read

Read

Read

The speaker says...


Judging from the developer perspective only, primary
copy is an undesired replication solution. In a
primary copy system only one node accepts writes.
The other nodes copy the updates performed on the
primary. Because of the read-write splitting, the
replication system does not need to coordinate
conflicting operations. Great for the replication
system author, bad for the developer. As a developer
you must ensure that all write operations are
directed to the primary node... Again, have a look at
PECL/mysqlnd_ms.
MySQL Replication follows this approach. Worse,
MySQL Replication is a lazy primary copy system.

Love: Eager Update Everywhere


Read

Write
Node
price = 1.23
price = 1.23

price = 1.23

Node

Node

Write

Read

Write

Read

The speaker says...


From a developer perspective an eager update
anywhere system, like MySQL Group Replication, is
indistinguishable from a single node. The only extra
work it brings you is load balancing, but that is the
case with any cluster. An eager update anywhere
cluster improves distribution transparency and
removes the risk of reading stale data. Transparency
and flexibility is improved because any transaction
can be directed to any replica. (Sometimes
synchronization happens as part of the commit, thus
strong consistency can be achieved.) Fault tolerance
is better than with Primary Copy. There is no single
point of failure a single primary - that can cause a
total outage of the cluster. Nodes may fail
individually without bringing the cluster down

HOW? Distributed + DB?


Database state machine?

The speaker says...


In the mid-1990s two observations made the
database and distributed system theory communities
wondered if they could develop a joint replication
approach.
First Gray et. al. (database communitiy) showed that
the common two-phase locking has an expected
deadlock rate that grows with the third power of the
number of replicas.
Second, Schiper and Raynal noted that transactions
have common properties with group communication
principles (distributed systems) such as ordering,
agreement/'all-or-nothing' and even durability.

Three building blocks


State machine replication

trivial to understand

Atomic Broadcast

database meets distributed systems community

OMG, how easy state machine replication is to


implement!

Deferred Update Database Replication

database meets distributed systems community

how we gain high availability and high performance

what those MySQL Replication team blogs talk


about ;-)

The speaker says...


Finally, in 1999 Pedone, Guerraoui and Schiper
published the paper The Database State Machine
Approach. The paper combines two well known
building blocks for replication with a messaging
primitive common in the distributed systems world:
atomic broadcast.
MySQL Group Replication is slightly different from
this 1999 version, more following a later refinement
from 2005 plus a bit of additional ease-of-use.
However, by end of this chapter you learned how
MySQL Cluster and MySQL Group Replication differ
beyond InnoDB support and built-in sharding.

State machine replication


Input
Set A = 1

Replica

Replica

Replica

Output

Output

Output

A=1

A=1

A=1

The speaker says...


The first building block is trivial: a state machine. A
state machine takes some input and produces some
output. Assume your state machines are
determinisitic. Then, if you have a set of replicas all
running the same state machine and they all get the
same input, they all will produce the same output.
On an aside: state machine replication is also known
as active replication. Active means that every replica
executes all the operations, active adds compute
load to every replica. With passive replication, also
called primary-backup replication, one replica
(primary) executes the operations and forwards the
results to the others. Passive suffers under primary
availability and possibly network bandwith.

Requirement: Agreement
Input
Set A = 1

Replica

Replica

Output

A = NULL

A=1

Replica

The speaker says...


Here's more trivia about the state machine
replication approach. There are two requirements for
it to work. Quite obviously, every replica has to
receive all input to come to the same output. And
the precondition for receiving input is that the replica
is still alive.
In academic words the requirement is: agreement.
Every non-faulty replica receives every request. Nonfaulty replicas must agree on the input.

Requirement: Order
1) Set A = 1
2) Set B = 1
3) Set B = A *2

Input: 1, 2, 3

Input: 1, 3, 2

Input: 3, 1, 2

Replica

Replica

Replica

A=1

A=1

A=1

B=2

B=1

B=1

The speaker says...


The second trivial requirement for state machine
replication is ordering. To produce the same output
any two state machines must execute the very same
input including the ordering of input operations.
The academic wording goes: if a replica processes
requests r1 before r2, then no replica processes
request r2 before r1. Note that if operations
commute, some reording may still lead to correct
output. The sequence A = 1, B = 1, B = A * 2 and
the sequence B = 1, A = 1, B = A * 2 produce the
same output.
(Unrelated here: the database scaling talk touches
the fancy commutative replicated data types Riak

Atomic Broadcast
Distributed systems messaging abstraction

Meets all replicated state machine requirements

Agreement

If a site delivers a message m then every site delivers


m

Order

No two sites deliver any two messages in different


orders

Termination

If a site broadcasts message m and does not fail, then


every site eventually delivers m

We need this in asynchronous enivronments

The speaker says...


State machine replication is the first building block
for understanding the database state machine
approach. The second building block is a messaging
abstraction from the distributed systems world called
atomic broadcast. Atomic broadcast provides all the
properties required for state machine replication:
agreement and ordering. It adds a property needed
for communication in an asynchronous system, such
as a system communicating via network messages:
termination.
All in all, this greatly simplifies state machine
replication and contributes to a simple, layered
design.

Delivery, durability, group


Client
Replica

Replica

Replica

Replica Group

Replica

Replica

Mr. X
Send first, possibly delivered second

The speaker says...


The Atomic broadcast properties given are literally
copied from the original paper describing the
database state machine replication approach. There
is two things in it not explained yet. First, atomic
broadcast defines properties in terms of message
delivery. The delivery property not only ensures total
ordering despite slow transport but also covers
message loss (MySQL desires uniform agreement
here, something better than Corosync) and even the
crash and recovery of processors (durability)! A
recovering processor must first deliver outstanding
messages before it continues. Second, note that
atomic broadcast introduces the notion of a group.
Only (correct) members of a group can exchange
messages.

Client Response

Replica

Agreement

Replica

Execution

Replica

Server Coordination

Client

Client Request

Deferred Update: the best?

Client
Replica
Replica
Replica

The speaker says...


We are almost there. The third building block to the
database state machine replication is deferred
update database replication. The slide shows a
generic functional model used by Pedone and
Schiper in 2010 to illustrate their choice of deferred
update.The argument goes that deferred update
combines the best of the two most prominent object
replication techniques: active and passive
replication. Only the comination of the best from the
two will give both high availability and high
performance.
Translation: MySQL Group Replication can in theory
- have higher overall throughput than MySQL
Replication. Do you love the theory ;-) ? As a DBA

Replica

All reply to client

Replica

Execution

Replica

Requests get ordered

Client

Client sends op to all

Active Replication (SM)

Client
Replica
Replica
Replica

The speaker says...


In an active replication system, a pure state machine
replication system, the client operations are
forwarded to all replicas and each replica individually
executes the operation. The two challenges are to
ensure all replicas execute requests in the same
order and all replicas decide the same. Recall, that
we talk multi-threaded database servers here.
A downside is that every replica has to execute the
operation. If the operation is expensive in terms of
CPU, this can be a waste of CPU time.

Client

Primary

Backup

Backup
Primary replies to client

Primary forwards changes

Only primary executes

Client sends op to primary

Passive Replication

Client

Replica

Replica

Replica

The speaker says...


The alternative is passive replication or primarybackup replication. Here, the client talks to only one
server, the primary. Only the primary server
executes client operations. After computation of the
result, the primary forwards the changes to the
backups which apply tem.
The problem here is that the primary determines the
systems throughput. None of the backups can
contribute its computing power to the overall system
throughput.

Multi-primary (pass.) replication


What we want...

for performance: more than one primary

for scalability: no distributed locking

.. and of course: transactions

Two-staged transaction protocol

Client

Primary

Transaction processing

Primary
Primary
Transaction termination

The speaker says...


Multi-primary (passive) replication has all the
ingredients desired.
Transaction processing is two staged. First, a client
picks any replica to execute a transaction. This
replica becomes the primary of the transaction. The
transaction executes locally, the stage is called
transaction processing. In the second stage, during
transaction termination, the primaries jointly decide
whether the transaction can commit or must abort.
Because updates are not immediately applied,
database folks call this deferred update our last
building block.

Deferred Update DB Replication


Deterministic certification

Reads execute locally, Updates get certified

Certification ensures transaction serializability

Replicas decide independently about certification result

Read

Write

Primary

Primary

Rs/Ws/U

Primary
Primary

The speaker says...


One property of transactions is isolation. Isolation is
also know as serializability: the concurrent execution
of transactions should be equivalent to a serial
execution of the same transactions. In Deferred
Update system, read transactions are processed and
terminated on one replica and serialized locally.
Updates must be certified. After the transaction
processing the readset, writeset and updates are
sent to all other replicas. The servers then decide in
a deterministic procedure whether (one-copy)
serializability holds, if the transaction commits.
Because its a deterministic procedure, the servers
can certify transactions independently!

Options for termination


Atomic Broadcast based

this is what is used, by MySQL, by DBSM

Optimization: Reordering (atop of Atomic


Broadcast)

in theory it means less transaction aborts

Optimization limit: Generic Broadcast based

this has issues, which make it nasty

Atomic Commit based

more transactions than atomic broadcast

The speaker says...


There are several ways of implementing the
termination protocol and the certification. There are
two truly distinct choices: atomic broadcast and
atomic commit. Atomic commit causes more
transaction aborts than atomic broadcast. So, it's out
and atomic broadcast remains.
Atomic broadcast can in theory be further
optimized towards less transaction aborts using
reordering. For practically matters, this is about
where the optimizations end. A weaker (and possibly
faster) generic broadcast causes problems in the
transactional model. For databases, it could be an
over-optimization.

Generic certification test


Transactions have a state

Executing, Comitting, Comitted, Aborted

Reads are handled locally


Updates are send to all replicas

Readset and writeset are forwarded

On each replica: search for 'conflicting'


transactions

Can be serialized with all previous transactions?


Commit!

Commit? Abort local transaction that overlap with


update

The speaker says...


No matter what termination procedure is used, the
basic procedure for certification in the deferred
update model is always the same. Updates/writes
need certification. The data read and the data
written by a transaction is forwarded to all other
replicas.
Every replica searches for potentially 'conflicting'
transactions, the details depend on the termination
procedure. A transaction is decided to commit if it
does not violate serializability with all previous
transactions. Any local transaction currently running
and conflicting with the update is aborted.

Database State Machine


Deferred Update Database Replication as a state
machine

Atomic Broadcast based termination

MySQL
Plugin Services

Plugins

Transaction hooks

MySQL Group Replication

Capture

Apply

Recover

Replication Protocol incl. termination protocol/certifier


Group Communication System

The speaker says...


The Database State Machine Approach combines all
the bits and pieces. Let's do a bottom up summary.
Atomic broadcast not only free's the database
developer to bother about networking APIs it also
solves the nasty bits of communicating in an
asynchronous network. It provides properties that
meet the requirements of the state machine
replication. A deterministic state machine is what
one needs to implement the termination protocol
within deferred update replication. Deferred update
replication does not use distributed locking which
Gray proved problematic and it combines the best of
active and passive replication. Side effects: simple
replication protocol, layered code.

The termination algorithm


Updates are send to all replicas

Readset and writeset are forwarded

Step 1 - On each replica: certify

Is there any comitted transaction that conflicts?


(In the original paper: check for write-read conflicts
between comitting transaction and comitted
transactions using. Does the committing transaction
readset overlap with any comitted transactions
writeset. Works slightly different in MySQL.)

Step 2 On each replica: commitment

Apply transactions decided to commit

Handle concurrent local transactions: remote wins

The speaker says...


The termination process has two logical steps, just
like the general one presented earlier. The very
details of how exactly two transactions are checked
for conflicts in the first step don't matter here.
MySQL Group Replication is using a refinement of the
algorithm tailored to its own needs. As a developer
all you need to know is: a remote transaction always
wins no matter how expensive local transactions are.
And, keep conflicting writes on one replica. It's faster.
The puzzling bit on the slide is the rule to check
check a commiting transaction against any
commited transaction for conflicts. Any !? Not any...
only concurrent.

What's concurrent?
Any other transaction that precedes the current
one

Recall: total ordering

Recall: asynchronous, delay between broadcast and


Broadcast
Delivery
delivery

Replica

Total order

Replica

1
2
1

R
2

The speaker says...


The definition of what concurrent means is a bit
tricky. Its defined through a negation and that's
confusing on the first look but becomes hopefully
clear on the next slide.
Concurrent to a transaction is any other transaction
that does precede it. If we know the order of all
transactions in the entire cluster -, then we can
which transactions precede one another.
Atomic broadcast ensures total order on delivery.
Some implementations decide on ordering when
sending and that number (logical clock) could be be
used. Any logical clock works.

Certify against all previous?


Broadcast:
Transaction 4 is based
on all previous up to 2
Transaction(2)

Replica

Total order

Replica

R
4

Certification

Certification when 4 is delivered:


Check conflicts with trx >2 and trx < 4

The speaker says...


The slide has an example how to find any other
transaction that precedes one. When a transaction
enters the committing state and is broadcasted, the
broadcast includes the logical time (= total order
number on the slide) of the latest transaction
comitted on the replica.
Eventually the transaction is delivered on all sites.
Upon delivery the certification considers all
transactions that happend after the logical time of
the to be certified transaction. All those transactions
precede the one to be certified, they executed
concurrently at different replicas. We don't have to
look further in the past. Further in the past is stuff
that's been decided on already.

TIME TO BREATH
MySQL is different anyway...

The speaker says...


Good news! The algorithm used by MySQL Group
Replication is different and simpler. For correctness,
the precedes relation is still relevant. But it comes
for free...

A developers view on commit


Client

BEGIN COMMIT

Replica Execute

t(3)

Result

Certify

R
Replica

Certify

Apply

The speaker says...


We are not done with the theory yet but let's do
some slides that take the developers perspective.
Assuming you have to scale a PHP application,
assuming a small cluster of a handful MySQL servers
is enough and assuming these servers are co-located
on racks, then MySQL Group Replication is your best
possible choice.
Did you get this from the theory? Replication is
'synchronous'. On commit you wait only for the
server you are connected to. Once your transaction
is broadcasted, you are done. You don't wait for the
other servers to execute the transaction. With
uniform atomic broadcast, once your transaction is
broadcasted, it cannot get lost. (That's why I torture
you with theory.)

MySQL Replication
Client

BEGIN COMMIT

Master execute

OK

Bin log etc.

R
Slave

Fetch

Apply

The speaker says...


If your network is slow or mother earth, the speed of
light and network message round trip time adds too
much too your transaction execution time, then
asynchronous MySQL Replication is a better choice.
In MySQL Replication the master (primary) never
waits for the network. Not even to broadcast
updates. Slaves asynchronously pull changes.
Despite pushing work on the developer this approach
has the downsite that a hardware crash on the
master can cause transaction loss. Slaves may or
may not have pulled the latest data.

MySQL Semi-sync Replication


Client

BEGIN COMMIT

Master Execute

OK

Bin log Wait for first ACK

R
Fetch

Slave
Slave

Fetch

Apply

Apply

The speaker says...


In the times of MySQL 5.0 the MySQL Community
suggested that to avoid transaction loss the master
should wait for one slave to acknowledge it has
fetched the update from the master. The fact that it's
fetched does not mean that it's been applied. The
update may not be visible to clients yet.
It is a back and forth whether database replication
should be asynchronous or not. It depends on your
needs.
Back to theory after this break.

Back to theory!
Virtual Synchrony?

Virtual Synchrony
Groups and views

P1

A turbo-charged veryion of Atomic Broadcast

M1

P2
P3

M3
M2
P4 VC
G1 = {P1, P2, P3}

M4
G2 = {P1, P2, P3, P4}

The speaker says...


Good news! Virtual Synchrony and Atomic Broadcast
are the same. Our Atomic Broadcast definition
assumes a static group. Adding group members,
removing members or detecting failed ones is
covered.
Virtual Synchrony handles all these membership
changes. Whenever an existing group agrees on
changes, a new view is installed through a view
change (VC) event.
(The term 'virtual': it's not synchronous. There is a
delay we don't want to wait for short message
delays. Yet, the system appears to be synchronous to
most real life observers.)

Virtual Synchrony
View changes act as a message barrier

That's a case causing troubles in Two-Phase Commit

P1
P2
P3
P4

VC
M5
M6

M8
M7

G2 = {P1, P2, P3, P4}

G3 = {P1, P2, P3}

The speaker says...


View changes are message barriers. If the group
members suspect a member to have failed they
install a new view.
Maybe the former member was not dead but just too
slow to respond, or disconnected for a brief period.
False alarm. The former member then tries to
broadcast some updates. Virtual Synchrony ensures
that the updates will not be seen by the remaining
members. Furthermore the former member will
realize that it was excluded.
Some GCS implementing virtual synchrony even
provide abstractions that ensure a joining member
learns all updates it missed (state transfer) before it

Auto-everything: failover
MySQL Group Replication has a pluggable GCS API

Split brain handling? Depends onGCS and/or GCS


config

Default GCS is Corosync

MySQL

MySQL
MySQL

MySQL

MySQL

MySQL

The speaker says...


Good news! The Virtual Synchrony group
membership advantages are fully exposed to the
user level: node failures are detected and handled
automatically. PECL/mysqlnd_ms can help you with
the client site. It's a minor tweak to have it
automatically learn about remaining MySQL server.
Expect and update release soon.
MySQL Group Replication works with any Group
Communication system that can be accessed from C
and implements Virtual Synchrony. The default
choice is Corosync. Split brain handling is GCS
dependent. MySQL follows view change notifications
of the GCS.

Auto-everything: joining
Elastic cluster grows and shrinks on demand

State transfer done via asynch replication channel

Donor

State transfer

MySQL

MySQL
MySQL

MySQL

MySQL
MySQL

Joiner

The speaker says...


Good news! When adding a server you don't fiddle
with the very details. You start the server, tell it to
join the cluster and wait for it to catch up. The server
picks a donor, begins fetching updates using much of
the existing MySQL Replication code infrastructure
and that's it.

Back to theory!
Generalized Snapshot Isolation

Deferred Update tweak


Transaction read set does not need to be
broadcasted

Readset is hard to extract and can be huge

Weaker serializability level than 1SR

Sufficient for InnoDB default isolation

Read

Write

Primary

Primary

V/Ws/U

Primary
Primary

The speaker says...


Good news! This is last bit of theory. The original
Database State Machine proposal was followed by a
simpler to implement proposal in 2005. If the
clusters serialization level is marginally lowered to
snapshot, certification becomes easier. Generalized
snapshot isolation can be achieved without having to
broadcast the readset of transactions. Recording the
readset of a transaction is difficult in most existing
databases. Also, readsets can be huge.
Snapshot isolation is an isolation level for multiversion concurrency control. MVCC? InnoDB!
Somehow... Whatever this is the MySQL Group
Replication termination base algorithm.

Snapshot Isolation
Concurrent and write conflict? First comitter wins!

Reads use snapshot from the beginning of the


transaction

First committer

T1

BEGIN(v1), W(v1, x=1), COMMIT!, x:v2=1 T1

T2

BEGIN(v1), W(v1, x=2), , , COMMIT?

Concurrent write (version 1)


Conflict (both change x)

T2

The speaker says...


In Snapshot Isolations transactions take a snapshot
when they begin. All reads return data from this
snapshot. Although any other concurrent transaction
may update the underlying data while the
transaction still runs, the change is unvisiable, the
transaction runs in isolation. If two concurrent
transactions change the same data item they
conflict. In case of conflicts, the first comitter wins.
MVCC requires that as part update of an data item its
version is incremented. Future transactions will base
their snapshot on the new version.

The actual termination protocol


Replica

Write(v2, x=1)

R
Certification

Replica
Object

Latest version

13

OK

The speaker says...


Every replica checks the version of a write during
certification. It compares the writes data items
version number with the latest it knows of. If the
version is higher or equal than the one found in the
replicas certification index, the write is accepted. A
lower number indicates that someone has already
updated the data item before. Because the first
comitter must win a write showing a lower version
number than is in the certification index must abort.
(The certification index fills over time and is
truncated periodically by MySQL. MySQL reports the
size through Performance Schema tables.)

Hmm...
Does it work?

It's a preview there are limits


General

InnoDB only

Corosync lacks uniform agreement

No rules to prevent split-brain (it's a preview, you're


allowed to fool yourself if you misconfigure the GCS!)

Isolation level

Primary Key based

Foreign Keys and Unique Keys not supported yet

No concurrent DDL

That's it, folks!


Questions?

The speaker says...


(Oh, a question. Flips slide)

Network messages pffft!

MySQL super hero at Facebook


@markcallaghan Sep 30
For MySQL sync replication, when all commits originate from 1
master is there 1 network round trip or 2?
http://mysqlhighavailability.com/mysql-group-replication-helloworld
@Ulf_Wendel
@markcallaghan AFAIK, on the logical level, there should be
one. Some of your questions might depend on the GCS used.
The GCS is pluggable
@markcallaghan
@Ulf_Wendel @h_ingo Henrik tells me it is "certification based"
so I remain confused

GCS != MySQL Semi-sync


It's many round trips, how many depends on GCS

Default GCS is Corosync, Corosyc is Totem Ring

Corosync uses a privilege-based approach for total


ordering

Many options: fixed sequencer, moving sequencer, ...

Where you run your updates only impacts collision rate

MySQL
Corosync
MySQL

MySQL

Corosync

Corosync

The speaker says...


No Mark, MySQL Group Replication cannot be
understood as a replacement for MySQL Semi-sync
Replication. The question about network round trips
is hard to answer. Atomic Broadcast and Virtual
Synchrony stack many subprotocols together. Let's
consider a stable group, no network failure, Totem.
Totem orders messages using a token that circulates
along a virtual ring of all members. Whoever has the
token, has the priviledge to broadcast. Others wait
for the token to appear. Atomic Broadcast gives us all
or nothing messaging. It takes at least another full
round on the ring to be sure the broadcast has been
received by all. How many round trips are that?
Welcome to distributed systems...

THE END

Contact: ulf.wendel@oracle.com

The speaker says...


Thank you for your attendance!
Upcoming shows:
Talk&Show! - YourPlace, any time