【Distributed System】Replication

Posted by 西维蜀黍 on 2022-01-28, Last Modified on 2023-09-29

Why

Replication means keeping a copy of the same data on multiple machines that are connected via a network. As discussed in the introduction to Part II, there are several reasons why you might want to replicate data:

  1. To keep data geographically close to your users (and thus reduce latency)
  2. To allow the system to continue working even if some of its parts have failed (and thus increase availability)
  3. To scale out the number of machines that can serve read queries (and thus increase read throughput)

Replication

Replication in computing can refer to:

  • Data replication, where the same data is stored on multiple storage devices
  • Computation replication, where the same computing task is executed many times. Computational tasks may be:
    • Replicated in space, where tasks are executed on separate devices
    • Replicated in time, where tasks are executed repeatedly on a single device

Replication in space or in time is often linked to scheduling algorithms.

Data replication

Replication means keeping a copy of the same data on multiple machines that are connected via a network.

There are several reasons why you might want to replicate data:

  • To keep data geographically close to your users (and thus reduce latency)
  • To allow the system to continue working even if some of its parts have failed (and thus increase availability)
  • To scale out the number of machines that can serve read queries (and thus increase read throughput)

How to Replicate

If the data that you’re replicating does not change over time, then replication is easy: you just need to copy the data to every node once, and you’re done.

All of the difficulty in replication lies in handling changes to replicated data, and that’s what this chapter is about.

We will discuss three popular algorithms for replicating changes between nodes:

  1. single-leader
  2. multi-leader
  3. leaderless replication

Almost all distributed databases use one of these three approaches. They all have various pros and cons, which we will examine in detail.

There are many trade-offs to consider with replication: for example, whether to use synchronous or asynchronous replication, and how to handle failed replicas. Those are often configuration options in databases, and although the details vary by data‐base, the general principles are similar across many different implementations.

Leaders and Followers

Each node that stores a copy of the database is called a replica.

Every write to the database needs to be processed by every replica; otherwise, the replicas would no longer contain the same data. The most common solution for this is called leader-based replication (also known as active/passive or master–slave replication)

Leader-based Replication (Active/passive or Master–slave Replication)

Refer to https://swsmile.info/post/Architecture-leader-based-Replication/

Multi-leader Replication (Master–master or Active/active Replication)

Refer to https://swsmile.info/post/Architecture-multi-leader-Replication/

Leaderless Replication

Refer to https://swsmile.info/post/Architecture-leaderless-Replication/

Implementation of Replication Logs

Position/statement-based Replication

In the simplest case, the leader logs every write request (statement) that it executes and sends that statement log to its followers. For a relational database, this means that every INSERT, UPDATE, or DELETE statement is forwarded to followers, and each follower parses and executes that SQL statement as if it had been received from a client.

Analysis

Although this may sound reasonable, there are various ways in which this approach to replication can break down:

  • Any statement that calls a nondeterministic function, such as NOW() to get the current date and time or RAND() to get a random number, is likely to generate a different value on each replica.
  • If statements use an auto-incrementing column, or if they depend on the existing data in the database (e.g., UPDATE … WHERE <some condition>), they must be executed in exactly the same order on each replica, or else they may have a different effect. This can be limiting when there are multiple concurrently executing transactions.
  • Statements that have side effects (e.g., triggers, stored procedures, user-defined functions) may result in different side effects occurring on each replica, unless the side effects are absolutely deterministic.

It is possible to work around those issues—for example, the leader can replace any nondeterministic function calls with a fixed return value when the statement is logged so that the followers all get the same value. However, because there are so many edge cases, other replication methods are now generally preferred.

Logical (row-based) log replication

An alternative is to use different log formats for replication and for the storage engine, which allows the replication log to be decoupled from the storage engine internals. This kind of replication log is called a logical log, to distinguish it from the storage engine’s (physical) data representation.

A logical log for a relational database is usually a sequence of records describing writes to database tables at the granularity of a row:

  • For an inserted row, the log contains the new values of all columns.
  • For a deleted row, the log contains enough information to uniquely identify the row that was deleted. Typically this would be the primary key, but if there is no primary key on the table, the old values of all columns need to be logged.
  • For an updated row, the log contains enough information to uniquely identify the updated row, and the new values of all columns (or at least the new values of all columns that changed).

A transaction that modifies several rows generates several such log records, followed by a record indicating that the transaction was committed. MySQL’s binlog (when configured to use row-based replication) uses this approach.

Since a logical log is decoupled from the storage engine internals, it can more easily be kept backward compatible, allowing the leader and the follower to run different versions of the database software, or even different storage engines.

Transaction-based Replication

//TODO

Computer scientists further describe replication as being either:

  • Active replication, which is performed by processing the same request at every replica
  • Passive replication, which involves processing every request on a single replica and transferring the result to the other replicas

When one leader replica is designated via leader election to process all the requests, the system is using a primary-backup or primary-replica scheme, which is predominant in high-availability clusters. In comparison, if any replica can process a request and distribute a new state, the system is using a multi-primary or multi-master scheme. In the latter case, some form of distributed concurrency control must be used, such as a distributed lock manager.

Replication models in distributed systems

Three widely cited models exist for data replication, each having its own properties and performance:

  • Transactional replication: used for replicating transactional data, such as a database. The one-copy serializability model is employed, which defines valid outcomes of a transaction on replicated data in accordance with the overall ACID (atomicity, consistency, isolation, durability) properties that transactional systems seek to guarantee.
  • State machine replication: assumes that the replicated process is a deterministic finite automaton and that atomic broadcast of every event is possible. It is based on distributed consensus and has a great deal in common with the transactional replication model. This is sometimes mistakenly used as a synonym of active replication. State machine replication is usually implemented by a replicated log consisting of multiple subsequent rounds of the Paxos algorithm. This was popularized by Google’s Chubby system, and is the core behind the open-source Keyspace data store.
  • Virtual synchrony: involves a group of processes which cooperate to replicate in-memory data or to coordinate actions. The model defines a distributed entity called a process group. A process can join a group and is provided with a checkpoint containing the current state of the data replicated by group members. Processes can then send multicasts to the group and will see incoming multicasts in the identical order. Membership changes are handled as a special multicast that delivers a new “membership view” to the processes in the group.

Reference