Availability and ScalabilityWe were introduced to Sharding in the previous article as a strategy for optimizing data consistency when scaling our service. Now we turn our attention to optimizing for availability. In this article, we are going to look at a strategy called Conflict-Free Replicated Data Types.
Conflict-Free Replicated Data TypesAs we have seen in previous articles, we can achieve higher service availability through the use of service replication. We can accomplish the same goal with Conflict-Free Replicated Data Types (CRDTs) to distribute data on multiple replicas for improved availability.
- any replica can be modified without coordinating with any other replica.
- when any two replicas have received the same set of updates, they reach the same state, deterministically, by adopting mathematically sound rules to guarantee state convergence.
CRDTs provide a high-availability strategy that employs asynchronous replication to achieve strong eventual consistency. CRDTs avoid the probabilistic synchronization issues used in other partial quorum technologies. CRDTs can use state replication or commutativity of operations to achieve this, which allows entity replicas to apply any incoming state or operation and still guarantee that all the entity replicas converge to a consistent state.
Replication mechanismWe have mentioned that when a change is made to any CRDT, those changes are asynchronously copied to all the other entity replicas. We can choose between two mechanisms to accomplish this.
Peer to PeerIn a peer-to-peer replication system, each peer is responsible for updating its replicas directly. This approach requires that a CRDT be aware of each of its peers and apply updates each individually.
In situations where many replicas are active, coherency is delayed as each replica must receive and process each notification. The peer-to-peer approach is most appropriate when short-term replica inconsistency is considered acceptable by the application or the replica pool is small.
Client-ServerThe second approach is the client-server model. In this strategy, one CFRDT is designated as the primary replica. It acts as the central hub (server), and is responsible for receiving the pooled replicas changes and distributing them back across the pool.
As changes occur in any replica, the primary replica is notified and enqueues the changes for delivery. The primary then notifies each replica in the sequence in which the change events were received. By sequencing the operations, we insures that the replicas dont significantly diverge between instances due to notification ordering differences. The client-server approach is most appropriate when the application needs each replica to be as similar as is practical.
Operation vs. State based replicationWhen a change to the CRDT occurs, there are two primary ways in which that data can propagate to its replicas: state-based & operation-based.
State-based replicationIn state-based replication, the entire state is propagated to each replica, and it is the responsibility of each replica to be able to merge its state with the incoming state. This type of CRDT is called a Convergent Replicated Datatypes (CVRDTs). When the CVRDT changes, a copy of its entire state is sent to every other replica in the pool where it is merged with the replica's state. In state-based replication, each CVRDT must be able to provide a merge operation that is responsible for merging the data. The merge operation must also be Associative, Commutative, and Idempotent.
Coming to terms with termsIf it has been a while since you have last saw these terms, here is a short refresher:
Associativity(a+(b+c)=(a+b)+c): Operations can be grouped in any order.
Commutative(a+b=b+a): Operations can be applied in any order.
Idempotence(a+a=a): Duplicate operations don't change the result.
Delta (Δ) Synchronization is a technique which is often employed when the data being synchronized is large. In this approach, only the changes (Δ) in the source replica are transmitted. By only transmitting the delta, we reduce the amount of time it takes to send the data, we reduce the impact on the merge operation, and we decrease the overall impact on network bandwidth.
Operation-based replicationIn operation-based replication, only the operation and any parameters invoked on the source replica is sent to each replica, which then applies that operation to itself. By only sending the operation, we no longer need to support a merge method and can reduce the message size of the payload that needs to be transmitted to each replica reducing transmission overhead. This type of CRDT is called a Commutative Replicated Datatype (CMRDTs).
Common CRDTsAs we mentioned earlier, not every data type can be a CRDT. The following, is a non-comprehensive list of several common CRDT types:
RegisterA register is CRDT that stores a single value. It has two operations, assign and value, which store and retrieve the value respectively. Since the assign operation is not commutative, its alternative LWWW-Register is often used instead.
Last-Write-Wins-Register (LWWW-Register)-With the LWWW-Register, we associate a timestamp with each new value. During update, if the value's timestamp is newer than the replica's value timestamp, the value is updated. Otherwise the update is ignored.
CountersIncrement only counters - As the name implies, this counter's value can only increase. Any increment operation on any replicated instance will increase the current value on the counter.
PN Counter- A PN counter allows both increment and decrement operations. This is often acheived by combining two increment-only counters. The P counter aggregates the increment operations across the replicas. The N counter aggregates all the decrement operations. When the counter value is read, the replica returns the difference between the P counter and the N counter value.
SetsGrow-Only Set (G-Set)- This CRDT is a set data type that only allows elements to be added.
Two-Phase Set (2P-Set)- this CRDT is actually two Grow-Only sets. The second set contains the elements that have been removed from the first set. These elements are referred to as Tombstones and the set is referred to as the Tombstone set. Once an element has been removed and added to the tombstone set, it can't be re-added. An element is only considered a member of the set if it only exists in the first set and not in the tombstone set.
Last-Write-Wins-Element-Set (LWWW-Element-Set)- Like a Two-Phase Set, the Last-Write-Wins-Element set has an Add set and a Remove set. However, all element for both sets also contains a timestamp. An element is only considered a member of the set if it exits in the first set and not the tombstone set, or in the first set with a newer timestamp than in the tomstone set.
Observed-Removed Set (OR-Set)-Observed-Removed sets are LWW-Element sets that have replaced timestamps with unique tags. As elements are added or removed they are accompanied by a unique add tag or remove tag. An element is only a member of the set if its add-tags do not exist in its remove tag list.
Its important to remember that 2P-Sets, LWW-Element-Sets, and OR-sets all contain tombstone sets. As elements are removed the set may appear empty when in fact its tombstone set may have many elements. It is critical to understand and mitigate the potential memory resource ramifications of these CRDTs. As elements are not actually "removed", the tombstone set will continue to grow and consume memory. It is critically important to remember this and take appropriate measures.