White Paper Summaries | AWS Dynamo DB

Hello everyone, and welcome to the start of a new series. In "White Paper Summaries," we will dive into detailed white papers on various topics and summarize them in a simple way.Our first topic is an AWS DynamoDB white paper that covers everything you need to know about DynamoDB. I'll link the white paper in the references section below. Without further or do let's dive in!

Introduction

This paper presents the design and implementation of Dynamo, a highly availablekey-value storage system that some of Amazon’s core services use to provide an “always-on” experience.

The paper starts by expressing how large Amazon's scale is and that reliability Is one of the most important requirements because even the slightest outages have significant financial consequences which may impact customer trust. After that it mentions that Amazon operates in a very decentralized, loosely coupled services architecture that consists of hundreds of services. And it mentions that amongst these services networks and hardwares can fail and lot's of unwanted problems could occur.

However they had a requirement where they must always be available. It gave an example of a shopping cart where under the worst circumstances possible a customer should still be able to add and remove items from his cart. The customer doesn't care about networks and hardware failure he just wants to add his items to the cart.

Therefore, the service responsible for managing shopping carts requires that it can always write to and read from its data store, and that its data needs to be available across multiple data centers.

It then proceeds to talk about Dynamo and its usage within Amazon, Dynamo is used to manage the state of services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost-effectiveness and performance.

There are many services on Amazon’s platform that only need primary-key access to a data store. For many services, such as those that provide best seller lists, shopping carts, customer preferences, session management, sales rank, and product catalog, the common pattern of using a relational database would lead to inefficiencies and limit scale and availability. Dynamo provides a simple primary-key only interface to meet the requirements of these applications.

System Assumptions & Requirements

Dynamo has a simple key/value interface, is highly available with a clearly defined consistency window, is efficient in its resource usage, and has a simple scale out scheme to address growth in data set size or request rates. Each service that uses Dynamo runs its own Dynamo instances.

Query Model: simple read and write operations to a data item that is uniquely identified by a key. State is stored as binary objects (i.e., blobs) identified by unique keys. Dynamo targets applications that need to store objects that are relatively small (usually less than 1 MB).

Dynamo targets applications that operate with weaker consistency, it does not provide any isolation guarantees and permits only single key updates.

The paper afterwards talks about SLAs and where the goal is to build a system where all customers have a good experience, rather than just the majority. Amazon is trying to maintain an SLA of 99.9th percentile. Then it talks about the impact of storage systems on the SLAs and that one of the main design considerations for Dynamo is to give services control over their system properties, such as durability and consistency, and to let services make their own tradeoffs between functionality, performance and cost- effectiveness to reach their wanted requirements.

Design Considerations

In this section the paper starts by mentioning how synchronous replication was the way to go in commercial systems to achieve data consistency and according to the CAP theory, in the presence of a network partition we cannot achieve both availability and consistency. And as such, systems and applications need to be aware which properties can be achieved under which conditions.

Dynamo is designed to be an eventually consistent data store; that is all updates reach all replicas eventually.

Increasing availability can be achieved by sacrificing consistency. Adding replicas and performing the replication in the background. The challenge with this is that it can lead to conflicting changes (that is indeed in a leaderless cluster of nodes where you can write to any node and read too).This process of conflict resolution introduces two problems: when to resolve them and who resolves them.

An important design consideration is to decide WHEN to perform the process of resolving update conflicts, i.e., whether conflicts should be resolved during reads or writes. Many traditional data stores execute conflict resolution during writes and keep the read complexity simple. In such systems, writes may be rejected if the data store cannot reach all (or a majority of) the replicas at a given time (Quorum). On the other hand, Dynamo targets the design space of an “always writeable” data store (i.e., a data store that is highly available for writes). For a number of Amazon services, rejecting customer updates could result in a poor customer experience. For instance, the shopping cart service must allow customers to add and remove items from their shopping cart even amidst network and server failures. This requirement forces us to push the complexity of conflict resolution to the reads in order to ensure that writes are never rejected.

The next design choice is WHO performs the process of conflict resolution. This can be done by the data store or the application. If conflict resolution is done by the data store, its choices are rather limited. In such cases, the data store can only use simple policies, such as “last write wins”, to resolve conflicting updates. On the other hand, since the application is aware of the data schema it can decide on the conflict resolution method that is best suited for its client’s experience. For instance, the application that maintains customer shopping carts can choose to “merge” the conflicting versions and return a single unified shopping cart. Despite this flexibility, some application developers may not want to write their own conflict resolution mechanisms and choose to push it down to the data store, which in turn chooses a simple policy such as “last write wins”.

Other design considerations include:

Incremental Scalability; scaling without affecting the system at all
Symmetry & Decentralization; every node should be the same in terms of responsibility which outputs a completely decentralized system
Heterogeneity; the work distribution must be proportional to the capabilities of the individual servers. This is essential in adding new nodes with higher capacity without having to upgrade all hosts at once.

Dynamo is a zero-hop DHT, meaning each node has enough routing information to send a request directly to the right node. This is to prevent multiple network hops from node to node which can degrade performance.

System Architecture

The paper now moves to the fun part. It shows a table of all of the techniques used in Dynamo and their advantages. Let's look at the image below

It then proceeds to dive into each and every one of the techniques. But before that it explains the API, dynamo stores objects associated with a key through a simple interface; it exposes two operations: get() and put(). The get(key) operation locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context. The put(key, context, object) operation determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk. The context encodes system metadata about the object that is opaque to the caller and includes information such as the version of the object. The context information is stored along with the object so that the system can verify the validity of the context object supplied in the put request. Dynamo treats both the key and the object supplied by the caller as an opaque array of bytes. It applies a MD5 hash on the key to generate a 128-bit identifier, which is used to determine the storage nodes that are responsible for serving the key.

Partitoning

The requirement of scaling incrementally for dynamo pushed the engineers to think of a solution that dynamically partitions the data over a set of nodes. Partitioning relies heavily on consistent hashing to distribute the load across multiple hosts. If you don't know what consistent hashing is this article.-,Consistent%20Hashing%20is%20a%20distributed%20hashing%20scheme%20that%20operates%20independently,without%20affecting%20the%20overall%20system.) explains it really well. In summary the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is assigned a random value within this space which represents its “position” on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position.

Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring. The principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and the rest of the nodes are unaffected.

But there exists problems with this basic approach;

Non uniform load distribution due to the random assignment of nodes on the ring;

one node might be too far from the other node and a lot of data is assigned to It, more than it can handle
Nodes don't perform the same due to problem number one, leading to heterogeneity.

Dynamo uses a variant of consistent hashing to fix the above problems; instead of mapping a node to a single point in the circle, each node gets assigned to multiple points in the ring. Effectively, when a new node is added to the system, it is assigned multiple positions(tokens) in the ring. This has the following advantages;

If a node becomes unavailable (due to failures or routine maintenance), the load handled by this node is evenly dispersed across the remaining available nodes instead of all the keys moving to the neighboring node.
When a node becomes available again, or a new node is added to the system, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.
The number of tokens that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.

So we can assign nodes with higher capacity a larger token number and lower one for less capacity.

Replication

Dynamo replicates its data on multiple hosts to achieve availability and durability. This is how;

Data is replicated on N hosts which is configurable when setting up the instance
Each key is assigned to a coordinator node responsible for the keys' replication
In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 clockwise successor nodes in the ring. This results in a system where each node is responsible for the region of the ring between it and its Nth predecessor.

In the image above the Key K is replicated across B,C and D. So if B is the coordinator (got the write request) it will replicate to N-1 nodes clockwise (c, d).

You might be wondering what if the coordinator was C? then it would replicate across D and E. We'll talk soon about how dynamo ensures that a request with key k will mostly get routed to node B from the first time without having to do any hops between nodes.

The list of nodes that is responsible for storing a particular key is called the preference list. The system is designed so that every node in the system can determine which nodes should be in this list for any particular key (Decentralization).

Each preference list has more than N physical nodes but only to account for node failures that may occur.

Data Versioning

As we mentioned dynamo provides eventual consistency, A put() call may return to its caller before the update has been applied at all the replicas, which can result in scenarios where a subsequent get() operation may return an object that does not have the latest updates. However, under certain failure scenarios (e.g., server outages or network partitions), updates may not arrive at all replicas for an extended period of time.

The paper talks about the "Shopping Cart" service in Amazon and that it an "Add to Cart" should never be forgotten or rejected. If the most recent state of the cart is unavailable, and a user makes changes to an older version of the cart, that change is still meaningful and should be preserved. But at the same time it shouldn’t supersede the currently unavailable state of the cart, which itself may contain changes that should be preserved. When the latest version is not available, the item is added to (or removed from) the older version and the divergent versions are reconciled later.

Dynamo treats the result of each modification as a new and immutable version of the data. It allows for multiple versions of an object to be present in the system at the same time.

Most of the time, new versions subsume the previous version(s) (contain all the previous version data + new data) and the system itself can determine the authoritative version (syntactic reconciliation). Syntactic reconciliation is the process of resolving differences between multiple versions of data based on their structure and content, rather than understanding the actual meaning or context.

The presence of failures combined with concurrent updates, results in conflicting versions of an object. In these cases, the system cannot reconcile the multiple versions of the same object and the client must perform the reconciliation.

A typical example of a collapse operation is “merging” different versions of a customer’s shopping cart. Using this reconciliation mechanism, an “add to cart” operation is never lost. However, deleted items can resurface.

Dynamo uses vector (logical) clocks in order to capture causality (what happened before what) between different versions of the same object.

A vector clock is effectively a list of (node, counter) pairs. One vector clock is associated with every version of every object. One can determine whether two versions of an object are on parallel branches or have a causal ordering, by examining their vector clocks. For example [2,1], [2,2] are causally ordered but [2,1], [3,1] are diverged.

In Dynamo, when a client wishes to update an object, it must specify which version it is updating. This is done by passing the context it obtained from an earlier read operation, which contains the vector clock information. Upon processing a read request, if dynamo has access to multiple branches that cannot be syntactically reconciled, it will return all the objects at the leaves, with the corresponding version information in the context. An update using this context is considered to have reconciled the divergent versions and the branches are collapsed into a single new version. (remember when we said dynamo handles conflicts in reads not writes)

Handling Temporary Failures

Any storage node in Dynamo is eligible to receive client get and put operations for any key. There are 2 strategies where a client can select a node for read or write;

Request goes through a generic load balancer that will select a node based on load information
Use a partition-aware client library that routes requests directly to the appropriate coordinator nodes.

The advantage of the first approach is that the client does not have to link any code specific to Dynamo in its application, whereas the second strategy can achieve lower latency because it skips a potential forwarding step (from the load balancer chosen node to the actual coordinator responsible for the key).

A node handling a read or write operation is known as the coordinator. If the requests are received through a load balancer, requests to access a key may be routed to any random node in the ring. In this scenario, the node that receives the request will not coordinate it if the node is not in the top N of the requested key’s preference list. Instead, that node will forward the request to the first among the top N nodes in the preference list.

Read and write operations involve the first N healthy nodes in the preference list, skipping over those that are down or inaccessible. When all nodes are healthy, the top N nodes in a key’s preference list are accessed. When there are node failures or network partitions, nodes that are lower ranked in the preference list are accessed. (remember when we said the preference lists have more than N nodes for failure handling)

To maintain consistency among its replicas, Dynamo uses a consistency protocol similar to those used in quorum systems. This protocol has two key configurable values: R and W. R is the minimum number of nodes that must participate in a successful read operation. W is the minimum number of nodes that must participate in a successful write operation. Setting R and W such that R + W > N yields a quorum-like system. In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas.

Upon receiving a put() request for a key, the coordinator generates the vector clock for the new version and writes the new version locally. The coordinator then sends the new version (along with metadata) to the N highest-ranked reachable nodes. If at least W-1 nodes (because he wrote to himself first) respond then the write is considered successful.

Similarly, for a get() request, the coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key, and then waits for R responses before returning the result to the client. If the coordinator ends up gathering multiple versions of the data, it returns all the versions it deems to be causally unrelated. The divergent versions are then reconciled and the reconciled version superseding the current versions is written back.

Handling Failures: Hinted Handoff

If Dynamo used a traditional quorum approach it would be unavailable during server failures and network partitions, and would have reduced durability even under the simplest of failure conditions.

To remedy this it does not enforce strict quorum membership and instead it uses a “sloppy quorum”; all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring.

Consider the example of Dynamo configuration given in Figure 2 with N=3. In this example, if node A is temporarily down or unreachable during a write operation then a replica that would normally have lived on A will now be sent to node D. This is done to maintain the desired availability and durability guarantees. The replica sent to D will have a hint in its metadata that suggests which node was the intended recipient of the replica (in this case A). Nodes that receive hinted replicas will keep them in a separate local database that is scanned periodically. Upon detecting that A has recovered, D will attempt to deliver the replica to A. Once the transfer succeeds, D may delete the object from its local store without decreasing the total number of replicas in the system.

Handling permanent failures: Replica synchronization

Hinted handoff works best if the system membership churn is low and node failures are transient.

There are scenarios under which hinted replicas become unavailable before they can be returned to the original replica node. To handle this and other threats to durability, Dynamo implements a replica synchronization protocol to keep the replicas synchronized.

To detect the inconsistencies between replicas faster and to minimize the amount of transferred data, Dynamo uses Merkle trees. A Merkle tree is a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in the tree are hashes of their respective children. The principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set. Moreover, Merkle trees help in reducing the amount of data that needs to be transferred while checking for inconsistencies among replicas. For instance, if the hash values of the root of two trees are equal, then the values of the leaf nodes in the tree are equal and the nodes require no synchronization. If not, it implies that the values of some replicas are different. In such cases, the nodes may exchange the hash values of children and the process continues until it reaches the leaves of the trees, at which point the hosts can identify the keys that are “out of sync”. Merkle trees minimize the amount of data that needs to be transferred for synchronization and reduce the number of disk reads performed during the synchronization process.

Each node maintains a separate Merkle tree for each key range it hosts. This allows nodes to compare whether the keys within a key range are up-to-date. In this scheme, two nodes exchange the root of the Merkle tree corresponding to the key ranges that they host in common. Subsequently, using the tree traversal scheme described above the nodes determine if they have any differences and perform the appropriate synchronization action. The disadvantage with this scheme is that many key ranges change when a node joins or leaves the system thereby requiring the tree(s) to be recalculated.

Membership and Failure Detection

A gossip-based protocol propagates membership changes and maintains an eventually consistent view of membership. Each node contacts a peer chosen at random every second and the two nodes efficiently reconcile their persisted membership change histories.

When a node starts for the first time, it chooses its set of tokens and maps nodes to their respective token sets. The mapping is persisted on disk and initially contains only the local node and token set.

The mappings stored at different Dynamo nodes are reconciled during the same communication exchange that reconciles the membership change histories. Therefore, partitioning and placement information also propagates via the gossip-based protocol and each storage node is aware of the token ranges handled by its peers.

This allows each node to forward a key’s read/write operations to the right set of nodes directly.

Decentralized failure detection protocols use a simple gossip-style protocol that enable each node in the system to learn about the arrival (or departure) of other nodes.

When a new node (say X) is added into the system, it gets assigned a number of tokens that are randomly scattered on the ring. For every key range that is assigned to node X, there may be a number of nodes (less than or equal to N) that are currently in charge of handling keys that fall within its token range. Due to the allocation of key ranges to X, some existing nodes no longer have to some of their keys and these nodes transfer those keys to X leading to an evenly distributed cluster.

Summary

The paper finalized the architecture of Dynamo DB going through all of the problems and techniques that were implemented to provide a highly durable and available key value store. It then moves on to a more practical part which is their implementation with the lessons learned, the problems they faced and how they leveraged monitoring to make critical decisions. I won't be going through that here because I think their experience should be taken as-is but it's definitely a good read!

Paper Reference

https://www.cs.cornell.edu/courses/cs5414/2017fa/papers/dynamo.pdf

White Paper Summaries | AWS Dynamo DB

Introduction

System Assumptions & Requirements

Design Considerations