- Cassandra High Availability
- Robbie Strickland
- 768字
- 2021-08-06 19:50:31
Hash table fundamentals
Most developers have experience with hash tables in some form, as nearly all programming languages include hash table implementations. Hash tables store data by applying a hash
function to the object, which determines its placement in an underlying array.
While a detailed description of hashing algorithms is out of scope of this book, it is sufficient for you to understand that a hash
function simply maps any input data object (which might be any size) to some expected output. While the input might be large, the output of the hash
function will be a fixed number of bits.
In a typical hash table design, the result of the hash
function is divided by the number of array slots; the remainder then becomes the assigned slot number. Thus, the slot can be computed using hash(o) % n, where o is the object and n is the number of slots. Consider the following hash table with names as keys and addresses as values:
In the preceding diagram, the values in the table on the left represent keys, which are hashed using the hash
function to produce the index of the slot where the value is stored. Our input objects (John
, Jane
, George
, and Sue
), are put through the hash
function, which results in an integer value. This value becomes the index in an array of street addresses. We can look up the street address of a given name by computing its hash, then accessing the resulting array index.
This method works well when the number of slots is stable or when the order of the elements can be managed in a predictable way by a single owner. There are additional complexities in hash table design, specifically around avoiding hash collisions, but the basic concept remains straightforward.
However, the situation gets a bit more complicated when multiple clients of the hash table need to stay in sync. All these clients need to consistently produce the same hash result even as the elements themselves might be moving around. Let's examine the distributed hash table architecture and the means by which it solves this problem.
Distributing hash tables
When we take the basic idea of a hash table and partition it out to multiple nodes, it gives us a distributed hash table (DHT). Each node in the DHT must share the same hash
function so that hash results on one node match the results on all others.
In order to determine the location of a given piece of data in the cluster, we need some means to associate an object with the node that owns it. We can ask every node in the cluster, but this will be problematic for at least two important reasons: first, this strategy doesn't scale well as the overhead will grow with the number of nodes; second, every node in the cluster will have to be available to answer requests in order to definitively state that a given item does not exist. A shared index can address this, but the result will be additional complexity and another point of failure.
Therefore, a key objective of the hash
function in a DHT is to map a key to the node that owns it, such that a request can be made to the correct node. However, the simple hash
function discussed previously is no longer appropriate to map data to a node. The simple hash is problematic in a distributed system because n
translates to the number of nodes in the cluster—and we know that n
changes as nodes are added or removed. To illustrate this, we can modify our hash table to store pointers to machine IP addresses instead of street addresses.
In this case, keys are mapped to a specific machine in the distributed hash table that holds the value for the key. Now, each key in the table can be mapped to its location in the cluster with a simple lookup. However, if we alter the cluster size (by adding or removing nodes), the result of the computation—and therefore the node mapping—changes for every object! Let's see what happens when a node is removed from the cluster.
When a node is removed from the cluster, the result is that the subsequent hash buckets are shifted, which causes the keys to point to different nodes. Note that after removing node 3, the number of buckets is reduced. As previously described, this changes the result of the hash
function, causing the old mappings to become unusable. This will be catastrophic as all key lookups will point to the wrong node.