提交 9c21a518 编写于 作者: A antirez

design documents added to the project

上级 dfc5e96c
Redis Cluster Design Proposal (work in progress)
Network layout
==============
- N different Data Nodes. Every node is identified by ip:port.
- A single Configuration Node.
- M different Proxy Nodes (redis-cluster).
- A single Handling Node.
Configuration Node
==================
- Contains information about all the Data nodes in the cluster.
- Contains information about all the Proxy nodes in the cluster.
- Maps the keyspace to different nodes.
The keyspace is divided into 1024 different "hashing slots".
Given a key perform SHA1(key) and use the last 10 bits of the result to get a 10 bit number representing the key slot (from 0 to 1023).
The Configuration node maps every slot of the keyspace to K different Data Nodes.
The Configuration node can be modified by a single client at a time. Locking is performed using SETNX.
The Configuration node should be replicated as there is a single configuration node for the whole network.
The Configuration node is a standard Redis server, like every other Data node.
Data Nodes
==========
Data nodes just hold data, and are normal Redis processes. There is no configuration stored on nodes, nor the nodes are "active" in the cluster, they just receive normal Redis commands.
Proxy Nodes
===========
Proxy nodes get requests from clients and route this requests to the right Redis nodes.
When a proxy node is started it needs to know the Configuration node address in order to load the infomration about the Data nodes and the mapping between the key space and the nodes.
On startup a Proxy node will also register itself in the Configuration node, and will make sure to refresh it's configuration every N seconds (via an EXPIREing key) so that it's possible to detect when a Proxy node fails.
The Proxy node also is in charge of signaling failing Data nodes to the Configuration node, so that the Handling node can take appropriate actions.
When a new Data node joins or leaves the cluster, and in general when the cluster configuration changes, all the Proxy nodes will receive a notification and will reload the configuration from the Configuration node.
Handling Node
=============
The handling node is a special Redis client with the following role:
- Handles the cluster configuration stored in the Config node.
- Is in charge for adding and removing nodes dynamically from the net.
- Relocates keys on nodes additions / removal.
- Signal a configuration change to Proxy nodes.
More details on hashing slots
============================
The Configuration node holds 1024 keys in the following form:
hashingslot:0
hashingslot:1
...
hashingslot:1023
Every hashing slot is actually a Redis list, containing a single or more ip:port pairs. For instance:
hashingslot:10 => 192.168.1.19:6379, 192.168.1.200:6379
This mean that keys hashing to slot 10 will be saved in the two Data nodes 192.168.1.19:6379 and 192.168.1.200:6379.
When a client performs a read operation (via a proxy node), the proxy will contact a random Data node among the data nodes in charge for the given slot.
For instance a client can ask for the following operation to a given Proxy node:
GET mykey
"mykey" hashes to (for instance) slot 10, so the Proxy will forward the request to either Data node 192.168.1.19:6379 or 192.168.1.200:6379, and then forward back the reply to the client.
When a write operation is performed, it is forwarded to both the Data nodes in the example (and in general to all the data nodes).
Adding or removing a node
=========================
When a Data node is added to the cluster, it is added via an LPUSH operation into a Redis list representing a queue of Data nodes that are ready to enter the cluster. This list is hold by the Configuration node of course, and can be added manually or via a configuration utility.
LPUSH newnodes 192.168.1.55:6379
The Handling node will check from time to time for this new elements in the "newode" list. If there are new nodes pending to enter the cluster, they are processed one after the other in this way:
For instance let's assume there are already two Data nodes in the cluster:
192.168.1.1:6379
192.168.1.2:6379
We add a new node 192.168.1.3:6379 via the LPUSH operation.
We can imagine that the 1024 hash slots are assigned equally among the two inital nodes. In order to add the new (third) node what we have to do is to move incrementally 341 slots form the two old servers to the new one.
For now we can think that every hash slot is only stored in a single server, to generalize the idea later.
In order to simplify the implementation every slot can be moved from one Data node to another one in a blocking way, that is, read operations will continue to all the 1024 slots, but a single slot at a time will delay write operations until the moving from one Data node to another is completed.
In order to do so the Handler node, before to move a given node, marks it as "write-locked" in the Configuration server, than asks all the Proxy nodes to refresh the configuration.
Then the slot is moved (1/1024 of all the keys). The Configuration server is modified to reflect the new hashing slots configuration, the slot is unlocked, the Proxy nodes notified.
Implementation details
======================
Every Proxy node should take persistent connections to all the Data nodes.
To run the Handling node and the Configuration node in the same physical computer is probably a good idea.
- Use N working childs (fork at startup) in order to implement async I/O.
- The swap file is opened at startup and unlink(2)-ed
- Swap file free/used block bitmap is taken on memory
- When a child is saving on background or rewriting the append only log the swap file gets freezed (no writes from the parent).
- When Redis is low on memory keys not recently used and big enough will be trasnfered on Disk by one of the child processes doing async I/O. Only when the transfer finishes the parent will mark the value as swapped out and will free the associated value.
- When Redis is going to process a command will first check that all the keys involved are in memory. If not will send a request to an async I/O child in order to load this keys in memory. When the operation finished Redis will "resume" the client opreation (just the client structure will hold the arguments of the suspended command, Redis will execute the command and unmask the read/write events in the client socket).
- async I/O childs and parent communicate via pipes, so while Redis is blocked in the event loop can be resumed by an async child I/O just writing a message in the pipe.
- Every Redis type should have a function to guess the max space needed to serialized an object.
- The swap file is divided into blocks.
- Even if Redis unblock the command when not all the keys are loaded, or if a key was swapped out in the mean time, this is not a critical condition as anyway the key lookup process will (this time synchronously) load the keys in memory if needed. This should happen very rarely or possibly never, so a bug in the async loading stage will not cause a bug but just a performance hit.
- The blocks allocation algorithm should try to avoid fragmentation and cache misses.
......@@ -1077,6 +1077,26 @@ proc main {server port} {
[$r zrangebyscore zset 0 10 LIMIT 20 10]
} {{a b} {c d e} {c d e} {}}
test {ZREMRANGE basics} {
$r del zset
$r zadd zset 1 a
$r zadd zset 2 b
$r zadd zset 3 c
$r zadd zset 4 d
$r zadd zset 5 e
list [$r zremrangebyscore zset 2 4] [$r zrange zset 0 -1]
} {3 {a e}}
test {ZREMRANGE from -inf to +inf} {
$r del zset
$r zadd zset 1 a
$r zadd zset 2 b
$r zadd zset 3 c
$r zadd zset 4 d
$r zadd zset 5 e
list [$r zremrangebyscore zset -inf +inf] [$r zrange zset 0 -1]
} {5 {}}
test {SORT against sorted sets} {
$r del zset
$r zadd zset 1 a
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册