1. 28 4月, 2017 1 次提交
    • A
      Regression test for PSYNC2 issue #3899 added. · c180bc7d
      antirez 提交于
      Experimentally verified that it can trigger the issue reverting the fix.
      At least on my system... Being the bug time/backlog dependant, it is
      very hard to tell if this test will be able to trigger the problem
      consistently, however even if it triggers the problem once in a while,
      we'll see it in the CI environment at http://ci.redis.io.
      c180bc7d
  2. 27 4月, 2017 1 次提交
    • A
      PSYNC2: fix master cleanup when caching it. · 469d6e2b
      antirez 提交于
      The master client cleanup was incomplete: resetClient() was missing and
      the output buffer of the client was not reset, so pending commands
      related to the previous connection could be still sent.
      
      The first problem caused the client argument vector to be, at times,
      half populated, so that when the correct replication stream arrived the
      protcol got mixed to the arugments creating invalid commands that nobody
      called.
      
      Thanks to @yangsiran for also investigating this problem, after
      already providing important design / implementation hints for the
      original PSYNC2 issues (see referenced Github issue).
      
      Note that this commit adds a new function to the list library of Redis
      in order to be able to reset a list without destroying it.
      
      Related to issue #3899.
      469d6e2b
  3. 22 4月, 2017 4 次提交
  4. 21 4月, 2017 1 次提交
    • A
      Check event loop creation return value. Fix #3951. · 238cebdd
      antirez 提交于
      Normally we never check for OOM conditions inside Redis since the
      allocator will always return a pointer or abort the program on OOM
      conditons. However we cannot have control on epool_create(), that may
      fail for kernel OOM (according to the manual page) even if all the
      parameters are correct, so the function aeCreateEventLoop() may indeed
      return NULL and this condition must be checked.
      238cebdd
  5. 20 4月, 2017 1 次提交
  6. 19 4月, 2017 3 次提交
    • A
      Fix getKeysUsingCommandTable() in cluster mode. · 7d9dd80d
      antirez 提交于
      Close #3940.
      7d9dd80d
    • A
      PSYNC2: discard pending transactions from cached master. · 189a12af
      antirez 提交于
      During the review of the fix for #3899, @yangsiran identified an
      implementation bug: given that the offset is now relative to the applied
      part of the replication log, when we cache a master, the successive
      PSYNC2 request will be made in order to *include* the transaction that
      was not completely processed. This means that we need to discard any
      pending transaction from our replication buffer: it will be re-executed.
      189a12af
    • A
      Fix PSYNC2 incomplete command bug as described in #3899. · 22be435e
      antirez 提交于
      This bug was discovered by @kevinmcgehee and constituted a major hidden
      bug in the PSYNC2 implementation, caused by the propagation from the
      master of incomplete commands to slaves.
      
      The bug had several results:
      
      1. Borrowing from Kevin text in the issue: "Given that slaves blindly
      copy over their master's input into their own replication backlog over
      successive read syscalls, it's possible that with large commands or
      small TCP buffers, partial commands are present in this buffer. If the
      master were to fail before successfully propagating the entire command
      to a slave, the slaves will never execute the partial command (since the
      client is invalidated) but will copy it to replication backlog which may
      relay those invalid bytes to its slaves on PSYNC2, corrupting the
      backlog and possibly other valid commands that follow the failover.
      Simple command boundaries aren't sufficient to capture this, either,
      because in the case of a MULTI/EXEC block, if the master successfully
      propagates a subset of the commands but not the EXEC, then the
      transaction in the backlog becomes corrupt and could corrupt other
      slaves that consume this data."
      
      2. As identified by @yangsiran later, there is another effect of the
      bug. For the same mechanism of the first problem, a slave having another
      slave, could receive a full resynchronization request with an already
      half-applied command in the backlog. Once the RDB is ready, it will be
      sent to the slave, and the replication will continue sending to the
      sub-slave the other half of the command, which is not valid.
      
      The fix, designed by @yangsiran and @antirez, and implemented by
      @antirez, uses a secondary buffer in order to feed the sub-masters and
      update the replication backlog and offsets, only when a given part of
      the query buffer is actually *applied* to the state of the instance,
      that is, when the command gets processed and the command is not pending
      in the Redis transaction buffer because of CLIENT_MULTI state.
      
      Given that now the backlog and offsets representation are in agreement
      with the actual processed commands, both issue 1 and 2 should no longer
      be possible.
      
      Thanks to @kevinmcgehee, @yangsiran and @oranagra for their work in
      identifying and designing a fix for this problem.
      22be435e
  7. 18 4月, 2017 8 次提交
  8. 17 4月, 2017 2 次提交
  9. 15 4月, 2017 1 次提交
    • A
      Cluster: discard pong times in the future. · 271733f4
      antirez 提交于
      However we allow for 500 milliseconds of tolerance, in order to
      avoid often discarding semantically valid info (the node is up)
      because of natural few milliseconds desync among servers even when
      NTP is used.
      
      Note that anyway we should ping the node from time to time regardless and
      discover if it's actually down from our point of view, since no update
      is accepted while we have an active ping on the node.
      
      Related to #3929.
      271733f4
  10. 14 4月, 2017 6 次提交
    • A
      Test: fix, hopefully, false PSYNC failure like in issue #2715. · 3f068b92
      antirez 提交于
      And many other related Github issues... all reporting the same problem.
      There was probably just not enough backlog in certain unlucky runs.
      I'll ask people that can reporduce if they see now this as fixed as
      well.
      3f068b92
    • A
      Cluster: always add PFAIL nodes at end of gossip section. · 02777bb2
      antirez 提交于
      To rely on the fact that nodes in PFAIL state will be shared around by
      randomly adding them in the gossip section is a weak assumption,
      especially after changes related to sending less ping/pong packets.
      
      We want to always include gossip entries for all the nodes that are in
      PFAIL state, so that the PFAIL -> FAIL state promotion can happen much
      faster and reliably.
      
      Related to #3929.
      02777bb2
    • A
      Cluster: fix gossip section ping/pong times encoding. · 8c829d9e
      antirez 提交于
      The gossip section times are 32 bit, so cannot store the milliseconds
      time but just the seconds approximation, which is good enough for our
      uses. At the same time however, when comparing the gossip section times
      of other nodes with our node's view, we need to convert back to
      milliseconds.
      
      Related to #3929. Without this change the patch to reduce the traffic in
      the bus message does not work.
      8c829d9e
    • A
      Cluster: add clean-logs command to create-cluster script. · 6878a3fe
      antirez 提交于
      6878a3fe
    • A
      Cluster: decrease ping/pong traffic by trusting other nodes reports. · 8f7bf284
      antirez 提交于
      Cluster of bigger sizes tend to have a lot of traffic in the cluster bus
      just for failure detection: a node will try to get a ping reply from
      another node no longer than when the half the node timeout would elapsed,
      in order to avoid a false positive.
      
      However this means that if we have N nodes and the node timeout is set
      to, for instance M seconds, we'll have to ping N nodes every M/2
      seconds. This N*M/2 pings will receive the same number of pongs, so
      a total of N*M packets per node. However given that we have a total of N
      nodes doing this, the total number of messages will be N*N*M.
      
      In a 100 nodes cluster with a timeout of 60 seconds, this translates
      to a total of 100*100*30 packets per second, summing all the packets
      exchanged by all the nodes.
      
      This is, as you can guess, a lot... So this patch changes the
      implementation in a very simple way in order to trust the reports of
      other nodes: if a node A reports a node B as alive at least up to
      a given time, we update our view accordingly.
      
      The problem with this approach is that it could result into a subset of
      nodes being able to reach a given node X, and preventing others from
      detecting that is actually not reachable from the majority of nodes.
      So the above algorithm is refined by trusting other nodes only if we do
      not have currently a ping pending for the node X, and if there are no
      failure reports for that node.
      
      Since each node, anyway, pings 10 other nodes every second (one node
      every 100 milliseconds), anyway eventually even trusting the other nodes
      reports, we will detect if a given node is down from our POV.
      
      Now to understand the number of packets that the cluster would exchange
      for failure detection with the patch, we can start considering the
      random PINGs that the cluster sent anyway as base line:
      Each node sends 10 packets per second, so the total traffic if no
      additioal packets would be sent, including PONG packets, would be:
      
          Total messages per second = N*10*2
      
      However by trusting other nodes gossip sections will not AWALYS prevent
      pinging nodes for the "half timeout reached" rule all the times. The
      math involved in computing the actual rate as N and M change is quite
      complex and depends also on another parameter, which is the number of
      entries in the gossip section of PING and PONG packets. However it is
      possible to compare what happens in cluster of different sizes
      experimentally. After applying this patch a very important reduction in
      the number of packets exchanged is trivial to observe, without apparent
      impacts on the failure detection performances.
      
      Actual numbers with different cluster sizes should be published in the
      Reids Cluster documentation in the future.
      
      Related to #3929.
      8f7bf284
    • A
      Cluster: collect more specific bus messages stats. · c5d6f577
      antirez 提交于
      First step in order to change Cluster in order to use less messages.
      Related to issue #3929.
      c5d6f577
  11. 12 4月, 2017 2 次提交
  12. 11 4月, 2017 5 次提交
  13. 10 4月, 2017 3 次提交
    • A
      Make more obvious why there was issue #3843. · 531647bb
      antirez 提交于
      531647bb
    • S
      Merge pull request #3843 from dvirsky/fix_bc_free · 01b6966a
      Salvatore Sanfilippo 提交于
      fixed free of blocked client before refering to it
      01b6966a
    • A
      Fix modules blocking commands awake delay. · ffefc9f9
      antirez 提交于
      If a thread unblocks a client blocked in a module command, by using the
      RedisMdoule_UnblockClient() API, the event loop may not be awaken until
      the next timeout of the multiplexing API or the next unrelated I/O
      operation on other clients. We actually want the client to be served
      ASAP, so a mechanism is needed in order for the unblocking API to inform
      Redis that there is a client to serve ASAP.
      
      This commit fixes the issue using the old trick of the pipe: when a
      client needs to be unblocked, a byte is written in a pipe. When we run
      the list of clients blocked in modules, we consume all the bytes
      written in the pipe. Writes and reads are performed inside the context
      of the mutex, so no race is possible in which we consume the bytes that
      are actually related to an awake request for a client that should still
      be put into the list of clients to unblock.
      
      It was verified that after the fix the server handles the blocked
      clients with the expected short delay.
      
      Thanks to @dvirsky for understanding there was such a problem and
      reporting it.
      ffefc9f9
  14. 08 4月, 2017 2 次提交