1. 01 9月, 2014 1 次提交
  2. 27 8月, 2014 3 次提交
  3. 23 6月, 2014 2 次提交
  4. 21 6月, 2014 1 次提交
    • A
      Sentinel: send SLAVEOF with MULTI, CLIENT KILL, CONFIG REWRITE. · 7d0992da
      antirez 提交于
      This implements the new Sentinel-Client protocol for the Sentinel part:
      now instances are reconfigured using a transaction that ensures that the
      config is rewritten in the target instance, and that clients lose the
      connection with the instance, in order to be forced to: ask Sentinel,
      reconnect to the instance, and verify the instance role with the new
      ROLE command.
      7d0992da
  5. 19 6月, 2014 2 次提交
    • A
      Sentinel: send hello messages ASAP after config change. · 93ee0f26
      antirez 提交于
      Eventual configuration convergence is guaranteed by our periodic hello
      messages to all the instances, however when there are important notices
      to share, better make a phone call. With this commit we force an hello
      message to other Sentinal and Redis instances within the next 100
      milliseconds of a config update, which is practically better than
      waiting a few seconds.
      93ee0f26
    • A
      Sentinel: handle SRI_PROMOTED flag correctly. · 9b883974
      antirez 提交于
      Lack of check of the SRI_PROMOTED flag caused Sentienl to act with the
      promoted slave turned into a master during failover like if it was a
      normal instance.
      
      Normally this problem was not apparent because during real failovers the
      old master is down so the bugged code path was not entered, however with
      manual failovers via the SENTINEL FAILOVER command, the problem was
      easily triggered.
      
      This commit prevents promoted slaves from getting reconfigured, moreover
      we now explicitly check that during a failover the slave turning into a
      master is the one we selected for promotion and not a different one.
      9b883974
  6. 28 5月, 2014 1 次提交
  7. 20 5月, 2014 1 次提交
  8. 08 5月, 2014 2 次提交
    • A
      Sentinel: log when a failover will be attempted again. · 13d8b2b0
      antirez 提交于
      When a Sentinel performs a failover (successful or not), or when a
      Sentinel votes for a different Sentinel trying to start a failover, it
      sets a min delay before it will try to get elected for a failover.
      
      While not strictly needed, because if multiple Sentinels will try
      to failover the same master at the same time, only one configuration
      will eventually win, this serialization is practically very useful.
      Normal failovers are cleaner: one Sentinel starts to failover, the
      others update their config when the Sentinel performing the failover
      is able to get the selected slave to move from the role of slave to the
      one of master.
      
      However currently this timeout was implicit, so users could see
      Sentinels not reacting, after a failed failover, for some time, without
      giving any feedback in the logs to the poor sysadmin waiting for clues.
      
      This commit makes Sentinels more verbose about the delay: when a master
      is down and a failover attempt is not performed because the delay has
      still not elaped, something like that will be logged:
      
          Next failover delay: I will not start a failover
          before Thu May  8 16:48:59 2014
      13d8b2b0
    • A
      Sentinel: generate +config-update-from event when a new config is received. · 909d1883
      antirez 提交于
      This event makes clear, before the switch-master event is generated,
      that a Sentinel received a configuration update from another Sentinel.
      909d1883
  9. 25 3月, 2014 4 次提交
  10. 21 3月, 2014 5 次提交
    • A
      Sentinel: sentinelRefreshInstanceInfo() minor refactoring. · 0937377a
      antirez 提交于
      Test sentinel.tilt condition on top and return if it is true.
      This allows to remove the check for the tilt condition in the remaining
      code paths of the function.
      0937377a
    • A
      9c2063fb
    • A
      Sentinel: down-after-milliseconds is not master-specific. · ffa8f479
      antirez 提交于
      addReplySentinelRedisInstance() modified so that this field is displayed
      for all the kind of instances: Sentinels, Masters, Slaves.
      ffa8f479
    • A
      Sentinel failure detection implementation improved. · 42091a79
      antirez 提交于
      Failure detection in Sentinel is ping-pong based. It used to work by
      remembering the last time a valid PONG reply was received, and checking
      if the reception time was too old compared to the current current time.
      
      PINGs were sent at a fixed interval of 1 second.
      
      This works in a decent way, but does not scale well when we want to set
      very small values of "down-after-milliseconds" (this is the node
      timeout basically).
      
      This commit reiplements the failure detection making a number of
      changes. Some changes are inspired to Redis Cluster failure detection
      code:
      
      * A new last_ping_time field is added in representation of instances.
        If non zero, we have an active ping that was sent at the specified
        time. When a valid reply to ping is received, the field is zeroed
        again.
      * last_ping_time is not reset when we reconnect the link or send a new
        ping, so from our point of view it represents the time we started
        waiting for the instance to reply to our pings without receiving a
        reply.
      * last_ping_time is now used in order to check if the instance is
        timed out. This means that we can have a node timeout of 100
        milliseconds and yet the system will work well since the new check is
        not bound to the period used to send pings.
      * Pings are now sent every second, or often if the value of
        down-after-milliseconds is less than one second. With a lower limit of
        10 HZ ping frequency.
      * Link reconnection code was improved. This is used in order to try to
        reconnect the link when we are at 50% of the node timeout without a
        valid reply received yet. However the old code triggered unnecessary
        reconnections when the node timeout was very small. Now that should be
        ok.
      
      The new code passes the tests but more testing is needed and more unit
      tests stressing the failure detector, so currently this is merged only
      in the unstable branch.
      42091a79
    • A
      Sentinel: use CLIENT SETNAME when connecting to Redis. · 38241c4b
      antirez 提交于
      This makes debugging / monitoring of Sentinels simpler since you can
      identify sentinels in CLIENT LIST output of Redis instances.
      38241c4b
  11. 15 3月, 2014 2 次提交
    • M
      Fix segfault from accessing array out of bounds · 9de07558
      Matt Stancliff 提交于
      argc == 2; argv[2] == crash
      9de07558
    • A
      Sentinel: be safe under crash-recovery assumptions. · a31a0b43
      antirez 提交于
      Sentinel's main safety argument is that there are no two configurations
      for the same master with the same version (configuration epoch).
      
      For this to be true Sentinels require to be authorized by a majority.
      Additionally Sentinels require to do two important things:
      
      * Never vote again for the same epoch.
      * Never exchange an old vote for a fresh one.
      
      The first prerequisite, in a crash-recovery system model, requires to
      persist the master->leader_epoch on durable storage before to reply to
      messages. This was not the case.
      
      We also make sure to persist the current epoch in order to never reply
      to stale votes requests from other Sentinels, after a recovery.
      
      The configuration is persisted by making use of fsync(), this is
      considered in the context of this code a good enough guarantee that
      after a restart our durable state is restored, however this may not
      always be the case depending on the kind of hardware and operating
      system used.
      a31a0b43
  12. 14 3月, 2014 2 次提交
    • A
      Sentinel: fake PUBLISH command to receive HELLO messages. · 6b0e36ff
      antirez 提交于
      Now the way HELLO messages are received is unified.
      Now it is no longer needed for Sentinels to converge to the higher
      configuration for a master to be able to chat via some Redis instance,
      the are able to directly exchanges configurations.
      
      Note that this commit does not include the (trivial) change needed to
      send HELLO messages to Sentinel instances as well, since for an error I
      committed the change in the previous commit that refactored hello
      messages processing into a separated function.
      6b0e36ff
    • A
  13. 05 3月, 2014 1 次提交
    • A
      Sentinel: more aggressive failover start desynchronization. · 1606978a
      antirez 提交于
      Sentinel needs to avoid split brain conditions due to multiple sentinels
      trying to get voted at the exact same time.
      
      So far some desynchronization was provided by fluctuating server.hz,
      that is the frequency of the timer function call. However the
      desynchonization provided in this way was not enough when using many
      Sentinel instances, especially when a large quorum value is used in
      order to force a greater degree of agreement (more than N/2+1).
      
      It was verified that it was likely to trigger a split brain
      condition, forcing the system to try again after a timeout.
      Usually the system will succeed after a few retries, but this is not
      optimal.
      
      This commit desynchronizes instances in a more effective way to make it
      likely that the first attempt will be successful.
      1606978a
  14. 25 2月, 2014 5 次提交
  15. 20 2月, 2014 1 次提交
  16. 18 2月, 2014 2 次提交
    • A
      Sentinel: SENTINEL_SLAVE_RECONF_RETRY_PERIOD -> RECONF_TIMEOUT · 905c55d5
      antirez 提交于
      Rename define to match the new meaning.
      905c55d5
    • A
      Sentinel: fix slave promotion timeout. · 1b345ec3
      antirez 提交于
      If we can't reconfigure a slave in time during failover, go forward as
      anyway the slave will be fixed by Sentinels in the future, once they
      detect it is misconfigured.
      
      Otherwise a failover in progress may never terminate if for some reason
      the slave is uncapable to sync with the master while at the same time
      it is not disconnected.
      1b345ec3
  17. 17 2月, 2014 1 次提交
  18. 07 2月, 2014 1 次提交
  19. 03 2月, 2014 1 次提交
    • A
      Move mstime_t define outside sentinel.c. · ddcf1603
      antirez 提交于
      The define is now used in other parts of Redis 2.8 tree instead of long
      long.
      
      A nice side effect is that now 2.8 and unstable sentinel.c files are
      identical as it should be.
      ddcf1603
  20. 31 1月, 2014 1 次提交
  21. 28 1月, 2014 1 次提交