1. 30 11月, 2017 1 次提交
  2. 14 7月, 2017 2 次提交
    • A
      Fix replication of SLAVEOF inside transaction. · 87aabb1a
      antirez 提交于
      In Redis 4.0 replication, with the introduction of PSYNC2, masters and
      slaves replicate commands to cascading slaves and to the replication
      backlog itself in a different way compared to the past.
      
      Masters actually replicate the effects of client commands.
      Slaves just propagate what they receive from masters.
      
      This mechanism can cause problems when the configuration of an instance
      is changed from master to slave inside a transaction. For instance
      we could send to a master instance the following sequence:
      
          MULTI
          SLAVEOF 127.0.0.1 0
          EXEC
          SLAVEOF NO ONE
      
      Before the fixes in this commit, the MULTI command used to be propagated
      into the replication backlog, however after the SLAVEOF command the
      instance is a slave, so the EXEC implementation failed to also propagate
      the EXEC command. When the slaves of the above instance reconnected,
      they were incrementally synchronized just sending a "MULTI". This put
      the master client (in the slaves) into MULTI state, breaking the
      replication.
      
      Notably even Redis Sentinel uses the above approach in order to guarantee
      that configuration changes are always performed together with rewrites
      of the configuration and with clients disconnection. Sentiel does:
      
          MULTI
          SLAVEOF ...
          CONFIG REWRITE
          CLIENT KILL TYPE normal
          EXEC
      
      So this was a really problematic issue. However even with the fix in
      this commit, that will add the final EXEC to the replication stream in
      case the instance was switched from master to slave during the
      transaction, the result would be to increment the slave replication
      offset, so a successive reconnection with the new master, will not
      permit a successful partial resynchronization: no way the new master can
      provide us with the backlog needed, we incremented our offset to a value
      that the new master cannot have.
      
      However the EXEC implementation waits to emit the MULTI, so that if the
      commands inside the transaction actually do not need to be replicated,
      no commands propagation happens at all. From multi.c:
      
          if (!must_propagate && !(c->cmd->flags & (CMD_READONLY|CMD_ADMIN))) {
      	execCommandPropagateMulti(c);
      	must_propagate = 1;
          }
      
      The above code is already modified by this commit you are reading.
      Now also ADMIN commands do not trigger the emission of MULTI. It is actually
      not clear why we do not just check for CMD_WRITE... Probably I wrote it this
      way in order to make the code more reliable: better to over-emit MULTI
      than not emitting it in time.
      
      So this commit should indeed fix issue #3836 (verified), however it looks
      like some reconsideration of this code path is needed in the long term.
      
      BONUS POINT: The reverse bug.
      
      Even in a read only slave "B", in a replication setup like:
      
      	A -> B -> C
      
      There are commands without the READONLY nor the ADMIN flag, that are also
      not flagged as WRITE commands. An example is just the PING command.
      
      So if we send B the following sequence:
      
          MULTI
          PING
          SLAVEOF NO ONE
          EXEC
      
      The result will be the reverse bug, where only EXEC is emitted, but not the
      previous MULTI. However this apparently does not create problems in practice
      but it is yet another acknowledge of the fact some work is needed here
      in order to make this code path less surprising.
      
      Note that there are many different approaches we could follow. For instance
      MULTI/EXEC blocks containing administrative commands may be allowed ONLY
      if all the commands are administrative ones, otherwise they could be
      denined. When allowed, the commands could simply never be replicated at all.
      87aabb1a
    • A
      cfdcd440
  3. 06 7月, 2017 1 次提交
  4. 30 6月, 2017 1 次提交
    • A
      Added GEORADIUS(BYMEMBER)_RO variants for read-only operations. · bdd6de96
      antirez 提交于
      Issue #4084 shows how for a design error, GEORADIUS is a write command
      because of the STORE option. Because of this it does not work
      on readonly slaves, gets redirected to masters in Redis Cluster even
      when the connection is in READONLY mode and so forth.
      
      To break backward compatibility at this stage, with Redis 4.0 to be in
      advanced RC state, is problematic for the user base. The API can be
      fixed into the unstable branch soon if we'll decide to do so in order to
      be more consistent, and reease Redis 5.0 with this incompatibility in
      the future. This is still unclear.
      
      However, the ability to scale GEO queries in slaves easily is too
      important so this commit adds two read-only variants to the GEORADIUS
      and GEORADIUSBYMEMBER command: GEORADIUS_RO and GEORADIUSBYMEMBER_RO.
      The commands are exactly as the original commands, but they do not
      accept the STORE and STOREDIST options.
      bdd6de96
  5. 27 6月, 2017 1 次提交
    • A
      RDB modules values serialization format version 2. · 5af0fc0c
      antirez 提交于
      The original RDB serialization format was not parsable without the
      module loaded, becuase the structure was managed only by the module
      itself. Moreover RDB is a streaming protocol in the sense that it is
      both produce di an append-only fashion, and is also sometimes directly
      sent to the socket (in the case of diskless replication).
      
      The fact that modules values cannot be parsed without the relevant
      module loaded is a problem in many ways: RDB checking tools must have
      loaded modules even for doing things not involving the value at all,
      like splitting an RDB into N RDBs by key or alike, or just checking the
      RDB for sanity.
      
      In theory module values could be just a blob of data with a prefixed
      length in order for us to be able to skip it. However prefixing the values
      with a length would mean one of the following:
      
      1. To be able to write some data at a previous offset. This breaks
      stremaing.
      2. To bufferize values before outputting them. This breaks performances.
      3. To have some chunked RDB output format. This breaks simplicity.
      
      Moreover, the above solution, still makes module values a totally opaque
      matter, with the fowllowing problems:
      
      1. The RDB check tool can just skip the value without being able to at
      least check the general structure. For datasets composed mostly of
      modules values this means to just check the outer level of the RDB not
      actually doing any checko on most of the data itself.
      2. It is not possible to do any recovering or processing of data for which a
      module no longer exists in the future, or is unknown.
      
      So this commit implements a different solution. The modules RDB
      serialization API is composed if well defined calls to store integers,
      floats, doubles or strings. After this commit, the parts generated by
      the module API have a one-byte prefix for each of the above emitted
      parts, and there is a final EOF byte as well. So even if we don't know
      exactly how to interpret a module value, we can always parse it at an
      high level, check the overall structure, understand the types used to
      store the information, and easily skip the whole value.
      
      The change is backward compatible: older RDB files can be still loaded
      since the new encoding has a new RDB type: MODULE_2 (of value 7).
      The commit also implements the ability to check RDB files for sanity
      taking advantage of the new feature.
      5af0fc0c
  6. 22 6月, 2017 1 次提交
  7. 15 6月, 2017 1 次提交
  8. 11 5月, 2017 5 次提交
    • A
      Modules TSC: use atomic var for server.unixtime. · 6b21cebd
      antirez 提交于
      This avoids Helgrind complaining, but we are actually not using
      atomicGet() to get the unixtime value for now: too many places where it
      is used and given tha time_t is word-sized it should be safe in all the
      archs we support as it is.
      
      On the other hand, Helgrind, when Redis is compiled with "make helgrind"
      in order to force the __sync macros, will detect the write in
      updateCachedTime() as a read (because atomic functions are used) and
      will not complain about races.
      
      This commit also includes minor refactoring of mutex initializations and
      a "helgrind" target in the Makefile.
      6b21cebd
    • A
      Modules TSC: Add mutex for server.lruclock. · b338f2b9
      antirez 提交于
      Only useful for when no atomic builtins are available.
      b338f2b9
    • A
      Modules TSC: Improve inter-thread synchronization. · 7e9c658d
      antirez 提交于
      More work to do with server.unixtime and similar. Need to write Helgrind
      suppression file in order to suppress the valse positives.
      7e9c658d
    • A
      Modules TSC: Release the GIL for all the time we are blocked. · c4b88495
      antirez 提交于
      Instead of giving the module background operations just a small time to
      run in the beforeSleep() function, we can have the lock released for all
      the time we are blocked in the multiplexing syscall.
      c4b88495
    • A
      Modules TSC: GIL and cooperative multi tasking setup. · 74f3a843
      antirez 提交于
      74f3a843
  9. 20 4月, 2017 1 次提交
    • A
      Fix PSYNC2 incomplete command bug as described in #3899. · a91cc5bc
      antirez 提交于
      This bug was discovered by @kevinmcgehee and constituted a major hidden
      bug in the PSYNC2 implementation, caused by the propagation from the
      master of incomplete commands to slaves.
      
      The bug had several results:
      
      1. Borrowing from Kevin text in the issue: "Given that slaves blindly
      copy over their master's input into their own replication backlog over
      successive read syscalls, it's possible that with large commands or
      small TCP buffers, partial commands are present in this buffer. If the
      master were to fail before successfully propagating the entire command
      to a slave, the slaves will never execute the partial command (since the
      client is invalidated) but will copy it to replication backlog which may
      relay those invalid bytes to its slaves on PSYNC2, corrupting the
      backlog and possibly other valid commands that follow the failover.
      Simple command boundaries aren't sufficient to capture this, either,
      because in the case of a MULTI/EXEC block, if the master successfully
      propagates a subset of the commands but not the EXEC, then the
      transaction in the backlog becomes corrupt and could corrupt other
      slaves that consume this data."
      
      2. As identified by @yangsiran later, there is another effect of the
      bug. For the same mechanism of the first problem, a slave having another
      slave, could receive a full resynchronization request with an already
      half-applied command in the backlog. Once the RDB is ready, it will be
      sent to the slave, and the replication will continue sending to the
      sub-slave the other half of the command, which is not valid.
      
      The fix, designed by @yangsiran and @antirez, and implemented by
      @antirez, uses a secondary buffer in order to feed the sub-masters and
      update the replication backlog and offsets, only when a given part of
      the query buffer is actually *applied* to the state of the instance,
      that is, when the command gets processed and the command is not pending
      in the Redis transaction buffer because of CLIENT_MULTI state.
      
      Given that now the backlog and offsets representation are in agreement
      with the actual processed commands, both issue 1 and 2 should no longer
      be possible.
      
      Thanks to @kevinmcgehee, @yangsiran and @oranagra for their work in
      identifying and designing a fix for this problem.
      a91cc5bc
  10. 18 4月, 2017 2 次提交
    • A
      Fix modules blocking commands awake delay. · f60d6f09
      antirez 提交于
      If a thread unblocks a client blocked in a module command, by using the
      RedisMdoule_UnblockClient() API, the event loop may not be awaken until
      the next timeout of the multiplexing API or the next unrelated I/O
      operation on other clients. We actually want the client to be served
      ASAP, so a mechanism is needed in order for the unblocking API to inform
      Redis that there is a client to serve ASAP.
      
      This commit fixes the issue using the old trick of the pipe: when a
      client needs to be unblocked, a byte is written in a pipe. When we run
      the list of clients blocked in modules, we consume all the bytes
      written in the pipe. Writes and reads are performed inside the context
      of the mutex, so no race is possible in which we consume the bytes that
      are actually related to an awake request for a client that should still
      be put into the list of clients to unblock.
      
      It was verified that after the fix the server handles the blocked
      clients with the expected short delay.
      
      Thanks to @dvirsky for understanding there was such a problem and
      reporting it.
      f60d6f09
    • A
      Cluster: hash slots tracking using a radix tree. · c4716d33
      antirez 提交于
      c4716d33
  11. 22 2月, 2017 1 次提交
    • A
      Use SipHash hash function to mitigate HashDos attempts. · ba647598
      antirez 提交于
      This change attempts to switch to an hash function which mitigates
      the effects of the HashDoS attack (denial of service attack trying
      to force data structures to worst case behavior) while at the same time
      providing Redis with an hash function that does not expect the input
      data to be word aligned, a condition no longer true now that sds.c
      strings have a varialbe length header.
      
      Note that it is possible sometimes that even using an hash function
      for which collisions cannot be generated without knowing the seed,
      special implementation details or the exposure of the seed in an
      indirect way (for example the ability to add elements to a Set and
      check the return in which Redis returns them with SMEMBERS) may
      make the attacker's life simpler in the process of trying to guess
      the correct seed, however the next step would be to switch to a
      log(N) data structure when too many items in a single bucket are
      detected: this seems like an overkill in the case of Redis.
      
      SPEED REGRESION TESTS:
      
      In order to verify that switching from MurmurHash to SipHash had
      no impact on speed, a set of benchmarks involving fast insertion
      of 5 million of keys were performed.
      
      The result shows Redis with SipHash in high pipelining conditions
      to be about 4% slower compared to using the previous hash function.
      However this could partially be related to the fact that the current
      implementation does not attempt to hash whole words at a time but
      reads single bytes, in order to have an output which is endian-netural
      and at the same time working on systems where unaligned memory accesses
      are a problem.
      
      Further X86 specific optimizations should be tested, the function
      may easily get at the same level of MurMurHash2 if a few optimizations
      are performed.
      ba647598
  12. 27 1月, 2017 1 次提交
    • A
      serverPanic(): allow printf() alike formatting. · dc83ddf0
      antirez 提交于
      This is of great interest because allows us to print debugging
      informations that could be of useful when debugging, like in the
      following example:
      
          serverPanic("Unexpected encoding for object %d, %d",
              obj->type, obj->encoding);
      dc83ddf0
  13. 13 1月, 2017 1 次提交
  14. 12 1月, 2017 2 次提交
  15. 20 12月, 2016 1 次提交
    • A
      Only show Redis logo if logging to stdout / TTY. · 3334a409
      antirez 提交于
      You can still force the logo in the normal logs.
      For motivations, check issue #3112. For me the reason is that actually
      the logo is nice to have in interactive sessions, but inside the logs
      kinda loses its usefulness, but for the ability of users to recognize
      restarts easily: for this reason the new startup sequence shows a one
      liner ASCII "wave" so that there is still a bit of visual clue.
      
      Startup logging was modified in order to log events in more obvious
      ways, and to log more events. Also certain important informations are
      now more easy to parse/grep since they are printed in field=value style.
      
      The option --always-show-logo in redis.conf was added, defaulting to no.
      3334a409
  16. 16 12月, 2016 3 次提交
  17. 14 12月, 2016 3 次提交
    • A
      Writable slaves expires: fix leak in key tracking. · 9a8bc6d2
      antirez 提交于
      We need to use a dictionary type that frees the key, since we copy the
      keys in the dictionary we use to track expires created in the slave
      side.
      9a8bc6d2
    • A
      INFO: show num of slave-expires keys tracked. · 746d70b0
      antirez 提交于
      746d70b0
    • A
      Replication: fix the infamous key leakage of writable slaves + EXPIRE. · c65dfb43
      antirez 提交于
      BACKGROUND AND USE CASEj
      
      Redis slaves are normally write only, however the supprot a "writable"
      mode which is very handy when scaling reads on slaves, that actually
      need write operations in order to access data. For instance imagine
      having slaves replicating certain Sets keys from the master. When
      accessing the data on the slave, we want to peform intersections between
      such Sets values. However we don't want to intersect each time: to cache
      the intersection for some time often is a good idea.
      
      To do so, it is possible to setup a slave as a writable slave, and
      perform the intersection on the slave side, perhaps setting a TTL on the
      resulting key so that it will expire after some time.
      
      THE BUG
      
      Problem: in order to have a consistent replication, expiring of keys in
      Redis replication is up to the master, that synthesize DEL operations to
      send in the replication stream. However slaves logically expire keys
      by hiding them from read attempts from clients so that if the master did
      not promptly sent a DEL, the client still see logically expired keys
      as non existing.
      
      Because slaves don't actively expire keys by actually evicting them but
      just masking from the POV of read operations, if a key is created in a
      writable slave, and an expire is set, the key will be leaked forever:
      
      1. No DEL will be received from the master, which does not know about
      such a key at all.
      
      2. No eviction will be performed by the slave, since it needs to disable
      eviction because it's up to masters, otherwise consistency of data is
      lost.
      
      THE FIX
      
      In order to fix the problem, the slave should be able to tag keys that
      were created in the slave side and have an expire set in some way.
      
      My solution involved using an unique additional dictionary created by
      the writable slave only if needed. The dictionary is obviously keyed by
      the key name that we need to track: all the keys that are set with an
      expire directly by a client writing to the slave are tracked.
      
      The value in the dictionary is a bitmap of all the DBs where such a key
      name need to be tracked, so that we can use a single dictionary to track
      keys in all the DBs used by the slave (actually this limits the solution
      to the first 64 DBs, but the default with Redis is to use 16 DBs).
      
      This solution allows to pay both a small complexity and CPU penalty,
      which is zero when the feature is not used, actually. The slave-side
      eviction is encapsulated in code which is not coupled with the rest of
      the Redis core, if not for the hook to track the keys.
      
      TODO
      
      I'm doing the first smoke tests to see if the feature works as expected:
      so far so good. Unit tests should be added before merging into the
      4.0 branch.
      c65dfb43
  18. 30 11月, 2016 1 次提交
  19. 10 11月, 2016 1 次提交
    • A
      PSYNC2: Save replication ID/offset on RDB file. · 28c96d73
      antirez 提交于
      This means that stopping a slave and restarting it will still make it
      able to PSYNC with the master. Moreover the master itself will retain
      its ID/offset, in case it gets turned into a slave, or if a slave will
      try to PSYNC with it with an exactly updated offset (otherwise there is
      no backlog).
      
      This change was possible thanks to PSYNC v2 that makes saving the current
      replication state much simpler.
      28c96d73
  20. 09 11月, 2016 1 次提交
    • A
      PSYNC2: different improvements to Redis replication. · 2669fb83
      antirez 提交于
      The gist of the changes is that now, partial resynchronizations between
      slaves and masters (without the need of a full resync with RDB transfer
      and so forth), work in a number of cases when it was impossible
      in the past. For instance:
      
      1. When a slave is promoted to mastrer, the slaves of the old master can
      partially resynchronize with the new master.
      
      2. Chained slalves (slaves of slaves) can be moved to replicate to other
      slaves or the master itsef, without requiring a full resync.
      
      3. The master itself, after being turned into a slave, is able to
      partially resynchronize with the new master, when it joins replication
      again.
      
      In order to obtain this, the following main changes were operated:
      
      * Slaves also take a replication backlog, not just masters.
      
      * Same stream replication for all the slaves and sub slaves. The
      replication stream is identical from the top level master to its slaves
      and is also the same from the slaves to their sub-slaves and so forth.
      This means that if a slave is later promoted to master, it has the
      same replication backlong, and can partially resynchronize with its
      slaves (that were previously slaves of the old master).
      
      * A given replication history is no longer identified by the `runid` of
      a Redis node. There is instead a `replication ID` which changes every
      time the instance has a new history no longer coherent with the past
      one. So, for example, slaves publish the same replication history of
      their master, however when they are turned into masters, they publish
      a new replication ID, but still remember the old ID, so that they are
      able to partially resynchronize with slaves of the old master (up to a
      given offset).
      
      * The replication protocol was slightly modified so that a new extended
      +CONTINUE reply from the master is able to inform the slave of a
      replication ID change.
      
      * REPLCONF CAPA is used in order to notify masters that a slave is able
      to understand the new +CONTINUE reply.
      
      * The RDB file was extended with an auxiliary field that is able to
      select a given DB after loading in the slave, so that the slave can
      continue receiving the replication stream from the point it was
      disconnected without requiring the master to insert "SELECT" statements.
      This is useful in order to guarantee the "same stream" property, because
      the slave must be able to accumulate an identical backlog.
      
      * Slave pings to sub-slaves are now sent in a special form, when the
      top-level master is disconnected, in order to don't interfer with the
      replication stream. We just use out of band "\n" bytes as in other parts
      of the Redis protocol.
      
      An old design document is available here:
      
      https://gist.github.com/antirez/ae068f95c0d084891305
      
      However the implementation is not identical to the description because
      during the work to implement it, different changes were needed in order
      to make things working well.
      2669fb83
  21. 14 10月, 2016 1 次提交
    • A
      SWAPDB command. · c7a4e694
      antirez 提交于
      This new command swaps two Redis databases, so that immediately all the
      clients connected to a given DB will see the data of the other DB, and
      the other way around. Example:
      
          SWAPDB 0 1
      
      This will swap DB 0 with DB 1. All the clients connected with DB 0 will
      immediately see the new data, exactly like all the clients connected
      with DB 1 will see the data that was formerly of DB 0.
      
      MOTIVATION AND HISTORY
      ---
      
      The command was recently demanded by Pedro Melo, but was suggested in
      the past multiple times, and always refused by me.
      
      The reason why it was asked: Imagine you have clients operating in DB 0.
      At the same time, you create a new version of the dataset in DB 1.
      When the new version of the dataset is available, you immediately want
      to swap the two views, so that the clients will transparently use the
      new version of the data. At the same time you'll likely destroy the
      DB 1 dataset (that contains the old data) and start to build a new
      version, to repeat the process.
      
      This is an interesting pattern, but the reason why I always opposed to
      implement this, was that FLUSHDB was a blocking command in Redis before
      Redis 4.0 improvements. Now we have FLUSHDB ASYNC that releases the
      old data in O(1) from the point of view of the client, to reclaim memory
      incrementally in a different thread.
      
      At this point, the pattern can really be supported without latency
      spikes, so I'm providing this implementation for the users to comment.
      In case a very compelling argument will be made against this new command
      it may be removed.
      
      BEHAVIOR WITH BLOCKING OPERATIONS
      ---
      
      If a client is blocking for a list in a given DB, after the swap it will
      still be blocked in the same DB ID, since this is the most logical thing
      to do: if I was blocked for a list push to list "foo", even after the
      swap I want still a LPUSH to reach the key "foo" in the same DB in order
      to unblock.
      
      However an interesting thing happens when a client is, for instance,
      blocked waiting for new elements in list "foo" of DB 0. Then the DB
      0 and 1 are swapped with SWAPDB. However the DB 1 happened to have
      a list called "foo" containing elements. When this happens, this
      implementation can correctly unblock the client.
      
      It is possible that there are subtle corner cases that are not covered
      in the implementation, but since the command is self-contained from the
      POV of the implementation and the Redis core, it cannot cause anything
      bad if not used.
      
      Tests and documentation are yet to be provided.
      c7a4e694
  22. 07 10月, 2016 1 次提交
  23. 06 10月, 2016 2 次提交
    • A
      Fix name of mispelled function. · 799208de
      antirez 提交于
      799208de
    • A
      Module: Ability to get context from IO context. · 152c1b68
      antirez 提交于
      It was noted by @dvirsky that it is not possible to use string functions
      when writing the AOF file. This sometimes is critical since the command
      rewriting may need to be built in the context of the AOF callback, and
      without access to the context, and the limited types that the AOF
      production functions will accept, this can be an issue.
      
      Moreover there are other needs that we can't anticipate regarding the
      ability to use Redis Modules APIs using the context in order to build
      representations to emit AOF / RDB.
      
      Because of this a new API was added that allows the user to get a
      temporary context from the IO context. The context is auto released
      if obtained when the RDB / AOF callback returns.
      
      Calling multiple time the function to get the context, always returns
      the same one, since it is invalid to have more than a single context.
      152c1b68
  24. 19 9月, 2016 1 次提交
  25. 16 9月, 2016 3 次提交
  26. 15 9月, 2016 1 次提交