• A
    ae.c: introduce the concept of read->write barrier. · 548e478e
    antirez 提交于
    AOF fsync=always, and certain Redis Cluster bus operations, require to
    fsync data on disk before replying with an acknowledge.
    In such case, in order to implement Group Commits, we want to be sure
    that queries that are read in a given cycle of the event loop, are never
    served to clients in the same event loop iteration. This way, by using
    the event loop "before sleep" callback, we can fsync the information
    just one time before returning into the event loop for the next cycle.
    This is much more efficient compared to calling fsync() multiple times.
    
    Unfortunately because of a bug, this was not always guaranteed: the
    actual way the events are installed was the sole thing that could
    control. Normally this problem is hard to trigger when AOF is enabled
    with fsync=always, because we try to flush the output buffers to the
    socekt directly in the beforeSleep() function of Redis. However if the
    output buffers are full, we actually install a write event, and in such
    a case, this bug could happen.
    
    This change to ae.c modifies the event loop implementation to make this
    concept explicit. Write events that are registered with:
    
        AE_WRITABLE|AE_BARRIER
    
    Are guaranteed to never fire after the readable event was fired for the
    same file descriptor. In this way we are sure that data is persisted to
    disk before the client performing the operation receives an
    acknowledged.
    
    However note that this semantics does not provide all the guarantees
    that one may believe are automatically provided. Take the example of the
    blocking list operations in Redis.
    
    With AOF and fsync=always we could have:
    
        Client A doing: BLPOP myqueue 0
        Client B doing: RPUSH myqueue a b c
    
    In this scenario, Client A will get the "a" elements immediately after
    the Client B RPUSH will be executed, even before the operation is persisted.
    However when Client B will get the acknowledge, it can be sure that
    "b,c" are already safe on disk inside the list.
    
    What to note here is that it cannot be assumed that Client A receiving
    the element is a guaranteed that the operation succeeded from the point
    of view of Client B.
    
    This is due to the fact that the barrier exists within the same socket,
    and not between different sockets. However in the case above, the
    element "a" was not going to be persisted regardless, so it is a pretty
    synthetic argument.
    548e478e
ae.c 16.4 KB