• S
    RDMA/ucma: Discard events for IDs not yet claimed by user space · c6b21824
    Sean Hefty 提交于
    Problem reported by Avneesh Pant <avneesh.pant@oracle.com>:
    
        It looks like we are triggering a bug in RDMA CM/UCM interaction.
        The bug specifically hits when we have an incoming connection
        request and the connecting process dies BEFORE the passive end of
        the connection can process the request i.e. it does not call
        rdma_get_cm_event() to retrieve the initial connection event.  We
        were able to triage this further and have some additional
        information now.
    
        In the example below when P1 dies after issuing a connect request
        as the CM id is being destroyed all outstanding connects (to P2)
        are sent a reject message. We see this reject message being
        received on the passive end and the appropriate CM ID created for
        the initial connection message being retrieved in cm_match_req().
        The problem is in the ucma_event_handler() code when this reject
        message is delivered to it and the initial connect message itself
        HAS NOT been delivered to the client. In fact the client has not
        even called rdma_cm_get_event() at this stage so we haven't
        allocated a new ctx in ucma_get_event() and updated the new
        connection CM_ID to point to the new UCMA context.
    
        This results in the reject message not being dropped in
        ucma_event_handler() for the new connection request as the
        (if (!ctx->uid)) block is skipped since the ctx it refers to is
        the listen CM id context which does have a valid UID associated
        with it (I believe the new CMID for the connection initially
        uses the listen CMID -> context when it is created in
        cma_new_conn_id). Thus the assumption that new events for a
        connection can get dropped in ucma_event_handler() is incorrect
        IF the initial connect request has not been retrieved in the
        first case. We end up getting a CM Reject event on the listen CM
        ID and our upper layer code asserts (in fact this event does not
        even have the listen_id set as that only gets set up librdmacm
        for connect requests).
    
    The solution is to verify that the cm_id being reported in the event
    is the same as the cm_id referenced by the ucma context.  A mismatch
    indicates that the ucma context corresponds to the listen.  This fix
    was validated by using a modified version of librdmacm that was able
    to verify the problem and see that the reject message was indeed
    dropped after this patch was applied.
    Signed-off-by: NSean Hefty <sean.hefty@intel.com>
    Signed-off-by: NRoland Dreier <roland@purestorage.com>
    c6b21824
ucma.c 38.5 KB