• B
    scsi: ibmvfc: Set default timeout to avoid crash during migration · ccd27124
    Brian King 提交于
    stable inclusion
    from stable-5.10.14
    commit b45a47e9adfc8fb902acb1f2747e6c0c94af633d
    bugzilla: 48051
    
    --------------------------------
    
    [ Upstream commit 76490729 ]
    
    While testing live partition mobility, we have observed occasional crashes
    of the Linux partition. What we've seen is that during the live migration,
    for specific configurations with large amounts of memory, slow network
    links, and workloads that are changing memory a lot, the partition can end
    up being suspended for 30 seconds or longer. This resulted in the following
    scenario:
    
    CPU 0                          CPU 1
    -------------------------------  ----------------------------------
    scsi_queue_rq                    migration_store
     -> blk_mq_start_request          -> rtas_ibm_suspend_me
      -> blk_add_timer                 -> on_each_cpu(rtas_percpu_suspend_me
                  _______________________________________V
                 |
                 V
        -> IPI from CPU 1
         -> rtas_percpu_suspend_me
                                         -> __rtas_suspend_last_cpu
    
    -- Linux partition suspended for > 30 seconds --
                                          -> for_each_online_cpu(cpu)
                                               plpar_hcall_norets(H_PROD
     -> scsi_dispatch_cmd
                                          -> scsi_times_out
                                           -> scsi_abort_command
                                            -> queue_delayed_work
      -> ibmvfc_queuecommand_lck
       -> ibmvfc_send_event
        -> ibmvfc_send_crq
         - returns H_CLOSED
       <- returns SCSI_MLQUEUE_HOST_BUSY
    -> __blk_mq_requeue_request
    
                                          -> scmd_eh_abort_handler
                                           -> scsi_try_to_abort_cmd
                                             - returns SUCCESS
                                           -> scsi_queue_insert
    
    Normally, the SCMD_STATE_COMPLETE bit would protect against the command
    completion and the timeout, but that doesn't work here, since we don't
    check that at all in the SCSI_MLQUEUE_HOST_BUSY path.
    
    In this case we end up calling scsi_queue_insert on a request that has
    already been queued, or possibly even freed, and we crash.
    
    The patch below simply increases the default I/O timeout to avoid this race
    condition. This is also the timeout value that nearly all IBM SAN storage
    recommends setting as the default value.
    
    Link: https://lore.kernel.org/r/1610463998-19791-1-git-send-email-brking@linux.vnet.ibm.comSigned-off-by: NBrian King <brking@linux.vnet.ibm.com>
    Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: NSasha Levin <sashal@kernel.org>
    Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
    Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
    ccd27124
ibmvfc.c 150.0 KB