• F
    btrfs: fix mount failure due to past and transient device flush error · 09fa6f6e
    Filipe Manana 提交于
    stable inclusion
    from stable-5.10.72
    commit 63c89930d4b5f6205b4f4f13498a37cee86d80fa
    bugzilla: 182982 https://gitee.com/openeuler/kernel/issues/I4I3L1
    
    Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=63c89930d4b5f6205b4f4f13498a37cee86d80fa
    
    --------------------------------
    
    [ Upstream commit 6b225baa ]
    
    When we get an error flushing one device, during a super block commit, we
    record the error in the device structure, in the field 'last_flush_error'.
    This is used to later check if we should error out the super block commit,
    depending on whether the number of flush errors is greater than or equals
    to the maximum tolerated device failures for a raid profile.
    
    However if we get a transient device flush error, unmount the filesystem
    and later try to mount it, we can fail the mount because we treat that
    past error as critical and consider the device is missing. Even if it's
    very likely that the error will happen again, as it's probably due to a
    hardware related problem, there may be cases where the error might not
    happen again. One example is during testing, and a test case like the
    new generic/648 from fstests always triggers this. The test cases
    generic/019 and generic/475 also trigger this scenario, but very
    sporadically.
    
    When this happens we get an error like this:
    
      $ mount /dev/sdc /mnt
      mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.
    
      $ dmesg
      (...)
      [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
      [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
      [12918.890853] BTRFS error (device sdc): open_ctree failed
    
    The failure happens because when btrfs_check_rw_degradable() is called at
    mount time, or at remount from RO to RW time, is sees a non zero value in
    a device's ->last_flush_error attribute, and therefore considers that the
    device is 'missing'.
    
    Fix this by setting a device's ->last_flush_error to zero when we close a
    device, making sure the error is not seen on the next mount attempt. We
    only need to track flush errors during the current mount, so that we never
    commit a super block if such errors happened.
    Signed-off-by: NFilipe Manana <fdmanana@suse.com>
    Reviewed-by: NDavid Sterba <dsterba@suse.com>
    Signed-off-by: NDavid Sterba <dsterba@suse.com>
    Signed-off-by: NSasha Levin <sashal@kernel.org>
    Signed-off-by: NChen Jun <chenjun102@huawei.com>
    Acked-by: NWeilong Chen <chenweilong@huawei.com>
    Signed-off-by: NChen Jun <chenjun102@huawei.com>
    Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
    09fa6f6e
volumes.c 206.0 KB