• B
    xfs: always drain dio before extending aio write submission · 3136e8bb
    Brian Foster 提交于
    XFS supports and typically allows concurrent asynchronous direct I/O
    submission to a single file. One exception to the rule is that file
    extending dio writes that start beyond the current EOF (e.g.,
    potentially create a hole at EOF) require exclusive I/O access to the
    file. This is because such writes must zero any pre-existing blocks
    beyond EOF that are exposed by virtue of now residing within EOF as a
    result of the write about to be submitted.
    
    Before EOF zeroing can occur, the current file i_size must be stabilized
    to avoid data corruption. In this scenario, XFS upgrades the iolock to
    exclude any further I/O submission, waits on in-flight I/O to complete
    to ensure i_size is up to date (i_size is updated on dio write
    completion) and restarts the various checks against the state of the
    file. The problem is that this protection sequence is triggered only
    when the iolock is currently held shared. While this is true for async
    dio in most cases, the caller may upgrade the lock in advance based on
    arbitrary circumstances with respect to EOF zeroing. For example, the
    iolock is always acquired exclusively if the start offset is not block
    aligned. This means that even though the iolock is already held
    exclusive for such I/Os, pending I/O is not drained and thus EOF zeroing
    can occur based on an unstable i_size.
    
    This problem has been reproduced as guest data corruption in virtual
    machines with file-backed qcow2 virtual disks hosted on an XFS
    filesystem. The virtual disks must be configured with aio=native mode
    and the must not be truncated out to the maximum file size (as some virt
    managers will do).
    
    Update xfs_file_aio_write_checks() to unconditionally drain in-flight
    dio before EOF zeroing can occur. Rather than trigger the wait based on
    iolock state, use a new flag and upgrade the iolock when necessary. Note
    that this results in a full restart of the inode checks even when the
    iolock was already held exclusive when technically it is only required
    to recheck i_size. This should be a rare enough occurrence that it is
    preferable to keep the code simple rather than create an alternate
    restart jump target.
    Signed-off-by: NBrian Foster <bfoster@redhat.com>
    Reviewed-by: NEric Sandeen <sandeen@redhat.com>
    Signed-off-by: NDave Chinner <david@fromorbit.com>
    3136e8bb
xfs_file.c 42.1 KB