• D
    xfs: Don't use unwritten extents for DAX · 1ca19157
    Dave Chinner 提交于
    DAX has a page fault serialisation problem with block allocation.
    Because it allows concurrent page faults and does not have a page
    lock to serialise faults to the same page, it can get two concurrent
    faults to the page that race.
    
    When two read faults race, this isn't a huge problem as the data
    underlying the page is not changing and so "detect and drop" works
    just fine. The issues are to do with write faults.
    
    When two write faults occur, we serialise block allocation in
    get_blocks() so only one faul will allocate the extent. It will,
    however, be marked as an unwritten extent, and that is where the
    problem lies - the DAX fault code cannot differentiate between a
    block that was just allocated and a block that was preallocated and
    needs zeroing. The result is that both write faults end up zeroing
    the block and attempting to convert it back to written.
    
    The problem is that the first fault can zero and convert before the
    second fault starts zeroing, resulting in the zeroing for the second
    fault overwriting the data that the first fault wrote with zeros.
    The second fault then attempts to convert the unwritten extent,
    which is then a no-op because it's already written. Data loss occurs
    as a result of this race.
    
    Because there is no sane locking construct in the page fault code
    that we can use for serialisation across the page faults, we need to
    ensure block allocation and zeroing occurs atomically in the
    filesystem. This means we can still take concurrent page faults and
    the only time they will serialise is in the filesystem
    mapping/allocation callback. The page fault code will always see
    written, initialised extents, so we will be able to remove the
    unwritten extent handling from the DAX code when all filesystems are
    converted.
    Signed-off-by: NDave Chinner <dchinner@redhat.com>
    Reviewed-by: NBrian Foster <bfoster@redhat.com>
    Signed-off-by: NDave Chinner <david@fromorbit.com>
    
    1ca19157
xfs_iomap.c 25.0 KB