1. 26 11月, 2018 1 次提交
    • T
      ext4: add ext4_sb_bread() to disambiguate ENOMEM cases · fb265c9c
      Theodore Ts'o 提交于
      Today, when sb_bread() returns NULL, this can either be because of an
      I/O error or because the system failed to allocate the buffer.  Since
      it's an old interface, changing would require changing many call
      sites.
      
      So instead we create our own ext4_sb_bread(), which also allows us to
      set the REQ_META flag.
      
      Also fixed a problem in the xattr code where a NULL return in a
      function could also mean that the xattr was not found, which could
      lead to the wrong error getting returned to userspace.
      
      Fixes: ac27a0ec ("ext4: initial copy of files from ext3")
      Cc: stable@kernel.org # 2.6.19
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      fb265c9c
  2. 23 11月, 2018 2 次提交
  3. 22 11月, 2018 7 次提交
    • D
      iomap: readpages doesn't zero page tail beyond EOF · 8c110d43
      Dave Chinner 提交于
      When we read the EOF page of the file via readpages, we need
      to zero the region beyond EOF that we either do not read or
      should not contain data so that mmap does not expose stale data to
      user applications.
      
      However, iomap_adjust_read_range() fails to detect EOF correctly,
      and so fsx on 1k block size filesystems fails very quickly with
      mapreads exposing data beyond EOF. There are two problems here.
      
      Firstly, when calculating the end block of the EOF byte, we have
      to round the size by one to avoid a block aligned EOF from reporting
      a block too large. i.e. a size of 1024 bytes is 1 block, which in
      index terms is block 0. Therefore we have to calculate the end block
      from (isize - 1), not isize.
      
      The second bug is determining if the current page spans EOF, and so
      whether we need split it into two half, one for the IO, and the
      other for zeroing. Unfortunately, the code that checks whether
      we should split the block doesn't actually check if we span EOF, it
      just checks if the read spans the /offset in the page/ that EOF
      sits on. So it splits every read into two if EOF is not page
      aligned, regardless of whether we are reading the EOF block or not.
      
      Hence we need to restrict the "does the read span EOF" check to
      just the page that spans EOF, not every page we read.
      
      This patch results in correct EOF detection through readpages:
      
      xfs_vm_readpages:     dev 259:0 ino 0x43 nr_pages 24
      xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4
      iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864
      xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c
      iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864
      iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864
      iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864
      
      As you can see, it now does full page reads until the last one which
      is split correctly at the block aligned EOF, reading 3072 bytes and
      zeroing the last 1024 bytes. The original version of the patch got
      this right, but it got another case wrong.
      
      The EOF detection crossing really needs to the the original length
      as plen, while it starts at the end of the block, will be shortened
      as up-to-date blocks are found on the page. This means "orig_pos +
      plen" no longer points to the end of the page, and so will not
      correctly detect EOF crossing. Hence we have to use the length
      passed in to detect this partial page case:
      
      xfs_filemap_fault:    dev 259:1 ino 0x43  write_fault 0
      xfs_vm_readpage:      dev 259:1 ino 0x43 nr_pages 1
      xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4
      iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296
      xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1
      iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296
      
      Heere we see a trace where the first block on the EOF page is up to
      date, hence poff = 1024 bytes. The offset into the page of EOF is
      3072, so the range we want to read is 1024 - 3071, and the range we
      want to zero is 3072 - 4095. You can see this is split correctly
      now.
      
      This fixes the stale data beyond EOF problem that fsx quickly
      uncovers on 1k block size filesystems.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      8c110d43
    • D
      vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP · 494633fa
      Dave Chinner 提交于
      It returns EINVAL when the operation is not supported by the
      filesystem. Fix it to return EOPNOTSUPP to be consistent with
      the man page and clone_file_range().
      
      Clean up the inconsistent error return handling while I'm there.
      (I know, lipstick on a pig, but every little bit helps...)
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      494633fa
    • D
      iomap: dio data corruption and spurious errors when pipes fill · 4721a601
      Dave Chinner 提交于
      When doing direct IO to a pipe for do_splice_direct(), then pipe is
      trivial to fill up and overflow as it can only hold 16 pages. At
      this point bio_iov_iter_get_pages() then returns -EFAULT, and we
      abort the IO submission process. Unfortunately, iomap_dio_rw()
      propagates the error back up the stack.
      
      The error is converted from the EFAULT to EAGAIN in
      generic_file_splice_read() to tell the splice layers that the pipe
      is full. do_splice_direct() completely fails to handle EAGAIN errors
      (it aborts on error) and returns EAGAIN to the caller.
      
      copy_file_write() then completely fails to handle EAGAIN as well,
      and so returns EAGAIN to userspace, having failed to copy the data
      it was asked to.
      
      Avoid this whole steaming pile of fail by having iomap_dio_rw()
      silently swallow EFAULT errors and so do short reads.
      
      To make matters worse, iomap_dio_actor() has a stale data exposure
      bug bio_iov_iter_get_pages() fails - it does not zero the tail block
      that it may have been left uncovered by partial IO. Fix the error
      handling case to drop to the sub-block zeroing rather than
      immmediately returning the -EFAULT error.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4721a601
    • D
      iomap: sub-block dio needs to zeroout beyond EOF · b450672f
      Dave Chinner 提交于
      If we are doing sub-block dio that extends EOF, we need to zero
      the unused tail of the block to initialise the data in it it. If we
      do not zero the tail of the block, then an immediate mmap read of
      the EOF block will expose stale data beyond EOF to userspace. Found
      with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.
      
      Fix this by detecting if the end of the DIO write is beyond EOF
      and zeroing the tail if necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      b450672f
    • D
      iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents · 0929d858
      Dave Chinner 提交于
      When we write into an unwritten extent via direct IO, we dirty
      metadata on IO completion to convert the unwritten extent to
      written. However, when we do the FUA optimisation checks, the inode
      may be clean and so we issue a FUA write into the unwritten extent.
      This means we then bypass the generic_write_sync() call after
      unwritten extent conversion has ben done and we don't force the
      modified metadata to stable storage.
      
      This violates O_DSYNC semantics. The window of exposure is a single
      IO, as the next DIO write will see the inode has dirty metadata and
      hence will not use the FUA optimisation. Calling
      generic_write_sync() after completion of the second IO will also
      sync the first write and it's metadata.
      
      Fix this by avoiding the FUA optimisation when writing to unwritten
      extents.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0929d858
    • D
      xfs: delalloc -> unwritten COW fork allocation can go wrong · 9230a0b6
      Dave Chinner 提交于
      Long saga. There have been days spent following this through dead end
      after dead end in multi-GB event traces. This morning, after writing
      a trace-cmd wrapper that enabled me to be more selective about XFS
      trace points, I discovered that I could get just enough essential
      tracepoints enabled that there was a 50:50 chance the fsx config
      would fail at ~115k ops. If it didn't fail at op 115547, I stopped
      fsx at op 115548 anyway.
      
      That gave me two traces - one where the problem manifested, and one
      where it didn't. After refining the traces to have the necessary
      information, I found that in the failing case there was a real
      extent in the COW fork compared to an unwritten extent in the
      working case.
      
      Walking back through the two traces to the point where the CWO fork
      extents actually diverged, I found that the bad case had an extra
      unwritten extent in it. This is likely because the bug it led me to
      had triggered multiple times in those 115k ops, leaving stray
      COW extents around. What I saw was a COW delalloc conversion to an
      unwritten extent (as they should always be through
      xfs_iomap_write_allocate()) resulted in a /written extent/:
      
      xfs_writepage:        dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0
      xfs_iext_remove:      dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real
      xfs_bmap_pre_update:  dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real
      xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex
      
      Basically, Cow fork before:
      
      	0 1            32          52
      	+H+DDDDDDDDDDDD+UUUUUUUUUUU+
      	   PREV		RIGHT
      
      COW delalloc conversion allocates:
      
      	  1	       32
      	  +uuuuuuuuuuuu+
      	  NEW
      
      And the result according to the xfs_bmap_post_update trace was:
      
      	0 1            32          52
      	+H+wwwwwwwwwwwwwwwwwwwwwwww+
      	   PREV
      
      Which is clearly wrong - it should be a merged unwritten extent,
      not an unwritten extent.
      
      That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG
      case in xfs_bmap_add_extent_delay_real(), and sure enough, there's
      the bug.
      
      It takes the old delalloc extent (PREV) and adds the length of the
      RIGHT extent to it, takes the start block from NEW, removes the
      RIGHT extent and then updates PREV with the new extent.
      
      What it fails to do is update PREV.br_state. For delalloc, this is
      always XFS_EXT_NORM, while in this case we are converting the
      delayed allocation to unwritten, so it needs to be updated to
      XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so
      the resultant extent is always written.
      
      And that's the bug I've been chasing for a week - a bmap btree bug,
      not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced
      with the recent in core extent tree scalability enhancements.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9230a0b6
    • D
      xfs: flush removing page cache in xfs_reflink_remap_prep · 2c307174
      Dave Chinner 提交于
      On a sub-page block size filesystem, fsx is failing with a data
      corruption after a series of operations involving copying a file
      with the destination offset beyond EOF of the destination of the file:
      
      8093(157 mod 256): TRUNCATE DOWN        from 0x7a120 to 0x50000 ******WWWW
      8094(158 mod 256): INSERT 0x25000 thru 0x25fff  (0x1000 bytes)
      8095(159 mod 256): COPY 0x18000 thru 0x1afff    (0x3000 bytes) to 0x2f400
      8096(160 mod 256): WRITE    0x5da00 thru 0x651ff        (0x7800 bytes) HOLE
      8097(161 mod 256): COPY 0x2000 thru 0x5fff      (0x4000 bytes) to 0x6fc00
      
      The second copy here is beyond EOF, and it is to sub-page (4k) but
      block aligned (1k) offset. The clone runs the EOF zeroing, landing
      in a pre-existing post-eof delalloc extent. This zeroes the post-eof
      extents in the page cache just fine, dirtying the pages correctly.
      
      The problem is that xfs_reflink_remap_prep() now truncates the page
      cache over the range that it is copying it to, and rounds that down
      to cover the entire start page. This removes the dirty page over the
      delalloc extent from the page cache without having written it back.
      Hence later, when the page cache is flushed, the page at offset
      0x6f000 has not been written back and hence exposes stale data,
      which fsx trips over less than 10 operations later.
      
      Fix this by changing xfs_reflink_remap_prep() to use
      xfs_flush_unmap_range().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      2c307174
  4. 21 11月, 2018 4 次提交
    • D
      xfs: extent shifting doesn't fully invalidate page cache · 7f9f71be
      Dave Chinner 提交于
      The extent shifting code uses a flush and invalidate mechainsm prior
      to shifting extents around. This is similar to what
      xfs_free_file_space() does, but it doesn't take into account things
      like page cache vs block size differences, and it will fail if there
      is a page that it currently busy.
      
      xfs_flush_unmap_range() handles all of these cases, so just convert
      xfs_prepare_shift() to us that mechanism rather than having it's own
      special sauce.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7f9f71be
    • D
      xfs: finobt AG reserves don't consider last AG can be a runt · c0876897
      Dave Chinner 提交于
      The last AG may be very small comapred to all other AGs, and hence
      AG reservations based on the superblock AG size may actually consume
      more space than the AG actually has. This results on assert failures
      like:
      
      XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
      [   48.932891]  xfs_ag_resv_init+0x1bd/0x1d0
      [   48.933853]  xfs_fs_reserve_ag_blocks+0x37/0xb0
      [   48.934939]  xfs_mountfs+0x5b3/0x920
      [   48.935804]  xfs_fs_fill_super+0x462/0x640
      [   48.936784]  ? xfs_test_remount_options+0x60/0x60
      [   48.937908]  mount_bdev+0x178/0x1b0
      [   48.938751]  mount_fs+0x36/0x170
      [   48.939533]  vfs_kern_mount.part.43+0x54/0x130
      [   48.940596]  do_mount+0x20e/0xcb0
      [   48.941396]  ? memdup_user+0x3e/0x70
      [   48.942249]  ksys_mount+0xba/0xd0
      [   48.943046]  __x64_sys_mount+0x21/0x30
      [   48.943953]  do_syscall_64+0x54/0x170
      [   48.944835]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Hence we need to ensure the finobt per-ag space reservations take
      into account the size of the last AG rather than treat it like all
      the other full size AGs.
      
      Note that both refcountbt and rmapbt already take the size of the AG
      into account via reading the AGF length directly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c0876897
    • D
      xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers · d43aaf16
      Dave Chinner 提交于
      When retrying a failed inode or dquot buffer,
      xfs_buf_resubmit_failed_buffers() clears all the failed flags from
      the inde/dquot log items. In doing so, it also drops all the
      reference counts on the buffer that the failed log items hold. This
      means it can drop all the active references on the buffer and hence
      free the buffer before it queues it for write again.
      
      Putting the buffer on the delwri queue takes a reference to the
      buffer (so that it hangs around until it has been written and
      completed), but this goes bang if the buffer has already been freed.
      
      Hence we need to add the buffer to the delwri queue before we remove
      the failed flags from the log items attached to the buffer to ensure
      it always remains referenced during the resubmit process.
      Reported-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d43aaf16
    • D
      xfs: uncached buffer tracing needs to print bno · d61fa8cb
      Dave Chinner 提交于
      Useless:
      
      xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_unlock:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_submit:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_hold:         dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_iowait:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_iodone:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_iowait_done:  dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_rele:         dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      
      Useful:
      
      
      xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_unlock:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
      xfs_buf_submit:       dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_hold:         dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_iowait:       dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_iodone:       dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_iowait_done:  dev 253:32 bno 0x200b5 nblks 0x1 ...
      xfs_buf_rele:         dev 253:32 bno 0x200b5 nblks 0x1 ...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d61fa8cb
  5. 20 11月, 2018 3 次提交
    • T
      NFSv4: Fix a NFSv4 state manager deadlock · aeabb3c9
      Trond Myklebust 提交于
      Fix a deadlock whereby the NFSv4 state manager can get stuck in the
      delegation return code, waiting for a layout return to complete in
      another thread. If the server reboots before that other thread
      completes, then we need to be able to start a second state
      manager thread in order to perform recovery.
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      aeabb3c9
    • E
      xfs: make xfs_file_remap_range() static · da034bcc
      Eric Biggers 提交于
      xfs_file_remap_range() is only used in fs/xfs/xfs_file.c, so make it
      static.
      
      This addresses a gcc warning when -Wmissing-prototypes is enabled.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      da034bcc
    • B
      xfs: fix shared extent data corruption due to missing cow reservation · 59e42931
      Brian Foster 提交于
      Page writeback indirectly handles shared extents via the existence
      of overlapping COW fork blocks. If COW fork blocks exist, writeback
      always performs the associated copy-on-write regardless if the
      underlying blocks are actually shared. If the blocks are shared,
      then overlapping COW fork blocks must always exist.
      
      fstests shared/010 reproduces a case where a buffered write occurs
      over a shared block without performing the requisite COW fork
      reservation.  This ultimately causes writeback to the shared extent
      and data corruption that is detected across md5 checks of the
      filesystem across a mount cycle.
      
      The problem occurs when a buffered write lands over a shared extent
      that crosses an extent size hint boundary and that also happens to
      have a partial COW reservation that doesn't cover the start and end
      blocks of the data fork extent.
      
      For example, a buffered write occurs across the file offset (in FSB
      units) range of [29, 57]. A shared extent exists at blocks [29, 35]
      and COW reservation already exists at blocks [32, 34]. After
      accommodating a COW extent size hint of 32 blocks and the existing
      reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
      blocks of reservation at offset 0 and returns with COW reservation
      across the range of [0, 34]. The associated data fork extent is
      still [29, 35], however, which isn't fully covered by the COW
      reservation.
      
      This leads to a buffered write at file offset 35 over a shared
      extent without associated COW reservation. Writeback eventually
      kicks in, performs an overwrite of the underlying shared block and
      causes the associated data corruption.
      
      Update xfs_reflink_reserve_cow() to accommodate the fact that a
      delalloc allocation request may not fully cover the extent in the
      data fork. Trim the data fork extent appropriately, just as is done
      for shared extent boundaries and/or existing COW reservations that
      happen to overlap the start of the data fork extent. This prevents
      shared/010 failures due to data corruption on reflink enabled
      filesystems.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      59e42931
  6. 19 11月, 2018 4 次提交
  7. 18 11月, 2018 2 次提交
    • M
      dax: Fix huge page faults · 0e40de03
      Matthew Wilcox 提交于
      Using xas_load() with a PMD-sized xa_state would work if either a
      PMD-sized entry was present or a PTE sized entry was present in the
      first 64 entries (of the 512 PTEs in a PMD on x86).  If there was no
      PTE in the first 64 entries, grab_mapping_entry() would believe there
      were no entries present, allocate a PMD-sized entry and overwrite the
      PTE in the page cache.
      
      Use xas_find_conflict() instead which turns out to simplify
      both get_unlocked_entry() and grab_mapping_entry().  Also remove a
      WARN_ON_ONCE from grab_mapping_entry() as it will have already triggered
      in get_unlocked_entry().
      
      Fixes: cfc93c6c ("dax: Convert dax_insert_pfn_mkwrite to XArray")
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      0e40de03
    • M
      dax: Fix dax_unlock_mapping_entry for PMD pages · fda490d3
      Matthew Wilcox 提交于
      Device DAX PMD pages do not set the PageHead bit for compound pages.
      Fix for now by retrieving the PMD bit from the entry, but eventually we
      will be passed the page size by the caller.
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Fixes: 9f32d221 ("dax: Convert dax_lock_mapping_entry to XArray")
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      fda490d3
  8. 17 11月, 2018 4 次提交
  9. 16 11月, 2018 1 次提交
    • D
      rxrpc: Fix life check · 7150ceaa
      David Howells 提交于
      The life-checking function, which is used by kAFS to make sure that a call
      is still live in the event of a pending signal, only samples the received
      packet serial number counter; it doesn't actually provoke a change in the
      counter, rather relying on the server to happen to give us a packet in the
      time window.
      
      Fix this by adding a function to force a ping to be transmitted.
      
      kAFS then keeps track of whether there's been a stall, and if so, uses the
      new function to ping the server, resetting the timeout to allow the reply
      to come back.
      
      If there's a stall, a ping and the call is *still* stalled in the same
      place after another period, then the call will be aborted.
      
      Fixes: bc5e3a54 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
      Fixes: f4d15fb6 ("rxrpc: Provide functions for allowing cleaner handling of signals")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7150ceaa
  10. 14 11月, 2018 1 次提交
  11. 13 11月, 2018 2 次提交
  12. 12 11月, 2018 1 次提交
    • B
      mnt: fix __detach_mounts infinite loop · 1e9c75fb
      Benjamin Coddington 提交于
      Since commit ff17fa56 ("d_invalidate(): unhash immediately")
      immediately unhashes the dentry, we'll never return the mountpoint in
      lookup_mountpoint(), which can lead to an unbreakable loop in
      d_invalidate().
      
      I have reports of NFS clients getting into this condition after the server
      removes an export of an existing mount created through follow_automount(),
      but I suspect there are various other ways to produce this problem if we
      hunt down users of d_invalidate().  For example, it is possible to get into
      this state by using XFS' d_invalidate() call in xfs_vn_unlink():
      
      truncate -s 100m img{1,2}
      
      mkfs.xfs -q -n version=ci img1
      mkfs.xfs -q -n version=ci img2
      
      mkdir -p /mnt/xfs
      mount img1 /mnt/xfs
      
      mkdir /mnt/xfs/sub1
      mount img2 /mnt/xfs/sub1
      
      cat > /mnt/xfs/sub1/foo &
      umount -l /mnt/xfs/sub1
      mount img2 /mnt/xfs/sub1
      
      mount --make-private /mnt/xfs
      
      mkdir /mnt/xfs/sub2
      mount --move /mnt/xfs/sub1 /mnt/xfs/sub2
      rmdir /mnt/xfs/sub1
      
      Fix this by moving the check for an unlinked dentry out of the
      detach_mounts() path.
      
      Fixes: ff17fa56 ("d_invalidate(): unhash immediately")
      Cc: stable@vger.kernel.org
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      1e9c75fb
  13. 10 11月, 2018 1 次提交
    • V
      ext4: missing !bh check in ext4_xattr_inode_write() · eb6984fa
      Vasily Averin 提交于
      According to Ted Ts'o ext4_getblk() called in ext4_xattr_inode_write()
      should not return bh = NULL
      
      The only time that bh could be NULL, then, would be in the case of
      something really going wrong; a programming error elsewhere (perhaps a
      wild pointer dereference) or I/O error causing on-disk file system
      corruption (although that would be highly unlikely given that we had
      *just* allocated the blocks and so the metadata blocks in question
      probably would still be in the cache).
      
      Fixes: e50e5129 ("ext4: xattr-in-inode support")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 4.13
      eb6984fa
  14. 09 11月, 2018 7 次提交
    • L
      fuse: fix use-after-free in fuse_direct_IO() · ebacb812
      Lukas Czerner 提交于
      In async IO blocking case the additional reference to the io is taken for
      it to survive fuse_aio_complete(). In non blocking case this additional
      reference is not needed, however we still reference io to figure out
      whether to wait for completion or not. This is wrong and will lead to
      use-after-free. Fix it by storing blocking information in separate
      variable.
      
      This was spotted by KASAN when running generic/208 fstest.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 744742d6 ("fuse: Add reference counting for fuse_io_priv")
      Cc: <stable@vger.kernel.org> # v4.6
      ebacb812
    • M
      fuse: fix possibly missed wake-up after abort · 2d84a2d1
      Miklos Szeredi 提交于
      In current fuse_drop_waiting() implementation it's possible that
      fuse_wait_aborted() will not be woken up in the unlikely case that
      fuse_abort_conn() + fuse_wait_aborted() runs in between checking
      fc->connected and calling atomic_dec(&fc->num_waiting).
      
      Do the atomic_dec_and_test() unconditionally, which also provides the
      necessary barrier against reordering with the fc->connected check.
      
      The explicit smp_mb() in fuse_wait_aborted() is not actually needed, since
      the spin_unlock() in fuse_abort_conn() provides the necessary RELEASE
      barrier after resetting fc->connected.  However, this is not a performance
      sensitive path, and adding the explicit barrier makes it easier to
      document.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: b8f95e5d ("fuse: umount should wait for all requests")
      Cc: <stable@vger.kernel.org> #v4.19
      2d84a2d1
    • M
      fuse: fix leaked notify reply · 7fabaf30
      Miklos Szeredi 提交于
      fuse_request_send_notify_reply() may fail if the connection was reset for
      some reason (e.g. fs was unmounted).  Don't leak request reference in this
      case.  Besides leaking memory, this resulted in fc->num_waiting not being
      decremented and hence fuse_wait_aborted() left in a hanging and unkillable
      state.
      
      Fixes: 2d45ba38 ("fuse: add retrieve request")
      Fixes: b8f95e5d ("fuse: umount should wait for all requests")
      Reported-and-tested-by: syzbot+6339eda9cb4ebbc4c37b@syzkaller.appspotmail.com
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Cc: <stable@vger.kernel.org> #v2.6.36
      7fabaf30
    • A
      gfs2: Fix metadata read-ahead during truncate (2) · e7445ced
      Andreas Gruenbacher 提交于
      The previous attempt to fix for metadata read-ahead during truncate was
      incorrect: for files with a height > 2 (1006989312 bytes with a block
      size of 4096 bytes), read-ahead requests were not being issued for some
      of the indirect blocks discovered while walking the metadata tree,
      leading to significant slow-downs when deleting large files.  Fix that.
      
      In addition, only issue read-ahead requests in the first pass through
      the meta-data tree, while deallocating data blocks.
      
      Fixes: c3ce5aa9 ("gfs2: Fix metadata read-ahead during truncate")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      e7445ced
    • A
      gfs2: Put bitmap buffers in put_super · 10283ea5
      Andreas Gruenbacher 提交于
      gfs2_put_super calls gfs2_clear_rgrpd to destroy the gfs2_rgrpd objects
      attached to the resource group glocks.  That function should release the
      buffers attached to the gfs2_bitmap objects (bi_bh), but the call to
      gfs2_rgrp_brelse for doing that is missing.
      
      When gfs2_releasepage later runs across these buffers which are still
      referenced, it refuses to free them.  This causes the pages the buffers
      are attached to to remain referenced as well.  With enough mount/unmount
      cycles, the system will eventually run out of memory.
      
      Fix this by adding the missing call to gfs2_rgrp_brelse in
      gfs2_clear_rgrpd.
      
      (Also fix a gfs2_rgrp_relse -> gfs2_rgrp_brelse typo in a comment.)
      
      Fixes: 39b0f1e9 ("GFS2: Don't brelse rgrp buffer_heads every allocation")
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      10283ea5
    • S
      nfsd: COPY and CLONE operations require the saved filehandle to be set · 01310bb7
      Scott Mayhew 提交于
      Make sure we have a saved filehandle, otherwise we'll oops with a null
      pointer dereference in nfs4_preprocess_stateid_op().
      Signed-off-by: NScott Mayhew <smayhew@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      01310bb7
    • I
      libceph: assume argonaut on the server side · 23c625ce
      Ilya Dryomov 提交于
      No one is running pre-argonaut.  In addition one of the argonaut
      features (NOSRCADDR) has been required since day one (and a half,
      2.6.34 vs 2.6.35) of the kernel client.
      
      Allow for the possibility of reusing these feature bits later.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      23c625ce