1. 17 2月, 2017 4 次提交
    • B
      xfs: split indlen reservations fairly when under reserved · 75d65361
      Brian Foster 提交于
      Certain workoads that punch holes into speculative preallocation can
      cause delalloc indirect reservation splits when the delalloc extent is
      split in two. If further splits occur, an already short-handed extent
      can be split into two in a manner that leaves zero indirect blocks for
      one of the two new extents. This occurs because the shortage is large
      enough that the xfs_bmap_split_indlen() algorithm completely drains the
      requested indlen of one of the extents before it honors the existing
      reservation.
      
      This ultimately results in a warning from xfs_bmap_del_extent(). This
      has been observed during file copies of large, sparse files using 'cp
      --sparse=always.'
      
      To avoid this problem, update xfs_bmap_split_indlen() to explicitly
      apply the reservation shortage fairly between both extents. This smooths
      out the overall indlen shortage and defers the situation where we end up
      with a delalloc extent with zero indlen reservation to extreme
      circumstances.
      Reported-by: NPatrick Dung <mpatdung@gmail.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      75d65361
    • B
      xfs: handle indlen shortage on delalloc extent merge · 0e339ef8
      Brian Foster 提交于
      When a delalloc extent is created, it can be merged with pre-existing,
      contiguous, delalloc extents. When this occurs,
      xfs_bmap_add_extent_hole_delay() merges the extents along with the
      associated indirect block reservations. The expectation here is that the
      combined worst case indlen reservation is always less than or equal to
      the indlen reservation for the individual extents.
      
      This is not always the case, however, as existing extents can less than
      the expected indlen reservation if the extent was previously split due
      to a hole punch. If a new extent merges with such an extent, the total
      indlen requirement may be larger than the sum of the indlen reservations
      held by both extents.
      
      xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen
      reservation is always available and assigns it to the merged extent
      without consideration for the indlen held by the pre-existing extent. As
      a result, the subsequent xfs_mod_fdblocks() call can attempt an
      unintentional allocation rather than a free (indicated by an ASSERT()
      failure). Further, if the allocation happens to fail in this context,
      the failure goes unhandled and creates a filesystem wide block
      accounting inconsistency.
      
      Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the
      indlen reservation assigned to the merged extent to the sum of the
      indlen reservations held by each of the individual extents.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0e339ef8
    • B
      xfs: resurrect debug mode drop buffered writes mechanism · 9dbddd7b
      Brian Foster 提交于
      A debug mode write failure mechanism was introduced to XFS in commit
      801cc4e1 ("xfs: debug mode forced buffered write failure") to
      facilitate targeted testing of delalloc indirect reservation management
      from userspace. This code was subsequently rendered ineffective by the
      move to iomap based buffered writes in commit 68a9f5e7 ("xfs:
      implement iomap based buffered write path"). This likely went unnoticed
      because the associated userspace code had not made it into xfstests.
      
      Resurrect this mechanism to facilitate effective indlen reservation
      testing from xfstests. The move to iomap based buffered writes relocated
      the hook this mechanism needs to return write failure from XFS to
      generic code. The failure trigger must remain in XFS. Given that
      limitation, convert this from a write failure mechanism to one that
      simply drops writes without returning failure to userspace. Rename all
      "fail_writes" references to "drop_writes" to illustrate the point. This
      is more hacky than preferred, but still triggers the XFS error handling
      behavior required to drive the indlen tests. This is only available in
      DEBUG mode and for testing purposes only.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9dbddd7b
    • B
      xfs: clear delalloc and cache on buffered write failure · fa7f138a
      Brian Foster 提交于
      The buffered write failure handling code in
      xfs_file_iomap_end_delalloc() has a couple minor problems. First, if
      written == 0, start_fsb is not rounded down and it fails to kill off a
      delalloc block if the start offset is block unaligned. This results in a
      lingering delalloc block and broken delalloc block accounting detected
      at unmount time. Fix this by rounding down start_fsb in the unlikely
      event that written == 0.
      
      Second, it is possible for a failed overwrite of a delalloc extent to
      leave dirty pagecache around over a hole in the file. This is because is
      possible to hit ->iomap_end() on write failure before the iomap code has
      attempted to allocate pagecache, and thus has no need to clean it up. If
      the targeted delalloc extent was successfully written by a previous
      write, however, then it does still have dirty pages when ->iomap_end()
      punches out the underlying blocks. This ultimately results in writeback
      over a hole. To fix this problem, unconditionally punch out the
      pagecache from XFS before the associated delalloc range.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      fa7f138a
  2. 10 2月, 2017 6 次提交
  3. 07 2月, 2017 5 次提交
  4. 04 2月, 2017 1 次提交
  5. 03 2月, 2017 7 次提交
    • D
      xfs: mark speculative prealloc CoW fork extents unwritten · 5eda4300
      Darrick J. Wong 提交于
      Christoph Hellwig pointed out that there's a potentially nasty race when
      performing simultaneous nearby directio cow writes:
      
      "Thread 1 writes a range from B to c
      
      "                    B --------- C
                                 p
      
      "a little later thread 2 writes from A to B
      
      "        A --------- B
                     p
      
      [editor's note: the 'p' denote cowextsize boundaries, which I added to
      make this more clear]
      
      "but the code preallocates beyond B into the range where thread
      "1 has just written, but ->end_io hasn't been called yet.
      "But once ->end_io is called thread 2 has already allocated
      "up to the extent size hint into the write range of thread 1,
      "so the end_io handler will splice the unintialized blocks from
      "that preallocation back into the file right after B."
      
      We can avoid this race by ensuring that thread 1 cannot accidentally
      remap the blocks that thread 2 allocated (as part of speculative
      preallocation) as part of t2's write preparation in t1's end_io handler.
      The way we make this happen is by taking advantage of the unwritten
      extent flag as an intermediate step.
      
      Recall that when we begin the process of writing data to shared blocks,
      we create a delayed allocation extent in the CoW fork:
      
      D: --RRRRRRSSSRRRRRRRR---
      C: ------DDDDDDD---------
      
      When a thread prepares to CoW some dirty data out to disk, it will now
      convert the delalloc reservation into an /unwritten/ allocated extent in
      the cow fork.  The da conversion code tries to opportunistically
      allocate as much of a (speculatively prealloc'd) extent as possible, so
      we may end up allocating a larger extent than we're actually writing
      out:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UUUUUUU---------
      
      Next, we convert only the part of the extent that we're actively
      planning to write to normal (i.e. not unwritten) status:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UURRUUU---------
      
      If the write succeeds, the end_cow function will now scan the relevant
      range of the CoW fork for real extents and remap only the real extents
      into the data fork:
      
      D: --RRRRRRRRSRRRRRRRR---
      U: ------UU--UUU---------
      
      This ensures that we never obliterate valid data fork extents with
      unwritten blocks from the CoW fork.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5eda4300
    • D
      xfs: allow unwritten extents in the CoW fork · 05a630d7
      Darrick J. Wong 提交于
      In the data fork, we only allow extents to perform the following state
      transitions:
      
      delay -> real <-> unwritten
      
      There's no way to move directly from a delalloc reservation to an
      /unwritten/ allocated extent.  However, for the CoW fork we want to be
      able to do the following to each extent:
      
      delalloc -> unwritten -> written -> remapped to data fork
      
      This will help us to avoid a race in the speculative CoW preallocation
      code between a first thread that is allocating a CoW extent and a second
      thread that is remapping part of a file after a write.  In order to do
      this, however, we need two things: first, we have to be able to
      transition from da to unwritten, and second the function that converts
      between real and unwritten has to be made aware of the cow fork.  Do
      both of those things.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      05a630d7
    • D
      xfs: verify free block header fields · de14c5f5
      Darrick J. Wong 提交于
      Perform basic sanity checking of the directory free block header
      fields so that we avoid hanging the system on invalid data.
      
      (Granted that just means that now we shutdown on directory write,
      but that seems better than hanging...)
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      de14c5f5
    • D
      xfs: check for obviously bad level values in the bmbt root · b3bf607d
      Darrick J. Wong 提交于
      We can't handle a bmbt that's taller than BTREE_MAXLEVELS, and there's
      no such thing as a zero-level bmbt (for that we have extents format),
      so if we see this, send back an error code.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      b3bf607d
    • D
      xfs: filter out obviously bad btree pointers · d5a91bae
      Darrick J. Wong 提交于
      Don't let anybody load an obviously bad btree pointer.  Since the values
      come from disk, we must return an error, not just ASSERT.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      d5a91bae
    • D
      xfs: fail _dir_open when readahead fails · 7a652bbe
      Darrick J. Wong 提交于
      When we open a directory, we try to readahead block 0 of the directory
      on the assumption that we're going to need it soon.  If the bmbt is
      corrupt, the directory will never be usable and the readahead fails
      immediately, so we might as well prevent the directory from being opened
      at all.  This prevents a subsequent read or modify operation from
      hitting it and taking the fs offline.
      
      NOTE: We're only checking for early failures in the block mapping, not
      the readahead directory block itself.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      7a652bbe
    • D
      xfs: fix toctou race when locking an inode to access the data map · 4b5bd5bf
      Darrick J. Wong 提交于
      We use di_format and if_flags to decide whether we're grabbing the ilock
      in btree mode (btree extents not loaded) or shared mode (anything else),
      but the state of those fields can be changed by other threads that are
      also trying to load the btree extents -- IFEXTENTS gets set before the
      _bmap_read_extents call and cleared if it fails.
      
      We don't actually need to have IFEXTENTS set until after the bmbt
      records are successfully loaded and validated, which will fix the race
      between multiple threads trying to read the same directory.  The next
      patch strengthens directory bmbt validation by refusing to open the
      directory if reading the bmbt to start directory readahead fails.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      4b5bd5bf
  6. 31 1月, 2017 9 次提交
  7. 30 1月, 2017 3 次提交
    • L
      Linux 4.10-rc6 · 566cf877
      Linus Torvalds 提交于
      566cf877
    • L
      drm/i915: Check for NULL i915_vma in intel_unpin_fb_obj() · 39cb2c9a
      Linus Torvalds 提交于
      I've seen this trigger twice now, where the i915_gem_object_to_ggtt()
      call in intel_unpin_fb_obj() returns NULL, resulting in an oops
      immediately afterwards as the (inlined) call to i915_vma_unpin_fence()
      tries to dereference it.
      
      It seems to be some race condition where the object is going away at
      shutdown time, since both times happened when shutting down the X
      server.  The call chains were different:
      
       - VT ioctl(KDSETMODE, KD_TEXT):
      
          intel_cleanup_plane_fb+0x5b/0xa0 [i915]
          drm_atomic_helper_cleanup_planes+0x6f/0x90 [drm_kms_helper]
          intel_atomic_commit_tail+0x749/0xfe0 [i915]
          intel_atomic_commit+0x3cb/0x4f0 [i915]
          drm_atomic_commit+0x4b/0x50 [drm]
          restore_fbdev_mode+0x14c/0x2a0 [drm_kms_helper]
          drm_fb_helper_restore_fbdev_mode_unlocked+0x34/0x80 [drm_kms_helper]
          drm_fb_helper_set_par+0x2d/0x60 [drm_kms_helper]
          intel_fbdev_set_par+0x18/0x70 [i915]
          fb_set_var+0x236/0x460
          fbcon_blank+0x30f/0x350
          do_unblank_screen+0xd2/0x1a0
          vt_ioctl+0x507/0x12a0
          tty_ioctl+0x355/0xc30
          do_vfs_ioctl+0xa3/0x5e0
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x13/0x94
      
       - i915 unpin_work workqueue:
      
          intel_unpin_work_fn+0x58/0x140 [i915]
          process_one_work+0x1f1/0x480
          worker_thread+0x48/0x4d0
          kthread+0x101/0x140
      
      and this patch purely papers over the issue by adding a NULL pointer
      check and a WARN_ON_ONCE() to avoid the oops that would then generally
      make the machine unresponsive.
      
      Other callers of i915_gem_object_to_ggtt() seem to also check for the
      returned pointer being NULL and warn about it, so this clearly has
      happened before in other places.
      
      [ Reported it originally to the i915 developers on Jan 8, applying the
        ugly workaround on my own now after triggering the problem for the
        second time with no feedback.
      
        This is likely to be the same bug reported as
      
           https://bugs.freedesktop.org/show_bug.cgi?id=98829
           https://bugs.freedesktop.org/show_bug.cgi?id=99134
      
        which has a patch for the underlying problem, but it hasn't gotten to
        me, so I'm applying the workaround. ]
      
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Imre Deak <imre.deak@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39cb2c9a
    • L
      Merge branch 'parisc-4.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 2c5d9555
      Linus Torvalds 提交于
      Pull two parisc fixes from Helge Deller:
       "One fix to avoid usage of BITS_PER_LONG in user-space exported swab.h
        header which breaks compiling qemu, and one trivial fix for printk
        continuation in the parisc parport driver"
      
      * 'parisc-4.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Don't use BITS_PER_LONG in userspace-exported swab.h header
        parisc, parport_gsc: Fixes for printk continuation lines
      2c5d9555
  8. 29 1月, 2017 5 次提交