1. 15 2月, 2016 5 次提交
    • D
      xfs: factor mapping out of xfs_do_writepage · bfce7d2e
      Dave Chinner 提交于
      Separate out the bufferhead based mapping from the writepage code so
      that we have a clear separation of the page operations and the
      bufferhead state.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bfce7d2e
    • D
      xfs: xfs_cluster_write is redundant · ad68972a
      Dave Chinner 提交于
      xfs_cluster_write() is not necessary now that xfs_vm_writepages()
      aggregates writepage calls across a single mapping. This means we no
      longer need to do page lookups in xfs_cluster_write, so writeback
      only needs to look up th epage cache once per page being written.
      This also removes a large amount of mostly duplicate code between
      xfs_do_writepage() and xfs_convert_page().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ad68972a
    • D
      xfs: Introduce writeback context for writepages · fbcc0256
      Dave Chinner 提交于
      xfs_vm_writepages() calls generic_writepages to writeback a range of
      a file, but then xfs_vm_writepage() clusters pages itself as it does
      not have any context it can pass between->writepage calls from
      __write_cache_pages().
      
      Introduce a writeback context for xfs_vm_writepages() and call
      __write_cache_pages directly with our own writepage callback so that
      we can pass that context to each writepage invocation. This
      encapsulates the current mapping, whether it is valid or not, the
      current ioend and it's IO type and the ioend chain being built.
      
      This requires us to move the ioend submission up to the level where
      the writepage context is declared. This does mean we do not submit
      IO until we packaged the entire writeback range, but with the block
      plugging in the writepages call this is the way IO is submitted,
      anyway.
      
      It also means that we need to handle discontiguous page ranges.  If
      the pages sent down by write_cache_pages to the writepage callback
      are discontiguous, we need to detect this and put each discontiguous
      page range into individual ioends. This is needed to ensure that the
      ioend accurately represents the range of the file that it covers so
      that file size updates during IO completion set the size correctly.
      Failure to take into account the discontiguous ranges results in
      files being too small when writeback patterns are non-sequential.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fbcc0256
    • D
      xfs: remove xfs_cancel_ioend · 150d5be0
      Dave Chinner 提交于
      We currently have code to cancel ioends being built because we
      change bufferhead state as we build the ioend. On error, this needs
      to be unwound and so we have cancelling code that walks the buffers
      on the ioend chain and undoes these state changes.
      
      However, the IO submission path already handles state changes for
      buffers when a submission error occurs, so we don't really need a
      separate cancel function to do this - we can simply submit the
      ioend chain with the specific error and it will be cancelled rather
      than submitted.
      
      Hence we can remove the explicit cancel code and just rely on
      submission to deal with the error correctly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      150d5be0
    • D
      xfs: remove nonblocking mode from xfs_vm_writepage · 988ef927
      Dave Chinner 提交于
      Remove the nonblocking optimisation done for mapping lookups during
      writeback. It's not clear that leaving a hole in the writeback range
      just because we couldn't get a lock is really a win, as it makes us
      do another small random IO later on rather than a large sequential
      IO now.
      
      As this gets in the way of sane error handling later on, just remove
      for the moment and we can re-introduce an equivalent optimisation in
      future if we see problems due to extent map lock contention.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      988ef927
  2. 08 1月, 2016 1 次提交
  3. 03 11月, 2015 3 次提交
    • D
      xfs: DAX does not use IO completion callbacks · 01a155e6
      Dave Chinner 提交于
      For DAX, we are now doing block zeroing during allocation. This
      means we no longer need a special DAX fault IO completion callback
      to do unwritten extent conversion. Because mmap never extends the
      file size (it SEGVs the process) we don't need a callback to update
      the file size, either. Hence we can remove the completion callbacks
      from the __dax_fault and __dax_mkwrite calls.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      01a155e6
    • D
      xfs: Don't use unwritten extents for DAX · 1ca19157
      Dave Chinner 提交于
      DAX has a page fault serialisation problem with block allocation.
      Because it allows concurrent page faults and does not have a page
      lock to serialise faults to the same page, it can get two concurrent
      faults to the page that race.
      
      When two read faults race, this isn't a huge problem as the data
      underlying the page is not changing and so "detect and drop" works
      just fine. The issues are to do with write faults.
      
      When two write faults occur, we serialise block allocation in
      get_blocks() so only one faul will allocate the extent. It will,
      however, be marked as an unwritten extent, and that is where the
      problem lies - the DAX fault code cannot differentiate between a
      block that was just allocated and a block that was preallocated and
      needs zeroing. The result is that both write faults end up zeroing
      the block and attempting to convert it back to written.
      
      The problem is that the first fault can zero and convert before the
      second fault starts zeroing, resulting in the zeroing for the second
      fault overwriting the data that the first fault wrote with zeros.
      The second fault then attempts to convert the unwritten extent,
      which is then a no-op because it's already written. Data loss occurs
      as a result of this race.
      
      Because there is no sane locking construct in the page fault code
      that we can use for serialisation across the page faults, we need to
      ensure block allocation and zeroing occurs atomically in the
      filesystem. This means we can still take concurrent page faults and
      the only time they will serialise is in the filesystem
      mapping/allocation callback. The page fault code will always see
      written, initialised extents, so we will be able to remove the
      unwritten extent handling from the DAX code when all filesystems are
      converted.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      1ca19157
    • D
      xfs: fix inode size update overflow in xfs_map_direct() · 3e12dbbd
      Dave Chinner 提交于
      Both direct IO and DAX pass an offset and count into get_blocks that
      will overflow a s64 variable when an IO goes into the last supported
      block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen
      from the tracing:
      
      xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096
      xfs_gbmap_direct:     [...] offset 0x7ffffffffffff000 count 4096
      xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096
      
      0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that
      overflows the s64 offset and we fail to detect the need for a
      filesize update and an ioend is not allocated.
      
      This is *mostly* avoided for direct IO because such extending IOs
      occur with full block allocation, and so the "IS_UNWRITTEN()" check
      still evaluates as true and we get an ioend that way. However, doing
      single sector extending IOs to this last block will expose the fact
      that file size updates will not occur after the first allocating
      direct IO as the overflow will then be exposed.
      
      There is one further complexity: the DAX page fault path also
      exposes the same issue in block allocation. However, page faults
      cannot extend the file size, so in this case we want to allocate the
      block but do not want to allocate an ioend to enable file size
      update at IO completion. Hence we now need to distinguish between
      the direct IO patch allocation and dax fault path allocation to
      avoid leaking ioend structures.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3e12dbbd
  4. 12 10月, 2015 2 次提交
    • B
      xfs: add missing ilock around dio write last extent alignment · 009c6e87
      Brian Foster 提交于
      The iomap codepath (via get_blocks()) acquires and release the inode
      lock in the case of a direct write that requires block allocation. This
      is because xfs_iomap_write_direct() allocates a transaction, which means
      the ilock must be dropped and reacquired after the transaction is
      allocated and reserved.
      
      xfs_iomap_write_direct() invokes xfs_iomap_eof_align_last_fsb() before
      the transaction is created and thus before the ilock is reacquired. This
      can lead to calls to xfs_iread_extents() and reads of the in-core extent
      list without any synchronization (via xfs_bmap_eof() and
      xfs_bmap_last_extent()). xfs_iread_extents() assert fails if the ilock
      is not held, but this is not currently seen in practice as the current
      callers had already invoked xfs_bmapi_read().
      
      What has been seen in practice are reports of crashes down in the
      xfs_bmap_eof() codepath on direct writes due to seemingly bogus pointer
      references from xfs_iext_get_ext(). While an explicit reproducer is not
      currently available to confirm the cause of the problem, crash analysis
      and code inspection from David Jeffrey had identified the insufficient
      locking.
      
      xfs_iomap_eof_align_last_fsb() is called from other contexts with the
      inode lock already held, so we cannot acquire it therein.
      __xfs_get_blocks() acquires and drops the ilock with variable flags to
      cover the event that the extent list must be read in. The common case is
      that __xfs_get_blocks() acquires the shared ilock. To provide locking
      around the last extent alignment call without adding more lock cycles to
      the dio path, update xfs_iomap_write_direct() to expect the shared ilock
      held on entry and do the extent alignment under its protection. Demote
      the lock, if necessary, from __xfs_get_blocks() and push the
      xfs_qm_dqattach() call outside of the shared lock critical section.
      Also, add an assert to document that the extent list is always expected
      to be present in this path. Otherwise, we risk a call to
      xfs_iread_extents() while under the shared ilock. This is safe as all
      current callers have executed an xfs_bmapi_read() call under the current
      iolock context.
      Reported-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      009c6e87
    • Z
      cancel the setfilesize transation when io error happen · 5cb13dcd
      Zhaohongjiang 提交于
      When I ran xfstest/073 case, the remount process was blocked to wait
      transactions to be zero. I found there was a io error happened, and
      the setfilesize transaction was not released properly. We should add
      the changes to cancel the io error in this case.
      
      Reproduction steps:
      1. dd if=/dev/zero of=xfs1.img bs=1M count=2048
      2. mkfs.xfs xfs1.img
      3. losetup -f ./xfs1.img /dev/loop0
      4. mount -t xfs /dev/loop0 /home/test_dir/
      5. mkdir /home/test_dir/test
      6. mkfs.xfs -dfile,name=image,size=2g
      7. mount -t xfs -o loop image /home/test_dir/test
      8. cp a file bigger than 2g to /home/test_dir/test
      9. mount -t xfs -o remount,ro /home/test_dir/test
      
      [ dchinner: moved io error detection to xfs_setfilesize_ioend() after
        transaction context restoration. ]
      Signed-off-by: NZhao Hongjiang <zhaohongjiang@huawei.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5cb13dcd
  5. 28 8月, 2015 1 次提交
    • D
      xfs: return errors from partial I/O failures to files · c9eb256e
      David Jeffery 提交于
      There is an issue with xfs's error reporting in some cases of I/O partially
      failing and partially succeeding. Calls like fsync() can report success even
      though not all I/O was successful in partial-failure cases such as one disk of
      a RAID0 array being offline.
      
      The issue can occur when there are more than one bio per xfs_ioend struct.
      Each call to xfs_end_bio() for a bio completing will write a value to
      ioend->io_error.  If a successful bio completes after any failed bio, no
      error is reported do to it writing 0 over the error code set by any failed bio.
      The I/O error information is now lost and when the ioend is completed
      only success is reported back up the filesystem stack.
      
      xfs_end_bio() should only set ioend->io_error in the case of BIO_UPTODATE
      being clear.  ioend->io_error is initialized to 0 at allocation so only needs
      to be updated by a failed bio. Also check that ioend->io_error is 0 so that
      the first error reported will be the error code returned.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      c9eb256e
  6. 15 8月, 2015 1 次提交
  7. 14 8月, 2015 1 次提交
  8. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  9. 04 6月, 2015 4 次提交
  10. 02 6月, 2015 1 次提交
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
  11. 06 5月, 2015 1 次提交
    • J
      bio: skip atomic inc/dec of ->bi_cnt for most use cases · dac56212
      Jens Axboe 提交于
      Struct bio has a reference count that controls when it can be freed.
      Most uses cases is allocating the bio, which then returns with a
      single reference to it, doing IO, and then dropping that single
      reference. We can remove this atomic_dec_and_test() in the completion
      path, if nobody else is holding a reference to the bio.
      
      If someone does call bio_get() on the bio, then we flag the bio as
      now having valid count and that we must properly honor the reference
      count when it's being put.
      Tested-by: NRobert Elliott <elliott@hp.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dac56212
  12. 16 4月, 2015 6 次提交
    • D
      xfs: DIO write completion size updates race · b9d59846
      Dave Chinner 提交于
      xfs_end_io_direct_write() can race with other IO completions when
      updating the in-core inode size. The IO completion processing is not
      serialised for direct IO - they are done either under the
      IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all
      during AIO DIO completion. Hence the non-atomic test-and-set update
      of the in-core inode size is racy and can result in the in-core
      inode size going backwards if the race if hit just right.
      
      If the inode size goes backwards, this can trigger the EOF zeroing
      code to run incorrectly on the next IO, which then will zero data
      that has successfully been written to disk by a previous DIO.
      
      To fix this bug, we need to serialise the test/set updates of the
      in-core inode size. This first patch introduces locking around the
      relevant updates and checks in the DIO path. Because we now have an
      ioend in xfs_end_io_direct_write(), we know exactly then we are
      doing an IO that requires an in-core EOF update, and we know that
      they are not running in interrupt context. As such, we do not need to
      use irqsave() spinlock variants to protect against interrupts while
      the lock is held.
      
      Hence we can use an existing spinlock in the inode to do this
      serialisation and so not need to grow the struct xfs_inode just to
      work around this problem.
      
      This patch does not address the test/set EOF update in
      generic_file_write_direct() for various reasons - that will be done
      as a followup with separate explanation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b9d59846
    • D
      xfs: DIO writes within EOF don't need an ioend · a06c277a
      Dave Chinner 提交于
      DIO writes that lie entirely within EOF have nothing to do in IO
      completion. In this case, we don't need no steekin' ioend, and so we
      can avoid allocating an ioend until we have a mapping that spans
      EOF.
      
      This means that IO completion has two contexts - deferred completion
      to the dio workqueue that uses an ioend, and interrupt completion
      that does nothing because there is nothing that can be done in this
      context.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a06c277a
    • D
      xfs: handle DIO overwrite EOF update completion correctly · 6dfa1b67
      Dave Chinner 提交于
      Currently a DIO overwrite that extends the EOF (e.g sub-block IO or
      write into allocated blocks beyond EOF) requires a transaction for
      the EOF update. Thi is done in IO completion context, but we aren't
      explicitly handling this situation properly and so it can run in
      interrupt context. Ensure that we defer IO that spans EOF correctly
      to the DIO completion workqueue, and now that we have an ioend in IO
      completion we can use the common ioend completion path to do all the
      work.
      
      Note: we do not preallocate the append transaction as we can have
      multiple mapping and allocation calls per direct IO. hence
      preallocating can still leave us with nested transactions by
      attempting to map and allocate more blocks after we've preallocated
      an append transaction.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6dfa1b67
    • D
      xfs: DIO needs an ioend for writes · d5cc2e3f
      Dave Chinner 提交于
      Currently we can only tell DIO completion that an IO requires
      unwritten extent completion. This is done by a hacky non-null
      private pointer passed to Io completion, but the private pointer
      does not actually contain any information that is used.
      
      We also need to pass to IO completion the fact that the IO may be
      beyond EOF and so a size update transaction needs to be done. This
      is currently determined by checks in the io completion, but we need
      to determine if this is necessary at block mapping time as we need
      to defer the size update transactions to a completion workqueue,
      just like unwritten extent conversion.
      
      To do this, first we need to allocate and pass an ioend to to IO
      completion. Add this for unwritten extent conversion; we'll do the
      EOF updates in the next commit.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d5cc2e3f
    • D
      xfs: move DIO mapping size calculation · 1fdca9c2
      Dave Chinner 提交于
      The mapping size calculation is done last in __xfs_get_blocks(), but
      we are going to need the actual mapping size we will use to map the
      direct IO correctly in xfs_map_direct(). Factor out the calculation
      for code clarity, and move the call to be the first operation in
      mapping the extent to the returned buffer.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      1fdca9c2
    • D
      xfs: factor DIO write mapping from get_blocks · a719370b
      Dave Chinner 提交于
      Clarify and separate the buffer mapping logic so that the direct IO mapping is
      not tangled up in propagating the extent status to teh mapping buffer. This
      makes it easier to extend the direct IO mapping to use an ioend in future.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a719370b
  13. 12 4月, 2015 3 次提交
  14. 26 3月, 2015 1 次提交
  15. 02 2月, 2015 1 次提交
  16. 28 11月, 2014 3 次提交
  17. 02 10月, 2014 1 次提交
    • B
      xfs: restore buffer_head unwritten bit on ioend cancel · 07d08681
      Brian Foster 提交于
      xfs_vm_writepage() walks each buffer_head on the page, maps to the block
      on disk and attaches to a running ioend structure that represents the
      I/O submission. A new ioend is created when the type of I/O (unwritten,
      delayed allocation or overwrite) required for a particular buffer_head
      differs from the previous. If a buffer_head is a delalloc or unwritten
      buffer, the associated bits are cleared by xfs_map_at_offset() once the
      buffer_head is added to the ioend.
      
      The process of mapping each buffer_head occurs in xfs_map_blocks() and
      acquires the ilock in blocking or non-blocking mode, depending on the
      type of writeback in progress. If the lock cannot be acquired for
      non-blocking writeback, we cancel the ioend, redirty the page and
      return. Writeback will revisit the page at some later point.
      
      Note that we acquire the ilock for each buffer on the page. Therefore
      during non-blocking writeback, it is possible to add an unwritten buffer
      to the ioend, clear the unwritten state, fail to acquire the ilock when
      mapping a subsequent buffer and cancel the ioend. If this occurs, the
      unwritten status of the buffer sitting in the ioend has been lost. The
      page will eventually hit writeback again, but xfs_vm_writepage() submits
      overwrite I/O instead of unwritten I/O and does not perform unwritten
      extent conversion at I/O completion. This leads to data corruption
      because unwritten extents are treated as holes on reads and zeroes are
      returned instead of reading from disk.
      
      Modify xfs_cancel_ioend() to restore the buffer unwritten bit for ioends
      of type XFS_IO_UNWRITTEN. This ensures that unwritten extent conversion
      occurs once the page is eventually written back.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      07d08681
  18. 23 9月, 2014 1 次提交
    • D
      xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly · 0d085a52
      Dave Chinner 提交于
      XFS has been having trouble with stray delayed allocation extents
      beyond EOF for a long time. Recent changes to the collapse range
      code has triggered erroneous EBUSY errors on page invalidtion for
      block size smaller than page size filesystems. These
      have been caused by dirty buffers beyond EOF on a partial page which
      do not get written to disk during a sync.
      
      The issue is that write-ahead in xfs_cluster_write() finds such a
      partial page and handles it by leaving the page dirty but pushing it
      into a writeback state. This used to work just fine, as the
      write_cache_pages() code would then find the dirty partial page in
      the next mapping tree lookup as the dirty tag is still set.
      
      Unfortunately, when we moved to a mark and sweep approach to
      writeback to fix other writeback sync issues, we broken this. THe
      act of marking the page as under writeback now clears the TOWRITE
      tag in the radix tree, even though the page is still dirty. This
      causes the TOWRITE tag to be cleared, and hence the next lookup on
      the mapping tree does not find the dirty partial page and so doesn't
      try to write it again.
      
      This same writeback bug was found recently in ext4 and fixed in
      commit 1c8349a1 ("ext4: fix data integrity sync in ordered mode")
      without communication to the wider filesystem community. We can use
      exactly the same fix here so the TOWRITE flag is not cleared on
      partial page writes.
      
      cc: stable@vger.kernel.org # dependent on 1c8349a1Root-cause-found-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      0d085a52
  19. 02 9月, 2014 1 次提交
    • D
      xfs: don't dirty buffers beyond EOF · 22e757a4
      Dave Chinner 提交于
      generic/263 is failing fsx at this point with a page spanning
      EOF that cannot be invalidated. The operations are:
      
      1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
      1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
      1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
      
      where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
      write attempts to invalidate the cached page over this range, it
      fails with -EBUSY and so any attempt to do page invalidation fails.
      
      The real question is this: Why can't that page be invalidated after
      it has been written to disk and cleaned?
      
      Well, there's data on the first two buffers in the page (1k block
      size, 4k page), but the third buffer on the page (i.e. beyond EOF)
      is failing drop_buffers because it's bh->b_state == 0x3, which is
      BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
      what?
      
      OK, set_buffer_dirty() is called on all buffers from
      __set_page_buffers_dirty(), regardless of whether the buffer is
      beyond EOF or not, which means that when we get to ->writepage,
      we have buffers marked dirty beyond EOF that we need to clean.
      So, we need to implement our own .set_page_dirty method that
      doesn't dirty buffers beyond EOF.
      
      This is messy because the buffer code is not meant to be shared
      and it has interesting locking issues on the buffer dirty bits.
      So just copy and paste it and then modify it to suit what we need.
      
      Note: the solutions the other filesystems and generic block code use
      of marking the buffers clean in ->writepage does not work for XFS.
      It still leaves dirty buffers beyond EOF and invalidations still
      fail. Hence rather than play whack-a-mole, this patch simply
      prevents those buffers from being dirtied in the first place.
      
      cc: <stable@kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      22e757a4
  20. 25 6月, 2014 1 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d
  21. 22 6月, 2014 1 次提交