1. 16 4月, 2015 6 次提交
    • D
      xfs: DIO write completion size updates race · b9d59846
      Dave Chinner 提交于
      xfs_end_io_direct_write() can race with other IO completions when
      updating the in-core inode size. The IO completion processing is not
      serialised for direct IO - they are done either under the
      IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all
      during AIO DIO completion. Hence the non-atomic test-and-set update
      of the in-core inode size is racy and can result in the in-core
      inode size going backwards if the race if hit just right.
      
      If the inode size goes backwards, this can trigger the EOF zeroing
      code to run incorrectly on the next IO, which then will zero data
      that has successfully been written to disk by a previous DIO.
      
      To fix this bug, we need to serialise the test/set updates of the
      in-core inode size. This first patch introduces locking around the
      relevant updates and checks in the DIO path. Because we now have an
      ioend in xfs_end_io_direct_write(), we know exactly then we are
      doing an IO that requires an in-core EOF update, and we know that
      they are not running in interrupt context. As such, we do not need to
      use irqsave() spinlock variants to protect against interrupts while
      the lock is held.
      
      Hence we can use an existing spinlock in the inode to do this
      serialisation and so not need to grow the struct xfs_inode just to
      work around this problem.
      
      This patch does not address the test/set EOF update in
      generic_file_write_direct() for various reasons - that will be done
      as a followup with separate explanation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b9d59846
    • D
      xfs: DIO writes within EOF don't need an ioend · a06c277a
      Dave Chinner 提交于
      DIO writes that lie entirely within EOF have nothing to do in IO
      completion. In this case, we don't need no steekin' ioend, and so we
      can avoid allocating an ioend until we have a mapping that spans
      EOF.
      
      This means that IO completion has two contexts - deferred completion
      to the dio workqueue that uses an ioend, and interrupt completion
      that does nothing because there is nothing that can be done in this
      context.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a06c277a
    • D
      xfs: handle DIO overwrite EOF update completion correctly · 6dfa1b67
      Dave Chinner 提交于
      Currently a DIO overwrite that extends the EOF (e.g sub-block IO or
      write into allocated blocks beyond EOF) requires a transaction for
      the EOF update. Thi is done in IO completion context, but we aren't
      explicitly handling this situation properly and so it can run in
      interrupt context. Ensure that we defer IO that spans EOF correctly
      to the DIO completion workqueue, and now that we have an ioend in IO
      completion we can use the common ioend completion path to do all the
      work.
      
      Note: we do not preallocate the append transaction as we can have
      multiple mapping and allocation calls per direct IO. hence
      preallocating can still leave us with nested transactions by
      attempting to map and allocate more blocks after we've preallocated
      an append transaction.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6dfa1b67
    • D
      xfs: DIO needs an ioend for writes · d5cc2e3f
      Dave Chinner 提交于
      Currently we can only tell DIO completion that an IO requires
      unwritten extent completion. This is done by a hacky non-null
      private pointer passed to Io completion, but the private pointer
      does not actually contain any information that is used.
      
      We also need to pass to IO completion the fact that the IO may be
      beyond EOF and so a size update transaction needs to be done. This
      is currently determined by checks in the io completion, but we need
      to determine if this is necessary at block mapping time as we need
      to defer the size update transactions to a completion workqueue,
      just like unwritten extent conversion.
      
      To do this, first we need to allocate and pass an ioend to to IO
      completion. Add this for unwritten extent conversion; we'll do the
      EOF updates in the next commit.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d5cc2e3f
    • D
      xfs: move DIO mapping size calculation · 1fdca9c2
      Dave Chinner 提交于
      The mapping size calculation is done last in __xfs_get_blocks(), but
      we are going to need the actual mapping size we will use to map the
      direct IO correctly in xfs_map_direct(). Factor out the calculation
      for code clarity, and move the call to be the first operation in
      mapping the extent to the returned buffer.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      1fdca9c2
    • D
      xfs: factor DIO write mapping from get_blocks · a719370b
      Dave Chinner 提交于
      Clarify and separate the buffer mapping logic so that the direct IO mapping is
      not tangled up in propagating the extent status to teh mapping buffer. This
      makes it easier to extend the direct IO mapping to use an ioend in future.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a719370b
  2. 12 4月, 2015 3 次提交
  3. 26 3月, 2015 1 次提交
  4. 02 2月, 2015 1 次提交
  5. 28 11月, 2014 3 次提交
  6. 02 10月, 2014 1 次提交
    • B
      xfs: restore buffer_head unwritten bit on ioend cancel · 07d08681
      Brian Foster 提交于
      xfs_vm_writepage() walks each buffer_head on the page, maps to the block
      on disk and attaches to a running ioend structure that represents the
      I/O submission. A new ioend is created when the type of I/O (unwritten,
      delayed allocation or overwrite) required for a particular buffer_head
      differs from the previous. If a buffer_head is a delalloc or unwritten
      buffer, the associated bits are cleared by xfs_map_at_offset() once the
      buffer_head is added to the ioend.
      
      The process of mapping each buffer_head occurs in xfs_map_blocks() and
      acquires the ilock in blocking or non-blocking mode, depending on the
      type of writeback in progress. If the lock cannot be acquired for
      non-blocking writeback, we cancel the ioend, redirty the page and
      return. Writeback will revisit the page at some later point.
      
      Note that we acquire the ilock for each buffer on the page. Therefore
      during non-blocking writeback, it is possible to add an unwritten buffer
      to the ioend, clear the unwritten state, fail to acquire the ilock when
      mapping a subsequent buffer and cancel the ioend. If this occurs, the
      unwritten status of the buffer sitting in the ioend has been lost. The
      page will eventually hit writeback again, but xfs_vm_writepage() submits
      overwrite I/O instead of unwritten I/O and does not perform unwritten
      extent conversion at I/O completion. This leads to data corruption
      because unwritten extents are treated as holes on reads and zeroes are
      returned instead of reading from disk.
      
      Modify xfs_cancel_ioend() to restore the buffer unwritten bit for ioends
      of type XFS_IO_UNWRITTEN. This ensures that unwritten extent conversion
      occurs once the page is eventually written back.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      07d08681
  7. 23 9月, 2014 1 次提交
    • D
      xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly · 0d085a52
      Dave Chinner 提交于
      XFS has been having trouble with stray delayed allocation extents
      beyond EOF for a long time. Recent changes to the collapse range
      code has triggered erroneous EBUSY errors on page invalidtion for
      block size smaller than page size filesystems. These
      have been caused by dirty buffers beyond EOF on a partial page which
      do not get written to disk during a sync.
      
      The issue is that write-ahead in xfs_cluster_write() finds such a
      partial page and handles it by leaving the page dirty but pushing it
      into a writeback state. This used to work just fine, as the
      write_cache_pages() code would then find the dirty partial page in
      the next mapping tree lookup as the dirty tag is still set.
      
      Unfortunately, when we moved to a mark and sweep approach to
      writeback to fix other writeback sync issues, we broken this. THe
      act of marking the page as under writeback now clears the TOWRITE
      tag in the radix tree, even though the page is still dirty. This
      causes the TOWRITE tag to be cleared, and hence the next lookup on
      the mapping tree does not find the dirty partial page and so doesn't
      try to write it again.
      
      This same writeback bug was found recently in ext4 and fixed in
      commit 1c8349a1 ("ext4: fix data integrity sync in ordered mode")
      without communication to the wider filesystem community. We can use
      exactly the same fix here so the TOWRITE flag is not cleared on
      partial page writes.
      
      cc: stable@vger.kernel.org # dependent on 1c8349a1Root-cause-found-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      0d085a52
  8. 02 9月, 2014 1 次提交
    • D
      xfs: don't dirty buffers beyond EOF · 22e757a4
      Dave Chinner 提交于
      generic/263 is failing fsx at this point with a page spanning
      EOF that cannot be invalidated. The operations are:
      
      1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
      1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
      1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
      
      where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
      write attempts to invalidate the cached page over this range, it
      fails with -EBUSY and so any attempt to do page invalidation fails.
      
      The real question is this: Why can't that page be invalidated after
      it has been written to disk and cleaned?
      
      Well, there's data on the first two buffers in the page (1k block
      size, 4k page), but the third buffer on the page (i.e. beyond EOF)
      is failing drop_buffers because it's bh->b_state == 0x3, which is
      BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
      what?
      
      OK, set_buffer_dirty() is called on all buffers from
      __set_page_buffers_dirty(), regardless of whether the buffer is
      beyond EOF or not, which means that when we get to ->writepage,
      we have buffers marked dirty beyond EOF that we need to clean.
      So, we need to implement our own .set_page_dirty method that
      doesn't dirty buffers beyond EOF.
      
      This is messy because the buffer code is not meant to be shared
      and it has interesting locking issues on the buffer dirty bits.
      So just copy and paste it and then modify it to suit what we need.
      
      Note: the solutions the other filesystems and generic block code use
      of marking the buffers clean in ->writepage does not work for XFS.
      It still leaves dirty buffers beyond EOF and invalidations still
      fail. Hence rather than play whack-a-mole, this patch simply
      prevents those buffers from being dirtied in the first place.
      
      cc: <stable@kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      22e757a4
  9. 25 6月, 2014 1 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d
  10. 22 6月, 2014 1 次提交
  11. 06 6月, 2014 1 次提交
  12. 20 5月, 2014 1 次提交
    • J
      xfs: fix infinite loop at xfs_vm_writepage on 32bit system · 8695d27e
      Jie Liu 提交于
      Write to a file with an offset greater than 16TB on 32-bit system and
      then trigger page write-back via sync(1) will cause task hang.
      
      # block_size=4096
      # offset=$(((2**32 - 1) * $block_size))
      # xfs_io -f -c "pwrite $offset $block_size" /storage/test_file
      # sync
      
      INFO: task sync:2590 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      sync            D c1064a28     0  2590   2097 0x00000000
      .....
      Call Trace:
      [<c1064a28>] ? ttwu_do_wakeup+0x18/0x130
      [<c1066d0e>] ? try_to_wake_up+0x1ce/0x220
      [<c1066dbf>] ? wake_up_process+0x1f/0x40
      [<c104fc2e>] ? wake_up_worker+0x1e/0x30
      [<c15b6083>] schedule+0x23/0x60
      [<c15b3c2d>] schedule_timeout+0x18d/0x1f0
      [<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90
      [<c10515f1>] ? __queue_delayed_work+0x91/0x150
      [<c12a12ef>] ? do_raw_spin_lock+0x3f/0x100
      [<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90
      [<c15b5b5d>] wait_for_completion+0x7d/0xc0
      [<c1066d60>] ? try_to_wake_up+0x220/0x220
      [<c116a4d2>] sync_inodes_sb+0x92/0x180
      [<c116fb05>] sync_inodes_one_sb+0x15/0x20
      [<c114a8f8>] iterate_supers+0xb8/0xc0
      [<c116faf0>] ? fdatawrite_one_bdev+0x20/0x20
      [<c116fc21>] sys_sync+0x31/0x80
      [<c15be18d>] sysenter_do_call+0x12/0x28
      
      This issue can be triggered via xfstests/generic/308.
      
      The reason is that the end_index is unsigned long with maximum value
      '2^32-1=4294967295' on 32-bit platform, and the given offset cause it
      wrapped to 0, so that the following codes will repeat again and again
      until the task schedule time out:
      
      end_index = offset >> PAGE_CACHE_SHIFT;
      last_index = (offset - 1) >> PAGE_CACHE_SHIFT;
      if (page->index >= end_index) {
      	unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1);
              /*
               * Just skip the page if it is fully outside i_size, e.g. due
               * to a truncate operation that is in progress.
               */
              if (page->index >= end_index + 1 || offset_into_page == 0) {
      	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      		unlock_page(page);
      		return 0;
      	}
      
      In order to check if a page is fully outsids i_size or not, we can fix
      the code logic as below:
      	if (page->index > end_index ||
      	    (page->index == end_index && offset_into_page == 0))
      
      Secondly, there still has another similar issue when calculating the
      end offset for mapping the filesystem blocks to the file blocks for
      delalloc.  With the same tests to above, run unmount(8) will cause
      kernel panic if CONFIG_XFS_DEBUG is enabled:
      
      XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || \
      	ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 964
      
      kernel BUG at fs/xfs/xfs_message.c:108!
      invalid opcode: 0000 [#1] SMP
      task: edddc100 ti: ec6ee000 task.ti: ec6ee000
      EIP: 0060:[<f83d87cb>] EFLAGS: 00010296 CPU: 1
      EIP is at assfail+0x2b/0x30 [xfs]
      ..............
      Call Trace:
      [<f83d9cd4>] xfs_fs_destroy_inode+0x74/0x120 [xfs]
      [<c115ddf1>] destroy_inode+0x31/0x50
      [<c115deff>] evict+0xef/0x170
      [<c115dfb2>] dispose_list+0x32/0x40
      [<c115ea3a>] evict_inodes+0xca/0xe0
      [<c1149706>] generic_shutdown_super+0x46/0xd0
      [<c11497b9>] kill_block_super+0x29/0x70
      [<c1149a14>] deactivate_locked_super+0x44/0x70
      [<c114a427>] deactivate_super+0x47/0x60
      [<c1161c3d>] mntput_no_expire+0xcd/0x120
      [<c1162ae8>] SyS_umount+0xa8/0x370
      [<c1162dce>] SyS_oldumount+0x1e/0x20
      [<c15be18d>] sysenter_do_call+0x12/0x28
      
      That because the end_offset is evaluated to 0 which is the same reason
      to above, hence the mapping and covertion for dealloc file blocks to
      file system blocks did not happened.
      
      This patch just fixed both issues.
      Reported-by: NMichael L. Semon <mlsemon35@gmail.com>
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      8695d27e
  13. 07 5月, 2014 3 次提交
  14. 17 4月, 2014 1 次提交
    • D
      xfs: don't map ranges that span EOF for direct IO · 0e1f789d
      Dave Chinner 提交于
      Al Viro tracked down the problem that has caused generic/263 to fail
      on XFS since the test was introduced. If is caused by
      xfs_get_blocks() mapping a single extent that spans EOF without
      marking it as buffer-new() so that the direct IO code does not zero
      the tail of the block at the new EOF. This is a long standing bug
      that has been around for many, many years.
      
      Because xfs_get_blocks() starts the map before EOF, it can't set
      buffer_new(), because that causes he direct IO code to also zero
      unaligned sectors at the head of the IO. This would overwrite valid
      data with zeros, and hence we cannot validly return a single extent
      that spans EOF to direct IO.
      
      Fix this by detecting a mapping that spans EOF and truncate it down
      to EOF. This results in the the direct IO code doing the right thing
      for unaligned data blocks before EOF, and then returning to get
      another mapping for the region beyond EOF which XFS treats correctly
      by setting buffer_new() on it. This makes direct Io behave correctly
      w.r.t. tail block zeroing beyond EOF, and fsx is happy about that.
      
      Again, thanks to Al Viro for finding what I couldn't.
      
      [ dchinner: Fix for __divdi3 build error:
      Reported-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Tested-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0e1f789d
  15. 14 4月, 2014 3 次提交
    • D
      xfs: xfs_vm_write_end truncates too much on failure · aad3f375
      Dave Chinner 提交于
      Similar to the write_begin problem, xfs-vm_write_end will truncate
      back to the old EOF, potentially removing page cache from over the
      top of delalloc blocks with valid data in them. Fix this by
      truncating back to just the start of the failed write.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      aad3f375
    • D
      xfs: write failure beyond EOF truncates too much data · 72ab70a1
      Dave Chinner 提交于
      If we fail a write beyond EOF and have to handle it in
      xfs_vm_write_begin(), we truncate the inode back to the current inode
      size. This doesn't take into account the fact that we may have
      already made successful writes to the same page (in the case of block
      size < page size) and hence we can truncate the page cache away from
      blocks with valid data in them. If these blocks are delayed
      allocation blocks, we now have a mismatch between the page cache and
      the extent tree, and this will trigger - at minimum - a delayed
      block count mismatch assert when the inode is evicted from the cache.
      We can also trip over it when block mapping for direct IO - this is
      the most common symptom seen from fsx and fsstress when run from
      xfstests.
      
      Fix it by only truncating away the exact range we are updating state
      for in this write_begin call.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      72ab70a1
    • D
      xfs: kill buffers over failed write ranges properly · 4ab9ed57
      Dave Chinner 提交于
      When a write fails, if we don't clear the delalloc flags from the
      buffers over the failed range, they can persist beyond EOF and cause
      problems. writeback will see the pages in the page cache, see they
      are dirty and continually retry the write, assuming that the page
      beyond EOF is just racing with a truncate. The page will eventually
      be released due to some other operation (e.g. direct IO), and it
      will not pass through invalidation because it is dirty. Hence it
      will be released with buffer_delay set on it, and trigger warnings
      in xfs_vm_releasepage() and assert fail in xfs_file_aio_write_direct
      because invalidation failed and we didn't write the corect amount.
      
      This causes failures on block size < page size filesystems in fsx
      and fsstress workloads run by xfstests.
      
      Fix it by completely trashing any state on the buffer that could be
      used to imply that it contains valid data when the delalloc range
      over the buffer is punched out during the failed write handling.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4ab9ed57
  16. 04 4月, 2014 1 次提交
  17. 07 3月, 2014 1 次提交
    • D
      xfs: xfs_check_page_type buffer checks need help · a49935f2
      Dave Chinner 提交于
      xfs_aops_discard_page() was introduced in the following commit:
      
        xfs: truncate delalloc extents when IO fails in writeback
      
      ... to clean up left over delalloc ranges after I/O failure in
      ->writepage(). generic/224 tests for this scenario and occasionally
      reproduces panics on sub-4k blocksize filesystems.
      
      The cause of this is failure to clean up the delalloc range on a
      page where the first buffer does not match one of the expected
      states of xfs_check_page_type(). If a buffer is not unwritten,
      delayed or dirty&mapped, xfs_check_page_type() stops and
      immediately returns 0.
      
      The stress test of generic/224 creates a scenario where the first
      several buffers of a page with delayed buffers are mapped & uptodate
      and some subsequent buffer is delayed. If the ->writepage() happens
      to fail for this page, xfs_aops_discard_page() incorrectly skips
      the entire page.
      
      This then causes later failures either when direct IO maps the range
      and finds the stale delayed buffer, or we evict the inode and find
      that the inode still has a delayed block reservation accounted to
      it.
      
      We can easily fix this xfs_aops_discard_page() failure by making
      xfs_check_page_type() check all buffers, but this breaks
      xfs_convert_page() more than it is already broken. Indeed,
      xfs_convert_page() wants xfs_check_page_type() to tell it if the
      first buffers on the pages are of a type that can be aggregated into
      the contiguous IO that is already being built.
      
      xfs_convert_page() should not be writing random buffers out of a
      page, but the current behaviour will cause it to do so if there are
      buffers that don't match the current specification on the page.
      Hence for xfs_convert_page() we need to:
      
      	a) return "not ok" if the first buffer on the page does not
      	match the specification provided to we don't write anything;
      	and
      	b) abort it's buffer-add-to-io loop the moment we come
      	across a buffer that does not match the specification.
      
      Hence we need to fix both xfs_check_page_type() and
      xfs_convert_page() to work correctly with pages that have mixed
      buffer types, whilst allowing xfs_aops_discard_page() to scan all
      buffers on the page for a type match.
      Reported-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a49935f2
  18. 10 2月, 2014 1 次提交
  19. 19 12月, 2013 1 次提交
  20. 24 11月, 2013 1 次提交
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37
  21. 31 10月, 2013 1 次提交
    • D
      xfs: prevent stack overflows from page cache allocation · ad22c7a0
      Dave Chinner 提交于
      Page cache allocation doesn't always go through ->begin_write and
      hence we don't always get the opportunity to set the allocation
      context to GFP_NOFS. Failing to do this means we open up the direct
      relcaim stack to recurse into the filesystem and consume a
      significant amount of stack.
      
      On RHEL6.4 kernels we are seeing ra_submit() and
      generic_file_splice_read() from an nfsd context recursing into the
      filesystem via the inode cache shrinker and evicting inodes. This is
      causing truncation to be run (e.g EOF block freeing) and causing
      bmap btree block merges and free space btree block splits to occur.
      These btree manipulations are occurring with the call chain already
      30 functions deep and hence there is not enough stack space to
      complete such operations.
      
      To avoid these specific overruns, we need to prevent the page cache
      allocation from recursing via direct reclaim. We can do that because
      the allocation functions take the allocation context from that which
      is stored in the mapping for the inode. We don't set that right now,
      so the default is GFP_HIGHUSER_MOVABLE, which is effectively a
      GFP_KERNEL context. We need it to be the equivalent of GFP_NOFS, so
      when we initialise an inode, set the mapping gfp mask appropriately.
      
      This makes the use of AOP_FLAG_NOFS redundant from other parts of
      the XFS IO path, so get rid of it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ad22c7a0
  22. 24 10月, 2013 3 次提交
    • D
      xfs: decouple inode and bmap btree header files · a4fbe6ab
      Dave Chinner 提交于
      Currently the xfs_inode.h header has a dependency on the definition
      of the BMAP btree records as the inode fork includes an array of
      xfs_bmbt_rec_host_t objects in it's definition.
      
      Move all the btree format definitions from xfs_btree.h,
      xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
      xfs_format.h to continue the process of centralising the on-disk
      format definitions. With this done, the xfs inode definitions are no
      longer dependent on btree header files.
      
      The enables a massive culling of unnecessary includes, with close to
      200 #include directives removed from the XFS kernel code base.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a4fbe6ab
    • D
      xfs: decouple log and transaction headers · 239880ef
      Dave Chinner 提交于
      xfs_trans.h has a dependency on xfs_log.h for a couple of
      structures. Most code that does transactions doesn't need to know
      anything about the log, but this dependency means that they have to
      include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
      files and clean up the includes to be in dependency order.
      
      In doing this, remove the direct include of xfs_trans_reserve.h from
      xfs_trans.h so that we remove the dependency between xfs_trans.h and
      xfs_mount.h. Hence the xfs_trans.h include can be moved to the
      indicate the actual dependencies other header files have on it.
      
      Note that these are kernel only header files, so this does not
      translate to any userspace changes at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      239880ef
    • D
      xfs: create a shared header file for format-related information · 70a9883c
      Dave Chinner 提交于
      All of the buffer operations structures are needed to be exported
      for xfs_db, so move them all to a common location rather than
      spreading them all over the place. They are verifying the on-disk
      format, so while xfs_format.h might be a good place, it is not part
      of the on disk format.
      
      Hence we need to create a new header file that we centralise these
      related definitions. Start by moving the bffer operations
      structures, and then also move all the other definitions that have
      crept into xfs_log_format.h and xfs_format.h as there was no other
      shared header file to put them in.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      70a9883c
  23. 02 10月, 2013 1 次提交
  24. 13 9月, 2013 1 次提交
  25. 04 9月, 2013 1 次提交
    • C
      direct-io: Implement generic deferred AIO completions · 7b7a8665
      Christoph Hellwig 提交于
      Add support to the core direct-io code to defer AIO completions to user
      context using a workqueue.  This replaces opencoded and less efficient
      code in XFS and ext4 (we save a memory allocation for each direct IO)
      and will be needed to properly support O_(D)SYNC for AIO.
      
      The communication between the filesystem and the direct I/O code requires
      a new buffer head flag, which is a bit ugly but not avoidable until the
      direct I/O code stops abusing the buffer_head structure for communicating
      with the filesystems.
      
      Currently this creates a per-superblock unbound workqueue for these
      completions, which is taken from an earlier patch by Jan Kara.  I'm
      not really convinced about this use and would prefer a "normal" global
      workqueue with a high concurrency limit, but this needs further discussion.
      
      JK: Fixed ext4 part, dynamic allocation of the workqueue.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7b7a8665