1. 17 12月, 2009 2 次提交
    • A
      dio: fix use-after-free · 06777d30
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      06777d30
    • C
      cleanup blockdev_direct_IO locking · 1e431f5c
      Christoph Hellwig 提交于
      Currently the locking in blockdev_direct_IO is a mess, we have three different
      locking types and very confusing checks for some of them.  The most
      complicated one is DIO_OWN_LOCKING for reads, which happens to not actually be
      used.
      
      This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read case
      is unused anyway, and the write side is almost identical to DIO_NO_LOCKING.
      The difference is that DIO_NO_LOCKING always sets the create argument for
      the get_blocks callback to zero, but we can easily move that to the actual
      get_blocks callbacks.  There are four users of the DIO_NO_LOCKING mode:
      gfs already ignores the create argument and thus is fine with the new
      version, ocfs2 only errors out if create were ever set, and we can remove
      this dead code now, the block device code only ever uses create for an
      error message if we are fully beyond the device which can never happen,
      and last but not least XFS will need the new behavour for writes.
      
      Now we can replace the lock_type variable with a flags one, where no flag
      means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
      flag.  Separate out the check for not allowing to fill holes into a separate
      flag, although for now both flags always get set at the same time.
      
      Also revamp the documentation of the locking scheme to actually make sense.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1e431f5c
  2. 16 12月, 2009 2 次提交
    • C
      direct-io: cleanup blockdev_direct_IO locking · 5fe878ae
      Christoph Hellwig 提交于
      Currently the locking in blockdev_direct_IO is a mess, we have three
      different locking types and very confusing checks for some of them.  The
      most complicated one is DIO_OWN_LOCKING for reads, which happens to not
      actually be used.
      
      This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read
      case is unused anyway, and the write side is almost identical to
      DIO_NO_LOCKING.  The difference is that DIO_NO_LOCKING always sets the
      create argument for the get_blocks callback to zero, but we can easily
      move that to the actual get_blocks callbacks.  There are four users of the
      DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is
      fine with the new version, ocfs2 only errors out if create were ever set,
      and we can remove this dead code now, the block device code only ever uses
      create for an error message if we are fully beyond the device which can
      never happen, and last but not least XFS will need the new behavour for
      writes.
      
      Now we can replace the lock_type variable with a flags one, where no flag
      means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
      flag.  Separate out the check for not allowing to fill holes into a
      separate flag, although for now both flags always get set at the same
      time.
      
      Also revamp the documentation of the locking scheme to actually make
      sense.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fe878ae
    • J
      dio: don't zero out the pages array inside struct dio · 23aee091
      Jeff Moyer 提交于
      Intel reported a performance regression caused by the following commit:
      
      commit 848c4dd5
      Author: Zach Brown <zach.brown@oracle.com>
      Date:   Mon Aug 20 17:12:01 2007 -0700
      
          dio: zero struct dio with kzalloc instead of manually
      
          This patch uses kzalloc to zero all of struct dio rather than
          manually trying to track which fields we rely on being zero.  It
          passed aio+dio stress testing and some bug regression testing on
          ext3.
      
          This patch was introduced by Linus in the conversation that lead up
          to Badari's minimal fix to manually zero .map_bh.b_state in commit:
      
            6a648fa7
      
          It makes the code a bit smaller.  Maybe a couple fewer cachelines to
          load, if we're lucky:
      
             text    data     bss     dec     hex filename
          3285925  568506 1304616 5159047  4eb887 vmlinux
          3285797  568506 1304616 5158919  4eb807 vmlinux.patched
      
          I was unable to measure a stable difference in the number of cpu
          cycles spent in blockdev_direct_IO() when pushing aio+dio 256K reads
          at ~340MB/s.
      
          So the resulting intent of the patch isn't a performance gain but to
          avoid exposing ourselves to the risk of finding another field like
          .map_bh.b_state where we rely on zeroing but don't enforce it in the
          code.
      
      Zach surmised that zeroing out the page array was what caused most of
      the problem, and suggested the approach taken in the attached patch for
      resolving the issue.  Intel re-tested with this patch and saw a 0.6%
      performance gain (the original regression was 0.5%).
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NZach Brown <zach.brown@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23aee091
  3. 26 11月, 2009 1 次提交
    • V
      Fix regression in direct writes performance due to WRITE_ODIRECT flag removal · d9449ce3
      Vivek Goyal 提交于
      There seems to be a regression in direct write path due to following
      commit in for-2.6.33 branch of block tree.
      
      commit 1af60fbd
      Author: Jeff Moyer <jmoyer@redhat.com>
      Date:   Fri Oct 2 18:56:53 2009 -0400
      
          block: get rid of the WRITE_ODIRECT flag
      
      Marking direct writes as WRITE_SYNC_PLUG instead of WRITE_ODIRECT, sets
      the NOIDLE flag in bio and hence in request. This tells CFQ to not expect
      more request from the queue and not idle on it (despite the fact that
      queue's think time is less and it is not seeky).
      
      So direct writers lose big time when competing with sequential readers.
      
      Using fio, I have run one direct writer and two sequential readers and
      following are the results with 2.6.32-rc7 kernel and with for-2.6.33
      branch.
      
      Test
      ====
      1 direct writer and 2 sequential reader running simultaneously.
      
      [global]
      directory=/mnt/sdc/fio/
      runtime=10
      
      [seqwrite]
      rw=write
      size=4G
      direct=1
      
      [seqread]
      rw=read
      size=2G
      numjobs=2
      
      2.6.32-rc7
      ==========
      direct writes: aggrb=2,968KB/s
      readers	     : aggrb=101MB/s
      
      for-2.6.33 branch
      =================
      direct write: aggrb=19KB/s
      readers	      aggrb=137MB/s
      
      This patch brings back the WRITE_ODIRECT flag, with the difference that we
      don't set the BIO_RW_UNPLUG flag so that device is not unplugged after
      submission of request and an explicit unplug from submitter is required.
      
      That way we fix the jeff's issue of not enough merging taking place in aio
      path as well as make sure direct writes get their fair share.
      
      After the fix
      =============
      for-2.6.33 + fix
      ----------------
      direct writes: aggrb=2,728KB/s
      reads: aggrb=103MB/s
      
      Thanks
      Vivek
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d9449ce3
  4. 28 10月, 2009 2 次提交
    • J
      aio: implement request batching · cfb1e33e
      Jeff Moyer 提交于
      Hi,
      
      Some workloads issue batches of small I/O, and the performance is poor
      due to the call to blk_run_address_space for every single iocb.  Nathan
      Roberts pointed this out, and suggested that by deferring this call
      until all I/Os in the iocb array are submitted to the block layer, we
      can realize some impressive performance gains (up to 30% for sequential
      4k reads in batches of 16).
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      cfb1e33e
    • J
      block: get rid of the WRITE_ODIRECT flag · 1af60fbd
      Jeff Moyer 提交于
      Hi,
      
      The WRITE_ODIRECT flag is only used in one place, and that code path
      happens to also call blk_run_address_space.  The introduction of this
      flag, then, could result in the device being unplugged twice for every
      I/O.
      
      Further, with the batching changes in the next patch, we don't want an
      O_DIRECT write to imply a queue unplug.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1af60fbd
  5. 23 5月, 2009 1 次提交
  6. 15 4月, 2009 1 次提交
  7. 06 4月, 2009 1 次提交
  8. 07 1月, 2009 1 次提交
    • D
      fs: truncate blocks outside i_size after O_DIRECT write error · 0f64415d
      Dmitri Monakhov 提交于
      In case of error extending write may have instantiated a few blocks
      outside i_size.  We need to trim these blocks.  We have to do it
      *regardless* to blocksize.  At least ext2, ext3 and reiserfs interpret
      (i_size < biggest block) condition as error.  Fsck will complain about
      wrong i_size.  Then fsck will fix the error by changing i_size according
      to the biggest block.  This is bad because this blocks contain garbage
      from previous write attempt.  And result in data corruption.
      
      ####TESTCASE_BEGIN
      $touch /mnt/test/BIG_FILE
      ## at this moment /mnt/test/BIG_FILE size and blocks equal to zero
      open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
      write(3, "aaaaaaaaaaaa"..., 104857600) = -1 ENOSPC (No space left on device)
      ## size and block sould't be changed because write op failed.
      $stat /mnt/test/BIG_FILE
      File: `/mnt/test/BIG_FILE'
      Size: 0 Blocks: 110896 IO Block: 1024 regular empty file
      <<<<<<<<^^^^^^^^^^^^^^^^^^^^^^^^^^^^^file size is less than biggest block idx
      Device: fe07h/65031d Inode: 14 Links: 1
      Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
      Access: 2007-01-24 20:03:38.000000000 +0300
      Modify: 2007-01-24 20:03:38.000000000 +0300
      Change: 2007-01-24 20:03:39.000000000 +0300
      
      #fsck.ext3 -f /dev/VG/test
      e2fsck 1.39 (29-May-2006)
      Pass 1: Checking inodes, blocks, and sizes
      Inode 14, i_size is 0, should be 56556544. Fix<y>? yes
      Pass 2: Checking directory structure
      ....
      #####TESTCASE_ENDdiff --git a/fs/direct-io.c b/fs/direct-io.c
      index af0558d..4e88bea 100644
      
      [akpm@linux-foundation.org: use i_size_read()]
      Signed-off-by: NDmitri Monakhov <dmonakhov@openvz.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f64415d
  9. 17 10月, 2008 1 次提交
  10. 27 7月, 2008 1 次提交
  11. 06 2月, 2008 1 次提交
    • C
      Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user · eebd2aa3
      Christoph Lameter 提交于
      Simplify page cache zeroing of segments of pages through 3 functions
      
      zero_user_segments(page, start1, end1, start2, end2)
      
              Zeros two segments of the page. It takes the position where to
              start and end the zeroing which avoids length calculations and
      	makes code clearer.
      
      zero_user_segment(page, start, end)
      
              Same for a single segment.
      
      zero_user(page, start, length)
      
              Length variant for the case where we know the length.
      
      We remove the zero_user_page macro. Issues:
      
      1. Its a macro. Inline functions are preferable.
      
      2. The KM_USER0 macro is only defined for HIGHMEM.
      
         Having to treat this special case everywhere makes the
         code needlessly complex. The parameter for zeroing is always
         KM_USER0 except in one single case that we open code.
      
      Avoiding KM_USER0 makes a lot of code not having to be dealing
      with the special casing for HIGHMEM anymore. Dealing with
      kmap is only necessary for HIGHMEM configurations. In those
      configurations we use KM_USER0 like we do for a series of other
      functions defined in highmem.h.
      
      Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
      function could not be a macro. zero_user_* functions introduced
      here can be be inline because that constant is not used when these
      functions are called.
      
      Also extract the flushing of the caches to be outside of the kmap.
      
      [akpm@linux-foundation.org: fix nfs and ntfs build]
      [akpm@linux-foundation.org: fix ntfs build some more]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eebd2aa3
  12. 17 10月, 2007 1 次提交
    • N
      remove ZERO_PAGE · 557ed1fa
      Nick Piggin 提交于
      The commit b5810039 contains the note
      
        A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
        (and thus mapcounted and count towards shared rss).  These writes to
        the struct page could cause excessive cacheline bouncing on big
        systems.  There are a number of ways this could be addressed if it is
        an issue.
      
      And indeed this cacheline bouncing has shown up on large SGI systems.
      There was a situation where an Altix system was essentially livelocked
      tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
      This situation can be avoided in userspace, but it does highlight the
      potential scalability problem with refcounting ZERO_PAGE, and corner
      cases where it can really hurt (we don't want the system to livelock!).
      
      There are several broad ways to fix this problem:
      1. add back some special casing to avoid refcounting ZERO_PAGE
      2. per-node or per-cpu ZERO_PAGES
      3. remove the ZERO_PAGE completely
      
      I will argue for 3. The others should also fix the problem, but they
      result in more complex code than does 3, with little or no real benefit
      that I can see.
      
      Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
      false optimisation: if an application is performance critical, it would
      not be doing many read faults of new memory, or at least it could be
      expected to write to that memory soon afterwards. If cache or memory use
      is critical, it should not be working with a significant number of
      ZERO_PAGEs anyway (a more compact representation of zeroes should be
      used).
      
      As a sanity check -- mesuring on my desktop system, there are never many
      mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
      increase much without it.
      
      When running a make -j4 kernel compile on my dual core system, there are
      about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
      ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
      is torn down without being COWed). So removing ZERO_PAGE will save 1,000
      page faults per second when running kbuild, while keeping it only saves
      less than 1 page clearing operation per second. 1 page clear is cheaper
      than a thousand faults, presumably, so there isn't an obvious loss.
      
      Neither the logical argument nor these basic tests give a guarantee of no
      regressions. However, this is a reasonable opportunity to try to remove
      the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
      we can reintroduce it and just avoid refcounting it.
      
      The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
      much use to them except on benchmarks.  All other users of ZERO_PAGE are
      converted just to use ZERO_PAGE(0) for simplicity. We can look at
      replacing them all and maybe ripping out ZERO_PAGE completely when we are
      more satisfied with this solution.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus "snif" Torvalds <torvalds@linux-foundation.org>
      557ed1fa
  13. 10 10月, 2007 1 次提交
  14. 21 8月, 2007 1 次提交
    • Z
      dio: zero struct dio with kzalloc instead of manually · 848c4dd5
      Zach Brown 提交于
      This patch uses kzalloc to zero all of struct dio rather than manually
      trying to track which fields we rely on being zero.  It passed aio+dio
      stress testing and some bug regression testing on ext3.
      
      This patch was introduced by Linus in the conversation that lead up to
      Badari's minimal fix to manually zero .map_bh.b_state in commit:
      
        6a648fa7
      
      It makes the code a bit smaller.  Maybe a couple fewer cachelines to
      load, if we're lucky:
      
         text    data     bss     dec     hex filename
      3285925  568506 1304616 5159047  4eb887 vmlinux
      3285797  568506 1304616 5158919  4eb807 vmlinux.patched
      
      I was unable to measure a stable difference in the number of cpu cycles
      spent in blockdev_direct_IO() when pushing aio+dio 256K reads at
      ~340MB/s.
      
      So the resulting intent of the patch isn't a performance gain but to
      avoid exposing ourselves to the risk of finding another field like
      .map_bh.b_state where we rely on zeroing but don't enforce it in the
      code.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      848c4dd5
  15. 12 8月, 2007 1 次提交
  16. 04 7月, 2007 1 次提交
    • Z
      dio: remove bogus refcounting BUG_ON · fcb82f88
      Zach Brown 提交于
      Badari Pulavarty reported a case of this BUG_ON is triggering during
      testing.  It's completely bogus and should be removed.
      
      It's trying to notice if we left references to the dio hanging around in
      the sync case.  They should have been dropped as IO completed while this
      path was in dio_await_completion().  This condition will also be
      checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few
      lines lower.  So to start this BUG_ON() is redundant.
      
      More fatally, it's dereferencing dio-> after having dropped its
      reference.  It's only safe to dereference the dio after releasing the
      lock if the final reference was just dropped.  Another CPU might free
      the dio in bio completion and reuse the memory after this path drops the
      dio lock but before the BUG_ON() is evaluated.
      
      This patch passed aio+dio regression unit tests and aio-stress on ext3.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcb82f88
  17. 10 5月, 2007 1 次提交
    • N
      fs: convert core functions to zero_user_page · 01f2705d
      Nate Diller 提交于
      It's very common for file systems to need to zero part or all of a page,
      the simplist way is just to use kmap_atomic() and memset().  There's
      actually a library function in include/linux/highmem.h that does exactly
      that, but it's confusingly named memclear_highpage_flush(), which is
      descriptive of *how* it does the work rather than what the *purpose* is.
      So this patchset renames the function to zero_user_page(), and calls it
      from the various places that currently open code it.
      
      This first patch introduces the new function call, and converts all the
      core kernel callsites, both the open-coded ones and the old
      memclear_highpage_flush() ones.  Following this patch is a series of
      conversions for each file system individually, per AKPM, and finally a
      patch deprecating the old call.  The diffstat below shows the entire
      patchset.
      
      [akpm@linux-foundation.org: fix a few things]
      Signed-off-by: NNate Diller <nate.diller@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01f2705d
  18. 09 5月, 2007 1 次提交
  19. 11 12月, 2006 7 次提交
    • Z
      [PATCH] dio: lock refcount operations · 5eb6c7a2
      Zach Brown 提交于
      The wait_for_more_bios() function name was poorly chosen.  While looking to
      clean it up it I noticed that the dio struct refcounting between the bio
      completion and dio submission paths was racey.
      
      The bio submission path was simply freeing the dio struct if
      atomic_dec_and_test() indicated that it dropped the final reference.
      
      The aio bio completion path was dereferencing its dio struct pointer *after
      dropping its reference* based on the remaining number of references.
      
      These two paths could race and result in the aio bio completion path
      dereferencing a freed dio, though this was not observed in the wild.
      
      This moves the refcount under the bio lock so that bio completion can drop
      its reference and decide to wake all in one atomic step.
      
      Once testing and waking is locked dio_await_one() can test its sleeping
      condition and mark itself uninterruptible under the lock.  It gets simpler
      and wait_for_more_bios() disappears.
      
      The addition of the interrupt masking spin lock acquiry in dio_bio_submit()
      looks alarming.  This lock acquiry existed in that path before the recent
      dio completion patch set.  We shouldn't expect significant performance
      regression from returning to the behaviour that existed before the
      completion clean up work.
      
      This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5eb6c7a2
    • Z
      [PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED · 8459d86a
      Zach Brown 提交于
      The only time it is safe to call aio_complete() is when the ->ki_retry
      function returns -EIOCBQUEUED to the AIO core.  direct_io_worker() has
      historically done this by relying on its caller to translate positive return
      codes into -EIOCBQUEUED for the aio case.  It did this by trying to keep
      conditionals in sync.  direct_io_worker() knew when finished_one_bio() was
      going to call aio_complete().  It would reverse the test and wait and free the
      dio in the cases it thought that finished_one_bio() wasn't going to.
      
      Not surprisingly, it ended up getting it wrong.  'ret' could be a negative
      errno from the submission path but it failed to communicate this to
      finished_one_bio().  direct_io_worker() would return < 0, it's callers
      wouldn't raise -EIOCBQUEUED, and aio_complete() would be called.  In the
      future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
      would be called for a second time which can manifest as an oops.
      
      The previous cleanups have whittled the sync and async completion paths down
      to the point where we can collapse them and clearly reassert the invariant
      that we must only call aio_complete() after returning -EIOCBQUEUED.
      direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
      drop the dio refcount and the aio bio completion path will only call
      aio_complete() when it is the last to drop the dio refcount.
      direct_io_worker() can ensure that it is the last to drop the reference count
      by waiting for bios to drain.  It does this for sync ops, of course, and for
      partial dio writes that must fall back to buffered and for aio ops that saw
      errors during submission.
      
      This means that operations that end up waiting, even if they were issued as
      aio ops, will not call aio_complete() from dio.  Instead we return the return
      code of the operation and let the aio core call aio_complete().  This is
      purposely done to fix a bug where AIO DIO file extensions would call
      aio_complete() before their callers have a chance to update i_size.
      
      Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
      no longer have to translate for it.  XFS needs to be careful not to free
      resources that will be used during AIO completion if -EIOCBQUEUED is returned.
       We maintain the previous behaviour of trying to write fs metadata for O_SYNC
      aio+dio writes.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8459d86a
    • Z
      [PATCH] dio: remove duplicate bio wait code · 20258b2b
      Zach Brown 提交于
      Now that we have a single refcount and waiting path we can reuse it in the
      async 'should_wait' path.  It continues to rely on the fragile link between
      the conditional in dio_complete_aio() which decides to complete the AIO and
      the conditional in direct_io_worker() which decides to wait and free.
      
      By waiting before dropping the reference we stop dio_bio_end_aio() from
      calling dio_complete_aio() which used to wake up the waiter after seeing the
      reference count drop to 0.  We hoist this wake up into dio_bio_end_aio() which
      now notices when it's left a single remaining reference that is held by the
      waiter.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      20258b2b
    • Z
      [PATCH] dio: formalize bio counters as a dio reference count · 0273201e
      Zach Brown 提交于
      Previously we had two confusing counts of bio progress.  'bio_count' was
      decremented as bios were processed and freed by the dio core.  It was used to
      indicate final completion of the dio operation.  'bios_in_flight' reflected
      how many bios were between submit_bio() and bio->end_io.  It was used by the
      sync path to decide when to wake up and finish completing bios and was ignored
      by the async path.
      
      This patch collapses the two notions into one notion of a dio reference count.
       bios hold a dio reference when they're between submit_bio and bio->end_io.
      
      Since bios_in_flight was only used in the sync path it is now equivalent to
      dio->refcount - 1 which accounts for direct_io_worker() holding a reference
      for the duration of the operation.
      
      dio_bio_complete() -> finished_one_bio() was called from the sync path after
      finding bios on the list that the bio->end_io function had deposited.
      finished_one_bio() can not drop the dio reference on behalf of these bios now
      because bio->end_io already has.  The is_async test in finished_one_bio()
      meant that it never actually did anything other than drop the bio_count for
      sync callers.  So we remove its refcount decrement, don't call it from
      dio_bio_complete(), and hoist its call up into the async dio_bio_complete()
      caller after an explicit refcount decrement.  It is renamed dio_complete_aio()
      to reflect the remaining work it actually does.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0273201e
    • Z
      [PATCH] dio: call blk_run_address_space() once per op · 17a7b1d7
      Zach Brown 提交于
      We only need to call blk_run_address_space() once after all the bios for the
      direct IO op have been submitted.  This removes the chance of calling
      blk_run_address_space() after spurious wake ups as the sync path waits for
      bios to drain.  It's also one less difference betwen the sync and async paths.
      
      In the process we remove a redundant dio_bio_submit() that its caller had
      already performed.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      17a7b1d7
    • Z
      [PATCH] dio: centralize completion in dio_complete() · 6d544bb4
      Zach Brown 提交于
      There have been a lot of bugs recently due to the way direct_io_worker() tries
      to decide how to finish direct IO operations.  In the worst examples it has
      failed to call aio_complete() at all (hang) or called it too many times
      (oops).
      
      This set of patches cleans up the completion phase with the goal of removing
      the complexity that lead to these bugs.  We end up with one path that
      calculates the result of the operation after all off the bios have completed.
      We decide when to generate a result of the operation using that path based on
      the final release of a refcount on the dio structure.
      
      I tried to progress towards the final state in steps that were relatively easy
      to understand.  Each step should compile but I only tested the final result of
      having all the patches applied.
      
      I've tested these on low end PC drives with aio-stress, the direct IO tests I
      could manage to get running in LTP, orasim, and some home-brew functional
      tests.
      
      In http://lkml.org/lkml/2006/9/21/103 IBM reports success with ext2 and ext3
      running DIO LTP tests.  They found that XFS bug which has since been addressed
      in the patch series.
      
      This patch:
      
      The mechanics which decide the result of a direct IO operation were duplicated
      in the sync and async paths.
      
      The async path didn't check page_errors which can manifest as silently
      returning success when the final pointer in an operation faults and its
      matching file region is filled with zeros.
      
      The sync path and async path differed in whether they passed errors to the
      caller's dio->end_io operation.  The async path was passing errors to it which
      trips an assertion in XFS, though it is apparently harmless.
      
      This centralizes the completion phase of dio ops in one place.  AIO will now
      return EFAULT consistently and all paths fall back to the previously sync
      behaviour of passing the number of bytes 'transferred' to the dio->end_io
      callback, regardless of errors.
      
      dio_await_completion() doesn't have to propogate EIO from non-uptodate bios
      now that it's being propogated through dio_complete() via dio->io_error.  This
      lets it return void which simplifies its sole caller.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6d544bb4
    • A
      [PATCH] io-accounting: direct-io · 98c4d57d
      Andrew Morton 提交于
      Account for direct-io.
      
      Cc: Jay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Cc: David Wright <daw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      98c4d57d
  20. 04 7月, 2006 1 次提交
  21. 23 6月, 2006 1 次提交
    • J
      [PATCH] Kill PF_SYNCWRITE flag · b31dc66a
      Jens Axboe 提交于
      A process flag to indicate whether we are doing sync io is incredibly
      ugly. It also causes performance problems when one does a lot of async
      io and then proceeds to sync it. Part of the io will go out as async,
      and the other part as sync. This causes a disconnect between the
      previously submitted io and the synced io. For io schedulers such as CFQ,
      this will cause us lost merges and suboptimal behaviour in scheduling.
      
      Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
      the O_DIRECT path just directly indicate that the writes are sync
      by using WRITE_SYNC instead.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      b31dc66a
  22. 01 4月, 2006 1 次提交
  23. 29 3月, 2006 1 次提交
  24. 27 3月, 2006 1 次提交
  25. 26 3月, 2006 1 次提交
    • C
      [PATCH] direct-io: bug fix in dio handling write error · 174e27c6
      Chen, Kenneth W 提交于
      There is a bug in direct-io on propagating write error up to the higher I/O
      layer.  When performing an async ODIRECT write to a block device, if a
      device error occurred (like media error or disk is pulled), the error code
      is only propagated from device driver to the DIO layer.  The error code
      stops at finished_one_bio().  The aysnc write, however, is supposedly have
      a corresponding AIO event with appropriate return code (in this case -EIO).
       Application which waits on the async write event, will hang forever since
      such AIO event is lost forever (if such app did not use the timeout option
      in io_getevents call.  Regardless, an AIO event is lost).
      
      The discovery of above bug leads to another discovery of potential race
      window with dio->result.  The fundamental problem is that dio->result is
      overloaded with dual use: an indicator of fall back path for partial dio
      write, and an error indicator used in the I/O completion path.  In the
      event of device error, the setting of -EIO to dio->result clashes with
      value used to track partial write that activates the fall back path.
      
      It was also pointed out that it is impossible to use dio->result to track
      partial write and at the same time to track error returned from device
      driver.  Because direct_io_work can only determines whether it is a partial
      write at the end of io submission and in mid stream of those io submission,
      a return code could be coming back from the driver.  Thus messing up all
      the subsequent logic.
      
      Proposed fix is to separating out error code returned by the IO completion
      path from partial IO submit tracking.  A new variable is added to dio
      structure specifically to track io error returned in the completion path.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NZach Brown <zach.brown@oracle.com>
      Acked-by: NSuparna Bhattacharya <suparna@in.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      174e27c6
  26. 15 3月, 2006 1 次提交
  27. 04 2月, 2006 1 次提交
    • J
      [PATCH] fix O_DIRECT read of last block in a sparse file · 35dc8161
      Jeff Moyer 提交于
      Currently, if you open a file O_DIRECT, truncate it to a size that is not a
      multiple of the disk block size, and then try to read the last block in the
      file, the read will return 0.  The problem is in do_direct_IO, here:
      
              /* Handle holes */
              if (!buffer_mapped(map_bh)) {
                      char *kaddr;
      
      		...
      
                      if (dio->block_in_file >=
                              i_size_read(dio->inode)>>blkbits) {
                              /* We hit eof */
                              page_cache_release(page);
                              goto out;
                      }
      
      We shift off any remaining bytes in the final block of the I/O, resulting
      in a 0-sized read.  I've attached a patch that fixes this.  I'm not happy
      about how ugly the math is getting, so suggestions are more than welcome.
      
      I've tested this with a simple program that performs the steps outlined for
      reproducing the problem above.  Without the patch, we get a 0-sized result
      from read.  With the patch, we get the correct return value from the short
      read.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Mingming Cao <cmm@us.ibm.com>
      Cc: Joel Becker <Joel.Becker@oracle.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      35dc8161
  28. 10 1月, 2006 1 次提交
  29. 30 10月, 2005 1 次提交
    • N
      [PATCH] core remove PageReserved · b5810039
      Nick Piggin 提交于
      Remove PageReserved() calls from core code by tightening VM_RESERVED
      handling in mm/ to cover PageReserved functionality.
      
      PageReserved special casing is removed from get_page and put_page.
      
      All setting and clearing of PageReserved is retained, and it is now flagged
      in the page_alloc checks to help ensure we don't introduce any refcount
      based freeing of Reserved pages.
      
      MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
      deprecated.  We never completely handled it correctly anyway, and is be
      reintroduced in future if required (Hugh has a proof of concept).
      
      Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
      arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
      be trivially removed.
      
      Last real user of PageReserved is swsusp, which uses PageReserved to
      determine whether a struct page points to valid memory or not.  This still
      needs to be addressed (a generic page_is_ram() should work).
      
      A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
      thus mapcounted and count towards shared rss).  These writes to the struct
      page could cause excessive cacheline bouncing on big systems.  There are a
      number of ways this could be addressed if it is an issue.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      
      Refcount bug fix for filemap_xip.c
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b5810039
  30. 24 6月, 2005 1 次提交
  31. 17 4月, 2005 1 次提交
    • D
      [PATCH] Direct IO async short read fix · 29504ff3
      Daniel McNeil 提交于
      The direct I/O code is mapping the read request to the file system block.  If
      the file size was not on a block boundary, the result would show the the read
      reading past EOF.  This was only happening for the AIO case.  The non-AIO case
      truncates the result to match file size (in direct_io_worker).  This patch
      does the same thing for the AIO case, it truncates the result to match the
      file size if the read reads past EOF.
      
      When I/O completes the result can be truncated to match the file size
      without using i_size_read(), thus the aio result now matches the number of
      bytes read to the end of file.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      29504ff3