1. 27 10月, 2010 1 次提交
  2. 10 9月, 2010 1 次提交
    • J
      O_DIRECT: fix the splitting up of contiguous I/O · 7a801ac6
      Jeff Moyer 提交于
      commit c2c6ca41 (direct-io: do not merge logically non-contiguous requests)
      introduced a bug whereby all O_DIRECT I/Os were submitted a page at a time
      to the block layer.  The problem is that the code expected
      dio->block_in_file to correspond to the current page in the dio.  In fact,
      it corresponds to the previous page submitted via submit_page_section.
      This was purely an oversight, as the dio->cur_page_fs_offset field was
      introduced for just this purpose.  This patch simply uses the correct
      variable when calculating whether there is a mismatch between contiguous
      logical blocks and contiguous physical blocks (as described in the
      comments).
      
      I also switched the if conditional following this check to an else if, to
      ensure that we never call dio_bio_submit twice for the same dio (in
      theory, this should not happen, anyway).
      
      I've tested this by running blktrace and verifying that a 64KB I/O was
      submitted as a single I/O.  I also ran the patched kernel through
      xfstests' aio tests using xfs, ext4 (with 1k and 4k block sizes) and btrfs
      and verified that there were no regressions as compared to an unpatched
      kernel.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NJosef Bacik <jbacik@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: <stable@kernel.org>		[2.6.35.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a801ac6
  3. 10 8月, 2010 1 次提交
    • C
      sort out blockdev_direct_IO variants · eafdc7d1
      Christoph Hellwig 提交于
      Move the call to vmtruncate to get rid of accessive blocks to the callers
      in prepearation of the new truncate calling sequence.  This was only done
      for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
      was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
      its _newtrunc variant while at it as just opencoding the two additional
      paramters is shorted than the name suffix.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eafdc7d1
  4. 27 7月, 2010 2 次提交
  5. 28 5月, 2010 1 次提交
    • N
      fs: introduce new truncate sequence · 7bb46a67
      npiggin@suse.de 提交于
      Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
      setattr > vmtruncate > truncate, have filesystems call their truncate sequence
      from ->setattr if filesystem specific operations are required. vmtruncate is
      deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
      previously should be used.
      
      simple_setattr is introduced for simple in-ram filesystems to implement
      the new truncate sequence. Eventually all filesystems should be converted
      to implement a setattr, and the default code in notify_change should go
      away.
      
      simple_setsize is also introduced to perform just the ATTR_SIZE portion
      of simple_setattr (ie. changing i_size and trimming pagecache).
      
      To implement the new truncate sequence:
      - filesystem specific manipulations (eg freeing blocks) must be done in
        the setattr method rather than ->truncate.
      - vmtruncate can not be used by core code to trim blocks past i_size in
        the event of write failure after allocation, so this must be performed
        in the fs code.
      - convert usage of helpers block_write_begin, nobh_write_begin,
        cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
        variants. These avoid calling vmtruncate to trim blocks (see previous).
      - inode_setattr should not be used. generic_setattr is a new function
        to be used to copy simple attributes into the generic inode.
      - make use of the better opportunity to handle errors with the new sequence.
      
      Big problem with the previous calling sequence: the filesystem is not called
      until i_size has already changed.  This means it is not allowed to fail the
      call, and also it does not know what the previous i_size was. Also, generic
      code calling vmtruncate to truncate allocated blocks in case of error had
      no good way to return a meaningful error (or, for example, atomically handle
      block deallocation).
      
      Cc: Christoph Hellwig <hch@lst.de>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7bb46a67
  6. 25 5月, 2010 2 次提交
    • J
      direct-io: do not merge logically non-contiguous requests · c2c6ca41
      Josef Bacik 提交于
      Btrfs cannot handle having logically non-contiguous requests submitted.  For
      example if you have
      
      Logical:  [0-4095][HOLE][8192-12287]
      Physical: [0-4095]      [4096-8191]
      
      Normally the DIO code would put these into the same BIO's.  The problem is we
      need to know exactly what offset is associated with what BIO so we can do our
      checksumming and unlocking properly, so putting them in the same BIO doesn't
      work.  So add another check where we submit the current BIO if the physical
      blocks are not contigous OR the logical blocks are not contiguous.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c2c6ca41
    • J
      direct-io: add a hook for the fs to provide its own submit_bio function · facd07b0
      Josef Bacik 提交于
      Because BTRFS can do RAID and such, we need our own submit hook so we can setup
      the bio's in the correct fashion, and handle checksum errors properly.  So there
      are a few changes here
      
      1) The submit_io hook.  This is straightforward, just call this instead of
      submit_bio.
      
      2) Allow the fs to return -ENOTBLK for reads.  Usually this has only worked for
      writes, since writes can fallback onto buffered IO.  But BTRFS needs the option
      of falling back on buffered IO if it encounters a compressed extent, since we
      need to read the entire extent in and decompress it.  So if we get -ENOTBLK back
      from get_block we'll return back and fallback on buffered just like the write
      case.
      
      I've tested these changes with fsx and everything seems to work.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      facd07b0
  7. 17 12月, 2009 2 次提交
    • A
      dio: fix use-after-free · 06777d30
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      06777d30
    • C
      cleanup blockdev_direct_IO locking · 1e431f5c
      Christoph Hellwig 提交于
      Currently the locking in blockdev_direct_IO is a mess, we have three different
      locking types and very confusing checks for some of them.  The most
      complicated one is DIO_OWN_LOCKING for reads, which happens to not actually be
      used.
      
      This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read case
      is unused anyway, and the write side is almost identical to DIO_NO_LOCKING.
      The difference is that DIO_NO_LOCKING always sets the create argument for
      the get_blocks callback to zero, but we can easily move that to the actual
      get_blocks callbacks.  There are four users of the DIO_NO_LOCKING mode:
      gfs already ignores the create argument and thus is fine with the new
      version, ocfs2 only errors out if create were ever set, and we can remove
      this dead code now, the block device code only ever uses create for an
      error message if we are fully beyond the device which can never happen,
      and last but not least XFS will need the new behavour for writes.
      
      Now we can replace the lock_type variable with a flags one, where no flag
      means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
      flag.  Separate out the check for not allowing to fill holes into a separate
      flag, although for now both flags always get set at the same time.
      
      Also revamp the documentation of the locking scheme to actually make sense.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1e431f5c
  8. 16 12月, 2009 2 次提交
    • C
      direct-io: cleanup blockdev_direct_IO locking · 5fe878ae
      Christoph Hellwig 提交于
      Currently the locking in blockdev_direct_IO is a mess, we have three
      different locking types and very confusing checks for some of them.  The
      most complicated one is DIO_OWN_LOCKING for reads, which happens to not
      actually be used.
      
      This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read
      case is unused anyway, and the write side is almost identical to
      DIO_NO_LOCKING.  The difference is that DIO_NO_LOCKING always sets the
      create argument for the get_blocks callback to zero, but we can easily
      move that to the actual get_blocks callbacks.  There are four users of the
      DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is
      fine with the new version, ocfs2 only errors out if create were ever set,
      and we can remove this dead code now, the block device code only ever uses
      create for an error message if we are fully beyond the device which can
      never happen, and last but not least XFS will need the new behavour for
      writes.
      
      Now we can replace the lock_type variable with a flags one, where no flag
      means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
      flag.  Separate out the check for not allowing to fill holes into a
      separate flag, although for now both flags always get set at the same
      time.
      
      Also revamp the documentation of the locking scheme to actually make
      sense.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fe878ae
    • J
      dio: don't zero out the pages array inside struct dio · 23aee091
      Jeff Moyer 提交于
      Intel reported a performance regression caused by the following commit:
      
      commit 848c4dd5
      Author: Zach Brown <zach.brown@oracle.com>
      Date:   Mon Aug 20 17:12:01 2007 -0700
      
          dio: zero struct dio with kzalloc instead of manually
      
          This patch uses kzalloc to zero all of struct dio rather than
          manually trying to track which fields we rely on being zero.  It
          passed aio+dio stress testing and some bug regression testing on
          ext3.
      
          This patch was introduced by Linus in the conversation that lead up
          to Badari's minimal fix to manually zero .map_bh.b_state in commit:
      
            6a648fa7
      
          It makes the code a bit smaller.  Maybe a couple fewer cachelines to
          load, if we're lucky:
      
             text    data     bss     dec     hex filename
          3285925  568506 1304616 5159047  4eb887 vmlinux
          3285797  568506 1304616 5158919  4eb807 vmlinux.patched
      
          I was unable to measure a stable difference in the number of cpu
          cycles spent in blockdev_direct_IO() when pushing aio+dio 256K reads
          at ~340MB/s.
      
          So the resulting intent of the patch isn't a performance gain but to
          avoid exposing ourselves to the risk of finding another field like
          .map_bh.b_state where we rely on zeroing but don't enforce it in the
          code.
      
      Zach surmised that zeroing out the page array was what caused most of
      the problem, and suggested the approach taken in the attached patch for
      resolving the issue.  Intel re-tested with this patch and saw a 0.6%
      performance gain (the original regression was 0.5%).
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NZach Brown <zach.brown@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23aee091
  9. 26 11月, 2009 1 次提交
    • V
      Fix regression in direct writes performance due to WRITE_ODIRECT flag removal · d9449ce3
      Vivek Goyal 提交于
      There seems to be a regression in direct write path due to following
      commit in for-2.6.33 branch of block tree.
      
      commit 1af60fbd
      Author: Jeff Moyer <jmoyer@redhat.com>
      Date:   Fri Oct 2 18:56:53 2009 -0400
      
          block: get rid of the WRITE_ODIRECT flag
      
      Marking direct writes as WRITE_SYNC_PLUG instead of WRITE_ODIRECT, sets
      the NOIDLE flag in bio and hence in request. This tells CFQ to not expect
      more request from the queue and not idle on it (despite the fact that
      queue's think time is less and it is not seeky).
      
      So direct writers lose big time when competing with sequential readers.
      
      Using fio, I have run one direct writer and two sequential readers and
      following are the results with 2.6.32-rc7 kernel and with for-2.6.33
      branch.
      
      Test
      ====
      1 direct writer and 2 sequential reader running simultaneously.
      
      [global]
      directory=/mnt/sdc/fio/
      runtime=10
      
      [seqwrite]
      rw=write
      size=4G
      direct=1
      
      [seqread]
      rw=read
      size=2G
      numjobs=2
      
      2.6.32-rc7
      ==========
      direct writes: aggrb=2,968KB/s
      readers	     : aggrb=101MB/s
      
      for-2.6.33 branch
      =================
      direct write: aggrb=19KB/s
      readers	      aggrb=137MB/s
      
      This patch brings back the WRITE_ODIRECT flag, with the difference that we
      don't set the BIO_RW_UNPLUG flag so that device is not unplugged after
      submission of request and an explicit unplug from submitter is required.
      
      That way we fix the jeff's issue of not enough merging taking place in aio
      path as well as make sure direct writes get their fair share.
      
      After the fix
      =============
      for-2.6.33 + fix
      ----------------
      direct writes: aggrb=2,728KB/s
      reads: aggrb=103MB/s
      
      Thanks
      Vivek
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d9449ce3
  10. 28 10月, 2009 2 次提交
    • J
      aio: implement request batching · cfb1e33e
      Jeff Moyer 提交于
      Hi,
      
      Some workloads issue batches of small I/O, and the performance is poor
      due to the call to blk_run_address_space for every single iocb.  Nathan
      Roberts pointed this out, and suggested that by deferring this call
      until all I/Os in the iocb array are submitted to the block layer, we
      can realize some impressive performance gains (up to 30% for sequential
      4k reads in batches of 16).
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      cfb1e33e
    • J
      block: get rid of the WRITE_ODIRECT flag · 1af60fbd
      Jeff Moyer 提交于
      Hi,
      
      The WRITE_ODIRECT flag is only used in one place, and that code path
      happens to also call blk_run_address_space.  The introduction of this
      flag, then, could result in the device being unplugged twice for every
      I/O.
      
      Further, with the batching changes in the next patch, we don't want an
      O_DIRECT write to imply a queue unplug.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1af60fbd
  11. 23 5月, 2009 1 次提交
  12. 15 4月, 2009 1 次提交
  13. 06 4月, 2009 1 次提交
  14. 07 1月, 2009 1 次提交
    • D
      fs: truncate blocks outside i_size after O_DIRECT write error · 0f64415d
      Dmitri Monakhov 提交于
      In case of error extending write may have instantiated a few blocks
      outside i_size.  We need to trim these blocks.  We have to do it
      *regardless* to blocksize.  At least ext2, ext3 and reiserfs interpret
      (i_size < biggest block) condition as error.  Fsck will complain about
      wrong i_size.  Then fsck will fix the error by changing i_size according
      to the biggest block.  This is bad because this blocks contain garbage
      from previous write attempt.  And result in data corruption.
      
      ####TESTCASE_BEGIN
      $touch /mnt/test/BIG_FILE
      ## at this moment /mnt/test/BIG_FILE size and blocks equal to zero
      open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
      write(3, "aaaaaaaaaaaa"..., 104857600) = -1 ENOSPC (No space left on device)
      ## size and block sould't be changed because write op failed.
      $stat /mnt/test/BIG_FILE
      File: `/mnt/test/BIG_FILE'
      Size: 0 Blocks: 110896 IO Block: 1024 regular empty file
      <<<<<<<<^^^^^^^^^^^^^^^^^^^^^^^^^^^^^file size is less than biggest block idx
      Device: fe07h/65031d Inode: 14 Links: 1
      Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
      Access: 2007-01-24 20:03:38.000000000 +0300
      Modify: 2007-01-24 20:03:38.000000000 +0300
      Change: 2007-01-24 20:03:39.000000000 +0300
      
      #fsck.ext3 -f /dev/VG/test
      e2fsck 1.39 (29-May-2006)
      Pass 1: Checking inodes, blocks, and sizes
      Inode 14, i_size is 0, should be 56556544. Fix<y>? yes
      Pass 2: Checking directory structure
      ....
      #####TESTCASE_ENDdiff --git a/fs/direct-io.c b/fs/direct-io.c
      index af0558d..4e88bea 100644
      
      [akpm@linux-foundation.org: use i_size_read()]
      Signed-off-by: NDmitri Monakhov <dmonakhov@openvz.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f64415d
  15. 17 10月, 2008 1 次提交
  16. 27 7月, 2008 1 次提交
  17. 06 2月, 2008 1 次提交
    • C
      Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user · eebd2aa3
      Christoph Lameter 提交于
      Simplify page cache zeroing of segments of pages through 3 functions
      
      zero_user_segments(page, start1, end1, start2, end2)
      
              Zeros two segments of the page. It takes the position where to
              start and end the zeroing which avoids length calculations and
      	makes code clearer.
      
      zero_user_segment(page, start, end)
      
              Same for a single segment.
      
      zero_user(page, start, length)
      
              Length variant for the case where we know the length.
      
      We remove the zero_user_page macro. Issues:
      
      1. Its a macro. Inline functions are preferable.
      
      2. The KM_USER0 macro is only defined for HIGHMEM.
      
         Having to treat this special case everywhere makes the
         code needlessly complex. The parameter for zeroing is always
         KM_USER0 except in one single case that we open code.
      
      Avoiding KM_USER0 makes a lot of code not having to be dealing
      with the special casing for HIGHMEM anymore. Dealing with
      kmap is only necessary for HIGHMEM configurations. In those
      configurations we use KM_USER0 like we do for a series of other
      functions defined in highmem.h.
      
      Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
      function could not be a macro. zero_user_* functions introduced
      here can be be inline because that constant is not used when these
      functions are called.
      
      Also extract the flushing of the caches to be outside of the kmap.
      
      [akpm@linux-foundation.org: fix nfs and ntfs build]
      [akpm@linux-foundation.org: fix ntfs build some more]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eebd2aa3
  18. 17 10月, 2007 1 次提交
    • N
      remove ZERO_PAGE · 557ed1fa
      Nick Piggin 提交于
      The commit b5810039 contains the note
      
        A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
        (and thus mapcounted and count towards shared rss).  These writes to
        the struct page could cause excessive cacheline bouncing on big
        systems.  There are a number of ways this could be addressed if it is
        an issue.
      
      And indeed this cacheline bouncing has shown up on large SGI systems.
      There was a situation where an Altix system was essentially livelocked
      tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
      This situation can be avoided in userspace, but it does highlight the
      potential scalability problem with refcounting ZERO_PAGE, and corner
      cases where it can really hurt (we don't want the system to livelock!).
      
      There are several broad ways to fix this problem:
      1. add back some special casing to avoid refcounting ZERO_PAGE
      2. per-node or per-cpu ZERO_PAGES
      3. remove the ZERO_PAGE completely
      
      I will argue for 3. The others should also fix the problem, but they
      result in more complex code than does 3, with little or no real benefit
      that I can see.
      
      Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
      false optimisation: if an application is performance critical, it would
      not be doing many read faults of new memory, or at least it could be
      expected to write to that memory soon afterwards. If cache or memory use
      is critical, it should not be working with a significant number of
      ZERO_PAGEs anyway (a more compact representation of zeroes should be
      used).
      
      As a sanity check -- mesuring on my desktop system, there are never many
      mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
      increase much without it.
      
      When running a make -j4 kernel compile on my dual core system, there are
      about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
      ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
      is torn down without being COWed). So removing ZERO_PAGE will save 1,000
      page faults per second when running kbuild, while keeping it only saves
      less than 1 page clearing operation per second. 1 page clear is cheaper
      than a thousand faults, presumably, so there isn't an obvious loss.
      
      Neither the logical argument nor these basic tests give a guarantee of no
      regressions. However, this is a reasonable opportunity to try to remove
      the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
      we can reintroduce it and just avoid refcounting it.
      
      The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
      much use to them except on benchmarks.  All other users of ZERO_PAGE are
      converted just to use ZERO_PAGE(0) for simplicity. We can look at
      replacing them all and maybe ripping out ZERO_PAGE completely when we are
      more satisfied with this solution.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus "snif" Torvalds <torvalds@linux-foundation.org>
      557ed1fa
  19. 10 10月, 2007 1 次提交
  20. 21 8月, 2007 1 次提交
    • Z
      dio: zero struct dio with kzalloc instead of manually · 848c4dd5
      Zach Brown 提交于
      This patch uses kzalloc to zero all of struct dio rather than manually
      trying to track which fields we rely on being zero.  It passed aio+dio
      stress testing and some bug regression testing on ext3.
      
      This patch was introduced by Linus in the conversation that lead up to
      Badari's minimal fix to manually zero .map_bh.b_state in commit:
      
        6a648fa7
      
      It makes the code a bit smaller.  Maybe a couple fewer cachelines to
      load, if we're lucky:
      
         text    data     bss     dec     hex filename
      3285925  568506 1304616 5159047  4eb887 vmlinux
      3285797  568506 1304616 5158919  4eb807 vmlinux.patched
      
      I was unable to measure a stable difference in the number of cpu cycles
      spent in blockdev_direct_IO() when pushing aio+dio 256K reads at
      ~340MB/s.
      
      So the resulting intent of the patch isn't a performance gain but to
      avoid exposing ourselves to the risk of finding another field like
      .map_bh.b_state where we rely on zeroing but don't enforce it in the
      code.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      848c4dd5
  21. 12 8月, 2007 1 次提交
  22. 04 7月, 2007 1 次提交
    • Z
      dio: remove bogus refcounting BUG_ON · fcb82f88
      Zach Brown 提交于
      Badari Pulavarty reported a case of this BUG_ON is triggering during
      testing.  It's completely bogus and should be removed.
      
      It's trying to notice if we left references to the dio hanging around in
      the sync case.  They should have been dropped as IO completed while this
      path was in dio_await_completion().  This condition will also be
      checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few
      lines lower.  So to start this BUG_ON() is redundant.
      
      More fatally, it's dereferencing dio-> after having dropped its
      reference.  It's only safe to dereference the dio after releasing the
      lock if the final reference was just dropped.  Another CPU might free
      the dio in bio completion and reuse the memory after this path drops the
      dio lock but before the BUG_ON() is evaluated.
      
      This patch passed aio+dio regression unit tests and aio-stress on ext3.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcb82f88
  23. 10 5月, 2007 1 次提交
    • N
      fs: convert core functions to zero_user_page · 01f2705d
      Nate Diller 提交于
      It's very common for file systems to need to zero part or all of a page,
      the simplist way is just to use kmap_atomic() and memset().  There's
      actually a library function in include/linux/highmem.h that does exactly
      that, but it's confusingly named memclear_highpage_flush(), which is
      descriptive of *how* it does the work rather than what the *purpose* is.
      So this patchset renames the function to zero_user_page(), and calls it
      from the various places that currently open code it.
      
      This first patch introduces the new function call, and converts all the
      core kernel callsites, both the open-coded ones and the old
      memclear_highpage_flush() ones.  Following this patch is a series of
      conversions for each file system individually, per AKPM, and finally a
      patch deprecating the old call.  The diffstat below shows the entire
      patchset.
      
      [akpm@linux-foundation.org: fix a few things]
      Signed-off-by: NNate Diller <nate.diller@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01f2705d
  24. 09 5月, 2007 1 次提交
  25. 11 12月, 2006 7 次提交
    • Z
      [PATCH] dio: lock refcount operations · 5eb6c7a2
      Zach Brown 提交于
      The wait_for_more_bios() function name was poorly chosen.  While looking to
      clean it up it I noticed that the dio struct refcounting between the bio
      completion and dio submission paths was racey.
      
      The bio submission path was simply freeing the dio struct if
      atomic_dec_and_test() indicated that it dropped the final reference.
      
      The aio bio completion path was dereferencing its dio struct pointer *after
      dropping its reference* based on the remaining number of references.
      
      These two paths could race and result in the aio bio completion path
      dereferencing a freed dio, though this was not observed in the wild.
      
      This moves the refcount under the bio lock so that bio completion can drop
      its reference and decide to wake all in one atomic step.
      
      Once testing and waking is locked dio_await_one() can test its sleeping
      condition and mark itself uninterruptible under the lock.  It gets simpler
      and wait_for_more_bios() disappears.
      
      The addition of the interrupt masking spin lock acquiry in dio_bio_submit()
      looks alarming.  This lock acquiry existed in that path before the recent
      dio completion patch set.  We shouldn't expect significant performance
      regression from returning to the behaviour that existed before the
      completion clean up work.
      
      This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5eb6c7a2
    • Z
      [PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED · 8459d86a
      Zach Brown 提交于
      The only time it is safe to call aio_complete() is when the ->ki_retry
      function returns -EIOCBQUEUED to the AIO core.  direct_io_worker() has
      historically done this by relying on its caller to translate positive return
      codes into -EIOCBQUEUED for the aio case.  It did this by trying to keep
      conditionals in sync.  direct_io_worker() knew when finished_one_bio() was
      going to call aio_complete().  It would reverse the test and wait and free the
      dio in the cases it thought that finished_one_bio() wasn't going to.
      
      Not surprisingly, it ended up getting it wrong.  'ret' could be a negative
      errno from the submission path but it failed to communicate this to
      finished_one_bio().  direct_io_worker() would return < 0, it's callers
      wouldn't raise -EIOCBQUEUED, and aio_complete() would be called.  In the
      future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
      would be called for a second time which can manifest as an oops.
      
      The previous cleanups have whittled the sync and async completion paths down
      to the point where we can collapse them and clearly reassert the invariant
      that we must only call aio_complete() after returning -EIOCBQUEUED.
      direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
      drop the dio refcount and the aio bio completion path will only call
      aio_complete() when it is the last to drop the dio refcount.
      direct_io_worker() can ensure that it is the last to drop the reference count
      by waiting for bios to drain.  It does this for sync ops, of course, and for
      partial dio writes that must fall back to buffered and for aio ops that saw
      errors during submission.
      
      This means that operations that end up waiting, even if they were issued as
      aio ops, will not call aio_complete() from dio.  Instead we return the return
      code of the operation and let the aio core call aio_complete().  This is
      purposely done to fix a bug where AIO DIO file extensions would call
      aio_complete() before their callers have a chance to update i_size.
      
      Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
      no longer have to translate for it.  XFS needs to be careful not to free
      resources that will be used during AIO completion if -EIOCBQUEUED is returned.
       We maintain the previous behaviour of trying to write fs metadata for O_SYNC
      aio+dio writes.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8459d86a
    • Z
      [PATCH] dio: remove duplicate bio wait code · 20258b2b
      Zach Brown 提交于
      Now that we have a single refcount and waiting path we can reuse it in the
      async 'should_wait' path.  It continues to rely on the fragile link between
      the conditional in dio_complete_aio() which decides to complete the AIO and
      the conditional in direct_io_worker() which decides to wait and free.
      
      By waiting before dropping the reference we stop dio_bio_end_aio() from
      calling dio_complete_aio() which used to wake up the waiter after seeing the
      reference count drop to 0.  We hoist this wake up into dio_bio_end_aio() which
      now notices when it's left a single remaining reference that is held by the
      waiter.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      20258b2b
    • Z
      [PATCH] dio: formalize bio counters as a dio reference count · 0273201e
      Zach Brown 提交于
      Previously we had two confusing counts of bio progress.  'bio_count' was
      decremented as bios were processed and freed by the dio core.  It was used to
      indicate final completion of the dio operation.  'bios_in_flight' reflected
      how many bios were between submit_bio() and bio->end_io.  It was used by the
      sync path to decide when to wake up and finish completing bios and was ignored
      by the async path.
      
      This patch collapses the two notions into one notion of a dio reference count.
       bios hold a dio reference when they're between submit_bio and bio->end_io.
      
      Since bios_in_flight was only used in the sync path it is now equivalent to
      dio->refcount - 1 which accounts for direct_io_worker() holding a reference
      for the duration of the operation.
      
      dio_bio_complete() -> finished_one_bio() was called from the sync path after
      finding bios on the list that the bio->end_io function had deposited.
      finished_one_bio() can not drop the dio reference on behalf of these bios now
      because bio->end_io already has.  The is_async test in finished_one_bio()
      meant that it never actually did anything other than drop the bio_count for
      sync callers.  So we remove its refcount decrement, don't call it from
      dio_bio_complete(), and hoist its call up into the async dio_bio_complete()
      caller after an explicit refcount decrement.  It is renamed dio_complete_aio()
      to reflect the remaining work it actually does.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0273201e
    • Z
      [PATCH] dio: call blk_run_address_space() once per op · 17a7b1d7
      Zach Brown 提交于
      We only need to call blk_run_address_space() once after all the bios for the
      direct IO op have been submitted.  This removes the chance of calling
      blk_run_address_space() after spurious wake ups as the sync path waits for
      bios to drain.  It's also one less difference betwen the sync and async paths.
      
      In the process we remove a redundant dio_bio_submit() that its caller had
      already performed.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      17a7b1d7
    • Z
      [PATCH] dio: centralize completion in dio_complete() · 6d544bb4
      Zach Brown 提交于
      There have been a lot of bugs recently due to the way direct_io_worker() tries
      to decide how to finish direct IO operations.  In the worst examples it has
      failed to call aio_complete() at all (hang) or called it too many times
      (oops).
      
      This set of patches cleans up the completion phase with the goal of removing
      the complexity that lead to these bugs.  We end up with one path that
      calculates the result of the operation after all off the bios have completed.
      We decide when to generate a result of the operation using that path based on
      the final release of a refcount on the dio structure.
      
      I tried to progress towards the final state in steps that were relatively easy
      to understand.  Each step should compile but I only tested the final result of
      having all the patches applied.
      
      I've tested these on low end PC drives with aio-stress, the direct IO tests I
      could manage to get running in LTP, orasim, and some home-brew functional
      tests.
      
      In http://lkml.org/lkml/2006/9/21/103 IBM reports success with ext2 and ext3
      running DIO LTP tests.  They found that XFS bug which has since been addressed
      in the patch series.
      
      This patch:
      
      The mechanics which decide the result of a direct IO operation were duplicated
      in the sync and async paths.
      
      The async path didn't check page_errors which can manifest as silently
      returning success when the final pointer in an operation faults and its
      matching file region is filled with zeros.
      
      The sync path and async path differed in whether they passed errors to the
      caller's dio->end_io operation.  The async path was passing errors to it which
      trips an assertion in XFS, though it is apparently harmless.
      
      This centralizes the completion phase of dio ops in one place.  AIO will now
      return EFAULT consistently and all paths fall back to the previously sync
      behaviour of passing the number of bytes 'transferred' to the dio->end_io
      callback, regardless of errors.
      
      dio_await_completion() doesn't have to propogate EIO from non-uptodate bios
      now that it's being propogated through dio_complete() via dio->io_error.  This
      lets it return void which simplifies its sole caller.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6d544bb4
    • A
      [PATCH] io-accounting: direct-io · 98c4d57d
      Andrew Morton 提交于
      Account for direct-io.
      
      Cc: Jay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Cc: David Wright <daw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      98c4d57d
  26. 04 7月, 2006 1 次提交
  27. 23 6月, 2006 1 次提交
    • J
      [PATCH] Kill PF_SYNCWRITE flag · b31dc66a
      Jens Axboe 提交于
      A process flag to indicate whether we are doing sync io is incredibly
      ugly. It also causes performance problems when one does a lot of async
      io and then proceeds to sync it. Part of the io will go out as async,
      and the other part as sync. This causes a disconnect between the
      previously submitted io and the synced io. For io schedulers such as CFQ,
      this will cause us lost merges and suboptimal behaviour in scheduling.
      
      Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
      the O_DIRECT path just directly indicate that the writes are sync
      by using WRITE_SYNC instead.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      b31dc66a
  28. 01 4月, 2006 1 次提交
  29. 29 3月, 2006 1 次提交