1. 28 10月, 2011 1 次提交
    • A
      direct-io: separate fields only used in the submission path from struct dio · eb28be2b
      Andi Kleen 提交于
      This large, but largely mechanic, patch moves all fields in struct dio
      that are only used in the submission path into a separate on stack
      data structure. This has the advantage that the memory is very likely
      cache hot, which is not guaranteed for memory fresh out of kmalloc.
      
      This also gives gcc more optimization potential because it can easier
      determine that there are no external aliases for these variables.
      
      The sdio initialization is a initialization now instead of memset.
      This allows gcc to break sdio into individual fields and optimize
      away unnecessary zeroing (after all the functions are inlined)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      eb28be2b
  2. 27 7月, 2011 1 次提交
  3. 21 7月, 2011 4 次提交
    • C
      fs: move inode_dio_done to the end_io handler · 72c5052d
      Christoph Hellwig 提交于
      For filesystems that delay their end_io processing we should keep our
      i_dio_count until the the processing is done.  Enable this by moving
      the inode_dio_done call to the end_io handler if one exist.  Note that
      the actual move to the workqueue for ext4 and XFS is not done in
      this patch yet, but left to the filesystem maintainers.  At least
      for XFS it's not needed yet either as XFS has an internal equivalent
      to i_dio_count.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      72c5052d
    • C
      fs: always maintain i_dio_count · df2d6f26
      Christoph Hellwig 提交于
      Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
      This these filesystems to also protect truncate against direct I/O requests
      by using common code.  Right now the only non-DIO_LOCKING filesystem that
      appears to do so is XFS, which uses an opencoded variant of the i_dio_count
      scheme.
      
      Behaviour doesn't change for filesystems never calling inode_dio_wait.
      For ext4 behaviour changes when using the dioread_nonlock option, which
      previously was missing any protection between truncate and direct I/O reads.
      For ocfs2 that handcrafted i_dio_count manipulations are replaced with
      the common code now enable.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      df2d6f26
    • C
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig 提交于
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      
      Replace it with a hand-grown construct:
      
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      system).
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd5fe6c5
    • C
      fs: simplify handling of zero sized reads in __blockdev_direct_IO · f9b5570d
      Christoph Hellwig 提交于
      Reject zero sized reads as soon as we know our I/O length, and don't
      borther with locks or allocations that might have to be cleaned up
      otherwise.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f9b5570d
  4. 10 3月, 2011 2 次提交
    • J
      block: kill off REQ_UNPLUG · 721a9602
      Jens Axboe 提交于
      With the plugging now being explicitly controlled by the
      submitter, callers need not pass down unplugging hints
      to the block layer. If they want to unplug, it's because they
      manually plugged on their own - in which case, they should just
      unplug at will.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      721a9602
    • J
      block: remove per-queue plugging · 7eaceacc
      Jens Axboe 提交于
      Code has been converted over to the new explicit on-stack plugging,
      and delay users have been converted to use the new API for that.
      So lets kill off the old plugging along with aops->sync_page().
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7eaceacc
  5. 21 1月, 2011 1 次提交
  6. 19 1月, 2011 1 次提交
  7. 27 10月, 2010 1 次提交
  8. 10 9月, 2010 1 次提交
    • J
      O_DIRECT: fix the splitting up of contiguous I/O · 7a801ac6
      Jeff Moyer 提交于
      commit c2c6ca41 (direct-io: do not merge logically non-contiguous requests)
      introduced a bug whereby all O_DIRECT I/Os were submitted a page at a time
      to the block layer.  The problem is that the code expected
      dio->block_in_file to correspond to the current page in the dio.  In fact,
      it corresponds to the previous page submitted via submit_page_section.
      This was purely an oversight, as the dio->cur_page_fs_offset field was
      introduced for just this purpose.  This patch simply uses the correct
      variable when calculating whether there is a mismatch between contiguous
      logical blocks and contiguous physical blocks (as described in the
      comments).
      
      I also switched the if conditional following this check to an else if, to
      ensure that we never call dio_bio_submit twice for the same dio (in
      theory, this should not happen, anyway).
      
      I've tested this by running blktrace and verifying that a 64KB I/O was
      submitted as a single I/O.  I also ran the patched kernel through
      xfstests' aio tests using xfs, ext4 (with 1k and 4k block sizes) and btrfs
      and verified that there were no regressions as compared to an unpatched
      kernel.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NJosef Bacik <jbacik@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: <stable@kernel.org>		[2.6.35.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a801ac6
  9. 10 8月, 2010 1 次提交
    • C
      sort out blockdev_direct_IO variants · eafdc7d1
      Christoph Hellwig 提交于
      Move the call to vmtruncate to get rid of accessive blocks to the callers
      in prepearation of the new truncate calling sequence.  This was only done
      for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
      was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
      its _newtrunc variant while at it as just opencoding the two additional
      paramters is shorted than the name suffix.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eafdc7d1
  10. 27 7月, 2010 2 次提交
  11. 28 5月, 2010 1 次提交
    • N
      fs: introduce new truncate sequence · 7bb46a67
      npiggin@suse.de 提交于
      Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
      setattr > vmtruncate > truncate, have filesystems call their truncate sequence
      from ->setattr if filesystem specific operations are required. vmtruncate is
      deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
      previously should be used.
      
      simple_setattr is introduced for simple in-ram filesystems to implement
      the new truncate sequence. Eventually all filesystems should be converted
      to implement a setattr, and the default code in notify_change should go
      away.
      
      simple_setsize is also introduced to perform just the ATTR_SIZE portion
      of simple_setattr (ie. changing i_size and trimming pagecache).
      
      To implement the new truncate sequence:
      - filesystem specific manipulations (eg freeing blocks) must be done in
        the setattr method rather than ->truncate.
      - vmtruncate can not be used by core code to trim blocks past i_size in
        the event of write failure after allocation, so this must be performed
        in the fs code.
      - convert usage of helpers block_write_begin, nobh_write_begin,
        cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
        variants. These avoid calling vmtruncate to trim blocks (see previous).
      - inode_setattr should not be used. generic_setattr is a new function
        to be used to copy simple attributes into the generic inode.
      - make use of the better opportunity to handle errors with the new sequence.
      
      Big problem with the previous calling sequence: the filesystem is not called
      until i_size has already changed.  This means it is not allowed to fail the
      call, and also it does not know what the previous i_size was. Also, generic
      code calling vmtruncate to truncate allocated blocks in case of error had
      no good way to return a meaningful error (or, for example, atomically handle
      block deallocation).
      
      Cc: Christoph Hellwig <hch@lst.de>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7bb46a67
  12. 25 5月, 2010 2 次提交
    • J
      direct-io: do not merge logically non-contiguous requests · c2c6ca41
      Josef Bacik 提交于
      Btrfs cannot handle having logically non-contiguous requests submitted.  For
      example if you have
      
      Logical:  [0-4095][HOLE][8192-12287]
      Physical: [0-4095]      [4096-8191]
      
      Normally the DIO code would put these into the same BIO's.  The problem is we
      need to know exactly what offset is associated with what BIO so we can do our
      checksumming and unlocking properly, so putting them in the same BIO doesn't
      work.  So add another check where we submit the current BIO if the physical
      blocks are not contigous OR the logical blocks are not contiguous.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c2c6ca41
    • J
      direct-io: add a hook for the fs to provide its own submit_bio function · facd07b0
      Josef Bacik 提交于
      Because BTRFS can do RAID and such, we need our own submit hook so we can setup
      the bio's in the correct fashion, and handle checksum errors properly.  So there
      are a few changes here
      
      1) The submit_io hook.  This is straightforward, just call this instead of
      submit_bio.
      
      2) Allow the fs to return -ENOTBLK for reads.  Usually this has only worked for
      writes, since writes can fallback onto buffered IO.  But BTRFS needs the option
      of falling back on buffered IO if it encounters a compressed extent, since we
      need to read the entire extent in and decompress it.  So if we get -ENOTBLK back
      from get_block we'll return back and fallback on buffered just like the write
      case.
      
      I've tested these changes with fsx and everything seems to work.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      facd07b0
  13. 17 12月, 2009 2 次提交
    • A
      dio: fix use-after-free · 06777d30
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      06777d30
    • C
      cleanup blockdev_direct_IO locking · 1e431f5c
      Christoph Hellwig 提交于
      Currently the locking in blockdev_direct_IO is a mess, we have three different
      locking types and very confusing checks for some of them.  The most
      complicated one is DIO_OWN_LOCKING for reads, which happens to not actually be
      used.
      
      This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read case
      is unused anyway, and the write side is almost identical to DIO_NO_LOCKING.
      The difference is that DIO_NO_LOCKING always sets the create argument for
      the get_blocks callback to zero, but we can easily move that to the actual
      get_blocks callbacks.  There are four users of the DIO_NO_LOCKING mode:
      gfs already ignores the create argument and thus is fine with the new
      version, ocfs2 only errors out if create were ever set, and we can remove
      this dead code now, the block device code only ever uses create for an
      error message if we are fully beyond the device which can never happen,
      and last but not least XFS will need the new behavour for writes.
      
      Now we can replace the lock_type variable with a flags one, where no flag
      means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
      flag.  Separate out the check for not allowing to fill holes into a separate
      flag, although for now both flags always get set at the same time.
      
      Also revamp the documentation of the locking scheme to actually make sense.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1e431f5c
  14. 16 12月, 2009 2 次提交
    • C
      direct-io: cleanup blockdev_direct_IO locking · 5fe878ae
      Christoph Hellwig 提交于
      Currently the locking in blockdev_direct_IO is a mess, we have three
      different locking types and very confusing checks for some of them.  The
      most complicated one is DIO_OWN_LOCKING for reads, which happens to not
      actually be used.
      
      This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read
      case is unused anyway, and the write side is almost identical to
      DIO_NO_LOCKING.  The difference is that DIO_NO_LOCKING always sets the
      create argument for the get_blocks callback to zero, but we can easily
      move that to the actual get_blocks callbacks.  There are four users of the
      DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is
      fine with the new version, ocfs2 only errors out if create were ever set,
      and we can remove this dead code now, the block device code only ever uses
      create for an error message if we are fully beyond the device which can
      never happen, and last but not least XFS will need the new behavour for
      writes.
      
      Now we can replace the lock_type variable with a flags one, where no flag
      means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
      flag.  Separate out the check for not allowing to fill holes into a
      separate flag, although for now both flags always get set at the same
      time.
      
      Also revamp the documentation of the locking scheme to actually make
      sense.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fe878ae
    • J
      dio: don't zero out the pages array inside struct dio · 23aee091
      Jeff Moyer 提交于
      Intel reported a performance regression caused by the following commit:
      
      commit 848c4dd5
      Author: Zach Brown <zach.brown@oracle.com>
      Date:   Mon Aug 20 17:12:01 2007 -0700
      
          dio: zero struct dio with kzalloc instead of manually
      
          This patch uses kzalloc to zero all of struct dio rather than
          manually trying to track which fields we rely on being zero.  It
          passed aio+dio stress testing and some bug regression testing on
          ext3.
      
          This patch was introduced by Linus in the conversation that lead up
          to Badari's minimal fix to manually zero .map_bh.b_state in commit:
      
            6a648fa7
      
          It makes the code a bit smaller.  Maybe a couple fewer cachelines to
          load, if we're lucky:
      
             text    data     bss     dec     hex filename
          3285925  568506 1304616 5159047  4eb887 vmlinux
          3285797  568506 1304616 5158919  4eb807 vmlinux.patched
      
          I was unable to measure a stable difference in the number of cpu
          cycles spent in blockdev_direct_IO() when pushing aio+dio 256K reads
          at ~340MB/s.
      
          So the resulting intent of the patch isn't a performance gain but to
          avoid exposing ourselves to the risk of finding another field like
          .map_bh.b_state where we rely on zeroing but don't enforce it in the
          code.
      
      Zach surmised that zeroing out the page array was what caused most of
      the problem, and suggested the approach taken in the attached patch for
      resolving the issue.  Intel re-tested with this patch and saw a 0.6%
      performance gain (the original regression was 0.5%).
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NZach Brown <zach.brown@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23aee091
  15. 26 11月, 2009 1 次提交
    • V
      Fix regression in direct writes performance due to WRITE_ODIRECT flag removal · d9449ce3
      Vivek Goyal 提交于
      There seems to be a regression in direct write path due to following
      commit in for-2.6.33 branch of block tree.
      
      commit 1af60fbd
      Author: Jeff Moyer <jmoyer@redhat.com>
      Date:   Fri Oct 2 18:56:53 2009 -0400
      
          block: get rid of the WRITE_ODIRECT flag
      
      Marking direct writes as WRITE_SYNC_PLUG instead of WRITE_ODIRECT, sets
      the NOIDLE flag in bio and hence in request. This tells CFQ to not expect
      more request from the queue and not idle on it (despite the fact that
      queue's think time is less and it is not seeky).
      
      So direct writers lose big time when competing with sequential readers.
      
      Using fio, I have run one direct writer and two sequential readers and
      following are the results with 2.6.32-rc7 kernel and with for-2.6.33
      branch.
      
      Test
      ====
      1 direct writer and 2 sequential reader running simultaneously.
      
      [global]
      directory=/mnt/sdc/fio/
      runtime=10
      
      [seqwrite]
      rw=write
      size=4G
      direct=1
      
      [seqread]
      rw=read
      size=2G
      numjobs=2
      
      2.6.32-rc7
      ==========
      direct writes: aggrb=2,968KB/s
      readers	     : aggrb=101MB/s
      
      for-2.6.33 branch
      =================
      direct write: aggrb=19KB/s
      readers	      aggrb=137MB/s
      
      This patch brings back the WRITE_ODIRECT flag, with the difference that we
      don't set the BIO_RW_UNPLUG flag so that device is not unplugged after
      submission of request and an explicit unplug from submitter is required.
      
      That way we fix the jeff's issue of not enough merging taking place in aio
      path as well as make sure direct writes get their fair share.
      
      After the fix
      =============
      for-2.6.33 + fix
      ----------------
      direct writes: aggrb=2,728KB/s
      reads: aggrb=103MB/s
      
      Thanks
      Vivek
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d9449ce3
  16. 28 10月, 2009 2 次提交
    • J
      aio: implement request batching · cfb1e33e
      Jeff Moyer 提交于
      Hi,
      
      Some workloads issue batches of small I/O, and the performance is poor
      due to the call to blk_run_address_space for every single iocb.  Nathan
      Roberts pointed this out, and suggested that by deferring this call
      until all I/Os in the iocb array are submitted to the block layer, we
      can realize some impressive performance gains (up to 30% for sequential
      4k reads in batches of 16).
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      cfb1e33e
    • J
      block: get rid of the WRITE_ODIRECT flag · 1af60fbd
      Jeff Moyer 提交于
      Hi,
      
      The WRITE_ODIRECT flag is only used in one place, and that code path
      happens to also call blk_run_address_space.  The introduction of this
      flag, then, could result in the device being unplugged twice for every
      I/O.
      
      Further, with the batching changes in the next patch, we don't want an
      O_DIRECT write to imply a queue unplug.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1af60fbd
  17. 23 5月, 2009 1 次提交
  18. 15 4月, 2009 1 次提交
  19. 06 4月, 2009 1 次提交
  20. 07 1月, 2009 1 次提交
    • D
      fs: truncate blocks outside i_size after O_DIRECT write error · 0f64415d
      Dmitri Monakhov 提交于
      In case of error extending write may have instantiated a few blocks
      outside i_size.  We need to trim these blocks.  We have to do it
      *regardless* to blocksize.  At least ext2, ext3 and reiserfs interpret
      (i_size < biggest block) condition as error.  Fsck will complain about
      wrong i_size.  Then fsck will fix the error by changing i_size according
      to the biggest block.  This is bad because this blocks contain garbage
      from previous write attempt.  And result in data corruption.
      
      ####TESTCASE_BEGIN
      $touch /mnt/test/BIG_FILE
      ## at this moment /mnt/test/BIG_FILE size and blocks equal to zero
      open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
      write(3, "aaaaaaaaaaaa"..., 104857600) = -1 ENOSPC (No space left on device)
      ## size and block sould't be changed because write op failed.
      $stat /mnt/test/BIG_FILE
      File: `/mnt/test/BIG_FILE'
      Size: 0 Blocks: 110896 IO Block: 1024 regular empty file
      <<<<<<<<^^^^^^^^^^^^^^^^^^^^^^^^^^^^^file size is less than biggest block idx
      Device: fe07h/65031d Inode: 14 Links: 1
      Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
      Access: 2007-01-24 20:03:38.000000000 +0300
      Modify: 2007-01-24 20:03:38.000000000 +0300
      Change: 2007-01-24 20:03:39.000000000 +0300
      
      #fsck.ext3 -f /dev/VG/test
      e2fsck 1.39 (29-May-2006)
      Pass 1: Checking inodes, blocks, and sizes
      Inode 14, i_size is 0, should be 56556544. Fix<y>? yes
      Pass 2: Checking directory structure
      ....
      #####TESTCASE_ENDdiff --git a/fs/direct-io.c b/fs/direct-io.c
      index af0558d..4e88bea 100644
      
      [akpm@linux-foundation.org: use i_size_read()]
      Signed-off-by: NDmitri Monakhov <dmonakhov@openvz.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f64415d
  21. 17 10月, 2008 1 次提交
  22. 27 7月, 2008 1 次提交
  23. 06 2月, 2008 1 次提交
    • C
      Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user · eebd2aa3
      Christoph Lameter 提交于
      Simplify page cache zeroing of segments of pages through 3 functions
      
      zero_user_segments(page, start1, end1, start2, end2)
      
              Zeros two segments of the page. It takes the position where to
              start and end the zeroing which avoids length calculations and
      	makes code clearer.
      
      zero_user_segment(page, start, end)
      
              Same for a single segment.
      
      zero_user(page, start, length)
      
              Length variant for the case where we know the length.
      
      We remove the zero_user_page macro. Issues:
      
      1. Its a macro. Inline functions are preferable.
      
      2. The KM_USER0 macro is only defined for HIGHMEM.
      
         Having to treat this special case everywhere makes the
         code needlessly complex. The parameter for zeroing is always
         KM_USER0 except in one single case that we open code.
      
      Avoiding KM_USER0 makes a lot of code not having to be dealing
      with the special casing for HIGHMEM anymore. Dealing with
      kmap is only necessary for HIGHMEM configurations. In those
      configurations we use KM_USER0 like we do for a series of other
      functions defined in highmem.h.
      
      Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
      function could not be a macro. zero_user_* functions introduced
      here can be be inline because that constant is not used when these
      functions are called.
      
      Also extract the flushing of the caches to be outside of the kmap.
      
      [akpm@linux-foundation.org: fix nfs and ntfs build]
      [akpm@linux-foundation.org: fix ntfs build some more]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eebd2aa3
  24. 17 10月, 2007 1 次提交
    • N
      remove ZERO_PAGE · 557ed1fa
      Nick Piggin 提交于
      The commit b5810039 contains the note
      
        A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
        (and thus mapcounted and count towards shared rss).  These writes to
        the struct page could cause excessive cacheline bouncing on big
        systems.  There are a number of ways this could be addressed if it is
        an issue.
      
      And indeed this cacheline bouncing has shown up on large SGI systems.
      There was a situation where an Altix system was essentially livelocked
      tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
      This situation can be avoided in userspace, but it does highlight the
      potential scalability problem with refcounting ZERO_PAGE, and corner
      cases where it can really hurt (we don't want the system to livelock!).
      
      There are several broad ways to fix this problem:
      1. add back some special casing to avoid refcounting ZERO_PAGE
      2. per-node or per-cpu ZERO_PAGES
      3. remove the ZERO_PAGE completely
      
      I will argue for 3. The others should also fix the problem, but they
      result in more complex code than does 3, with little or no real benefit
      that I can see.
      
      Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
      false optimisation: if an application is performance critical, it would
      not be doing many read faults of new memory, or at least it could be
      expected to write to that memory soon afterwards. If cache or memory use
      is critical, it should not be working with a significant number of
      ZERO_PAGEs anyway (a more compact representation of zeroes should be
      used).
      
      As a sanity check -- mesuring on my desktop system, there are never many
      mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
      increase much without it.
      
      When running a make -j4 kernel compile on my dual core system, there are
      about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
      ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
      is torn down without being COWed). So removing ZERO_PAGE will save 1,000
      page faults per second when running kbuild, while keeping it only saves
      less than 1 page clearing operation per second. 1 page clear is cheaper
      than a thousand faults, presumably, so there isn't an obvious loss.
      
      Neither the logical argument nor these basic tests give a guarantee of no
      regressions. However, this is a reasonable opportunity to try to remove
      the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
      we can reintroduce it and just avoid refcounting it.
      
      The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
      much use to them except on benchmarks.  All other users of ZERO_PAGE are
      converted just to use ZERO_PAGE(0) for simplicity. We can look at
      replacing them all and maybe ripping out ZERO_PAGE completely when we are
      more satisfied with this solution.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus "snif" Torvalds <torvalds@linux-foundation.org>
      557ed1fa
  25. 10 10月, 2007 1 次提交
  26. 21 8月, 2007 1 次提交
    • Z
      dio: zero struct dio with kzalloc instead of manually · 848c4dd5
      Zach Brown 提交于
      This patch uses kzalloc to zero all of struct dio rather than manually
      trying to track which fields we rely on being zero.  It passed aio+dio
      stress testing and some bug regression testing on ext3.
      
      This patch was introduced by Linus in the conversation that lead up to
      Badari's minimal fix to manually zero .map_bh.b_state in commit:
      
        6a648fa7
      
      It makes the code a bit smaller.  Maybe a couple fewer cachelines to
      load, if we're lucky:
      
         text    data     bss     dec     hex filename
      3285925  568506 1304616 5159047  4eb887 vmlinux
      3285797  568506 1304616 5158919  4eb807 vmlinux.patched
      
      I was unable to measure a stable difference in the number of cpu cycles
      spent in blockdev_direct_IO() when pushing aio+dio 256K reads at
      ~340MB/s.
      
      So the resulting intent of the patch isn't a performance gain but to
      avoid exposing ourselves to the risk of finding another field like
      .map_bh.b_state where we rely on zeroing but don't enforce it in the
      code.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      848c4dd5
  27. 12 8月, 2007 1 次提交
  28. 04 7月, 2007 1 次提交
    • Z
      dio: remove bogus refcounting BUG_ON · fcb82f88
      Zach Brown 提交于
      Badari Pulavarty reported a case of this BUG_ON is triggering during
      testing.  It's completely bogus and should be removed.
      
      It's trying to notice if we left references to the dio hanging around in
      the sync case.  They should have been dropped as IO completed while this
      path was in dio_await_completion().  This condition will also be
      checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few
      lines lower.  So to start this BUG_ON() is redundant.
      
      More fatally, it's dereferencing dio-> after having dropped its
      reference.  It's only safe to dereference the dio after releasing the
      lock if the final reference was just dropped.  Another CPU might free
      the dio in bio completion and reuse the memory after this path drops the
      dio lock but before the BUG_ON() is evaluated.
      
      This patch passed aio+dio regression unit tests and aio-stress on ext3.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcb82f88
  29. 10 5月, 2007 1 次提交
    • N
      fs: convert core functions to zero_user_page · 01f2705d
      Nate Diller 提交于
      It's very common for file systems to need to zero part or all of a page,
      the simplist way is just to use kmap_atomic() and memset().  There's
      actually a library function in include/linux/highmem.h that does exactly
      that, but it's confusingly named memclear_highpage_flush(), which is
      descriptive of *how* it does the work rather than what the *purpose* is.
      So this patchset renames the function to zero_user_page(), and calls it
      from the various places that currently open code it.
      
      This first patch introduces the new function call, and converts all the
      core kernel callsites, both the open-coded ones and the old
      memclear_highpage_flush() ones.  Following this patch is a series of
      conversions for each file system individually, per AKPM, and finally a
      patch deprecating the old call.  The diffstat below shows the entire
      patchset.
      
      [akpm@linux-foundation.org: fix a few things]
      Signed-off-by: NNate Diller <nate.diller@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01f2705d
  30. 09 5月, 2007 1 次提交
  31. 11 12月, 2006 1 次提交
    • Z
      [PATCH] dio: lock refcount operations · 5eb6c7a2
      Zach Brown 提交于
      The wait_for_more_bios() function name was poorly chosen.  While looking to
      clean it up it I noticed that the dio struct refcounting between the bio
      completion and dio submission paths was racey.
      
      The bio submission path was simply freeing the dio struct if
      atomic_dec_and_test() indicated that it dropped the final reference.
      
      The aio bio completion path was dereferencing its dio struct pointer *after
      dropping its reference* based on the remaining number of references.
      
      These two paths could race and result in the aio bio completion path
      dereferencing a freed dio, though this was not observed in the wild.
      
      This moves the refcount under the bio lock so that bio completion can drop
      its reference and decide to wake all in one atomic step.
      
      Once testing and waking is locked dio_await_one() can test its sleeping
      condition and mark itself uninterruptible under the lock.  It gets simpler
      and wait_for_more_bios() disappears.
      
      The addition of the interrupt masking spin lock acquiry in dio_bio_submit()
      looks alarming.  This lock acquiry existed in that path before the recent
      dio completion patch set.  We shouldn't expect significant performance
      regression from returning to the behaviour that existed before the
      completion clean up work.
      
      This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: <xfs-masters@oss.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5eb6c7a2