1. 10 9月, 2013 1 次提交
  2. 04 9月, 2013 2 次提交
    • C
      direct-io: Handle O_(D)SYNC AIO · 02afc27f
      Christoph Hellwig 提交于
      Call generic_write_sync() from the deferred I/O completion handler if
      O_DSYNC is set for a write request.  Also make sure various callers
      don't call generic_write_sync if the direct I/O code returns
      -EIOCBQUEUED.
      
      Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
      Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      02afc27f
    • C
      direct-io: Implement generic deferred AIO completions · 7b7a8665
      Christoph Hellwig 提交于
      Add support to the core direct-io code to defer AIO completions to user
      context using a workqueue.  This replaces opencoded and less efficient
      code in XFS and ext4 (we save a memory allocation for each direct IO)
      and will be needed to properly support O_(D)SYNC for AIO.
      
      The communication between the filesystem and the direct I/O code requires
      a new buffer head flag, which is a bit ugly but not avoidable until the
      direct I/O code stops abusing the buffer_head structure for communicating
      with the filesystems.
      
      Currently this creates a per-superblock unbound workqueue for these
      completions, which is taken from an earlier patch by Jan Kara.  I'm
      not really convinced about this use and would prefer a "normal" global
      workqueue with a high concurrency limit, but this needs further discussion.
      
      JK: Fixed ext4 part, dynamic allocation of the workqueue.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7b7a8665
  3. 08 5月, 2013 1 次提交
  4. 30 4月, 2013 2 次提交
  5. 24 3月, 2013 1 次提交
    • K
      block: Convert some code to bio_for_each_segment_all() · cb34e057
      Kent Overstreet 提交于
      More prep work for immutable bvecs:
      
      A few places in the code were either open coding or using the wrong
      version - fix.
      
      After we introduce the bvec iter, it'll no longer be possible to modify
      the biovec through bio_for_each_segment_all() - it doesn't increment a
      pointer to the current bvec, you pass in a struct bio_vec (not a
      pointer) which is updated with what the current biovec would be (taking
      into account bi_bvec_done and bi_size).
      
      So because of that it's more worthwhile to be consistent about
      bio_for_each_segment()/bio_for_each_segment_all() usage.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Alexander Viro <viro@zeniv.linux.org.uk>
      cb34e057
  6. 23 2月, 2013 1 次提交
  7. 30 11月, 2012 1 次提交
    • L
      direct-io: don't read inode->i_blkbits multiple times · ab73857e
      Linus Torvalds 提交于
      Since directio can work on a raw block device, and the block size of the
      device can change under it, we need to do the same thing that
      fs/buffer.c now does: read the block size a single time, using
      ACCESS_ONCE().
      
      Reading it multiple times can get different results, which will then
      confuse the code because it actually encodes the i_blksize in
      relationship to the underlying logical blocksize.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab73857e
  8. 09 8月, 2012 1 次提交
  9. 14 7月, 2012 1 次提交
  10. 01 6月, 2012 1 次提交
  11. 24 2月, 2012 1 次提交
    • A
      Restore direct_io / truncate locking API · 37fbf4bf
      Anton Altaparmakov 提交于
      With kernel 3.1, Christoph removed i_alloc_sem and replaced it with
      calls (namely inode_dio_wait() and inode_dio_done()) which are
      EXPORT_SYMBOL_GPL() thus they cannot be used by non-GPL file systems and
      further inode_dio_wait() was pushed from notify_change() into the file
      system ->setattr() method but no non-GPL file system can make this call.
      
      That means non-GPL file systems cannot exist any more unless they do not
      use any VFS functionality related to reading/writing as far as I can
      tell or at least as long as they want to implement direct i/o.
      
      Both Linus and Al (and others) have said on LKML that this breakage of
      the VFS API should not have happened and that the change was simply
      missed as it was not documented in the change logs of the patches that
      did those changes.
      
      This patch changes the two function exports in question to be
      EXPORT_SYMBOL() thus restoring the VFS API as it used to be - accessible
      for all modules.
      
      Christoph, who introduced the two functions and exported them GPL-only
      is CC-ed on this patch to give him the opportunity to object to the
      symbols being changed in this manner if he did indeed intend them to be
      GPL-only and does not want them to become available to all modules.
      Signed-off-by: NAnton Altaparmakov <anton@tuxera.com>
      CC: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37fbf4bf
  12. 13 1月, 2012 2 次提交
    • A
      dio: optimize cache misses in the submission path · 65dd2aa9
      Andi Kleen 提交于
      Some investigation of a transaction processing workload showed that a
      major consumer of cycles in __blockdev_direct_IO is the cache miss while
      accessing the block size.  This is because it has to walk the chain from
      block_dev to gendisk to queue.
      
      The block size is needed early on to check alignment and sizes.  It's only
      done if the check for the inode block size fails.  But the costly block
      device state is unconditionally fetched.
      
      - Reorganize the code to only fetch block dev state when actually
        needed.
      
      Then do a prefetch on the block dev early on in the direct IO path.  This
      is worth it, because there is substantial code run before we actually
      touch the block dev now.
      
      - I also added some unlikelies to make it clear the compiler that block
        device fetch code is not normally executed.
      
      This gave a small, but measurable improvement on a large database
      benchmark (about 0.3%)
      
      [akpm@linux-foundation.org: coding-style fixes]
      [sfr@canb.auug.org.au: using prefetch requires including prefetch.h]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65dd2aa9
    • T
      fs/direct-io.c: calculate fs_count correctly in get_more_blocks() · ae55e1aa
      Tao Ma 提交于
      In get_more_blocks(), we use dio_count to calcuate fs_count and do some
      tricky things to increase fs_count if dio_count isn't aligned.  But
      actually it still has some corner cases that can't be coverd.  See the
      following example:
      
      	dio_write foo -s 1024 -w 4096
      
      (direct write 4096 bytes at offset 1024).  The same goes if the offset
      isn't aligned to fs_blocksize.
      
      In this case, the old calculation counts fs_count to be 1, but actually we
      will write into 2 different blocks (if fs_blocksize=4096).  The old code
      just works, since it will call get_block twice (and may have to allocate
      and create extents twice for filesystems like ext4).  So we'd better call
      get_block just once with the proper fs_count.
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae55e1aa
  13. 28 10月, 2011 7 次提交
  14. 27 7月, 2011 1 次提交
  15. 21 7月, 2011 4 次提交
    • C
      fs: move inode_dio_done to the end_io handler · 72c5052d
      Christoph Hellwig 提交于
      For filesystems that delay their end_io processing we should keep our
      i_dio_count until the the processing is done.  Enable this by moving
      the inode_dio_done call to the end_io handler if one exist.  Note that
      the actual move to the workqueue for ext4 and XFS is not done in
      this patch yet, but left to the filesystem maintainers.  At least
      for XFS it's not needed yet either as XFS has an internal equivalent
      to i_dio_count.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      72c5052d
    • C
      fs: always maintain i_dio_count · df2d6f26
      Christoph Hellwig 提交于
      Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
      This these filesystems to also protect truncate against direct I/O requests
      by using common code.  Right now the only non-DIO_LOCKING filesystem that
      appears to do so is XFS, which uses an opencoded variant of the i_dio_count
      scheme.
      
      Behaviour doesn't change for filesystems never calling inode_dio_wait.
      For ext4 behaviour changes when using the dioread_nonlock option, which
      previously was missing any protection between truncate and direct I/O reads.
      For ocfs2 that handcrafted i_dio_count manipulations are replaced with
      the common code now enable.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      df2d6f26
    • C
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig 提交于
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      
      Replace it with a hand-grown construct:
      
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      system).
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd5fe6c5
    • C
      fs: simplify handling of zero sized reads in __blockdev_direct_IO · f9b5570d
      Christoph Hellwig 提交于
      Reject zero sized reads as soon as we know our I/O length, and don't
      borther with locks or allocations that might have to be cleaned up
      otherwise.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f9b5570d
  16. 10 3月, 2011 2 次提交
    • J
      block: kill off REQ_UNPLUG · 721a9602
      Jens Axboe 提交于
      With the plugging now being explicitly controlled by the
      submitter, callers need not pass down unplugging hints
      to the block layer. If they want to unplug, it's because they
      manually plugged on their own - in which case, they should just
      unplug at will.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      721a9602
    • J
      block: remove per-queue plugging · 7eaceacc
      Jens Axboe 提交于
      Code has been converted over to the new explicit on-stack plugging,
      and delay users have been converted to use the new API for that.
      So lets kill off the old plugging along with aops->sync_page().
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7eaceacc
  17. 21 1月, 2011 1 次提交
  18. 19 1月, 2011 1 次提交
  19. 27 10月, 2010 1 次提交
  20. 10 9月, 2010 1 次提交
    • J
      O_DIRECT: fix the splitting up of contiguous I/O · 7a801ac6
      Jeff Moyer 提交于
      commit c2c6ca41 (direct-io: do not merge logically non-contiguous requests)
      introduced a bug whereby all O_DIRECT I/Os were submitted a page at a time
      to the block layer.  The problem is that the code expected
      dio->block_in_file to correspond to the current page in the dio.  In fact,
      it corresponds to the previous page submitted via submit_page_section.
      This was purely an oversight, as the dio->cur_page_fs_offset field was
      introduced for just this purpose.  This patch simply uses the correct
      variable when calculating whether there is a mismatch between contiguous
      logical blocks and contiguous physical blocks (as described in the
      comments).
      
      I also switched the if conditional following this check to an else if, to
      ensure that we never call dio_bio_submit twice for the same dio (in
      theory, this should not happen, anyway).
      
      I've tested this by running blktrace and verifying that a 64KB I/O was
      submitted as a single I/O.  I also ran the patched kernel through
      xfstests' aio tests using xfs, ext4 (with 1k and 4k block sizes) and btrfs
      and verified that there were no regressions as compared to an unpatched
      kernel.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NJosef Bacik <jbacik@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: <stable@kernel.org>		[2.6.35.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a801ac6
  21. 10 8月, 2010 1 次提交
    • C
      sort out blockdev_direct_IO variants · eafdc7d1
      Christoph Hellwig 提交于
      Move the call to vmtruncate to get rid of accessive blocks to the callers
      in prepearation of the new truncate calling sequence.  This was only done
      for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
      was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
      its _newtrunc variant while at it as just opencoding the two additional
      paramters is shorted than the name suffix.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eafdc7d1
  22. 27 7月, 2010 2 次提交
  23. 28 5月, 2010 1 次提交
    • N
      fs: introduce new truncate sequence · 7bb46a67
      npiggin@suse.de 提交于
      Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
      setattr > vmtruncate > truncate, have filesystems call their truncate sequence
      from ->setattr if filesystem specific operations are required. vmtruncate is
      deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
      previously should be used.
      
      simple_setattr is introduced for simple in-ram filesystems to implement
      the new truncate sequence. Eventually all filesystems should be converted
      to implement a setattr, and the default code in notify_change should go
      away.
      
      simple_setsize is also introduced to perform just the ATTR_SIZE portion
      of simple_setattr (ie. changing i_size and trimming pagecache).
      
      To implement the new truncate sequence:
      - filesystem specific manipulations (eg freeing blocks) must be done in
        the setattr method rather than ->truncate.
      - vmtruncate can not be used by core code to trim blocks past i_size in
        the event of write failure after allocation, so this must be performed
        in the fs code.
      - convert usage of helpers block_write_begin, nobh_write_begin,
        cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
        variants. These avoid calling vmtruncate to trim blocks (see previous).
      - inode_setattr should not be used. generic_setattr is a new function
        to be used to copy simple attributes into the generic inode.
      - make use of the better opportunity to handle errors with the new sequence.
      
      Big problem with the previous calling sequence: the filesystem is not called
      until i_size has already changed.  This means it is not allowed to fail the
      call, and also it does not know what the previous i_size was. Also, generic
      code calling vmtruncate to truncate allocated blocks in case of error had
      no good way to return a meaningful error (or, for example, atomically handle
      block deallocation).
      
      Cc: Christoph Hellwig <hch@lst.de>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7bb46a67
  24. 25 5月, 2010 2 次提交
    • J
      direct-io: do not merge logically non-contiguous requests · c2c6ca41
      Josef Bacik 提交于
      Btrfs cannot handle having logically non-contiguous requests submitted.  For
      example if you have
      
      Logical:  [0-4095][HOLE][8192-12287]
      Physical: [0-4095]      [4096-8191]
      
      Normally the DIO code would put these into the same BIO's.  The problem is we
      need to know exactly what offset is associated with what BIO so we can do our
      checksumming and unlocking properly, so putting them in the same BIO doesn't
      work.  So add another check where we submit the current BIO if the physical
      blocks are not contigous OR the logical blocks are not contiguous.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c2c6ca41
    • J
      direct-io: add a hook for the fs to provide its own submit_bio function · facd07b0
      Josef Bacik 提交于
      Because BTRFS can do RAID and such, we need our own submit hook so we can setup
      the bio's in the correct fashion, and handle checksum errors properly.  So there
      are a few changes here
      
      1) The submit_io hook.  This is straightforward, just call this instead of
      submit_bio.
      
      2) Allow the fs to return -ENOTBLK for reads.  Usually this has only worked for
      writes, since writes can fallback onto buffered IO.  But BTRFS needs the option
      of falling back on buffered IO if it encounters a compressed extent, since we
      need to read the entire extent in and decompress it.  So if we get -ENOTBLK back
      from get_block we'll return back and fallback on buffered just like the write
      case.
      
      I've tested these changes with fsx and everything seems to work.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      facd07b0
  25. 17 12月, 2009 1 次提交