1. 06 3月, 2012 1 次提交
    • J
      aio: wake up waiters when freeing unused kiocbs · 880641bb
      Jeff Moyer 提交于
      Bart Van Assche reported a hung fio process when either hot-removing
      storage or when interrupting the fio process itself.  The (pruned) call
      trace for the latter looks like so:
      
        fio             D 0000000000000001     0  6849   6848 0x00000004
         ffff880092541b88 0000000000000046 ffff880000000000 ffff88012fa11dc0
         ffff88012404be70 ffff880092541fd8 ffff880092541fd8 ffff880092541fd8
         ffff880128b894d0 ffff88012404be70 ffff880092541b88 000000018106f24d
        Call Trace:
          schedule+0x3f/0x60
          io_schedule+0x8f/0xd0
          wait_for_all_aios+0xc0/0x100
          exit_aio+0x55/0xc0
          mmput+0x2d/0x110
          exit_mm+0x10d/0x130
          do_exit+0x671/0x860
          do_group_exit+0x44/0xb0
          get_signal_to_deliver+0x218/0x5a0
          do_signal+0x65/0x700
          do_notify_resume+0x65/0x80
          int_signal+0x12/0x17
      
      The problem lies with the allocation batching code.  It will
      opportunistically allocate kiocbs, and then trim back the list of iocbs
      when there is not enough room in the completion ring to hold all of the
      events.
      
      In the case above, what happens is that the pruning back of events ends
      up freeing up the last active request and the context is marked as dead,
      so it is thus responsible for waking up waiters.  Unfortunately, the
      code does not check for this condition, so we end up with a hung task.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Reported-by: NBart Van Assche <bvanassche@acm.org>
      Tested-by: NBart Van Assche <bvanassche@acm.org>
      Cc: <stable@kernel.org>		[3.2.x only]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      880641bb
  2. 14 1月, 2012 1 次提交
    • G
      Unused iocbs in a batch should not be accounted as active. · 69e4747e
      Gleb Natapov 提交于
      Since commit 080d676d ("aio: allocate kiocbs in batches") iocbs are
      allocated in a batch during processing of first iocbs.  All iocbs in a
      batch are automatically added to ctx->active_reqs list and accounted in
      ctx->reqs_active.
      
      If one (not the last one) of iocbs submitted by an user fails, further
      iocbs are not processed, but they are still present in ctx->active_reqs
      and accounted in ctx->reqs_active.  This causes process to stuck in a D
      state in wait_for_all_aios() on exit since ctx->reqs_active will never
      go down to zero.  Furthermore since kiocb_batch_free() frees iocb
      without removing it from active_reqs list the list become corrupted
      which may cause oops.
      
      Fix this by removing iocb from ctx->active_reqs and updating
      ctx->reqs_active in kiocb_batch_free().
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: stable@kernel.org   # 3.2
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69e4747e
  3. 03 11月, 2011 1 次提交
    • J
      aio: allocate kiocbs in batches · 080d676d
      Jeff Moyer 提交于
      In testing aio on a fast storage device, I found that the context lock
      takes up a fair amount of cpu time in the I/O submission path.  The reason
      is that we take it for every I/O submitted (see __aio_get_req).  Since we
      know how many I/Os are passed to io_submit, we can preallocate the kiocbs
      in batches, reducing the number of times we take and release the lock.
      
      In my testing, I was able to reduce the amount of time spent in
      _raw_spin_lock_irq by .56% (average of 3 runs).  The command I used to
      test this was:
      
         aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev>
      
      I also tested the patch with various numbers of events passed to
      io_submit, and I ran the xfstests aio group of tests to ensure I didn't
      break anything.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Daniel Ehrenberg <dehrenberg@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      080d676d
  4. 01 11月, 2011 1 次提交
    • C
      Cross Memory Attach · fcf63409
      Christopher Yeoh 提交于
      The basic idea behind cross memory attach is to allow MPI programs doing
      intra-node communication to do a single copy of the message rather than a
      double copy of the message via shared memory.
      
      The following patch attempts to achieve this by allowing a destination
      process, given an address and size from a source process, to copy memory
      directly from the source process into its own address space via a system
      call.  There is also a symmetrical ability to copy from the current
      process's address space into a destination process's address space.
      
      - Use of /proc/pid/mem has been considered, but there are issues with
        using it:
        - Does not allow for specifying iovecs for both src and dest, assuming
          preadv or pwritev was implemented either the area read from or
        written to would need to be contiguous.
        - Currently mem_read allows only processes who are currently
        ptrace'ing the target and are still able to ptrace the target to read
        from the target. This check could possibly be moved to the open call,
        but its not clear exactly what race this restriction is stopping
        (reason  appears to have been lost)
        - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
        domain socket is a bit ugly from a userspace point of view,
        especially when you may have hundreds if not (eventually) thousands
        of processes  that all need to do this with each other
        - Doesn't allow for some future use of the interface we would like to
        consider adding in the future (see below)
        - Interestingly reading from /proc/pid/mem currently actually
        involves two copies! (But this could be fixed pretty easily)
      
      As mentioned previously use of vmsplice instead was considered, but has
      problems.  Since you need the reader and writer working co-operatively if
      the pipe is not drained then you block.  Which requires some wrapping to
      do non blocking on the send side or polling on the receive.  In all to all
      communication it requires ordering otherwise you can deadlock.  And in the
      example of many MPI tasks writing to one MPI task vmsplice serialises the
      copying.
      
      There are some cases of MPI collectives where even a single copy interface
      does not get us the performance gain we could.  For example in an
      MPI_Reduce rather than copy the data from the source we would like to
      instead use it directly in a mathops (say the reduce is doing a sum) as
      this would save us doing a copy.  We don't need to keep a copy of the data
      from the source.  I haven't implemented this, but I think this interface
      could in the future do all this through the use of the flags - eg could
      specify the math operation and type and the kernel rather than just
      copying the data would apply the specified operation between the source
      and destination and store it in the destination.
      
      Although we don't have a "second user" of the interface (though I've had
      some nibbles from people who may be interested in using it for intra
      process messaging which is not MPI).  This interface is something which
      hardware vendors are already doing for their custom drivers to implement
      fast local communication.  And so in addition to this being useful for
      OpenMPI it would mean the driver maintainers don't have to fix things up
      when the mm changes.
      
      There was some discussion about how much faster a true zero copy would
      go. Here's a link back to the email with some testing I did on that:
      
      http://marc.info/?l=linux-mm&m=130105930902915&w=2
      
      There is a basic man page for the proposed interface here:
      
      http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
      
      This has been implemented for x86 and powerpc, other architecture should
      mainly (I think) just need to add syscall numbers for the process_vm_readv
      and process_vm_writev. There are 32 bit compatibility versions for
      64-bit kernels.
      
      For arch maintainers there are some simple tests to be able to quickly
      verify that the syscalls are working correctly here:
      
      http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: NChris Yeoh <yeohc@au1.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <linux-man@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcf63409
  5. 23 3月, 2011 1 次提交
    • R
      aio: wake all waiters when destroying ctx · e91f90bb
      Roland Dreier 提交于
      The test program below will hang because io_getevents() uses
      add_wait_queue_exclusive(), which means the wake_up() in io_destroy() only
      wakes up one of the threads.  Fix this by using wake_up_all() in the aio
      code paths where we want to make sure no one gets stuck.
      
      	// t.c -- compile with gcc -lpthread -laio t.c
      
      	#include <libaio.h>
      	#include <pthread.h>
      	#include <stdio.h>
      	#include <unistd.h>
      
      	static const int nthr = 2;
      
      	void *getev(void *ctx)
      	{
      		struct io_event ev;
      		io_getevents(ctx, 1, 1, &ev, NULL);
      		printf("io_getevents returned\n");
      		return NULL;
      	}
      
      	int main(int argc, char *argv[])
      	{
      		io_context_t ctx = 0;
      		pthread_t thread[nthr];
      		int i;
      
      		io_setup(1024, &ctx);
      
      		for (i = 0; i < nthr; ++i)
      			pthread_create(&thread[i], NULL, getev, ctx);
      
      		sleep(1);
      
      		io_destroy(ctx);
      
      		for (i = 0; i < nthr; ++i)
      			pthread_join(thread[i], NULL);
      
      		return 0;
      	}
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e91f90bb
  6. 10 3月, 2011 3 次提交
  7. 26 2月, 2011 2 次提交
  8. 27 1月, 2011 1 次提交
    • T
      fs/aio: aio_wq isn't used in memory reclaim path · d37adaa1
      Tejun Heo 提交于
      aio_wq isn't used during memory reclaim.  Convert to alloc_workqueue()
      without WQ_MEM_RECLAIM.  It's possible to use system_wq but given that
      the number of work items is determined from userland and the work item
      may block, enforcing strict concurrency limit would be a good idea.
      
      Also, move fput_work to system_wq so that aio_wq is used soley to
      throttle the max concurrency of aio work items and fput_work doesn't
      interact with other work items.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: linux-aio@kvack.org
      d37adaa1
  9. 17 1月, 2011 1 次提交
  10. 14 1月, 2011 2 次提交
  11. 26 10月, 2010 2 次提交
    • A
      new helper: ihold() · 7de9c6ee
      Al Viro 提交于
      Clones an existing reference to inode; caller must already hold one.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7de9c6ee
    • C
      aio: bump i_count instead of using igrab · 306fb097
      Chris Mason 提交于
      The aio batching code is using igrab to get an extra reference on the
      inode so it can safely batch.  igrab will go ahead and take the global
      inode spinlock, which can be a bottleneck on large machines doing lots
      of AIO.
      
      In this case, igrab isn't required because we already have a reference
      on the file handle.  It is safe to just bump the i_count directly
      on the inode.
      
      Benchmarking shows this patch brings IOP/s on tons of flash up by about
      2.5X.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      306fb097
  12. 23 9月, 2010 1 次提交
    • J
      aio: do not return ERESTARTSYS as a result of AIO · a0c42bac
      Jan Kara 提交于
      OCFS2 can return ERESTARTSYS from its write function when the process is
      signalled while waiting for a cluster lock (and the filesystem is mounted
      with intr mount option).  Generally, it seems reasonable to allow
      filesystems to return this error code from its IO functions.  As we must
      not leak ERESTARTSYS (and similar error codes) to userspace as a result of
      an AIO operation, we have to properly convert it to EINTR inside AIO code
      (restarting the syscall isn't really an option because other AIO could
      have been already submitted by the same io_submit syscall).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0c42bac
  13. 15 9月, 2010 1 次提交
    • J
      aio: check for multiplication overflow in do_io_submit · 75e1c70f
      Jeff Moyer 提交于
      Tavis Ormandy pointed out that do_io_submit does not do proper bounds
      checking on the passed-in iocb array:
      
             if (unlikely(nr < 0))
                     return -EINVAL;
      
             if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp)))))
                     return -EFAULT;                      ^^^^^^^^^^^^^^^^^^
      
      The attached patch checks for overflow, and if it is detected, the
      number of iocbs submitted is scaled down to a number that will fit in
      the long.  This is an ok thing to do, as sys_io_submit is documented as
      returning the number of iocbs submitted, so callers should handle a
      return value of less than the 'nr' argument passed in.
      Reported-by: NTavis Ormandy <taviso@cmpxchg8b.com>
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75e1c70f
  14. 06 8月, 2010 1 次提交
  15. 28 5月, 2010 2 次提交
    • A
      get rid of the magic around f_count in aio · d7065da0
      Al Viro 提交于
      __aio_put_req() plays sick games with file refcount.  What
      it wants is fput() from atomic context; it's almost always
      done with f_count > 1, so they only have to deal with delayed
      work in rare cases when their reference happens to be the
      last one.  Current code decrements f_count and if it hasn't
      hit 0, everything is fine.  Otherwise it keeps a pointer
      to struct file (with zero f_count!) around and has delayed
      work do __fput() on it.
      
      Better way to do it: use atomic_long_add_unless( , -1, 1)
      instead of !atomic_long_dec_and_test().  IOW, decrement it
      only if it's not the last reference, leave refcount alone
      if it was.  And use normal fput() in delayed work.
      
      I've made that atomic_long_add_unless call a new helper -
      fput_atomic().  Drops a reference to file if it's safe to
      do in atomic (i.e. if that's not the last one), tells if
      it had been able to do that.  aio.c converted to it, __fput()
      use is gone.  req->ki_file *always* contributes to refcount
      now.  And __fput() became static.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d7065da0
    • J
      aio: fix the compat vectored operations · 9d85cba7
      Jeff Moyer 提交于
      The aio compat code was not converting the struct iovecs from 32bit to
      64bit pointers, causing either EINVAL to be returned from io_getevents, or
      EFAULT as the result of the I/O.  This patch passes a compat flag to
      io_submit to signal that pointer conversion is necessary for a given iocb
      array.
      
      A variant of this was tested by Michael Tokarev.  I have also updated the
      libaio test harness to exercise this code path with good success.
      Further, I grabbed a copy of ltp and ran the
      testcases/kernel/syscall/readv and writev tests there (compiled with -m32
      on my 64bit system).  All seems happy, but extra eyes on this would be
      welcome.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_COMPAT=n build]
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Reported-by: NMichael Tokarev <mjt@tls.msk.ru>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: <stable@kernel.org>		[2.6.35.1]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d85cba7
  16. 16 12月, 2009 1 次提交
  17. 29 10月, 2009 1 次提交
  18. 28 10月, 2009 1 次提交
    • J
      aio: implement request batching · cfb1e33e
      Jeff Moyer 提交于
      Hi,
      
      Some workloads issue batches of small I/O, and the performance is poor
      due to the call to blk_run_address_space for every single iocb.  Nathan
      Roberts pointed this out, and suggested that by deferring this call
      until all I/Os in the iocb array are submitted to the block layer, we
      can realize some impressive performance gains (up to 30% for sequential
      4k reads in batches of 16).
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      cfb1e33e
  19. 23 9月, 2009 1 次提交
  20. 22 9月, 2009 1 次提交
  21. 01 7月, 2009 1 次提交
  22. 20 3月, 2009 2 次提交
  23. 14 1月, 2009 1 次提交
  24. 29 12月, 2008 1 次提交
    • J
      aio: make the lookup_ioctx() lockless · abf137dd
      Jens Axboe 提交于
      The mm->ioctx_list is currently protected by a reader-writer lock,
      so we always grab that lock on the read side for doing ioctx
      lookups. As the workload is extremely reader biased, turn this into
      an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
      the rwlock and use a spinlock for providing update side exclusion.
      
      There's usually only 1 entry on this list, so it doesn't make sense
      to look into fancier data structures.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      abf137dd
  25. 27 7月, 2008 1 次提交
  26. 26 7月, 2008 1 次提交
    • O
      kill PF_BORROWED_MM in favour of PF_KTHREAD · 246bb0b1
      Oleg Nesterov 提交于
      Kill PF_BORROWED_MM.  Change use_mm/unuse_mm to not play with ->flags, and
      do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.
      
      No functional changes yet.  But this allows us to do further
      fixes/cleanups.
      
      oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
      kthreads, this is wrong because of use_mm().  The problem with
      PF_BORROWED_MM is that we need task_lock() to avoid races.  With this
      patch we can check PF_KTHREAD directly, or use a simple lockless helper:
      
      	/* The result must not be dereferenced !!! */
      	struct mm_struct *__get_task_mm(struct task_struct *tsk)
      	{
      		if (tsk->flags & PF_KTHREAD)
      			return NULL;
      		return tsk->mm;
      	}
      
      Note also ecard_task().  It runs with ->mm != NULL, but it's the kernel
      thread without PF_BORROWED_MM.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      246bb0b1
  27. 07 6月, 2008 1 次提交
  28. 30 4月, 2008 1 次提交
  29. 29 4月, 2008 3 次提交
  30. 28 4月, 2008 1 次提交
  31. 11 4月, 2008 1 次提交