1. 26 10月, 2010 1 次提交
    • C
      aio: bump i_count instead of using igrab · 306fb097
      Chris Mason 提交于
      The aio batching code is using igrab to get an extra reference on the
      inode so it can safely batch.  igrab will go ahead and take the global
      inode spinlock, which can be a bottleneck on large machines doing lots
      of AIO.
      
      In this case, igrab isn't required because we already have a reference
      on the file handle.  It is safe to just bump the i_count directly
      on the inode.
      
      Benchmarking shows this patch brings IOP/s on tons of flash up by about
      2.5X.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      306fb097
  2. 23 9月, 2010 1 次提交
    • J
      aio: do not return ERESTARTSYS as a result of AIO · a0c42bac
      Jan Kara 提交于
      OCFS2 can return ERESTARTSYS from its write function when the process is
      signalled while waiting for a cluster lock (and the filesystem is mounted
      with intr mount option).  Generally, it seems reasonable to allow
      filesystems to return this error code from its IO functions.  As we must
      not leak ERESTARTSYS (and similar error codes) to userspace as a result of
      an AIO operation, we have to properly convert it to EINTR inside AIO code
      (restarting the syscall isn't really an option because other AIO could
      have been already submitted by the same io_submit syscall).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0c42bac
  3. 15 9月, 2010 1 次提交
    • J
      aio: check for multiplication overflow in do_io_submit · 75e1c70f
      Jeff Moyer 提交于
      Tavis Ormandy pointed out that do_io_submit does not do proper bounds
      checking on the passed-in iocb array:
      
             if (unlikely(nr < 0))
                     return -EINVAL;
      
             if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp)))))
                     return -EFAULT;                      ^^^^^^^^^^^^^^^^^^
      
      The attached patch checks for overflow, and if it is detected, the
      number of iocbs submitted is scaled down to a number that will fit in
      the long.  This is an ok thing to do, as sys_io_submit is documented as
      returning the number of iocbs submitted, so callers should handle a
      return value of less than the 'nr' argument passed in.
      Reported-by: NTavis Ormandy <taviso@cmpxchg8b.com>
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75e1c70f
  4. 06 8月, 2010 1 次提交
  5. 28 5月, 2010 2 次提交
    • A
      get rid of the magic around f_count in aio · d7065da0
      Al Viro 提交于
      __aio_put_req() plays sick games with file refcount.  What
      it wants is fput() from atomic context; it's almost always
      done with f_count > 1, so they only have to deal with delayed
      work in rare cases when their reference happens to be the
      last one.  Current code decrements f_count and if it hasn't
      hit 0, everything is fine.  Otherwise it keeps a pointer
      to struct file (with zero f_count!) around and has delayed
      work do __fput() on it.
      
      Better way to do it: use atomic_long_add_unless( , -1, 1)
      instead of !atomic_long_dec_and_test().  IOW, decrement it
      only if it's not the last reference, leave refcount alone
      if it was.  And use normal fput() in delayed work.
      
      I've made that atomic_long_add_unless call a new helper -
      fput_atomic().  Drops a reference to file if it's safe to
      do in atomic (i.e. if that's not the last one), tells if
      it had been able to do that.  aio.c converted to it, __fput()
      use is gone.  req->ki_file *always* contributes to refcount
      now.  And __fput() became static.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d7065da0
    • J
      aio: fix the compat vectored operations · 9d85cba7
      Jeff Moyer 提交于
      The aio compat code was not converting the struct iovecs from 32bit to
      64bit pointers, causing either EINVAL to be returned from io_getevents, or
      EFAULT as the result of the I/O.  This patch passes a compat flag to
      io_submit to signal that pointer conversion is necessary for a given iocb
      array.
      
      A variant of this was tested by Michael Tokarev.  I have also updated the
      libaio test harness to exercise this code path with good success.
      Further, I grabbed a copy of ltp and ran the
      testcases/kernel/syscall/readv and writev tests there (compiled with -m32
      on my 64bit system).  All seems happy, but extra eyes on this would be
      welcome.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_COMPAT=n build]
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Reported-by: NMichael Tokarev <mjt@tls.msk.ru>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: <stable@kernel.org>		[2.6.35.1]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d85cba7
  6. 16 12月, 2009 1 次提交
  7. 29 10月, 2009 1 次提交
  8. 28 10月, 2009 1 次提交
    • J
      aio: implement request batching · cfb1e33e
      Jeff Moyer 提交于
      Hi,
      
      Some workloads issue batches of small I/O, and the performance is poor
      due to the call to blk_run_address_space for every single iocb.  Nathan
      Roberts pointed this out, and suggested that by deferring this call
      until all I/Os in the iocb array are submitted to the block layer, we
      can realize some impressive performance gains (up to 30% for sequential
      4k reads in batches of 16).
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      cfb1e33e
  9. 23 9月, 2009 1 次提交
  10. 22 9月, 2009 1 次提交
  11. 01 7月, 2009 1 次提交
  12. 20 3月, 2009 2 次提交
  13. 14 1月, 2009 1 次提交
  14. 29 12月, 2008 1 次提交
    • J
      aio: make the lookup_ioctx() lockless · abf137dd
      Jens Axboe 提交于
      The mm->ioctx_list is currently protected by a reader-writer lock,
      so we always grab that lock on the read side for doing ioctx
      lookups. As the workload is extremely reader biased, turn this into
      an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
      the rwlock and use a spinlock for providing update side exclusion.
      
      There's usually only 1 entry on this list, so it doesn't make sense
      to look into fancier data structures.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      abf137dd
  15. 27 7月, 2008 1 次提交
  16. 26 7月, 2008 1 次提交
    • O
      kill PF_BORROWED_MM in favour of PF_KTHREAD · 246bb0b1
      Oleg Nesterov 提交于
      Kill PF_BORROWED_MM.  Change use_mm/unuse_mm to not play with ->flags, and
      do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.
      
      No functional changes yet.  But this allows us to do further
      fixes/cleanups.
      
      oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
      kthreads, this is wrong because of use_mm().  The problem with
      PF_BORROWED_MM is that we need task_lock() to avoid races.  With this
      patch we can check PF_KTHREAD directly, or use a simple lockless helper:
      
      	/* The result must not be dereferenced !!! */
      	struct mm_struct *__get_task_mm(struct task_struct *tsk)
      	{
      		if (tsk->flags & PF_KTHREAD)
      			return NULL;
      		return tsk->mm;
      	}
      
      Note also ecard_task().  It runs with ->mm != NULL, but it's the kernel
      thread without PF_BORROWED_MM.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      246bb0b1
  17. 07 6月, 2008 1 次提交
  18. 30 4月, 2008 1 次提交
  19. 29 4月, 2008 3 次提交
  20. 28 4月, 2008 1 次提交
  21. 11 4月, 2008 2 次提交
  22. 20 3月, 2008 1 次提交
    • Q
      aio: bad AIO race in aio_complete() leads to process hang · 6cb2a210
      Quentin Barnes 提交于
      My group ran into a AIO process hang on a 2.6.24 kernel with the process
      sleeping indefinitely in io_getevents(2) waiting for the last wakeup to come
      and it never would.
      
      We ran the tests on x86_64 SMP.  The hang only occurred on a Xeon box
      ("Clovertown") but not a Core2Duo ("Conroe").  On the Xeon, the L2 cache isn't
      shared between all eight processors, but is L2 is shared between between all
      two processors on the Core2Duo we use.
      
      My analysis of the hang is if you go down to the second while-loop
      in read_events(), what happens on processor #1:
      	1) add_wait_queue_exclusive() adds thread to ctx->wait
      	2) aio_read_evt() to check tail
      	3) if aio_read_evt() returned 0, call [io_]schedule() and sleep
      
      In aio_complete() with processor #2:
      	A) info->tail = tail;
      	B) waitqueue_active(&ctx->wait)
      	C) if waitqueue_active() returned non-0, call wake_up()
      
      The way the code is written, step 1 must be seen by all other processors
      before processor 1 checks for pending events in step 2 (that were recorded by
      step A) and step A by processor 2 must be seen by all other processors
      (checked in step 2) before step B is done.
      
      The race I believed I was seeing is that steps 1 and 2 were
      effectively swapped due to the __list_add() being delayed by the L2
      cache not shared by some of the other processors.  Imagine:
      proc 2: just before step A
      proc 1, step 1: adds to ctx->wait, but is not visible by other processors yet
      proc 1, step 2: checks tail and sees no pending events
      proc 2, step A: updates tail
      proc 1, step 3: calls [io_]schedule() and sleeps
      proc 2, step B: checks ctx->wait, but sees no one waiting, skips wakeup
                      so proc 1 sleeps indefinitely
      
      My patch adds a memory barrier between steps A and B.  It ensures that the
      update in step 1 gets seen on processor 2 before continuing.  If processor 1
      was just before step 1, the memory barrier makes sure that step A (update
      tail) gets seen by the time processor 1 makes it to step 2 (check tail).
      
      Before the patch our AIO process would hang virtually 100% of the time.  After
      the patch, we have yet to see the process ever hang.
      Signed-off-by: NQuentin Barnes <qbarnes+linux@yahoo-inc.com>
      Reviewed-by: NZach Brown <zach.brown@oracle.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: <stable@kernel.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ We should probably disallow that "if (waitqueue_active()) wake_up()"
        coding pattern, because it's so often buggy wrt memory ordering ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cb2a210
  23. 09 2月, 2008 3 次提交
  24. 30 1月, 2008 1 次提交
  25. 06 12月, 2007 1 次提交
  26. 19 10月, 2007 1 次提交
  27. 17 10月, 2007 1 次提交
  28. 09 10月, 2007 1 次提交
  29. 11 5月, 2007 1 次提交
    • D
      signal/timer/event: KAIO eventfd support example · 9c3060be
      Davide Libenzi 提交于
      This is an example about how to add eventfd support to the current KAIO code,
      in order to enable KAIO to post readiness events to a pollable fd (hence
      compatible with POSIX select/poll).  The KAIO code simply signals the eventfd
      fd when events are ready, and this triggers a POLLIN in the fd.  This patch
      uses a reserved for future use member of the struct iocb to pass an eventfd
      file descriptor, that KAIO will use to post events every time a request
      completes.  At that point, an aio_getevents() will return the completed result
      to a struct io_event.  I made a quick test program to verify the patch, and it
      runs fine here:
      
      http://www.xmailserver.org/eventfd-aio-test.c
      
      The test program uses poll(2), but it'd, of course, work with select and epoll
      too.
      
      This can allow to schedule both block I/O and other poll-able devices
      requests, and wait for results using select/poll/epoll.  In a typical
      scenario, an application would submit KAIO request using aio_submit(), and
      will also use epoll_ctl() on the whole other class of devices (that with the
      addition of signals, timers and user events, now it's pretty much complete),
      and then would:
      
      	epoll_wait(...);
      	for_each_event {
      		if (curr_event_is_kaiofd) {
      			aio_getevents();
      			dispatch_aio_events();
      		} else {
      			dispatch_epoll_event();
      		}
      	}
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c3060be
  30. 10 5月, 2007 2 次提交
  31. 08 5月, 2007 1 次提交
  32. 28 3月, 2007 1 次提交