1. 11 5月, 2007 1 次提交
    • D
      signal/timer/event: KAIO eventfd support example · 9c3060be
      Davide Libenzi 提交于
      This is an example about how to add eventfd support to the current KAIO code,
      in order to enable KAIO to post readiness events to a pollable fd (hence
      compatible with POSIX select/poll).  The KAIO code simply signals the eventfd
      fd when events are ready, and this triggers a POLLIN in the fd.  This patch
      uses a reserved for future use member of the struct iocb to pass an eventfd
      file descriptor, that KAIO will use to post events every time a request
      completes.  At that point, an aio_getevents() will return the completed result
      to a struct io_event.  I made a quick test program to verify the patch, and it
      runs fine here:
      
      http://www.xmailserver.org/eventfd-aio-test.c
      
      The test program uses poll(2), but it'd, of course, work with select and epoll
      too.
      
      This can allow to schedule both block I/O and other poll-able devices
      requests, and wait for results using select/poll/epoll.  In a typical
      scenario, an application would submit KAIO request using aio_submit(), and
      will also use epoll_ctl() on the whole other class of devices (that with the
      addition of signals, timers and user events, now it's pretty much complete),
      and then would:
      
      	epoll_wait(...);
      	for_each_event {
      		if (curr_event_is_kaiofd) {
      			aio_getevents();
      			dispatch_aio_events();
      		} else {
      			dispatch_epoll_event();
      		}
      	}
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c3060be
  2. 10 5月, 2007 2 次提交
  3. 08 5月, 2007 1 次提交
  4. 28 3月, 2007 1 次提交
  5. 12 2月, 2007 2 次提交
  6. 04 2月, 2007 1 次提交
    • K
      [PATCH] aio: fix buggy put_ioctx call in aio_complete - v2 · dee11c23
      Ken Chen 提交于
      An AIO bug was reported that sleeping function is being called in softirq
      context:
      
      BUG: warning at kernel/mutex.c:132/__mutex_lock_common()
      Call Trace:
           [<a000000100577b00>] __mutex_lock_slowpath+0x640/0x6c0
           [<a000000100577ba0>] mutex_lock+0x20/0x40
           [<a0000001000a25b0>] flush_workqueue+0xb0/0x1a0
           [<a00000010018c0c0>] __put_ioctx+0xc0/0x240
           [<a00000010018d470>] aio_complete+0x2f0/0x420
           [<a00000010019cc80>] finished_one_bio+0x200/0x2a0
           [<a00000010019d1c0>] dio_bio_complete+0x1c0/0x200
           [<a00000010019d260>] dio_bio_end_aio+0x60/0x80
           [<a00000010014acd0>] bio_endio+0x110/0x1c0
           [<a0000001002770e0>] __end_that_request_first+0x180/0xba0
           [<a000000100277b90>] end_that_request_chunk+0x30/0x60
           [<a0000002073c0c70>] scsi_end_request+0x50/0x300 [scsi_mod]
           [<a0000002073c1240>] scsi_io_completion+0x200/0x8a0 [scsi_mod]
           [<a0000002074729b0>] sd_rw_intr+0x330/0x860 [sd_mod]
           [<a0000002073b3ac0>] scsi_finish_command+0x100/0x1c0 [scsi_mod]
           [<a0000002073c2910>] scsi_softirq_done+0x230/0x300 [scsi_mod]
           [<a000000100277d20>] blk_done_softirq+0x160/0x1c0
           [<a000000100083e00>] __do_softirq+0x200/0x240
           [<a000000100083eb0>] do_softirq+0x70/0xc0
      
      See report: http://marc.theaimsgroup.com/?l=linux-kernel&m=116599593200888&w=2
      
      flush_workqueue() is not allowed to be called in the softirq context.
      However, aio_complete() called from I/O interrupt can potentially call
      put_ioctx with last ref count on ioctx and triggers bug.  It is simply
      incorrect to perform ioctx freeing from aio_complete.
      
      The bug is trigger-able from a race between io_destroy() and aio_complete().
      A possible scenario:
      
      cpu0                               cpu1
      io_destroy                         aio_complete
        wait_for_all_aios {                __aio_put_req
           ...                                 ctx->reqs_active--;
           if (!ctx->reqs_active)
              return;
        }
        ...
        put_ioctx(ioctx)
      
                                           put_ioctx(ctx);
                                              __put_ioctx
                                                bam! Bug trigger!
      
      The real problem is that the condition check of ctx->reqs_active in
      wait_for_all_aios() is incorrect that access to reqs_active is not
      being properly protected by spin lock.
      
      This patch adds that protective spin lock, and at the same time removes
      all duplicate ref counting for each kiocb as reqs_active is already used
      as a ref count for each active ioctx.  This also ensures that buggy call
      to flush_workqueue() in softirq context is eliminated.
      Signed-off-by: N"Ken Chen" <kenchen@google.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: <stable@kernel.org>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dee11c23
  7. 31 12月, 2006 1 次提交
    • Z
      [PATCH] Fix lock inversion aio_kick_handler() · 1ebb1101
      Zach Brown 提交于
      lockdep found a AB BC CA lock inversion in retry-based AIO:
      
      1) The task struct's alloc_lock (A) is acquired in process context with
         interrupts enabled.  An interrupt might arrive and call wake_up() which
         grabs the wait queue's q->lock (B).
      
      2) When performing retry-based AIO the AIO core registers
         aio_wake_function() as the wake funtion for iocb->ki_wait.  It is called
         with the wait queue's q->lock (B) held and then tries to add the iocb to
         the run list after acquiring the ctx_lock (C).
      
      3) aio_kick_handler() holds the ctx_lock (C) while acquiring the
         alloc_lock (A) via lock_task() and unuse_mm().  Lockdep emits a warning
         saying that we're trying to connect the irq-safe q->lock to the
         irq-unsafe alloc_lock via ctx_lock.
      
      This fixes the inversion by calling unuse_mm() in the AIO kick handing path
      after we've released the ctx_lock.  As Ben LaHaise pointed out __put_ioctx
      could set ctx->mm to NULL, so we must only access ctx->mm while we have the
      lock.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Signed-off-by: NSuparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1ebb1101
  8. 14 12月, 2006 1 次提交
    • J
      [PATCH] Use activate_mm() in fs/aio.c:use_mm() · 90aef12e
      Jeremy Fitzhardinge 提交于
      activate_mm() is not the right thing to be using in use_mm().  It should be
      switch_mm().
      
      On normal x86, they're synonymous, but for the Xen patches I'm adding a
      hook which assumes that activate_mm is only used the first time a new mm
      is used after creation (I have another hook for dealing with dup_mm).  I
      think this use of activate_mm() is the only place where it could be used
      a second time on an mm.
      
      >From a quick look at the other architectures I think this is OK (most
      simply implement one in terms of the other), but some are doing some
      subtly different stuff between the two.
      Acked-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      90aef12e
  9. 08 12月, 2006 3 次提交
  10. 30 11月, 2006 1 次提交
  11. 22 11月, 2006 2 次提交
    • D
      WorkStruct: Pass the work_struct pointer instead of context data · 65f27f38
      David Howells 提交于
      Pass the work_struct pointer to the work function rather than context data.
      The work function can use container_of() to work out the data.
      
      For the cases where the container of the work_struct may go away the moment the
      pending bit is cleared, it is made possible to defer the release of the
      structure by deferring the clearing of the pending bit.
      
      To make this work, an extra flag is introduced into the management side of the
      work_struct.  This governs auto-release of the structure upon execution.
      
      Ordinarily, the work queue executor would release the work_struct for further
      scheduling or deallocation by clearing the pending bit prior to jumping to the
      work function.  This means that, unless the driver makes some guarantee itself
      that the work_struct won't go away, the work function may not access anything
      else in the work_struct or its container lest they be deallocated..  This is a
      problem if the auxiliary data is taken away (as done by the last patch).
      
      However, if the pending bit is *not* cleared before jumping to the work
      function, then the work function *may* access the work_struct and its container
      with no problems.  But then the work function must itself release the
      work_struct by calling work_release().
      
      In most cases, automatic release is fine, so this is the default.  Special
      initiators exist for the non-auto-release case (ending in _NAR).
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      65f27f38
    • D
      WorkStruct: Separate delayable and non-delayable events. · 52bad64d
      David Howells 提交于
      Separate delayable work items from non-delayable work items be splitting them
      into a separate structure (delayed_work), which incorporates a work_struct and
      the timer_list removed from work_struct.
      
      The work_struct struct is huge, and this limits it's usefulness.  On a 64-bit
      architecture it's nearly 100 bytes in size.  This reduces that by half for the
      non-delayable type of event.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      52bad64d
  12. 03 10月, 2006 1 次提交
  13. 01 10月, 2006 2 次提交
  14. 27 6月, 2006 1 次提交
  15. 23 6月, 2006 1 次提交
  16. 26 3月, 2006 1 次提交
  17. 09 1月, 2006 1 次提交
  18. 14 11月, 2005 2 次提交
  19. 07 11月, 2005 1 次提交
    • Z
      [PATCH] aio: remove aio_max_nr accounting race · d55b5fda
      Zach Brown 提交于
      AIO was adding a new context's max requests to the global total before
      testing if that resulting total was over the global limit.  This let
      innocent tasks get their new limit tested along with a racing guilty task
      that was crossing the limit.  This serializes the _nr accounting with a
      spinlock It also switches to using unsigned long for the global totals.
      Individual contexts are still limited to an unsigned int's worth of
      requests by the syscall interface.
      
      The problem and fix were verified with a simple program that spun creating
      and destroying a context while holding on to another long lived context.
      Before the patch a task creating a tiny context could get a spurious EAGAIN
      if it raced with a task creating a very large context that overran the
      limit.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d55b5fda
  20. 24 10月, 2005 1 次提交
  21. 18 10月, 2005 1 次提交
  22. 01 10月, 2005 3 次提交
  23. 18 9月, 2005 1 次提交
  24. 10 9月, 2005 3 次提交
    • D
      [PATCH] files: files struct with RCU · ab2af1f5
      Dipankar Sarma 提交于
      Patch to eliminate struct files_struct.file_lock spinlock on the reader side
      and use rcu refcounting rcuref_xxx api for the f_count refcounter.  The
      updates to the fdtable are done by allocating a new fdtable structure and
      setting files->fdt to point to the new structure.  The fdtable structure is
      protected by RCU thereby allowing lock-free lookup.  For fd arrays/sets that
      are vmalloced, we use keventd to free them since RCU callbacks can't sleep.  A
      global list of fdtable to be freed is not scalable, so we use a per-cpu list.
      If keventd is already handling the current cpu's work, we use a timer to defer
      queueing of that work.
      
      Since the last publication, this patch has been re-written to avoid using
      explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
      premitives instead.  This required that the fd information is kept in a
      separate structure (fdtable) and updated atomically.
      Signed-off-by: NDipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ab2af1f5
    • B
      [PATCH] aio: kiocb locking to serialise retry and cancel · ac0b1bc1
      Benjamin LaHaise 提交于
      Implement a per-kiocb lock to serialise retry operations and cancel.  This
      is done using wait_on_bit_lock() on the KIF_LOCKED bit of kiocb->ki_flags.
      Also, make the cancellation path lock the kiocb and subsequently release
      all references to it if the cancel was successful.  This version includes a
      fix for the deadlock with __aio_run_iocbs.
      Signed-off-by: NBenjamin LaHaise <bcrl@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ac0b1bc1
    • W
      [PATCH] change io_cancel return code for no cancel case · 8f58202b
      Wendy Cheng 提交于
      Note that other than few exceptions, most of the current filesystem and/or
      drivers do not have aio cancel specifically defined (kiob->ki_cancel field
      is mostly NULL).  However, sys_io_cancel system call universally sets
      return code to -EAGAIN.  This gives applications a wrong impression that
      this call is implemented but just never works.  We have customer inquires
      about this issue.
      
      Changed by Benjamin LaHaise to EINVAL instead of ENOSYS
      Signed-off-by: NS. Wendy Cheng <wcheng@redhat.com>
      Acked-by: NBenjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8f58202b
  25. 05 9月, 2005 1 次提交
    • P
      [PATCH] uml: fixes performance regression in activate_mm and thus exec() · 1e40cd38
      Paolo 'Blaisorblade' Giarrusso 提交于
      Normally, activate_mm() is called from exec(), and thus it used to be a
      no-op because we use a completely new "MM context" on the host (for
      instance, a new process), and so we didn't need to flush any "TLB entries"
      (which for us are the set of memory mappings for the host process from the
      virtual "RAM" file).
      
      Kernel threads, instead, are usually handled in a different way.  So, when
      for AIO we call use_mm(), things used to break and so Benjamin implemented
      activate_mm().  However, that is only needed for AIO, and could slow down
      exec() inside UML, so be smart: detect being called for AIO (via
      PF_BORROWED_MM) and do the full flush only in that situation.
      
      Comment also the caller so that people won't go breaking UML without
      noticing.  I also rely on the caller's locks for testing current->flags.
      Signed-off-by: NPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      CC: Benjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1e40cd38
  26. 29 6月, 2005 1 次提交
  27. 01 5月, 2005 3 次提交