1. 26 7月, 2008 1 次提交
    • O
      kill PF_BORROWED_MM in favour of PF_KTHREAD · 246bb0b1
      Oleg Nesterov 提交于
      Kill PF_BORROWED_MM.  Change use_mm/unuse_mm to not play with ->flags, and
      do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.
      
      No functional changes yet.  But this allows us to do further
      fixes/cleanups.
      
      oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
      kthreads, this is wrong because of use_mm().  The problem with
      PF_BORROWED_MM is that we need task_lock() to avoid races.  With this
      patch we can check PF_KTHREAD directly, or use a simple lockless helper:
      
      	/* The result must not be dereferenced !!! */
      	struct mm_struct *__get_task_mm(struct task_struct *tsk)
      	{
      		if (tsk->flags & PF_KTHREAD)
      			return NULL;
      		return tsk->mm;
      	}
      
      Note also ecard_task().  It runs with ->mm != NULL, but it's the kernel
      thread without PF_BORROWED_MM.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      246bb0b1
  2. 07 6月, 2008 1 次提交
  3. 30 4月, 2008 1 次提交
  4. 29 4月, 2008 3 次提交
  5. 28 4月, 2008 1 次提交
  6. 11 4月, 2008 2 次提交
  7. 20 3月, 2008 1 次提交
    • Q
      aio: bad AIO race in aio_complete() leads to process hang · 6cb2a210
      Quentin Barnes 提交于
      My group ran into a AIO process hang on a 2.6.24 kernel with the process
      sleeping indefinitely in io_getevents(2) waiting for the last wakeup to come
      and it never would.
      
      We ran the tests on x86_64 SMP.  The hang only occurred on a Xeon box
      ("Clovertown") but not a Core2Duo ("Conroe").  On the Xeon, the L2 cache isn't
      shared between all eight processors, but is L2 is shared between between all
      two processors on the Core2Duo we use.
      
      My analysis of the hang is if you go down to the second while-loop
      in read_events(), what happens on processor #1:
      	1) add_wait_queue_exclusive() adds thread to ctx->wait
      	2) aio_read_evt() to check tail
      	3) if aio_read_evt() returned 0, call [io_]schedule() and sleep
      
      In aio_complete() with processor #2:
      	A) info->tail = tail;
      	B) waitqueue_active(&ctx->wait)
      	C) if waitqueue_active() returned non-0, call wake_up()
      
      The way the code is written, step 1 must be seen by all other processors
      before processor 1 checks for pending events in step 2 (that were recorded by
      step A) and step A by processor 2 must be seen by all other processors
      (checked in step 2) before step B is done.
      
      The race I believed I was seeing is that steps 1 and 2 were
      effectively swapped due to the __list_add() being delayed by the L2
      cache not shared by some of the other processors.  Imagine:
      proc 2: just before step A
      proc 1, step 1: adds to ctx->wait, but is not visible by other processors yet
      proc 1, step 2: checks tail and sees no pending events
      proc 2, step A: updates tail
      proc 1, step 3: calls [io_]schedule() and sleeps
      proc 2, step B: checks ctx->wait, but sees no one waiting, skips wakeup
                      so proc 1 sleeps indefinitely
      
      My patch adds a memory barrier between steps A and B.  It ensures that the
      update in step 1 gets seen on processor 2 before continuing.  If processor 1
      was just before step 1, the memory barrier makes sure that step A (update
      tail) gets seen by the time processor 1 makes it to step 2 (check tail).
      
      Before the patch our AIO process would hang virtually 100% of the time.  After
      the patch, we have yet to see the process ever hang.
      Signed-off-by: NQuentin Barnes <qbarnes+linux@yahoo-inc.com>
      Reviewed-by: NZach Brown <zach.brown@oracle.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: <stable@kernel.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ We should probably disallow that "if (waitqueue_active()) wake_up()"
        coding pattern, because it's so often buggy wrt memory ordering ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cb2a210
  8. 09 2月, 2008 3 次提交
  9. 30 1月, 2008 1 次提交
  10. 06 12月, 2007 1 次提交
  11. 19 10月, 2007 1 次提交
  12. 17 10月, 2007 1 次提交
  13. 09 10月, 2007 1 次提交
  14. 11 5月, 2007 1 次提交
    • D
      signal/timer/event: KAIO eventfd support example · 9c3060be
      Davide Libenzi 提交于
      This is an example about how to add eventfd support to the current KAIO code,
      in order to enable KAIO to post readiness events to a pollable fd (hence
      compatible with POSIX select/poll).  The KAIO code simply signals the eventfd
      fd when events are ready, and this triggers a POLLIN in the fd.  This patch
      uses a reserved for future use member of the struct iocb to pass an eventfd
      file descriptor, that KAIO will use to post events every time a request
      completes.  At that point, an aio_getevents() will return the completed result
      to a struct io_event.  I made a quick test program to verify the patch, and it
      runs fine here:
      
      http://www.xmailserver.org/eventfd-aio-test.c
      
      The test program uses poll(2), but it'd, of course, work with select and epoll
      too.
      
      This can allow to schedule both block I/O and other poll-able devices
      requests, and wait for results using select/poll/epoll.  In a typical
      scenario, an application would submit KAIO request using aio_submit(), and
      will also use epoll_ctl() on the whole other class of devices (that with the
      addition of signals, timers and user events, now it's pretty much complete),
      and then would:
      
      	epoll_wait(...);
      	for_each_event {
      		if (curr_event_is_kaiofd) {
      			aio_getevents();
      			dispatch_aio_events();
      		} else {
      			dispatch_epoll_event();
      		}
      	}
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c3060be
  15. 10 5月, 2007 2 次提交
  16. 08 5月, 2007 1 次提交
  17. 28 3月, 2007 1 次提交
  18. 12 2月, 2007 2 次提交
  19. 04 2月, 2007 1 次提交
    • K
      [PATCH] aio: fix buggy put_ioctx call in aio_complete - v2 · dee11c23
      Ken Chen 提交于
      An AIO bug was reported that sleeping function is being called in softirq
      context:
      
      BUG: warning at kernel/mutex.c:132/__mutex_lock_common()
      Call Trace:
           [<a000000100577b00>] __mutex_lock_slowpath+0x640/0x6c0
           [<a000000100577ba0>] mutex_lock+0x20/0x40
           [<a0000001000a25b0>] flush_workqueue+0xb0/0x1a0
           [<a00000010018c0c0>] __put_ioctx+0xc0/0x240
           [<a00000010018d470>] aio_complete+0x2f0/0x420
           [<a00000010019cc80>] finished_one_bio+0x200/0x2a0
           [<a00000010019d1c0>] dio_bio_complete+0x1c0/0x200
           [<a00000010019d260>] dio_bio_end_aio+0x60/0x80
           [<a00000010014acd0>] bio_endio+0x110/0x1c0
           [<a0000001002770e0>] __end_that_request_first+0x180/0xba0
           [<a000000100277b90>] end_that_request_chunk+0x30/0x60
           [<a0000002073c0c70>] scsi_end_request+0x50/0x300 [scsi_mod]
           [<a0000002073c1240>] scsi_io_completion+0x200/0x8a0 [scsi_mod]
           [<a0000002074729b0>] sd_rw_intr+0x330/0x860 [sd_mod]
           [<a0000002073b3ac0>] scsi_finish_command+0x100/0x1c0 [scsi_mod]
           [<a0000002073c2910>] scsi_softirq_done+0x230/0x300 [scsi_mod]
           [<a000000100277d20>] blk_done_softirq+0x160/0x1c0
           [<a000000100083e00>] __do_softirq+0x200/0x240
           [<a000000100083eb0>] do_softirq+0x70/0xc0
      
      See report: http://marc.theaimsgroup.com/?l=linux-kernel&m=116599593200888&w=2
      
      flush_workqueue() is not allowed to be called in the softirq context.
      However, aio_complete() called from I/O interrupt can potentially call
      put_ioctx with last ref count on ioctx and triggers bug.  It is simply
      incorrect to perform ioctx freeing from aio_complete.
      
      The bug is trigger-able from a race between io_destroy() and aio_complete().
      A possible scenario:
      
      cpu0                               cpu1
      io_destroy                         aio_complete
        wait_for_all_aios {                __aio_put_req
           ...                                 ctx->reqs_active--;
           if (!ctx->reqs_active)
              return;
        }
        ...
        put_ioctx(ioctx)
      
                                           put_ioctx(ctx);
                                              __put_ioctx
                                                bam! Bug trigger!
      
      The real problem is that the condition check of ctx->reqs_active in
      wait_for_all_aios() is incorrect that access to reqs_active is not
      being properly protected by spin lock.
      
      This patch adds that protective spin lock, and at the same time removes
      all duplicate ref counting for each kiocb as reqs_active is already used
      as a ref count for each active ioctx.  This also ensures that buggy call
      to flush_workqueue() in softirq context is eliminated.
      Signed-off-by: N"Ken Chen" <kenchen@google.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: <stable@kernel.org>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dee11c23
  20. 31 12月, 2006 1 次提交
    • Z
      [PATCH] Fix lock inversion aio_kick_handler() · 1ebb1101
      Zach Brown 提交于
      lockdep found a AB BC CA lock inversion in retry-based AIO:
      
      1) The task struct's alloc_lock (A) is acquired in process context with
         interrupts enabled.  An interrupt might arrive and call wake_up() which
         grabs the wait queue's q->lock (B).
      
      2) When performing retry-based AIO the AIO core registers
         aio_wake_function() as the wake funtion for iocb->ki_wait.  It is called
         with the wait queue's q->lock (B) held and then tries to add the iocb to
         the run list after acquiring the ctx_lock (C).
      
      3) aio_kick_handler() holds the ctx_lock (C) while acquiring the
         alloc_lock (A) via lock_task() and unuse_mm().  Lockdep emits a warning
         saying that we're trying to connect the irq-safe q->lock to the
         irq-unsafe alloc_lock via ctx_lock.
      
      This fixes the inversion by calling unuse_mm() in the AIO kick handing path
      after we've released the ctx_lock.  As Ben LaHaise pointed out __put_ioctx
      could set ctx->mm to NULL, so we must only access ctx->mm while we have the
      lock.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Signed-off-by: NSuparna Bhattacharya <suparna@in.ibm.com>
      Acked-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1ebb1101
  21. 14 12月, 2006 1 次提交
    • J
      [PATCH] Use activate_mm() in fs/aio.c:use_mm() · 90aef12e
      Jeremy Fitzhardinge 提交于
      activate_mm() is not the right thing to be using in use_mm().  It should be
      switch_mm().
      
      On normal x86, they're synonymous, but for the Xen patches I'm adding a
      hook which assumes that activate_mm is only used the first time a new mm
      is used after creation (I have another hook for dealing with dup_mm).  I
      think this use of activate_mm() is the only place where it could be used
      a second time on an mm.
      
      >From a quick look at the other architectures I think this is OK (most
      simply implement one in terms of the other), but some are doing some
      subtly different stuff between the two.
      Acked-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      90aef12e
  22. 08 12月, 2006 3 次提交
  23. 30 11月, 2006 1 次提交
  24. 22 11月, 2006 2 次提交
    • D
      WorkStruct: Pass the work_struct pointer instead of context data · 65f27f38
      David Howells 提交于
      Pass the work_struct pointer to the work function rather than context data.
      The work function can use container_of() to work out the data.
      
      For the cases where the container of the work_struct may go away the moment the
      pending bit is cleared, it is made possible to defer the release of the
      structure by deferring the clearing of the pending bit.
      
      To make this work, an extra flag is introduced into the management side of the
      work_struct.  This governs auto-release of the structure upon execution.
      
      Ordinarily, the work queue executor would release the work_struct for further
      scheduling or deallocation by clearing the pending bit prior to jumping to the
      work function.  This means that, unless the driver makes some guarantee itself
      that the work_struct won't go away, the work function may not access anything
      else in the work_struct or its container lest they be deallocated..  This is a
      problem if the auxiliary data is taken away (as done by the last patch).
      
      However, if the pending bit is *not* cleared before jumping to the work
      function, then the work function *may* access the work_struct and its container
      with no problems.  But then the work function must itself release the
      work_struct by calling work_release().
      
      In most cases, automatic release is fine, so this is the default.  Special
      initiators exist for the non-auto-release case (ending in _NAR).
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      65f27f38
    • D
      WorkStruct: Separate delayable and non-delayable events. · 52bad64d
      David Howells 提交于
      Separate delayable work items from non-delayable work items be splitting them
      into a separate structure (delayed_work), which incorporates a work_struct and
      the timer_list removed from work_struct.
      
      The work_struct struct is huge, and this limits it's usefulness.  On a 64-bit
      architecture it's nearly 100 bytes in size.  This reduces that by half for the
      non-delayable type of event.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      52bad64d
  25. 03 10月, 2006 1 次提交
  26. 01 10月, 2006 2 次提交
  27. 27 6月, 2006 1 次提交
  28. 23 6月, 2006 1 次提交
  29. 26 3月, 2006 1 次提交