1. 20 5月, 2014 6 次提交
    • L
      workqueue: convert worker_idr to worker_ida · 7cda9aae
      Lai Jiangshan 提交于
      We no longer iterate workers via worker_idr and worker_idr is used
      only for allocating/freeing ID, so we can convert it to worker_ida.
      
      By using ida_simple_get/remove(), worker_ida doesn't require external
      synchronization, so we don't need manager_mutex to protect it and the
      ID-removal code is allowed to be moved out from
      worker_detach_from_pool().
      
      In a later patch, worker_detach_from_pool() will be used in rescuers
      which don't have IDs, so we move the ID-removal code out from
      worker_detach_from_pool() into worker_thread().
      
      tj: Minor description updates.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7cda9aae
    • L
      workqueue: separate iteration role from worker_idr · da028469
      Lai Jiangshan 提交于
      worker_idr has the iteration (iterating for attached workers) and
      worker ID duties. These two duties don't have to be tied together. We
      can separate them and use a list for tracking attached workers and
      iteration.
      
      Before this separation, it wasn't possible to add rescuer workers to
      worker_idr due to rescuer workers couldn't allocate ID dynamically
      because ID-allocation depends on memory-allocation, which rescuer
      can't depend on.
      
      After separation, we can easily add the rescuer workers to the list for
      iteration without any memory-allocation. It is required when we attach
      the rescuer worker to the pool in later patch.
      
      tj: Minor description updates.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      da028469
    • L
      workqueue: destroy worker directly in the idle timeout handler · 3347fc9f
      Lai Jiangshan 提交于
      Since destroy_worker() doesn't need to sleep nor require manager_mutex,
      destroy_worker() can be directly called in the idle timeout
      handler, it helps us remove POOL_MANAGE_WORKERS and
      maybe_destroy_worker() and simplify the manage_workers()
      
      After POOL_MANAGE_WORKERS is removed, worker_thread() doesn't
      need to test whether it needs to manage after processed works.
      So we can remove the test branch.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      3347fc9f
    • L
      workqueue: async worker destruction · 60f5a4bc
      Lai Jiangshan 提交于
      worker destruction includes these parts of code:
      	adjust pool's stats
      	remove the worker from idle list
      	detach the worker from the pool
      	kthread_stop() to wait for the worker's task exit
      	free the worker struct
      
      We can find out that there is no essential work to do after
      kthread_stop(), which means destroy_worker() doesn't need to wait for
      the worker's task exit, so we can remove kthread_stop() and free the
      worker struct in the worker exiting path.
      
      However, put_unbound_pool() still needs to sync the all the workers'
      destruction before destroying the pool; otherwise, the workers may
      access to the invalid pool when they are exiting.
      
      So we also move the code of "detach the worker" to the exiting
      path and let put_unbound_pool() to sync with this code via
      detach_completion.
      
      The code of "detach the worker" is wrapped in a new function
      "worker_detach_from_pool()" although worker_detach_from_pool() is only
      called once (in worker_thread()) after this patch, but we need to wrap
      it for these reasons:
      
        1) The code of "detach the worker" is not short enough to unfold them
           in worker_thread().
        2) the name of "worker_detach_from_pool()" is self-comment, and we add
           some comments above the function.
        3) it will be shared by rescuer in later patch which allows rescuer
           and normal thread use the same attach/detach frameworks.
      
      The worker id is freed when detaching which happens before the worker
      is fully dead, but this id of the dying worker may be re-used for a
      new worker, so the dying worker's task name is changed to
      "worker/dying" to avoid two or several workers having the same name.
      
      Since "detach the worker" is moved out from destroy_worker(),
      destroy_worker() doesn't require manager_mutex, so the
      "lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
      removed, and destroy_worker() is not protected by manager_mutex in
      put_unbound_pool().
      
      tj: Minor description updates.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      60f5a4bc
    • L
      workqueue: destroy_worker() should destroy idle workers only · 73eb7fe7
      Lai Jiangshan 提交于
      We used to have the CPU online failure path where a worker is created
      and then destroyed without being started. A worker was created for
      the CPU coming online and if the online operation failed the created worker
      was shut down without being started.  But this behavior was changed.
      The first worker is created and started at the same time for the CPU coming
      online.
      
      It means that we had already ensured in the code that destroy_worker()
      destroys only idle workers and we don't want to allow it to destroy
      any non-idle worker in the future. Otherwise, it may be buggy and it
      may be extremely hard to check. We should force destroy_worker() to
      destroy only idle workers explicitly.
      
      Since destroy_worker() destroys only idle workers, this patch does not
      change any functionality. We just need to update the comments and the
      sanity check code.
      
      In the sanity check code, we will refuse to destroy the worker
      if !(worker->flags & WORKER_IDLE).
      
      If the worker entered idle which means it is already started,
      so we remove the check of "worker->flags & WORKER_STARTED",
      after this removal, WORKER_STARTED is totally unneeded,
      so we remove WORKER_STARTED too.
      
      In the comments for create_worker(), "Create a new worker which is bound..."
      is changed to "... which is attached..." due to we change the name of this
      behavior to attaching.
      
      tj: Minor description / comment updates.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      73eb7fe7
    • L
      workqueue: use manager lock only to protect worker_idr · 9625ab17
      Lai Jiangshan 提交于
      worker_idr is highly bound to managers and is always/only accessed in manager
      lock context. So we don't need pool->lock for it.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9625ab17
  2. 13 5月, 2014 1 次提交
  3. 19 4月, 2014 2 次提交
    • D
      workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the... · 534a3fbb
      Daeseok Youn 提交于
      workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's
      
      wq_update_unbound_numa(), when it's decided that the newly updated
      cpumask equals the default, looks at whether the current pwq is
      already the default one and skips setting pwq to the default one.
      This extra step is unnecessary and we can always jump to use_dfl_pwq
      instead. Simplify the code by removing the conditional.
      This doesn't make any functional difference.
      Signed-off-by: NDaeseok Youn <daeseok.youn@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      534a3fbb
    • L
      workqueue: fix a possible race condition between rescuer and pwq-release · 77668c8b
      Lai Jiangshan 提交于
      There is a race condition between rescuer_thread() and
      pwq_unbound_release_workfn().
      
      Even after a pwq is scheduled for rescue, the associated work items
      may be consumed by any worker.  If all of them are consumed before the
      rescuer gets to them and the pwq's base ref was put due to attribute
      change, the pwq may be released while still being linked on
      @wq->maydays list making the rescuer dereference already freed pwq
      later.
      
      Make send_mayday() pin the target pwq until the rescuer is done with
      it.
      
      tj: Updated comment and patch description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v3.10+
      77668c8b
  4. 18 4月, 2014 1 次提交
    • L
      workqueue: make rescuer_thread() empty wq->maydays list before exiting · 4d595b86
      Lai Jiangshan 提交于
      After a @pwq is scheduled for emergency execution, other workers may
      consume the affectd work items before the rescuer gets to them.  This
      means that a workqueue many have pwqs queued on @wq->maydays list
      while not having any work item pending or in-flight.  If
      destroy_workqueue() executes in such condition, the rescuer may exit
      without emptying @wq->maydays.
      
      This currently doesn't cause any actual harm.  destroy_workqueue() can
      safely destroy all the involved data structures whether @wq->maydays
      is populated or not as nobody access the list once the rescuer exits.
      
      However, this is nasty and makes future development difficult.  Let's
      update rescuer_thread() so that it empties @wq->maydays after seeing
      should_stop to guarantee that the list is empty on rescuer exit.
      
      tj: Updated comment and patch description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v3.10+
      4d595b86
  5. 17 4月, 2014 1 次提交
  6. 26 3月, 2014 1 次提交
  7. 23 2月, 2014 1 次提交
  8. 19 2月, 2014 1 次提交
    • L
      workqueue: ensure @task is valid across kthread_stop() · 5bdfff96
      Lai Jiangshan 提交于
      When a kworker should die, the kworkre is notified through WORKER_DIE
      flag instead of kthread_should_stop().  This, IIRC, is primarily to
      keep the test synchronized inside worker_pool lock.  WORKER_DIE is
      first set while holding pool->lock, the lock is dropped and
      kthread_stop() is called.
      
      Unfortunately, this means that there's a slight chance that the target
      kworker may see WORKER_DIE before kthread_stop() finishes and exits
      and frees the target task before or during kthread_stop().
      
      Fix it by pinning the target task before setting WORKER_DIE and
      putting it after kthread_stop() is done.
      
      tj: Improved patch description and comment.  Moved pinning above
          WORKER_DIE for better signify what it's protecting.
      
      CC: stable@vger.kernel.org
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5bdfff96
  9. 12 1月, 2014 1 次提交
  10. 26 11月, 2013 1 次提交
  11. 23 11月, 2013 4 次提交
    • L
      workqueue: fix pool ID allocation leakage and remove BUILD_BUG_ON() in init_workqueues · 4e8b22bd
      Li Bin 提交于
      When one work starts execution, the high bits of work's data contain
      pool ID. It can represent a maximum of WORK_OFFQ_POOL_NONE. Pool ID
      is assigned WORK_OFFQ_POOL_NONE when the work being initialized
      indicating that no pool is associated and get_work_pool() uses it to
      check the associated pool. So if worker_pool_assign_id() assigns a
      ID greater than or equal WORK_OFFQ_POOL_NONE to a pool, it triggers
      leakage, and it may break the non-reentrance guarantee.
      
      This patch fix this issue by modifying the worker_pool_assign_id()
      function calling idr_alloc() by setting @end param WORK_OFFQ_POOL_NONE.
      
      Furthermore, in the current implementation, the BUILD_BUG_ON() in
      init_workqueues makes no sense. The number of worker pools needed
      cannot be determined at compile time, because the number of backing
      pools for UNBOUND workqueues is dynamic based on the assigned custom
      attributes. So remove it.
      
      tj: Minor comment and indentation updates.
      Signed-off-by: NLi Bin <huawei.libin@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4e8b22bd
    • L
      workqueue: fix comment typo for __queue_work() · 9ef28a73
      Li Bin 提交于
      It seems the "dying" should be "draining" here.
      Signed-off-by: NLi Bin <huawei.libin@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9ef28a73
    • T
      workqueue: fix ordered workqueues in NUMA setups · 8a2b7538
      Tejun Heo 提交于
      An ordered workqueue implements execution ordering by using single
      pool_workqueue with max_active == 1.  On a given pool_workqueue, work
      items are processed in FIFO order and limiting max_active to 1
      enforces the queued work items to be processed one by one.
      
      Unfortunately, 4c16bd32 ("workqueue: implement NUMA affinity for
      unbound workqueues") accidentally broke this guarantee by applying
      NUMA affinity to ordered workqueues too.  On NUMA setups, an ordered
      workqueue would end up with separate pool_workqueues for different
      nodes.  Each pool_workqueue still limits max_active to 1 but multiple
      work items may be executed concurrently and out of order depending on
      which node they are queued to.
      
      Fix it by using dedicated ordered_wq_attrs[] when creating ordered
      workqueues.  The new attrs match the unbound ones except that no_numa
      is always set thus forcing all NUMA nodes to share the default
      pool_workqueue.
      
      While at it, add sanity check in workqueue creation path which
      verifies that an ordered workqueues has only the default
      pool_workqueue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NLibin <huawei.libin@huawei.com>
      Cc: stable@vger.kernel.org
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      8a2b7538
    • O
      workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY · 91151228
      Oleg Nesterov 提交于
      Move the setting of PF_NO_SETAFFINITY up before set_cpus_allowed()
      in create_worker(). Otherwise userland can change ->cpus_allowed
      in between.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      91151228
  12. 29 8月, 2013 1 次提交
    • T
      workqueue: cond_resched() after processing each work item · b22ce278
      Tejun Heo 提交于
      If !PREEMPT, a kworker running work items back to back can hog CPU.
      This becomes dangerous when a self-requeueing work item which is
      waiting for something to happen races against stop_machine.  Such
      self-requeueing work item would requeue itself indefinitely hogging
      the kworker and CPU it's running on while stop_machine would wait for
      that CPU to enter stop_machine while preventing anything else from
      happening on all other CPUs.  The two would deadlock.
      
      Jamie Liu reports that this deadlock scenario exists around
      scsi_requeue_run_queue() and libata port multiplier support, where one
      port may exclude command processing from other ports.  With the right
      timing, scsi_requeue_run_queue() can end up requeueing itself trying
      to execute an IO which is asked to be retried while another device has
      an exclusive access, which in turn can't make forward progress due to
      stop_machine.
      
      Fix it by invoking cond_resched() after executing each work item.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJamie Liu <jamieliu@google.com>
      References: http://thread.gmane.org/gmane.linux.kernel/1552567
      Cc: stable@vger.kernel.org
      --
       kernel/workqueue.c |    9 +++++++++
       1 file changed, 9 insertions(+)
      b22ce278
  13. 24 8月, 2013 1 次提交
  14. 21 8月, 2013 2 次提交
  15. 20 8月, 2013 1 次提交
  16. 01 8月, 2013 1 次提交
    • S
      workqueue: copy workqueue_attrs with all fields · 2865a8fb
      Shaohua Li 提交于
       $echo '0' > /sys/bus/workqueue/devices/xxx/numa
       $cat /sys/bus/workqueue/devices/xxx/numa
      
      I got 1. It should be 0, the reason is copy_workqueue_attrs() called
      in apply_workqueue_attrs() doesn't copy no_numa field.
      
      Fix it by making copy_workqueue_attrs() copy ->no_numa too.  This
      would also make get_unbound_pool() set a pool's ->no_numa attribute
      according to the workqueue attributes used when the pool was created.
      While harmelss, as ->no_numa isn't a pool attribute, this is a bit
      confusing.  Clear it explicitly.
      
      tj: Updated description and comments a bit.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      2865a8fb
  17. 25 7月, 2013 1 次提交
    • L
      workqueue: allow work_on_cpu() to be called recursively · c2fda509
      Lai Jiangshan 提交于
      If the @fn call work_on_cpu() again, the lockdep will complain:
      
      > [ INFO: possible recursive locking detected ]
      > 3.11.0-rc1-lockdep-fix-a #6 Not tainted
      > ---------------------------------------------
      > kworker/0:1/142 is trying to acquire lock:
      >  ((&wfc.work)){+.+.+.}, at: [<ffffffff81077100>] flush_work+0x0/0xb0
      >
      > but task is already holding lock:
      >  ((&wfc.work)){+.+.+.}, at: [<ffffffff81075dd9>] process_one_work+0x169/0x610
      >
      > other info that might help us debug this:
      >  Possible unsafe locking scenario:
      >
      >        CPU0
      >        ----
      >   lock((&wfc.work));
      >   lock((&wfc.work));
      >
      >  *** DEADLOCK ***
      
      It is false-positive lockdep report. In this sutiation,
      the two "wfc"s of the two work_on_cpu() are different,
      they are both on stack. flush_work() can't be deadlock.
      
      To fix this, we need to avoid the lockdep checking in this case,
      thus we instroduce a internal __flush_work() which skip the lockdep.
      
      tj: Minor comment adjustment.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Reported-by: N"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Reported-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c2fda509
  18. 15 7月, 2013 1 次提交
    • P
      kernel: delete __cpuinit usage from all core kernel files · 0db0628d
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the uses of the __cpuinit macros from C files in
      the core kernel directories (kernel, init, lib, mm, and include)
      that don't really have a specific maintainer.
      
      [1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      0db0628d
  19. 16 5月, 2013 1 次提交
    • T
      workqueue: don't perform NUMA-aware allocations on offline nodes in wq_numa_init() · 1be0c25d
      Tejun Heo 提交于
      wq_numa_init() builds per-node cpumasks which are later used to make
      unbound workqueues NUMA-aware.  The cpumasks are allocated using
      alloc_cpumask_var_node() for all possible nodes.  Unfortunately, on
      machines with off-line nodes, this leads to NUMA-aware allocations on
      existing bug offline nodes, which in turn triggers BUG in the memory
      allocation code.
      
      Fix it by using NUMA_NO_NODE for cpumask allocations for offline
      nodes.
      
        kernel BUG at include/linux/gfp.h:323!
        invalid opcode: 0000 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0+ #1
        Hardware name: ProLiant BL465c G7, BIOS A19 12/10/2011
        task: ffff880234608000 ti: ffff880234602000 task.ti: ffff880234602000
        RIP: 0010:[<ffffffff8117495d>]  [<ffffffff8117495d>] new_slab+0x2ad/0x340
        RSP: 0000:ffff880234603bf8  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff880237404b40 RCX: 00000000000000d0
        RDX: 0000000000000001 RSI: 0000000000000003 RDI: 00000000002052d0
        RBP: ffff880234603c28 R08: 0000000000000000 R09: 0000000000000001
        R10: 0000000000000001 R11: ffffffff812e3aa8 R12: 0000000000000001
        R13: ffff8802378161c0 R14: 0000000000030027 R15: 00000000000040d0
        FS:  0000000000000000(0000) GS:ffff880237800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: ffff88043fdff000 CR3: 00000000018d5000 CR4: 00000000000007f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Stack:
         ffff880234603c28 0000000000000001 00000000000000d0 ffff8802378161c0
         ffff880237404b40 ffff880237404b40 ffff880234603d28 ffffffff815edba1
         ffff880237816140 0000000000000000 ffff88023740e1c0
        Call Trace:
         [<ffffffff815edba1>] __slab_alloc+0x330/0x4f2
         [<ffffffff81174b25>] kmem_cache_alloc_node_trace+0xa5/0x200
         [<ffffffff812e3aa8>] alloc_cpumask_var_node+0x28/0x90
         [<ffffffff81a0bdb3>] wq_numa_init+0x10d/0x1be
         [<ffffffff81a0bec8>] init_workqueues+0x64/0x341
         [<ffffffff810002ea>] do_one_initcall+0xea/0x1a0
         [<ffffffff819f1f31>] kernel_init_freeable+0xb7/0x1ec
         [<ffffffff815d50de>] kernel_init+0xe/0xf0
         [<ffffffff815ff89c>] ret_from_fork+0x7c/0xb0
        Code: 45  84 ac 00 00 00 f0 41 80 4d 00 40 e9 f6 fe ff ff 66 0f 1f 84 00 00 00 00 00 e8 eb 4b ff ff 49 89 c5 e9 05 fe ff ff <0f> 0b 4c 8b 73 38 44 89 ff 81 cf 00 00 20 00 4c 89 f6 48 c1 ee
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-Tested-by: NLingzhu Xiang <lxiang@redhat.com>
      1be0c25d
  20. 15 5月, 2013 4 次提交
  21. 11 5月, 2013 1 次提交
    • T
      workqueue: workqueue_congested() shouldn't translate WORK_CPU_UNBOUND into node number · d3251859
      Tejun Heo 提交于
      df2d5ae4 ("workqueue: map an unbound workqueues to multiple per-node
      pool_workqueues") made unbound workqueues to map to multiple per-node
      pool_workqueues and accordingly updated workqueue_contested() so that,
      for unbound workqueues, it maps the specified @cpu to the NUMA node
      number to obtain the matching pool_workqueue to query the congested
      state.
      
      Before this change, workqueue_congested() ignored @cpu for unbound
      workqueues as there was only one pool_workqueue and some users
      (fscache) called it with WORK_CPU_UNBOUND.  After the commit, this
      causes the following oops as WORK_CPU_UNBOUND gets translated to
      garbage by cpu_to_node().
      
        BUG: unable to handle kernel paging request at ffff8803598d98b8
        IP: [<ffffffff81043b7e>] unbound_pwq_by_node+0xa1/0xfa
        PGD 2421067 PUD 0
        Oops: 0000 [#1] SMP
        CPU: 1 PID: 2689 Comm: cat Tainted: GF            3.9.0-fsdevel+ #4
        task: ffff88003d801040 ti: ffff880025806000 task.ti: ffff880025806000
        RIP: 0010:[<ffffffff81043b7e>]  [<ffffffff81043b7e>] unbound_pwq_by_node+0xa1/0xfa
        RSP: 0018:ffff880025807ad8  EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff8800388a2400 RCX: 0000000000000003
        RDX: ffff880025807fd8 RSI: ffffffff81a31420 RDI: ffff88003d8016e0
        RBP: ffff880025807ae8 R08: ffff88003d801730 R09: ffffffffa00b4898
        R10: ffffffff81044217 R11: ffff88003d801040 R12: 0000000064206e97
        R13: ffff880036059d98 R14: ffff880038cc8080 R15: ffff880038cc82d0
        FS:  00007f21afd9c740(0000) GS:ffff88003d100000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: ffff8803598d98b8 CR3: 000000003df49000 CR4: 00000000000007e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Stack:
         ffff8800388a2400 0000000000000002 ffff880025807b18 ffffffff810442ce
         ffffffff81044217 ffff880000000002 ffff8800371b4080 ffff88003d112ec0
         ffff880025807b38 ffffffffa00810b0 ffff880036059d88 ffff880036059be8
        Call Trace:
         [<ffffffff810442ce>] workqueue_congested+0xb7/0x12c
         [<ffffffffa00810b0>] fscache_enqueue_object+0xb2/0xe8 [fscache]
         [<ffffffffa007facd>] __fscache_acquire_cookie+0x3b9/0x56c [fscache]
         [<ffffffffa00ad8fe>] nfs_fscache_set_inode_cookie+0xee/0x132 [nfs]
         [<ffffffffa009e112>] do_open+0x9/0xd [nfs]
         [<ffffffff810e804a>] do_dentry_open+0x175/0x24b
         [<ffffffff810e8298>] finish_open+0x41/0x51
      
      Fix it by using smp_processor_id() if @cpu is WORK_CPU_UNBOUND.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDavid Howells <dhowells@redhat.com>
      Tested-and-Acked-by: NDavid Howells <dhowells@redhat.com>
      d3251859
  22. 01 5月, 2013 1 次提交
    • T
      workqueue: include workqueue info when printing debug dump of a worker task · 3d1cb205
      Tejun Heo 提交于
      One of the problems that arise when converting dedicated custom
      threadpool to workqueue is that the shared worker pool used by workqueue
      anonimizes each worker making it more difficult to identify what the
      worker was doing on which target from the output of sysrq-t or debug
      dump from oops, BUG() and friends.
      
      This patch implements set_worker_desc() which can be called from any
      workqueue work function to set its description.  When the worker task is
      dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
      and so on - the description will be printed out together with the
      workqueue name and the worker function pointer.
      
      The printing side is implemented by print_worker_info() which is called
      from functions in task dump paths - sched_show_task() and
      dump_stack_print_info().  print_worker_info() can be safely called on
      any task in any state as long as the task struct itself is accessible.
      It uses probe_*() functions to access worker fields.  It may print
      garbage if something went very wrong, but it wouldn't cause (another)
      oops.
      
      The description is currently limited to 24bytes including the
      terminating \0.  worker->desc_valid and workder->desc[] are added and
      the 64 bytes marker which was already incorrect before adding the new
      fields is moved to the correct position.
      
      Here's an example dump with writeback updated to set the bdi name as
      worker desc.
      
       Hardware name: Bochs
       Modules linked in:
       Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
       Workqueue: writeback bdi_writeback_workfn (flush-8:0)
        ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
        ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
        ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
       Call Trace:
        [<ffffffff81c61845>] dump_stack+0x19/0x1b
        [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
        [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
        [<ffffffff81200150>] bdi_writeback_workfn+0x2a0/0x3b0
       ...
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d1cb205
  23. 10 4月, 2013 1 次提交
  24. 04 4月, 2013 1 次提交
    • L
      workqueue: avoid false negative WARN_ON() in destroy_workqueue() · 5c529597
      Lai Jiangshan 提交于
      destroy_workqueue() performs several sanity checks before proceeding
      with destruction of a workqueue.  One of the checks verifies that
      refcnt of each pwq (pool_workqueue) is over 1 as at that point there
      should be no in-flight work items and the only holder of pwq refs is
      the workqueue itself.
      
      This worked fine as a workqueue used to hold only one reference to its
      pwqs; however, since 4c16bd32 ("workqueue: implement NUMA affinity
      for unbound workqueues"), a workqueue may hold multiple references to
      its default pwq triggering this sanity check spuriously.
      
      Fix it by not triggering the pwq->refcnt assertion on default pwqs.
      
      An example spurious WARN trigger follows.
      
       WARNING: at kernel/workqueue.c:4201 destroy_workqueue+0x6a/0x13e()
       Hardware name: 4286C12
       Modules linked in: sdhci_pci sdhci mmc_core usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video
       Pid: 361, comm: umount Not tainted 3.9.0-rc5+ #29
       Call Trace:
        [<c04314a7>] warn_slowpath_common+0x7c/0x93
        [<c04314e0>] warn_slowpath_null+0x22/0x24
        [<c044796a>] destroy_workqueue+0x6a/0x13e
        [<c056dc01>] ext4_put_super+0x43/0x2c4
        [<c04fb7b8>] generic_shutdown_super+0x4b/0xb9
        [<c04fb848>] kill_block_super+0x22/0x60
        [<c04fb960>] deactivate_locked_super+0x2f/0x56
        [<c04fc41b>] deactivate_super+0x2e/0x31
        [<c050f1e6>] mntput_no_expire+0x103/0x108
        [<c050fdce>] sys_umount+0x2a2/0x2c4
        [<c050fe0e>] sys_oldumount+0x1e/0x20
        [<c085ba4d>] sysenter_do_call+0x12/0x38
      
      tj: Rewrote description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      5c529597
  25. 02 4月, 2013 3 次提交
    • T
      workqueue: update sysfs interface to reflect NUMA awareness and a kernel param... · d55262c4
      Tejun Heo 提交于
      workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
      
      Unbound workqueues are now NUMA aware.  Let's add some control knobs
      and update sysfs interface accordingly.
      
      * Add kernel param workqueue.numa_disable which disables NUMA affinity
        globally.
      
      * Replace sysfs file "pool_id" with "pool_ids" which contain
        node:pool_id pairs.  This change is userland-visible but "pool_id"
        hasn't seen a release yet, so this is okay.
      
      * Add a new sysf files "numa" which can toggle NUMA affinity on
        individual workqueues.  This is implemented as attrs->no_numa whichn
        is special in that it isn't part of a pool's attributes.  It only
        affects how apply_workqueue_attrs() picks which pools to use.
      
      After "pool_ids" change, first_pwq() doesn't have any user left.
      Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      d55262c4
    • T
      workqueue: implement NUMA affinity for unbound workqueues · 4c16bd32
      Tejun Heo 提交于
      Currently, an unbound workqueue has single current, or first, pwq
      (pool_workqueue) to which all new work items are queued.  This often
      isn't optimal on NUMA machines as workers may jump around across node
      boundaries and work items get assigned to workers without any regard
      to NUMA affinity.
      
      This patch implements NUMA affinity for unbound workqueues.  Instead
      of mapping all entries of numa_pwq_tbl[] to the same pwq,
      apply_workqueue_attrs() now creates a separate pwq covering the
      intersecting CPUs for each NUMA node which has online CPUs in
      @attrs->cpumask.  Nodes which don't have intersecting possible CPUs
      are mapped to pwqs covering whole @attrs->cpumask.
      
      As CPUs come up and go down, the pool association is changed
      accordingly.  Changing pool association may involve allocating new
      pools which may fail.  To avoid failing CPU_DOWN, each workqueue
      always keeps a default pwq which covers whole attrs->cpumask which is
      used as fallback if pool creation fails during a CPU hotplug
      operation.
      
      This ensures that all work items issued on a NUMA node is executed on
      the same node as long as the workqueue allows execution on the CPUs of
      the node.
      
      As this maps a workqueue to multiple pwqs and max_active is per-pwq,
      this change the behavior of max_active.  The limit is now per NUMA
      node instead of global.  While this is an actual change, max_active is
      already per-cpu for per-cpu workqueues and primarily used as safety
      mechanism rather than for active concurrency control.  Concurrency is
      usually limited from workqueue users by the number of concurrently
      active work items and this change shouldn't matter much.
      
      v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
          by Lai.
      
      v3: The previous version incorrectly made a workqueue spanning
          multiple nodes spread work items over all online CPUs when some of
          its nodes don't have any desired cpus.  Reimplemented so that NUMA
          affinity is properly updated as CPUs go up and down.  This problem
          was spotted by Lai Jiangshan.
      
      v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
          however, wq may be freed at any time after dfl_pwq is put making
          the clearing use-after-free.  Clear wq->dfl_pwq before putting it.
      
      v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
          @pwq_tbl after success.  Fixed.
      
          Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
          application of new attrs is excluded via CPU hotplug.  Removed.
      
          Documentation on CPU affinity guarantee on CPU_DOWN added.
      
          All changes are suggested by Lai Jiangshan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      4c16bd32
    • T
      workqueue: introduce put_pwq_unlocked() · dce90d47
      Tejun Heo 提交于
      Factor out lock pool, put_pwq(), unlock sequence into
      put_pwq_unlocked().  The two existing places are converted and there
      will be more with NUMA affinity support.
      
      This is to prepare for NUMA affinity support for unbound workqueues
      and doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      dce90d47