1. 02 12月, 2012 2 次提交
    • T
      workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay · 8852aac2
      Tejun Heo 提交于
      8376fe22 ("workqueue: implement mod_delayed_work[_on]()")
      implemented mod_delayed_work[_on]() using the improved
      try_to_grab_pending().  The function is later used, among others, to
      replace [__]candel_delayed_work() + queue_delayed_work() combinations.
      
      Unfortunately, a delayed_work item w/ zero @delay is handled slightly
      differently by mod_delayed_work_on() compared to
      queue_delayed_work_on().  The latter skips timer altogether and
      directly queues it using queue_work_on() while the former schedules
      timer which will expire on the closest tick.  This means, when @delay
      is zero, that [__]cancel_delayed_work() + queue_delayed_work_on()
      makes the target item immediately executable while
      mod_delayed_work_on() may induce delay of upto a full tick.
      
      This somewhat subtle difference breaks some of the converted users.
      e.g. block queue plugging uses delayed_work for deferred processing
      and uses mod_delayed_work_on() when the queue needs to be immediately
      unplugged.  The above problem manifested as noticeably higher number
      of context switches under certain circumstances.
      
      The difference in behavior was caused by missing special case handling
      for 0 delay in mod_delayed_work_on() compared to
      queue_delayed_work_on().  Joonsoo Kim posted a patch to add it -
      ("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1].
      The patch was queued for 3.8 but it was described as optimization and
      I missed that it was a correctness issue.
      
      As both queue_delayed_work_on() and mod_delayed_work_on() use
      __queue_delayed_work() for queueing, it seems that the better approach
      is to move the 0 delay special handling to the function instead of
      duplicating it in mod_delayed_work_on().
      
      Fix the problem by moving 0 delay special case handling from
      queue_delayed_work_on() to __queue_delayed_work().  This replaces
      Joonsoo's patch.
      
      [1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NAnders Kaseorg <andersk@MIT.EDU>
      Reported-and-tested-by: NZlatko Calusic <zlatko.calusic@iskon.hr>
      LKML-Reference: <alpine.DEB.2.00.1211280953350.26602@dr-wily.mit.edu>
      LKML-Reference: <50A78AA9.5040904@iskon.hr>
      Cc: Joonsoo Kim <js1304@gmail.com>
      8852aac2
    • M
      workqueue: exit rescuer_thread() as TASK_RUNNING · 412d32e6
      Mike Galbraith 提交于
      A rescue thread exiting TASK_INTERRUPTIBLE can lead to a task scheduling
      off, never to be seen again.  In the case where this occurred, an exiting
      thread hit reiserfs homebrew conditional resched while holding a mutex,
      bringing the box to its knees.
      
      PID: 18105  TASK: ffff8807fd412180  CPU: 5   COMMAND: "kdmflush"
       #0 [ffff8808157e7670] schedule at ffffffff8143f489
       #1 [ffff8808157e77b8] reiserfs_get_block at ffffffffa038ab2d [reiserfs]
       #2 [ffff8808157e79a8] __block_write_begin at ffffffff8117fb14
       #3 [ffff8808157e7a98] reiserfs_write_begin at ffffffffa0388695 [reiserfs]
       #4 [ffff8808157e7ad8] generic_perform_write at ffffffff810ee9e2
       #5 [ffff8808157e7b58] generic_file_buffered_write at ffffffff810eeb41
       #6 [ffff8808157e7ba8] __generic_file_aio_write at ffffffff810f1a3a
       #7 [ffff8808157e7c58] generic_file_aio_write at ffffffff810f1c88
       #8 [ffff8808157e7cc8] do_sync_write at ffffffff8114f850
       #9 [ffff8808157e7dd8] do_acct_process at ffffffff810a268f
          [exception RIP: kernel_thread_helper]
          RIP: ffffffff8144a5c0  RSP: ffff8808157e7f58  RFLAGS: 00000202
          RAX: 0000000000000000  RBX: 0000000000000000  RCX: 0000000000000000
          RDX: 0000000000000000  RSI: ffffffff8107af60  RDI: ffff8803ee491d18
          RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
          R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
          R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      Signed-off-by: NMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      412d32e6
  2. 25 10月, 2012 1 次提交
  3. 21 9月, 2012 1 次提交
  4. 20 9月, 2012 3 次提交
  5. 19 9月, 2012 8 次提交
    • L
      workqueue: remove @delayed from cwq_dec_nr_in_flight() · b3f9f405
      Lai Jiangshan 提交于
      @delayed is now always false for all callers, remove it.
      
      tj: Updated description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b3f9f405
    • L
      workqueue: fix possible stall on try_to_grab_pending() of a delayed work item · 3aa62497
      Lai Jiangshan 提交于
      Currently, when try_to_grab_pending() grabs a delayed work item, it
      leaves its linked work items alone on the delayed_works.  The linked
      work items are always NO_COLOR and will cause future
      cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and
      may cause the whole cwq to stall.  For example,
      
      state: cwq->max_active = 1, cwq->nr_active = 1
             one work in cwq->pool, many in cwq->delayed_works.
      
      step1: try_to_grab_pending() removes a work item from delayed_works
             but leaves its NO_COLOR linked work items on it.
      
      step2: Later on, cwq_activate_first_delayed() activates the linked
             work item increasing ->nr_active.
      
      step3: cwq->nr_active = 1, but all activated work items of the cwq are
             NO_COLOR.  When they finish, cwq->nr_active will not be
             decreased due to NO_COLOR, and no further work items will be
             activated from cwq->delayed_works. the cwq stalls.
      
      Fix it by ensuring the target work item is activated before stealing
      PENDING in try_to_grab_pending().  This ensures that all the linked
      work items are activated without incorrectly bumping cwq->nr_active.
      
      tj: Updated comment and description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      3aa62497
    • L
      workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() · a5b4e57d
      Lai Jiangshan 提交于
      workqueue_cpu_down_callback() is used only if HOTPLUG_CPU=y, so
      hotcpu_notifier() fits better than cpu_notifier().
      
      When HOTPLUG_CPU=y, hotcpu_notifier() and cpu_notifier() are the same.
      
      When HOTPLUG_CPU=n, if we use cpu_notifier(),
      workqueue_cpu_down_callback() will be called during boot to do
      nothing, and the memory of workqueue_cpu_down_callback() and
      gcwq_unbind_fn() will be discarded after boot.
      
      If we use hotcpu_notifier(), we can avoid the no-op call of
      workqueue_cpu_down_callback() and the memory of
      workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discard at
      build time:
      
      $ ls -l kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier
      -rw-rw-r-- 1 laijs laijs 484080 Sep 15 11:31 kernel/workqueue.o.cpu_notifier
      -rw-rw-r-- 1 laijs laijs 478240 Sep 15 11:31 kernel/workqueue.o.hotcpu_notifier
      
      $ size kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier
         text	   data	    bss	    dec	    hex	filename
        18513	   2387	   1221	  22121	   5669	kernel/workqueue.o.cpu_notifier
        18082	   2355	   1221	  21658	   549a	kernel/workqueue.o.hotcpu_notifier
      
      tj: Updated description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a5b4e57d
    • L
      workqueue: use __cpuinit instead of __devinit for cpu callbacks · 9fdf9b73
      Lai Jiangshan 提交于
      For workqueue hotplug callbacks, it makes less sense to use __devinit
      which discards the memory after boot if !HOTPLUG.  __cpuinit, which
      discards the memory after boot if !HOTPLUG_CPU fits better.
      
      tj: Updated description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9fdf9b73
    • L
      workqueue: rename manager_mutex to assoc_mutex · b2eb83d1
      Lai Jiangshan 提交于
      Now that manager_mutex's role has changed from synchronizing manager
      role to excluding hotplug against manager, the name is misleading.
      
      As it is protecting the CPU-association of the gcwq now, rename it to
      assoc_mutex.
      
      This patch is pure rename and doesn't introduce any functional change.
      
      tj: Updated comments and description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b2eb83d1
    • L
      workqueue: WORKER_REBIND is no longer necessary for idle rebinding · 5f7dabfd
      Lai Jiangshan 提交于
      Now both worker destruction and idle rebinding remove the worker from
      idle list while it's still idle, so list_empty(&worker->entry) can be
      used to test whether either is pending and WORKER_DIE to distinguish
      between the two instead making WORKER_REBIND unnecessary.
      
      Use list_empty(&worker->entry) to determine whether destruction or
      rebinding is pending.  This simplifies worker state transitions.
      
      WORKER_REBIND is not needed anymore.  Remove it.
      
      tj: Updated comments and description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5f7dabfd
    • L
      workqueue: WORKER_REBIND is no longer necessary for busy rebinding · eab6d828
      Lai Jiangshan 提交于
      Because the old unbind/rebinding implementation wasn't atomic w.r.t.
      GCWQ_DISASSOCIATED manipulation which is protected by
      global_cwq->lock, we had to use two flags, WORKER_UNBOUND and
      WORKER_REBIND, to avoid incorrectly losing all NOT_RUNNING bits with
      back-to-back CPU hotplug operations; otherwise, completion of
      rebinding while another unbinding is in progress could clear UNBIND
      prematurely.
      
      Now that both unbind/rebinding are atomic w.r.t. GCWQ_DISASSOCIATED,
      there's no need to use two flags.  Just one is enough.  Don't use
      WORKER_REBIND for busy rebinding.
      
      tj: Updated description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      eab6d828
    • L
      workqueue: reimplement idle worker rebinding · ea1abd61
      Lai Jiangshan 提交于
      Currently rebind_workers() uses rebinds idle workers synchronously
      before proceeding to requesting busy workers to rebind.  This is
      necessary because all workers on @worker_pool->idle_list must be bound
      before concurrency management local wake-ups from the busy workers
      take place.
      
      Unfortunately, the synchronous idle rebinding is quite complicated.
      This patch reimplements idle rebinding to simplify the code path.
      
      Rather than trying to make all idle workers bound before rebinding
      busy workers, we simply remove all to-be-bound idle workers from the
      idle list and let them add themselves back after completing rebinding
      (successful or not).
      
      As only workers which finished rebinding can on on the idle worker
      list, the idle worker list is guaranteed to have only bound workers
      unless CPU went down again and local wake-ups are safe.
      
      After the change, @worker_pool->nr_idle may deviate than the actual
      number of idle workers on @worker_pool->idle_list.  More specifically,
      nr_idle may be non-zero while ->idle_list is empty.  All users of
      ->nr_idle and ->idle_list are audited.  The only affected one is
      too_many_workers() which is updated to check %false if ->idle_list is
      empty regardless of ->nr_idle.
      
      After this patch, rebind_workers() no longer performs the nasty
      idle-rebind retries which require temporary release of gcwq->lock, and
      both unbinding and rebinding are atomic w.r.t. global_cwq->lock.
      
      worker->idle_rebind and global_cwq->rebind_hold are now unnecessary
      and removed along with the definition of struct idle_rebind.
      
      Changed from V1:
      	1) remove unlikely from too_many_workers(), ->idle_list can be empty
      	   anytime, even before this patch, no reason to use unlikely.
      	2) fix a small rebasing mistake.
      	   (which is from rebasing the orignal fixing patch to for-next)
      	3) add a lot of comments.
      	4) clear WORKER_REBIND unconditionaly in idle_worker_rebind()
      
      tj: Updated comments and description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ea1abd61
  6. 18 9月, 2012 1 次提交
    • L
      workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn() · 960bd11b
      Lai Jiangshan 提交于
      busy_worker_rebind_fn() didn't clear WORKER_REBIND if rebinding failed
      (CPU is down again).  This used to be okay because the flag wasn't
      used for anything else.
      
      However, after 25511a47 "workqueue: reimplement CPU online rebinding
      to handle idle workers", WORKER_REBIND is also used to command idle
      workers to rebind.  If not cleared, the worker may confuse the next
      CPU_UP cycle by having REBIND spuriously set or oops / get stuck by
      prematurely calling idle_worker_rebind().
      
        WARNING: at /work/os/wq/kernel/workqueue.c:1323 worker_thread+0x4cd/0x5
       00()
        Hardware name: Bochs
        Modules linked in: test_wq(O-)
        Pid: 33, comm: kworker/1:1 Tainted: G           O 3.6.0-rc1-work+ #3
        Call Trace:
         [<ffffffff8109039f>] warn_slowpath_common+0x7f/0xc0
         [<ffffffff810903fa>] warn_slowpath_null+0x1a/0x20
         [<ffffffff810b3f1d>] worker_thread+0x4cd/0x500
         [<ffffffff810bc16e>] kthread+0xbe/0xd0
         [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10
        ---[ end trace e977cf20f4661968 ]---
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff810b3db0>] worker_thread+0x360/0x500
        PGD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in: test_wq(O-)
        CPU 0
        Pid: 33, comm: kworker/1:1 Tainted: G        W  O 3.6.0-rc1-work+ #3 Bochs Bochs
        RIP: 0010:[<ffffffff810b3db0>]  [<ffffffff810b3db0>] worker_thread+0x360/0x500
        RSP: 0018:ffff88001e1c9de0  EFLAGS: 00010086
        RAX: 0000000000000000 RBX: ffff88001e633e00 RCX: 0000000000004140
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
        RBP: ffff88001e1c9ea0 R08: 0000000000000000 R09: 0000000000000001
        R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001fc8d580
        R13: ffff88001fc8d590 R14: ffff88001e633e20 R15: ffff88001e1c6900
        FS:  0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000000 CR3: 00000000130e8000 CR4: 00000000000006f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process kworker/1:1 (pid: 33, threadinfo ffff88001e1c8000, task ffff88001e1c6900)
        Stack:
         ffff880000000000 ffff88001e1c9e40 0000000000000001 ffff88001e1c8010
         ffff88001e519c78 ffff88001e1c9e58 ffff88001e1c6900 ffff88001e1c6900
         ffff88001e1c6900 ffff88001e1c6900 ffff88001fc8d340 ffff88001fc8d340
        Call Trace:
         [<ffffffff810bc16e>] kthread+0xbe/0xd0
         [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10
        Code: b1 00 f6 43 48 02 0f 85 91 01 00 00 48 8b 43 38 48 89 df 48 8b 00 48 89 45 90 e8 ac f0 ff ff 3c 01 0f 85 60 01 00 00 48 8b 53 50 <8b> 02 83 e8 01 85 c0 89 02 0f 84 3b 01 00 00 48 8b 43 38 48 8b
        RIP  [<ffffffff810b3db0>] worker_thread+0x360/0x500
         RSP <ffff88001e1c9de0>
        CR2: 0000000000000000
      
      There was no reason to keep WORKER_REBIND on failure in the first
      place - WORKER_UNBOUND is guaranteed to be set in such cases
      preventing incorrectly activating concurrency management.  Always
      clear WORKER_REBIND.
      
      tj: Updated comment and description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      960bd11b
  7. 11 9月, 2012 2 次提交
    • L
      workqueue: fix possible idle worker depletion across CPU hotplug · ee378aa4
      Lai Jiangshan 提交于
      To simplify both normal and CPU hotplug paths, worker management is
      prevented while CPU hoplug is in progress.  This is achieved by CPU
      hotplug holding the same exclusion mechanism used by workers to ensure
      there's only one manager per pool.
      
      If someone else seems to be performing the manager role, workers
      proceed to execute work items.  CPU hotplug using the same mechanism
      can lead to idle worker depletion because all workers could proceed to
      execute work items while CPU hotplug is in progress and CPU hotplug
      itself wouldn't actually perform the worker management duty - it
      doesn't guarantee that there's an idle worker left when it releases
      management.
      
      This idle worker depletion, under extreme circumstances, can break
      forward-progress guarantee and thus lead to deadlock.
      
      This patch fixes the bug by using separate mechanisms for manager
      exclusion among workers and hotplug exclusion.  For manager exclusion,
      POOL_MANAGING_WORKERS which was restored by the previous patch is
      used.  pool->manager_mutex is now only used for exclusion between the
      elected manager and CPU hotplug.  The elected manager won't proceed
      without holding pool->manager_mutex.
      
      This ensures that the worker which won the manager position can't skip
      managing while CPU hotplug is in progress.  It will block on
      manager_mutex and perform management after CPU hotplug is complete.
      
      Note that hotplug may happen while waiting for manager_mutex.  A
      manager isn't either on idle or busy list and thus the hoplug code
      can't unbind/rebind it.  Make the manager handle its own un/rebinding.
      
      tj: Updated comment and description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ee378aa4
    • L
      workqueue: restore POOL_MANAGING_WORKERS · 552a37e9
      Lai Jiangshan 提交于
      This patch restores POOL_MANAGING_WORKERS which was replaced by
      pool->manager_mutex by 60373152 "workqueue: use mutex for global_cwq
      manager exclusion".
      
      There's a subtle idle worker depletion bug across CPU hotplug events
      and we need to distinguish an actual manager and CPU hotplug
      preventing management.  POOL_MANAGING_WORKERS will be used for the
      former and manager_mutex the later.
      
      This patch just lays POOL_MANAGING_WORKERS on top of the existing
      manager_mutex and doesn't introduce any synchronization changes.  The
      next patch will update it.
      
      Note that this patch fixes a non-critical anomaly where
      too_many_workers() may return %true spuriously while CPU hotplug is in
      progress.  While the issue could schedule idle timer spuriously, it
      didn't trigger any actual misbehavior.
      
      tj: Rewrote patch description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      552a37e9
  8. 06 9月, 2012 2 次提交
    • T
      workqueue: fix possible deadlock in idle worker rebinding · ec58815a
      Tejun Heo 提交于
      Currently, rebind_workers() and idle_worker_rebind() are two-way
      interlocked.  rebind_workers() waits for idle workers to finish
      rebinding and rebound idle workers wait for rebind_workers() to finish
      rebinding busy workers before proceeding.
      
      Unfortunately, this isn't enough.  The second wait from idle workers
      is implemented as follows.
      
      	wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND));
      
      rebind_workers() clears WORKER_REBIND, wakes up the idle workers and
      then returns.  If CPU hotplug cycle happens again before one of the
      idle workers finishes the above wait_event(), rebind_workers() will
      repeat the first part of the handshake - set WORKER_REBIND again and
      wait for the idle worker to finish rebinding - and this leads to
      deadlock because the idle worker would be waiting for WORKER_REBIND to
      clear.
      
      This is fixed by adding another interlocking step at the end -
      rebind_workers() now waits for all the idle workers to finish the
      above WORKER_REBIND wait before returning.  This ensures that all
      rebinding steps are complete on all idle workers before the next
      hotplug cycle can happen.
      
      This problem was diagnosed by Lai Jiangshan who also posted a patch to
      fix the issue, upon which this patch is based.
      
      This is the minimal fix and further patches are scheduled for the next
      merge window to simplify the CPU hotplug path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Original-patch-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com>
      ec58815a
    • T
      workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the function · 90beca5d
      Tejun Heo 提交于
      This doesn't make any functional difference and is purely to help the
      next patch to be simpler.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      90beca5d
  9. 05 9月, 2012 1 次提交
    • L
      workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic · 96e65306
      Lai Jiangshan 提交于
      The compiler may compile the following code into TWO write/modify
      instructions.
      
      	worker->flags &= ~WORKER_UNBOUND;
      	worker->flags |= WORKER_REBIND;
      
      so the other CPU may temporarily see worker->flags which doesn't have
      either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup
      prematurely.
      
      Fix it by using single explicit assignment via ACCESS_ONCE().
      
      Because idle workers have another WORKER_NOT_RUNNING flag, this bug
      doesn't exist for them; however, update it to use the same pattern for
      consistency.
      
      tj: Applied the change to idle workers too and updated comments and
          patch description a bit.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      96e65306
  10. 22 8月, 2012 2 次提交
    • T
      workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() · 57b30ae7
      Tejun Heo 提交于
      cancel_delayed_work() can't be called from IRQ handlers due to its use
      of del_timer_sync() and can't cancel work items which are already
      transferred from timer to worklist.
      
      Also, unlike other flush and cancel functions, a canceled delayed_work
      would still point to the last associated cpu_workqueue.  If the
      workqueue is destroyed afterwards and the work item is re-used on a
      different workqueue, the queueing code can oops trying to dereference
      already freed cpu_workqueue.
      
      This patch reimplements cancel_delayed_work() using
      try_to_grab_pending() and set_work_cpu_and_clear_pending().  This
      allows the function to be called from IRQ handlers and makes its
      behavior consistent with other flush / cancel functions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      57b30ae7
    • T
      workqueue: use irqsafe timer for delayed_work · e0aecdd8
      Tejun Heo 提交于
      Up to now, for delayed_works, try_to_grab_pending() couldn't be used
      from IRQ handlers because IRQs may happen while
      delayed_work_timer_fn() is in progress leading to indefinite -EAGAIN.
      
      This patch makes delayed_work use the new TIMER_IRQSAFE flag for
      delayed_work->timer.  This makes try_to_grab_pending() and thus
      mod_delayed_work_on() safe to call from IRQ handlers.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e0aecdd8
  11. 21 8月, 2012 4 次提交
    • T
      workqueue: gut system_nrt[_freezable]_wq() · ae930e0f
      Tejun Heo 提交于
      Now that all workqueues are non-reentrant, system[_freezable]_wq() are
      equivalent to system_nrt[_freezable]_wq().  Replace the latter with
      wrappers around system[_freezable]_wq().  The wrapping goes through
      inline functions so that __deprecated can be added easily.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ae930e0f
    • T
      workqueue: gut flush[_delayed]_work_sync() · 606a5020
      Tejun Heo 提交于
      Now that all workqueues are non-reentrant, flush[_delayed]_work_sync()
      are equivalent to flush[_delayed]_work().  Drop the separate
      implementation and make them thin wrappers around
      flush[_delayed]_work().
      
      * start_flush_work() no longer takes @wait_executing as the only left
        user - flush_work() - always sets it to %true.
      
      * __cancel_work_timer() uses flush_work() instead of wait_on_work().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      606a5020
    • T
      workqueue: make all workqueues non-reentrant · dbf2576e
      Tejun Heo 提交于
      By default, each per-cpu part of a bound workqueue operates separately
      and a work item may be executing concurrently on different CPUs.  The
      behavior avoids some cross-cpu traffic but leads to subtle weirdities
      and not-so-subtle contortions in the API.
      
      * There's no sane usefulness in allowing a single work item to be
        executed concurrently on multiple CPUs.  People just get the
        behavior unintentionally and get surprised after learning about it.
        Most either explicitly synchronize or use non-reentrant/ordered
        workqueue but this is error-prone.
      
      * flush_work() can't wait for multiple instances of the same work item
        on different CPUs.  If a work item is executing on cpu0 and then
        queued on cpu1, flush_work() can only wait for the one on cpu1.
      
        Unfortunately, work items can easily cross CPU boundaries
        unintentionally when the queueing thread gets migrated.  This means
        that if multiple queuers compete, flush_work() can't even guarantee
        that the instance queued right before it is finished before
        returning.
      
      * flush_work_sync() was added to work around some of the deficiencies
        of flush_work().  In addition to the usual flushing, it ensures that
        all currently executing instances are finished before returning.
        This operation is expensive as it has to walk all CPUs and at the
        same time fails to address competing queuer case.
      
        Incorrectly using flush_work() when flush_work_sync() is necessary
        is an easy error to make and can lead to bugs which are difficult to
        reproduce.
      
      * Similar problems exist for flush_delayed_work[_sync]().
      
      Other than the cross-cpu access concern, there's no benefit in
      allowing parallel execution and it's plain silly to have this level of
      contortion for workqueue which is widely used from core code to
      extremely obscure drivers.
      
      This patch makes all workqueues non-reentrant.  If a work item is
      executing on a different CPU when queueing is requested, it is always
      queued to that CPU.  This guarantees that any given work item can be
      executing on one CPU at maximum and if a work item is queued and
      executing, both are on the same CPU.
      
      The only behavior change which may affect workqueue users negatively
      is that non-reentrancy overrides the affinity specified by
      queue_work_on().  On a reentrant workqueue, the affinity specified by
      queue_work_on() is always followed.  Now, if the work item is
      executing on one of the CPUs, the work item will be queued there
      regardless of the requested affinity.  I've reviewed all workqueue
      users which request explicit affinity, and, fortunately, none seems to
      be crazy enough to exploit parallel execution of the same work item.
      
      This adds an additional busy_hash lookup if the work item was
      previously queued on a different CPU.  This shouldn't be noticeable
      under any sane workload.  Work item queueing isn't a very
      high-frequency operation and they don't jump across CPUs all the time.
      In a micro benchmark to exaggerate this difference - measuring the
      time it takes for two work items to repeatedly jump between two CPUs a
      number (10M) of times with busy_hash table densely populated, the
      difference was around 3%.
      
      While the overhead is measureable, it is only visible in pathological
      cases and the difference isn't huge.  This change brings much needed
      sanity to workqueue and makes its behavior consistent with timer.  I
      think this is the right tradeoff to make.
      
      This enables significant simplification of workqueue API.
      Simplification patches will follow.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      dbf2576e
    • V
      workqueue: fix checkpatch issues · 044c782c
      Valentin Ilie 提交于
      Fixed some checkpatch warnings.
      
      tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit.
      Signed-off-by: NValentin Ilie <valentin.ilie@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <1345326762-21747-1-git-send-email-valentin.ilie@gmail.com>
      044c782c
  12. 17 8月, 2012 6 次提交
    • J
      workqueue: use system_highpri_wq for unbind_work · 7635d2fd
      Joonsoo Kim 提交于
      To speed cpu down processing up, use system_highpri_wq.
      As scheduling priority of workers on it is higher than system_wq and
      it is not contended by other normal works on this cpu, work on it
      is processed faster than system_wq.
      
      tj: CPU up/downs care quite a bit about latency these days.  This
          shouldn't hurt anything and makes sense.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7635d2fd
    • J
      workqueue: use system_highpri_wq for highpri workers in rebind_workers() · e2b6a6d5
      Joonsoo Kim 提交于
      In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
      Currently, in this case, we use only system_wq. This makes a possible
      error situation as there is mismatch between cwq->pool and worker->pool.
      
      To prevent this, we should use system_highpri_wq for highpri worker
      to match theses. This implements it.
      
      tj: Rephrased comment a bit.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e2b6a6d5
    • J
      workqueue: introduce system_highpri_wq · 1aabe902
      Joonsoo Kim 提交于
      Commit 3270476a ('workqueue: reimplement
      WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
      for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
      or highpri worker. But, we don't consider this difference in rebind_workers(),
      we use just system_wq for highpri worker. It makes mismatch between
      cwq->pool and worker->pool.
      
      It doesn't make error in current implementation, but possible in the future.
      Now, we introduce system_highpri_wq to use proper cwq for highpri workers
      in rebind_workers(). Following patch fix this issue properly.
      
      tj: Even apart from rebinding, having system_highpri_wq generally
          makes sense.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1aabe902
    • J
      workqueue: change value of lcpu in __queue_delayed_work_on() · e42986de
      Joonsoo Kim 提交于
      We assign cpu id into work struct's data field in __queue_delayed_work_on().
      In current implementation, when work is come in first time,
      current running cpu id is assigned.
      If we do __queue_delayed_work_on() with CPU A on CPU B,
      __queue_work() invoked in delayed_work_timer_fn() go into
      the following sub-optimal path in case of WQ_NON_REENTRANT.
      
      	gcwq = get_gcwq(cpu);
      	if (wq->flags & WQ_NON_REENTRANT &&
      		(last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {
      
      Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND.
      It is sufficient to prevent to go into sub-optimal path.
      
      tj: Slightly rephrased the comment.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e42986de
    • J
      workqueue: correct req_cpu in trace_workqueue_queue_work() · b75cac93
      Joonsoo Kim 提交于
      When we do tracing workqueue_queue_work(), it records requested cpu.
      But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
      requested cpu is changed as local cpu.
      In case of @wq->flag & WQ_UNBOUND, above change is not occured,
      therefore it is reasonable to correct it.
      
      Use temporary local variable for storing requested cpu.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b75cac93
    • J
      workqueue: use enum value to set array size of pools in gcwq · 330dad5b
      Joonsoo Kim 提交于
      Commit 3270476a ('workqueue: reimplement
      WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
      for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
      size of pools, definition of worker_pool in gcwq doesn't use it.
      Using it makes code robust and prevent future mistakes.
      So change code to use this enum value.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      330dad5b
  13. 14 8月, 2012 2 次提交
    • T
      workqueue: add missing wmb() in clear_work_data() · 23657bb1
      Tejun Heo 提交于
      Any operation which clears PENDING should be preceded by a wmb to
      guarantee that the next PENDING owner sees all the changes made before
      PENDING release.
      
      There are only two places where PENDING is cleared -
      set_work_cpu_and_clear_pending() and clear_work_data().  The caller of
      the former already does smp_wmb() but the latter doesn't have any.
      
      Move the wmb above set_work_cpu_and_clear_pending() into it and add
      one to clear_work_data().
      
      There hasn't been any report related to this issue, and, given how
      clear_work_data() is used, it is extremely unlikely to have caused any
      actual problems on any architecture.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      23657bb1
    • T
      workqueue: fix CPU binding of flush_delayed_work[_sync]() · 1265057f
      Tejun Heo 提交于
      delayed_work encodes the workqueue to use and the last CPU in
      delayed_work->work.data while it's on timer.  The target CPU is
      implicitly recorded as the CPU the timer is queued on and
      delayed_work_timer_fn() queues delayed_work->work to the CPU it is
      running on.
      
      Unfortunately, this leaves flush_delayed_work[_sync]() no way to find
      out which CPU the delayed_work was queued for when they try to
      re-queue after killing the timer.  Currently, it chooses the local CPU
      flush is running on.  This can unexpectedly move a delayed_work queued
      on a specific CPU to another CPU and lead to subtle errors.
      
      There isn't much point in trying to save several bytes in struct
      delayed_work, which is already close to a hundred bytes on 64bit with
      all debug options turned off.  This patch adds delayed_work->cpu to
      remember the CPU it's queued for.
      
      Note that if the timer is migrated during CPU down, the work item
      could be queued to the downed global_cwq after this change.  As a
      detached global_cwq behaves like an unbound one, this doesn't change
      much for the delayed_work.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      1265057f
  14. 04 8月, 2012 5 次提交
    • T
      workqueue: implement mod_delayed_work[_on]() · 8376fe22
      Tejun Heo 提交于
      Workqueue was lacking a mechanism to modify the timeout of an already
      pending delayed_work.  delayed_work users have been working around
      this using several methods - using an explicit timer + work item,
      messing directly with delayed_work->timer, and canceling before
      re-queueing, all of which are error-prone and/or ugly.
      
      This patch implements mod_delayed_work[_on]() which behaves similarly
      to mod_timer() - if the delayed_work is idle, it's queued with the
      given delay; otherwise, its timeout is modified to the new value.
      Zero @delay guarantees immediate execution.
      
      v2: Updated to reflect try_to_grab_pending() changes.  Now safe to be
          called from bh context.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      8376fe22
    • T
      workqueue: mark a work item being canceled as such · bbb68dfa
      Tejun Heo 提交于
      There can be two reasons try_to_grab_pending() can fail with -EAGAIN.
      One is when someone else is queueing or deqeueing the work item.  With
      the previous patches, it is guaranteed that PENDING and queued state
      will soon agree making it safe to busy-retry in this case.
      
      The other is if multiple __cancel_work_timer() invocations are racing
      one another.  __cancel_work_timer() grabs PENDING and then waits for
      running instances of the target work item on all CPUs while holding
      PENDING and !queued.  try_to_grab_pending() invoked from another task
      will keep returning -EAGAIN while the current owner is waiting.
      
      Not distinguishing the two cases is okay because __cancel_work_timer()
      is the only user of try_to_grab_pending() and it invokes
      wait_on_work() whenever grabbing fails.  For the first case, busy
      looping should be fine but wait_on_work() doesn't cause any critical
      problem.  For the latter case, the new contender usually waits for the
      same condition as the current owner, so no unnecessarily extended
      busy-looping happens.  Combined, these make __cancel_work_timer()
      technically correct even without irq protection while grabbing PENDING
      or distinguishing the two different cases.
      
      While the current code is technically correct, not distinguishing the
      two cases makes it difficult to use try_to_grab_pending() for other
      purposes than canceling because it's impossible to tell whether it's
      safe to busy-retry grabbing.
      
      This patch adds a mechanism to mark a work item being canceled.
      try_to_grab_pending() now disables irq on success and returns -EAGAIN
      to indicate that grabbing failed but PENDING and queued states are
      gonna agree soon and it's safe to busy-loop.  It returns -ENOENT if
      the work item is being canceled and it may stay PENDING && !queued for
      arbitrary amount of time.
      
      __cancel_work_timer() is modified to mark the work canceling with
      WORK_OFFQ_CANCELING after grabbing PENDING, thus making
      try_to_grab_pending() fail with -ENOENT instead of -EAGAIN.  Also, it
      invokes wait_on_work() iff grabbing failed with -ENOENT.  This isn't
      necessary for correctness but makes it consistent with other future
      users of try_to_grab_pending().
      
      v2: try_to_grab_pending() was testing preempt_count() to ensure that
          the caller has disabled preemption.  This triggers spuriously if
          !CONFIG_PREEMPT_COUNT.  Use preemptible() instead.  Reported by
          Fengguang Wu.
      
      v3: Updated so that try_to_grab_pending() disables irq on success
          rather than requiring preemption disabled by the caller.  This
          makes busy-looping easier and will allow try_to_grap_pending() to
          be used from bh/irq contexts.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      bbb68dfa
    • T
      workqueue: reorganize try_to_grab_pending() and __cancel_timer_work() · 36e227d2
      Tejun Heo 提交于
      * Use bool @is_dwork instead of @timer and let try_to_grab_pending()
        use to_delayed_work() to determine the delayed_work address.
      
      * Move timer handling from __cancel_work_timer() to
        try_to_grab_pending().
      
      * Make try_to_grab_pending() use -EAGAIN instead of -1 for
        busy-looping and drop the ret local variable.
      
      * Add proper function comment to try_to_grab_pending().
      
      This makes the code a bit easier to understand and will ease further
      changes.  This patch doesn't make any functional change.
      
      v2: Use @is_dwork instead of @timer.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      36e227d2
    • T
      workqueue: factor out __queue_delayed_work() from queue_delayed_work_on() · 7beb2edf
      Tejun Heo 提交于
      This is to prepare for mod_delayed_work[_on]() and doesn't cause any
      functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7beb2edf
    • T
      workqueue: introduce WORK_OFFQ_FLAG_* · b5490077
      Tejun Heo 提交于
      Low WORK_STRUCT_FLAG_BITS bits of work_struct->data contain
      WORK_STRUCT_FLAG_* and flush color.  If the work item is queued, the
      rest point to the cpu_workqueue with WORK_STRUCT_CWQ set; otherwise,
      WORK_STRUCT_CWQ is clear and the bits contain the last CPU number -
      either a real CPU number or one of WORK_CPU_*.
      
      Scheduled addition of mod_delayed_work[_on]() requires an additional
      flag, which is used only while a work item is off queue.  There are
      more than enough bits to represent off-queue CPU number on both 32 and
      64bits.  This patch introduces WORK_OFFQ_FLAG_* which occupy the lower
      part of the @work->data high bits while off queue.  This patch doesn't
      define any actual OFFQ flag yet.
      
      Off-queue CPU number is now shifted by WORK_OFFQ_CPU_SHIFT, which adds
      the number of bits used by OFFQ flags to WORK_STRUCT_FLAG_SHIFT, to
      make room for OFFQ flags.
      
      To avoid shift width warning with large WORK_OFFQ_FLAG_BITS, ulong
      cast is added to WORK_STRUCT_NO_CPU and, just in case, BUILD_BUG_ON()
      to check that there are enough bits to accomodate off-queue CPU number
      is added.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b5490077