1. 21 8月, 2012 3 次提交
    • T
      workqueue: gut flush[_delayed]_work_sync() · 606a5020
      Tejun Heo 提交于
      Now that all workqueues are non-reentrant, flush[_delayed]_work_sync()
      are equivalent to flush[_delayed]_work().  Drop the separate
      implementation and make them thin wrappers around
      flush[_delayed]_work().
      
      * start_flush_work() no longer takes @wait_executing as the only left
        user - flush_work() - always sets it to %true.
      
      * __cancel_work_timer() uses flush_work() instead of wait_on_work().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      606a5020
    • T
      workqueue: make all workqueues non-reentrant · dbf2576e
      Tejun Heo 提交于
      By default, each per-cpu part of a bound workqueue operates separately
      and a work item may be executing concurrently on different CPUs.  The
      behavior avoids some cross-cpu traffic but leads to subtle weirdities
      and not-so-subtle contortions in the API.
      
      * There's no sane usefulness in allowing a single work item to be
        executed concurrently on multiple CPUs.  People just get the
        behavior unintentionally and get surprised after learning about it.
        Most either explicitly synchronize or use non-reentrant/ordered
        workqueue but this is error-prone.
      
      * flush_work() can't wait for multiple instances of the same work item
        on different CPUs.  If a work item is executing on cpu0 and then
        queued on cpu1, flush_work() can only wait for the one on cpu1.
      
        Unfortunately, work items can easily cross CPU boundaries
        unintentionally when the queueing thread gets migrated.  This means
        that if multiple queuers compete, flush_work() can't even guarantee
        that the instance queued right before it is finished before
        returning.
      
      * flush_work_sync() was added to work around some of the deficiencies
        of flush_work().  In addition to the usual flushing, it ensures that
        all currently executing instances are finished before returning.
        This operation is expensive as it has to walk all CPUs and at the
        same time fails to address competing queuer case.
      
        Incorrectly using flush_work() when flush_work_sync() is necessary
        is an easy error to make and can lead to bugs which are difficult to
        reproduce.
      
      * Similar problems exist for flush_delayed_work[_sync]().
      
      Other than the cross-cpu access concern, there's no benefit in
      allowing parallel execution and it's plain silly to have this level of
      contortion for workqueue which is widely used from core code to
      extremely obscure drivers.
      
      This patch makes all workqueues non-reentrant.  If a work item is
      executing on a different CPU when queueing is requested, it is always
      queued to that CPU.  This guarantees that any given work item can be
      executing on one CPU at maximum and if a work item is queued and
      executing, both are on the same CPU.
      
      The only behavior change which may affect workqueue users negatively
      is that non-reentrancy overrides the affinity specified by
      queue_work_on().  On a reentrant workqueue, the affinity specified by
      queue_work_on() is always followed.  Now, if the work item is
      executing on one of the CPUs, the work item will be queued there
      regardless of the requested affinity.  I've reviewed all workqueue
      users which request explicit affinity, and, fortunately, none seems to
      be crazy enough to exploit parallel execution of the same work item.
      
      This adds an additional busy_hash lookup if the work item was
      previously queued on a different CPU.  This shouldn't be noticeable
      under any sane workload.  Work item queueing isn't a very
      high-frequency operation and they don't jump across CPUs all the time.
      In a micro benchmark to exaggerate this difference - measuring the
      time it takes for two work items to repeatedly jump between two CPUs a
      number (10M) of times with busy_hash table densely populated, the
      difference was around 3%.
      
      While the overhead is measureable, it is only visible in pathological
      cases and the difference isn't huge.  This change brings much needed
      sanity to workqueue and makes its behavior consistent with timer.  I
      think this is the right tradeoff to make.
      
      This enables significant simplification of workqueue API.
      Simplification patches will follow.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      dbf2576e
    • V
      workqueue: fix checkpatch issues · 044c782c
      Valentin Ilie 提交于
      Fixed some checkpatch warnings.
      
      tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit.
      Signed-off-by: NValentin Ilie <valentin.ilie@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <1345326762-21747-1-git-send-email-valentin.ilie@gmail.com>
      044c782c
  2. 17 8月, 2012 6 次提交
    • J
      workqueue: use system_highpri_wq for unbind_work · 7635d2fd
      Joonsoo Kim 提交于
      To speed cpu down processing up, use system_highpri_wq.
      As scheduling priority of workers on it is higher than system_wq and
      it is not contended by other normal works on this cpu, work on it
      is processed faster than system_wq.
      
      tj: CPU up/downs care quite a bit about latency these days.  This
          shouldn't hurt anything and makes sense.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7635d2fd
    • J
      workqueue: use system_highpri_wq for highpri workers in rebind_workers() · e2b6a6d5
      Joonsoo Kim 提交于
      In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
      Currently, in this case, we use only system_wq. This makes a possible
      error situation as there is mismatch between cwq->pool and worker->pool.
      
      To prevent this, we should use system_highpri_wq for highpri worker
      to match theses. This implements it.
      
      tj: Rephrased comment a bit.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e2b6a6d5
    • J
      workqueue: introduce system_highpri_wq · 1aabe902
      Joonsoo Kim 提交于
      Commit 3270476a ('workqueue: reimplement
      WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
      for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
      or highpri worker. But, we don't consider this difference in rebind_workers(),
      we use just system_wq for highpri worker. It makes mismatch between
      cwq->pool and worker->pool.
      
      It doesn't make error in current implementation, but possible in the future.
      Now, we introduce system_highpri_wq to use proper cwq for highpri workers
      in rebind_workers(). Following patch fix this issue properly.
      
      tj: Even apart from rebinding, having system_highpri_wq generally
          makes sense.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1aabe902
    • J
      workqueue: change value of lcpu in __queue_delayed_work_on() · e42986de
      Joonsoo Kim 提交于
      We assign cpu id into work struct's data field in __queue_delayed_work_on().
      In current implementation, when work is come in first time,
      current running cpu id is assigned.
      If we do __queue_delayed_work_on() with CPU A on CPU B,
      __queue_work() invoked in delayed_work_timer_fn() go into
      the following sub-optimal path in case of WQ_NON_REENTRANT.
      
      	gcwq = get_gcwq(cpu);
      	if (wq->flags & WQ_NON_REENTRANT &&
      		(last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {
      
      Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND.
      It is sufficient to prevent to go into sub-optimal path.
      
      tj: Slightly rephrased the comment.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e42986de
    • J
      workqueue: correct req_cpu in trace_workqueue_queue_work() · b75cac93
      Joonsoo Kim 提交于
      When we do tracing workqueue_queue_work(), it records requested cpu.
      But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
      requested cpu is changed as local cpu.
      In case of @wq->flag & WQ_UNBOUND, above change is not occured,
      therefore it is reasonable to correct it.
      
      Use temporary local variable for storing requested cpu.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b75cac93
    • J
      workqueue: use enum value to set array size of pools in gcwq · 330dad5b
      Joonsoo Kim 提交于
      Commit 3270476a ('workqueue: reimplement
      WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
      for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
      size of pools, definition of worker_pool in gcwq doesn't use it.
      Using it makes code robust and prevent future mistakes.
      So change code to use this enum value.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      330dad5b
  3. 14 8月, 2012 2 次提交
    • T
      workqueue: add missing wmb() in clear_work_data() · 23657bb1
      Tejun Heo 提交于
      Any operation which clears PENDING should be preceded by a wmb to
      guarantee that the next PENDING owner sees all the changes made before
      PENDING release.
      
      There are only two places where PENDING is cleared -
      set_work_cpu_and_clear_pending() and clear_work_data().  The caller of
      the former already does smp_wmb() but the latter doesn't have any.
      
      Move the wmb above set_work_cpu_and_clear_pending() into it and add
      one to clear_work_data().
      
      There hasn't been any report related to this issue, and, given how
      clear_work_data() is used, it is extremely unlikely to have caused any
      actual problems on any architecture.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      23657bb1
    • T
      workqueue: fix CPU binding of flush_delayed_work[_sync]() · 1265057f
      Tejun Heo 提交于
      delayed_work encodes the workqueue to use and the last CPU in
      delayed_work->work.data while it's on timer.  The target CPU is
      implicitly recorded as the CPU the timer is queued on and
      delayed_work_timer_fn() queues delayed_work->work to the CPU it is
      running on.
      
      Unfortunately, this leaves flush_delayed_work[_sync]() no way to find
      out which CPU the delayed_work was queued for when they try to
      re-queue after killing the timer.  Currently, it chooses the local CPU
      flush is running on.  This can unexpectedly move a delayed_work queued
      on a specific CPU to another CPU and lead to subtle errors.
      
      There isn't much point in trying to save several bytes in struct
      delayed_work, which is already close to a hundred bytes on 64bit with
      all debug options turned off.  This patch adds delayed_work->cpu to
      remember the CPU it's queued for.
      
      Note that if the timer is migrated during CPU down, the work item
      could be queued to the downed global_cwq after this change.  As a
      detached global_cwq behaves like an unbound one, this doesn't change
      much for the delayed_work.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      1265057f
  4. 04 8月, 2012 13 次提交
    • T
      workqueue: implement mod_delayed_work[_on]() · 8376fe22
      Tejun Heo 提交于
      Workqueue was lacking a mechanism to modify the timeout of an already
      pending delayed_work.  delayed_work users have been working around
      this using several methods - using an explicit timer + work item,
      messing directly with delayed_work->timer, and canceling before
      re-queueing, all of which are error-prone and/or ugly.
      
      This patch implements mod_delayed_work[_on]() which behaves similarly
      to mod_timer() - if the delayed_work is idle, it's queued with the
      given delay; otherwise, its timeout is modified to the new value.
      Zero @delay guarantees immediate execution.
      
      v2: Updated to reflect try_to_grab_pending() changes.  Now safe to be
          called from bh context.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      8376fe22
    • T
      workqueue: mark a work item being canceled as such · bbb68dfa
      Tejun Heo 提交于
      There can be two reasons try_to_grab_pending() can fail with -EAGAIN.
      One is when someone else is queueing or deqeueing the work item.  With
      the previous patches, it is guaranteed that PENDING and queued state
      will soon agree making it safe to busy-retry in this case.
      
      The other is if multiple __cancel_work_timer() invocations are racing
      one another.  __cancel_work_timer() grabs PENDING and then waits for
      running instances of the target work item on all CPUs while holding
      PENDING and !queued.  try_to_grab_pending() invoked from another task
      will keep returning -EAGAIN while the current owner is waiting.
      
      Not distinguishing the two cases is okay because __cancel_work_timer()
      is the only user of try_to_grab_pending() and it invokes
      wait_on_work() whenever grabbing fails.  For the first case, busy
      looping should be fine but wait_on_work() doesn't cause any critical
      problem.  For the latter case, the new contender usually waits for the
      same condition as the current owner, so no unnecessarily extended
      busy-looping happens.  Combined, these make __cancel_work_timer()
      technically correct even without irq protection while grabbing PENDING
      or distinguishing the two different cases.
      
      While the current code is technically correct, not distinguishing the
      two cases makes it difficult to use try_to_grab_pending() for other
      purposes than canceling because it's impossible to tell whether it's
      safe to busy-retry grabbing.
      
      This patch adds a mechanism to mark a work item being canceled.
      try_to_grab_pending() now disables irq on success and returns -EAGAIN
      to indicate that grabbing failed but PENDING and queued states are
      gonna agree soon and it's safe to busy-loop.  It returns -ENOENT if
      the work item is being canceled and it may stay PENDING && !queued for
      arbitrary amount of time.
      
      __cancel_work_timer() is modified to mark the work canceling with
      WORK_OFFQ_CANCELING after grabbing PENDING, thus making
      try_to_grab_pending() fail with -ENOENT instead of -EAGAIN.  Also, it
      invokes wait_on_work() iff grabbing failed with -ENOENT.  This isn't
      necessary for correctness but makes it consistent with other future
      users of try_to_grab_pending().
      
      v2: try_to_grab_pending() was testing preempt_count() to ensure that
          the caller has disabled preemption.  This triggers spuriously if
          !CONFIG_PREEMPT_COUNT.  Use preemptible() instead.  Reported by
          Fengguang Wu.
      
      v3: Updated so that try_to_grab_pending() disables irq on success
          rather than requiring preemption disabled by the caller.  This
          makes busy-looping easier and will allow try_to_grap_pending() to
          be used from bh/irq contexts.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      bbb68dfa
    • T
      workqueue: reorganize try_to_grab_pending() and __cancel_timer_work() · 36e227d2
      Tejun Heo 提交于
      * Use bool @is_dwork instead of @timer and let try_to_grab_pending()
        use to_delayed_work() to determine the delayed_work address.
      
      * Move timer handling from __cancel_work_timer() to
        try_to_grab_pending().
      
      * Make try_to_grab_pending() use -EAGAIN instead of -1 for
        busy-looping and drop the ret local variable.
      
      * Add proper function comment to try_to_grab_pending().
      
      This makes the code a bit easier to understand and will ease further
      changes.  This patch doesn't make any functional change.
      
      v2: Use @is_dwork instead of @timer.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      36e227d2
    • T
      workqueue: factor out __queue_delayed_work() from queue_delayed_work_on() · 7beb2edf
      Tejun Heo 提交于
      This is to prepare for mod_delayed_work[_on]() and doesn't cause any
      functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7beb2edf
    • T
      workqueue: introduce WORK_OFFQ_FLAG_* · b5490077
      Tejun Heo 提交于
      Low WORK_STRUCT_FLAG_BITS bits of work_struct->data contain
      WORK_STRUCT_FLAG_* and flush color.  If the work item is queued, the
      rest point to the cpu_workqueue with WORK_STRUCT_CWQ set; otherwise,
      WORK_STRUCT_CWQ is clear and the bits contain the last CPU number -
      either a real CPU number or one of WORK_CPU_*.
      
      Scheduled addition of mod_delayed_work[_on]() requires an additional
      flag, which is used only while a work item is off queue.  There are
      more than enough bits to represent off-queue CPU number on both 32 and
      64bits.  This patch introduces WORK_OFFQ_FLAG_* which occupy the lower
      part of the @work->data high bits while off queue.  This patch doesn't
      define any actual OFFQ flag yet.
      
      Off-queue CPU number is now shifted by WORK_OFFQ_CPU_SHIFT, which adds
      the number of bits used by OFFQ flags to WORK_STRUCT_FLAG_SHIFT, to
      make room for OFFQ flags.
      
      To avoid shift width warning with large WORK_OFFQ_FLAG_BITS, ulong
      cast is added to WORK_STRUCT_NO_CPU and, just in case, BUILD_BUG_ON()
      to check that there are enough bits to accomodate off-queue CPU number
      is added.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b5490077
    • T
      workqueue: move try_to_grab_pending() upwards · bf4ede01
      Tejun Heo 提交于
      try_to_grab_pending() will be used by to-be-implemented
      mod_delayed_work[_on]().  Move try_to_grab_pending() and related
      functions above queueing functions.
      
      This patch only moves functions around.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bf4ede01
    • T
      workqueue: fix zero @delay handling of queue_delayed_work_on() · 715f1300
      Tejun Heo 提交于
      If @delay is zero and the dealyed_work is idle, queue_delayed_work()
      queues it for immediate execution; however, queue_delayed_work_on()
      lacks this logic and always goes through timer regardless of @delay.
      
      This patch moves 0 @delay handling logic from queue_delayed_work() to
      queue_delayed_work_on() so that both functions behave the same.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      715f1300
    • T
      workqueue: unify local CPU queueing handling · 57469821
      Tejun Heo 提交于
      Queueing functions have been using different methods to determine the
      local CPU.
      
      * queue_work() superflously uses get/put_cpu() to acquire and hold the
        local CPU across queue_work_on().
      
      * delayed_work_timer_fn() uses smp_processor_id().
      
      * queue_delayed_work() calls queue_delayed_work_on() with -1 @cpu
        which is interpreted as the local CPU.
      
      * flush_delayed_work[_sync]() were using raw_smp_processor_id().
      
      * __queue_work() interprets %WORK_CPU_UNBOUND as local CPU if the
        target workqueue is bound one but nobody uses this.
      
      This patch converts all functions to uniformly use %WORK_CPU_UNBOUND
      to indicate local CPU and use the local binding feature of
      __queue_work().  unlikely() is dropped from %WORK_CPU_UNBOUND handling
      in __queue_work().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      57469821
    • T
      workqueue: set delayed_work->timer function on initialization · d8e794df
      Tejun Heo 提交于
      delayed_work->timer.function is currently initialized during
      queue_delayed_work_on().  Export delayed_work_timer_fn() and set
      delayed_work timer function during delayed_work initialization
      together with other fields.
      
      This ensures the timer function is always valid on an initialized
      delayed_work.  This is to help mod_delayed_work() implementation.
      
      To detect delayed_work users which diddle with the internal timer,
      trigger WARN if timer function doesn't match on queue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d8e794df
    • T
      workqueue: disable irq while manipulating PENDING · 8930caba
      Tejun Heo 提交于
      Queueing operations use WORK_STRUCT_PENDING_BIT to synchronize access
      to the target work item.  They first try to claim the bit and proceed
      with queueing only after that succeeds and there's a window between
      PENDING being set and the actual queueing where the task can be
      interrupted or preempted.
      
      There's also a similar window in process_one_work() when clearing
      PENDING.  A work item is dequeued, gcwq->lock is released and then
      PENDING is cleared and the worker might get interrupted or preempted
      between releasing gcwq->lock and clearing PENDING.
      
      cancel[_delayed]_work_sync() tries to claim or steal PENDING.  The
      function assumes that a work item with PENDING is either queued or in
      the process of being [de]queued.  In the latter case, it busy-loops
      until either the work item loses PENDING or is queued.  If canceling
      coincides with the above described interrupts or preemptions, the
      canceling task will busy-loop while the queueing or executing task is
      preempted.
      
      This patch keeps irq disabled across claiming PENDING and actual
      queueing and moves PENDING clearing in process_one_work() inside
      gcwq->lock so that busy looping from PENDING && !queued doesn't wait
      for interrupted/preempted tasks.  Note that, in process_one_work(),
      setting last CPU and clearing PENDING got merged into single
      operation.
      
      This removes possible long busy-loops and will allow using
      try_to_grab_pending() from bh and irq contexts.
      
      v2: __queue_work() was testing preempt_count() to ensure that the
          caller has disabled preemption.  This triggers spuriously if
          !CONFIG_PREEMPT_COUNT.  Use preemptible() instead.  Reported by
          Fengguang Wu.
      
      v3: Disable irq instead of preemption.  IRQ will be disabled while
          grabbing gcwq->lock later anyway and this allows using
          try_to_grab_pending() from bh and irq contexts.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      8930caba
    • T
      workqueue: add missing smp_wmb() in process_one_work() · 959d1af8
      Tejun Heo 提交于
      WORK_STRUCT_PENDING is used to claim ownership of a work item and
      process_one_work() releases it before starting execution.  When
      someone else grabs PENDING, all pre-release updates to the work item
      should be visible and all updates made by the new owner should happen
      afterwards.
      
      Grabbing PENDING uses test_and_set_bit() and thus has a full barrier;
      however, clearing doesn't have a matching wmb.  Given the preceding
      spin_unlock and use of clear_bit, I don't believe this can be a
      problem on an actual machine and there hasn't been any related report
      but it still is theretically possible for clear_pending to permeate
      upwards and happen before work->entry update.
      
      Add an explicit smp_wmb() before work_clear_pending().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: stable@vger.kernel.org
      959d1af8
    • T
      workqueue: make queueing functions return bool · d4283e93
      Tejun Heo 提交于
      All queueing functions return 1 on success, 0 if the work item was
      already pending.  Update them to return bool instead.  This signifies
      better that they don't return 0 / -errno.
      
      This is cleanup and doesn't cause any functional difference.
      
      While at it, fix comment opening for schedule_work_on().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d4283e93
    • T
      workqueue: reorder queueing functions so that _on() variants are on top · 0a13c00e
      Tejun Heo 提交于
      Currently, queue/schedule[_delayed]_work_on() are located below the
      counterpart without the _on postifx even though the latter is usually
      implemented using the former.  Swap them.
      
      This is cleanup and doesn't cause any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      0a13c00e
  5. 23 7月, 2012 1 次提交
    • T
      workqueue: fix spurious CPU locality WARN from process_one_work() · 6fec10a1
      Tejun Heo 提交于
      25511a47 "workqueue: reimplement CPU online rebinding to handle idle
      workers" added CPU locality sanity check in process_one_work().  It
      triggers if a worker is executing on a different CPU without UNBOUND
      or REBIND set.
      
      This works for all normal workers but rescuers can trigger this
      spuriously when they're serving the unbound or a disassociated
      global_cwq - rescuers don't have either flag set and thus its
      gcwq->cpu can be a different value including %WORK_CPU_UNBOUND.
      
      Fix it by additionally testing %GCWQ_DISASSOCIATED.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      LKML-Refence: <20120721213656.GA7783@linux.vnet.ibm.com>
      6fec10a1
  6. 18 7月, 2012 9 次提交
    • T
      workqueue: simplify CPU hotplug code · 8db25e78
      Tejun Heo 提交于
      With trustee gone, CPU hotplug code can be simplified.
      
      * gcwq_claim/release_management() now grab and release gcwq lock too
        respectively and gained _and_lock and _and_unlock postfixes.
      
      * All CPU hotplug logic was implemented in workqueue_cpu_callback()
        which was called by workqueue_cpu_up/down_callback() for the correct
        priority.  This was because up and down paths shared a lot of logic,
        which is no longer true.  Remove workqueue_cpu_callback() and move
        all hotplug logic into the two actual callbacks.
      
      This patch doesn't make any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      8db25e78
    • T
      workqueue: remove CPU offline trustee · 628c78e7
      Tejun Heo 提交于
      With the previous changes, a disassociated global_cwq now can run as
      an unbound one on its own - it can create workers as necessary to
      drain remaining works after the CPU has been brought down and manage
      the number of workers using the usual idle timer mechanism making
      trustee completely redundant except for the actual unbinding
      operation.
      
      This patch removes the trustee and let a disassociated global_cwq
      manage itself.  Unbinding is moved to a work item (for CPU affinity)
      which is scheduled and flushed from CPU_DONW_PREPARE.
      
      This patch moves nr_running clearing outside gcwq and manager locks to
      simplify the code.  As nr_running is unused at the point, this is
      safe.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      628c78e7
    • T
      workqueue: don't butcher idle workers on an offline CPU · 3ce63377
      Tejun Heo 提交于
      Currently, during CPU offlining, after all pending work items are
      drained, the trustee butchers all workers.  Also, on CPU onlining
      failure, workqueue_cpu_callback() ensures that the first idle worker
      is destroyed.  Combined, these guarantee that an offline CPU doesn't
      have any worker for it once all the lingering work items are finished.
      
      This guarantee isn't really necessary and makes CPU on/offlining more
      expensive than needs to be, especially for platforms which use CPU
      hotplug for powersaving.
      
      This patch lets offline CPUs removes idle worker butchering from the
      trustee and let a CPU which failed onlining keep the created first
      worker.  The first worker is created if the CPU doesn't have any
      during CPU_DOWN_PREPARE and started right away.  If onlining succeeds,
      the rebind_workers() call in CPU_ONLINE will rebind it like any other
      workers.  If onlining fails, the worker is left alone till the next
      try.
      
      This makes CPU hotplugs cheaper by allowing global_cwqs to keep
      workers across them and simplifies code.
      
      Note that trustee doesn't re-arm idle timer when it's done and thus
      the disassociated global_cwq will keep all workers until it comes back
      online.  This will be improved by further patches.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      3ce63377
    • T
      workqueue: reimplement CPU online rebinding to handle idle workers · 25511a47
      Tejun Heo 提交于
      Currently, if there are left workers when a CPU is being brough back
      online, the trustee kills all idle workers and scheduled rebind_work
      so that they re-bind to the CPU after the currently executing work is
      finished.  This works for busy workers because concurrency management
      doesn't try to wake up them from scheduler callbacks, which require
      the target task to be on the local run queue.  The busy worker bumps
      concurrency counter appropriately as it clears WORKER_UNBOUND from the
      rebind work item and it's bound to the CPU before returning to the
      idle state.
      
      To reduce CPU on/offlining overhead (as many embedded systems use it
      for powersaving) and simplify the code path, workqueue is planned to
      be modified to retain idle workers across CPU on/offlining.  This
      patch reimplements CPU online rebinding such that it can also handle
      idle workers.
      
      As noted earlier, due to the local wakeup requirement, rebinding idle
      workers is tricky.  All idle workers must be re-bound before scheduler
      callbacks are enabled.  This is achieved by interlocking idle
      re-binding.  Idle workers are requested to re-bind and then hold until
      all idle re-binding is complete so that no bound worker starts
      executing work item.  Only after all idle workers are re-bound and
      parked, CPU_ONLINE proceeds to release them and queue rebind work item
      to busy workers thus guaranteeing scheduler callbacks aren't invoked
      until all idle workers are ready.
      
      worker_rebind_fn() is renamed to busy_worker_rebind_fn() and
      idle_worker_rebind() for idle workers is added.  Rebinding logic is
      moved to rebind_workers() and now called from CPU_ONLINE after
      flushing trustee.  While at it, add CPU sanity check in
      worker_thread().
      
      Note that now a worker may become idle or the manager between trustee
      release and rebinding during CPU_ONLINE.  As the previous patch
      updated create_worker() so that it can be used by regular manager
      while unbound and this patch implements idle re-binding, this is safe.
      
      This prepares for removal of trustee and keeping idle workers across
      CPU hotplugs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      25511a47
    • T
      workqueue: drop @bind from create_worker() · bc2ae0f5
      Tejun Heo 提交于
      Currently, create_worker()'s callers are responsible for deciding
      whether the newly created worker should be bound to the associated CPU
      and create_worker() sets WORKER_UNBOUND only for the workers for the
      unbound global_cwq.  Creation during normal operation is always via
      maybe_create_worker() and @bind is true.  For workers created during
      hotplug, @bind is false.
      
      Normal operation path is planned to be used even while the CPU is
      going through hotplug operations or offline and this static decision
      won't work.
      
      Drop @bind from create_worker() and decide whether to bind by looking
      at GCWQ_DISASSOCIATED.  create_worker() will also set WORKER_UNBOUND
      autmatically if disassociated.  To avoid flipping GCWQ_DISASSOCIATED
      while create_worker() is in progress, the flag is now allowed to be
      changed only while holding all manager_mutexes on the global_cwq.
      
      This requires that GCWQ_DISASSOCIATED is not cleared behind trustee's
      back.  CPU_ONLINE no longer clears DISASSOCIATED before flushing
      trustee, which clears DISASSOCIATED before rebinding remaining workers
      if asked to release.  For cases where trustee isn't around, CPU_ONLINE
      clears DISASSOCIATED after flushing trustee.  Also, now, first_idle
      has UNBOUND set on creation which is explicitly cleared by CPU_ONLINE
      while binding it.  These convolutions will soon be removed by further
      simplification of CPU hotplug path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      bc2ae0f5
    • T
      workqueue: use mutex for global_cwq manager exclusion · 60373152
      Tejun Heo 提交于
      POOL_MANAGING_WORKERS is used to ensure that at most one worker takes
      the manager role at any given time on a given global_cwq.  Trustee
      later hitched on it to assume manager adding blocking wait for the
      bit.  As trustee already needed a custom wait mechanism, waiting for
      MANAGING_WORKERS was rolled into the same mechanism.
      
      Trustee is scheduled to be removed.  This patch separates out
      MANAGING_WORKERS wait into per-pool mutex.  Workers use
      mutex_trylock() to test for manager role and trustee uses mutex_lock()
      to claim manager roles.
      
      gcwq_claim/release_management() helpers are added to grab and release
      manager roles of all pools on a global_cwq.  gcwq_claim_management()
      always grabs pool manager mutexes in ascending pool index order and
      uses pool index as lockdep subclass.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      60373152
    • T
      workqueue: ROGUE workers are UNBOUND workers · 403c821d
      Tejun Heo 提交于
      Currently, WORKER_UNBOUND is used to mark workers for the unbound
      global_cwq and WORKER_ROGUE is used to mark workers for disassociated
      per-cpu global_cwqs.  Both are used to make the marked worker skip
      concurrency management and the only place they make any difference is
      in worker_enter_idle() where WORKER_ROGUE is used to skip scheduling
      idle timer, which can easily be replaced with trustee state testing.
      
      This patch replaces WORKER_ROGUE with WORKER_UNBOUND and drops
      WORKER_ROGUE.  This is to prepare for removing trustee and handling
      disassociated global_cwqs as unbound.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      403c821d
    • T
      workqueue: drop CPU_DYING notifier operation · f2d5a0ee
      Tejun Heo 提交于
      Workqueue used CPU_DYING notification to mark GCWQ_DISASSOCIATED.
      This was necessary because workqueue's CPU_DOWN_PREPARE happened
      before other DOWN_PREPARE notifiers and workqueue needed to stay
      associated across the rest of DOWN_PREPARE.
      
      After the previous patch, workqueue's DOWN_PREPARE happens after
      others and can set GCWQ_DISASSOCIATED directly.  Drop CPU_DYING and
      let the trustee set GCWQ_DISASSOCIATED after disabling concurrency
      management.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      f2d5a0ee
    • T
      workqueue: perform cpu down operations from low priority cpu_notifier() · 65758202
      Tejun Heo 提交于
      Currently, all workqueue cpu hotplug operations run off
      CPU_PRI_WORKQUEUE which is higher than normal notifiers.  This is to
      ensure that workqueue is up and running while bringing up a CPU before
      other notifiers try to use workqueue on the CPU.
      
      Per-cpu workqueues are supposed to remain working and bound to the CPU
      for normal CPU_DOWN_PREPARE notifiers.  This holds mostly true even
      with workqueue offlining running with higher priority because
      workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which
      runs the per-cpu workqueue without concurrency management without
      explicitly detaching the existing workers.
      
      However, if the trustee needs to create new workers, it creates
      unbound workers which may wander off to other CPUs while
      CPU_DOWN_PREPARE notifiers are in progress.  Furthermore, if the CPU
      down is cancelled, the per-CPU workqueue may end up with workers which
      aren't bound to the CPU.
      
      While reliably reproducible with a convoluted artificial test-case
      involving scheduling and flushing CPU burning work items from CPU down
      notifiers, this isn't very likely to happen in the wild, and, even
      when it happens, the effects are likely to be hidden by the following
      successful CPU down.
      
      Fix it by using different priorities for up and down notifiers - high
      priority for up operations and low priority for down operations.
      
      Workqueue cpu hotplug operations will soon go through further cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Acked-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      65758202
  7. 14 7月, 2012 2 次提交
    • T
      workqueue: reimplement WQ_HIGHPRI using a separate worker_pool · 3270476a
      Tejun Heo 提交于
      WQ_HIGHPRI was implemented by queueing highpri work items at the head
      of the global worklist.  Other than queueing at the head, they weren't
      handled differently; unfortunately, this could lead to execution
      latency of a few seconds on heavily loaded systems.
      
      Now that workqueue code has been updated to deal with multiple
      worker_pools per global_cwq, this patch reimplements WQ_HIGHPRI using
      a separate worker_pool.  NR_WORKER_POOLS is bumped to two and
      gcwq->pools[0] is used for normal pri work items and ->pools[1] for
      highpri.  Highpri workers get -20 nice level and has 'H' suffix in
      their names.  Note that this change increases the number of kworkers
      per cpu.
      
      POOL_HIGHPRI_PENDING, pool_determine_ins_pos() and highpri chain
      wakeup code in process_one_work() are no longer used and removed.
      
      This allows proper prioritization of highpri work items and removes
      high execution latency of highpri work items.
      
      v2: nr_running indexing bug in get_pool_nr_running() fixed.
      
      v3: Refreshed for the get_pool_nr_running() update in the previous
          patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJosh Hunt <joshhunt00@gmail.com>
      LKML-Reference: <CAKA=qzaHqwZ8eqpLNFjxnO2fX-tgAOjmpvxgBFjv6dJeQaOW1w@mail.gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      3270476a
    • T
      workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool() · 4ce62e9e
      Tejun Heo 提交于
      Introduce NR_WORKER_POOLS and for_each_worker_pool() and convert code
      paths which need to manipulate all pools in a gcwq to use them.
      NR_WORKER_POOLS is currently one and for_each_worker_pool() iterates
      over only @gcwq->pool.
      
      Note that nr_running is per-pool property and converted to an array
      with NR_WORKER_POOLS elements and renamed to pool_nr_running.  Note
      that get_pool_nr_running() currently assumes 0 index.  The next patch
      will make use of non-zero index.
      
      The changes in this patch are mechanical and don't caues any
      functional difference.  This is to prepare for multiple pools per
      gcwq.
      
      v2: nr_running indexing bug in get_pool_nr_running() fixed.
      
      v3: Pointer to array is stupid.  Don't use it in get_pool_nr_running()
          as suggested by Linus.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      4ce62e9e
  8. 13 7月, 2012 4 次提交
    • T
      workqueue: separate out worker_pool flags · 11ebea50
      Tejun Heo 提交于
      GCWQ_MANAGE_WORKERS, GCWQ_MANAGING_WORKERS and GCWQ_HIGHPRI_PENDING
      are per-pool properties.  Add worker_pool->flags and make the above
      three flags per-pool flags.
      
      The changes in this patch are mechanical and don't caues any
      functional difference.  This is to prepare for multiple pools per
      gcwq.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      11ebea50
    • T
      workqueue: use @pool instead of @gcwq or @cpu where applicable · 63d95a91
      Tejun Heo 提交于
      Modify all functions which deal with per-pool properties to pass
      around @pool instead of @gcwq or @cpu.
      
      The changes in this patch are mechanical and don't caues any
      functional difference.  This is to prepare for multiple pools per
      gcwq.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      63d95a91
    • T
      workqueue: factor out worker_pool from global_cwq · bd7bdd43
      Tejun Heo 提交于
      Move worklist and all worker management fields from global_cwq into
      the new struct worker_pool.  worker_pool points back to the containing
      gcwq.  worker and cpu_workqueue_struct are updated to point to
      worker_pool instead of gcwq too.
      
      This change is mechanical and doesn't introduce any functional
      difference other than rearranging of fields and an added level of
      indirection in some places.  This is to prepare for multiple pools per
      gcwq.
      
      v2: Comment typo fixes as suggested by Namhyung.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      bd7bdd43
    • T
      workqueue: don't use WQ_HIGHPRI for unbound workqueues · 974271c4
      Tejun Heo 提交于
      Unbound wqs aren't concurrency-managed and try to execute work items
      as soon as possible.  This is currently achieved by implicitly setting
      %WQ_HIGHPRI on all unbound workqueues; however, WQ_HIGHPRI
      implementation is about to be restructured and this usage won't be
      valid anymore.
      
      Add an explicit chain-wakeup path for unbound workqueues in
      process_one_work() instead of piggy backing on %WQ_HIGHPRI.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      974271c4