1. 19 5月, 2015 1 次提交
    • L
      workqueue: wq_pool_mutex protects the attrs-installation · 5b95e1af
      Lai Jiangshan 提交于
      Current wq_pool_mutex doesn't proctect the attrs-installation, it results
      that ->unbound_attrs, ->numa_pwq_tbl[] and ->dfl_pwq can only be accessed
      under wq->mutex and causes some inconveniences. Example, wq_update_unbound_numa()
      has to acquire wq->mutex before fetching the wq->unbound_attrs->no_numa
      and the old_pwq.
      
      attrs-installation is a short operation, so this change will no cause any
      latency for other operations which also acquire the wq_pool_mutex.
      
      The only unprotected attrs-installation code is in apply_workqueue_attrs(),
      so this patch touches code less than comments.
      
      It is also a preparation patch for next several patches which read
      wq->unbound_attrs, wq->numa_pwq_tbl[] and wq->dfl_pwq with
      only wq_pool_mutex held.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5b95e1af
  2. 13 5月, 2015 1 次提交
  3. 11 5月, 2015 1 次提交
  4. 30 4月, 2015 1 次提交
    • L
      workqueue: Allow modifying low level unbound workqueue cpumask · 042f7df1
      Lai Jiangshan 提交于
      Allow to modify the low-level unbound workqueues cpumask through
      sysfs. This is performed by traversing the entire workqueue list
      and calling apply_wqattrs_prepare() on the unbound workqueues
      with the new low level mask. Only after all the preparation are done,
      we commit them all together.
      
      Ordered workqueues are ignored from the low level unbound workqueue
      cpumask, it will be handled in near future.
      
      All the (default & per-node) pwqs are mandatorily controlled by
      the low level cpumask. If the user configured cpumask doesn't overlap
      with the low level cpumask, the low level cpumask will be used for the
      wq instead.
      
      The comment of wq_calc_node_cpumask() is updated and explicitly
      requires that its first argument should be the attrs of the default
      pwq.
      
      The default wq_unbound_cpumask is cpu_possible_mask.  The workqueue
      subsystem doesn't know its best default value, let the system manager
      or the other subsystem set it when needed.
      
      Changed from V8:
        merge the calculating code for the attrs of the default pwq together.
        minor change the code&comments for saving the user configured attrs.
        remove unnecessary list_del().
        minor update the comment of wq_calc_node_cpumask().
        update the comment of workqueue_set_unbound_cpumask();
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Original-patch-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      042f7df1
  5. 27 4月, 2015 2 次提交
    • F
      workqueue: Create low-level unbound workqueues cpumask · b05a7928
      Frederic Weisbecker 提交于
      Create a cpumask that limits the affinity of all unbound workqueues.
      This cpumask is controlled through a file at the root of the workqueue
      sysfs directory.
      
      It works on a lower-level than the per WQ_SYSFS workqueues cpumask files
      such that the effective cpumask applied for a given unbound workqueue is
      the intersection of /sys/devices/virtual/workqueue/$WORKQUEUE/cpumask and
      the new /sys/devices/virtual/workqueue/cpumask file.
      
      This patch implements the basic infrastructure and the read interface.
      wq_unbound_cpumask is initially set to cpu_possible_mask.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b05a7928
    • L
      workqueue: split apply_workqueue_attrs() into 3 stages · 2d5f0764
      Lai Jiangshan 提交于
      Current apply_workqueue_attrs() includes pwqs-allocation and pwqs-installation,
      so when we batch multiple apply_workqueue_attrs()s as a transaction, we can't
      ensure the transaction must succeed or fail as a complete unit.
      
      To solve this, we split apply_workqueue_attrs() into three stages.
      The first stage does the preparation: allocation memory, pwqs.
      The second stage does the attrs-installaion and pwqs-installation.
      The third stage frees the allocated memory and (old or unused) pwqs.
      
      As the result, batching multiple apply_workqueue_attrs()s can
      succeed or fail as a complete unit:
      	1) batch do all the first stage for all the workqueues
      	2) only commit all when all the above succeed.
      
      This patch is a preparation for the next patch ("Allow modifying low level
      unbound workqueue cpumask") which will do a multiple apply_workqueue_attrs().
      
      The patch doesn't have functionality changed except two minor adjustment:
      	1) free_unbound_pwq() for the error path is removed, we use the
      	   heavier version put_pwq_unlocked() instead since the error path
      	   is rare. this adjustment simplifies the code.
      	2) the memory-allocation is also moved into wq_pool_mutex.
      	   this is needed to avoid to do the further splitting.
      
      tj: minor updates to comments.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2d5f0764
  6. 06 4月, 2015 1 次提交
    • F
      workqueue: Reorder sysfs code · 6ba94429
      Frederic Weisbecker 提交于
      The sysfs code usually belongs to the botom of the file since it deals
      with high level objects. In the workqueue code it's misplaced and such
      that we'll need to work around functions references to allow the sysfs
      code to call APIs like apply_workqueue_attrs().
      
      Lets move that block further in the file, almost the botom.
      
      And declare workqueue_sysfs_unregister() just before destroy_workqueue()
      which reference it.
      
      tj: Moved workqueue_sysfs_unregister() forward declaration where other
          forward declarations are.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6ba94429
  7. 09 3月, 2015 3 次提交
    • T
      workqueue: dump workqueues on sysrq-t · 3494fc30
      Tejun Heo 提交于
      Workqueues are used extensively throughout the kernel but sometimes
      it's difficult to debug stalls involving work items because visibility
      into its inner workings is fairly limited.  Although sysrq-t task dump
      annotates each active worker task with the information on the work
      item being executed, it is challenging to find out which work items
      are pending or delayed on which queues and how pools are being
      managed.
      
      This patch implements show_workqueue_state() which dumps all busy
      workqueues and pools and is called from the sysrq-t handler.  At the
      end of sysrq-t dump, something like the following is printed.
      
       Showing busy workqueues and worker pools:
       ...
       workqueue filler_wq: flags=0x0
         pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
           in-flight: 491:filler_workfn, 507:filler_workfn
         pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
           in-flight: 501:filler_workfn
           pending: filler_workfn
       ...
       workqueue test_wq: flags=0x8
         pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
           in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
           delayed: test_workfn1 BAR(492), test_workfn2
       ...
       pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
       pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
       pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
       pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
      
      The above shows that test_wq is executing test_workfn() on pid 510
      which is the rescuer and also that there are two tasks 69 and 500
      waiting for the work item to finish in flush_work().  As test_wq has
      max_active of 1, there are two work items for test_workfn1() and
      test_workfn2() which are delayed till the current work item is
      finished.  In addition, pid 492 is flushing test_workfn1().
      
      The work item for test_workfn() is being executed on pwq of pool 2
      which is the normal priority per-cpu pool for CPU 1.  The pool has
      three workers, two of which are executing filler_workfn() for
      filler_wq and the last one is assuming the manager role trying to
      create more workers.
      
      This extra workqueue state dump will hopefully help chasing down hangs
      involving workqueues.
      
      v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
      
      v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
          printk()'s replaced with pr_info()'s, and cpumask printing now
          uses cpulist_pr_cont().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      CC: Ingo Molnar <mingo@redhat.com>
      3494fc30
    • T
      workqueue: keep track of the flushing task and pool manager · 2607d7a6
      Tejun Heo 提交于
      Add wq_barrier->task and worker_pool->manager to keep track of the
      flushing task and pool manager respectively.  These are purely
      informational and will be used to implement sysrq dump of workqueues.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2607d7a6
    • T
      workqueue: make the workqueues list RCU walkable · e2dca7ad
      Tejun Heo 提交于
      The workqueues list is protected by wq_pool_mutex and a workqueue and
      its subordinate data structures are freed directly on destruction.  We
      want to add the ability dump workqueues from a sysrq callback which
      requires walking all workqueues without grabbing wq_pool_mutex.  This
      patch makes freeing of workqueues RCU protected and makes the
      workqueues list walkable while holding RCU read lock.
      
      Note that pool_workqueues and pools are already sched-RCU protected.
      For consistency, workqueues are also protected with sched-RCU.
      
      While at it, reverse the workqueues list so that a workqueue which is
      created earlier comes before.  The order of the list isn't significant
      functionally but this makes the planned sysrq dump list system
      workqueues first.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e2dca7ad
  8. 05 3月, 2015 1 次提交
    • T
      workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE · 8603e1b3
      Tejun Heo 提交于
      cancel[_delayed]_work_sync() are implemented using
      __cancel_work_timer() which grabs the PENDING bit using
      try_to_grab_pending() and then flushes the work item with PENDING set
      to prevent the on-going execution of the work item from requeueing
      itself.
      
      try_to_grab_pending() can always grab PENDING bit without blocking
      except when someone else is doing the above flushing during
      cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
      this case, __cancel_work_timer() currently invokes flush_work().  The
      assumption is that the completion of the work item is what the other
      canceling task would be waiting for too and thus waiting for the same
      condition and retrying should allow forward progress without excessive
      busy looping
      
      Unfortunately, this doesn't work if preemption is disabled or the
      latter task has real time priority.  Let's say task A just got woken
      up from flush_work() by the completion of the target work item.  If,
      before task A starts executing, task B gets scheduled and invokes
      __cancel_work_timer() on the same work item, its try_to_grab_pending()
      will return -ENOENT as the work item is still being canceled by task A
      and flush_work() will also immediately return false as the work item
      is no longer executing.  This puts task B in a busy loop possibly
      preventing task A from executing and clearing the canceling state on
      the work item leading to a hang.
      
      task A			task B			worker
      
      						executing work
      __cancel_work_timer()
        try_to_grab_pending()
        set work CANCELING
        flush_work()
          block for work completion
      						completion, wakes up A
      			__cancel_work_timer()
      			while (forever) {
      			  try_to_grab_pending()
      			    -ENOENT as work is being canceled
      			  flush_work()
      			    false as work is no longer executing
      			}
      
      This patch removes the possible hang by updating __cancel_work_timer()
      to explicitly wait for clearing of CANCELING rather than invoking
      flush_work() after try_to_grab_pending() fails with -ENOENT.
      
      Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com
      
      v3: bit_waitqueue() can't be used for work items defined in vmalloc
          area.  Switched to custom wake function which matches the target
          work item and exclusive wait and wakeup.
      
      v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
          the target bit waitqueue has wait_bit_queue's on it.  Use
          DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
          Vizoso.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NRabin Vincent <rabin.vincent@axis.com>
      Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com>
      Cc: stable@vger.kernel.org
      Tested-by: NJesper Nilsson <jesper.nilsson@axis.com>
      Tested-by: NRabin Vincent <rabin.vincent@axis.com>
      8603e1b3
  9. 14 2月, 2015 1 次提交
  10. 17 1月, 2015 1 次提交
    • T
      workqueue: fix subtle pool management issue which can stall whole worker_pool · 29187a9e
      Tejun Heo 提交于
      A worker_pool's forward progress is guaranteed by the fact that the
      last idle worker assumes the manager role to create more workers and
      summon the rescuers if creating workers doesn't succeed in timely
      manner before proceeding to execute work items.
      
      This manager role is implemented in manage_workers(), which indicates
      whether the worker may proceed to work item execution with its return
      value.  This is necessary because multiple workers may contend for the
      manager role, and, if there already is a manager, others should
      proceed to work item execution.
      
      Unfortunately, the function also indicates that the worker may proceed
      to work item execution if need_to_create_worker() is false at the head
      of the function.  need_to_create_worker() tests the following
      conditions.
      
      	pending work items && !nr_running && !nr_idle
      
      The first and third conditions are protected by pool->lock and thus
      won't change while holding pool->lock; however, nr_running can change
      asynchronously as other workers block and resume and while it's likely
      to be zero, as someone woke this worker up in the first place, some
      other workers could have become runnable inbetween making it non-zero.
      
      If this happens, manage_worker() could return false even with zero
      nr_idle making the worker, the last idle one, proceed to execute work
      items.  If then all workers of the pool end up blocking on a resource
      which can only be released by a work item which is pending on that
      pool, the whole pool can deadlock as there's no one to create more
      workers or summon the rescuers.
      
      This patch fixes the problem by removing the early exit condition from
      maybe_create_worker() and making manage_workers() return false iff
      there's already another manager, which ensures that the last worker
      doesn't start executing work items.
      
      We can leave the early exit condition alone and just ignore the return
      value but the only reason it was put there is because the
      manage_workers() used to perform both creations and destructions of
      workers and thus the function may be invoked while the pool is trying
      to reduce the number of workers.  Now that manage_workers() is called
      only when more workers are needed, the only case this early exit
      condition is triggered is rare race conditions rendering it pointless.
      
      Tested with simulated workload and modified workqueue code which
      trigger the pool deadlock reliably without this patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NEric Sandeen <sandeen@sandeen.net>
      Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: stable@vger.kernel.org
      29187a9e
  11. 09 12月, 2014 2 次提交
    • N
      workqueue: allow rescuer thread to do more work. · 008847f6
      NeilBrown 提交于
      When there is serious memory pressure, all workers in a pool could be
      blocked, and a new thread cannot be created because it requires memory
      allocation.
      
      In this situation a WQ_MEM_RECLAIM workqueue will wake up the
      rescuer thread to do some work.
      
      The rescuer will only handle requests that are already on ->worklist.
      If max_requests is 1, that means it will handle a single request.
      
      The rescuer will be woken again in 100ms to handle another max_requests
      requests.
      
      I've seen a machine (running a 3.0 based "enterprise" kernel) with
      thousands of requests queued for xfslogd, which has a max_requests of
      1, and is needed for retiring all 'xfs' write requests.  When one of
      the worker pools gets into this state, it progresses extremely slowly
      and possibly never recovers (only waited an hour or two).
      
      With this patch we leave a pool_workqueue on mayday list
      until it is clearly no longer in need of assistance.  This allows
      all requests to be handled in a timely fashion.
      
      We keep each pool_workqueue on the mayday list until
      need_to_create_worker() is false, and no work for this workqueue is
      found in the pool.
      
      I have tested this in combination with a (hackish) patch which forces
      all work items to be handled by the rescuer thread.  In that context
      it significantly improves performance.  A similar patch for a 3.0
      kernel significantly improved performance on a heavy work load.
      
      Thanks to Jan Kara for some design ideas, and to Dongsu Park for
      some comments and testing.
      
      tj: Inverted the lock order between wq_mayday_lock and pool->lock with
          a preceding patch and simplified this patch.  Added comment and
          updated changelog accordingly.  Dongsu spotted missing get_pwq()
          in the simplified code.
      
      Cc: Dongsu Park <dongsu.park@profitbricks.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      008847f6
    • T
      workqueue: invert the order between pool->lock and wq_mayday_lock · b2d82909
      Tejun Heo 提交于
      Currently, pool->lock nests inside pool->lock.  There's no inherent
      reason for this order.  The only place where the two locks are held
      together is pool_mayday_timeout() and it just got decided that way.
      
      This nesting order turns out to complicate things with the planned
      rescuer_thread() update.  Let's invert them.  This doesn't cause any
      behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Dongsu Park <dongsu.park@profitbricks.com>
      b2d82909
  12. 04 12月, 2014 1 次提交
    • T
      workqueue: cosmetic update in rescuer_thread() · 0479c8c5
      Tejun Heo 提交于
      rescuer_thread() caches &rescuer->scheduled in a local variable
      scheduled for convenience.  There's one WARN_ON_ONCE() which was using
      &rescuer->scheduled directly.  Replace it with the local variable.
      
      This patch causes no functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      0479c8c5
  13. 06 10月, 2014 2 次提交
  14. 23 7月, 2014 6 次提交
    • L
      workqueue: use nr_node_ids instead of wq_numa_tbl_len · ddcb57e2
      Lai Jiangshan 提交于
      They are the same and nr_node_ids is provided by the memory subsystem.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ddcb57e2
    • L
      workqueue: remove the misnamed out_unlock label in get_unbound_pool() · 3fb1823c
      Lai Jiangshan 提交于
      After the locking was moved up to the caller of the get_unbound_pool(),
      out_unlock label doesn't need to do any unlock operation and the name
      became bad, so we just remove this label, and the only usage-site
      "goto out_unlock" is subsituted to "return pool".
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3fb1823c
    • L
      workqueue: remove the stale comment in pwq_unbound_release_workfn() · 29b1cb41
      Lai Jiangshan 提交于
      In 75ccf595 ("workqueue: prepare flush_workqueue() for dynamic
      creation and destrucion of unbound pool_workqueues"), a comment
      about the synchronization for the pwq in pwq_unbound_release_workfn()
      was added. The comment claimed the flush_mutex wasn't strictly
      necessary, it was correct in that time, due to the pwq was protected
      by workqueue_lock.
      
      But it is incorrect now since the wq->flush_mutex was renamed to
      wq->mutex and workqueue_lock was removed, the wq->mutex is strictly
      needed. But the comment was miss-updated when the synchronization
      was changed.
      
      This patch removes the incorrect comments and doesn't add any new
      comment to explain why wq->mutex is needed here, which is definitely
      obvious and wq->pwqs_node has "WQ" notation in its definition which is
      better comment.
      
      The old commit mentioned above also introduced a comment in link_pwq()
      about the synchronization. This comment is also removed in this patch
      since the whole link_pwq() is proteced by wq->mutex.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      29b1cb41
    • L
      workqueue: move rescuer pool detachment to the end · 13b1d625
      Lai Jiangshan 提交于
      In 51697d39 ("workqueue: use generic attach/detach routine for
      rescuers"), The rescuer detaches itself from the pool before put_pwq()
      so that the put_unbound_pool() will not destroy the rescuer-attached
      pool.
      
      It is unnecessary.  worker_detach_from_pool() can be used as the last
      statement to access to the pool just like the regular workers,
      put_unbound_pool() will wait for it to detach and then free the pool.
      
      So we move the worker_detach_from_pool() down, make it coincide with
      the regular workers.
      
      tj: Minor description update.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      13b1d625
    • L
      workqueue: unfold start_worker() into create_worker() · 051e1850
      Lai Jiangshan 提交于
      Simply unfold the code of start_worker() into create_worker() and
      remove the original start_worker() and create_and_start_worker().
      
      The only trade-off is the introduced overhead that the pool->lock
      is released and regrabbed after the newly worker is started.
      The overhead is acceptible since the manager is slow path.
      
      And because this new locking behavior, the newly created worker
      may grab the lock earlier than the manager and go to process
      work items. In this case, the recheck need_to_create_worker() may be
      true as expected and the manager goes to restart which is the
      correct behavior.
      
      tj: Minor updates to description and comments.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      051e1850
    • L
      workqueue: remove @wakeup from worker_set_flags() · 228f1d00
      Lai Jiangshan 提交于
      worker_set_flags() has only two callers, each specifying %true and
      %false for @wakeup.  Let's push the wake up to the caller and remove
      @wakeup from worker_set_flags().  The caller can use the following
      instead if wakeup is necessary:
      
      	worker_set_flags();
      	if (need_more_worker(pool))
       		wake_up_worker(pool);
      
      This makes the code simpler.  This patch doesn't introduce behavior
      changes.
      
      tj: Updated description and comments.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      228f1d00
  15. 22 7月, 2014 1 次提交
    • L
      workqueue: remove an unneeded UNBOUND test before waking up the next worker · a489a03e
      Lai Jiangshan 提交于
      In process_one_work():
      
      	if ((worker->flags & WORKER_UNBOUND) && need_more_worker(pool))
      		wake_up_worker(pool);
      
      the first test is unneeded.  Even if the first test is removed, it
      doesn't affect the wake-up logic for WORKER_UNBOUND, and it will not
      introduce any useless wake-ups for normal per-cpu workers since
      nr_running is always >= 1.  It will introduce useless/redundant
      wake-ups for CPU_INTENSIVE, but this case is rare and the next patch
      will also remove this redundant wake-up.
      
      tj: Minor updates to the description and comment.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a489a03e
  16. 19 7月, 2014 1 次提交
  17. 15 7月, 2014 1 次提交
  18. 11 7月, 2014 1 次提交
  19. 07 7月, 2014 1 次提交
    • Y
      workqueue: zero cpumask of wq_numa_possible_cpumask on init · 5a6024f1
      Yasuaki Ishimatsu 提交于
      When hot-adding and onlining CPU, kernel panic occurs, showing following
      call trace.
      
        BUG: unable to handle kernel paging request at 0000000000001d08
        IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
        PGD 0
        Oops: 0000 [#1] SMP
        ...
        Call Trace:
         [<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
         [<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
         [<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
         [<ffffffff811926f1>] new_slab+0x91/0x300
         [<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
         [<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
         [<ffffffff810a3c78>] ? load_balance+0x218/0x890
         [<ffffffff8101a679>] ? sched_clock+0x9/0x10
         [<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
         [<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
         [<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
         [<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
         [<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
         [<ffffffff8105d0ec>] do_fork+0xbc/0x360
         [<ffffffff8105d3b6>] kernel_thread+0x26/0x30
         [<ffffffff81086652>] kthreadd+0x2c2/0x300
         [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
         [<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
         [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
      
      In my investigation, I found the root cause is wq_numa_possible_cpumask.
      All entries of wq_numa_possible_cpumask is allocated by
      alloc_cpumask_var_node(). And these entries are used without initializing.
      So these entries have wrong value.
      
      When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
      wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
      calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
      as follow:
      
      3592         /* if cpumask is contained inside a NUMA node, we belong to that node */
      3593         if (wq_numa_enabled) {
      3594                 for_each_node(node) {
      3595                         if (cpumask_subset(pool->attrs->cpumask,
      3596                                            wq_numa_possible_cpumask[node])) {
      3597                                 pool->node = node;
      3598                                 break;
      3599                         }
      3600                 }
      3601         }
      
      But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
      node is selected. As a result, kernel panic occurs.
      
      By this patch, all entries of wq_numa_possible_cpumask are allocated by
      zalloc_cpumask_var_node to initialize them. And the panic disappeared.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: bce90380 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
      5a6024f1
  20. 02 7月, 2014 2 次提交
    • L
      workqueue: stronger test in process_one_work() · 85327af6
      Lai Jiangshan 提交于
      When POOL_DISASSOCIATED is cleared, the running worker's local CPU should
      be the same as pool->cpu without any exception even during cpu-hotplug.
      
      This patch changes "(proposition_A && proposition_B && proposition_C)"
      to "(proposition_B && proposition_C)", so if the old compound
      proposition is true, the new one must be true too. so this won't hide
      any possible bug which can be hit by old test.
      
      tj: Minor description update and dropped the obvious comment.
      
      CC: Jason J. Herne <jjherne@linux.vnet.ibm.com>
      CC: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      85327af6
    • L
      workqueue: clear POOL_DISASSOCIATED in rebind_workers() · 3de5e884
      Lai Jiangshan 提交于
      a9ab775b ("workqueue: directly restore CPU affinity of workers
      from CPU_ONLINE") moved pool locking into rebind_workers() but left
      "pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback().
      
      There is nothing necessarily wrong with it, but there is no benefit
      either.  Let's move it into rebind_workers() and achieve the following
      benefits:
      
        1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers()
           as expected.
      
        2) we can guarantee that, when POOL_DISASSOCIATED is clear, the
           running workers of the pool are on the local CPU (pool->cpu).
      
      tj: Minor description update.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3de5e884
  21. 24 6月, 2014 1 次提交
  22. 20 6月, 2014 8 次提交
    • L
      workqueue: stronger test in process_one_work() · 807407c0
      Lai Jiangshan 提交于
      After the recent changes, when POOL_DISASSOCIATED is cleared, the
      running worker's local CPU should be the same as pool->cpu without any
      exception even during cpu-hotplug.  Update the sanity check in
      process_one_work() accordingly.
      
      This patch changes "(proposition_A && proposition_B && proposition_C)"
      to "(proposition_B && proposition_C)", so if the old compound
      proposition is true, the new one must be true too. so this will not
      hide any possible bug which can be caught by the old test.
      
      tj: Minor updates to the description.
      
      CC: Jason J. Herne <jjherne@linux.vnet.ibm.com>
      CC: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      807407c0
    • L
      workqueue: clear POOL_DISASSOCIATED in rebind_workers() · f05b558d
      Lai Jiangshan 提交于
      The commit a9ab775b ("workqueue: directly restore CPU affinity of
      workers from CPU_ONLINE") moved the pool->lock into rebind_workers()
      without also moving "pool->flags &= ~POOL_DISASSOCIATED".
      
      There is nothing wrong with "pool->flags &= ~POOL_DISASSOCIATED" not
      being moved together, but there isn't any benefit either. We move it
      into rebind_workers() and achieve these benefits:
      
      1) Better readability.  POOL_DISASSOCIATED is cleared in
         rebind_workers() as expected.
      
      2) When POOL_DISASSOCIATED is cleared, we can ensure that all the
         running workers of the pool are on the local CPU (pool->cpu).
      
      tj: Cosmetic updates to the code and description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f05b558d
    • L
      workqueue: sanity check pool->cpu in wq_worker_sleeping() · 92b69f50
      Lai Jiangshan 提交于
      In theory, pool->cpu is equals to @cpu in wq_worker_sleeping() after
      worker->flags is checked.
      
      And "pool->cpu != cpu" sanity check will help us if something wrong.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      92b69f50
    • L
      workqueue: clear leftover flags when detached · b62c0751
      Lai Jiangshan 提交于
      When a worker is detached, the worker->flags may still have WORKER_UNBOUND
      or WORKER_REBOUND, it is OK for all cases:
        1) if it is a normal worker, the worker will be dead, it is OK.
        2) if it is a rescuer, it may re-attach to a pool with this leftover flag[s],
           it is still correct except it may cause unneeded wakeup.
      
      It is correct but not good, so we just remove the leftover flags.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b62c0751
    • L
      workqueue: remove useless WARN_ON_ONCE() · 25ef0958
      Lai Jiangshan 提交于
      The @cpu is fetched via smp_processor_id() in this function,
      so the check is useless.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      25ef0958
    • L
      workqueue: use schedule_timeout_interruptible() instead of open code · e212f361
      Lai Jiangshan 提交于
      schedule_timeout_interruptible(CREATE_COOLDOWN) is exactly the same as
      the original code.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e212f361
    • L
      workqueue: remove the empty check in too_many_workers() · e6a9a771
      Lai Jiangshan 提交于
      The commit ea1abd61 ("workqueue: reimplement idle worker rebinding")
      used a trick which simply removes all to-be-bound idle workers from the
      idle list and lets them add themselves back after completing rebinding.
      
      And this trick caused the @worker_pool->nr_idle may deviate than the actual
      number of idle workers on @worker_pool->idle_list.  More specifically,
      nr_idle may be non-zero while ->idle_list is empty.  All users of
      ->nr_idle and ->idle_list are audited.  The only affected one is
      too_many_workers() which is updated to check %false if ->idle_list is
      empty regardless of ->nr_idle.
      
      The commit/trick was complicated due to it just tried to simplify an even
      more complicated problem (workers had to rebind itself). But the commit
      a9ab775b ("workqueue: directly restore CPU affinity of workers
      from CPU_ONLINE") fixed all these problems and the mentioned trick was
      useless and is gone.
      
      So, now the @worker_pool->nr_idle is exactly the actual number of workers
      on @worker_pool->idle_list. too_many_workers() should recover as it was
      before the trick. So we remove the empty check.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e6a9a771
    • L
      workqueue: use "pool->cpu < 0" to stand for an unbound pool · 61d0fbb4
      Lai Jiangshan 提交于
      There is a piece of sanity checks code in the put_unbound_pool().
      The meaning of this code is "if it is not an unbound pool, it will complain
      and return" IIUC. But the code uses "pool->flags & POOL_DISASSOCIATED"
      imprecisely due to a non-unbound pool may also have this flags.
      
      We should use "pool->cpu < 0" to stand for an unbound pool, so we covert the
      code to it.
      
      There is no strictly wrong if we still keep "pool->flags & POOL_DISASSOCIATED"
      here, but it is just a noise if we keep it:
        1) we focus on "unbound" here, not "[dis]association".
        2) "pool->cpu < 0" already implies "pool->flags & POOL_DISASSOCIATED".
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      61d0fbb4