提交 · 5b95e1af8d17d85a17728f6de7dbff538e6e3c49 · openeuler / raspberrypi-kernel

19 5月, 2015 1 次提交

workqueue: wq_pool_mutex protects the attrs-installation · 5b95e1af

由 Lai Jiangshan 提交于 5月 12, 2015

Current wq_pool_mutex doesn't proctect the attrs-installation, it results
that ->unbound_attrs, ->numa_pwq_tbl[] and ->dfl_pwq can only be accessed
under wq->mutex and causes some inconveniences. Example, wq_update_unbound_numa()
has to acquire wq->mutex before fetching the wq->unbound_attrs->no_numa
and the old_pwq.

attrs-installation is a short operation, so this change will no cause any
latency for other operations which also acquire the wq_pool_mutex.

The only unprotected attrs-installation code is in apply_workqueue_attrs(),
so this patch touches code less than comments.

It is also a preparation patch for next several patches which read
wq->unbound_attrs, wq->numa_pwq_tbl[] and wq->dfl_pwq with
only wq_pool_mutex held.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

5b95e1af

13 5月, 2015 1 次提交

workqueue: fix a typo · b749b1b6

由 Chen Hanxiao 提交于 5月 13, 2015

s/detemined/determined
Signed-off-by: NChen Hanxiao <chenhanxiao@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

b749b1b6

11 5月, 2015 1 次提交

workqueue: function name in the comment differs from the real function name · 30186c6f

由 Gong Zhaogang 提交于 5月 11, 2015

modify wq_calc_node_mask to wq_calc_node_cpumask
Signed-off-by: NGong Zhaogang <gongzhaogang@inspur.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

30186c6f

30 4月, 2015 1 次提交

workqueue: Allow modifying low level unbound workqueue cpumask · 042f7df1

由 Lai Jiangshan 提交于 4月 30, 2015

Allow to modify the low-level unbound workqueues cpumask through
sysfs. This is performed by traversing the entire workqueue list
and calling apply_wqattrs_prepare() on the unbound workqueues
with the new low level mask. Only after all the preparation are done,
we commit them all together.

Ordered workqueues are ignored from the low level unbound workqueue
cpumask, it will be handled in near future.

All the (default & per-node) pwqs are mandatorily controlled by
the low level cpumask. If the user configured cpumask doesn't overlap
with the low level cpumask, the low level cpumask will be used for the
wq instead.

The comment of wq_calc_node_cpumask() is updated and explicitly
requires that its first argument should be the attrs of the default
pwq.

The default wq_unbound_cpumask is cpu_possible_mask.  The workqueue
subsystem doesn't know its best default value, let the system manager
or the other subsystem set it when needed.

Changed from V8:
  merge the calculating code for the attrs of the default pwq together.
  minor change the code&comments for saving the user configured attrs.
  remove unnecessary list_del().
  minor update the comment of wq_calc_node_cpumask().
  update the comment of workqueue_set_unbound_cpumask();

Cc: Christoph Lameter <cl@linux.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Original-patch-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

042f7df1

27 4月, 2015 2 次提交

workqueue: Create low-level unbound workqueues cpumask · b05a7928

由 Frederic Weisbecker 提交于 4月 27, 2015

Create a cpumask that limits the affinity of all unbound workqueues.
This cpumask is controlled through a file at the root of the workqueue
sysfs directory.

It works on a lower-level than the per WQ_SYSFS workqueues cpumask files
such that the effective cpumask applied for a given unbound workqueue is
the intersection of /sys/devices/virtual/workqueue/$WORKQUEUE/cpumask and
the new /sys/devices/virtual/workqueue/cpumask file.

This patch implements the basic infrastructure and the read interface.
wq_unbound_cpumask is initially set to cpu_possible_mask.

Cc: Christoph Lameter <cl@linux.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

b05a7928

workqueue: split apply_workqueue_attrs() into 3 stages · 2d5f0764

由 Lai Jiangshan 提交于 4月 27, 2015

Current apply_workqueue_attrs() includes pwqs-allocation and pwqs-installation,
so when we batch multiple apply_workqueue_attrs()s as a transaction, we can't
ensure the transaction must succeed or fail as a complete unit.

To solve this, we split apply_workqueue_attrs() into three stages.
The first stage does the preparation: allocation memory, pwqs.
The second stage does the attrs-installaion and pwqs-installation.
The third stage frees the allocated memory and (old or unused) pwqs.

As the result, batching multiple apply_workqueue_attrs()s can
succeed or fail as a complete unit:
	1) batch do all the first stage for all the workqueues
	2) only commit all when all the above succeed.

This patch is a preparation for the next patch ("Allow modifying low level
unbound workqueue cpumask") which will do a multiple apply_workqueue_attrs().

The patch doesn't have functionality changed except two minor adjustment:
	1) free_unbound_pwq() for the error path is removed, we use the
	   heavier version put_pwq_unlocked() instead since the error path
	   is rare. this adjustment simplifies the code.
	2) the memory-allocation is also moved into wq_pool_mutex.
	   this is needed to avoid to do the further splitting.

tj: minor updates to comments.
Suggested-by: NTejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

2d5f0764

06 4月, 2015 1 次提交

workqueue: Reorder sysfs code · 6ba94429

由 Frederic Weisbecker 提交于 4月 02, 2015

The sysfs code usually belongs to the botom of the file since it deals
with high level objects. In the workqueue code it's misplaced and such
that we'll need to work around functions references to allow the sysfs
code to call APIs like apply_workqueue_attrs().

Lets move that block further in the file, almost the botom.

And declare workqueue_sysfs_unregister() just before destroy_workqueue()
which reference it.

tj: Moved workqueue_sysfs_unregister() forward declaration where other
    forward declarations are.
Suggested-by: NTejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

6ba94429

09 3月, 2015 3 次提交

workqueue: dump workqueues on sysrq-t · 3494fc30

由 Tejun Heo 提交于 3月 09, 2015

Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited.  Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.

This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler.  At the
end of sysrq-t dump, something like the following is printed.

 Showing busy workqueues and worker pools:
 ...
 workqueue filler_wq: flags=0x0
   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
     in-flight: 491:filler_workfn, 507:filler_workfn
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
     in-flight: 501:filler_workfn
     pending: filler_workfn
 ...
 workqueue test_wq: flags=0x8
   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
     in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
     delayed: test_workfn1 BAR(492), test_workfn2
 ...
 pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
 pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
 pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
 pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62

The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work().  As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished.  In addition, pid 492 is flushing test_workfn1().

The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1.  The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.

This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.

v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.

v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
    printk()'s replaced with pr_info()'s, and cpumask printing now
    uses cpulist_pr_cont().
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>

3494fc30

workqueue: keep track of the flushing task and pool manager · 2607d7a6

由 Tejun Heo 提交于 3月 09, 2015

Add wq_barrier->task and worker_pool->manager to keep track of the
flushing task and pool manager respectively.  These are purely
informational and will be used to implement sysrq dump of workqueues.
Signed-off-by: NTejun Heo <tj@kernel.org>

2607d7a6

workqueue: make the workqueues list RCU walkable · e2dca7ad

由 Tejun Heo 提交于 3月 09, 2015

The workqueues list is protected by wq_pool_mutex and a workqueue and
its subordinate data structures are freed directly on destruction.  We
want to add the ability dump workqueues from a sysrq callback which
requires walking all workqueues without grabbing wq_pool_mutex.  This
patch makes freeing of workqueues RCU protected and makes the
workqueues list walkable while holding RCU read lock.

Note that pool_workqueues and pools are already sched-RCU protected.
For consistency, workqueues are also protected with sched-RCU.

While at it, reverse the workqueues list so that a workqueue which is
created earlier comes before.  The order of the list isn't significant
functionally but this makes the planned sysrq dump list system
workqueues first.
Signed-off-by: NTejun Heo <tj@kernel.org>

e2dca7ad

05 3月, 2015 1 次提交

workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE · 8603e1b3

由 Tejun Heo 提交于 3月 05, 2015

cancel[_delayed]_work_sync() are implemented using
__cancel_work_timer() which grabs the PENDING bit using
try_to_grab_pending() and then flushes the work item with PENDING set
to prevent the on-going execution of the work item from requeueing
itself.

try_to_grab_pending() can always grab PENDING bit without blocking
except when someone else is doing the above flushing during
cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
this case, __cancel_work_timer() currently invokes flush_work().  The
assumption is that the completion of the work item is what the other
canceling task would be waiting for too and thus waiting for the same
condition and retrying should allow forward progress without excessive
busy looping

Unfortunately, this doesn't work if preemption is disabled or the
latter task has real time priority.  Let's say task A just got woken
up from flush_work() by the completion of the target work item.  If,
before task A starts executing, task B gets scheduled and invokes
__cancel_work_timer() on the same work item, its try_to_grab_pending()
will return -ENOENT as the work item is still being canceled by task A
and flush_work() will also immediately return false as the work item
is no longer executing.  This puts task B in a busy loop possibly
preventing task A from executing and clearing the canceling state on
the work item leading to a hang.

task A			task B			worker

						executing work
__cancel_work_timer()
  try_to_grab_pending()
  set work CANCELING
  flush_work()
    block for work completion
						completion, wakes up A
			__cancel_work_timer()
			while (forever) {
			  try_to_grab_pending()
			    -ENOENT as work is being canceled
			  flush_work()
			    false as work is no longer executing
			}

This patch removes the possible hang by updating __cancel_work_timer()
to explicitly wait for clearing of CANCELING rather than invoking
flush_work() after try_to_grab_pending() fails with -ENOENT.

Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com

v3: bit_waitqueue() can't be used for work items defined in vmalloc
    area.  Switched to custom wake function which matches the target
    work item and exclusive wait and wakeup.

v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
    the target bit waitqueue has wait_bit_queue's on it.  Use
    DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
    Vizoso.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NRabin Vincent <rabin.vincent@axis.com>
Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com>
Cc: stable@vger.kernel.org
Tested-by: NJesper Nilsson <jesper.nilsson@axis.com>
Tested-by: NRabin Vincent <rabin.vincent@axis.com>

8603e1b3

14 2月, 2015 1 次提交

workqueue: use %*pb[l] to format bitmaps including cpumasks and nodemasks · dfbcbf42

由 Tejun Heo 提交于 2月 13, 2015

printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dfbcbf42

17 1月, 2015 1 次提交

workqueue: fix subtle pool management issue which can stall whole worker_pool · 29187a9e

由 Tejun Heo 提交于 1月 16, 2015

A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.

This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value.  This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.

Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function.  need_to_create_worker() tests the following
conditions.

	pending work items && !nr_running && !nr_idle

The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.

If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items.  If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.

This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.

We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers.  Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.

Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NEric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org

29187a9e

09 12月, 2014 2 次提交

workqueue: allow rescuer thread to do more work. · 008847f6

由 NeilBrown 提交于 12月 08, 2014

When there is serious memory pressure, all workers in a pool could be
blocked, and a new thread cannot be created because it requires memory
allocation.

In this situation a WQ_MEM_RECLAIM workqueue will wake up the
rescuer thread to do some work.

The rescuer will only handle requests that are already on ->worklist.
If max_requests is 1, that means it will handle a single request.

The rescuer will be woken again in 100ms to handle another max_requests
requests.

I've seen a machine (running a 3.0 based "enterprise" kernel) with
thousands of requests queued for xfslogd, which has a max_requests of
1, and is needed for retiring all 'xfs' write requests.  When one of
the worker pools gets into this state, it progresses extremely slowly
and possibly never recovers (only waited an hour or two).

With this patch we leave a pool_workqueue on mayday list
until it is clearly no longer in need of assistance.  This allows
all requests to be handled in a timely fashion.

We keep each pool_workqueue on the mayday list until
need_to_create_worker() is false, and no work for this workqueue is
found in the pool.

I have tested this in combination with a (hackish) patch which forces
all work items to be handled by the rescuer thread.  In that context
it significantly improves performance.  A similar patch for a 3.0
kernel significantly improved performance on a heavy work load.

Thanks to Jan Kara for some design ideas, and to Dongsu Park for
some comments and testing.

tj: Inverted the lock order between wq_mayday_lock and pool->lock with
    a preceding patch and simplified this patch.  Added comment and
    updated changelog accordingly.  Dongsu spotted missing get_pwq()
    in the simplified code.

Cc: Dongsu Park <dongsu.park@profitbricks.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NTejun Heo <tj@kernel.org>

008847f6

workqueue: invert the order between pool->lock and wq_mayday_lock · b2d82909

由 Tejun Heo 提交于 12月 08, 2014

Currently, pool->lock nests inside pool->lock.  There's no inherent
reason for this order.  The only place where the two locks are held
together is pool_mayday_timeout() and it just got decided that way.

This nesting order turns out to complicate things with the planned
rescuer_thread() update.  Let's invert them.  This doesn't cause any
behavior differences.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Dongsu Park <dongsu.park@profitbricks.com>

b2d82909

04 12月, 2014 1 次提交

workqueue: cosmetic update in rescuer_thread() · 0479c8c5

由 Tejun Heo 提交于 12月 04, 2014

rescuer_thread() caches &rescuer->scheduled in a local variable
scheduled for convenience.  There's one WARN_ON_ONCE() which was using
&rescuer->scheduled directly.  Replace it with the local variable.

This patch causes no functional difference.
Signed-off-by: NTejun Heo <tj@kernel.org>

0479c8c5

06 10月, 2014 2 次提交

workqueue: Use cond_resched_rcu_qs macro · 3e28e377

由 Joe Lawrence 提交于 10月 05, 2014

Tidy up and use cond_resched_rcu_qs when calling cond_resched and
reporting potential quiescent state to RCU.  Splitting this change in
this way allows easy backporting to -stable for kernel versions not
having cond_resched_rcu_qs().
Signed-off-by: NJoe Lawrence <joe.lawrence@stratus.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>

3e28e377

workqueue: Add quiescent state between work items · 789cbbec

由 Joe Lawrence 提交于 10月 05, 2014

Similar to the stop_machine deadlock scenario on !PREEMPT kernels
addressed in b22ce278 "workqueue: cond_resched() after processing
each work item", kworker threads requeueing back-to-back with zero jiffy
delay can stall RCU. The cond_resched call introduced in that fix will
yield only iff there are other higher priority tasks to run, so force a
quiescent RCU state between work items.
Signed-off-by: NJoe Lawrence <joe.lawrence@stratus.com>
Link: https://lkml.kernel.org/r/20140926105227.01325697@jlaw-desktop.mno.stratus.com
Link: https://lkml.kernel.org/r/20140929115445.40221d8e@jlaw-desktop.mno.stratus.com
Fixes: b22ce278 ("workqueue: cond_resched() after processing each work item")
Cc: <stable@vger.kernel.org>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>

789cbbec

23 7月, 2014 6 次提交

workqueue: use nr_node_ids instead of wq_numa_tbl_len · ddcb57e2

由 Lai Jiangshan 提交于 7月 22, 2014

They are the same and nr_node_ids is provided by the memory subsystem.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

ddcb57e2

workqueue: remove the misnamed out_unlock label in get_unbound_pool() · 3fb1823c

由 Lai Jiangshan 提交于 7月 22, 2014

After the locking was moved up to the caller of the get_unbound_pool(),
out_unlock label doesn't need to do any unlock operation and the name
became bad, so we just remove this label, and the only usage-site
"goto out_unlock" is subsituted to "return pool".
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

3fb1823c

workqueue: remove the stale comment in pwq_unbound_release_workfn() · 29b1cb41

由 Lai Jiangshan 提交于 7月 22, 2014

In 75ccf595 ("workqueue: prepare flush_workqueue() for dynamic
creation and destrucion of unbound pool_workqueues"), a comment
about the synchronization for the pwq in pwq_unbound_release_workfn()
was added. The comment claimed the flush_mutex wasn't strictly
necessary, it was correct in that time, due to the pwq was protected
by workqueue_lock.

But it is incorrect now since the wq->flush_mutex was renamed to
wq->mutex and workqueue_lock was removed, the wq->mutex is strictly
needed. But the comment was miss-updated when the synchronization
was changed.

This patch removes the incorrect comments and doesn't add any new
comment to explain why wq->mutex is needed here, which is definitely
obvious and wq->pwqs_node has "WQ" notation in its definition which is
better comment.

The old commit mentioned above also introduced a comment in link_pwq()
about the synchronization. This comment is also removed in this patch
since the whole link_pwq() is proteced by wq->mutex.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

29b1cb41

workqueue: move rescuer pool detachment to the end · 13b1d625

由 Lai Jiangshan 提交于 7月 22, 2014

In 51697d39 ("workqueue: use generic attach/detach routine for
rescuers"), The rescuer detaches itself from the pool before put_pwq()
so that the put_unbound_pool() will not destroy the rescuer-attached
pool.

It is unnecessary.  worker_detach_from_pool() can be used as the last
statement to access to the pool just like the regular workers,
put_unbound_pool() will wait for it to detach and then free the pool.

So we move the worker_detach_from_pool() down, make it coincide with
the regular workers.

tj: Minor description update.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

13b1d625

workqueue: unfold start_worker() into create_worker() · 051e1850

由 Lai Jiangshan 提交于 7月 22, 2014

Simply unfold the code of start_worker() into create_worker() and
remove the original start_worker() and create_and_start_worker().

The only trade-off is the introduced overhead that the pool->lock
is released and regrabbed after the newly worker is started.
The overhead is acceptible since the manager is slow path.

And because this new locking behavior, the newly created worker
may grab the lock earlier than the manager and go to process
work items. In this case, the recheck need_to_create_worker() may be
true as expected and the manager goes to restart which is the
correct behavior.

tj: Minor updates to description and comments.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

051e1850

workqueue: remove @wakeup from worker_set_flags() · 228f1d00

由 Lai Jiangshan 提交于 7月 22, 2014

worker_set_flags() has only two callers, each specifying %true and
%false for @wakeup.  Let's push the wake up to the caller and remove
@wakeup from worker_set_flags().  The caller can use the following
instead if wakeup is necessary:

	worker_set_flags();
	if (need_more_worker(pool))
 		wake_up_worker(pool);

This makes the code simpler.  This patch doesn't introduce behavior
changes.

tj: Updated description and comments.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

228f1d00

22 7月, 2014 1 次提交

workqueue: remove an unneeded UNBOUND test before waking up the next worker · a489a03e

由 Lai Jiangshan 提交于 7月 22, 2014

In process_one_work():

	if ((worker->flags & WORKER_UNBOUND) && need_more_worker(pool))
		wake_up_worker(pool);

the first test is unneeded.  Even if the first test is removed, it
doesn't affect the wake-up logic for WORKER_UNBOUND, and it will not
introduce any useless wake-ups for normal per-cpu workers since
nr_running is always >= 1.  It will introduce useless/redundant
wake-ups for CPU_INTENSIVE, but this case is rare and the next patch
will also remove this redundant wake-up.

tj: Minor updates to the description and comment.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

a489a03e

19 7月, 2014 1 次提交

workqueue: wake regular worker if need_more_worker() when rescuer leave the pool · d8ca83e6

由 Lai Jiangshan 提交于 7月 16, 2014

We don't need to wake up regular worker when nr_running==1,
so need_more_worker() is sufficient here.

And need_more_worker() gives us better readability due to the name of
"keep_working()" implies the rescuer should keep working now but
the rescuer is actually leaving.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

d8ca83e6

15 7月, 2014 1 次提交

workqueue: alloc struct worker on its local node · f7537df5

由 Lai Jiangshan 提交于 7月 15, 2014

When the create_worker() is called from non-manager, the struct worker
is allocated from the node of the caller which may be different from the
node of pool->node.

So we add a node ID argument for the alloc_worker() to ensure the
struct worker is allocated from the preferable node.

tj: @nid renamed to @node for consistency.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

f7537df5

11 7月, 2014 1 次提交

workqueue: reuse the already calculated pwq in try_to_grab_pending() · 9c34a704

由 Lai Jiangshan 提交于 7月 11, 2014

try_to_grab_pending() was re-calculating the associated pwq using
get_work_pwq() when it already has it cached in a local varible and
the association can't change.  Reuse the local variable instead.

This doesn't introduce any functional changes.

tj: Updated description.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

9c34a704

07 7月, 2014 1 次提交

workqueue: zero cpumask of wq_numa_possible_cpumask on init · 5a6024f1

由 Yasuaki Ishimatsu 提交于 7月 07, 2014

When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.

  BUG: unable to handle kernel paging request at 0000000000001d08
  IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
  PGD 0
  Oops: 0000 [#1] SMP
  ...
  Call Trace:
   [<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
   [<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
   [<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
   [<ffffffff811926f1>] new_slab+0x91/0x300
   [<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
   [<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
   [<ffffffff810a3c78>] ? load_balance+0x218/0x890
   [<ffffffff8101a679>] ? sched_clock+0x9/0x10
   [<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
   [<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
   [<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
   [<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
   [<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
   [<ffffffff8105d0ec>] do_fork+0xbc/0x360
   [<ffffffff8105d3b6>] kernel_thread+0x26/0x30
   [<ffffffff81086652>] kthreadd+0x2c2/0x300
   [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
   [<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
   [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60

In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.

When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:

3592         /* if cpumask is contained inside a NUMA node, we belong to that node */
3593         if (wq_numa_enabled) {
3594                 for_each_node(node) {
3595                         if (cpumask_subset(pool->attrs->cpumask,
3596                                            wq_numa_possible_cpumask[node])) {
3597                                 pool->node = node;
3598                                 break;
3599                         }
3600                 }
3601         }

But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.

By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.
Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: bce90380 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")

5a6024f1

02 7月, 2014 2 次提交

workqueue: stronger test in process_one_work() · 85327af6

由 Lai Jiangshan 提交于 6月 03, 2014

When POOL_DISASSOCIATED is cleared, the running worker's local CPU should
be the same as pool->cpu without any exception even during cpu-hotplug.

This patch changes "(proposition_A && proposition_B && proposition_C)"
to "(proposition_B && proposition_C)", so if the old compound
proposition is true, the new one must be true too. so this won't hide
any possible bug which can be hit by old test.

tj: Minor description update and dropped the obvious comment.

CC: Jason J. Herne <jjherne@linux.vnet.ibm.com>
CC: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

85327af6

workqueue: clear POOL_DISASSOCIATED in rebind_workers() · 3de5e884

由 Lai Jiangshan 提交于 6月 03, 2014

a9ab775b ("workqueue: directly restore CPU affinity of workers
from CPU_ONLINE") moved pool locking into rebind_workers() but left
"pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback().

There is nothing necessarily wrong with it, but there is no benefit
either.  Let's move it into rebind_workers() and achieve the following
benefits:

  1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers()
     as expected.

  2) we can guarantee that, when POOL_DISASSOCIATED is clear, the
     running workers of the pool are on the local CPU (pool->cpu).

tj: Minor description update.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

3de5e884

24 6月, 2014 1 次提交

workqueue: fix dev_set_uevent_suppress() imbalance · bddbceb6

由 Maxime Bizon 提交于 6月 23, 2014

Uevents are suppressed during attributes registration, but never
restored, so kobject_uevent() does nothing.
Signed-off-by: NMaxime Bizon <mbizon@freebox.fr>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 226223ab

bddbceb6

20 6月, 2014 8 次提交

workqueue: stronger test in process_one_work() · 807407c0

由 Lai Jiangshan 提交于 6月 03, 2014

After the recent changes, when POOL_DISASSOCIATED is cleared, the
running worker's local CPU should be the same as pool->cpu without any
exception even during cpu-hotplug.  Update the sanity check in
process_one_work() accordingly.

This patch changes "(proposition_A && proposition_B && proposition_C)"
to "(proposition_B && proposition_C)", so if the old compound
proposition is true, the new one must be true too. so this will not
hide any possible bug which can be caught by the old test.

tj: Minor updates to the description.

CC: Jason J. Herne <jjherne@linux.vnet.ibm.com>
CC: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

807407c0

workqueue: clear POOL_DISASSOCIATED in rebind_workers() · f05b558d

由 Lai Jiangshan 提交于 6月 03, 2014

The commit a9ab775b ("workqueue: directly restore CPU affinity of
workers from CPU_ONLINE") moved the pool->lock into rebind_workers()
without also moving "pool->flags &= ~POOL_DISASSOCIATED".

There is nothing wrong with "pool->flags &= ~POOL_DISASSOCIATED" not
being moved together, but there isn't any benefit either. We move it
into rebind_workers() and achieve these benefits:

1) Better readability.  POOL_DISASSOCIATED is cleared in
   rebind_workers() as expected.

2) When POOL_DISASSOCIATED is cleared, we can ensure that all the
   running workers of the pool are on the local CPU (pool->cpu).

tj: Cosmetic updates to the code and description.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

f05b558d

workqueue: sanity check pool->cpu in wq_worker_sleeping() · 92b69f50

由 Lai Jiangshan 提交于 6月 03, 2014

In theory, pool->cpu is equals to @cpu in wq_worker_sleeping() after
worker->flags is checked.

And "pool->cpu != cpu" sanity check will help us if something wrong.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

92b69f50

workqueue: clear leftover flags when detached · b62c0751

由 Lai Jiangshan 提交于 6月 03, 2014

When a worker is detached, the worker->flags may still have WORKER_UNBOUND
or WORKER_REBOUND, it is OK for all cases:
  1) if it is a normal worker, the worker will be dead, it is OK.
  2) if it is a rescuer, it may re-attach to a pool with this leftover flag[s],
     it is still correct except it may cause unneeded wakeup.

It is correct but not good, so we just remove the leftover flags.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

b62c0751

workqueue: remove useless WARN_ON_ONCE() · 25ef0958

由 Lai Jiangshan 提交于 6月 03, 2014

The @cpu is fetched via smp_processor_id() in this function,
so the check is useless.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

25ef0958

workqueue: use schedule_timeout_interruptible() instead of open code · e212f361

由 Lai Jiangshan 提交于 6月 03, 2014

schedule_timeout_interruptible(CREATE_COOLDOWN) is exactly the same as
the original code.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

e212f361

workqueue: remove the empty check in too_many_workers() · e6a9a771

由 Lai Jiangshan 提交于 6月 03, 2014

The commit ea1abd61 ("workqueue: reimplement idle worker rebinding")
used a trick which simply removes all to-be-bound idle workers from the
idle list and lets them add themselves back after completing rebinding.

And this trick caused the @worker_pool->nr_idle may deviate than the actual
number of idle workers on @worker_pool->idle_list.  More specifically,
nr_idle may be non-zero while ->idle_list is empty.  All users of
->nr_idle and ->idle_list are audited.  The only affected one is
too_many_workers() which is updated to check %false if ->idle_list is
empty regardless of ->nr_idle.

The commit/trick was complicated due to it just tried to simplify an even
more complicated problem (workers had to rebind itself). But the commit
a9ab775b ("workqueue: directly restore CPU affinity of workers
from CPU_ONLINE") fixed all these problems and the mentioned trick was
useless and is gone.

So, now the @worker_pool->nr_idle is exactly the actual number of workers
on @worker_pool->idle_list. too_many_workers() should recover as it was
before the trick. So we remove the empty check.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

e6a9a771

workqueue: use "pool->cpu < 0" to stand for an unbound pool · 61d0fbb4

由 Lai Jiangshan 提交于 6月 03, 2014

There is a piece of sanity checks code in the put_unbound_pool().
The meaning of this code is "if it is not an unbound pool, it will complain
and return" IIUC. But the code uses "pool->flags & POOL_DISASSOCIATED"
imprecisely due to a non-unbound pool may also have this flags.

We should use "pool->cpu < 0" to stand for an unbound pool, so we covert the
code to it.

There is no strictly wrong if we still keep "pool->flags & POOL_DISASSOCIATED"
here, but it is just a noise if we keep it:
  1) we focus on "unbound" here, not "[dis]association".
  2) "pool->cpu < 0" already implies "pool->flags & POOL_DISASSOCIATED".
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

61d0fbb4