提交 · b8d4a5bf6a049303a29a3275f463f09a490b50ea · _Walt / cloud-kernel

30 4月, 2013 3 次提交

relay: move remove_buf_file inside relay_close_buf · b8d4a5bf

由 Dmitry Monakhov 提交于 4月 22, 2013

Currently remove_buf_file callback is called from from kobject
release method. This result in follow issue:
# blktrace -d /dev/sda1 -d /dev/sda -o test

blktrace_setup()
 dir = create_dir()
 rchan = relay_open(dir,...)
 ->create_buf_file_callback
    buf_file  = debugfs_create_file(dir, )

Userspace will open buf_file.
Later we make a decision to stop tracing
blktrace_down()
  relay_close(rhcan)  /* just decrement kobj reference  */
                      /* since it is not zero then callback not called */
  debugfs_remove(dir) /* FAIL due to non empty dir   */

Later user space will close the file and file will be deleted,
but directory still exist.
user_space_close()
 ->file_release
   ->release_buf_file_callback
     ->debugfs_remove(buf_file
## TESTCASE:
# blktrace -d /dev/sda1 -d /dev/sda -o test
# After that blktrace infrastructure will remain broken in
# an unusable state so: blktrace -d /dev/sda1 will not work.

In fact this is general issue, blktrace is just one of examples.
We can not reliably remove parent dir until all users close the
buf_file.

Solution: We don't have to wait that long. File should be deleted inside
relay_close_buf().
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b8d4a5bf

partitions/efi.c: replace useless kzalloc's by kmalloc's · ea56505b

由 Philippe De Muyter 提交于 4月 29, 2013

In alloc_read_gpt_entries and alloc_read_gpt_header, the kzalloc'ated
zones are either totally overwritten by the following read_lba call,
or freed.  As kmalloc is cheaper than kzalloc, use kmalloc.
Signed-off-by: NPhilippe De Muyter <phdm@macqel.be>
Cc: Matt Domsch <Matt_Domsch@dell.com>
Cc: Panagiotis Issaris <takis@issaris.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ea56505b

fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read() · 6f8f5c26

由 Gu Zheng 提交于 4月 29, 2013

blkdev_aio_read() test 'size' to see if it is equal or greater than the
target count we request(iocb->ki_left).  If so there is no need to call
iov_shorten() to reduce number of segments and the iovec's length.  So the
judgement should be changed to 'if (size < iocb->ki_left)' instead.
Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6f8f5c26

24 4月, 2013 1 次提交

block: fix max discard sectors limit · 871dd928

由 James Bottomley 提交于 4月 24, 2013

linux-v3.8-rc1 and later support for plug for blkdev_issue_discard with
commit 0cfbcafc
(block: add plug for blkdev_issue_discard )

For example,
1) DISCARD rq-1 with size size 4GB
2) DISCARD rq-2 with size size 1GB

If these 2 discard requests get merged, final request size will be 5GB.

In this case, request's __data_len field may overflow as it can store
max 4GB(unsigned int).

This issue was observed while doing mkfs.f2fs on 5GB SD card:
https://lkml.org/lkml/2013/4/1/292

Info: sector size = 512
Info: total sectors = 11370496 (in 512bytes)
Info: zone aligned segment0 blkaddr: 512
[  257.789764] blk_update_request: bio idx 0 >= vcnt 0

mkfs process gets stuck in D state and I see the following in the dmesg:

[  257.789733] __end_that: dev mmcblk0: type=1, flags=122c8081
[  257.789764]   sector 4194304, nr/cnr 2981888/4294959104
[  257.789764]   bio df3840c0, biotail df3848c0, buffer   (null), len
1526726656
[  257.789764] blk_update_request: bio idx 0 >= vcnt 0
[  257.794921] request botched: dev mmcblk0: type=1, flags=122c8081
[  257.794921]   sector 4194304, nr/cnr 2981888/4294959104
[  257.794921]   bio df3840c0, biotail df3848c0, buffer   (null), len
1526726656

This patch fixes this issue.
Reported-by: NMax Filippov <jcmvbkbc@gmail.com>
Signed-off-by: NJames Bottomley <JBottomley@Parallels.com>
Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
Tested-by: NMax Filippov <jcmvbkbc@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

871dd928

09 4月, 2013 2 次提交

blkcg: fix "scheduling while atomic" in blk_queue_bypass_start · e5072664

由 Jun'ichi Nomura 提交于 4月 09, 2013

Since 749fefe6 in v3.7 ("block: lift the initial queue bypass mode
on blk_register_queue() instead of blk_init_allocated_queue()"),
the following warning appears when multipath is used with CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

  BUG: scheduling while atomic: multipath/2460/0x00000002
  1 lock held by multipath/2460:
   #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
  Call Trace:
   [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
   [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
   [<ffffffff814291e6>] schedule+0x64/0x66
   [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
   [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
   [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
   [<ffffffff814289e3>] wait_for_common+0x9d/0xee
   [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
   [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
   [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
   [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
   [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
   [<ffffffff8106fb18>] ? complete+0x21/0x53
   [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
   [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
   [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
   [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
   [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [<ffffffff811d8c41>] elevator_init+0xe1/0x115
   [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
   [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
   [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
   [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
   [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
   [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
   [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
   [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: NVivek Goyal <vgoyal@redhat.com>
Acked-by: NTejun Heo <tj@kernel.org>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e5072664

Documentation: cfq-iosched: update documentation help for cfq tunables · fdc6fdc5

由 Namjae Jeon 提交于 4月 09, 2013

Add the documentation text for latency, target_latency & group_idle
tunnable parameters in the block/cfq-iosched.txt.
Also fix few typo(spelling) mistakes.
Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: NAmit Sahrawat <a.sahrawat@samsung.com>

Language somewhat modified by Jens.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fdc6fdc5

02 4月, 2013 22 次提交

Merge branch 'writeback-workqueue' of... · 64f8de4d

由 Jens Axboe 提交于 4月 02, 2013

Merge branch 'writeback-workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core

Tejun writes:

-----

This is the pull request for the earlier patchset[1] with the same
name. It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.

* Because the conversion needs features from wq/for-3.10,
block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
workqueue conflicts from flaring up in block tree.

* Resolving the issue that Jan and Dave raised about debugging
requires arch-wide changes. The patchset is being worked on[2] but
it'll have to go through -mm after these changes show up in -next,
and not included in this pull request.

The three commits are located in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue

Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.

e3620a3a ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
2f6db2a7 ("raid5: use bio_reset()")

The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it. We just need to
remove both. The merged branch is available at

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge

so that you can use it for verification. The test merge commit has
proper merge description.

While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.

----

Fixed up the conflict.

Conflicts:
drivers/md/raid5.c
Signed-off-by: NJens Axboe <axboe@kernel.dk>

64f8de4d

writeback: expose the bdi_wq workqueue · b5c872dd

由 Tejun Heo 提交于 4月 01, 2013

There are cases where userland wants to tweak the priority and
affinity of writeback flushers.  Expose bdi_wq to userland by setting
WQ_SYSFS.  It appears under /sys/bus/workqueue/devices/writeback/ and
allows adjusting maximum concurrency level, cpumask and nice level.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

b5c872dd

writeback: replace custom worker pool implementation with unbound workqueue · 839a8e86

由 Tejun Heo 提交于 4月 01, 2013

Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically.  The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.

there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.

The conversion isn't too complicated but the followings are worth
mentioning.

* bdi_writeback->last_active, task and wakeup_timer are removed.
  delayed_work ->dwork is added instead.  Explicit timer handling is
  no longer necessary.  Everything works by either queueing / modding
  / flushing / canceling the delayed_work item.

* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
  bdi_writeback->dwork.  On each execution, it processes
  bdi->work_list and reschedules itself if there are more things to
  do.

  The function also handles low-mem condition, which used to be
  handled by the forker thread.  If the function is running off a
  rescuer thread, it only writes out limited number of pages so that
  the rescuer can serve other bdis too.  This preserves the flusher
  creation failure behavior of the forker thread.

* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
  bdi_writeback_workfn() about on-going bdi unregistration so that it
  always drains work_list even if it's running off the rescuer.  Note
  that the original code was broken in this regard.  Under memory
  pressure, a bdi could finish unregistration with non-empty
  work_list.

* The default bdi is no longer special.  It now is treated the same as
  any other bdi and bdi_cap_flush_forker() is removed.

* BDI_pending is no longer used.  Removed.

* Some tracepoints become non-applicable.  The following TPs are
  removed - writeback_nothread, writeback_wake_thread,
  writeback_wake_forker_thread, writeback_thread_start,
  writeback_thread_stop.

Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>

839a8e86

writeback: remove unused bdi_pending_list · 181387da

由 Tejun Heo 提交于 4月 01, 2013

There's no user left.  Remove it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>

181387da

Merge tag 'v3.9-rc5' into wq/for-3.10 · 229641a6

由 Tejun Heo 提交于 4月 01, 2013

Writeback conversion to workqueue will be based on top of wq/for-3.10
branch to take advantage of custom attrs and NUMA support for unbound
workqueues.  Mainline currently contains two commits which result in
non-trivial merge conflicts with wq/for-3.10 and because
block/for-3.10/core is based on v3.9-rc3 which contains one of the
conflicting commits, we need a pre-merge-window merge anyway.  Let's
pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer
from workqueue merge conflicts.

The two conflicts and their resolutions:

* e68035fb ("workqueue: convert to idr_alloc()") in mainline changes
  worker_pool_assign_id() to use idr_alloc() instead of the old idr
  interface.  worker_pool_assign_id() goes through multiple locking
  changes in wq/for-3.10 causing the following conflict.

  static int worker_pool_assign_id(struct worker_pool *pool)
  {
	  int ret;

  <<<<<<< HEAD
	  lockdep_assert_held(&wq_pool_mutex);

	  do {
		  if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
			  return -ENOMEM;
		  ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
	  } while (ret == -EAGAIN);
  =======
	  mutex_lock(&worker_pool_idr_mutex);
	  ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
	  if (ret >= 0)
		  pool->id = ret;
	  mutex_unlock(&worker_pool_idr_mutex);
  >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89

	  return ret < 0 ? ret : 0;
  }

  We want locking from the former and idr_alloc() usage from the
  latter, which can be combined to the following.

  static int worker_pool_assign_id(struct worker_pool *pool)
  {
	  int ret;

	  lockdep_assert_held(&wq_pool_mutex);

	  ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
	  if (ret >= 0) {
		  pool->id = ret;
		  return 0;
	  }
	  return ret;
   }

* eb283428 ("workqueue: fix possible pool stall bug in
  wq_unbind_fn()") updated wq_unbind_fn() such that it has single
  larger for_each_std_worker_pool() loop instead of two separate loops
  with a schedule() call inbetween.  wq/for-3.10 renamed
  pool->assoc_mutex to pool->manager_mutex causing the following
  conflict (earlier function body and comments omitted for brevity).

  static void wq_unbind_fn(struct work_struct *work)
  {
  ...
		  spin_unlock_irq(&pool->lock);
  <<<<<<< HEAD
		  mutex_unlock(&pool->manager_mutex);
	  }
  =======
		  mutex_unlock(&pool->assoc_mutex);
  >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89

		  schedule();

  <<<<<<< HEAD
	  for_each_cpu_worker_pool(pool, cpu)
  =======
  >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
		  atomic_set(&pool->nr_running, 0);

		  spin_lock_irq(&pool->lock);
		  wake_up_worker(pool);
		  spin_unlock_irq(&pool->lock);
	  }
  }

  The resolution is mostly trivial.  We want the control flow of the
  latter with the rename of the former.

  static void wq_unbind_fn(struct work_struct *work)
  {
  ...
		  spin_unlock_irq(&pool->lock);
		  mutex_unlock(&pool->manager_mutex);

		  schedule();

		  atomic_set(&pool->nr_running, 0);

		  spin_lock_irq(&pool->lock);
		  wake_up_worker(pool);
		  spin_unlock_irq(&pool->lock);
	  }
  }
Signed-off-by: NTejun Heo <tj@kernel.org>

229641a6

workqueue: update sysfs interface to reflect NUMA awareness and a kernel param... · d55262c4

由 Tejun Heo 提交于 4月 01, 2013

workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity

Unbound workqueues are now NUMA aware.  Let's add some control knobs
and update sysfs interface accordingly.

* Add kernel param workqueue.numa_disable which disables NUMA affinity
  globally.

* Replace sysfs file "pool_id" with "pool_ids" which contain
  node:pool_id pairs.  This change is userland-visible but "pool_id"
  hasn't seen a release yet, so this is okay.

* Add a new sysf files "numa" which can toggle NUMA affinity on
  individual workqueues.  This is implemented as attrs->no_numa whichn
  is special in that it isn't part of a pool's attributes.  It only
  affects how apply_workqueue_attrs() picks which pools to use.

After "pool_ids" change, first_pwq() doesn't have any user left.
Removed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

d55262c4

workqueue: implement NUMA affinity for unbound workqueues · 4c16bd32

由 Tejun Heo 提交于 4月 01, 2013

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

As CPUs come up and go down, the pool association is changed
accordingly.  Changing pool association may involve allocating new
pools which may fail.  To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
    by Lai.

v3: The previous version incorrectly made a workqueue spanning
    multiple nodes spread work items over all online CPUs when some of
    its nodes don't have any desired cpus.  Reimplemented so that NUMA
    affinity is properly updated as CPUs go up and down.  This problem
    was spotted by Lai Jiangshan.

v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
    however, wq may be freed at any time after dfl_pwq is put making
    the clearing use-after-free.  Clear wq->dfl_pwq before putting it.

v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
    @pwq_tbl after success.  Fixed.

    Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
    application of new attrs is excluded via CPU hotplug.  Removed.

    Documentation on CPU affinity guarantee on CPU_DOWN added.

    All changes are suggested by Lai Jiangshan.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

4c16bd32

workqueue: introduce put_pwq_unlocked() · dce90d47

由 Tejun Heo 提交于 4月 01, 2013

Factor out lock pool, put_pwq(), unlock sequence into
put_pwq_unlocked().  The two existing places are converted and there
will be more with NUMA affinity support.

This is to prepare for NUMA affinity support for unbound workqueues
and doesn't introduce any functional difference.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

dce90d47

workqueue: introduce numa_pwq_tbl_install() · 1befcf30

由 Tejun Heo 提交于 4月 01, 2013

Factor out pool_workqueue linking and installation into numa_pwq_tbl[]
from apply_workqueue_attrs() into numa_pwq_tbl_install().  link_pwq()
is made safe to call multiple times.  numa_pwq_tbl_install() links the
pwq, installs it into numa_pwq_tbl[] at the specified node and returns
the old entry.

@last_pwq is removed from link_pwq() as the return value of the new
function can be used instead.

This is to prepare for NUMA affinity support for unbound workqueues.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

1befcf30

workqueue: use NUMA-aware allocation for pool_workqueues · e50aba9a

由 Tejun Heo 提交于 4月 01, 2013

Use kmem_cache_alloc_node() with @pool->node instead of
kmem_cache_zalloc() when allocating a pool_workqueue so that it's
allocated on the same node as the associated worker_pool.  As there's
no no kmem_cache_zalloc_node(), move zeroing to init_pwq().

This was suggested by Lai Jiangshan.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

e50aba9a

workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() · f147f29e

由 Tejun Heo 提交于 4月 01, 2013

Break init_and_link_pwq() into init_pwq() and link_pwq() and move
unbound-workqueue specific handling into apply_workqueue_attrs().
Also, factor out unbound pool and pool_workqueue allocation into
alloc_unbound_pwq().

This reorganization is to prepare for NUMA affinity and doesn't
introduce any functional changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

f147f29e

workqueue: map an unbound workqueues to multiple per-node pool_workqueues · df2d5ae4

由 Tejun Heo 提交于 4月 01, 2013

Currently, an unbound workqueue has only one "current" pool_workqueue
associated with it.  It may have multple pool_workqueues but only the
first pool_workqueue servies new work items.  For NUMA affinity, we
want to change this so that there are multiple current pool_workqueues
serving different NUMA nodes.

Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
points to the pool_workqueue to use for each possible node.  This
replaces first_pwq() in __queue_work() and workqueue_congested().

numa_pwq_tbl[] is currently initialized to point to the same
pool_workqueue as first_pwq() so this patch doesn't make any behavior
changes.

v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function
    may be called only with wq->mutex held.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

df2d5ae4

workqueue: move hot fields of workqueue_struct to the end · 2728fd2f

由 Tejun Heo 提交于 4月 01, 2013

Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
them to the cacheline. These two fields are used in the work item
issue path and thus hot. The scheduled NUMA affinity support will add
dispatch table at the end of workqueue_struct and relocating these two
fields will allow us hitting only single cacheline on hot paths.

Note that wq->pwqs isn't moved although it currently is being used in
the work item issue path for unbound workqueues. The dispatch table
mentioned above will replace its use in the issue path, so it will
become cold once NUMA support is implemented.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

2728fd2f

workqueue: make workqueue->name[] fixed len · ecf6881f

由 Tejun Heo 提交于 4月 01, 2013

Currently workqueue->name[] is of flexible length.  We want to use the
flexible field for something more useful and there isn't much benefit
in allowing arbitrary name length anyway.  Make it fixed len capping
at 24 bytes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

ecf6881f

workqueue: add workqueue->unbound_attrs · 6029a918

由 Tejun Heo 提交于 4月 01, 2013

Currently, when exposing attrs of an unbound workqueue via sysfs, the
workqueue_attrs of first_pwq() is used as that should equal the
current state of the workqueue.

The planned NUMA affinity support will make unbound workqueues make
use of multiple pool_workqueues for different NUMA nodes and the above
assumption will no longer hold.  Introduce workqueue->unbound_attrs
which records the current attrs in effect and use it for sysfs instead
of first_pwq()->attrs.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

6029a918

workqueue: determine NUMA node of workers accourding to the allowed cpumask · f3f90ad4

由 Tejun Heo 提交于 4月 01, 2013

When worker tasks are created using kthread_create_on_node(),
currently only per-cpu ones have the matching NUMA node specified.
All unbound workers are always created with NUMA_NO_NODE.

Now that an unbound worker pool may have an arbitrary cpumask
associated with it, this isn't optimal.  Add pool->node which is
determined by the pool's cpumask.  If the pool's cpumask is contained
inside a NUMA node proper, the pool is associated with that node, and
all workers of the pool are created on that node.

This currently only makes difference for unbound worker pools with
cpumask contained inside single NUMA node, but this will serve as
foundation for making all unbound pools NUMA-affine.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

f3f90ad4

workqueue: drop 'H' from kworker names of unbound worker pools · e3c916a4

由 Tejun Heo 提交于 4月 01, 2013

Currently, all workqueue workers which have negative nice value has
'H' postfixed to their names.  This is necessary for per-cpu workers
as they use the CPU number instead of pool->id to identify the pool
and the 'H' postfix is the only thing distinguishing normal and
highpri workers.

As workers for unbound pools use pool->id, the 'H' postfix is purely
informational.  TASK_COMM_LEN is 16 and after the static part and
delimiters, there are only five characters left for the pool and
worker IDs.  We're expecting to have more unbound pools with the
scheduled NUMA awareness support.  Let's drop the non-essential 'H'
postfix from unbound kworker name.

While at it, restructure kthread_create*() invocation to help future
NUMA related changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

e3c916a4

workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[] · bce90380

由 Tejun Heo 提交于 4月 01, 2013

Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
and wq_numa_possible_cpumask[] in preparation.  The former is the
highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
each NUMA node.

This patch only introduces these.  Future patches will make use of
them.

v2: NUMA initialization move into wq_numa_init().  Also, the possible
    cpumask array is not created if there aren't multiple nodes on the
    system.  wq_numa_enabled bool added.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

bce90380

workqueue: move pwq_pool_locking outside of get/put_unbound_pool() · a892cacc

由 Tejun Heo 提交于 4月 01, 2013

The scheduled NUMA affinity support for unbound workqueues would need
to walk workqueues list and pool related operations on each workqueue.

Move wq_pool_mutex locking out of get/put_unbound_pool() to their
callers so that pool operations can be performed while walking the
workqueues list, which is also protected by wq_pool_mutex.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>

a892cacc

workqueue: fix memory leak in apply_workqueue_attrs() · 4862125b

由 Tejun Heo 提交于 4月 01, 2013

apply_workqueue_attrs() wasn't freeing temp attrs variable @new_attrs
in its success path.  Fix it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NLai Jiangshan <laijs@cn.fujitsu.com>

4862125b

workqueue: fix unbound workqueue attrs hashing / comparison · 13e2e556

由 Tejun Heo 提交于 4月 01, 2013

29c91e99 ("workqueue: implement attribute-based unbound worker_pool
management") implemented attrs based worker_pool matching.  It tried
to avoid false negative when comparing cpumasks with custom hash
function; unfortunately, the hash and comparison functions fail to
ignore CPUs which are not possible.  It incorrectly assumed that
bitmap_copy() skips leftover bits in the last word of bitmap and
cpumask_equal() ignores impossible CPUs.

This patch updates attrs->cpumask handling such that impossible CPUs
are properly ignored.

* Hash and copy functions no longer do anything special.  They expect
  their callers to clear impossible CPUs.

* alloc_workqueue_attrs() initializes the cpumask to cpu_possible_mask
  instead of setting all bits and explicit cpumask_setall() for
  unbound_std_wq_attrs[] in init_workqueues() is dropped.

* apply_workqueue_attrs() is now responsible for ignoring impossible
  CPUs.  It makes a copy of @attrs and clears impossible CPUs before
  doing anything else.
Signed-off-by: NTejun Heo <tj@kernel.org>

13e2e556

workqueue: fix race condition in unbound workqueue free path · bc0caf09

由 Tejun Heo 提交于 4月 01, 2013

8864b4e5 ("workqueue: implement get/put_pwq()") implemented pwq
(pool_workqueue) refcnting which frees workqueue when the last pwq
goes away.  It determined whether it was the last pwq by testing
wq->pwqs is empty.  Unfortunately, the test was done outside wq->mutex
and multiple pwq release could race and try to free wq multiple times
leading to oops.

Test wq->pwqs emptiness while holding wq->mutex.
Signed-off-by: NTejun Heo <tj@kernel.org>

bc0caf09

01 4月, 2013 5 次提交

L

Linux 3.9-rc5 · 07961ac7
由 Linus Torvalds 提交于 3月 31, 2013

07961ac7

Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma · 0bb44280

由 Linus Torvalds 提交于 3月 31, 2013

Pull slave-dmaengine fixes from Vinod Koul:
 "Two fixes for slave-dmaengine.

  The first one is for making slave_id value correct for dw_dmac and
  the other one fixes the endieness in DT parsing"

* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
  dw_dmac: adjust slave_id accordingly to request line base
  dmaengine: dw_dma: fix endianess for DT xlate function

0bb44280

Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · a7b436d3

由 Linus Torvalds 提交于 3月 31, 2013

Pull media fixes from Mauro Carvalho Chehab:
 "For a some fixes for Kernel 3.9:
   - subsystem build fix when VIDEO_DEV=y, VIDEO_V4L2=m and I2C=m
   - compilation fix for arm multiarch preventing IR_RX51 to be selected
   - regression fix at bttv crop logic
   - s5p-mfc/m5mols/exynos: a few fixes for cameras on exynos hardware"

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  [media] [REGRESSION] bt8xx: Fix too large height in cropcap
  [media] fix compilation with both V4L2 and I2C as 'm'
  [media] m5mols: Fix bug in stream on handler
  [media] s5p-fimc: Do not attempt to disable not enabled media pipeline
  [media] s5p-mfc: Fix encoder control 15 issue
  [media] s5p-mfc: Fix frame skip bug
  [media] s5p-fimc: send valid m2m ctx to fimc_m2m_job_finish
  [media] exynos-gsc: send valid m2m ctx to gsc_m2m_job_finish
  [media] fimc-lite: Fix the variable type to avoid possible crash
  [media] fimc-lite: Initialize 'step' field in fimc_lite_ctrl structure
  [media] ir: IR_RX51 only works on OMAP2

a7b436d3

Merge tag 'for-linus-20130331' of git://git.kernel.dk/linux-block · d299c290

由 Linus Torvalds 提交于 3月 31, 2013

Pull block fixes from Jens Axboe:
 "Alright, this time from 10K up in the air.

  Collection of fixes that have been queued up since the merge window
  opened, hence postponed until later in the cycle.  The pull request
  contains:

   - A bunch of fixes for the xen blk front/back driver.

   - A round of fixes for the new IBM RamSan driver, fixing various
     nasty issues.

   - Fixes for multiple drives from Wei Yongjun, bad handling of return
     values and wrong pointer math.

   - A fix for loop properly killing partitions when being detached."

* tag 'for-linus-20130331' of git://git.kernel.dk/linux-block: (25 commits)
  mg_disk: fix error return code in mg_probe()
  rsxx: remove unused variable
  rsxx: enable error return of rsxx_eeh_save_issued_dmas()
  block: removes dynamic allocation on stack
  Block: blk-flush: Fixed indent code style
  cciss: fix invalid use of sizeof in cciss_find_cfgtables()
  loop: cleanup partitions when detaching loop device
  loop: fix error return code in loop_add()
  mtip32xx: fix error return code in mtip_pci_probe()
  xen-blkfront: remove frame list from blk_shadow
  xen-blkfront: pre-allocate pages for requests
  xen-blkback: don't store dev_bus_addr
  xen-blkfront: switch from llist to list
  xen-blkback: fix foreach_grant_safe to handle empty lists
  xen-blkfront: replace kmalloc and then memcpy with kmemdup
  xen-blkback: fix dispatch_rw_block_io() error path
  rsxx: fix missing unlock on error return in rsxx_eeh_remap_dmas()
  Adding in EEH support to the IBM FlashSystem 70/80 device driver
  block: IBM RamSan 70/80 error message bug fix.
  block: IBM RamSan 70/80 branding changes.
  ...

d299c290

Revert "lockdep: check that no locks held at freeze time" · dbf520a9

由 Paul Walmsley 提交于 3月 31, 2013

This reverts commit 6aa97070.

Commit 6aa97070 ("lockdep: check that no locks held at freeze time")
causes problems with NFS root filesystems.  The failures were noticed on
OMAP2 and 3 boards during kernel init:

  [ BUG: swapper/0/1 still has locks held! ]
  3.9.0-rc3-00344-ga937536b #1 Not tainted
  -------------------------------------
  1 lock held by swapper/0/1:
   #0:  (&type->s_umount_key#13/1){+.+.+.}, at: [<c011e84c>] sget+0x248/0x574

  stack backtrace:
    rpc_wait_bit_killable
    __wait_on_bit
    out_of_line_wait_on_bit
    __rpc_execute
    rpc_run_task
    rpc_call_sync
    nfs_proc_get_root
    nfs_get_root
    nfs_fs_mount_common
    nfs_try_mount
    nfs_fs_mount
    mount_fs
    vfs_kern_mount
    do_mount
    sys_mount
    do_mount_root
    mount_root
    prepare_namespace
    kernel_init_freeable
    kernel_init

Although the rootfs mounts, the system is unstable.  Here's a transcript
from a PM test:

  http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt

Here's what the test log should look like:

  http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt

Mailing list discussion is here:

  http://lkml.org/lkml/2013/3/4/221

Deal with this for v3.9 by reverting the problem commit, until folks can
figure out the right long-term course of action.
Signed-off-by: NPaul Walmsley <paul@pwsan.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Shawn Guo <shawn.guo@linaro.org>
Cc: <maciej.rutecki@gmail.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ben Chan <benchan@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dbf520a9

31 3月, 2013 1 次提交

Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · 13d2080d

由 Linus Torvalds 提交于 3月 30, 2013

Pull SCSI target fixes from Nicholas Bellinger:
 "This includes the bug-fix for a >= v3.8-rc1 regression specific to
  iscsi-target persistent reservation conflict handling (CC'ed to
  stable), and a tcm_vhost patch to drop VIRTIO_RING_F_EVENT_IDX usage
  so that in-flight qemu vhost-scsi-pci device code can detect the
  proper vhost feature bits.

  Also, there are two more tcm_vhost patches still being discussed by
  MST and Asias for v3.9 that will be required for the in-flight qemu
  vhost-scsi-pci device patch to function properly, and that should
  (hopefully) be the last target fixes for this round."

* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
  target: Fix RESERVATION_CONFLICT status regression for iscsi-target special case
  tcm_vhost: Avoid VIRTIO_RING_F_EVENT_IDX feature bit

13d2080d

30 3月, 2013 6 次提交

dw_dmac: adjust slave_id accordingly to request line base · bce95c63

由 Andy Shevchenko 提交于 2月 20, 2013

On some hardware configurations we have got the request line with the offset.
The patch introduces convert_slave_id() helper for that cases. The request line
base is came from the driver data provided by the platform_device_id table.
Signed-off-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NVinod Koul <vinod.koul@intel.com>

bce95c63

dmaengine: dw_dma: fix endianess for DT xlate function · f73bb9b3

由 Arnd Bergmann 提交于 3月 03, 2013

As reported by Wu Fengguang's build robot tracking sparse warnings, the
dma_spec arguments in the dw_dma_xlate are already byte swapped on
little-endian platforms and must not get swapped again. This code is
currently not used anywhere, but will be used in Linux 3.10 when the
ARM SPEAr platform starts using the generic DMA DT binding.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NVinod Koul <vinod.koul@intel.com>

f73bb9b3

PNP: List Rafael Wysocki as a maintainer · 46a1f21a

由 Rafael J. Wysocki 提交于 3月 29, 2013

The Adam Belay's e-mail address in MAINTAINERS under PNP SUPPORT
is not valid any more and I started to maintain that code in the
meantime as a matter of fact, so list myself as a maintainer of it
along with Bjorn and remove the Adam's entry from it.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

46a1f21a

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client · b92eded4

由 Linus Torvalds 提交于 3月 29, 2013

Pull ceph fix from Sage Weil:
 "This fixes a regression introduced during the last merge window when
  mapping non-existent images."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: don't zero-fill non-image object requests

b92eded4

rbd: don't zero-fill non-image object requests · 6e2a4505

由 Alex Elder 提交于 3月 27, 2013

A result of ENOENT from a read request for an object that's part of
an rbd image indicates that there is a hole in that portion of the
image.  Similarly, a short read for such an object indicates that
the remainder of the read should be interpreted a full read with
zeros filling out the end of the request.

This behavior is not correct for objects that are not backing rbd
image data.  Currently rbd_img_obj_request_callback() assumes it
should be done for all objects.

Change rbd_img_obj_request_callback() so it only does this zeroing
for image objects.  Encapsulate that special handling in its own
function.  Add an assertion that the image object request is a bio
request, since we assume that (and we currently don't support any
other types).

This resolves a problem identified here:
    http://tracker.ceph.com/issues/4559

The regression was introduced by bf0d5f50.
Reported-by: NDan van der Ster <dan@vanderster.com>
Signed-off-by: NAlex Elder <elder@inktank.com>
Reviewed-off-by: NSage Weil <sage@inktank.com>

6e2a4505

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 3615db41

由 Linus Torvalds 提交于 3月 29, 2013

Pull btrfs fixes from Chris Mason:
 "We've had a busy two weeks of bug fixing.  The biggest patches in here
  are some long standing early-enospc problems (Josef) and a very old
  race where compression and mmap combine forces to lose writes (me).
  I'm fairly sure the mmap bug goes all the way back to the introduction
  of the compression code, which is proof that fsx doesn't trigger every
  possible mmap corner after all.

  I'm sure you'll notice one of these is from this morning, it's a small
  and isolated use-after-free fix in our scrub error reporting.  I
  double checked it here."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: don't drop path when printing out tree errors in scrub
  Btrfs: fix wrong return value of btrfs_lookup_csum()
  Btrfs: fix wrong reservation of csums
  Btrfs: fix double free in the btrfs_qgroup_account_ref()
  Btrfs: limit the global reserve to 512mb
  Btrfs: hold the ordered operations mutex when waiting on ordered extents
  Btrfs: fix space accounting for unlink and rename
  Btrfs: fix space leak when we fail to reserve metadata space
  Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes
  Btrfs: fix race between mmap writes and compression
  Btrfs: fix memory leak in btrfs_create_tree()
  Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
  Btrfs: fix missing qgroup reservation before fallocating
  Btrfs: handle a bogus chunk tree nicely
  Btrfs: update to use fs_state bit

3615db41

_Walt / cloud-kernel 与 Fork 源项目一致

_Walt / cloud-kernel
与 Fork 源项目一致