提交 · b54e5ed8f285d62c0d242c4ef9da90937994db02 · openeuler / Kernel

17 7月, 2015 1 次提交

block: partition: introduce hd_free_part() · b54e5ed8

由 Ming Lei 提交于 7月 16, 2015

So the helper can be used in both generic partition
case and part0 case.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

b54e5ed8

16 7月, 2015 1 次提交

blk-mq: set default timeout as 30 seconds · e56f698b

由 Ming Lei 提交于 7月 16, 2015

It is reasonable to set default timeout of request as 30 seconds instead of
30000 ticks, which may be 300 seconds if HZ is 100, for example, some arm64
based systems may choose 100 HZ.
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Fixes: c76cbbcf ("blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()"
Signed-off-by: NJens Axboe <axboe@fb.com>

e56f698b

10 7月, 2015 4 次提交

blkcg: fix blkcg_policy_data allocation bug · 06b285bd

由 Tejun Heo 提交于 7月 09, 2015

e48453c3 ("block, cgroup: implement policy-specific per-blkcg
data") updated per-blkcg policy data to be dynamically allocated.
When a policy is registered, its policy data aren't created.  Instead,
when the policy is activated on a queue, the policy data are allocated
if there are blkg's (blkcg_gq's) which are attached to a given blkcg.
This is buggy.  Consider the following scenario.

1. A blkcg is created.  No blkg's attached yet.

2. The policy is registered.  No policy data is allocated.

3. The policy is activated on a queue.  As the above blkcg doesn't
   have any blkg's, it won't allocate the matching blkcg_policy_data.

4. An IO is issued from the blkcg and blkg is created and the blkcg
   still doesn't have the matching policy data allocated.

With cfq-iosched, this leads to an oops.

It also doesn't free policy data on policy unregistration assuming
that freeing of all policy data on blkcg destruction should take care
of it; however, this also is incorrect.

1. A blkcg has policy data.

2. The policy gets unregistered but the policy data remains.

3. Another policy gets registered on the same slot.

4. Later, the new policy tries to allocate policy data on the previous
   blkcg but the slot is already occupied and gets skipped.  The
   policy ends up operating on the policy data of the previous policy.

There's no reason to manage blkcg_policy_data lazily.  The reason we
do lazy allocation of blkg's is that the number of all possible blkg's
is the product of cgroups and block devices which can reach a
surprising level.  blkcg_policy_data is contrained by the number of
cgroups and shouldn't be a problem.

This patch makes blkcg_policy_data to be allocated for all existing
blkcg's on policy registration and freed on unregistration and removes
blkcg_policy_data handling from policy [de]activation paths.  This
makes that blkcg_policy_data are created and removed with the policy
they belong to and fixes the above described problems.
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

06b285bd

blkcg: implement all_blkcgs list · 7876f930

由 Tejun Heo 提交于 7月 09, 2015

Add all_blkcgs list goes through blkcg->all_blkcgs_node and is
protected by blkcg_pol_mutex.  This will be used to fix
blkcg_policy_data allocation bug.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

7876f930

blkcg: blkcg_css_alloc() should grab blkcg_pol_mutex while iterating blkcg_policy[] · 144232b3

由 Tejun Heo 提交于 7月 09, 2015

An entry in blkcg_policy[] is stable while there are non-bypassing
in-flight IOs on a request_queue which has the policy activated.  This
is why most derefs of blkcg_policy[] don't need explicit locking;
however, blkcg_css_alloc() isn't invoked from IO path and thus doesn't
have this protection and may race policies being added and removed.

Fix it by adding explicit blkcg_pol_mutex protection around
blkcg_policy[] iteration in blkcg_css_alloc().
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

144232b3

blkcg: allow blkcg_pol_mutex to be grabbed from cgroup [file] methods · 838f13bf

由 Tejun Heo 提交于 7月 09, 2015

blkcg_pol_mutex primarily protects the blkcg_policy array.  It also
protects cgroup file type [un]registration during policy addition /
removal.  This puts blkcg_pol_mutex outside cgroup internal
synchronization and in turn makes it impossible to grab from blkcg's
cgroup methods as that leads to cyclic dependency.

Another problematic dependency arising from this is through cgroup
interface file deactivation.  Removing a cftype requires removing all
files of the type which in turn involves draining all on-going
invocations of the file methods.  This means that an interface file
implementation can't grab blkcg_pol_mutex as draining can lead to AA
deadlock.

blkcg_reset_stats() is already in this situation.  It currently
trylocks blkcg_pol_mutex and then unwinds and retries the whole
operation on failure, which is cumbersome at best.  It has a lengthy
comment explaining how cgroup internal synchronization is involved and
expected to be updated but as explained above this doesn't need cgroup
internal locking to deadlock.  It's a self-contained AA deadlock.

The described circular dependencies can be easily broken by moving
cftype [un]registration out of blkcg_pol_mutex and protect them with
an outer mutex.  This patch introduces blkcg_pol_register_mutex which
wraps entire policy [un]registration including cftype operations and
shrinks blkcg_pol_mutex critical section.  This also makes the trylock
dancing in blkcg_reset_stats() unnecessary.  Removed.

This patch is necessary for the following blkcg_policy_data allocation
bug fixes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

838f13bf

07 7月, 2015 3 次提交

block/blk-cgroup.c: free per-blkcg data when freeing the blkcg · a322baad

由 Arianna Avanzini 提交于 7月 07, 2015

Currently, per-blkcg data is freed each time a policy is deactivated,
that is also upon scheduler switch. However, when switching from a
scheduler implementing a policy which requires per-blkcg data to
another one, that same policy might be active on other devices, and
therefore those same per-blkcg data could be still in use.
This commit lets per-blkcg data be freed when the blkcg is freed
instead of on policy deactivation.
Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
Reported-and-tested-by: NMichael Kaminsky <kaminsky@cs.cmu.edu>
Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
Signed-off-by: NJens Axboe <axboe@fb.com>

a322baad

block: use FIELD_SIZEOF to calculate size of a field · 0762b23d

由 Maninder Singh 提交于 7月 07, 2015

use FIELD_SIZEOF instead of open coding
Signed-off-by: NManinder Singh <maninder1.s@samsung.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

0762b23d

bio integrity: do not assume bio_integrity_pool exists if bioset exists · bb8bd38b

由 Mike Snitzer 提交于 7月 01, 2015

bio_integrity_alloc() and bio_integrity_free() assume that if a bio was
allocated from a bioset that that bioset also had its bio_integrity_pool
allocated using bioset_integrity_create().  This is a very bad
assumption given that bioset_create() and bioset_integrity_create() are
completely disjoint.  Not all callers of bioset_create() have been
trained to also call bioset_integrity_create() -- and they may not care
to be.

Fix this by falling back to kmalloc'ing 'struct bio_integrity_payload'
rather than force all bioset consumers to (wastefully) preallocate a
bio_integrity_pool that they very likely won't actually need (given the
niche nature of the current block integrity support).

Otherwise, a NULL pointer "Kernel BUG" with a trace like the following
will be observed (as seen on s390x using zfcp storage) because dm-io
doesn't use bioset_integrity_create() when creating its bioset:

    [  791.643338] Call Trace:
    [  791.643339] ([<00000003df98b848>] 0x3df98b848)
    [  791.643341]  [<00000000002c5de8>] bio_integrity_alloc+0x48/0xf8
    [  791.643348]  [<00000000002c6486>] bio_integrity_prep+0xae/0x2f0
    [  791.643349]  [<0000000000371e38>] blk_queue_bio+0x1c8/0x3d8
    [  791.643355]  [<000000000036f8d0>] generic_make_request+0xc0/0x100
    [  791.643357]  [<000000000036f9b2>] submit_bio+0xa2/0x198
    [  791.643406]  [<000003ff801f9774>] dispatch_io+0x15c/0x3b0 [dm_mod]
    [  791.643419]  [<000003ff801f9b3e>] dm_io+0x176/0x2f0 [dm_mod]
    [  791.643423]  [<000003ff8074b28a>] do_reads+0x13a/0x1a8 [dm_mirror]
    [  791.643425]  [<000003ff8074b43a>] do_mirror+0x142/0x298 [dm_mirror]
    [  791.643428]  [<0000000000154fca>] process_one_work+0x18a/0x3f8
    [  791.643432]  [<000000000015598a>] worker_thread+0x132/0x3b0
    [  791.643435]  [<000000000015d49a>] kthread+0xd2/0xd8
    [  791.643438]  [<00000000005bc0ca>] kernel_thread_starter+0x6/0xc
    [  791.643446]  [<00000000005bc0c4>] kernel_thread_starter+0x0/0xc
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@fb.com>

bb8bd38b

28 6月, 2015 1 次提交

block: fix bogus EFAULT error from SG_IO ioctl · 2c4cffe8

由 Paolo Bonzini 提交于 6月 26, 2015

Whenever blk_fill_sghdr_rq fails, its errno code is ignored and changed to
EFAULT.  This can cause very confusing errors:

  $ sg_persist -k /dev/sda
  persistent reservation in: pass through os error: Bad address

The fix is trivial, just propagate the return value from
blk_fill_sghdr_rq.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2c4cffe8

26 6月, 2015 1 次提交

Revert "block, dm: don't copy bios for request clones" · 78d8e58a

由 Mike Snitzer 提交于 6月 26, 2015

This reverts commit 5f1b670d.

Justification for revert as reported in this dm-devel post:
https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html

this change should not be pushed to mainline yet.

Firstly, Christoph has a newer version of the patch that fixes silent
data corruption problem:
  https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html

And the new version still depends on LLDDs to always complete requests
to the end when error happens, while block API doesn't enforce such a
requirement. If the assumption is ever broken, the inconsistency between
request and bio (e.g. rq->__sector and rq->bio) will cause silent data
corruption:
  https://www.redhat.com/archives/dm-devel/2015-June/msg00022.htmlReported-by: NJunichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

78d8e58a

21 6月, 2015 1 次提交

cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL · ae994ea9

由 Jens Axboe 提交于 6月 20, 2015

Commit 9470e4a6 only covered the initial bug report, there are
other spots in CFQ where we need to check that we actually have
a valid cfq_group_data structure.

Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
Signed-off-by: NJens Axboe <axboe@fb.com>

ae994ea9

20 6月, 2015 2 次提交

cfq-iosched: fix sysfs oops when attempting to read unconfigured weights · 9470e4a6

由 Jens Axboe 提交于 6月 19, 2015

If none of the devices in the system are using CFQ, then attempting to
read:

/sys/fs/cgroup/blkio/blkio.leaf_weight

will results in a NULL dereference. Check for a valid cfq_group_data
struct before attempting to dereference it.
Reported-by: NAndrey Wagin <avagin@gmail.com>
Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
Signed-off-by: NJens Axboe <axboe@fb.com>

9470e4a6

cfq-iosched: move group scheduling functions under ifdef · 4ceab71b

由 Jens Axboe 提交于 6月 19, 2015

If CFQ_GROUP_IOSCHED is not set, the compiler produces the
following warning:

  CC      block/cfq-iosched.o
  linux/block/cfq-iosched.c:469:2:
    warning: 'cpd_to_cfqgd' defined but not used [-Wunused-function]
    *cpd_to_cfqgd(struct blkcg_policy_data *cpd)
     ^

In reality, two other lookup functions aren't used either if
CFQ_GROUP_IOSCHED isn't set. Move all three under one of the
CFQ_GROUP_IOSCHED sections in the code.
Reported-by: NVladimir Zapolskiy <vz@mleia.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

4ceab71b

11 6月, 2015 1 次提交

block: fix ext_dev_lock lockdep report · 4d66e5e9

由 Dan Williams 提交于 6月 10, 2015

 =================================
 [ INFO: inconsistent lock state ]
 4.1.0-rc7+ #217 Tainted: G           O
 ---------------------------------
 inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
 swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
  (ext_devt_lock){+.?...}, at: [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70
 {SOFTIRQ-ON-W} state was registered at:
   [<ffffffff810bf6b1>] __lock_acquire+0x461/0x1e70
   [<ffffffff810c1947>] lock_acquire+0xb7/0x290
   [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
   [<ffffffff8143a07d>] blk_alloc_devt+0x6d/0xd0  <-- take the lock in process context
[..]
  [<ffffffff810bf64e>] __lock_acquire+0x3fe/0x1e70
  [<ffffffff810c00ad>] ? __lock_acquire+0xe5d/0x1e70
  [<ffffffff810c1947>] lock_acquire+0xb7/0x290
  [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
  [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
  [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
  [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70    <-- take the lock in softirq
  [<ffffffff8143bfec>] part_release+0x1c/0x50
  [<ffffffff8158edf6>] device_release+0x36/0xb0
  [<ffffffff8145ac2b>] kobject_cleanup+0x7b/0x1a0
  [<ffffffff8145aad0>] kobject_put+0x30/0x70
  [<ffffffff8158f147>] put_device+0x17/0x20
  [<ffffffff8143c29c>] delete_partition_rcu_cb+0x16c/0x180
  [<ffffffff8143c130>] ? read_dev_sector+0xa0/0xa0
  [<ffffffff810e0e0f>] rcu_process_callbacks+0x2ff/0xa90
  [<ffffffff810e0dcf>] ? rcu_process_callbacks+0x2bf/0xa90
  [<ffffffff81067e2e>] __do_softirq+0xde/0x600

Neil sees this in his tests and it also triggers on pmem driver unbind
for the libnvdimm tests.  This fix is on top of an initial fix by Keith
for incorrect usage of mutex_lock() in this path: 2da78092 "block:
Fix dev_t minor allocation lifetime".  Both this and 2da78092 are
candidates for -stable.

Fixes: 2da78092 ("block: Fix dev_t minor allocation lifetime")
Cc: <stable@vger.kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Reported-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

4d66e5e9

10 6月, 2015 2 次提交

cfq-iosched: fix the setting of IOPS mode on SSDs · 0bb97947

由 Jens Axboe 提交于 6月 10, 2015

A previous commit wanted to make CFQ default to IOPS mode on
non-rotational storage, however it did so when the queue was
initialized and the non-rotational flag is only set later on
in the probe.

Add an elevator hook that gets called off the add_disk() path,
at that point we know that feature probing has finished, and
we can reliably check for the various flags that drivers can
set.

Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs")
Tested-by: NRomain Francoise <romain@orebokech.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

0bb97947

blk-mq: free hctx->ctxs in queue's release handler · c3b4afca

由 Ming Lei 提交于 6月 04, 2015

Now blk_cleanup_queue() can be called before calling
del_gendisk()[1], inside which hctx->ctxs is touched
from blk_mq_unregister_hctx(), but the variable has
been freed by blk_cleanup_queue() at that time.

So this patch moves freeing of hctx->ctxs into queue's
release handler for fixing the oops reported by Stefan.

[1], 6cd18e71 (block: destroy bdi before blockdev is
unregistered)
Reported-by: NStefan Seyfried <stefan.seyfried@googlemail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c3b4afca

07 6月, 2015 1 次提交

block, cgroup: implement policy-specific per-blkcg data · e48453c3

由 Arianna Avanzini 提交于 6月 05, 2015

The block IO (blkio) controller enables the block layer to provide service
guarantees in a hierarchical fashion. Specifically, service guarantees
are provided by registered request-accounting policies. As of now, a
proportional-share and a throttling policy are available. They are
implemented, respectively, by the CFQ I/O scheduler and the blk-throttle
subsystem. Unfortunately, as for adding new policies, the current
implementation of the block IO controller is only halfway ready to allow
new policies to be plugged in. This commit provides a solution to make
the block IO controller fully ready to handle new policies.
In what follows, we first describe briefly the current state, and then
list the changes made by this commit.

The throttling policy does not need any per-cgroup information to perform
its task. In contrast, the proportional share policy uses, for each cgroup,
both the weight assigned by the user to the cgroup, and a set of dynamically-
computed weights, one for each device.

The first, user-defined weight is stored in the blkcg data structure: the
block IO controller allocates a private blkcg data structure for each
cgroup in the blkio cgroups hierarchy (regardless of which policy is active).
In other words, the block IO controller internally mirrors the blkio cgroups
with private blkcg data structures.

On the other hand, for each cgroup and device, the corresponding dynamically-
computed weight is maintained in the following, different way. For each device,
the block IO controller keeps a private blkcg_gq structure for each cgroup in
blkio. In other words, block IO also keeps one private mirror copy of the blkio
cgroups hierarchy for each device, made of blkcg_gq structures.
Each blkcg_gq structure keeps per-policy information in a generic array of
dynamically-allocated 'dedicated' data structures, one for each registered
policy (so currently the array contains two elements). To be inserted into the
generic array, each dedicated data structure embeds a generic blkg_policy_data
structure. Consider now the array contained in the blkcg_gq structure
corresponding to a given pair of cgroup and device: one of the elements
of the array contains the dedicated data structure for the proportional-share
policy, and this dedicated data structure contains the dynamically-computed
weight for that pair of cgroup and device.

The generic strategy adopted for storing per-policy data in blkcg_gq structures
is already capable of handling new policies, whereas the one adopted with blkcg
structures is not, because per-policy data are hard-coded in the blkcg
structures themselves (currently only data related to the proportional-
share policy).

This commit addresses the above issues through the following changes:
. It generalizes blkcg structures so that per-policy data are stored in the same
way as in blkcg_gq structures.
Specifically, it lets also the blkcg structure store per-policy data in a
generic array of dynamically-allocated dedicated data structures. We will
refer to these data structures as blkcg dedicated data structures, to
distinguish them from the dedicated data structures inserted in the generic
arrays kept by blkcg_gq structures.
To allow blkcg dedicated data structures to be inserted in the generic array
inside a blkcg structure, this commit also introduces a new blkcg_policy_data
structure, which is the equivalent of blkg_policy_data for blkcg dedicated
data structures.
. It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a
cpd_size field and a cpd_init field, to be initialized by the policy with,
respectively, the size of the blkcg dedicated data structures, and the
address of a constructor function for blkcg dedicated data structures.
. It moves the CFQ-specific fields embedded in the blkcg data structure (i.e.,
the fields related to the proportional-share policy), into a new blkcg
dedicated data structure called cfq_group_data.
Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e48453c3

06 6月, 2015 1 次提交

block: Make CFQ default to IOPS mode on SSDs · 41c0126b

由 Tahsin Erdogan 提交于 5月 19, 2015

CFQ idling causes reduced IOPS throughput on non-rotational disks.
Since disk head seeking is not applicable to SSDs, it doesn't really
help performance by anticipating future near-by IO requests.

By turning off idling (and switching to IOPS mode), we allow other
processes to dispatch IO requests down to the driver and so increase IO
throughput.

Following FIO benchmark results were taken on a cloud SSD offering with
idling on and off:

Idling     iops    avg-lat(ms)    stddev            bw
------------------------------------------------------
    On     7054    90.107         38.697     28217KB/s
   Off    29255    21.836         11.730    117022KB/s

fio --name=temp --size=100G --time_based --ioengine=libaio \
    --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \
    --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \
    --filename=/dev/sdb --runtime=10 --iodepth=64 --numjobs=10

And the following is from a local SSD run:

Idling     iops    avg-lat(ms)    stddev            bw
------------------------------------------------------
    On    19320    33.043         14.068     77281KB/s
   Off    21626    29.465         12.662     86507KB/s

fio --name=temp --size=5G --time_based --ioengine=libaio \
    --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \
    --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \
    --filename=/fio_data --runtime=10 --iodepth=64 --numjobs=10
Reviewed-by: NNauman Rafique <nauman@google.com>
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

41c0126b

02 6月, 2015 13 次提交

writeback, blkcg: propagate non-root blkcg congestion state · 482cf79c

由 Tejun Heo 提交于 5月 22, 2015

Now that bdi layer can handle per-blkcg bdi_writeback_congested state,
blk_{set|clear}_congested() can propagate non-root blkcg congestion
state to them.

This can be easily achieved by disabling the root_rl tests in
blk_{set|clear}_congested().  Note that we still need those tests when
!CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg
wb's congestion state for events happening on other blkcgs.

v2: Updated for bdi_writeback_congested.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

482cf79c

writeback, blkcg: restructure blk_{set|clear}_queue_congested() · d40f75a0

由 Tejun Heo 提交于 5月 22, 2015

blk_{set|clear}_queue_congested() take @q and set or clear,
respectively, the congestion state of its bdi's root wb.  Because bdi
used to be able to handle congestion state only on the root wb, the
callers of those functions tested whether the congestion is on the
root blkcg and skipped if not.

This is cumbersome and makes implementation of per cgroup
bdi_writeback congestion state propagation difficult.  This patch
renames blk_{set|clear}_queue_congested() to
blk_{set|clear}_congested(), and makes them take request_list instead
of request_queue and test whether the specified request_list is the
root one before updating bdi_writeback congestion state.  This makes
the tests in the callers unnecessary and simplifies them.

As there are no external users of these functions, the definitions are
moved from include/linux/blkdev.h to block/blk-core.c.

This patch doesn't introduce any noticeable behavior difference.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

d40f75a0

writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback_congested · ce7acfea

由 Tejun Heo 提交于 5月 22, 2015

A blkg (blkcg_gq) can be congested and decongested independently from
other blkgs on the same request_queue.  Accordingly, for cgroup
writeback support, the congestion status at bdi (backing_dev_info)
should be split and updated separately from matching blkg's.

This patch prepares by adding blkg->wb_congested and associating a
blkg with its matching per-blkcg bdi_writeback_congested on creation.

v2: Updated to associate bdi_writeback_congested instead of
    bdi_writeback.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ce7acfea

writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74

由 Tejun Heo 提交于 5月 22, 2015

For the planned cgroup writeback support, on each bdi
(backing_dev_info), each memcg will be served by a separate wb
(bdi_writeback).  This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).

On the default hierarchy, blkcg implicitly enables memcg.  This allows
using memcg's page ownership for attributing writeback IOs, and every
memcg - blkcg combination can be served by its own wb by assigning a
dedicated wb to each memcg.  This means that there may be multiple
wb's of a bdi mapped to the same blkcg.  As congested state is per
blkcg - bdi combination, those wb's should share the same congested
state.  This is achieved by tracking congested state via
bdi_writeback_congested structs which are keyed by blkcg.

bdi->wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up while dirtying an inode according to the memcg of the page
being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
by its memcg id.  Once an inode is associated with its wb, it can be
retrieved using inode_to_wb().

Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi->wb.

v3: inode_attach_wb() in account_page_dirtied() moved inside
    mapping_cap_account_dirty() block where it's known to be !NULL.
    Also, an unnecessary NULL check before kfree() removed.  Both
    detected by the kbuild bot.

v2: Updated so that wb association is per inode and wb is per memcg
    rather than blkcg.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

52ebea74

writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK · 89e9b9e0

由 Tejun Heo 提交于 5月 22, 2015

cgroup writeback requires support from both bdi and filesystem sides.
Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
default.  Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
both MEMCG and BLK_CGROUP are enabled.

inode_cgwb_enabled() which determines whether a given inode's both bdi
and fs support cgroup writeback is added.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

89e9b9e0

writeback: separate out include/linux/backing-dev-defs.h · 66114cad

由 Tejun Heo 提交于 5月 22, 2015

With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NJens Axboe <axboe@fb.com>

66114cad

writeback: move backing_dev_info->state into bdi_writeback · 4452226e

由 Tejun Heo 提交于 5月 22, 2015

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear.  For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi.  To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->state into wb.

* enum bdi_state is renamed to wb_state and the prefix of all enums is
  changed from BDI_ to WB_.

* Explicit zeroing of bdi->state is removed without adding zeoring of
  wb->state as the whole data structure is zeroed on init anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
  uses of bdi->state are mechanically replaced with bdi->wb.state
  introducing no behavior changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: drbd-dev@lists.linbit.com
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

4452226e

blkcg: implement bio_associate_blkcg() · 1d933cf0

由 Tejun Heo 提交于 5月 22, 2015

Currently, a bio can only be associated with the io_context and blkcg
of %current using bio_associate_current().  This is too restrictive
for cgroup writeback support.  Implement bio_associate_blkcg() which
associates a bio with the specified blkcg.

bio_associate_blkcg() leaves the io_context unassociated.
bio_associate_current() is updated so that it considers a bio as
already associated if it has a blkcg_css, instead of an io_context,
associated with it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

1d933cf0

cgroup, block: implement task_get_css() and use it in bio_associate_current() · ec438699

由 Tejun Heo 提交于 5月 22, 2015

bio_associate_current() currently open codes task_css() and
css_tryget_online() to find and pin $current's blkcg css.  Abstract it
into task_get_css() which is implemented from cgroup side.  As a task
is always associated with an online css for every subsystem except
while the css_set update is propagating, task_get_css() retries till
css_tryget_online() succeeds.

This is a cleanup and shouldn't lead to noticeable behavior changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ec438699

blkcg: add blkcg_root_css · 496d5e75

由 Tejun Heo 提交于 5月 22, 2015

Add global constant blkcg_root_css which points to &blkcg_root.css.
This will be used by cgroup writeback support.  If blkcg is disabled,
it's defined as ERR_PTR(-EINVAL).

v2: The declarations moved to include/linux/blk-cgroup.h as suggested
    by Vivek.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NJens Axboe <axboe@fb.com>

496d5e75

blkcg: always create the blkcg_gq for the root blkcg · ec13b1d6

由 Tejun Heo 提交于 5月 22, 2015

Currently, blkcg does a minor optimization where the root blkcg is
created when the first blkcg policy is activated on a queue and
destroyed on the deactivation of the last. On systems where blkcg is
configured but not used, this saves one blkcg_gq struct per queue. On
systems where blkcg is actually used, there's no difference. The only
case where this can lead to any meaninful, albeit still minute, save
in memory consumption is when all blkcg policies are deactivated after
being widely used in the system, which is a hihgly unlikely scenario.

The conditional existence of root blkcg_gq has already created several
bugs in blkcg and became an issue once again for the new per-cgroup
wb_congested mechanism for cgroup writeback support leading to a NULL
dereference when no blkcg policy is active. This is really not worth
bothering with. This patch makes blkcg always allocate and link the
root blkcg_gq and release it only on queue destruction.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ec13b1d6

blkcg: move block/blk-cgroup.h to include/linux/blk-cgroup.h · eea8f41c

由 Tejun Heo 提交于 5月 22, 2015

cgroup aware writeback support will require exposing some of blkcg
details.  In preprataion, move block/blk-cgroup.h to
include/linux/blk-cgroup.h.  This patch is pure file move.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

eea8f41c

blk-mq: Shared tag enhancements · f26cdc85

由 Keith Busch 提交于 6月 01, 2015

Storage controllers may expose multiple block devices that share hardware
resources managed by blk-mq. This patch enhances the shared tags so a
low-level driver can access the shared resources not tied to the unshared
h/w contexts. This way the LLD can dynamically add and delete disks and
request queues without having to track all the request_queue hctx's to
iterate outstanding tags.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

f26cdc85

30 5月, 2015 1 次提交

block: only honor SG gap prevention for merges that contain data · beefa6ba

由 Jens Axboe 提交于 5月 29, 2015

We can safely merge anything that wont generate an SG list entry,
so if the bio is data-less (discard), don't look at potential
SG gaps.
Signed-off-by: NJens Axboe <axboe@fb.com>

beefa6ba

29 5月, 2015 1 次提交

block: discard bdi_unregister() in favour of bdi_destroy() · aad653a0

由 NeilBrown 提交于 5月 19, 2015

bdi_unregister() now contains very little functionality.

It contains a "WARN_ON" if bdi->dev is NULL.  This warning is of no
real consequence as bdi->dev isn't needed by anything else in the function,
and it triggers if
   blk_cleanup_queue() -> bdi_destroy()
is called before bdi_unregister, which happens since
  Commit: 6cd18e71 ("block: destroy bdi before blockdev is unregistered.")

So this isn't wanted.

It also calls bdi_set_min_ratio().  This needs to be called after
writes through the bdi have all been flushed, and before the bdi is destroyed.
Calling it early is better than calling it late as it frees up a global
resource.

Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
perfectly fits these requirements.

So bdi_unregister() can be discarded with the important content moved to
bdi_destroy(), as can the
  writeback_bdi_unregister
event which is already not used.
Reported-by: NMike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org (v4.0)
Fixes: c4db59d3 ("fs: don't reassign dirty inodes to default_backing_dev_info")
Fixes: 6cd18e71 ("block: destroy bdi before blockdev is unregistered.")
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Tested-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

aad653a0

27 5月, 2015 1 次提交

sched/topology: Rename topology_thread_cpumask() to topology_sibling_cpumask() · 06931e62

由 Bartosz Golaszewski 提交于 5月 26, 2015

Rename topology_thread_cpumask() to topology_sibling_cpumask()
for more consistency with scheduler code.
Signed-off-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRussell King <rmk+kernel@arm.linux.org.uk>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Cc: Benoit Cousson <bcousson@baylibre.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/1432645896-12588-2-git-send-email-bgolaszewski@baylibre.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

06931e62

22 5月, 2015 2 次提交

block, dm: don't copy bios for request clones · 5f1b670d

由 Christoph Hellwig 提交于 5月 22, 2015

Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.

This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence.  With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.

I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

5f1b670d

block: remove management of bi_remaining when restoring original bi_end_io · 326e1dbb

由 Mike Snitzer 提交于 5月 22, 2015

Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

326e1dbb

20 5月, 2015 2 次提交

block: replace trylock with mutex_lock in blkdev_reread_part() · b04a5636

由 Ming Lei 提交于 5月 06, 2015

The only possible problem of using mutex_lock() instead of trylock
is about deadlock.

If there aren't any locks held before calling blkdev_reread_part(),
deadlock can't be caused by this conversion.

If there are locks held before calling blkdev_reread_part(),
and if these locks arn't required in open, close handler and I/O
path, deadlock shouldn't be caused too.

Both user space's ioctl(BLKRRPART) and md_setup_drive() from
init/do_mounts_md.c belongs to the 1st case, so the conversion is safe
for the two cases.

For loop, the previous patches in this pathset has fixed the ABBA lock
dependency, so the conversion is OK.

For nbd, tx_lock is held when calling the function:

	- both open and release won't hold the lock
	- when blkdev_reread_part() is run, I/O thread has been stopped
	already, so tx_lock won't be acquired in I/O path at that time.
	- so the conversion won't cause deadlock for nbd

For dasd, both dasd_open(), dasd_release() and request function don't
acquire any mutex/semphone, so the conversion should be safe.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NJarod Wilson <jarod@redhat.com>
Acked-by: NJarod Wilson <jarod@redhat.com>
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

b04a5636

block: export blkdev_reread_part() and __blkdev_reread_part() · be324177

由 Jarod Wilson 提交于 5月 06, 2015

This patch exports blkdev_reread_part() for block drivers, also
introduce __blkdev_reread_part().

For some drivers, such as loop, reread of partitions can be run
from the release path, and bd_mutex may already be held prior to
calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
__blkdev_reread_part for use in such cases.

CC: Christoph Hellwig <hch@lst.de>
CC: Jens Axboe <axboe@kernel.dk>
CC: Tejun Heo <tj@kernel.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Markus Pargmann <mpa@pengutronix.de>
CC: Stefan Weinhuber <wein@de.ibm.com>
CC: Stefan Haberland <stefan.haberland@de.ibm.com>
CC: Sebastian Ott <sebott@linux.vnet.ibm.com>
CC: Fabian Frederick <fabf@skynet.be>
CC: Ming Lei <ming.lei@canonical.com>
CC: David Herrmann <dh.herrmann@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: nbd-general@lists.sourceforge.net
CC: linux-s390@vger.kernel.org
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJarod Wilson <jarod@redhat.com>
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

be324177

19 5月, 2015 1 次提交

block: remove unused BIO_RW_BLOCK and BIO_EOF flags · 97ca223c

由 Christoph Hellwig 提交于 4月 24, 2015

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

97ca223c

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功