提交 · 2f9a0b33ac6c4a2accaf787456080af35f1cab0b · openeuler / Kernel

13 4月, 2016 4 次提交

block: add ability to flag write back caching on a device · 93e9d8e8

由 Jens Axboe 提交于 4月 12, 2016

Add an internal helper and flag for setting whether a queue has
write back caching, or write through (or none). Add a sysfs file
to show this as well, and make it changeable from user space.

This will replace the (awkward) blk_queue_flush() interface that
drivers currently use to inform the block layer of write cache state
and capabilities.
Signed-off-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

93e9d8e8

blk-mq: Make blk_mq_all_tag_busy_iter static · e8f1e163

由 Sagi Grimberg 提交于 3月 10, 2016

No caller outside the blk-mq code so we can settle
with it static.
Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

e8f1e163

blk-mq: Export tagset iter function · e0489487

由 Sagi Grimberg 提交于 3月 10, 2016

Its useful to iterate on all the active tags in cases
where we will need to fail all the queues IO.
Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
[hch: carefully check for valid tagsets]
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

e0489487

block: add offset in blk_add_request_payload() · 37e58237

由 Ming Lin 提交于 3月 22, 2016

We could kmalloc() the payload, so need the offset in page.
Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

37e58237

05 4月, 2016 2 次提交

mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0

由 Kirill A. Shutemov 提交于 4月 01, 2016

Mostly direct substitution with occasional adjustment or removing
outdated comments.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ea1754a0

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf

由 Kirill A. Shutemov 提交于 4月 01, 2016

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

09cbfeaf

20 3月, 2016 1 次提交

blk-mq: Use proper cpumask iterator · 897bb0c7

由 Thomas Gleixner 提交于 3月 19, 2016

queue_for_each_ctx() iterates over per_cpu variables under the assumption that
the possible cpu mask cannot have holes. That's wrong as all cpumasks can have
holes. In case there are holes the iteration ends up accessing uninitialized
memory and crashing as a result.

Replace the macro by a proper for_each_possible_cpu() loop and drop the unused
macro blk_ctx_sum() which references queue_for_each_ctx().
Reported-by: NXiong Zhou <jencce.kernel@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

897bb0c7

16 3月, 2016 2 次提交

block: partition: add partition specific uevent callbacks for partition info · 0d9c51a6

由 San Mehat 提交于 3月 15, 2016

This patch has been carried in the Android tree for quite some time and
is one of the few patches required to get a mainline kernel up and
running with an exsiting Android userspace.  So I wanted to submit it
for review and consideration if it should be merged.

For partitions, add new uevent parameters 'PARTN' which specifies the
partitions index in the table, and 'PARTNAME', which specifies PARTNAME
specifices the partition name of a partition device.

Android's userspace uses this for creating device node links from the
partition name and number, ie:

    /dev/block/platform/soc/by-name/system
or
    /dev/block/platform/soc/by-num/p1

One can see its usage here:
    https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#355
and
    https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#494

[john.stultz@linaro.org: dropped NPARTS and reworded commit message for context]
Signed-off-by: NDima Zavin <dima@android.com>
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Rom Lemarchand <romlem@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: <harald@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0d9c51a6

blk-mq: add bounds check on tag-to-rq conversion · 4ee86bab

由 Hannes Reinecke 提交于 3月 15, 2016

We need to check for a valid index before accessing the array
element to avoid accessing invalid memory regions.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>

Modified by Jens to drop the unlikely(), and make the fall through
path be having a valid tag.
Signed-off-by: NJens Axboe <axboe@fb.com>

4ee86bab

14 3月, 2016 4 次提交

block: bio_remaining_done() isn't unlikely · 2b885517

由 Christoph Hellwig 提交于 3月 11, 2016

We use bio chaining during most I/Os these days due to the delayed
bio splitting.  Additionally XFS will start using it, and there is
a pending direct I/O rewrite also making heavy use for it.  Don't
pretend it's always unlikely, and let the branch predictor do it's
job instead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

2b885517

block: cleanup bio_endio · ba8c6967

由 Christoph Hellwig 提交于 3月 11, 2016

Replace the while loop that unecessarily checks for a NULL bio in the fast
path with a simple goto loop.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

ba8c6967

block: factor out chained bio completion · 38f8baae

由 Christoph Hellwig 提交于 3月 11, 2016

Factor common code between bio_chain_endio and bio_endio into a common
helper.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

38f8baae

block: don't unecessarily clobber bi_error for chained bios · af3e3a52

由 Christoph Hellwig 提交于 3月 11, 2016

Only overwrite the parents bi_error if it was zero. That way a successful
bio completion doesn't reset the error pointer.
Reported-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

af3e3a52

04 3月, 2016 3 次提交

blk-mq: Fix NULL pointer updating nr_requests · e9137d4b

由 Keith Busch 提交于 2月 18, 2016

A h/w context's tags are freed if it was not assigned a CPU. Check if
the context has tags before updating the depth.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e9137d4b

block: support large requests in blk_rq_map_user_iov · 4d6af73d

由 Christoph Hellwig 提交于 3月 02, 2016

This patch adds support for larger requests in blk_rq_map_user_iov by
allowing it to build multiple bios for a request.  This functionality
used to exist for the non-vectored blk_rq_map_user in the past, and
this patch reuses the existing functionality for it on the unmap side,
which stuck around.  Thanks to the iov_iter API supporting multiple
bios is fairly trivial, as we can just iterate the iov until we've
consumed the whole iov_iter.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reported-by: NJeff Lien <Jeff.Lien@hgst.com>
Tested-by: NJeff Lien <Jeff.Lien@hgst.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

4d6af73d

block: merge: get the 1st and last bvec via helpers · e827091c

由 Ming Lei 提交于 2月 26, 2016

This patch applies the two introduced helpers to
figure out the 1st and last bvec.
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e827091c

28 2月, 2016 1 次提交

block: disable block device DAX by default · 03cdadb0

由 Dan Williams 提交于 2月 26, 2016

The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings.  This can lead to data corruption.

We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.

   dump_stack+0x85/0xc2
   dax_writeback_mapping_range+0x60/0xe0
   blkdev_writepages+0x3f/0x50
   do_writepages+0x21/0x30
   __filemap_fdatawrite_range+0xc6/0x100
   filemap_write_and_wait+0x4a/0xa0
   set_blocksize+0x70/0xd0
   sb_set_blocksize+0x1d/0x50
   ext4_fill_super+0x75b/0x3360
   mount_bdev+0x180/0x1b0
   ext4_mount+0x15/0x20
   mount_fs+0x38/0x170

Mark the support broken so its disabled by default, but otherwise still
available for testing.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: NDave Chinner <david@fromorbit.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

03cdadb0

23 2月, 2016 1 次提交

dm: fix excessive dm-mq context switching · 6acfe68b

由 Mike Snitzer 提交于 2月 05, 2016

Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
than if an underlying null_blk device were used directly.  One of the
reasons for this drop in performance is that blk_insert_clone_request()
was calling blk_mq_insert_request() with @async=true.  This forced the
use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
which ushered in ping-ponging between process context (fio in this case)
and kblockd's kworker to submit the cloned request.  The ftrace
function_graph tracer showed:

  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...
  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...

Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
_not_ use kblockd to submit the cloned requests isn't enough to
eliminate the observed context switches.

In addition to this dm-mq specific blk-core fix, there are 2 DM core
fixes to dm-mq that (when paired with the blk-core fix) completely
eliminate the observed context switching:

1)  don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this).  So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b0 !

2)  use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request).  Using blk_mq_complete_request() for
    blk-mq requests is important for performance.  It should be noted
    that, like blk_complete_request(), blk_mq_complete_request() doesn't
    natively handle partial completions -- but the request-based
    DM-multipath target does provide the required partial completion
    support by dm.c:end_clone_bio() triggering requeueing of the request
    via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.

dm-mq fix #2 is _much_ more important than #1 for eliminating the
context switches.
Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

With these changes multithreaded async read IOPs improved from ~950K
to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
IOPs of the underlying null_blk device for the same workload is ~1950K.

Fixes: 7fb4898e ("block: add blk-mq support to blk_insert_cloned_request()")
Fixes: bfebd1cd ("dm: add full blk-mq support to request-based DM")
Cc: stable@vger.kernel.org # 4.1+
Reported-by: NSagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NJens Axboe <axboe@kernel.dk>

6acfe68b

19 2月, 2016 1 次提交

block: Add blk_set_runtime_active() · d07ab6d1

由 Mika Westerberg 提交于 2月 18, 2016

If block device is left runtime suspended during system suspend, resume
hook of the driver typically corrects runtime PM status of the device back
to "active" after it is resumed. However, this is not enough as queue's
runtime PM status is still "suspended". As long as it is in this state
blk_pm_peek_request() returns NULL and thus prevents new requests to be
processed.

Add new function blk_set_runtime_active() that can be used to force the
queue status back to "active" as needed.
Signed-off-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NTejun Heo <tj@kernel.org>

d07ab6d1

18 2月, 2016 1 次提交

blk: fix overflow in queue_discard_max_hw_show · 18f922d0

由 Alan 提交于 2月 17, 2016

We get this right for queue_discard_max_show but not max_hw_show. Follow the
same pattern as queue_discard_max_show instead so that we don't truncate.
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

18f922d0

15 2月, 2016 1 次提交

blk-mq: mark request queue as mq asap · 66841672

由 Ming Lei 提交于 2月 12, 2016

Currently q->mq_ops is used widely to decide if the queue
is mq or not, so we should set the 'flag' asap so that both
block core and drivers can get the correct mq info.

For example, commit 868f2f0b(blk-mq: dynamic h/w context count)
moves the hctx's initialization before setting q->mq_ops in
blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue()
to think the queue is non-mq and don't allocate command size
for the per-hctx flush rq.

This patches should fix the problem reported by Sasha.

Cc: Keith Busch <keith.busch@intel.com>
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Fixes: 868f2f0b ("blk-mq: dynamic h/w context count")
Signed-off-by: NJens Axboe <axboe@fb.com>

66841672

12 2月, 2016 3 次提交

bio: return EINTR if copying to user space got interrupted · 2d99b55d

由 Hannes Reinecke 提交于 2月 12, 2016

Commit 35dc2483 introduced a check for
current->mm to see if we have a user space context and only copies data
if we do. Now if an IO gets interrupted by a signal data isn't copied
into user space any more (as we don't have a user space context) but
user space isn't notified about it.

This patch modifies the behaviour to return -EINTR from bio_uncopy_user()
to notify userland that a signal has interrupted the syscall, otherwise
it could lead to a situation where the caller may get a buffer with
no data returned.

This can be reproduced by issuing SG_IO ioctl()s in one thread while
constantly sending signals to it.

Fixes: 35dc2483 [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NHannes Reinecke <hare@suse.de>
Cc: stable@vger.kernel.org # v.3.11+
Signed-off-by: NJens Axboe <axboe@fb.com>

2d99b55d

blk-mq: End unstarted requests on dying queue · a59e0f57

由 Keith Busch 提交于 2月 11, 2016

Go directly to ending a request if it wasn't started. Previously, completing a
request may invoke a driver callback for a request it didn't initialize.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
Acked-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

a59e0f57

block: Initialize max_dev_sectors to 0 · 5f009d3f

由 Keith Busch 提交于 2月 10, 2016

The new queue limit is not used by the majority of block drivers, and
should be initialized to 0 for the driver's requested settings to be used.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

5f009d3f

11 2月, 2016 1 次提交

block: Initialize max_dev_sectors to 0 · d5df731a

由 Keith Busch 提交于 2月 10, 2016

The new queue limit is not used by the majority of block drivers, and
should be initialized to 0 for the driver's requested settings to be used.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

d5df731a

10 2月, 2016 3 次提交

blk-mq: dynamic h/w context count · 868f2f0b

由 Keith Busch 提交于 12月 17, 2015

The hardware's provided queue count may change at runtime with resource
provisioning. This patch allows a block driver to alter the number of
h/w queues available when its resource count changes.

The main part is a new blk-mq API to request a new number of h/w queues
for a given live tag set. The new API freezes all queues using that set,
then adjusts the allocated count prior to remapping these to CPUs.

The bulk of the rest just shifts where h/w contexts and all their
artifacts are allocated and freed.

The number of max h/w contexts is capped to the number of possible cpus
since there is no use for more than that. As such, all pre-allocated
memory for pointers need to account for the max possible rather than
the initial number of queues.

A side effect of this is that the blk-mq will proceed successfully as
long as it can allocate at least one h/w context. Previously it would
fail request queue initialization if less than the requested number
was allocated.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NJon Derrick <jonathan.derrick@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

868f2f0b

block: fix module reference leak on put_disk() call for cgroups throttle · 39a169b6

由 Roman Pen 提交于 2月 09, 2016

get_disk(),get_gendisk() calls have non explicit side effect: they
increase the reference on the disk owner module.

The following is the correct sequence how to get a disk reference and
to put it:

    disk = get_gendisk(...);

    /* use disk */

    owner = disk->fops->owner;
    put_disk(disk);
    module_put(owner);

fs/block_dev.c is aware of this required module_put() call, but f.e.
blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put
a module reference.  To see a leakage in action cgroups throttle config
can be used.  In the following script I'm removing throttle for /dev/ram0
(actually this is NOP, because throttle was never set for this device):

    # lsmod | grep brd
    brd                     5175  0
    # i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \
        /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \
    done
    # lsmod | grep brd
    brd                     5175  100

Now brd module has 100 references.

The issue is fixed by calling module_put() just right away put_disk().
Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
Cc: Gi-Oh Kim <gi-oh.kim@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@fb.com>

39a169b6

kernel/fs: fix I/O wait not accounted for RW O_DSYNC · d57d6115

由 Stephane Gasparini 提交于 2月 09, 2016

 When a process is doing Random Write with O_DSYNC flag
 the I/O wait are not accounted in the kernel (get_cpu_iowait_time_us).
 This is preventing the governor or the cpufreq driver to account for
 I/O wait and thus use the right pstate
Signed-off-by: NStephane Gasparini <stephane.gasparini@linux.intel.com>
Signed-off-by: NPhilippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

d57d6115

05 2月, 2016 5 次提交

block/sd: Return -EREMOTEIO when WRITE SAME and DISCARD are disabled · 0fb5b1fb

由 Martin K. Petersen 提交于 2月 04, 2016

When a storage device rejects a WRITE SAME command we will disable write
same functionality for the device and return -EREMOTEIO to the block
layer. -EREMOTEIO will in turn prevent DM from retrying the I/O and/or
failing the path.

Yiwen Jiang discovered a small race where WRITE SAME requests issued
simultaneously would cause -EIO to be returned. This happened because
any requests being prepared after WRITE SAME had been disabled for the
device caused us to return BLKPREP_KILL. The latter caused the block
layer to return -EIO upon completion.

To overcome this we introduce BLKPREP_INVALID which indicates that this
is an invalid request for the device. blk_peek_request() is modified to
return -EREMOTEIO in that case.
Reported-by: NYiwen Jiang <jiangyiwen@huawei.com>
Suggested-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NHannes Reinicke <hare@suse.de>
Reviewed-by: NEwan Milne <emilne@redhat.com>
Reviewed-by: NYiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>

0fb5b1fb

cfq-iosched: Allow parent cgroup to preempt its child · 3984aa55

由 Jan Kara 提交于 1月 12, 2016

Currently we don't allow sync workload of one cgroup to preempt sync
workload of any other cgroup. This is because we want to achieve service
separation between cgroups. However in cases where cgroup preempting is
ancestor of the current cgroup, there is no need of separation and
idling introduces unnecessary overhead. This hurts for example the case
when workload is isolated within a cgroup but journalling threads are in
root cgroup. Simple way to demostrate the issue is using:

dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1

on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make
difference more visible). When all processes are in the root cgroup,
reported throughput is 153.132 MB/sec. When dbench process gets its own
blkio cgroup, reported throughput drops to 26.1006 MB/sec.

Fix the problem by making check in cfq_should_preempt() more benevolent
and allow preemption by ancestor cgroup. This improves the throughput
reported by dbench4 to 48.9106 MB/sec.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJan Kara <jack@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

3984aa55

cfq-iosched: Allow sync noidle workloads to preempt each other · a257ae3e

由 Jan Kara 提交于 1月 12, 2016

The original idea with preemption of sync noidle queues (introduced in
commit 718eee05 "cfq-iosched: fairness for sync no-idle queues") was
that we service all sync noidle queues together, we don't idle on any of
the queues individually and we idle only if there is no sync noidle
queue to be served. This intention also matches the original test:

	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
	   && new_cfqq->service_tree == cfqq->service_tree)
		return true;

However since at that time cfqq->service_tree was not set for idling
queues, this test was unreliable and was replaced in commit e4a22919
"cfq-iosched: fix no-idle preemption logic" by:

	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
	    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
	    new_cfqq->service_tree->count == 1)
		return true;

That was a reliable test but was actually doing something different -
now we preempt sync noidle queue only if the new queue is the only one
busy in the service tree.

These days cfq queue is kept in service tree even if it is idling and
thus the original check would be safe again. But since we actually check
that cfq queues are in the same cgroup, of the same priority class and
workload type (sync noidle), we know that new_cfqq is fine to preempt
cfqq. So just remove the service tree check.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJan Kara <jack@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

a257ae3e

cfq-iosched: Reorder checks in cfq_should_preempt() · 6c80731c

由 Jan Kara 提交于 1月 12, 2016

Move check for preemption by rt class up. There is no functional change
but it makes arguing about conditions simpler since we can be sure both
cfq queues are from the same ioprio class.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJan Kara <jack@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

6c80731c

cfq-iosched: Don't group_idle if cfqq has big thinktime · e795421e

由 Jan Kara 提交于 1月 12, 2016

There is no point in idling on a cfq group if the only cfq queue that is
there has too big thinktime.
Signed-off-by: NJan Kara <jack@suse.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

e795421e

02 2月, 2016 1 次提交

deadline: remove unused struct member · e502fb8f

由 Tahsin Erdogan 提交于 1月 14, 2016

commit 63de428b ("deadline-iosched:
allow non-sequential batching") removed last use of last_sector.
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e502fb8f

31 1月, 2016 2 次提交

block: use DAX for partition table reads · d1a5f2b4

由 Dan Williams 提交于 1月 28, 2016

Avoid populating pagecache when the block device is in DAX mode.
Otherwise these page cache entries collide with the fsync/msync
implementation and break data durability guarantees.

Cc: Jan Kara <jack@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NMatthew Wilcox <willy@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

d1a5f2b4

block: revert runtime dax control of the raw block device · 9f4736fe

由 Dan Williams 提交于 1月 28, 2016

Dynamically enabling DAX requires that the page cache first be flushed
and invalidated.  This must occur atomically with the change of DAX mode
otherwise we confuse the fsync/msync tracking and violate data
durability guarantees.  Eliminate the possibilty of DAX-disabled to
DAX-enabled transitions for now and revisit this for the next cycle.

Cc: Jan Kara <jack@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

9f4736fe

23 1月, 2016 2 次提交

block: fix bio splitting on max sectors · d0e5fbb0

由 Ming Lei 提交于 1月 23, 2016

After commit e36f6204(block: split bios to maxpossible length),
bio can be splitted in the middle of a vector entry, then it
is easy to split out one bio which size isn't aligned with block
size, especially when the block size is bigger than 512.

This patch fixes the issue by making the max io size aligned
to logical block size.

Fixes: e36f6204(block: split bios to maxpossible length)
Reported-by: NStefan Haberland <sth@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

d0e5fbb0

wrappers for ->i_mutex access · 5955102c

由 Al Viro 提交于 1月 22, 2016

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5955102c

13 1月, 2016 1 次提交

block: split bios to max possible length · e36f6204

由 Keith Busch 提交于 1月 12, 2016

This splits bio in the middle of a vector to form the largest possible
bio at the h/w's desired alignment, and guarantees the bio being split
will have some data.

The criteria for splitting is changed from the max sectors to the h/w's
optimal sector alignment if it is provided. For h/w that advertise their
block storage's underlying chunk size, it's a big performance win to not
submit commands that cross them. If sector alignment is not provided,
this patch uses the max sectors as before.

This addresses the performance issue commit d3805611 attempted to
fix, but was reverted due to splitting logic error.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: <stable@vger.kernel.org> # 4.4.x-
Signed-off-by: NJens Axboe <axboe@fb.com>

e36f6204

10 1月, 2016 1 次提交

block: kill disk_{check|set|clear|alloc}_badblocks · 55f5560d

由 Dan Williams 提交于 1月 05, 2016

These actions are completely managed by a block driver or can use the
badblocks api directly.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

55f5560d

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功