提交 · 111be883981748acc9a56e855c8336404a8e787c · openeuler / raspberrypi-kernel

21 12月, 2017 1 次提交

block-throttle: avoid double charge · 111be883

由 Shaohua Li 提交于 12月 20, 2017

If a bio is throttled and split after throttling, the bio could be
resubmited and enters the throttling again. This will cause part of the
bio to be charged multiple times. If the cgroup has an IO limit, the
double charge will significantly harm the performance. The bio split
becomes quite common after arbitrary bio size change.

To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
If the bio is cloned/split, we copy the flag to new bio too to avoid a
double charge. However, cloned bio could be directed to a new disk,
keeping the flag be a problem. The observation is we always set new disk
for the bio in this case, so we can clear the flag in bio_set_dev().

This issue exists for a long time, arbitrary bio size change just makes
it worse, so this should go into stable at least since v4.2.

V1-> V2: Not add extra field in bio based on discussion with Tejun

Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

111be883

23 11月, 2017 1 次提交

block: remove useless assignment in bio_split · f341a4d3

由 Mikulas Patocka 提交于 11月 22, 2017

Remove useless assignment to the variable "split" because the variable is
unconditionally assigned later.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f341a4d3

17 11月, 2017 1 次提交

bio: ensure __bio_clone_fast copies bi_partno · 62530ed8

由 Michael Lyle 提交于 11月 16, 2017

A new field was introduced in 74d46992, bi_partno, instead of using
bdev->bd_contains and encoding the partition information in the bi_bdev
field.  __bio_clone_fast was changed to copy the disk information, but
not the partition information.  At minimum, this regressed bcache and
caused data corruption.
Signed-off-by: NMichael Lyle <mlyle@lyle.org>
Fixes: 74d46992 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Reported-by: NPavel Goran <via-bcache@pvgoran.name>
Reported-by: NCampbell Steven <casteven@gmail.com>
Reviewed-by: NColy Li <colyli@suse.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Cc: <stable@vger.kernel.org> # 4.14
Signed-off-by: NJens Axboe <axboe@kernel.dk>

62530ed8

26 10月, 2017 1 次提交

block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion() · e319e1fb

由 Byungchul Park 提交于 10月 25, 2017

Darrick posted the following warning and Dave Chinner analyzed it:

> ======================================================
> WARNING: possible circular locking dependency detected
> 4.14.0-rc1-fixes #1 Tainted: G        W
> ------------------------------------------------------
> loop0/31693 is trying to acquire lock:
>  (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
>
> but now in release context of a crosslock acquired at the following:
>  ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 ((complete)&ret.event){+.+.}:
>        lock_acquire+0xab/0x200
>        wait_for_completion_io+0x4e/0x1a0
>        submit_bio_wait+0x7f/0xb0
>        blkdev_issue_zeroout+0x71/0xa0
>        xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
>        xfs_bmapi_write+0x374/0x11f0 [xfs]
>        xfs_iomap_write_direct+0x2ac/0x430 [xfs]
>        xfs_file_iomap_begin+0x20d/0xd50 [xfs]
>        iomap_apply+0x43/0xe0
>        dax_iomap_rw+0x89/0xf0
>        xfs_file_dax_write+0xcc/0x220 [xfs]
>        xfs_file_write_iter+0xf0/0x130 [xfs]
>        __vfs_write+0xd9/0x150
>        vfs_write+0xc8/0x1c0
>        SyS_write+0x45/0xa0
>        entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #1 (&xfs_nondir_ilock_class){++++}:
>        lock_acquire+0xab/0x200
>        down_write_nested+0x4a/0xb0
>        xfs_ilock+0x263/0x330 [xfs]
>        xfs_setattr_size+0x152/0x370 [xfs]
>        xfs_vn_setattr+0x6b/0x90 [xfs]
>        notify_change+0x27d/0x3f0
>        do_truncate+0x5b/0x90
>        path_openat+0x237/0xa90
>        do_filp_open+0x8a/0xf0
>        do_sys_open+0x11c/0x1f0
>        entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
>        up_write+0x1c/0x40
>        xfs_iunlock+0x1d0/0x310 [xfs]
>        xfs_file_fallocate+0x8a/0x310 [xfs]
>        loop_queue_work+0xb7/0x8d0
>        kthread_worker_fn+0xb9/0x1f0
>
> Chain exists of:
>   &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
>
>  Possible unsafe locking scenario by crosslock:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&xfs_nondir_ilock_class);
>   lock((complete)&ret.event);
>                                lock(&(&ip->i_mmaplock)->mr_lock);
>                                unlock((complete)&ret.event);
>
>                *** DEADLOCK ***

The warning is a false positive, caused by the fact that all
wait_for_completion()s in submit_bio_wait() are waiting with the same
lock class.

However, some bios have nothing to do with others, for example in the case
of loop devices, there's no direct connection between the bios of an upper
device and the bios of a lower device(=loop device).

The safest way to assign different lock classes to different devices is
to do it for each gendisk. In other words, this patch assigns a
lockdep_map per gendisk and uses it when initializing completion in
submit_bio_wait().
Analyzed-by: NDave Chinner <david@fromorbit.com>
Reported-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NByungchul Park <byungchul.park@lge.com>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: amir73il@gmail.com
Cc: axboe@kernel.dk
Cc: david@fromorbit.com
Cc: hch@infradead.org
Cc: idryomov@gmail.com
Cc: johan@kernel.org
Cc: johannes.berg@intel.com
Cc: kernel-team@lge.com
Cc: linux-block@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Cc: oleg@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1508921765-15396-10-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

e319e1fb

25 10月, 2017 1 次提交

block: Use DECLARE_COMPLETION_ONSTACK() in submit_bio_wait() · 65e53aab

由 Christoph Hellwig 提交于 10月 25, 2017

Simplify the code by getting rid of the submit_bio_ret structure.

(This also helps address a lockdep false positive.)
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: amir73il@gmail.com
Cc: axboe@kernel.dk
Cc: darrick.wong@oracle.com
Cc: david@fromorbit.com
Cc: hch@infradead.org
Cc: idryomov@gmail.com
Cc: johan@kernel.org
Cc: johannes.berg@intel.com
Cc: kernel-team@lge.com
Cc: linux-block@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Cc: oleg@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1508921765-15396-2-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

65e53aab

17 10月, 2017 1 次提交

block: fix Sphinx kernel-doc warning · 519c8e9f

由 Randy Dunlap 提交于 10月 16, 2017

Sphinx treats symbols that end with '_' as a kind of special
documentation indicator, so fix that by adding an ending '*'
to it.

../block/bio.c:404: ERROR: Unknown target name: "gfp".
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

519c8e9f

12 10月, 2017 11 次提交

A
bio_alloc_map_data(): do bmd->iter setup right there · 0e5b935d
由 Al Viro 提交于 9月 24, 2017
```
just need to copy it iter instead of iter->nr_segs
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
0e5b935d

bio_copy_user_iov(): saner bio size calculation · d16d44eb

由 Al Viro 提交于 9月 24, 2017

it's a bounce buffer; we don't *care* how badly is the real
source/destination fragmented, all that matters is the total
size.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d16d44eb

A
bio_map_user_iov(): get rid of copying iov_iter · 0a0f1513
由 Al Viro 提交于 9月 24, 2017
```
we do want *iter advanced
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
0a0f1513
A
bio_copy_from_iter(): get rid of copying iov_iter · 98a09d61
由 Al Viro 提交于 9月 24, 2017
```
we want the one passed to it advanced, anyway
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
98a09d61
A
move more stuff down into bio_copy_user_iov() · 2884d0be
由 Al Viro 提交于 9月 24, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
2884d0be
A
blk_rq_map_user_iov(): move iov_iter_advance() down · e81cef5d
由 Al Viro 提交于 9月 24, 2017
```
... into bio_{map,copy}_user_iov()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
e81cef5d
A
bio_map_user_iov(): get rid of the iov_for_each() · b282cc76
由 Al Viro 提交于 9月 23, 2017
```
Use iov_iter_npages()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
b282cc76
A
bio_map_user_iov(): move alignment check into the main loop · 98f0bc99
由 Al Viro 提交于 9月 23, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
98f0bc99

don't rely upon subsequent bio_add_pc_page() calls failing · e2e115d1

由 Al Viro 提交于 9月 23, 2017

... they might actually succeed in some cases (when we are at the
queue-imposed segments limit, the next page is not mergable with
the last one we'd got in, but the first page covered by the next
iovec *is* mergable).  Make sure that once it's failed, we are
done with that bio.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e2e115d1

A
... and with iov_iter_get_pages_alloc() it becomes even simpler · 629e42bc
由 Al Viro 提交于 9月 23, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
629e42bc
A
bio_map_user_iov(): switch to iov_iter_get_pages()/iov_iter_advance() · 076098e5
由 Al Viro 提交于 9月 23, 2017
```
... and to hell with iov_for_each() nonsense
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
076098e5

11 10月, 2017 3 次提交

bio_copy_user_iov(): don't ignore ->iov_offset · 1cfd0ddd

由 Al Viro 提交于 9月 24, 2017

Since "block: support large requests in blk_rq_map_user_iov" we
started to call it with partially drained iter; that works fine
on the write side, but reads create a copy of iter for completion
time.  And that needs to take the possibility of ->iov_iter != 0
into account...

Cc: stable@vger.kernel.org #v4.5+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1cfd0ddd

more bio_map_user_iov() leak fixes · 2b04e8f6

由 Al Viro 提交于 9月 23, 2017

we need to take care of failure exit as well - pages already
in bio should be dropped by analogue of bio_unmap_pages(),
since their refcounts had been bumped only once per reference
in bio.

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

2b04e8f6

fix unbalanced page refcounting in bio_map_user_iov · 95d78c28

由 Vitaly Mayatskikh 提交于 9月 22, 2017

bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
IO vector has small consecutive buffers belonging to the same page.
bio_add_pc_page merges them into one, but the page reference is never
dropped.

Cc: stable@vger.kernel.org
Signed-off-by: NVitaly Mayatskikh <v.mayatskih@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

95d78c28

07 10月, 2017 1 次提交

block/bio: Remove null checks before mempool_destroy in bioset_free · 4078def8

由 Tim Hansen 提交于 10月 06, 2017

This patch removes redundant checks for null values on bio_pool and
bvec_pool.

Found using make coccicheck M=block/ on linux-net tree on the
next-20170929 tag.
Signed-off-by: NTim Hansen <devtimhansen@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4078def8

26 9月, 2017 1 次提交

blkcg: delete unused APIs · af551fb3

由 Shaohua Li 提交于 9月 14, 2017

Nobody uses the APIs right now.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

af551fb3

26 8月, 2017 1 次提交

md/raid0: attach correct cgroup info in bio · 8a8e6f84

由 Shaohua Li 提交于 8月 18, 2017

The discard bio doesn't attach the original bio cgroup info. Normal bio
is cloned, so is fine.
Signed-off-by: NShaohua Li <shli@fb.com>

8a8e6f84

24 8月, 2017 1 次提交

block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992

由 Christoph Hellwig 提交于 8月 23, 2017

This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

74d46992

10 8月, 2017 1 次提交

block: pass in queue to inflight accounting · d62e26b3

由 Jens Axboe 提交于 6月 30, 2017

No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d62e26b3

02 8月, 2017 1 次提交

block: Add comment to submit_bio_wait() · 3d289d68

由 Jan Kara 提交于 8月 02, 2017

submit_bio_wait() does not consume bio reference. Add comment about
that.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3d289d68

11 7月, 2017 1 次提交

block: call bio_uninit in bio_endio · b222dd2f

由 Shaohua Li 提交于 7月 10, 2017

bio_free isn't a good place to free cgroup info. There are a
lot of cases bio is allocated in special way (for example, in stack) and
never gets called by bio_put hence bio_free, we are leaking memory. This
patch moves the free to bio endio, which should be called anyway. The
bio_uninit call in bio_free is kept, in case the bio never gets called
bio endio.

This assumes ->bi_end_io() doesn't access cgroup info, which seems true
in my audit.

This along with Christoph's integrity patch should fix the memory leak
issue.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b222dd2f

04 7月, 2017 3 次提交

bio-integrity: stop abusing bi_end_io · 7c20f116

由 Christoph Hellwig 提交于 7月 03, 2017

And instead call directly into the integrity code from bio_end_io.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7c20f116

bio-integrity: fix interface for bio_integrity_trim · fbd08e76

由 Dmitry Monakhov 提交于 6月 29, 2017

bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be insync with bio's data offset. That is why we have
integrity update hook in bio_advance()

So only meaningful values are: offset == 0, sectors == bio_sectors(bio)
Let's just remove them completely.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fbd08e76

bio-integrity: bio_trim should truncate integrity vector accordingly · 376a78ab

由 Dmitry Monakhov 提交于 6月 29, 2017

Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

376a78ab

29 6月, 2017 1 次提交

block: provide bio_uninit() free freeing integrity/task associations · 9ae3b3f5

由 Jens Axboe 提交于 6月 28, 2017

Wen reports significant memory leaks with DIF and O_DIRECT:

"With nvme devive + T10 enabled, On a system it has 256GB and started
logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
leaking.

/proc/meminfo | grep SUnreclaim...

SUnreclaim:      6752128 kB
SUnreclaim:      6874880 kB
SUnreclaim:      7238080 kB
....
SUnreclaim:     22307264 kB
SUnreclaim:     22485888 kB
SUnreclaim:     22720256 kB

When testcases with T10 enabled call into __blkdev_direct_IO_simple,
code doesn't free memory allocated by bio_integrity_alloc. The patch
fixes the issue. HTX has been run with +60 hours without failure."

Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
doesn't go through the regular bio free. This means that any ancillary
data allocated with the bio through the stack is not freed. Hence, we
can leak the integrity data associated with the bio, if the device is
using DIF/DIX.

Fix this by providing a bio_uninit() and export it, so that we can use
it to free this data. Note that this is a minimal fix for this issue.
Any current user of bio's that are allocated outside of
bio_alloc_bioset() suffers from this issue, most notably some drivers.
We will fix those in a more comprehensive patch for 4.13. This also
means that the commit marked as being fixed by this isn't the real
culprit, it's just the most obvious one out there.

Fixes: 542ff7bf ("block: new direct I/O implementation")
Reported-by: NWen Xiong <wenxiong@linux.vnet.ibm.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9ae3b3f5

28 6月, 2017 1 次提交

block: add support for write hints in a bio · cb6934f8

由 Jens Axboe 提交于 6月 27, 2017

No functional changes in this patch, we just use up some holes
in the bio and request structures to define a write hint that
we psas down the stack.

Ensure that we don't merge requests that have different life time
hints assigned to them, and that we inherit the write hint when
cloning a bio.
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cb6934f8

19 6月, 2017 3 次提交

block: remove bio_clone() and all references. · 9b10f6a9

由 NeilBrown 提交于 6月 18, 2017

bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().

So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9b10f6a9

blk: make the bioset rescue_workqueue optional. · 47e0fb46

由 NeilBrown 提交于 6月 18, 2017

This patch converts bioset_create() to not create a workqueue by
default, so alloctions will never trigger punt_bios_to_rescuer().  It
also introduces a new flag BIOSET_NEED_RESCUER which tells
bioset_create() to preserve the old behavior.

All callers of bioset_create() that are inside block device drivers,
are given the BIOSET_NEED_RESCUER flag.

biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios.  This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, and one allocated by
target_core_iblock.c.

biosets used by md/raid do not need rescuing as
their usage was recently audited and revised to never
risk deadlock.

It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes)
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

47e0fb46

blk: replace bioset_create_nobvec() with a flags arg to bioset_create() · 011067b0

由 NeilBrown 提交于 6月 18, 2017

"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().
Suggested-by: NChristoph Hellwig <hch@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@infradead.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

011067b0

16 6月, 2017 1 次提交

block: Dedicated error code fixups · a462b950

由 Bart Van Assche 提交于 6月 13, 2017

This patch fixes two sparse warnings introduced by the "dedicated
error codes for the block layer V3" patch series. These changes
have not been tested.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

a462b950

09 6月, 2017 1 次提交

block: switch bios to blk_status_t · 4e4cbee9

由 Christoph Hellwig 提交于 6月 03, 2017

Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

4e4cbee9

12 4月, 2017 1 次提交

Revert "block: introduce bio_copy_data_partial" · 50512625

由 NeilBrown 提交于 4月 05, 2017

This reverts commit 6f880285.
bio_copy_data_partial() is no longer needed.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

50512625

07 4月, 2017 1 次提交

block: trace completion of all bios. · fbbaf700

由 NeilBrown 提交于 4月 07, 2017

Currently only dm and md/raid5 bios trigger
trace_block_bio_complete().  Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete.  Only bio_endio() knows that.

So move the trace_block_bio_complete() call to bio_endio().

Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.

There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
   trace event at the 'request' level, there is no point generating
   one at the bio level too.  In this case the bi_sector and bi_size
   will have changed, so the bio level event would be wrong

2/ If the bio hasn't actually been queued yet, but is being aborted
   early, then a trace event could be confusing.  Some filesystems
   call bio_endio() but do not want tracing.

3/ The bio_integrity code interposes itself by replacing bi_end_io,
   then restoring it and calling bio_endio() again.  This would produce
   two identical trace events if left like that.

To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.

When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication.  A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component.  To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.

So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

fbbaf700

28 3月, 2017 1 次提交

blk-throttle: add a simple idle detection · 9e234eea

由 Shaohua Li 提交于 3月 27, 2017

A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.

The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9e234eea