提交 · 0e5b935d43f385ab23d2e38e7134b1abb0e7907e · openeuler / raspberrypi-kernel

12 10月, 2017 11 次提交

A
bio_alloc_map_data(): do bmd->iter setup right there · 0e5b935d
由 Al Viro 提交于 9月 24, 2017
```
just need to copy it iter instead of iter->nr_segs
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
0e5b935d

bio_copy_user_iov(): saner bio size calculation · d16d44eb

由 Al Viro 提交于 9月 24, 2017

it's a bounce buffer; we don't *care* how badly is the real
source/destination fragmented, all that matters is the total
size.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d16d44eb

A
bio_map_user_iov(): get rid of copying iov_iter · 0a0f1513
由 Al Viro 提交于 9月 24, 2017
```
we do want *iter advanced
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
0a0f1513
A
bio_copy_from_iter(): get rid of copying iov_iter · 98a09d61
由 Al Viro 提交于 9月 24, 2017
```
we want the one passed to it advanced, anyway
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
98a09d61
A
move more stuff down into bio_copy_user_iov() · 2884d0be
由 Al Viro 提交于 9月 24, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
2884d0be
A
blk_rq_map_user_iov(): move iov_iter_advance() down · e81cef5d
由 Al Viro 提交于 9月 24, 2017
```
... into bio_{map,copy}_user_iov()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
e81cef5d
A
bio_map_user_iov(): get rid of the iov_for_each() · b282cc76
由 Al Viro 提交于 9月 23, 2017
```
Use iov_iter_npages()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
b282cc76
A
bio_map_user_iov(): move alignment check into the main loop · 98f0bc99
由 Al Viro 提交于 9月 23, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
98f0bc99

don't rely upon subsequent bio_add_pc_page() calls failing · e2e115d1

由 Al Viro 提交于 9月 23, 2017

... they might actually succeed in some cases (when we are at the
queue-imposed segments limit, the next page is not mergable with
the last one we'd got in, but the first page covered by the next
iovec *is* mergable).  Make sure that once it's failed, we are
done with that bio.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e2e115d1

A
... and with iov_iter_get_pages_alloc() it becomes even simpler · 629e42bc
由 Al Viro 提交于 9月 23, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
629e42bc
A
bio_map_user_iov(): switch to iov_iter_get_pages()/iov_iter_advance() · 076098e5
由 Al Viro 提交于 9月 23, 2017
```
... and to hell with iov_for_each() nonsense
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
076098e5

11 10月, 2017 3 次提交

bio_copy_user_iov(): don't ignore ->iov_offset · 1cfd0ddd

由 Al Viro 提交于 9月 24, 2017

Since "block: support large requests in blk_rq_map_user_iov" we
started to call it with partially drained iter; that works fine
on the write side, but reads create a copy of iter for completion
time.  And that needs to take the possibility of ->iov_iter != 0
into account...

Cc: stable@vger.kernel.org #v4.5+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1cfd0ddd

more bio_map_user_iov() leak fixes · 2b04e8f6

由 Al Viro 提交于 9月 23, 2017

we need to take care of failure exit as well - pages already
in bio should be dropped by analogue of bio_unmap_pages(),
since their refcounts had been bumped only once per reference
in bio.

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

2b04e8f6

fix unbalanced page refcounting in bio_map_user_iov · 95d78c28

由 Vitaly Mayatskikh 提交于 9月 22, 2017

bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
IO vector has small consecutive buffers belonging to the same page.
bio_add_pc_page merges them into one, but the page reference is never
dropped.

Cc: stable@vger.kernel.org
Signed-off-by: NVitaly Mayatskikh <v.mayatskih@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

95d78c28

26 8月, 2017 1 次提交

md/raid0: attach correct cgroup info in bio · 8a8e6f84

由 Shaohua Li 提交于 8月 18, 2017

The discard bio doesn't attach the original bio cgroup info. Normal bio
is cloned, so is fine.
Signed-off-by: NShaohua Li <shli@fb.com>

8a8e6f84

24 8月, 2017 1 次提交

block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992

由 Christoph Hellwig 提交于 8月 23, 2017

This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

74d46992

10 8月, 2017 1 次提交

block: pass in queue to inflight accounting · d62e26b3

由 Jens Axboe 提交于 6月 30, 2017

No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d62e26b3

02 8月, 2017 1 次提交

block: Add comment to submit_bio_wait() · 3d289d68

由 Jan Kara 提交于 8月 02, 2017

submit_bio_wait() does not consume bio reference. Add comment about
that.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3d289d68

11 7月, 2017 1 次提交

block: call bio_uninit in bio_endio · b222dd2f

由 Shaohua Li 提交于 7月 10, 2017

bio_free isn't a good place to free cgroup info. There are a
lot of cases bio is allocated in special way (for example, in stack) and
never gets called by bio_put hence bio_free, we are leaking memory. This
patch moves the free to bio endio, which should be called anyway. The
bio_uninit call in bio_free is kept, in case the bio never gets called
bio endio.

This assumes ->bi_end_io() doesn't access cgroup info, which seems true
in my audit.

This along with Christoph's integrity patch should fix the memory leak
issue.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b222dd2f

04 7月, 2017 3 次提交

bio-integrity: stop abusing bi_end_io · 7c20f116

由 Christoph Hellwig 提交于 7月 03, 2017

And instead call directly into the integrity code from bio_end_io.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7c20f116

bio-integrity: fix interface for bio_integrity_trim · fbd08e76

由 Dmitry Monakhov 提交于 6月 29, 2017

bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be insync with bio's data offset. That is why we have
integrity update hook in bio_advance()

So only meaningful values are: offset == 0, sectors == bio_sectors(bio)
Let's just remove them completely.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fbd08e76

bio-integrity: bio_trim should truncate integrity vector accordingly · 376a78ab

由 Dmitry Monakhov 提交于 6月 29, 2017

Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

376a78ab

29 6月, 2017 1 次提交

block: provide bio_uninit() free freeing integrity/task associations · 9ae3b3f5

由 Jens Axboe 提交于 6月 28, 2017

Wen reports significant memory leaks with DIF and O_DIRECT:

"With nvme devive + T10 enabled, On a system it has 256GB and started
logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
leaking.

/proc/meminfo | grep SUnreclaim...

SUnreclaim:      6752128 kB
SUnreclaim:      6874880 kB
SUnreclaim:      7238080 kB
....
SUnreclaim:     22307264 kB
SUnreclaim:     22485888 kB
SUnreclaim:     22720256 kB

When testcases with T10 enabled call into __blkdev_direct_IO_simple,
code doesn't free memory allocated by bio_integrity_alloc. The patch
fixes the issue. HTX has been run with +60 hours without failure."

Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
doesn't go through the regular bio free. This means that any ancillary
data allocated with the bio through the stack is not freed. Hence, we
can leak the integrity data associated with the bio, if the device is
using DIF/DIX.

Fix this by providing a bio_uninit() and export it, so that we can use
it to free this data. Note that this is a minimal fix for this issue.
Any current user of bio's that are allocated outside of
bio_alloc_bioset() suffers from this issue, most notably some drivers.
We will fix those in a more comprehensive patch for 4.13. This also
means that the commit marked as being fixed by this isn't the real
culprit, it's just the most obvious one out there.

Fixes: 542ff7bf ("block: new direct I/O implementation")
Reported-by: NWen Xiong <wenxiong@linux.vnet.ibm.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9ae3b3f5

28 6月, 2017 1 次提交

block: add support for write hints in a bio · cb6934f8

由 Jens Axboe 提交于 6月 27, 2017

No functional changes in this patch, we just use up some holes
in the bio and request structures to define a write hint that
we psas down the stack.

Ensure that we don't merge requests that have different life time
hints assigned to them, and that we inherit the write hint when
cloning a bio.
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cb6934f8

19 6月, 2017 3 次提交

block: remove bio_clone() and all references. · 9b10f6a9

由 NeilBrown 提交于 6月 18, 2017

bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().

So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9b10f6a9

blk: make the bioset rescue_workqueue optional. · 47e0fb46

由 NeilBrown 提交于 6月 18, 2017

This patch converts bioset_create() to not create a workqueue by
default, so alloctions will never trigger punt_bios_to_rescuer().  It
also introduces a new flag BIOSET_NEED_RESCUER which tells
bioset_create() to preserve the old behavior.

All callers of bioset_create() that are inside block device drivers,
are given the BIOSET_NEED_RESCUER flag.

biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios.  This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, and one allocated by
target_core_iblock.c.

biosets used by md/raid do not need rescuing as
their usage was recently audited and revised to never
risk deadlock.

It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes)
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

47e0fb46

blk: replace bioset_create_nobvec() with a flags arg to bioset_create() · 011067b0

由 NeilBrown 提交于 6月 18, 2017

"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().
Suggested-by: NChristoph Hellwig <hch@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@infradead.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

011067b0

16 6月, 2017 1 次提交

block: Dedicated error code fixups · a462b950

由 Bart Van Assche 提交于 6月 13, 2017

This patch fixes two sparse warnings introduced by the "dedicated
error codes for the block layer V3" patch series. These changes
have not been tested.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

a462b950

09 6月, 2017 1 次提交

block: switch bios to blk_status_t · 4e4cbee9

由 Christoph Hellwig 提交于 6月 03, 2017

Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

4e4cbee9

12 4月, 2017 1 次提交

Revert "block: introduce bio_copy_data_partial" · 50512625

由 NeilBrown 提交于 4月 05, 2017

This reverts commit 6f880285.
bio_copy_data_partial() is no longer needed.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

50512625

07 4月, 2017 1 次提交

block: trace completion of all bios. · fbbaf700

由 NeilBrown 提交于 4月 07, 2017

Currently only dm and md/raid5 bios trigger
trace_block_bio_complete().  Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete.  Only bio_endio() knows that.

So move the trace_block_bio_complete() call to bio_endio().

Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.

There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
   trace event at the 'request' level, there is no point generating
   one at the bio level too.  In this case the bi_sector and bi_size
   will have changed, so the bio level event would be wrong

2/ If the bio hasn't actually been queued yet, but is being aborted
   early, then a trace event could be confusing.  Some filesystems
   call bio_endio() but do not want tracing.

3/ The bio_integrity code interposes itself by replacing bi_end_io,
   then restoring it and calling bio_endio() again.  This would produce
   two identical trace events if left like that.

To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.

When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication.  A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component.  To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.

So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

fbbaf700

28 3月, 2017 1 次提交

blk-throttle: add a simple idle detection · 9e234eea

由 Shaohua Li 提交于 3月 27, 2017

A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.

The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9e234eea

26 3月, 2017 1 次提交

block: remove bio_clone_bioset_partial() · f4595875

由 Shaohua Li 提交于 3月 24, 2017

commit c18a1e09(block: introduce bio_clone_bioset_partial()) introduced
bio_clone_bioset_partial() for raid1 write behind IO. Now the write behind is
rewritten by Ming. We don't need the API any more, so revert the commit.

Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

f4595875

25 3月, 2017 1 次提交

block: introduce bio_copy_data_partial · 6f880285

由 Ming Lei 提交于 3月 17, 2017

Turns out we can use bio_copy_data in raid1's write behind,
and we can make alloc_behind_pages() more clean/efficient,
but we need to partial version of bio_copy_data().
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Reviewed-by: NJens Axboe <axboe@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

6f880285

23 3月, 2017 1 次提交

block: make nr_iovecs unsigned in bio_alloc_bioset() · 7a88fa19

由 Dan Carpenter 提交于 3月 23, 2017

There isn't a bug here, but Smatch is not smart enough to know that
"nr_iovecs" can't be negative so it complains about underflows.
Really, it's slightly cleaner to make this parameter unsigned.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

7a88fa19

12 3月, 2017 1 次提交

blk: Ensure users for current->bio_list can see the full list. · f5fe1b51

由 NeilBrown 提交于 3月 10, 2017

Commit 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
changed current->bio_list so that it did not contain *all* of the
queued bios, but only those submitted by the currently running
make_request_fn.

There are two places which walk the list and requeue selected bios,
and others that check if the list is empty.  These are no longer
correct.

So redefine current->bio_list to point to an array of two lists, which
contain all queued bios, and adjust various code to test or walk both
lists.
Signed-off-by: NNeilBrown <neilb@suse.com>
Fixes: 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
Signed-off-by: NJens Axboe <axboe@fb.com>

f5fe1b51

16 2月, 2017 1 次提交

block: introduce bio_clone_bioset_partial() · c18a1e09

由 Ming Lei 提交于 2月 14, 2017

md still need bio clone(not the fast version) for behind write,
and it is more efficient to use bio_clone_bioset_partial().

The idea is simple and just copy the bvecs range specified from
parameters.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJens Axboe <axboe@fb.com>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

c18a1e09

02 2月, 2017 1 次提交

block: Update comments that refer to __bio_map_user() and bio_map_user() · 5fad1b64

由 Bart Van Assche 提交于 2月 01, 2017

Since __bio_map_user() and bio_map_user() have been removed, update
the comments that still refer to these functions.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
References: commit ddad8dd0 ("block: use blk_rq_map_user_iov to implement blk_rq_map_user")
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

5fad1b64

01 2月, 2017 1 次提交

block: fold cmd_type into the REQ_OP_ space · aebf526b

由 Christoph Hellwig 提交于 1月 31, 2017

Instead of keeping two levels of indirection for requests types, fold it
all into the operations.  The little caveat here is that previously
cmd_type only applied to struct request, while the request and bio op
fields were set to plain REQ_OP_READ/WRITE even for passthrough
operations.

Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
private requests, althought it has to add two for each so that we
can communicate the data in/out nature of the request.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

aebf526b

09 12月, 2016 1 次提交

block: improve handling of the magic discard payload · f9d03f96

由 Christoph Hellwig 提交于 12月 08, 2016

Instead of allocating a single unused biovec for discard requests, send
them down without any payload.  Instead we allow the driver to add a
"special" payload using a biovec embedded into struct request (unioned
over other fields never used while in the driver), and overloading
the number of segments for this case.

This has a couple of advantages:

 - we don't have to allocate the bio_vec
 - the amount of special casing for discard requests in the block
   layer is significantly reduced
 - using this same scheme for other request types is trivial,
   which will be important for implementing the new WRITE_ZEROES
   op on devices where it actually requires a payload (e.g. SCSI)
 - we can get rid of playing games with the request length, as
   we'll never touch it and completions will work just fine
 - it will allow us to support ranged discard operations in the
   future by merging non-contiguous discard bios into a single
   request
 - last but not least it removes a lot of code

This patch is the common base for my WIP series for ranges discards and to
remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
so it would be good to get it in quickly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

f9d03f96