提交 · b4c5c60920e3b0c4598f43e7317559f6aec51531 · openanolis / cloud-kernel

24 7月, 2014 1 次提交

zram: avoid lockdep splat by revalidate_disk · b4c5c609

由 Minchan Kim 提交于 7月 23, 2014

Sasha reported lockdep warning [1] introduced by [2].

It could be fixed by doing disk revalidation out of the init_lock.  It's
okay because disk capacity change is protected by init_lock so that
revalidate_disk always sees up-to-date value so there is no race.

[1] https://lkml.org/lkml/2014/7/3/735
[2] zram: revalidate disk after capacity change

Fixes 2e32baea ("zram: revalidate disk after capacity change").
Signed-off-by: NMinchan Kim <minchan@kernel.org>
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Cc: "Alexander E. Patrakov" <patrakov@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
CC: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b4c5c609

10 7月, 2014 1 次提交

drbd: fix regression 'out of mem, failed to invoke fence-peer helper' · bbc1c5e8

由 Lars Ellenberg 提交于 7月 09, 2014

Since linux kernel 3.13, kthread_run() internally uses
wait_for_completion_killable().  We sometimes may use kthread_run()
while we still have a signal pending, which we used to kick our threads
out of potentially blocking network functions, causing kthread_run() to
mistake that as a new fatal signal and fail.

Fix: flush_signals() before kthread_run().
Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

bbc1c5e8

04 7月, 2014 1 次提交

zram: revalidate disk after capacity change · 2e32baea

由 Minchan Kim 提交于 7月 02, 2014

Alexander reported mkswap on /dev/zram0 is failed if other process is
opening the block device file.

Step is as follows,

0. Reset the unused zram device.
1. Use a program that opens /dev/zram0 with O_RDWR and sleeps
   until killed.
2. While that program sleeps, echo the correct value to
   /sys/block/zram0/disksize.
3. Verify (e.g. in /proc/partitions) that the disk size is applied
   correctly. It is.
4. While that program still sleeps, attempt to mkswap /dev/zram0.
   This fails: mkswap: error: swap area needs to be at least 40 KiB

When I investigated, the size get by ioctl(fd, BLKGETSIZE64, xxx) on
mkswap to get a size of blockdev was zero although zram0 has right size by
2.

The reason is zram didn't revalidate disk after changing capacity so that
size of blockdev's inode is not uptodate until all of file is close.

This patch should fix the BUG.
Signed-off-by: NMinchan Kim <minchan@kernel.org>
Reported-by: NAlexander E. Patrakov <patrakov@gmail.com>
Tested-by: NAlexander E. Patrakov <patrakov@gmail.com>
Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Acked-by: NJerome Marchand <jmarchan@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2e32baea

25 6月, 2014 1 次提交

drbd: fix NULL pointer deref in blk_add_request_payload · 54ed4ed8

由 Lars Ellenberg 提交于 6月 25, 2014

Discards don't have any payload.
But the scsi layer still expects a bio_vec it can use internally,
see sd_setup_discard_cmnd() and blk_add_request_payload().
Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

54ed4ed8

23 6月, 2014 1 次提交

rbd: handle parent_overlap on writes correctly · 9638556a

由 Ilya Dryomov 提交于 6月 10, 2014

The following check in rbd_img_obj_request_submit()

    rbd_dev->parent_overlap <= obj_request->img_offset

allows the fall through to the non-layered write case even if both
parent_overlap and obj_request->img_offset belong to the same RADOS
object.  This leads to data corruption, because the area to the left of
parent_overlap ends up unconditionally zero-filled instead of being
populated with parent data.  Suppose we want to write 1M to offset 6M
of image bar, which is a clone of foo@snap; object_size is 4M,
parent_overlap is 5M:

    rbd_data.<id>.0000000000000001
     ---------------------|----------------------|------------
    | should be copyup'ed | should be zeroed out | write ...
     ---------------------|----------------------|------------
   4M                    5M                     6M
                    parent_overlap    obj_request->img_offset

4..5M should be copyup'ed from foo, yet it is zero-filled, just like
5..6M is.

Given that the only striping mode kernel client currently supports is
chunking (i.e. stripe_unit == object_size, stripe_count == 1), round
parent_overlap up to the next object boundary for the purposes of the
overlap check.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

9638556a

18 6月, 2014 1 次提交

floppy: format block0 read error message properly · 1c65df3d

由 Jiri Kosina 提交于 6月 18, 2014

In case reading of block 0 fails, line without trailing newline
is printed causing dmesg to look horrible.
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

1c65df3d

17 6月, 2014 1 次提交

null_blk: fix softirq completions for queue_mode == 1 · d891fa70

由 Jens Axboe 提交于 6月 16, 2014

Only blk-mq completions have payload attached to the request, for
request_fn mode we have stored it in req->special. This fixes an
oops with queue_mode=1 and softirq completions.
Signed-off-by: NJens Axboe <axboe@fb.com>

d891fa70

14 6月, 2014 1 次提交

NVMe: Fix START_STOP_UNIT Scsi->NVMe translation. · b8e08084

由 Dan McLeran 提交于 6月 06, 2014

This patch contains several fixes for Scsi START_STOP_UNIT. The previous
code did not account for signed vs. unsigned arithmetic which resulted
in an invalid lowest power state caculation when the device only supports
1 power state.

The code for Power Condition == 2 (Idle) was not following the spec. The
spec calls for setting the device to specific power states, depending
upon Power Condition Modifier, without accounting for the number of
power states supported by the device.

The code for Power Condition == 3 (Standby) was using a hard-coded '0'
which is replaced with the macro POWER_STATE_0.
Signed-off-by: NDan McLeran <daniel.mcleran@intel.com>
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

b8e08084

13 6月, 2014 2 次提交

NVMe: Use Log Page constants in SCSI emulation · ef351b97

由 Matthew Wilcox 提交于 6月 13, 2014

The nvme-scsi file defined its own Log Page constant.  Use the
newly-defined one from the header file instead.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

ef351b97

NVMe: Fix hot cpu notification dead lock · f3db22fe

由 Keith Busch 提交于 6月 11, 2014

There is a potential dead lock if a cpu event occurs during nvme probe
since it registered with hot cpu notification. This fixes the race by
having the module register with notification outside of probe rather
than have each device register.

The actual work is done in a scheduled work queue instead of in the
notifier since assigning IO queues has the potential to block if the
driver creates additional queues.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

f3db22fe

12 6月, 2014 1 次提交

null_blk: fix name and description of 'queue_mode' module parameter · 54ae81cd

由 Mike Snitzer 提交于 6月 11, 2014

'use_mq' is not the name of the module parameter, 'queue_mode' is.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

54ae81cd

11 6月, 2014 3 次提交

rbd: only set disk to read-only once · 22001f61

由 Josh Durgin 提交于 9月 30, 2013

rbd_open(), called every time the device is opened, calls
set_device_ro().  There's no reason to set the device read-only or
read-write every time it is opened. Just do this once during device
setup, using set_disk_ro() instead because the struct block_device
isn't available to us there.
Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

22001f61

rbd: move calls that may sleep out of spin lock range · 77f33c03

由 Josh Durgin 提交于 9月 30, 2013

get_user() and set_disk_ro() may allocate memory, leading to a
potential deadlock if theye are called while a spin lock is held.

Move the acquisition and release of rbd_dev->lock from rbd_ioctl()
into rbd_ioctl_set_ro(), so it can occur between get_user() and
set_disk_ro().
Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

77f33c03

rbd: add ioctl for rbd · 131fd9f6

由 Guangliang Zhao 提交于 9月 24, 2013

When running the following commands:
    [root@ceph0 mnt]# blockdev --setro /dev/rbd1
    [root@ceph0 mnt]# blockdev --getro /dev/rbd1
    0

The block setro didn't take effect, it is because
the rbd doesn't support ioctl of block driver.

This resolves:
	http://tracker.ceph.com/issues/6265Signed-off-by: NGuangliang Zhao <guangliang@unitedstack.com>
Reviewed-by: NAlex Elder <elder@linaro.org>
Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>

131fd9f6

07 6月, 2014 2 次提交

nbd: zero from and len fields in NBD_CMD_DISCONNECT. · 04cfac4e

由 Hani Benhabiles 提交于 6月 06, 2014

Len field is already set to zero, but not the from field which is sent
as 0xfffffffffffffe00.  This makes no sense, and may cause confuse
server implementations doing sanity checks (qemu-nbd is an example.)
Signed-off-by: NHani Benhabiles <hani@linux.com>
Cc: Paul Clements <paul.clements@us.sios.com>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

04cfac4e

mtip32xx: minor performance enhancements · f45c40a9

由 Sam Bradshaw 提交于 6月 06, 2014

This patch adds the following:

1) Compiler hinting in the fast path.
2) A prefetch of port->flags to eliminate moderate cpu stalling later
in mtip_hw_submit_io().
3) Eliminate a redundant rq_data_dir().
4) Reorder members of driver_data to eliminate false cacheline sharing
between irq_workers_active and unal_qdepth.

With some workload and topology configurations, I'm seeing ~1.5%
throughput improvement in small block random read benchmarks as well
as improved latency std. dev.
Signed-off-by: NSam Bradshaw <sbradshaw@micron.com>

Add include of <linux/prefetch.h>
Signed-off-by: NJens Axboe <axboe@fb.com>

f45c40a9

06 6月, 2014 6 次提交

block: add blk_rq_set_block_pc() · f27b087b

由 Jens Axboe 提交于 6月 06, 2014

With the optimizations around not clearing the full request at alloc
time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC
up to the user allocating the request.

Add a blk_rq_set_block_pc() that sets the command type to
REQ_TYPE_BLOCK_PC, and properly initializes the members associated
with this type of request. Update callers to use this function instead
of manipulating rq->cmd_type directly.

Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed
attempt.
Signed-off-by: NJens Axboe <axboe@fb.com>

f27b087b

rbd: fix ida/idr memory leak · ffe312cf

由 Ilya Dryomov 提交于 5月 20, 2014

ida_destroy() needs to be called on module exit to release ida caches.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

ffe312cf

rbd: use reference counts for image requests · 0f2d5be7

由 Alex Elder 提交于 4月 26, 2014

Each image request contains a reference count, but to date it has
not actually been used.  (I think this was just an oversight.) A
recent report involving rbd failing an assertion shed light on why
and where we need to use these reference counts.

Every OSD request associated with an object request uses
rbd_osd_req_callback() as its callback function.  That function will
call a helper function (dependent on the type of OSD request) that
will set the object request's "done" flag if the object request if
appropriate.  If that "done" flag is set, the object request is
passed to rbd_obj_request_complete().

In rbd_obj_request_complete(), requests are processed in sequential
order.  So if an object request completes before one of its
predecessors in the image request, the completion is deferred.
Otherwise, if it's a completing object's "turn" to be completed, it
is passed to rbd_img_obj_end_request(), which records the result of
the operation, accumulates transferred bytes, and so on.  Next, the
successor to this request is checked and if it is marked "done",
(deferred) completion processing is performed on that request, and
so on.  If the last object request in an image request is completed,
rbd_img_request_complete() is called, which (typically) destroys
the image request.

There is a race here, however.  The instant an object request is
marked "done" it can be provided (by a thread handling completion of
one of its predecessor operations) to rbd_img_obj_end_request(),
which (for the last request) can then lead to the image request
getting torn down.  And this can happen *before* that object has
itself entered rbd_img_obj_end_request().  As a result, once it
*does* enter that function, the image request (and even the object
request itself) may have been freed and become invalid.

All that's necessary to avoid this is to properly count references
to the image requests.  We tear down an image request's object
requests all at once--only when the entire image request has
completed.  So there's no need for an image request to count
references for its object requests.  However, we don't want an
image request to go away until the last of its object requests
has passed through rbd_img_obj_callback().  In other words,
we don't want rbd_img_request_complete() to necessarily
result in the image request being destroyed, because it may
get called before we've finished processing on all of its
object requests.

So the fix is to add a reference to an image request for
each of its object requests.  The reference can be viewed
as representing an object request that has not yet finished
its call to rbd_img_obj_callback().  That is emphasized by
getting the reference right after assigning that as the image
object's callback function.  The corresponding release of that
reference is done at the end of rbd_img_obj_callback(), which
every image object request passes through exactly once.

Cc: stable@vger.kernel.org
Signed-off-by: NAlex Elder <elder@linaro.org>
Reviewed-by: NIlya Dryomov <ilya.dryomov@inktank.com>

0f2d5be7

rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() · b30a01f2

由 Ilya Dryomov 提交于 5月 22, 2014

osd_request, along with r_request and r_reply messages attached to it
are leaked in __rbd_dev_header_watch_sync() if the requested image
doesn't exist. This is because lingering requests are special and get
an extra ref in the reply path. Fix it by unregistering linger request
on the error path and split __rbd_dev_header_watch_sync() into two
functions to make it maintainable.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>

b30a01f2

rbd: make sure we have latest osdmap on 'rbd map' · 30ba1f02

由 Ilya Dryomov 提交于 5月 13, 2014

Given an existing idle mapping (img1), mapping an image (img2) in
a newly created pool (pool2) fails:

    $ ceph osd pool create pool1 8 8
    $ rbd create --size 1000 pool1/img1
    $ sudo rbd map pool1/img1
    $ ceph osd pool create pool2 8 8
    $ rbd create --size 1000 pool2/img2
    $ sudo rbd map pool2/img2
    rbd: sysfs write failed
    rbd: map failed: (2) No such file or directory

This is because client instances are shared by default and we don't
request an osdmap update when bumping a ref on an existing client.  The
fix is to use the mon_get_version request to see if the osdmap we have
is the latest, and block until the requested update is received if it's
not.

Fixes: http://tracker.ceph.com/issues/8184Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

30ba1f02

rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO · 461f758a

由 Duan Jiong 提交于 4月 11, 2014

This patch fixes coccinelle error regarding usage of IS_ERR and
PTR_ERR instead of PTR_ERR_OR_ZERO.
Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>

461f758a

05 6月, 2014 4 次提交

zram: correct offset usage in zram_bio_discard · 38515c73

由 Weijie Yang 提交于 6月 04, 2014

We want to skip the physical block(PAGE_SIZE) which is partially covered
by the discard bio, so we check the remaining size and subtract it if
there is a need to goto the next physical block.

The current offset usage in zram_bio_discard is incorrect, it will cause
its upper filesystem breakdown.  Consider the following scenario:

On some architecture or config, PAGE_SIZE is 64K for example, filesystem
is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a
offset = 4K and size=72K, normally, it should not really discard any
physical block as it partially cover two physical blocks.  However, with
the current offset usage, it will discard the second physical block and
free its memory, which will cause filesystem breakdown.

This patch corrects the offset usage in zram_bio_discard.
Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

38515c73

brd: return -ENOSPC rather than -ENOMEM on page allocation failure · 96f8d8e0

由 Matthew Wilcox 提交于 6月 04, 2014

brd is effectively a thinly provisioned device.  Thinly provisioned
devices return -ENOSPC when they can't write a new block.  -ENOMEM is an
implementation detail that callers shouldn't know.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: NDave Chinner <david@fromorbit.com>
Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

96f8d8e0

brd: add support for rw_page() · a72132c3

由 Matthew Wilcox 提交于 6月 04, 2014

Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a72132c3

blk-mq: let blk_mq_tag_to_rq() take blk_mq_tags as the main parameter · 0e62f51f

由 Jens Axboe 提交于 6月 04, 2014

We currently pass in the hardware queue, and get the tags from there.
But from scsi-mq, with a shared tag space, it's a lot more convenient
to pass in the blk_mq_tags instead as the hardware queue isn't always
directly available. So instead of having to re-map to a given
hardware queue from rq->mq_ctx, just pass in the tags structure.
Signed-off-by: NJens Axboe <axboe@fb.com>

0e62f51f

04 6月, 2014 9 次提交

NVMe: Rename io_timeout to nvme_io_timeout · bd67608a

由 Matthew Wilcox 提交于 6月 03, 2014

It's positively immoral to have a global variable called 'io_timeout'.
Keep the module parameter called io_timeout, though.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

bd67608a

NVMe: Use last bytes of f/w rev SCSI Inquiry · dedf4b15

由 Keith Busch 提交于 4月 29, 2014

After skipping right-padded spaces, use the last four bytes of the
firmware revision when reporting the Inquiry Product Revision. These
are generally more indicative to what is running.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Acked-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

dedf4b15

NVMe: Adhere to request queue block accounting enable/disable · b4e75cbf

由 Sam Bradshaw 提交于 5月 09, 2014

Recently, a new sysfs control "iostats" was added to selectively
enable or disable io statistics collection for request queues.  This
patch hooks that control.

IO statistics collection is rather expensive on large, multi-node
machines with drives pushing millions of iops.  Having the ability to
disable collection if not needed can improve throughput significantly.

As a data point, on a quad E5-4640, I see more than 50% throughput
improvement when io statistics accounting is disabled during heavily
multi-threaded small block random read benchmarks where device
performance is in the million iops+ range.
Signed-off-by: NSam Bradshaw <sbradshaw@micron.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

b4e75cbf

NVMe: Fix nvme get/put queue semantics · a51afb54

由 Keith Busch 提交于 5月 13, 2014

The routines to get and lock nvme queues required the caller to "put"
or "unlock" them even if getting one returned NULL. This patch fixes that.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

a51afb54

NVMe: Delete NVME_GET_FEAT_TEMP_THRESH · de672b97

由 Matthew Wilcox 提交于 6月 03, 2014

This define isn't used, and any code that wanted to use it should use
NVME_FEAT_TEMP_THRESH instead.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

de672b97

NVMe: Make admin timeout a module parameter · 9d43cf64

由 Keith Busch 提交于 5月 13, 2014

Signed-off-by: NKeith Busch <keith.busch@intel.com>
[made admin_timeout static]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

9d43cf64

NVMe: Make iod bio timeout a parameter · 61e4ce08

由 Keith Busch 提交于 5月 13, 2014

This was originally set to 4 times the IO timeout, but that was when
the IO timeout was 5 seconds instead of 30. 20 seconds for total time
to failure seemed more reasonable than 2 minutes for most, but other
users have requested to make this a module parameter instead.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
[renamed the module parameter to retry_time]
[made retry_time static]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

61e4ce08

NVMe: Prevent possible NULL pointer dereference · 6808c5fb

由 Santosh Y 提交于 5月 29, 2014

kmalloc() used by the nvme_alloc_iod() to allocate memory for 'iod'
can fail. So check the return value.
Signed-off-by: NSantosh Y <santosh.sy@samsung.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

6808c5fb

NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD) · 4131f2fc

由 Indraneel Mukherjee 提交于 5月 29, 2014

In GetLogPage the buffer size passed to device is a 0's based value.
Signed-off-by: NIndraneel M <indraneel.m@samsung.com>
Reported-by: NShiro Itou <shiro.itou@outlook.com>
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

4131f2fc

30 5月, 2014 1 次提交

block: virtio_blk: don't hold spin lock during world switch · e8edca6f

由 Ming Lei 提交于 5月 30, 2014

Firstly, it isn't necessary to hold lock of vblk->vq_lock
when notifying hypervisor about queued I/O.

Secondly, virtqueue_notify() will cause world switch and
it may take long time on some hypervisors(such as, qemu-arm),
so it isn't good to hold the lock and block other vCPUs.

On arm64 quad core VM(qemu-kvm), the patch can increase I/O
performance a lot with VIRTIO_RING_F_EVENT_IDX enabled:
	- without the patch: 14K IOPS
	- with the patch: 34K IOPS

fio script:
	[global]
	direct=1
	bsrange=4k-4k
	timeout=10
	numjobs=4
	ioengine=libaio
	iodepth=64

	filename=/dev/vdc
	group_reporting=1

	[f1]
	rw=randread

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org # 3.13+
Signed-off-by: NJens Axboe <axboe@fb.com>

e8edca6f

29 5月, 2014 4 次提交

xen-blkback: defer freeing blkif to avoid blocking xenwatch · 814d04e7

由 Valentin Priescu 提交于 5月 20, 2014

Currently xenwatch blocks in VBD disconnect, waiting for all pending I/O
requests to finish. If the VBD is attached to a hot-swappable disk, then
xenwatch can hang for a long period of time, stalling other watches.

 INFO: task xenwatch:39 blocked for more than 120 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 ffff880057f01bd0 0000000000000246 ffff880057f01ac0 ffffffff810b0782
 ffff880057f01ad0 00000000000131c0 0000000000000004 ffff880057edb040
 ffff8800344c6080 0000000000000000 ffff880058c00ba0 ffff880057edb040
 Call Trace:
 [<ffffffff810b0782>] ? irq_to_desc+0x12/0x20
 [<ffffffff8128f761>] ? list_del+0x11/0x40
 [<ffffffff8147a080>] ? wait_for_common+0x60/0x160
 [<ffffffff8147bcef>] ? _raw_spin_lock_irqsave+0x2f/0x50
 [<ffffffff8147bd49>] ? _raw_spin_unlock_irqrestore+0x19/0x20
 [<ffffffff8147a26a>] schedule+0x3a/0x60
 [<ffffffffa018fe6a>] xen_blkif_disconnect+0x8a/0x100 [xen_blkback]
 [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40
 [<ffffffffa018ffce>] xen_blkbk_remove+0xae/0x1e0 [xen_blkback]
 [<ffffffff8130b254>] xenbus_dev_remove+0x44/0x90
 [<ffffffff81345cb7>] __device_release_driver+0x77/0xd0
 [<ffffffff81346488>] device_release_driver+0x28/0x40
 [<ffffffff813456e8>] bus_remove_device+0x78/0xe0
 [<ffffffff81342c9f>] device_del+0x12f/0x1a0
 [<ffffffff81342d2d>] device_unregister+0x1d/0x60
 [<ffffffffa0190826>] frontend_changed+0xa6/0x4d0 [xen_blkback]
 [<ffffffffa019c252>] ? frontend_changed+0x192/0x650 [xen_netback]
 [<ffffffff8130ae50>] ? cmp_dev+0x60/0x60
 [<ffffffff81344fe4>] ? bus_for_each_dev+0x94/0xa0
 [<ffffffff8130b06e>] xenbus_otherend_changed+0xbe/0x120
 [<ffffffff8130b4cb>] frontend_changed+0xb/0x10
 [<ffffffff81309c82>] xenwatch_thread+0xf2/0x130
 [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40
 [<ffffffff81309b90>] ? xenbus_directory+0x80/0x80
 [<ffffffff810799d6>] kthread+0x96/0xa0
 [<ffffffff81485934>] kernel_thread_helper+0x4/0x10
 [<ffffffff814839f3>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8147c17c>] ? retint_restore_args+0x5/0x6
 [<ffffffff81485930>] ? gs_change+0x13/0x13

With this patch, when there is still pending I/O, the actual disconnect
is done by the last reference holder (last pending I/O request). In this
case, xenwatch doesn't block indefinitely.
Signed-off-by: NValentin Priescu <priescuv@amazon.com>
Reviewed-by: NSteven Kady <stevkady@amazon.com>
Reviewed-by: NSteven Noonan <snoonan@amazon.com>
Reviewed-by: NDavid Vrabel <david.vrabel@citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

814d04e7

xen/blkback: disable discard feature if requested by toolstack · c926b701

由 Olaf Hering 提交于 5月 21, 2014

Newer toolstacks may provide a boolean property "discard-enable" in the
backend node. Its purpose is to disable discard for file backed storage
to avoid fragmentation. Recognize this setting also for physical
storage.  If that property exists and is false, do not advertise
"feature-discard" to the frontend.
Signed-off-by: NOlaf Hering <olaf@aepfle.de>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

c926b701

xen-blkfront: remove type check from blkfront_setup_discard · 1c8cad6c

由 Olaf Hering 提交于 5月 21, 2014

In its initial implementation a check for "type" was added, but only phy
and file are handled. This breaks advertised discard support for other
type values such as qdisk.

Fix and simplify this function: If the backend advertises discard
support it is supposed to implement it properly, so enable
feature_discard unconditionally. If the backend advertises the need for
a certain granularity and alignment then propagate both properties to
the blocklayer. The discard-secure property is a boolean, update the code
to reflect that.
Signed-off-by: NOlaf Hering <olaf@aepfle.de>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

1c8cad6c

blk-mq: remove alloc_hctx and free_hctx methods · cdef54dd

由 Christoph Hellwig 提交于 5月 28, 2014

There is no need for drivers to control hardware context allocation
now that we do the context to node mapping in common code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

cdef54dd

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功