提交 · 8e483ed1342a4ea45b70f0f33ac54eff7a33d918 · openeuler / Kernel

31 10月, 2015 1 次提交

rbd: require stable pages if message data CRCs are enabled · bae818ee

由 Ronny Hegewald 提交于 10月 15, 2015

rbd requires stable pages, as it performs a crc of the page data before
they are send to the OSDs.

But since kernel 3.9 (patch 1d1d1a76
"mm: only enforce stable page writes if the backing device requires
it") it is not assumed anymore that block devices require stable pages.

This patch sets the necessary flag to get stable pages back for rbd.

In a ceph installation that provides multiple ext4 formatted rbd
devices "bad crc" messages appeared regularly (ca 1 message every 1-2
minutes on every OSD that provided the data for the rbd) in the
OSD-logs before this patch. After this patch this messages are pretty
much gone (only ca 1-2 / month / OSD).

Cc: stable@vger.kernel.org # 3.9+, needs backporting
Signed-off-by: NRonny Hegewald <Ronny.Hegewald@online.de>
[idryomov@gmail.com: require stable pages only in crc case, changelog]
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

bae818ee

24 10月, 2015 2 次提交

rbd: prevent kernel stack blow up on rbd map · 6d69bb53

由 Ilya Dryomov 提交于 10月 11, 2015

Mapping an image with a long parent chain (e.g. image foo, whose parent
is bar, whose parent is baz, etc) currently leads to a kernel stack
overflow, due to the following recursion in the reply path:

  rbd_osd_req_callback()
    rbd_obj_request_complete()
      rbd_img_obj_callback()
        rbd_img_parent_read_callback()
          rbd_obj_request_complete()
            ...

Limit the parent chain to 16 images, which is ~5K worth of stack.  When
the above recursion is eliminated, this limit can be lifted.

Fixes: http://tracker.ceph.com/issues/12538

Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NJosh Durgin <jdurgin@redhat.com>

6d69bb53

rbd: don't leak parent_spec in rbd_dev_probe_parent() · 1f2c6651

由 Ilya Dryomov 提交于 10月 11, 2015

Currently we leak parent_spec and trigger a "parent reference
underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
The problem is we take the !parent out_err branch and that only drops
refcounts; parent_spec that would've been freed had we called
rbd_dev_unparent() remains and triggers rbd_warn() in
rbd_dev_parent_put() - at that point we have parent_spec != NULL and
parent_ref == 0, so counter ends up being -1 after the decrement.

Redo rbd_dev_probe_parent() to fix this.

Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

1f2c6651

23 10月, 2015 6 次提交

xen/xenbus: Rename *RING_PAGE* to *RING_GRANT* · 9cce2914

由 Julien Grall 提交于 10月 13, 2015

Linux may use a different page size than the size of grant. So make
clear that the order is actually in number of grant.
Signed-off-by: NJulien Grall <julien.grall@citrix.com>
Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>

9cce2914

block/xen-blkback: Make it running on 64KB page granularity · 67de5dfb

由 Julien Grall 提交于 5月 05, 2015

The PV block protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity behaving as a
block backend on a non-modified Xen.

It's only necessary to adapt the ring size and the number of request per
indirect frames. The rest of the code is relying on the grant table
code.

Note that the grant table code is allocating a Linux page per grant
which will result to waste 6OKB for every grant when Linux is using 64KB
page granularity. This could be improved by sharing the page between
multiple grants.
Signed-off-by: NJulien Grall <julien.grall@citrix.com>
Acked-by: N"Roger Pau Monné" <roger.pau@citrix.com>
Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>

67de5dfb

block/xen-blkfront: Make it running on 64KB page granularity · c004a6fe

由 Julien Grall 提交于 7月 22, 2015

The PV block protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity using block
device on a non-modified Xen.

The block API is using segment which should at least be the size of a
Linux page. Therefore, the driver will have to break the page in chunk
of 4K before giving the page to the backend.

When breaking a 64KB segment in 4KB chunks, it is possible that some
chunks are empty. As the PV protocol always require to have data in the
chunk, we have to count the number of Xen page which will be in use and
avoid sending empty chunks.

Note that, a pre-defined number of grants are reserved before preparing
the request. This pre-defined number is based on the number and the
maximum size of the segments. If each segment contains a very small
amount of data, the driver may reserve too many grants (16 grants is
reserved per segment with 64KB page granularity).

Furthermore, in the case of persistent grants we allocate one Linux page
per grant although only the first 4KB of the page will be effectively
in use. This could be improved by sharing the page with multiple grants.
Signed-off-by: NJulien Grall <julien.grall@citrix.com>
Acked-by: NRoger Pau Monné <roger.pau@citrix.com>
Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>

c004a6fe

block/xen-blkfront: split get_grant in 2 · 4f503fbd

由 Julien Grall 提交于 7月 01, 2015

Prepare the code to support 64KB page granularity. The first
implementation will use a full Linux page per indirect and persistent
grant. When non-persistent grant is used, each page of a bio request
may be split in multiple grant.

Furthermore, the field page of the grant structure is only used to copy
data from persistent grant or indirect grant. Avoid to set it for other
use case as it will have no meaning given the page will be split in
multiple grant.

Provide 2 functions, to setup indirect grant, the other for bio page.
Signed-off-by: NJulien Grall <julien.grall@citrix.com>
Acked-by: NRoger Pau Monné <roger.pau@citrix.com>
Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>

4f503fbd

block/xen-blkfront: Store a page rather a pfn in the grant structure · a7a6df22

由 Julien Grall 提交于 6月 30, 2015

All the usage of the field pfn are done using the same idiom:

pfn_to_page(grant->pfn)

This will  return always the same page. Store directly the page in the
grant to clean up the code.
Signed-off-by: NJulien Grall <julien.grall@citrix.com>
Acked-by: NRoger Pau Monné <roger.pau@citrix.com>
Reviewed-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>

a7a6df22

block/xen-blkfront: Split blkif_queue_request in 2 · 33204663

由 Julien Grall 提交于 6月 29, 2015

Currently, blkif_queue_request has 2 distinct execution path:
    - Send a discard request
    - Send a read/write request

The function is also allocating grants to use for generating the
request. Although, this is only used for read/write request.

Rather than having a function with 2 distinct execution path, separate
the function in 2. This will also remove one level of tabulation.
Signed-off-by: NJulien Grall <julien.grall@citrix.com>
Reviewed-by: NRoger Pau Monné <roger.pau@citrix.com>
Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>

33204663

16 10月, 2015 3 次提交

rbd: use writefull op for object size writes · e30b7577

由 Ilya Dryomov 提交于 10月 07, 2015

This covers only the simplest case - an object size sized write, but
it's still useful in tiering setups when EC is used for the base tier
as writefull op can be proxied, saving an object promotion.

Even though updating ceph_osdc_new_request() to allow writefull should
just be a matter of fixing an assert, I didn't do it because its only
user is cephfs.  All other sites were updated.

Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

e30b7577

rbd: set max_sectors explicitly · 0d9fde4f

由 Ilya Dryomov 提交于 10月 07, 2015

Commit 30e2bc08 ("Revert "block: remove artifical max_hw_sectors
cap"") restored a clamp on max_sectors.  It's now 2560 sectors instead
of 1024, but it's not good enough: we set max_hw_sectors to rbd object
size because we don't want object sized I/Os to be split, and the
default object size is 4M.

So, set max_sectors to max_hw_sectors in rbd at queue init time.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

0d9fde4f

NVMe: Fix memory leak on retried commands · 0dfc70c3

由 Keith Busch 提交于 10月 15, 2015

Resources are reallocated for requeued commands, so unmap and release
the iod for the failed command.

It's a pretty bad memory leak and causes a kernel hang if you remove a
drive because of a busy dma pool. You'll get messages spewing like this:

  nvme 0000:xx:xx.x: dma_pool_destroy prp list 256, ffff880420dec000 busy

and lock up pci and the driver since removal never completes while
holding a lock.

Cc: stable@vger.kernel.org
Cc: <stable@vger.kernel.org> # 4.0.x-
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

0dfc70c3

15 10月, 2015 1 次提交

nvme: use an integer value to Linux errno values · 81c04b94

由 Christoph Hellwig 提交于 10月 12, 2015

Use a separate integer variable to hold the signed Linux errno
values we pass back to the block layer.  Note that for pass through
commands those might still be NVMe values, but those fit into the
int as well.

Fixes: f4829a9b: ("blk-mq: fix racy updates of rq->errors")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

81c04b94

13 10月, 2015 1 次提交

nvme: fix 32-bit build warning · 835da3f9

由 Arnd Bergmann 提交于 10月 06, 2015

Compiling the nvme driver on 32-bit warns about a cast from a __u64
variable to a pointer:

drivers/block/nvme-core.c: In function 'nvme_submit_io':
drivers/block/nvme-core.c:1847:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    (void __user *)io.addr, length, NULL, 0);

The cast here is intentional and safe, so we can shut up the
gcc warning by adding an intermediate cast to 'uintptr_t'.

I had previously submitted a patch to fix this problem in the
nvme driver, but it was accepted on the same day that two new
warnings got added.

For clarification, I also change the third instance of this cast
to use uintptr_t instead of unsigned long now.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Fixes: d29ec824 ("nvme: submit internal commands through the block layer")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

835da3f9

10 10月, 2015 11 次提交

nvme: move to a new drivers/nvme/host directory · 57dacad5

由 Jay Sternberg 提交于 10月 09, 2015

This patch moves the NVMe driver from drivers/block/ to its own new
drivers/nvme/host/ directory.  This is in preparation of splitting the
current monolithic driver up and add support for the upcoming NVMe
over Fabrics standard.  The drivers/nvme/host/ is chose to leave space
for a NVMe target implementation in addition to this host side driver.
Signed-off-by: NJay Sternberg <jay.e.sternberg@intel.com>
[hch: rebased, renamed core.c to pci.c, slight tweaks]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

57dacad5

nvme: move hardware structures out of the uapi version of nvme.h · 9d99a8dd

由 Christoph Hellwig 提交于 10月 02, 2015

Currently all NVMe command and completion structures are exposed to userspace
through the uapi version of nvme.h. They are not an ABI between the kernel
and userspace, and will change in C-incompatible way for future versions of
the spec. Move them to the kernel version of the file and rename the uapi
header to nvme_ioctl.h so that userspace can easily detect the presence of
the new clean header. Nvme-cli already carries a local copy of the header,
so it won't be affected by this move.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9d99a8dd

nvme: add a local nvme.h header · f11bb3e2

由 Christoph Hellwig 提交于 10月 03, 2015

Add a new drivers/block/nvme.h which contains all the driver internal
interface.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

f11bb3e2

nvme: properly handle partially initialized queues in nvme_create_io_queues · 2659e57b

由 Christoph Hellwig 提交于 10月 02, 2015

This avoids having to clean up later in a seemingly unrelated place.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2659e57b

nvme: merge nvme_dev_start, nvme_dev_resume and nvme_async_probe · 3cf519b5

由 Christoph Hellwig 提交于 10月 03, 2015

And give the resulting function a sensible name.  This keeps all the
error handling in a single place and will allow for further improvements
to it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

3cf519b5

nvme: factor reset code into a common helper · 90667892

由 Christoph Hellwig 提交于 10月 02, 2015

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

90667892

nvme: merge nvme_dev_reset into nvme_reset_failed_dev · 77b50d9e

由 Christoph Hellwig 提交于 10月 02, 2015

And give the resulting function a more descriptive name.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

77b50d9e

nvme: delete dev from dev_list in nvme_reset · 201cf1ec

由 Christoph Hellwig 提交于 10月 02, 2015

Device resets need to delete the device from the device list before
kicking of the reset an re-probe, otherwise we get the device added
to the list twice.  nvme_reset is the only side missing this deletion
at the moment, and this patch adds it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

201cf1ec

NVMe: Simplify device resume on io queue failure · 0a7385ad

由 Keith Busch 提交于 10月 02, 2015

Releasing IO queues and disks was done in a work queue outside the
controller resume context to delete namespaces if the controller failed
after a resume from suspend. This is unnecessary since we can resume
a device asynchronously.

This patch makes resume use probe_work so it can directly remove
namespaces if the device is manageable but not IO capable. Since the
deleting disks was the only reason we had the convoluted "reset_workfn",
this patch removes that unnecessary indirection.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

0a7385ad

NVMe: Namespace removal simplifications · 5105aa55

由 Keith Busch 提交于 10月 02, 2015

This liberates namespace removal from the device, allowing gendisk
references to be closed independent of the nvme controller reference
count.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

5105aa55

NVMe: Reference count open namespaces · 188c3568

由 Keith Busch 提交于 10月 01, 2015

Dynamic namespace attachment means the namespace may be removed at any
time, so the namespace reference count can not be tied to the device
reference count. This fixes a NULL dereference if an opened namespace
is detached from a controller.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

188c3568

09 10月, 2015 1 次提交

nbd: Add locking for tasks · dcc909d9

由 Markus Pargmann 提交于 10月 06, 2015

The timeout handling introduced in
	7e2893a1 (nbd: Fix timeout detection)
introduces a race condition which may lead to killing of tasks that are
not in nbd context anymore. This was not observed or reproducable yet.

This patch adds locking to critical use of task_recv and task_send to
avoid killing tasks that already left the NBD thread functions. This
lock is only acquired if a timeout occures or the nbd device
starts/stops.
Reported-by: NBen Hutchings <ben@decadent.org.uk>
Signed-off-by: NMarkus Pargmann <mpa@pengutronix.de>
Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
Fixes: 7e2893a1 ("nbd: Fix timeout detection")
Signed-off-by: NJens Axboe <axboe@fb.com>

dcc909d9

08 10月, 2015 1 次提交

xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing) · a54c8f0f

由 Cathy Avery 提交于 10月 02, 2015

xen-blkfront will crash if the check to talk_to_blkback()
in blkback_changed()(XenbusStateInitWait) returns an error.
The driver data is freed and info is set to NULL. Later during
the close process via talk_to_blkback's call to xenbus_dev_fatal()
the null pointer is passed to and dereference in blkfront_closing.

CC: stable@vger.kernel.org
Signed-off-by: NCathy Avery <cathy.avery@oracle.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

a54c8f0f

01 10月, 2015 1 次提交

blk-mq: fix racy updates of rq->errors · f4829a9b

由 Christoph Hellwig 提交于 9月 27, 2015

blk_mq_complete_request may be a no-op if the request has already
been completed by others means (e.g. a timeout or cancellation), but
currently drivers have to set rq->errors before calling
blk_mq_complete_request, which might leave us with the wrong error value.

Add an error parameter to blk_mq_complete_request so that we can
defer setting rq->errors until we known we won the race to complete the
request.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

f4829a9b

24 9月, 2015 7 次提交

NVMe: Set affinity after allocating request queues · bda4e0fb

由 Keith Busch 提交于 9月 03, 2015

The asynchronous namespace scanning caused affinity hints to be set before
its tagset initialized, so there was no cpu mask to set the hint. This
patch moves the affinity hint setting to after namespaces are scanned.
Reported-by: N김경산 <ks0204.kim@samsung.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

bda4e0fb

block: loop: support DIO & AIO · bc07c10a

由 Ming Lei 提交于 8月 17, 2015

There are at least 3 advantages to use direct I/O and AIO on
read/write loop's backing file:

1) double cache can be avoided, then memory usage gets
decreased a lot

2) not like user space direct I/O, there isn't cost of
pinning pages

3) avoid context switch for obtaining good throughput
- in buffered file read, random I/O top throughput is often obtained
only if they are submitted concurrently from lots of tasks; but for
sequential I/O, most of times they can be hit from page cache, so
concurrent submissions often introduce unnecessary context switch
and can't improve throughput much. There was such discussion[1]
to use non-blocking I/O to improve the problem for application.
- with direct I/O and AIO, concurrent submissions can be
avoided and random read throughput can't be affected meantime

xfstests(-g auto, ext4) is basically passed when running with
direct I/O(aio), one exception is generic/232, but it failed in
loop buffered I/O(4.2-rc6-next-20150814) too.

Follows the fio test result for performance purpose:
	4 jobs fio test inside ext4 file system over loop block

1) How to run
	- KVM: 4 VCPUs, 2G RAM
	- linux kernel: 4.2-rc6-next-20150814(base) with the patchset
	- the loop block is over one image on SSD.
	- linux psync, 4 jobs, size 1500M, ext4 over loop block
	- test result: IOPS from fio output

2) Throughput(IOPS) becomes a bit better with direct I/O(aio)
        -------------------------------------------------------------
        test cases          |randread   |read   |randwrite  |write  |
        -------------------------------------------------------------
        base                |8015       |113811 |67442      |106978
        -------------------------------------------------------------
        base+loop aio       |8136       |125040 |67811      |111376
        -------------------------------------------------------------

- somehow, it should be caused by more page cache avaiable for
application or one extra page copy is avoided in case of direct I/O

3) context switch
        - context switch decreased by ~50% with loop direct I/O(aio)
	compared with loop buffered I/O(4.2-rc6-next-20150814)

4) memory usage from /proc/meminfo
        -------------------------------------------------------------
                                   | Buffers       | Cached
        -------------------------------------------------------------
        base                       | > 760MB       | ~950MB
        -------------------------------------------------------------
        base+loop direct I/O(aio)  | < 5MB         | ~1.6GB
        -------------------------------------------------------------

- so there are much more page caches available for application with
direct I/O

[1] https://lwn.net/Articles/612483/Signed-off-by: NMing Lei <ming.lei@canonical.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

bc07c10a

block: loop: introduce ioctl command of LOOP_SET_DIRECT_IO · ab1cb278

由 Ming Lei 提交于 8月 17, 2015

If loop block is mounted via 'mount -o loop', it isn't easy
to pass file descriptor opened as O_DIRECT, so this patch
introduces a new command to support direct IO for this case.

Cc: linux-api@vger.kernel.org
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

ab1cb278

block: loop: prepare for supporing direct IO · 2e5ab5f3

由 Ming Lei 提交于 8月 17, 2015

This patches provides one interface for enabling direct IO
from user space:

	- userspace(such as losetup) can pass 'file' which is
	opened/fcntl as O_DIRECT

Also __loop_update_dio() is introduced to check if direct I/O
can be used on current loop setting.

The last big change is to introduce LO_FLAGS_DIRECT_IO flag
for userspace to know if direct IO is used to access backing
file.

Cc: linux-api@vger.kernel.org
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

2e5ab5f3

block: loop: use kthread_work · e03a3d7a

由 Ming Lei 提交于 8月 17, 2015

The following patch will use dio/aio to submit IO to backing file,
then it needn't to schedule IO concurrently from work, so
use kthread_work for decreasing context switch cost a lot.

For non-AIO case, single thread has been used for long long time,
and it was just converted to work in v4.0, which has caused performance
regression for fedora live booting already. In discussion[1], even
though submitting I/O via work concurrently can improve random read IO
throughput, meantime it might hurt sequential read IO performance, so
better to restore to single thread behaviour.

For the following AIO support, it is better to use multi hw-queue
with per-hwq kthread than current work approach suppose there is so
high performance requirement for loop.

[1] http://marc.info/?t=143082678400002&r=1&w=2Signed-off-by: NMing Lei <ming.lei@canonical.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

e03a3d7a

block: loop: set QUEUE_FLAG_NOMERGES for request queue of loop · 5b5e20f4

由 Ming Lei 提交于 8月 17, 2015

It doesn't make sense to enable merge because the I/O
submitted to backing file is handled page by page.
Signed-off-by: NMing Lei <ming.lei@canonical.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

5b5e20f4

xen/blkback: free requests on disconnection · f929d42c

由 Roger Pau Monne 提交于 9月 04, 2015

This is due to  commit 86839c56
"xen/block: add multi-page ring support"

When using an guest under UEFI - after the domain is destroyed
the following warning comes from blkback.

------------[ cut here ]------------
WARNING: CPU: 2 PID: 95 at
/home/julien/works/linux/drivers/block/xen-blkback/xenbus.c:274
xen_blkif_deferred_free+0x1f4/0x1f8()
Modules linked in:
CPU: 2 PID: 95 Comm: kworker/2:1 Tainted: G        W       4.2.0 #85
Hardware name: APM X-Gene Mustang board (DT)
Workqueue: events xen_blkif_deferred_free
Call trace:
[<ffff8000000890a8>] dump_backtrace+0x0/0x124
[<ffff8000000891dc>] show_stack+0x10/0x1c
[<ffff8000007653bc>] dump_stack+0x78/0x98
[<ffff800000097e88>] warn_slowpath_common+0x9c/0xd4
[<ffff800000097f80>] warn_slowpath_null+0x14/0x20
[<ffff800000557a0c>] xen_blkif_deferred_free+0x1f0/0x1f8
[<ffff8000000ad020>] process_one_work+0x160/0x3b4
[<ffff8000000ad3b4>] worker_thread+0x140/0x494
[<ffff8000000b2e34>] kthread+0xd8/0xf0
---[ end trace 6f859b7883c88cdd ]---

Request allocation has been moved to connect_ring, which is called every
time blkback connects to the frontend (this can happen multiple times during
a blkback instance life cycle). On the other hand, request freeing has not
been moved, so it's only called when destroying the backend instance. Due to
this mismatch, blkback can allocate the request pool multiple times, without
freeing it.

In order to fix it, move the freeing of requests to xen_blkif_disconnect to
restore the symmetry between request allocation and freeing.
Reported-by: NJulien Grall <julien.grall@citrix.com>
Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
Tested-by: NJulien Grall <julien.grall@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: xen-devel@lists.xenproject.org
CC: stable@vger.kernel.org # 4.2
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

f929d42c

18 9月, 2015 1 次提交

zram: fix possible use after free in zcomp_create() · 3aaf14da

由 Luis Henriques 提交于 9月 17, 2015

zcomp_create() verifies the success of zcomp_strm_{multi,single}_create()
through comp->stream, which can potentially be pointing to memory that
was freed if these functions returned an error.

While at it, replace a 'ERR_PTR(-ENOMEM)' by a more generic
'ERR_PTR(error)' as in the future zcomp_strm_{multi,siggle}_create()
could return other error codes.  Function documentation updated
accordingly.

Fixes: beca3ec7 ("zram: add multi stream functionality")
Signed-off-by: NLuis Henriques <luis.henriques@canonical.com>
Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: NMinchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3aaf14da

09 9月, 2015 4 次提交

zram: unify error reporting · 70864969

由 Sergey Senozhatsky 提交于 9月 08, 2015

Make zram syslog error reporting more consistent. We have random
error levels in some places. For example, critical errors like
  "Error allocating memory for compressed page"
and
  "Unable to allocate temp memory"
are reported as KERN_INFO messages.

a) Reassign error levels

Error messages that directly affect zram
functionality -- pr_err():

 Error allocating zram address table
 Error creating memory pool
 Decompression failed! err=%d, page=%u
 Unable to allocate temp memory
 Compression failed! err=%d
 Error allocating memory for compressed page: %u, size=%zu
 Cannot initialise %s compressing backend
 Error allocating disk queue for device %d
 Error allocating disk structure for device %d
 Error creating sysfs group for device %d
 Unable to register zram-control class
 Unable to get major number

Messages that do not affect functionality, but user
must be warned (because sysfs attrs will be removed in
this particular case) -- pr_warn():

 %d (%s) Attribute %s (and others) will be removed. %s

Messages that do not affect functionality and mostly are
informative -- pr_info():

 Cannot change max compression streams
 Can't change algorithm for initialized device
 Cannot change disksize for initialized device
 Added device: %s
 Removed device: %s

b) Update sysfs_create_group() error message

First, it lacks a trailing new line; add it.  Second, every error message
in zram_add() has a "for device %d" part, which makes errors more
informative.  Add missing part to "Error creating sysfs group" message.
Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

70864969

zsmalloc: account the number of compacted pages · 860c707d

由 Sergey Senozhatsky 提交于 9月 08, 2015

Compaction returns back to zram the number of migrated objects, which is
quite uninformative -- we have objects of different sizes so user space
cannot obtain any valuable data from that number.  Change compaction to
operate in terms of pages and return back to compaction issuer the
number of pages that were freed during compaction.  So from now on we
will export more meaningful value in zram<id>/mm_stat -- the number of
freed (compacted) pages.

This requires:
 (a) a rename of `num_migrated' to 'pages_compacted'
 (b) a internal API change -- return first_page's fullness_group from
     putback_zspage(), so we know when putback_zspage() did
     free_zspage().  It helps us to account compaction stats correctly.
Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: NMinchan Kim <minchan@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

860c707d

zsmalloc/zram: introduce zs_pool_stats api · 7d3f3938

由 Sergey Senozhatsky 提交于 9月 08, 2015

`zs_compact_control' accounts the number of migrated objects but it has
a limited lifespan -- we lose it as soon as zs_compaction() returns back
to zram.  It worked fine, because (a) zram had it's own counter of
migrated objects and (b) only zram could trigger compaction.  However,
this does not work for automatic pool compaction (not issued by zram).
To account objects migrated during auto-compaction (issued by the
shrinker) we need to store this number in zs_pool.

Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
there.  It provides only `num_migrated', as of this writing, but it
surely can be extended.

A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
caller.

Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.
Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: NMinchan Kim <minchan@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7d3f3938

rbd: plug rbd_dev->header.object_prefix memory leak · d194cd1d

由 Ilya Dryomov 提交于 8月 31, 2015

Need to free object_prefix when rbd_dev_v2_snap_context() fails, but
only if this is the first time we are reading in the header.
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>

d194cd1d

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功