1. 31 10月, 2015 1 次提交
    • R
      rbd: require stable pages if message data CRCs are enabled · bae818ee
      Ronny Hegewald 提交于
      rbd requires stable pages, as it performs a crc of the page data before
      they are send to the OSDs.
      
      But since kernel 3.9 (patch 1d1d1a76
      "mm: only enforce stable page writes if the backing device requires
      it") it is not assumed anymore that block devices require stable pages.
      
      This patch sets the necessary flag to get stable pages back for rbd.
      
      In a ceph installation that provides multiple ext4 formatted rbd
      devices "bad crc" messages appeared regularly (ca 1 message every 1-2
      minutes on every OSD that provided the data for the rbd) in the
      OSD-logs before this patch. After this patch this messages are pretty
      much gone (only ca 1-2 / month / OSD).
      
      Cc: stable@vger.kernel.org # 3.9+, needs backporting
      Signed-off-by: NRonny Hegewald <Ronny.Hegewald@online.de>
      [idryomov@gmail.com: require stable pages only in crc case, changelog]
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      bae818ee
  2. 24 10月, 2015 2 次提交
    • I
      rbd: prevent kernel stack blow up on rbd map · 6d69bb53
      Ilya Dryomov 提交于
      Mapping an image with a long parent chain (e.g. image foo, whose parent
      is bar, whose parent is baz, etc) currently leads to a kernel stack
      overflow, due to the following recursion in the reply path:
      
        rbd_osd_req_callback()
          rbd_obj_request_complete()
            rbd_img_obj_callback()
              rbd_img_parent_read_callback()
                rbd_obj_request_complete()
                  ...
      
      Limit the parent chain to 16 images, which is ~5K worth of stack.  When
      the above recursion is eliminated, this limit can be lifted.
      
      Fixes: http://tracker.ceph.com/issues/12538
      
      Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NJosh Durgin <jdurgin@redhat.com>
      6d69bb53
    • I
      rbd: don't leak parent_spec in rbd_dev_probe_parent() · 1f2c6651
      Ilya Dryomov 提交于
      Currently we leak parent_spec and trigger a "parent reference
      underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
      The problem is we take the !parent out_err branch and that only drops
      refcounts; parent_spec that would've been freed had we called
      rbd_dev_unparent() remains and triggers rbd_warn() in
      rbd_dev_parent_put() - at that point we have parent_spec != NULL and
      parent_ref == 0, so counter ends up being -1 after the decrement.
      
      Redo rbd_dev_probe_parent() to fix this.
      
      Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      1f2c6651
  3. 23 10月, 2015 6 次提交
  4. 16 10月, 2015 3 次提交
    • I
      rbd: use writefull op for object size writes · e30b7577
      Ilya Dryomov 提交于
      This covers only the simplest case - an object size sized write, but
      it's still useful in tiering setups when EC is used for the base tier
      as writefull op can be proxied, saving an object promotion.
      
      Even though updating ceph_osdc_new_request() to allow writefull should
      just be a matter of fixing an assert, I didn't do it because its only
      user is cephfs.  All other sites were updated.
      
      Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      e30b7577
    • I
      rbd: set max_sectors explicitly · 0d9fde4f
      Ilya Dryomov 提交于
      Commit 30e2bc08 ("Revert "block: remove artifical max_hw_sectors
      cap"") restored a clamp on max_sectors.  It's now 2560 sectors instead
      of 1024, but it's not good enough: we set max_hw_sectors to rbd object
      size because we don't want object sized I/Os to be split, and the
      default object size is 4M.
      
      So, set max_sectors to max_hw_sectors in rbd at queue init time.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      0d9fde4f
    • K
      NVMe: Fix memory leak on retried commands · 0dfc70c3
      Keith Busch 提交于
      Resources are reallocated for requeued commands, so unmap and release
      the iod for the failed command.
      
      It's a pretty bad memory leak and causes a kernel hang if you remove a
      drive because of a busy dma pool. You'll get messages spewing like this:
      
        nvme 0000:xx:xx.x: dma_pool_destroy prp list 256, ffff880420dec000 busy
      
      and lock up pci and the driver since removal never completes while
      holding a lock.
      
      Cc: stable@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 4.0.x-
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0dfc70c3
  5. 15 10月, 2015 1 次提交
  6. 13 10月, 2015 1 次提交
    • A
      nvme: fix 32-bit build warning · 835da3f9
      Arnd Bergmann 提交于
      Compiling the nvme driver on 32-bit warns about a cast from a __u64
      variable to a pointer:
      
      drivers/block/nvme-core.c: In function 'nvme_submit_io':
      drivers/block/nvme-core.c:1847:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
          (void __user *)io.addr, length, NULL, 0);
      
      The cast here is intentional and safe, so we can shut up the
      gcc warning by adding an intermediate cast to 'uintptr_t'.
      
      I had previously submitted a patch to fix this problem in the
      nvme driver, but it was accepted on the same day that two new
      warnings got added.
      
      For clarification, I also change the third instance of this cast
      to use uintptr_t instead of unsigned long now.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Fixes: d29ec824 ("nvme: submit internal commands through the block layer")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      835da3f9
  7. 10 10月, 2015 11 次提交
  8. 09 10月, 2015 1 次提交
  9. 08 10月, 2015 1 次提交
  10. 01 10月, 2015 1 次提交
  11. 24 9月, 2015 7 次提交
    • K
      NVMe: Set affinity after allocating request queues · bda4e0fb
      Keith Busch 提交于
      The asynchronous namespace scanning caused affinity hints to be set before
      its tagset initialized, so there was no cpu mask to set the hint. This
      patch moves the affinity hint setting to after namespaces are scanned.
      Reported-by: N김경산 <ks0204.kim@samsung.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bda4e0fb
    • M
      block: loop: support DIO & AIO · bc07c10a
      Ming Lei 提交于
      There are at least 3 advantages to use direct I/O and AIO on
      read/write loop's backing file:
      
      1) double cache can be avoided, then memory usage gets
      decreased a lot
      
      2) not like user space direct I/O, there isn't cost of
      pinning pages
      
      3) avoid context switch for obtaining good throughput
      - in buffered file read, random I/O top throughput is often obtained
      only if they are submitted concurrently from lots of tasks; but for
      sequential I/O, most of times they can be hit from page cache, so
      concurrent submissions often introduce unnecessary context switch
      and can't improve throughput much. There was such discussion[1]
      to use non-blocking I/O to improve the problem for application.
      - with direct I/O and AIO, concurrent submissions can be
      avoided and random read throughput can't be affected meantime
      
      xfstests(-g auto, ext4) is basically passed when running with
      direct I/O(aio), one exception is generic/232, but it failed in
      loop buffered I/O(4.2-rc6-next-20150814) too.
      
      Follows the fio test result for performance purpose:
      	4 jobs fio test inside ext4 file system over loop block
      
      1) How to run
      	- KVM: 4 VCPUs, 2G RAM
      	- linux kernel: 4.2-rc6-next-20150814(base) with the patchset
      	- the loop block is over one image on SSD.
      	- linux psync, 4 jobs, size 1500M, ext4 over loop block
      	- test result: IOPS from fio output
      
      2) Throughput(IOPS) becomes a bit better with direct I/O(aio)
              -------------------------------------------------------------
              test cases          |randread   |read   |randwrite  |write  |
              -------------------------------------------------------------
              base                |8015       |113811 |67442      |106978
              -------------------------------------------------------------
              base+loop aio       |8136       |125040 |67811      |111376
              -------------------------------------------------------------
      
      - somehow, it should be caused by more page cache avaiable for
      application or one extra page copy is avoided in case of direct I/O
      
      3) context switch
              - context switch decreased by ~50% with loop direct I/O(aio)
      	compared with loop buffered I/O(4.2-rc6-next-20150814)
      
      4) memory usage from /proc/meminfo
              -------------------------------------------------------------
                                         | Buffers       | Cached
              -------------------------------------------------------------
              base                       | > 760MB       | ~950MB
              -------------------------------------------------------------
              base+loop direct I/O(aio)  | < 5MB         | ~1.6GB
              -------------------------------------------------------------
      
      - so there are much more page caches available for application with
      direct I/O
      
      [1] https://lwn.net/Articles/612483/Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bc07c10a
    • M
      block: loop: introduce ioctl command of LOOP_SET_DIRECT_IO · ab1cb278
      Ming Lei 提交于
      If loop block is mounted via 'mount -o loop', it isn't easy
      to pass file descriptor opened as O_DIRECT, so this patch
      introduces a new command to support direct IO for this case.
      
      Cc: linux-api@vger.kernel.org
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ab1cb278
    • M
      block: loop: prepare for supporing direct IO · 2e5ab5f3
      Ming Lei 提交于
      This patches provides one interface for enabling direct IO
      from user space:
      
      	- userspace(such as losetup) can pass 'file' which is
      	opened/fcntl as O_DIRECT
      
      Also __loop_update_dio() is introduced to check if direct I/O
      can be used on current loop setting.
      
      The last big change is to introduce LO_FLAGS_DIRECT_IO flag
      for userspace to know if direct IO is used to access backing
      file.
      
      Cc: linux-api@vger.kernel.org
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2e5ab5f3
    • M
      block: loop: use kthread_work · e03a3d7a
      Ming Lei 提交于
      The following patch will use dio/aio to submit IO to backing file,
      then it needn't to schedule IO concurrently from work, so
      use kthread_work for decreasing context switch cost a lot.
      
      For non-AIO case, single thread has been used for long long time,
      and it was just converted to work in v4.0, which has caused performance
      regression for fedora live booting already. In discussion[1], even
      though submitting I/O via work concurrently can improve random read IO
      throughput, meantime it might hurt sequential read IO performance, so
      better to restore to single thread behaviour.
      
      For the following AIO support, it is better to use multi hw-queue
      with per-hwq kthread than current work approach suppose there is so
      high performance requirement for loop.
      
      [1] http://marc.info/?t=143082678400002&r=1&w=2Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e03a3d7a
    • M
      block: loop: set QUEUE_FLAG_NOMERGES for request queue of loop · 5b5e20f4
      Ming Lei 提交于
      It doesn't make sense to enable merge because the I/O
      submitted to backing file is handled page by page.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5b5e20f4
    • R
      xen/blkback: free requests on disconnection · f929d42c
      Roger Pau Monne 提交于
      This is due to  commit 86839c56
      "xen/block: add multi-page ring support"
      
      When using an guest under UEFI - after the domain is destroyed
      the following warning comes from blkback.
      
      ------------[ cut here ]------------
      WARNING: CPU: 2 PID: 95 at
      /home/julien/works/linux/drivers/block/xen-blkback/xenbus.c:274
      xen_blkif_deferred_free+0x1f4/0x1f8()
      Modules linked in:
      CPU: 2 PID: 95 Comm: kworker/2:1 Tainted: G        W       4.2.0 #85
      Hardware name: APM X-Gene Mustang board (DT)
      Workqueue: events xen_blkif_deferred_free
      Call trace:
      [<ffff8000000890a8>] dump_backtrace+0x0/0x124
      [<ffff8000000891dc>] show_stack+0x10/0x1c
      [<ffff8000007653bc>] dump_stack+0x78/0x98
      [<ffff800000097e88>] warn_slowpath_common+0x9c/0xd4
      [<ffff800000097f80>] warn_slowpath_null+0x14/0x20
      [<ffff800000557a0c>] xen_blkif_deferred_free+0x1f0/0x1f8
      [<ffff8000000ad020>] process_one_work+0x160/0x3b4
      [<ffff8000000ad3b4>] worker_thread+0x140/0x494
      [<ffff8000000b2e34>] kthread+0xd8/0xf0
      ---[ end trace 6f859b7883c88cdd ]---
      
      Request allocation has been moved to connect_ring, which is called every
      time blkback connects to the frontend (this can happen multiple times during
      a blkback instance life cycle). On the other hand, request freeing has not
      been moved, so it's only called when destroying the backend instance. Due to
      this mismatch, blkback can allocate the request pool multiple times, without
      freeing it.
      
      In order to fix it, move the freeing of requests to xen_blkif_disconnect to
      restore the symmetry between request allocation and freeing.
      Reported-by: NJulien Grall <julien.grall@citrix.com>
      Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
      Tested-by: NJulien Grall <julien.grall@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: xen-devel@lists.xenproject.org
      CC: stable@vger.kernel.org # 4.2
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      f929d42c
  12. 18 9月, 2015 1 次提交
  13. 09 9月, 2015 4 次提交
    • S
      zram: unify error reporting · 70864969
      Sergey Senozhatsky 提交于
      Make zram syslog error reporting more consistent. We have random
      error levels in some places. For example, critical errors like
        "Error allocating memory for compressed page"
      and
        "Unable to allocate temp memory"
      are reported as KERN_INFO messages.
      
      a) Reassign error levels
      
      Error messages that directly affect zram
      functionality -- pr_err():
      
       Error allocating zram address table
       Error creating memory pool
       Decompression failed! err=%d, page=%u
       Unable to allocate temp memory
       Compression failed! err=%d
       Error allocating memory for compressed page: %u, size=%zu
       Cannot initialise %s compressing backend
       Error allocating disk queue for device %d
       Error allocating disk structure for device %d
       Error creating sysfs group for device %d
       Unable to register zram-control class
       Unable to get major number
      
      Messages that do not affect functionality, but user
      must be warned (because sysfs attrs will be removed in
      this particular case) -- pr_warn():
      
       %d (%s) Attribute %s (and others) will be removed. %s
      
      Messages that do not affect functionality and mostly are
      informative -- pr_info():
      
       Cannot change max compression streams
       Can't change algorithm for initialized device
       Cannot change disksize for initialized device
       Added device: %s
       Removed device: %s
      
      b) Update sysfs_create_group() error message
      
      First, it lacks a trailing new line; add it.  Second, every error message
      in zram_add() has a "for device %d" part, which makes errors more
      informative.  Add missing part to "Error creating sysfs group" message.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70864969
    • S
      zsmalloc: account the number of compacted pages · 860c707d
      Sergey Senozhatsky 提交于
      Compaction returns back to zram the number of migrated objects, which is
      quite uninformative -- we have objects of different sizes so user space
      cannot obtain any valuable data from that number.  Change compaction to
      operate in terms of pages and return back to compaction issuer the
      number of pages that were freed during compaction.  So from now on we
      will export more meaningful value in zram<id>/mm_stat -- the number of
      freed (compacted) pages.
      
      This requires:
       (a) a rename of `num_migrated' to 'pages_compacted'
       (b) a internal API change -- return first_page's fullness_group from
           putback_zspage(), so we know when putback_zspage() did
           free_zspage().  It helps us to account compaction stats correctly.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      860c707d
    • S
      zsmalloc/zram: introduce zs_pool_stats api · 7d3f3938
      Sergey Senozhatsky 提交于
      `zs_compact_control' accounts the number of migrated objects but it has
      a limited lifespan -- we lose it as soon as zs_compaction() returns back
      to zram.  It worked fine, because (a) zram had it's own counter of
      migrated objects and (b) only zram could trigger compaction.  However,
      this does not work for automatic pool compaction (not issued by zram).
      To account objects migrated during auto-compaction (issued by the
      shrinker) we need to store this number in zs_pool.
      
      Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
      there.  It provides only `num_migrated', as of this writing, but it
      surely can be extended.
      
      A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
      caller.
      
      Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d3f3938
    • I
      rbd: plug rbd_dev->header.object_prefix memory leak · d194cd1d
      Ilya Dryomov 提交于
      Need to free object_prefix when rbd_dev_v2_snap_context() fails, but
      only if this is the first time we are reading in the header.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      d194cd1d