1. 01 10月, 2015 1 次提交
  2. 24 9月, 2015 2 次提交
    • K
      NVMe: Set affinity after allocating request queues · bda4e0fb
      Keith Busch 提交于
      The asynchronous namespace scanning caused affinity hints to be set before
      its tagset initialized, so there was no cpu mask to set the hint. This
      patch moves the affinity hint setting to after namespaces are scanned.
      Reported-by: N김경산 <ks0204.kim@samsung.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bda4e0fb
    • R
      xen/blkback: free requests on disconnection · f929d42c
      Roger Pau Monne 提交于
      This is due to  commit 86839c56
      "xen/block: add multi-page ring support"
      
      When using an guest under UEFI - after the domain is destroyed
      the following warning comes from blkback.
      
      ------------[ cut here ]------------
      WARNING: CPU: 2 PID: 95 at
      /home/julien/works/linux/drivers/block/xen-blkback/xenbus.c:274
      xen_blkif_deferred_free+0x1f4/0x1f8()
      Modules linked in:
      CPU: 2 PID: 95 Comm: kworker/2:1 Tainted: G        W       4.2.0 #85
      Hardware name: APM X-Gene Mustang board (DT)
      Workqueue: events xen_blkif_deferred_free
      Call trace:
      [<ffff8000000890a8>] dump_backtrace+0x0/0x124
      [<ffff8000000891dc>] show_stack+0x10/0x1c
      [<ffff8000007653bc>] dump_stack+0x78/0x98
      [<ffff800000097e88>] warn_slowpath_common+0x9c/0xd4
      [<ffff800000097f80>] warn_slowpath_null+0x14/0x20
      [<ffff800000557a0c>] xen_blkif_deferred_free+0x1f0/0x1f8
      [<ffff8000000ad020>] process_one_work+0x160/0x3b4
      [<ffff8000000ad3b4>] worker_thread+0x140/0x494
      [<ffff8000000b2e34>] kthread+0xd8/0xf0
      ---[ end trace 6f859b7883c88cdd ]---
      
      Request allocation has been moved to connect_ring, which is called every
      time blkback connects to the frontend (this can happen multiple times during
      a blkback instance life cycle). On the other hand, request freeing has not
      been moved, so it's only called when destroying the backend instance. Due to
      this mismatch, blkback can allocate the request pool multiple times, without
      freeing it.
      
      In order to fix it, move the freeing of requests to xen_blkif_disconnect to
      restore the symmetry between request allocation and freeing.
      Reported-by: NJulien Grall <julien.grall@citrix.com>
      Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
      Tested-by: NJulien Grall <julien.grall@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: xen-devel@lists.xenproject.org
      CC: stable@vger.kernel.org # 4.2
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      f929d42c
  3. 18 9月, 2015 1 次提交
  4. 09 9月, 2015 6 次提交
    • S
      zram: unify error reporting · 70864969
      Sergey Senozhatsky 提交于
      Make zram syslog error reporting more consistent. We have random
      error levels in some places. For example, critical errors like
        "Error allocating memory for compressed page"
      and
        "Unable to allocate temp memory"
      are reported as KERN_INFO messages.
      
      a) Reassign error levels
      
      Error messages that directly affect zram
      functionality -- pr_err():
      
       Error allocating zram address table
       Error creating memory pool
       Decompression failed! err=%d, page=%u
       Unable to allocate temp memory
       Compression failed! err=%d
       Error allocating memory for compressed page: %u, size=%zu
       Cannot initialise %s compressing backend
       Error allocating disk queue for device %d
       Error allocating disk structure for device %d
       Error creating sysfs group for device %d
       Unable to register zram-control class
       Unable to get major number
      
      Messages that do not affect functionality, but user
      must be warned (because sysfs attrs will be removed in
      this particular case) -- pr_warn():
      
       %d (%s) Attribute %s (and others) will be removed. %s
      
      Messages that do not affect functionality and mostly are
      informative -- pr_info():
      
       Cannot change max compression streams
       Can't change algorithm for initialized device
       Cannot change disksize for initialized device
       Added device: %s
       Removed device: %s
      
      b) Update sysfs_create_group() error message
      
      First, it lacks a trailing new line; add it.  Second, every error message
      in zram_add() has a "for device %d" part, which makes errors more
      informative.  Add missing part to "Error creating sysfs group" message.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70864969
    • S
      zsmalloc: account the number of compacted pages · 860c707d
      Sergey Senozhatsky 提交于
      Compaction returns back to zram the number of migrated objects, which is
      quite uninformative -- we have objects of different sizes so user space
      cannot obtain any valuable data from that number.  Change compaction to
      operate in terms of pages and return back to compaction issuer the
      number of pages that were freed during compaction.  So from now on we
      will export more meaningful value in zram<id>/mm_stat -- the number of
      freed (compacted) pages.
      
      This requires:
       (a) a rename of `num_migrated' to 'pages_compacted'
       (b) a internal API change -- return first_page's fullness_group from
           putback_zspage(), so we know when putback_zspage() did
           free_zspage().  It helps us to account compaction stats correctly.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      860c707d
    • S
      zsmalloc/zram: introduce zs_pool_stats api · 7d3f3938
      Sergey Senozhatsky 提交于
      `zs_compact_control' accounts the number of migrated objects but it has
      a limited lifespan -- we lose it as soon as zs_compaction() returns back
      to zram.  It worked fine, because (a) zram had it's own counter of
      migrated objects and (b) only zram could trigger compaction.  However,
      this does not work for automatic pool compaction (not issued by zram).
      To account objects migrated during auto-compaction (issued by the
      shrinker) we need to store this number in zs_pool.
      
      Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
      there.  It provides only `num_migrated', as of this writing, but it
      surely can be extended.
      
      A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
      caller.
      
      Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d3f3938
    • I
      rbd: plug rbd_dev->header.object_prefix memory leak · d194cd1d
      Ilya Dryomov 提交于
      Need to free object_prefix when rbd_dev_v2_snap_context() fails, but
      only if this is the first time we are reading in the header.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      d194cd1d
    • I
      rbd: fix double free on rbd_dev->header_name · 3ebe138a
      Ilya Dryomov 提交于
      If rbd_dev_image_probe() in rbd_dev_probe_parent() fails, header_name
      is freed twice: once in rbd_dev_probe_parent() and then in its caller
      rbd_dev_image_probe() (rbd_dev_image_probe() is called recursively to
      handle parent images).
      
      rbd_dev_probe_parent() is responsible for probing the parent, so it
      shouldn't muck with clone's fields.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      3ebe138a
    • J
      xen: Use correctly the Xen memory terminologies · 0df4f266
      Julien Grall 提交于
      Based on include/xen/mm.h [1], Linux is mistakenly using MFN when GFN
      is meant, I suspect this is because the first support for Xen was for
      PV. This resulted in some misimplementation of helpers on ARM and
      confused developers about the expected behavior.
      
      For instance, with pfn_to_mfn, we expect to get an MFN based on the name.
      Although, if we look at the implementation on x86, it's returning a GFN.
      
      For clarity and avoid new confusion, replace any reference to mfn with
      gfn in any helpers used by PV drivers. The x86 code will still keep some
      reference of pfn_to_mfn which may be used by all kind of guests
      No changes as been made in the hypercall field, even
      though they may be invalid, in order to keep the same as the defintion
      in xen repo.
      
      Note that page_to_mfn has been renamed to xen_page_to_gfn to avoid a
      name to close to the KVM function gfn_to_page.
      
      Take also the opportunity to simplify simple construction such
      as pfn_to_mfn(page_to_pfn(page)) into xen_page_to_gfn. More complex clean up
      will come in follow-up patches.
      
      [1] http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=e758ed14f390342513405dd766e874934573e6cbSigned-off-by: NJulien Grall <julien.grall@citrix.com>
      Reviewed-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Acked-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Acked-by: NWei Liu <wei.liu2@citrix.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      0df4f266
  5. 08 9月, 2015 2 次提交
  6. 03 9月, 2015 2 次提交
  7. 28 8月, 2015 1 次提交
  8. 26 8月, 2015 2 次提交
    • A
      NVMe: Using PRACT bit to generate and verify PI by controller · e19b127f
      Alok Pandey 提交于
      This patch enables the PRCHK and reftag support when PRACT bit is set, and
      block layer integrity is disabled.
      Signed-off-by: NAlok Pandey <pandey.alok@samsung.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e19b127f
    • J
      mtip32x: fix regression introduced by blk-mq per-hctx flush · 74c9c913
      Jeff Moyer 提交于
      Hi,
      
      After commit f70ced09 (blk-mq: support per-distpatch_queue flush
      machinery), the mtip32xx driver may oops upon module load due to walking
      off the end of an array in mtip_init_cmd.  On initialization of the
      flush_rq, init_request is called with request_index >= the maximum queue
      depth the driver supports.  For mtip32xx, this value is used to index
      into an array.  What this means is that the driver will walk off the end
      of the array, and either oops or cause random memory corruption.
      
      The problem is easily reproduced by doing modprobe/rmmod of the mtip32xx
      driver in a loop.  I can typically reproduce the problem in about 30
      seconds.
      
      Now, in the case of mtip32xx, it actually doesn't support flush/fua, so
      I think we can simply return without doing anything.  In addition, no
      other mq-enabled driver does anything with the request_index passed into
      init_request(), so no other driver is affected.  However, I'm not really
      sure what is expected of drivers.  Ming, what did you envision drivers
      would do when initializing the flush requests?
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      74c9c913
  9. 21 8月, 2015 1 次提交
  10. 20 8月, 2015 3 次提交
  11. 19 8月, 2015 4 次提交
  12. 18 8月, 2015 1 次提交
    • K
      NVMe: Set queue max segments · e824410f
      Keith Busch 提交于
      This sets the queue's max segment size to match the device's
      capabilities. The default of 128 is usable until a device's transfer
      capability exceeds 512k, assuming a device page size of 4k. Many nvme
      devices exceed that transfer limit, so this lets the block layer know what
      kind of commands it to allow to form rather than unnecessarily split them.
      
      One additional segment is added to account for a transfer that may start
      in the middle of a page.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e824410f
  13. 17 8月, 2015 10 次提交
  14. 15 8月, 2015 1 次提交
    • S
      zram: fix pool name truncation · 4ce321f5
      Sergey Senozhatsky 提交于
      zram_meta_alloc() constructs a pool name for zs_create_pool() call as
      
          snprintf(pool_name, sizeof(pool_name), "zram%d", device_id);
      
      However, it defines pool name buffer to be only 8 bytes long (minus
      trailing zero), which means that we can have only 1000 pool names: zram0
      -- zram999.
      
      With CONFIG_ZSMALLOC_STAT enabled an attempt to create a device zram1000
      can fail if device zram100 already exists, because snprintf() will
      truncate new pool name to zram100 and pass it debugfs_create_dir(),
      causing:
      
        debugfs dir <zram100> creation failed
        zram: Error creating memory pool
      
      ... and so on.
      
      Fix it by passing zram->disk->disk_name to zram_meta_alloc() instead of
      divice_id.  We construct zram%d name earlier and keep it as a ->disk_name,
      no need to snprintf() it again.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ce321f5
  15. 14 8月, 2015 2 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666
    • K
      block: make generic_make_request handle arbitrarily sized bios · 54efd50b
      Kent Overstreet 提交于
      The way the block layer is currently written, it goes to great lengths
      to avoid having to split bios; upper layer code (such as bio_add_page())
      checks what the underlying device can handle and tries to always create
      bios that don't need to be split.
      
      But this approach becomes unwieldy and eventually breaks down with
      stacked devices and devices with dynamic limits, and it adds a lot of
      complexity. If the block layer could split bios as needed, we could
      eliminate a lot of complexity elsewhere - particularly in stacked
      drivers. Code that creates bios can then create whatever size bios are
      convenient, and more importantly stacked drivers don't have to deal with
      both their own bio size limitations and the limitations of the
      (potentially multiple) devices underneath them.  In the future this will
      let us delete merge_bvec_fn and a bunch of other code.
      
      We do this by adding calls to blk_queue_split() to the various
      make_request functions that need it - a few can already handle arbitrary
      size bios. Note that we add the call _after_ any call to
      blk_queue_bounce(); this means that blk_queue_split() and
      blk_recalc_rq_segments() don't need to be concerned with bouncing
      affecting segment merging.
      
      Some make_request_fn() callbacks were simple enough to audit and verify
      they don't need blk_queue_split() calls. The skipped ones are:
      
       * nfhd_make_request (arch/m68k/emu/nfblock.c)
       * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
       * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
       * brd_make_request (ramdisk - drivers/block/brd.c)
       * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
       * loop_make_request
       * null_queue_bio
       * bcache's make_request fns
      
      Some others are almost certainly safe to remove now, but will be left
      for future patches.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      54efd50b
  16. 31 7月, 2015 1 次提交
    • I
      rbd: fix copyup completion race · 2761713d
      Ilya Dryomov 提交于
      For write/discard obj_requests that involved a copyup method call, the
      opcode of the first op is CEPH_OSD_OP_CALL and the ->callback is
      rbd_img_obj_copyup_callback().  The latter frees copyup pages, sets
      ->xferred and delegates to rbd_img_obj_callback(), the "normal" image
      object callback, for reporting to block layer and putting refs.
      
      rbd_osd_req_callback() however treats CEPH_OSD_OP_CALL as a trivial op,
      which means obj_request is marked done in rbd_osd_trivial_callback(),
      *before* ->callback is invoked and rbd_img_obj_copyup_callback() has
      a chance to run.  Marking obj_request done essentially means giving
      rbd_img_obj_callback() a license to end it at any moment, so if another
      obj_request from the same img_request is being completed concurrently,
      rbd_img_obj_end_request() may very well be called on such prematurally
      marked done request:
      
      <obj_request-1/2 reply>
      handle_reply()
        rbd_osd_req_callback()
          rbd_osd_trivial_callback()
          rbd_obj_request_complete()
          rbd_img_obj_copyup_callback()
          rbd_img_obj_callback()
                                          <obj_request-2/2 reply>
                                          handle_reply()
                                            rbd_osd_req_callback()
                                              rbd_osd_trivial_callback()
            for_each_obj_request(obj_request->img_request) {
              rbd_img_obj_end_request(obj_request-1/2)
              rbd_img_obj_end_request(obj_request-2/2) <--
            }
      
      Calling rbd_img_obj_end_request() on such a request leads to trouble,
      in particular because its ->xfferred is 0.  We report 0 to the block
      layer with blk_update_request(), get back 1 for "this request has more
      data in flight" and then trip on
      
          rbd_assert(more ^ (which == img_request->obj_request_count));
      
      with rhs (which == ...) being 1 because rbd_img_obj_end_request() has
      been called for both requests and lhs (more) being 1 because we haven't
      got a chance to set ->xfferred in rbd_img_obj_copyup_callback() yet.
      
      To fix this, leverage that rbd wants to call class methods in only two
      cases: one is a generic method call wrapper (obj_request is standalone)
      and the other is a copyup (obj_request is part of an img_request).  So
      make a dedicated handler for CEPH_OSD_OP_CALL and directly invoke
      rbd_img_obj_copyup_callback() from it if obj_request is part of an
      img_request, similar to how CEPH_OSD_OP_READ handler invokes
      rbd_img_obj_request_read_callback().
      
      Since rbd_img_obj_copyup_callback() is now being called from the OSD
      request callback (only), it is renamed to rbd_osd_copyup_callback().
      
      Cc: Alex Elder <elder@linaro.org>
      Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 3.18
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      2761713d