1. 20 12月, 2017 2 次提交
    • M
      dm: optimize bio-based NVMe IO submission · 978e51ba
      Mike Snitzer 提交于
      Upper level bio-based drivers that stack immediately ontop of NVMe can
      leverage direct_make_request().  In addition DM's NVMe bio-based
      will initially only ever have one NVMe device that it submits IO to at a
      time.  There is no splitting needed.  Enhance DM core so that
      DM_TYPE_NVME_BIO_BASED's IO submission takes advantage of both of these
      characteristics.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      978e51ba
    • M
      dm: introduce DM_TYPE_NVME_BIO_BASED · 22c11858
      Mike Snitzer 提交于
      If dm_table_determine_type() establishes DM_TYPE_NVME_BIO_BASED then
      all devices in the DM table do not support partial completions.  Also,
      the table has a single immutable target that doesn't require DM core to
      split bios.
      
      This will enable adding NVMe optimizations to bio-based DM.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      22c11858
  2. 18 12月, 2017 1 次提交
    • M
      dm: simplify start of block stats accounting for bio-based · f3986374
      Mike Snitzer 提交于
      No apparent need to generic_start_io_acct() until before the IO is ready
      for submission.  start_io_acct() is the proper place to do this
      accounting -- it is also where DM accounts for pending IO and, if
      enabled, starts dm-stats accounting.
      
      Replace start_io_acct()'s part_round_stats() with generic_start_io_acct().
      This eliminates needing to take part_stat_lock() multiple times when
      starting an IO on bio-based devices.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f3986374
  3. 17 12月, 2017 5 次提交
  4. 14 12月, 2017 8 次提交
    • M
      dm: set QUEUE_FLAG_DAX accordingly in dm_table_set_restrictions() · ad3793fc
      Mike Snitzer 提交于
      Rather than having DAX support be unique by setting it based on table
      type in dm_setup_md_queue().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ad3793fc
    • M
      dm: fix __send_changing_extent_only() to send first bio and chain remainder · 3d7f4562
      Mike Snitzer 提交于
      __send_changing_extent_only() must follow the same pattern that was
      established with commit "dm: ensure bio submission follows a depth-first
      tree walk".  That is: submit first bio up to split boundary and then
      split the remainder to further submissions.
      Suggested-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3d7f4562
    • M
      dm: ensure bio-based DM's bioset and io_pool support targets' maximum IOs · 0776aa0e
      Mike Snitzer 提交于
      alloc_multiple_bios() assumes it can allocate the requested number of
      bios but until now there was no gaurantee that the mempools would be
      accomodating.
      Suggested-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      0776aa0e
    • M
      dm: remove BIOSET_NEED_RESCUER based dm_offload infrastructure · 4a3f54d9
      Mike Snitzer 提交于
      Now that all of DM has been revised and/or verified to no longer require
      the use of BIOSET_NEED_RESCUER the dm_offload code may be removed.
      Suggested-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      4a3f54d9
    • M
      dm: safely allocate multiple bioset bios · 318716dd
      Mike Snitzer 提交于
      DM targets can request multiple bios be sent to them by DM core (see:
      num_{flush,discard,write_same,write_zeroes}_bios).  But until now these
      bios were allocated in an unsafe manner than could potentially exhaust
      the DM device's bioset -- in the face of multiple threads each trying to
      do multiple allocations from the same DM device's bioset.
      
      Fix __send_duplicate_bios() by using the new alloc_multiple_bios().  The
      allocation strategy used by alloc_multiple_bios() models that used by
      dm-crypt.c:crypt_alloc_buffer().
      
      Neil Brown initially proposed this fix but the implementation has been
      revised enough that it inappropriate to attribute the entirety of it to
      him.
      Suggested-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      318716dd
    • N
      dm: remove unused 'num_write_bios' target interface · f31c21e4
      NeilBrown 提交于
      No DM target provides num_write_bios and none has since dm-cache's
      brief use in 2013.
      
      Having the possibility of num_write_bios > 1 complicates bio
      allocation.  So remove the interface and assume there is only one bio
      needed.
      
      If a target ever needs more, it must provide a suitable bioset and
      allocate itself based on its particular needs.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f31c21e4
    • N
      dm: ensure bio submission follows a depth-first tree walk · 18a25da8
      NeilBrown 提交于
      A dm device can, in general, represent a tree of targets, each of which
      handles a sub-range of the range of blocks handled by the parent.
      
      The bio sequencing managed by generic_make_request() requires that bios
      are generated and handled in a depth-first manner.  Each call to a
      make_request_fn() may submit bios to a single member device, and may
      submit bios for a reduced region of the same device as the
      make_request_fn.
      
      In particular, any bios submitted to member devices must be expected to
      be processed in order, so a later one must never wait for an earlier
      one.
      
      This ordering is usually achieved by using bio_split() to reduce a bio
      to a size that can be completely handled by one target, and resubmitting
      the remainder to the originating device. bio_queue_split() shows the
      canonical approach.
      
      dm doesn't follow this approach, largely because it has needed to split
      bios since long before bio_split() was available.  It currently can
      submit bios to separate targets within the one dm_make_request() call.
      Dependencies between these targets, as can happen with dm-snap, can
      cause deadlocks if either bios gets stuck behind the other in the queues
      managed by generic_make_request().  This requires the 'rescue'
      functionality provided by dm_offload_{start,end}.
      
      Some of this requirement can be removed by changing the order of bio
      submission to follow the canonical approach.  That is, if dm finds that
      it needs to split a bio, the remainder should be sent to
      generic_make_request() rather than being handled immediately.  This
      delays the handling until the first part is completely processed, so the
      deadlock problems do not occur.
      
      __split_and_process_bio() can be called both from dm_make_request() and
      from dm_wq_work().  When called from dm_wq_work() the current approach
      is perfectly satisfactory as each bio will be processed immediately.
      When called from dm_make_request(), current->bio_list will be non-NULL,
      and in this case it is best to create a separate "clone" bio for the
      remainder.
      
      When we use bio_clone_bioset() to split off the front part of a bio
      and chain the two together and submit the remainder to
      generic_make_request(), it is important that the newly allocated
      bio is used as the head to be processed immediately, and the original
      bio gets "bio_advance()"d and sent to generic_make_request() as the
      remainder.  Otherwise, if the newly allocated bio is used as the
      remainder, and if it then needs to be split again, then the next
      bio_clone_bioset() call will be made while holding a reference a bio
      (result of the first clone) from the same bioset.  This can potentially
      exhaust the bioset mempool and result in a memory allocation deadlock.
      
      Note that there is no race caused by reassigning cio.io->bio after already
      calling __map_bio().  This bio will only be dereferenced again after
      dec_pending() has found io->io_count to be zero, and this cannot happen
      before the dec_pending() call at the end of __split_and_process_bio().
      
      To provide the clone bio when splitting, we use q->bio_split.  This
      was previously being freed by bio-based dm to avoid having excess
      rescuer threads.  As bio_split bio sets no longer create rescuer
      threads, there is little cost and much gain from restoring the
      q->bio_split bio set.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      18a25da8
    • N
      dm: fix comment above dm_accept_partial_bio · c06b3e58
      NeilBrown 提交于
      Clarify that dm_accept_partial_bio isn't allowed for REQ_OP_ZONE_RESET
      bios.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c06b3e58
  5. 11 11月, 2017 3 次提交
    • M
      dm: small cleanup in dm_get_md() · 49de5769
      Mike Snitzer 提交于
      Makes dm_get_md() and dm_get_from_kobject() have similar code.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      49de5769
    • H
      dm: fix race between dm_get_from_kobject() and __dm_destroy() · b9a41d21
      Hou Tao 提交于
      The following BUG_ON was hit when testing repeat creation and removal of
      DM devices:
      
          kernel BUG at drivers/md/dm.c:2919!
          CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
          Call Trace:
           [<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a
           [<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e
           [<ffffffff817b46d1>] ? mutex_lock+0x26/0x44
           [<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf
           [<ffffffff811de257>] kernfs_seq_show+0x23/0x25
           [<ffffffff81199118>] seq_read+0x16f/0x325
           [<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f
           [<ffffffff8117b625>] __vfs_read+0x26/0x9d
           [<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44
           [<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9
           [<ffffffff8117be9d>] vfs_read+0x8f/0xcf
           [<ffffffff81193e34>] ? __fdget_pos+0x12/0x41
           [<ffffffff8117c686>] SyS_read+0x4b/0x76
           [<ffffffff817b606e>] system_call_fastpath+0x12/0x71
      
      The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
      between the test of DMF_FREEING & DMF_DELETING and dm_get() in
      dm_get_from_kobject().
      
      To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
      dm_get() are done in an atomic way, so _minor_lock is used.
      
      The other callers of dm_get() have also been checked to be OK: some
      callers invoke dm_get() under _minor_lock, some callers invoke it under
      _hash_lock, and dm_start_request() invoke it after increasing
      md->open_count.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b9a41d21
    • M
      dm: allocate struct mapped_device with kvzalloc · 856eb091
      Mikulas Patocka 提交于
      The structure srcu_struct can be very big, its size is proportional to the
      value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field
      io_barrier in the struct mapped_device has 84kB in the debugging kernel
      and 50kB in the non-debugging kernel. The large size may result in failure
      of the function kzalloc_node.
      
      In order to avoid the allocation failure, we use the function
      kvzalloc_node, this function falls back to vmalloc if a large contiguous
      chunk of memory is not available. This patch also moves the field
      io_barrier to the last position of struct mapped_device - the reason is
      that on many processor architectures, short memory offsets result in
      smaller code than long memory offsets - on x86-64 it reduces code size by
      320 bytes.
      
      Note to stable kernel maintainers - the kernels 4.11 and older don't have
      the function kvzalloc_node, you can use the function vzalloc_node instead.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      856eb091
  6. 25 10月, 2017 2 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
    • E
      dm: convert table_device.count from atomic_t to refcount_t · b0b4d7c6
      Elena Reshetova 提交于
      atomic_t variables are currently used to implement reference
      counters with the following properties:
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable table_device.count is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      Suggested-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: NHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b0b4d7c6
  7. 06 10月, 2017 1 次提交
  8. 25 9月, 2017 1 次提交
    • M
      dm ioctl: fix alignment of event number in the device list · 62e08243
      Mikulas Patocka 提交于
      The size of struct dm_name_list is different on 32-bit and 64-bit
      kernels (so "(nl + 1)" differs between 32-bit and 64-bit kernels).
      
      This mismatch caused some harmless difference in padding when using 32-bit
      or 64-bit kernel. Commit 23d70c5e ("dm ioctl: report event number in
      DM_LIST_DEVICES") added reporting event number in the output of
      DM_LIST_DEVICES_CMD. This difference in padding makes it impossible for
      userspace to determine the location of the event number (the location
      would be different when running on 32-bit and 64-bit kernels).
      
      Fix the padding by using offsetof(struct dm_name_list, name) instead of
      sizeof(struct dm_name_list) to determine the location of entries.
      
      Also, the ioctl version number is incremented to 37 so that userspace
      can use the version number to determine that the event number is present
      and correctly located.
      
      In addition, a global event is now raised when a DM device is created,
      removed, renamed or when table is swapped, so that the user can monitor
      for device changes.
      Reported-by: NEugene Syromiatnikov <esyr@redhat.com>
      Fixes: 23d70c5e ("dm ioctl: report event number in DM_LIST_DEVICES")
      Cc: stable@vger.kernel.org # 4.13
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      62e08243
  9. 11 9月, 2017 1 次提交
    • M
      dax: remove the pmem_dax_ops->flush abstraction · c3ca015f
      Mikulas Patocka 提交于
      Commit abebfbe2 ("dm: add ->flush() dax operation support") is
      buggy. A DM device may be composed of multiple underlying devices and
      all of them need to be flushed. That commit just routes the flush
      request to the first device and ignores the other devices.
      
      It could be fixed by adding more complex logic to the device mapper. But
      there is only one implementation of the method pmem_dax_ops->flush - that
      is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
      don't need the pmem_dax_ops->flush abstraction at all, we can call
      arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
      can't ever reach anything different from arch_wb_cache_pmem().
      
      It should be also pointed out that for some uses of persistent memory it
      is needed to flush only a very small amount of data (such as 1 cacheline),
      and it would be overkill if we go through that device mapper machinery for
      a single flushed cache line.
      
      Fix this by removing the pmem_dax_ops->flush abstraction and call
      arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
      mapper code that forwards the flushes.
      
      Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c3ca015f
  10. 28 8月, 2017 2 次提交
  11. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  12. 10 8月, 2017 1 次提交
  13. 04 7月, 2017 1 次提交
  14. 28 6月, 2017 1 次提交
  15. 19 6月, 2017 6 次提交
  16. 16 6月, 2017 1 次提交
  17. 10 6月, 2017 1 次提交
  18. 09 6月, 2017 2 次提交