1. 16 1月, 2019 1 次提交
    • J
      dm thin: fix passdown_double_checking_shared_status() · d445bd9c
      Joe Thornber 提交于
      Commit 00a0ea33 ("dm thin: do not queue freed thin mapping for next
      stage processing") changed process_prepared_discard_passdown_pt1() to
      increment all the blocks being discarded until after the passdown had
      completed to avoid them being prematurely reused.
      
      IO issued to a thin device that breaks sharing with a snapshot, followed
      by a discard issued to snapshot(s) that previously shared the block(s),
      results in passdown_double_checking_shared_status() being called to
      iterate through the blocks double checking their reference count is zero
      and issuing the passdown if so.  So a side effect of commit 00a0ea33
      is passdown_double_checking_shared_status() was broken.
      
      Fix this by checking if the block reference count is greater than 1.
      Also, rename dm_pool_block_is_used() to dm_pool_block_is_shared().
      
      Fixes: 00a0ea33 ("dm thin: do not queue freed thin mapping for next stage processing")
      Cc: stable@vger.kernel.org # 4.9+
      Reported-by: ryan.p.norwood@gmail.com
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d445bd9c
  2. 12 12月, 2018 2 次提交
    • M
      dm thin: bump target version · 2af6c070
      Mike Snitzer 提交于
      Decoupled version bump from commit f6c36758 ("dm thin: send event
      about thin-pool state change _after_ making it") because version bumps
      just create conflicts when backporting to the stable trees.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2af6c070
    • M
      dm thin: send event about thin-pool state change _after_ making it · f6c36758
      Mike Snitzer 提交于
      Sending a DM event before a thin-pool state change is about to happen is
      a bug.  It wasn't realized until it became clear that userspace response
      to the event raced with the actual state change that the event was
      meant to notify about.
      
      Fix this by first updating internal thin-pool state to reflect what the
      DM event is being issued about.  This fixes a long-standing racey/buggy
      userspace device-mapper-test-suite 'resize_io' test that would get an
      event but not find the state it was looking for -- so it would just go
      on to hang because no other events caused the test to reevaluate the
      thin-pool's state.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f6c36758
  3. 17 10月, 2018 1 次提交
  4. 11 9月, 2018 1 次提交
    • J
      dm thin metadata: try to avoid ever aborting transactions · 3ab91828
      Joe Thornber 提交于
      Committing a transaction can consume some metadata of it's own, we now
      reserve a small amount of metadata to cover this.  Free metadata
      reported by the kernel will not include this reserve.
      
      If any of the reserve has been used after a commit we enter a new
      internal state PM_OUT_OF_METADATA_SPACE.  This is reported as
      PM_READ_ONLY, so no userland changes are needed.  If the metadata
      device is resized the pool will move back to PM_WRITE.
      
      These changes mean we never need to abort and rollback a transaction due
      to running out of metadata space.  This is particularly important
      because there have been a handful of reports of data corruption against
      DM thin-provisioning that can all be attributed to the thin-pool having
      ran out of metadata space.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3ab91828
  5. 08 8月, 2018 1 次提交
    • H
      dm thin: stop no_space_timeout worker when switching to write-mode · 75294442
      Hou Tao 提交于
      Now both check_for_space() and do_no_space_timeout() will read & write
      pool->pf.error_if_no_space.  If these functions run concurrently, as
      shown in the following case, the default setting of "queue_if_no_space"
      can get lost.
      
      precondition:
          * error_if_no_space = false (aka "queue_if_no_space")
          * pool is in Out-of-Data-Space (OODS) mode
          * no_space_timeout worker has been queued
      
      CPU 0:                          CPU 1:
      // delete a thin device
      process_delete_mesg()
      // check_for_space() invoked by commit()
      set_pool_mode(pool, PM_WRITE)
          pool->pf.error_if_no_space = \
           pt->requested_pf.error_if_no_space
      
      				// timeout, pool is still in OODS mode
      				do_no_space_timeout
      				    // "queue_if_no_space" config is lost
      				    pool->pf.error_if_no_space = true
          pool->pf.mode = new_mode
      
      Fix it by stopping no_space_timeout worker when switching to write mode.
      
      Fixes: bcc696fa ("dm thin: stay in out-of-data-space mode once no_space_timeout expires")
      Cc: stable@vger.kernel.org
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      75294442
  6. 01 8月, 2018 1 次提交
  7. 30 7月, 2018 1 次提交
    • A
      dm thin: include metadata_low_watermark threshold in pool status · 63c8ecb6
      Andy Grover 提交于
      The metadata low watermark threshold is set by the kernel.  But the
      kernel depends on userspace to extend the thinpool metadata device when
      the threshold is crossed.
      
      Since the metadata low watermark threshold is not visible to userspace,
      upon receiving an event, userspace cannot tell that the kernel wants the
      metadata device extended, instead of some other eventing condition.
      Making it visible (but not settable) enables userspace to affirmatively
      know the kernel is asking for a metadata device extension, by comparing
      metadata_low_watermark against nr_free_blocks_metadata, also reported in
      status.
      
      Current solutions like dmeventd have their own thresholds for extending
      the data and metadata devices, and both devices are checked against
      their thresholds on each event.  This lessens the value of the kernel-set
      threshold, since userspace will either extend the metadata device sooner,
      when receiving another event; or will receive the metadata lowater event
      and do nothing, if dmeventd's threshold is less than the kernel's.
      (This second case is dangerous. The metadata lowater event will not be
      re-sent, so no further event will be generated before the metadata
      device is out if space, unless some other event causes userspace to
      recheck its thresholds.)
      Signed-off-by: NAndy Grover <agrover@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      63c8ecb6
  8. 27 6月, 2018 1 次提交
    • M
      dm thin: handle running out of data space vs concurrent discard · a685557f
      Mike Snitzer 提交于
      Discards issued to a DM thin device can complete to userspace (via
      fstrim) _before_ the metadata changes associated with the discards is
      reflected in the thinp superblock (e.g. free blocks).  As such, if a
      user constructs a test that loops repeatedly over these steps, block
      allocation can fail due to discards not having completed yet:
      1) fill thin device via filesystem file
      2) remove file
      3) fstrim
      
      From initial report, here:
      https://www.redhat.com/archives/dm-devel/2018-April/msg00022.html
      
      "The root cause of this issue is that dm-thin will first remove
      mapping and increase corresponding blocks' reference count to prevent
      them from being reused before DISCARD bios get processed by the
      underlying layers. However. increasing blocks' reference count could
      also increase the nr_allocated_this_transaction in struct sm_disk
      which makes smd->old_ll.nr_allocated +
      smd->nr_allocated_this_transaction bigger than smd->old_ll.nr_blocks.
      In this case, alloc_data_block() will never commit metadata to reset
      the begin pointer of struct sm_disk, because sm_disk_get_nr_free()
      always return an underflow value."
      
      While there is room for improvement to the space-map accounting that
      thinp is making use of: the reality is this test is inherently racey and
      will result in the previous iteration's fstrim's discard(s) completing
      vs concurrent block allocation, via dd, in the next iteration of the
      loop.
      
      No amount of space map accounting improvements will be able to allow
      user's to use a block before a discard of that block has completed.
      
      So the best we can really do is allow DM thinp to gracefully handle such
      aggressive use of all the pool's data by degrading the pool into
      out-of-data-space (OODS) mode.  We _should_ get that behaviour already
      (if space map accounting didn't falsely cause alloc_data_block() to
      believe free space was available).. but short of that we handle the
      current reality that dm_pool_alloc_data_block() can return -ENOSPC.
      Reported-by: NDennis Yang <dennisyang@qnap.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a685557f
  9. 13 6月, 2018 1 次提交
    • K
      treewide: Use array_size() in vmalloc() · 42bc47b3
      Kees Cook 提交于
      The vmalloc() function has no 2-factor argument form, so multiplication
      factors need to be wrapped in array_size(). This patch replaces cases of:
      
              vmalloc(a * b)
      
      with:
              vmalloc(array_size(a, b))
      
      as well as handling cases of:
      
              vmalloc(a * b * c)
      
      with:
      
              vmalloc(array3_size(a, b, c))
      
      This does, however, attempt to ignore constant size factors like:
      
              vmalloc(4 * 1024)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        vmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        vmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        vmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
        vmalloc(
      -	SIZE * COUNT
      +	array_size(COUNT, SIZE)
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        vmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        vmalloc(C1 * C2 * C3, ...)
      |
        vmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants.
      @@
      expression E1, E2;
      constant C1, C2;
      @@
      
      (
        vmalloc(C1 * C2, ...)
      |
        vmalloc(
      -	E1 * E2
      +	array_size(E1, E2)
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      42bc47b3
  10. 08 6月, 2018 1 次提交
  11. 05 6月, 2018 1 次提交
  12. 31 5月, 2018 1 次提交
  13. 04 4月, 2018 1 次提交
  14. 30 1月, 2018 1 次提交
  15. 17 1月, 2018 1 次提交
  16. 04 12月, 2017 1 次提交
    • M
      dm: fix various targets to dm_register_target after module __init resources created · 7e6358d2
      monty_pavel@sina.com 提交于
      A NULL pointer is seen if two concurrent "vgchange -ay -K <vg name>"
      processes race to load the dm-thin-pool module:
      
       PID: 25992 TASK: ffff883cd7d23500 CPU: 4 COMMAND: "vgchange"
        #0 [ffff883cd743d600] machine_kexec at ffffffff81038fa9
        0000001 [ffff883cd743d660] crash_kexec at ffffffff810c5992
        0000002 [ffff883cd743d730] oops_end at ffffffff81515c90
        0000003 [ffff883cd743d760] no_context at ffffffff81049f1b
        0000004 [ffff883cd743d7b0] __bad_area_nosemaphore at ffffffff8104a1a5
        0000005 [ffff883cd743d800] bad_area at ffffffff8104a2ce
        0000006 [ffff883cd743d830] __do_page_fault at ffffffff8104aa6f
        0000007 [ffff883cd743d950] do_page_fault at ffffffff81517bae
        0000008 [ffff883cd743d980] page_fault at ffffffff81514f95
           [exception RIP: kmem_cache_alloc+108]
           RIP: ffffffff8116ef3c RSP: ffff883cd743da38 RFLAGS: 00010046
           RAX: 0000000000000004 RBX: ffffffff81121b90 RCX: ffff881bf1e78cc0
           RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
           RBP: ffff883cd743da68 R8: ffff881bf1a4eb00 R9: 0000000080042000
           R10: 0000000000002000 R11: 0000000000000000 R12: 00000000000000d0
           R13: 0000000000000000 R14: 00000000000000d0 R15: 0000000000000246
           ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
        0000009 [ffff883cd743da70] mempool_alloc_slab at ffffffff81121ba5
       0000010 [ffff883cd743da80] mempool_create_node at ffffffff81122083
       0000011 [ffff883cd743dad0] mempool_create at ffffffff811220f4
       0000012 [ffff883cd743dae0] pool_ctr at ffffffffa08de049 [dm_thin_pool]
       0000013 [ffff883cd743dbd0] dm_table_add_target at ffffffffa0005f2f [dm_mod]
       0000014 [ffff883cd743dc30] table_load at ffffffffa0008ba9 [dm_mod]
       0000015 [ffff883cd743dc90] ctl_ioctl at ffffffffa0009dc4 [dm_mod]
      
      The race results in a NULL pointer because:
      
      Process A (vgchange -ay -K):
       	a. send DM_LIST_VERSIONS_CMD ioctl;
       	b. pool_target not registered;
       	c. modprobe dm_thin_pool and wait until end.
      
      Process B (vgchange -ay -K):
       	a. send DM_LIST_VERSIONS_CMD ioctl;
       	b. pool_target registered;
       	c. table_load->dm_table_add_target->pool_ctr;
       	d. _new_mapping_cache is NULL and panic.
      Note:
       	1. process A and process B are two concurrent processes.
       	2. pool_target can be detected by process B but
       	_new_mapping_cache initialization has not ended.
      
      To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
      with the same problem, simply dm_register_target() after all resources
      created during module init (as labelled with __init) are finished.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: Nmonty <monty_pavel@sina.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      7e6358d2
  17. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  18. 28 8月, 2017 1 次提交
  19. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  20. 28 6月, 2017 1 次提交
    • V
      dm thin: do not queue freed thin mapping for next stage processing · 00a0ea33
      Vallish Vaidyeshwara 提交于
      process_prepared_discard_passdown_pt1() should cleanup
      dm_thin_new_mapping in cases of error.
      
      dm_pool_inc_data_range() can fail trying to get a block reference:
      
      metadata operation 'dm_pool_inc_data_range' failed: error = -61
      
      When dm_pool_inc_data_range() fails, dm thin aborts current metadata
      transaction and marks pool as PM_READ_ONLY. Memory for thin mapping
      is released as well. However, current thin mapping will be queued
      onto next stage as part of queue_passdown_pt2() or passdown_endio().
      This dangling thin mapping memory when processed and accessed in
      next stage will lead to device mapper crashing.
      
      Code flow without fix:
      -> process_prepared_discard_passdown_pt1(m)
         -> dm_thin_remove_range()
         -> discard passdown
            --> passdown_endio(m) queues m onto next stage
         -> dm_pool_inc_data_range() fails, frees memory m
                  but does not remove it from next stage queue
      
      -> process_prepared_discard_passdown_pt2(m)
         -> processes freed memory m and crashes
      
      One such stack:
      
      Call Trace:
      [<ffffffffa037a46f>] dm_cell_release_no_holder+0x2f/0x70 [dm_bio_prison]
      [<ffffffffa039b6dc>] cell_defer_no_holder+0x3c/0x80 [dm_thin_pool]
      [<ffffffffa039b88b>] process_prepared_discard_passdown_pt2+0x4b/0x90 [dm_thin_pool]
      [<ffffffffa0399611>] process_prepared+0x81/0xa0 [dm_thin_pool]
      [<ffffffffa039e735>] do_worker+0xc5/0x820 [dm_thin_pool]
      [<ffffffff8152bf54>] ? __schedule+0x244/0x680
      [<ffffffff81087e72>] ? pwq_activate_delayed_work+0x42/0xb0
      [<ffffffff81089f53>] process_one_work+0x153/0x3f0
      [<ffffffff8108a71b>] worker_thread+0x12b/0x4b0
      [<ffffffff8108a5f0>] ? rescuer_thread+0x350/0x350
      [<ffffffff8108fd6a>] kthread+0xca/0xe0
      [<ffffffff8108fca0>] ? kthread_park+0x60/0x60
      [<ffffffff81530b45>] ret_from_fork+0x25/0x30
      
      The fix is to first take the block ref count for discarded block and
      then do a passdown discard of this block. If block ref count fails,
      then bail out aborting current metadata transaction, mark pool as
      PM_READ_ONLY and also free current thin mapping memory (existing error
      handling code) without queueing this thin mapping onto next stage of
      processing. If block ref count succeeds, then passdown discard of this
      block. Discard callback of passdown_endio() will queue this thin mapping
      onto next stage of processing.
      
      Code flow with fix:
      -> process_prepared_discard_passdown_pt1(m)
         -> dm_thin_remove_range()
         -> dm_pool_inc_data_range()
            --> if fails, free memory m and bail out
         -> discard passdown
            --> passdown_endio(m) queues m onto next stage
      
      Cc: stable <stable@vger.kernel.org> # v4.9+
      Reviewed-by: NEduardo Valentin <eduval@amazon.com>
      Reviewed-by: NCristian Gafton <gafton@amazon.com>
      Reviewed-by: NAnchal Agarwal <anchalag@amazon.com>
      Signed-off-by: NVallish Vaidyeshwara <vallish@amazon.com>
      Reviewed-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      00a0ea33
  21. 09 6月, 2017 2 次提交
  22. 25 4月, 2017 1 次提交
    • D
      dm thin: fix a memory leak when passing discard bio down · 948f581a
      Dennis Yang 提交于
      dm-thin does not free the discard_parent bio after all chained sub
      bios finished. The following kmemleak report could be observed after
      pool with discard_passdown option processes discard bios in
      linux v4.11-rc7. To fix this, we drop the discard_parent bio reference
      when its endio (passdown_endio) called.
      
      unreferenced object 0xffff8803d6b29700 (size 256):
        comm "kworker/u8:0", pid 30349, jiffies 4379504020 (age 143002.776s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          01 00 00 00 00 00 00 f0 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81a5efd9>] kmemleak_alloc+0x49/0xa0
          [<ffffffff8114ec34>] kmem_cache_alloc+0xb4/0x100
          [<ffffffff8110eec0>] mempool_alloc_slab+0x10/0x20
          [<ffffffff8110efa5>] mempool_alloc+0x55/0x150
          [<ffffffff81374939>] bio_alloc_bioset+0xb9/0x260
          [<ffffffffa018fd20>] process_prepared_discard_passdown_pt1+0x40/0x1c0 [dm_thin_pool]
          [<ffffffffa018b409>] break_up_discard_bio+0x1a9/0x200 [dm_thin_pool]
          [<ffffffffa018b484>] process_discard_cell_passdown+0x24/0x40 [dm_thin_pool]
          [<ffffffffa018b24d>] process_discard_bio+0xdd/0xf0 [dm_thin_pool]
          [<ffffffffa018ecf6>] do_worker+0xa76/0xd50 [dm_thin_pool]
          [<ffffffff81086239>] process_one_work+0x139/0x370
          [<ffffffff810867b1>] worker_thread+0x61/0x450
          [<ffffffff8108b316>] kthread+0xd6/0xf0
          [<ffffffff81a6cd1f>] ret_from_fork+0x3f/0x70
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDennis Yang <dennisyang@qnap.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      948f581a
  23. 09 4月, 2017 1 次提交
  24. 08 3月, 2017 1 次提交
  25. 02 2月, 2017 1 次提交
  26. 28 1月, 2017 1 次提交
  27. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  28. 21 7月, 2016 1 次提交
  29. 08 6月, 2016 4 次提交
  30. 13 5月, 2016 3 次提交
  31. 06 5月, 2016 1 次提交
  32. 12 3月, 2016 1 次提交
    • M
      dm thin: consistently return -ENOSPC if pool has run out of data space · c3667cc6
      Mike Snitzer 提交于
      Commit 0a927c2f ("dm thin: return -ENOSPC when erroring retry list due
      to out of data space") was a step in the right direction but didn't go
      far enough.
      
      Add a new 'out_of_data_space' flag to 'struct pool' and set it if/when
      the pool runs of of data space.  This fixes cell_error() and
      error_retry_list() to not blindly return -EIO.
      
      We cannot rely on the 'error_if_no_space' feature flag since it is
      transient (in that it can be reset once space is added, plus it only
      controls whether errors are issued, it doesn't reflect whether the
      pool is actually out of space).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c3667cc6
  33. 23 2月, 2016 1 次提交