1. 15 1月, 2020 1 次提交
    • M
      dm thin: fix use-after-free in metadata_pre_commit_callback · a4a8d286
      Mike Snitzer 提交于
      dm-thin uses struct pool to hold the state of the pool. There may be
      multiple pool_c's pointing to a given pool, each pool_c represents a
      loaded target. pool_c's may be created and destroyed arbitrarily and the
      pool contains a reference count of pool_c's pointing to it.
      
      Since commit 694cfe7f ("dm thin: Flush data device before
      committing metadata") a pointer to pool_c is passed to
      dm_pool_register_pre_commit_callback and this function stores it in
      pmd->pre_commit_context. If this pool_c is freed, but pool is not
      (because there is another pool_c referencing it), we end up in a
      situation where pmd->pre_commit_context structure points to freed
      pool_c. It causes a crash in metadata_pre_commit_callback.
      
      Fix this by moving the dm_pool_register_pre_commit_callback() from
      pool_ctr() to pool_preresume(). This way the in-core thin-pool metadata
      is only ever armed with callback data whose lifetime matches the
      active thin-pool target.
      
      In should be noted that this fix preserves the ability to load a
      thin-pool table that uses a different data block device (that contains
      the same data) -- though it is unclear if that capability is still
      useful and/or needed.
      
      Fixes: 694cfe7f ("dm thin: Flush data device before committing metadata")
      Cc: stable@vger.kernel.org
      Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a4a8d286
  2. 07 12月, 2019 1 次提交
    • N
      dm thin: Flush data device before committing metadata · 694cfe7f
      Nikos Tsironis 提交于
      The thin provisioning target maintains per thin device mappings that map
      virtual blocks to data blocks in the data device.
      
      When we write to a shared block, in case of internal snapshots, or
      provision a new block, in case of external snapshots, we copy the shared
      block to a new data block (COW), update the mapping for the relevant
      virtual block and then issue the write to the new data block.
      
      Suppose the data device has a volatile write-back cache and the
      following sequence of events occur:
      
      1. We write to a shared block
      2. A new data block is allocated
      3. We copy the shared block to the new data block using kcopyd (COW)
      4. We insert the new mapping for the virtual block in the btree for that
         thin device.
      5. The commit timeout expires and we commit the metadata, that now
         includes the new mapping from step (4).
      6. The system crashes and the data device's cache has not been flushed,
         meaning that the COWed data are lost.
      
      The next time we read that virtual block of the thin device we read it
      from the data block allocated in step (2), since the metadata have been
      successfully committed. The data are lost due to the crash, so we read
      garbage instead of the old, shared data.
      
      This has the following implications:
      
      1. In case of writes to shared blocks, with size smaller than the pool's
         block size (which means we first copy the whole block and then issue
         the smaller write), we corrupt data that the user never touched.
      
      2. In case of writes to shared blocks, with size equal to the device's
         logical block size, we fail to provide atomic sector writes. When the
         system recovers the user will read garbage from that sector instead
         of the old data or the new data.
      
      3. Even for writes to shared blocks, with size equal to the pool's block
         size (overwrites), after the system recovers, the written sectors
         will contain garbage instead of a random mix of sectors containing
         either old data or new data, thus we fail again to provide atomic
         sectors writes.
      
      4. Even when the user flushes the thin device, because we first commit
         the metadata and then pass down the flush, the same risk for
         corruption exists (if the system crashes after the metadata have been
         committed but before the flush is passed down to the data device.)
      
      The only case which is unaffected is that of writes with size equal to
      the pool's block size and with the FUA flag set. But, because FUA writes
      trigger metadata commits, this case can trigger the corruption
      indirectly.
      
      Moreover, apart from internal and external snapshots, the same issue
      exists for newly provisioned blocks, when block zeroing is enabled.
      After the system recovers the provisioned blocks might contain garbage
      instead of zeroes.
      
      To solve this and avoid the potential data corruption we flush the
      pool's data device **before** committing its metadata.
      
      This ensures that the data blocks of any newly inserted mappings are
      properly written to non-volatile storage and won't be lost in case of a
      crash.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      694cfe7f
  3. 18 11月, 2019 1 次提交
    • J
      dm thin: wakeup worker only when deferred bios exist · d256d796
      Jeffle Xu 提交于
      Single thread fio test (read, bs=4k, ioengine=libaio, iodepth=128,
      numjobs=1) over dm-thin device has poor performance versus bare nvme
      device.
      
      Further investigation with perf indicates that queue_work_on() consumes
      over 20% CPU time when doing IO over dm-thin device. The call stack is
      as follows.
      
      - 40.57% thin_map
          + 22.07% queue_work_on
          + 9.95% dm_thin_find_block
          + 2.80% cell_defer_no_holder
            1.91% inc_all_io_entry.isra.33.part.34
          + 1.78% bio_detain.isra.35
      
      In cell_defer_no_holder(), wakeup_worker() is always called, no matter
      whether the tc->deferred_bio_list list is empty or not. In single thread
      IO model, this list is most likely empty. So skip waking up worker thread
      if tc->deferred_bio_list list is empty.
      
      Single thread IO performance improves from 448 MiB/s to 646 MiB/s (+44%)
      once the needless wake_worker() calls are properly skipped.
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d256d796
  4. 06 11月, 2019 1 次提交
  5. 06 3月, 2019 1 次提交
    • J
      dm thin: add sanity checks to thin-pool and external snapshot creation · 70de2cbd
      Jason Cai (Xiang Feng) 提交于
      Invoking dm_get_device() twice on the same device path with different
      modes is dangerous.  Because in that case, upgrade_mode() will alloc a
      new 'dm_dev' and free the old one, which may be referenced by a previous
      caller.  Dereferencing the dangling pointer will trigger kernel NULL
      pointer dereference.
      
      The following two cases can reproduce this issue.  Actually, they are
      invalid setups that must be disallowed, e.g.:
      
      1. Creating a thin-pool with read_only mode, and the same device as
      both metadata and data.
      
      dmsetup create thinp --table \
          "0 41943040 thin-pool /dev/vdb /dev/vdb 128 0 1 read_only"
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
      ...
      Call Trace:
       new_read+0xfb/0x110 [dm_bufio]
       dm_bm_read_lock+0x43/0x190 [dm_persistent_data]
       ? kmem_cache_alloc_trace+0x15c/0x1e0
       __create_persistent_data_objects+0x65/0x3e0 [dm_thin_pool]
       dm_pool_metadata_open+0x8c/0xf0 [dm_thin_pool]
       pool_ctr.cold.79+0x213/0x913 [dm_thin_pool]
       ? realloc_argv+0x50/0x70 [dm_mod]
       dm_table_add_target+0x14e/0x330 [dm_mod]
       table_load+0x122/0x2e0 [dm_mod]
       ? dev_status+0x40/0x40 [dm_mod]
       ctl_ioctl+0x1aa/0x3e0 [dm_mod]
       dm_ctl_ioctl+0xa/0x10 [dm_mod]
       do_vfs_ioctl+0xa2/0x600
       ? handle_mm_fault+0xda/0x200
       ? __do_page_fault+0x26c/0x4f0
       ksys_ioctl+0x60/0x90
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x55/0x150
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      2. Creating a external snapshot using the same thin-pool device.
      
      dmsetup create thinp --table \
          "0 41943040 thin-pool /dev/vdc /dev/vdb 128 0 2 ignore_discard"
      dmsetup message /dev/mapper/thinp 0 "create_thin 0"
      dmsetup create snap --table \
                  "0 204800 thin /dev/mapper/thinp 0 /dev/mapper/thinp"
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      ...
      Call Trace:
      ? __alloc_pages_nodemask+0x13c/0x2e0
      retrieve_status+0xa5/0x1f0 [dm_mod]
      ? dm_get_live_or_inactive_table.isra.7+0x20/0x20 [dm_mod]
       table_status+0x61/0xa0 [dm_mod]
       ctl_ioctl+0x1aa/0x3e0 [dm_mod]
       dm_ctl_ioctl+0xa/0x10 [dm_mod]
       do_vfs_ioctl+0xa2/0x600
       ksys_ioctl+0x60/0x90
       ? ksys_write+0x4f/0xb0
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x55/0x150
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: NJason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      70de2cbd
  6. 21 2月, 2019 1 次提交
  7. 15 2月, 2019 1 次提交
    • N
      dm thin: fix bug where bio that overwrites thin block ignores FUA · 4ae280b4
      Nikos Tsironis 提交于
      When provisioning a new data block for a virtual block, either because
      the block was previously unallocated or because we are breaking sharing,
      if the whole block of data is being overwritten the bio that triggered
      the provisioning is issued immediately, skipping copying or zeroing of
      the data block.
      
      When this bio completes the new mapping is inserted in to the pool's
      metadata by process_prepared_mapping(), where the bio completion is
      signaled to the upper layers.
      
      This completion is signaled without first committing the metadata.  If
      the bio in question has the REQ_FUA flag set and the system crashes
      right after its completion and before the next metadata commit, then the
      write is lost despite the REQ_FUA flag requiring that I/O completion for
      this request must only be signaled after the data has been committed to
      non-volatile storage.
      
      Fix this by deferring the completion of overwrite bios, with the REQ_FUA
      flag set, until after the metadata has been committed.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      4ae280b4
  8. 16 1月, 2019 1 次提交
    • J
      dm thin: fix passdown_double_checking_shared_status() · d445bd9c
      Joe Thornber 提交于
      Commit 00a0ea33 ("dm thin: do not queue freed thin mapping for next
      stage processing") changed process_prepared_discard_passdown_pt1() to
      increment all the blocks being discarded until after the passdown had
      completed to avoid them being prematurely reused.
      
      IO issued to a thin device that breaks sharing with a snapshot, followed
      by a discard issued to snapshot(s) that previously shared the block(s),
      results in passdown_double_checking_shared_status() being called to
      iterate through the blocks double checking their reference count is zero
      and issuing the passdown if so.  So a side effect of commit 00a0ea33
      is passdown_double_checking_shared_status() was broken.
      
      Fix this by checking if the block reference count is greater than 1.
      Also, rename dm_pool_block_is_used() to dm_pool_block_is_shared().
      
      Fixes: 00a0ea33 ("dm thin: do not queue freed thin mapping for next stage processing")
      Cc: stable@vger.kernel.org # 4.9+
      Reported-by: ryan.p.norwood@gmail.com
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d445bd9c
  9. 12 12月, 2018 2 次提交
    • M
      dm thin: bump target version · 2af6c070
      Mike Snitzer 提交于
      Decoupled version bump from commit f6c36758 ("dm thin: send event
      about thin-pool state change _after_ making it") because version bumps
      just create conflicts when backporting to the stable trees.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2af6c070
    • M
      dm thin: send event about thin-pool state change _after_ making it · f6c36758
      Mike Snitzer 提交于
      Sending a DM event before a thin-pool state change is about to happen is
      a bug.  It wasn't realized until it became clear that userspace response
      to the event raced with the actual state change that the event was
      meant to notify about.
      
      Fix this by first updating internal thin-pool state to reflect what the
      DM event is being issued about.  This fixes a long-standing racey/buggy
      userspace device-mapper-test-suite 'resize_io' test that would get an
      event but not find the state it was looking for -- so it would just go
      on to hang because no other events caused the test to reevaluate the
      thin-pool's state.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f6c36758
  10. 17 10月, 2018 1 次提交
  11. 11 9月, 2018 1 次提交
    • J
      dm thin metadata: try to avoid ever aborting transactions · 3ab91828
      Joe Thornber 提交于
      Committing a transaction can consume some metadata of it's own, we now
      reserve a small amount of metadata to cover this.  Free metadata
      reported by the kernel will not include this reserve.
      
      If any of the reserve has been used after a commit we enter a new
      internal state PM_OUT_OF_METADATA_SPACE.  This is reported as
      PM_READ_ONLY, so no userland changes are needed.  If the metadata
      device is resized the pool will move back to PM_WRITE.
      
      These changes mean we never need to abort and rollback a transaction due
      to running out of metadata space.  This is particularly important
      because there have been a handful of reports of data corruption against
      DM thin-provisioning that can all be attributed to the thin-pool having
      ran out of metadata space.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3ab91828
  12. 08 8月, 2018 1 次提交
    • H
      dm thin: stop no_space_timeout worker when switching to write-mode · 75294442
      Hou Tao 提交于
      Now both check_for_space() and do_no_space_timeout() will read & write
      pool->pf.error_if_no_space.  If these functions run concurrently, as
      shown in the following case, the default setting of "queue_if_no_space"
      can get lost.
      
      precondition:
          * error_if_no_space = false (aka "queue_if_no_space")
          * pool is in Out-of-Data-Space (OODS) mode
          * no_space_timeout worker has been queued
      
      CPU 0:                          CPU 1:
      // delete a thin device
      process_delete_mesg()
      // check_for_space() invoked by commit()
      set_pool_mode(pool, PM_WRITE)
          pool->pf.error_if_no_space = \
           pt->requested_pf.error_if_no_space
      
      				// timeout, pool is still in OODS mode
      				do_no_space_timeout
      				    // "queue_if_no_space" config is lost
      				    pool->pf.error_if_no_space = true
          pool->pf.mode = new_mode
      
      Fix it by stopping no_space_timeout worker when switching to write mode.
      
      Fixes: bcc696fa ("dm thin: stay in out-of-data-space mode once no_space_timeout expires")
      Cc: stable@vger.kernel.org
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      75294442
  13. 01 8月, 2018 1 次提交
  14. 30 7月, 2018 1 次提交
    • A
      dm thin: include metadata_low_watermark threshold in pool status · 63c8ecb6
      Andy Grover 提交于
      The metadata low watermark threshold is set by the kernel.  But the
      kernel depends on userspace to extend the thinpool metadata device when
      the threshold is crossed.
      
      Since the metadata low watermark threshold is not visible to userspace,
      upon receiving an event, userspace cannot tell that the kernel wants the
      metadata device extended, instead of some other eventing condition.
      Making it visible (but not settable) enables userspace to affirmatively
      know the kernel is asking for a metadata device extension, by comparing
      metadata_low_watermark against nr_free_blocks_metadata, also reported in
      status.
      
      Current solutions like dmeventd have their own thresholds for extending
      the data and metadata devices, and both devices are checked against
      their thresholds on each event.  This lessens the value of the kernel-set
      threshold, since userspace will either extend the metadata device sooner,
      when receiving another event; or will receive the metadata lowater event
      and do nothing, if dmeventd's threshold is less than the kernel's.
      (This second case is dangerous. The metadata lowater event will not be
      re-sent, so no further event will be generated before the metadata
      device is out if space, unless some other event causes userspace to
      recheck its thresholds.)
      Signed-off-by: NAndy Grover <agrover@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      63c8ecb6
  15. 27 6月, 2018 1 次提交
    • M
      dm thin: handle running out of data space vs concurrent discard · a685557f
      Mike Snitzer 提交于
      Discards issued to a DM thin device can complete to userspace (via
      fstrim) _before_ the metadata changes associated with the discards is
      reflected in the thinp superblock (e.g. free blocks).  As such, if a
      user constructs a test that loops repeatedly over these steps, block
      allocation can fail due to discards not having completed yet:
      1) fill thin device via filesystem file
      2) remove file
      3) fstrim
      
      From initial report, here:
      https://www.redhat.com/archives/dm-devel/2018-April/msg00022.html
      
      "The root cause of this issue is that dm-thin will first remove
      mapping and increase corresponding blocks' reference count to prevent
      them from being reused before DISCARD bios get processed by the
      underlying layers. However. increasing blocks' reference count could
      also increase the nr_allocated_this_transaction in struct sm_disk
      which makes smd->old_ll.nr_allocated +
      smd->nr_allocated_this_transaction bigger than smd->old_ll.nr_blocks.
      In this case, alloc_data_block() will never commit metadata to reset
      the begin pointer of struct sm_disk, because sm_disk_get_nr_free()
      always return an underflow value."
      
      While there is room for improvement to the space-map accounting that
      thinp is making use of: the reality is this test is inherently racey and
      will result in the previous iteration's fstrim's discard(s) completing
      vs concurrent block allocation, via dd, in the next iteration of the
      loop.
      
      No amount of space map accounting improvements will be able to allow
      user's to use a block before a discard of that block has completed.
      
      So the best we can really do is allow DM thinp to gracefully handle such
      aggressive use of all the pool's data by degrading the pool into
      out-of-data-space (OODS) mode.  We _should_ get that behaviour already
      (if space map accounting didn't falsely cause alloc_data_block() to
      believe free space was available).. but short of that we handle the
      current reality that dm_pool_alloc_data_block() can return -ENOSPC.
      Reported-by: NDennis Yang <dennisyang@qnap.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a685557f
  16. 13 6月, 2018 1 次提交
    • K
      treewide: Use array_size() in vmalloc() · 42bc47b3
      Kees Cook 提交于
      The vmalloc() function has no 2-factor argument form, so multiplication
      factors need to be wrapped in array_size(). This patch replaces cases of:
      
              vmalloc(a * b)
      
      with:
              vmalloc(array_size(a, b))
      
      as well as handling cases of:
      
              vmalloc(a * b * c)
      
      with:
      
              vmalloc(array3_size(a, b, c))
      
      This does, however, attempt to ignore constant size factors like:
      
              vmalloc(4 * 1024)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        vmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        vmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        vmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
        vmalloc(
      -	SIZE * COUNT
      +	array_size(COUNT, SIZE)
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        vmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        vmalloc(C1 * C2 * C3, ...)
      |
        vmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants.
      @@
      expression E1, E2;
      constant C1, C2;
      @@
      
      (
        vmalloc(C1 * C2, ...)
      |
        vmalloc(
      -	E1 * E2
      +	array_size(E1, E2)
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      42bc47b3
  17. 08 6月, 2018 1 次提交
  18. 05 6月, 2018 1 次提交
  19. 31 5月, 2018 1 次提交
  20. 04 4月, 2018 1 次提交
  21. 30 1月, 2018 1 次提交
  22. 17 1月, 2018 1 次提交
  23. 04 12月, 2017 1 次提交
    • M
      dm: fix various targets to dm_register_target after module __init resources created · 7e6358d2
      monty_pavel@sina.com 提交于
      A NULL pointer is seen if two concurrent "vgchange -ay -K <vg name>"
      processes race to load the dm-thin-pool module:
      
       PID: 25992 TASK: ffff883cd7d23500 CPU: 4 COMMAND: "vgchange"
        #0 [ffff883cd743d600] machine_kexec at ffffffff81038fa9
        0000001 [ffff883cd743d660] crash_kexec at ffffffff810c5992
        0000002 [ffff883cd743d730] oops_end at ffffffff81515c90
        0000003 [ffff883cd743d760] no_context at ffffffff81049f1b
        0000004 [ffff883cd743d7b0] __bad_area_nosemaphore at ffffffff8104a1a5
        0000005 [ffff883cd743d800] bad_area at ffffffff8104a2ce
        0000006 [ffff883cd743d830] __do_page_fault at ffffffff8104aa6f
        0000007 [ffff883cd743d950] do_page_fault at ffffffff81517bae
        0000008 [ffff883cd743d980] page_fault at ffffffff81514f95
           [exception RIP: kmem_cache_alloc+108]
           RIP: ffffffff8116ef3c RSP: ffff883cd743da38 RFLAGS: 00010046
           RAX: 0000000000000004 RBX: ffffffff81121b90 RCX: ffff881bf1e78cc0
           RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
           RBP: ffff883cd743da68 R8: ffff881bf1a4eb00 R9: 0000000080042000
           R10: 0000000000002000 R11: 0000000000000000 R12: 00000000000000d0
           R13: 0000000000000000 R14: 00000000000000d0 R15: 0000000000000246
           ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
        0000009 [ffff883cd743da70] mempool_alloc_slab at ffffffff81121ba5
       0000010 [ffff883cd743da80] mempool_create_node at ffffffff81122083
       0000011 [ffff883cd743dad0] mempool_create at ffffffff811220f4
       0000012 [ffff883cd743dae0] pool_ctr at ffffffffa08de049 [dm_thin_pool]
       0000013 [ffff883cd743dbd0] dm_table_add_target at ffffffffa0005f2f [dm_mod]
       0000014 [ffff883cd743dc30] table_load at ffffffffa0008ba9 [dm_mod]
       0000015 [ffff883cd743dc90] ctl_ioctl at ffffffffa0009dc4 [dm_mod]
      
      The race results in a NULL pointer because:
      
      Process A (vgchange -ay -K):
       	a. send DM_LIST_VERSIONS_CMD ioctl;
       	b. pool_target not registered;
       	c. modprobe dm_thin_pool and wait until end.
      
      Process B (vgchange -ay -K):
       	a. send DM_LIST_VERSIONS_CMD ioctl;
       	b. pool_target registered;
       	c. table_load->dm_table_add_target->pool_ctr;
       	d. _new_mapping_cache is NULL and panic.
      Note:
       	1. process A and process B are two concurrent processes.
       	2. pool_target can be detected by process B but
       	_new_mapping_cache initialization has not ended.
      
      To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
      with the same problem, simply dm_register_target() after all resources
      created during module init (as labelled with __init) are finished.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: Nmonty <monty_pavel@sina.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      7e6358d2
  24. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  25. 28 8月, 2017 1 次提交
  26. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  27. 28 6月, 2017 1 次提交
    • V
      dm thin: do not queue freed thin mapping for next stage processing · 00a0ea33
      Vallish Vaidyeshwara 提交于
      process_prepared_discard_passdown_pt1() should cleanup
      dm_thin_new_mapping in cases of error.
      
      dm_pool_inc_data_range() can fail trying to get a block reference:
      
      metadata operation 'dm_pool_inc_data_range' failed: error = -61
      
      When dm_pool_inc_data_range() fails, dm thin aborts current metadata
      transaction and marks pool as PM_READ_ONLY. Memory for thin mapping
      is released as well. However, current thin mapping will be queued
      onto next stage as part of queue_passdown_pt2() or passdown_endio().
      This dangling thin mapping memory when processed and accessed in
      next stage will lead to device mapper crashing.
      
      Code flow without fix:
      -> process_prepared_discard_passdown_pt1(m)
         -> dm_thin_remove_range()
         -> discard passdown
            --> passdown_endio(m) queues m onto next stage
         -> dm_pool_inc_data_range() fails, frees memory m
                  but does not remove it from next stage queue
      
      -> process_prepared_discard_passdown_pt2(m)
         -> processes freed memory m and crashes
      
      One such stack:
      
      Call Trace:
      [<ffffffffa037a46f>] dm_cell_release_no_holder+0x2f/0x70 [dm_bio_prison]
      [<ffffffffa039b6dc>] cell_defer_no_holder+0x3c/0x80 [dm_thin_pool]
      [<ffffffffa039b88b>] process_prepared_discard_passdown_pt2+0x4b/0x90 [dm_thin_pool]
      [<ffffffffa0399611>] process_prepared+0x81/0xa0 [dm_thin_pool]
      [<ffffffffa039e735>] do_worker+0xc5/0x820 [dm_thin_pool]
      [<ffffffff8152bf54>] ? __schedule+0x244/0x680
      [<ffffffff81087e72>] ? pwq_activate_delayed_work+0x42/0xb0
      [<ffffffff81089f53>] process_one_work+0x153/0x3f0
      [<ffffffff8108a71b>] worker_thread+0x12b/0x4b0
      [<ffffffff8108a5f0>] ? rescuer_thread+0x350/0x350
      [<ffffffff8108fd6a>] kthread+0xca/0xe0
      [<ffffffff8108fca0>] ? kthread_park+0x60/0x60
      [<ffffffff81530b45>] ret_from_fork+0x25/0x30
      
      The fix is to first take the block ref count for discarded block and
      then do a passdown discard of this block. If block ref count fails,
      then bail out aborting current metadata transaction, mark pool as
      PM_READ_ONLY and also free current thin mapping memory (existing error
      handling code) without queueing this thin mapping onto next stage of
      processing. If block ref count succeeds, then passdown discard of this
      block. Discard callback of passdown_endio() will queue this thin mapping
      onto next stage of processing.
      
      Code flow with fix:
      -> process_prepared_discard_passdown_pt1(m)
         -> dm_thin_remove_range()
         -> dm_pool_inc_data_range()
            --> if fails, free memory m and bail out
         -> discard passdown
            --> passdown_endio(m) queues m onto next stage
      
      Cc: stable <stable@vger.kernel.org> # v4.9+
      Reviewed-by: NEduardo Valentin <eduval@amazon.com>
      Reviewed-by: NCristian Gafton <gafton@amazon.com>
      Reviewed-by: NAnchal Agarwal <anchalag@amazon.com>
      Signed-off-by: NVallish Vaidyeshwara <vallish@amazon.com>
      Reviewed-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      00a0ea33
  28. 09 6月, 2017 2 次提交
  29. 25 4月, 2017 1 次提交
    • D
      dm thin: fix a memory leak when passing discard bio down · 948f581a
      Dennis Yang 提交于
      dm-thin does not free the discard_parent bio after all chained sub
      bios finished. The following kmemleak report could be observed after
      pool with discard_passdown option processes discard bios in
      linux v4.11-rc7. To fix this, we drop the discard_parent bio reference
      when its endio (passdown_endio) called.
      
      unreferenced object 0xffff8803d6b29700 (size 256):
        comm "kworker/u8:0", pid 30349, jiffies 4379504020 (age 143002.776s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          01 00 00 00 00 00 00 f0 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81a5efd9>] kmemleak_alloc+0x49/0xa0
          [<ffffffff8114ec34>] kmem_cache_alloc+0xb4/0x100
          [<ffffffff8110eec0>] mempool_alloc_slab+0x10/0x20
          [<ffffffff8110efa5>] mempool_alloc+0x55/0x150
          [<ffffffff81374939>] bio_alloc_bioset+0xb9/0x260
          [<ffffffffa018fd20>] process_prepared_discard_passdown_pt1+0x40/0x1c0 [dm_thin_pool]
          [<ffffffffa018b409>] break_up_discard_bio+0x1a9/0x200 [dm_thin_pool]
          [<ffffffffa018b484>] process_discard_cell_passdown+0x24/0x40 [dm_thin_pool]
          [<ffffffffa018b24d>] process_discard_bio+0xdd/0xf0 [dm_thin_pool]
          [<ffffffffa018ecf6>] do_worker+0xa76/0xd50 [dm_thin_pool]
          [<ffffffff81086239>] process_one_work+0x139/0x370
          [<ffffffff810867b1>] worker_thread+0x61/0x450
          [<ffffffff8108b316>] kthread+0xd6/0xf0
          [<ffffffff81a6cd1f>] ret_from_fork+0x3f/0x70
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDennis Yang <dennisyang@qnap.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      948f581a
  30. 09 4月, 2017 1 次提交
  31. 08 3月, 2017 1 次提交
  32. 02 2月, 2017 1 次提交
  33. 28 1月, 2017 1 次提交
  34. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  35. 21 7月, 2016 1 次提交
  36. 08 6月, 2016 3 次提交