1. 14 1月, 2020 6 次提交
  2. 12 12月, 2019 3 次提交
  3. 10 12月, 2019 1 次提交
  4. 07 12月, 2019 1 次提交
    • N
      dm thin: Flush data device before committing metadata · 694cfe7f
      Nikos Tsironis 提交于
      The thin provisioning target maintains per thin device mappings that map
      virtual blocks to data blocks in the data device.
      
      When we write to a shared block, in case of internal snapshots, or
      provision a new block, in case of external snapshots, we copy the shared
      block to a new data block (COW), update the mapping for the relevant
      virtual block and then issue the write to the new data block.
      
      Suppose the data device has a volatile write-back cache and the
      following sequence of events occur:
      
      1. We write to a shared block
      2. A new data block is allocated
      3. We copy the shared block to the new data block using kcopyd (COW)
      4. We insert the new mapping for the virtual block in the btree for that
         thin device.
      5. The commit timeout expires and we commit the metadata, that now
         includes the new mapping from step (4).
      6. The system crashes and the data device's cache has not been flushed,
         meaning that the COWed data are lost.
      
      The next time we read that virtual block of the thin device we read it
      from the data block allocated in step (2), since the metadata have been
      successfully committed. The data are lost due to the crash, so we read
      garbage instead of the old, shared data.
      
      This has the following implications:
      
      1. In case of writes to shared blocks, with size smaller than the pool's
         block size (which means we first copy the whole block and then issue
         the smaller write), we corrupt data that the user never touched.
      
      2. In case of writes to shared blocks, with size equal to the device's
         logical block size, we fail to provide atomic sector writes. When the
         system recovers the user will read garbage from that sector instead
         of the old data or the new data.
      
      3. Even for writes to shared blocks, with size equal to the pool's block
         size (overwrites), after the system recovers, the written sectors
         will contain garbage instead of a random mix of sectors containing
         either old data or new data, thus we fail again to provide atomic
         sectors writes.
      
      4. Even when the user flushes the thin device, because we first commit
         the metadata and then pass down the flush, the same risk for
         corruption exists (if the system crashes after the metadata have been
         committed but before the flush is passed down to the data device.)
      
      The only case which is unaffected is that of writes with size equal to
      the pool's block size and with the FUA flag set. But, because FUA writes
      trigger metadata commits, this case can trigger the corruption
      indirectly.
      
      Moreover, apart from internal and external snapshots, the same issue
      exists for newly provisioned blocks, when block zeroing is enabled.
      After the system recovers the provisioned blocks might contain garbage
      instead of zeroes.
      
      To solve this and avoid the potential data corruption we flush the
      pool's data device **before** committing its metadata.
      
      This ensures that the data blocks of any newly inserted mappings are
      properly written to non-volatile storage and won't be lost in case of a
      crash.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      694cfe7f
  5. 06 12月, 2019 5 次提交
    • N
      dm thin metadata: Add support for a pre-commit callback · ecda7c02
      Nikos Tsironis 提交于
      Add support for one pre-commit callback which is run right before the
      metadata are committed.
      
      This allows the thin provisioning target to run a callback before the
      metadata are committed and is required by the next commit.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ecda7c02
    • N
      dm clone: Flush destination device before committing metadata · 8b3fd1f5
      Nikos Tsironis 提交于
      dm-clone maintains an on-disk bitmap which records which regions are
      valid in the destination device, i.e., which regions have already been
      hydrated, or have been written to directly, via user I/O.
      
      Setting a bit in the on-disk bitmap meas the corresponding region is
      valid in the destination device and we redirect all I/O regarding it to
      the destination device.
      
      Suppose the destination device has a volatile write-back cache and the
      following sequence of events occur:
      
      1. A region gets hydrated, either through the background hydration or
         because it was written to directly, via user I/O.
      
      2. The commit timeout expires and we commit the metadata, marking that
         region as valid in the destination device.
      
      3. The system crashes and the destination device's cache has not been
         flushed, meaning the region's data are lost.
      
      The next time we read that region we read it from the destination
      device, since the metadata have been successfully committed, but the
      data are lost due to the crash, so we read garbage instead of the old
      data.
      
      This has several implications:
      
      1. In case of background hydration or of writes with size smaller than
         the region size (which means we first copy the whole region and then
         issue the smaller write), we corrupt data that the user never
         touched.
      
      2. In case of writes with size equal to the device's logical block size,
         we fail to provide atomic sector writes. When the system recovers the
         user will read garbage from the sector instead of the old data or the
         new data.
      
      3. In case of writes without the FUA flag set, after the system
         recovers, the written sectors will contain garbage instead of a
         random mix of sectors containing either old data or new data, thus we
         fail again to provide atomic sector writes.
      
      4. Even when the user flushes the dm-clone device, because we first
         commit the metadata and then pass down the flush, the same risk for
         corruption exists (if the system crashes after the metadata have been
         committed but before the flush is passed down).
      
      The only case which is unaffected is that of writes with size equal to
      the region size and with the FUA flag set. But, because FUA writes
      trigger metadata commits, this case can trigger the corruption
      indirectly.
      
      To solve this and avoid the potential data corruption we flush the
      destination device **before** committing the metadata.
      
      This ensures that any freshly hydrated regions, for which we commit the
      metadata, are properly written to non-volatile storage and won't be lost
      in case of a crash.
      
      Fixes: 7431b783 ("dm: add clone target")
      Cc: stable@vger.kernel.org # v5.4+
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      8b3fd1f5
    • N
      dm clone metadata: Use a two phase commit · 8fdbfe8d
      Nikos Tsironis 提交于
      Split the metadata commit in two parts:
      
      1. dm_clone_metadata_pre_commit(): Prepare the current transaction for
         committing. After this is called, all subsequent metadata updates,
         done through either dm_clone_set_region_hydrated() or
         dm_clone_cond_set_range(), will be part of the next transaction.
      
      2. dm_clone_metadata_commit(): Actually commit the current transaction
         to disk and start a new transaction.
      
      This is required by the following commit. It allows dm-clone to flush
      the destination device after step (1) to ensure that all freshly
      hydrated regions, for which we are updating the metadata, are properly
      written to non-volatile storage and won't be lost in case of a crash.
      
      Fixes: 7431b783 ("dm: add clone target")
      Cc: stable@vger.kernel.org # v5.4+
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      8fdbfe8d
    • N
      dm clone metadata: Track exact changes per transaction · e6a505f3
      Nikos Tsironis 提交于
      Extend struct dirty_map with a second bitmap which tracks the exact
      regions that were hydrated during the current metadata transaction.
      
      Moreover, fix __flush_dmap() to only commit the metadata of the regions
      that were hydrated during the current transaction.
      
      This is required by the following commits to fix a data corruption bug.
      
      Fixes: 7431b783 ("dm: add clone target")
      Cc: stable@vger.kernel.org # v5.4+
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      e6a505f3
    • H
      dm btree: increase rebalance threshold in __rebalance2() · 474e5595
      Hou Tao 提交于
      We got the following warnings from thin_check during thin-pool setup:
      
        $ thin_check /dev/vdb
        examining superblock
        examining devices tree
          missing devices: [1, 84]
            too few entries in btree_node: 41, expected at least 42 (block 138, max_entries = 126)
        examining mapping tree
      
      The phenomenon is the number of entries in one node of details_info tree is
      less than (max_entries / 3). And it can be easily reproduced by the following
      procedures:
      
        $ new a thin pool
        $ presume the max entries of details_info tree is 126
        $ new 127 thin devices (e.g. 1~127) to make the root node being full
          and then split
        $ remove the first 43 (e.g. 1~43) thin devices to make the children
          reblance repeatedly
        $ stop the thin pool
        $ thin_check
      
      The root cause is that the B-tree removal procedure in __rebalance2()
      doesn't guarantee the invariance: the minimal number of entries in
      non-root node should be >= (max_entries / 3).
      
      Simply fix the problem by increasing the rebalance threshold to
      make sure the number of entries in each child will be greater
      than or equal to (max_entries / 3 + 1), so no matter which
      child is used for removal, the number will still be valid.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      474e5595
  6. 03 12月, 2019 2 次提交
  7. 26 11月, 2019 1 次提交
    • M
      dm mpath: remove harmful bio-based optimization · dbaf971c
      Mike Snitzer 提交于
      Removes the branching for edge-case where no SCSI device handler
      exists.  The __map_bio_fast() method was far too limited, by only
      selecting a new pathgroup or path IFF there was a path failure, fix this
      be eliminating it in favor of __map_bio().  __map_bio()'s extra SCSI
      device handler specific MPATHF_PG_INIT_REQUIRED test is not in the fast
      path anyway.
      
      This change restores full path selector functionality for bio-based
      configurations that don't haave a SCSI device handler.  But it should be
      noted that the path selectors do have an impact on performance for
      certain networks that are extremely fast (and don't require frequent
      switching).
      
      Fixes: 8d47e659 ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks")
      Cc: stable@vger.kernel.org
      Reported-by: NDrew Hastings <dhastings@crucialwebhost.com>
      Suggested-by: NMartin Wilck <mwilck@suse.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      dbaf971c
  8. 21 11月, 2019 1 次提交
  9. 20 11月, 2019 1 次提交
  10. 18 11月, 2019 2 次提交
    • J
      Revert "bcache: fix fifo index swapping condition in journal_pin_cmp()" · 00b89892
      Jens Axboe 提交于
      Coly says:
      
      "Guoju Fang talked to me today, he told me this change was unnecessary
      and I was over-thought.
      
      Then I realize fifo_idx() uses a mask to handle the array index overflow
      condition, so the index swap in journal_pin_cmp() won't happen. And yes,
      Guoju and Kent are correct.
      
      Since you already applied this patch, can you please to remove this
      patch from your for-next branch? This single patch does not break
      thing, but it is unecessary at this moment."
      
      This reverts commit c0e0954e.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      00b89892
    • J
      dm thin: wakeup worker only when deferred bios exist · d256d796
      Jeffle Xu 提交于
      Single thread fio test (read, bs=4k, ioengine=libaio, iodepth=128,
      numjobs=1) over dm-thin device has poor performance versus bare nvme
      device.
      
      Further investigation with perf indicates that queue_work_on() consumes
      over 20% CPU time when doing IO over dm-thin device. The call stack is
      as follows.
      
      - 40.57% thin_map
          + 22.07% queue_work_on
          + 9.95% dm_thin_find_block
          + 2.80% cell_defer_no_holder
            1.91% inc_all_io_entry.isra.33.part.34
          + 1.78% bio_detain.isra.35
      
      In cell_defer_no_holder(), wakeup_worker() is always called, no matter
      whether the tc->deferred_bio_list list is empty or not. In single thread
      IO model, this list is most likely empty. So skip waking up worker thread
      if tc->deferred_bio_list list is empty.
      
      Single thread IO performance improves from 448 MiB/s to 646 MiB/s (+44%)
      once the needless wake_worker() calls are properly skipped.
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d256d796
  11. 16 11月, 2019 1 次提交
    • M
      dm integrity: fix excessive alignment of metadata runs · d537858a
      Mikulas Patocka 提交于
      Metadata runs are supposed to be aligned on 4k boundary (so that they work
      efficiently with disks with 4k sectors). However, there was a programming
      bug that makes them aligned on 128k boundary instead. The unused space is
      wasted.
      
      Fix this bug by providing a proper 4k alignment. In order to keep
      existing volumes working, we introduce a new flag SB_FLAG_FIXED_PADDING
      - when the flag is clear, we calculate the padding the old way. In order
      to make sure that the old version cannot mount the volume created by the
      new version, we increase superblock version to 4.
      
      Also in order to not break with old integritysetup, we fix alignment
      only if the parameter "fix_padding" is present when formatting the
      device.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d537858a
  12. 15 11月, 2019 2 次提交
  13. 14 11月, 2019 12 次提交
    • C
      bcache: don't export symbols · 15fbb231
      Christoph Hellwig 提交于
      None of the exported bcache symbols are actually used anywhere.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      15fbb231
    • C
      bcache: remove the extra cflags for request.o · 651bbba5
      Christoph Hellwig 提交于
      There is no block directory this file needs includes from.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      651bbba5
    • C
      bcache: at least try to shrink 1 node in bch_mca_scan() · 9fcc34b1
      Coly Li 提交于
      In bch_mca_scan(), the number of shrinking btree node is calculated
      by code like this,
      	unsigned long nr = sc->nr_to_scan;
      
              nr /= c->btree_pages;
              nr = min_t(unsigned long, nr, mca_can_free(c));
      variable sc->nr_to_scan is number of objects (here is bcache B+tree
      nodes' number) to shrink, and pointer variable sc is sent from memory
      management code as parametr of a callback.
      
      If sc->nr_to_scan is smaller than c->btree_pages, after the above
      calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
      frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
      nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
      and releasing mutex c->bucket_lock.
      
      This patch checkes whether nr is 0 after the above calculation, if 0
      is the result then set 1 to variable 'n'. Then at least bch_mca_scan()
      will try to shrink a single B+tree node.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9fcc34b1
    • C
      bcache: add idle_max_writeback_rate sysfs interface · c5fcdedc
      Coly Li 提交于
      For writeback mode, if there is no regular I/O request for a while,
      the writeback rate will be set to the maximum value (1TB/s for now).
      This is good for most of the storage workload, but there are still
      people don't what the maximum writeback rate in I/O idle time.
      
      This patch adds a sysfs interface file idle_max_writeback_rate to
      permit people to disable maximum writeback rate. Then the minimum
      writeback rate can be advised by writeback_rate_minimum in the
      bcache device's sysfs interface.
      Reported-by: NChristian Balzer <chibi@gol.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5fcdedc
    • C
      bcache: add code comments in bch_btree_leaf_dirty() · 5dccefd3
      Coly Li 提交于
      This patch adds code comments in bch_btree_leaf_dirty() to explain
      why w->journal should always reference the eldest journal pin of
      all the writing bkeys in the btree node. To make the bcache journal
      code to be easier to be understood.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5dccefd3
    • A
      bcache: fix deadlock in bcache_allocator · 84c529ae
      Andrea Righi 提交于
      bcache_allocator can call the following:
      
       bch_allocator_thread()
        -> bch_prio_write()
           -> bch_bucket_alloc()
              -> wait on &ca->set->bucket_wait
      
      But the wake up event on bucket_wait is supposed to come from
      bch_allocator_thread() itself => deadlock:
      
      [ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 seconds.
      [ 1158.495929]       Not tainted 5.3.0-050300rc3-generic #201908042232
      [ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1158.504413] bcache_allocato D    0 15861      2 0x80004000
      [ 1158.504419] Call Trace:
      [ 1158.504429]  __schedule+0x2a8/0x670
      [ 1158.504432]  schedule+0x2d/0x90
      [ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
      [ 1158.504453]  ? wait_woken+0x80/0x80
      [ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
      [ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
      [ 1158.504491]  kthread+0x121/0x140
      [ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
      [ 1158.504506]  ? kthread_park+0xb0/0xb0
      [ 1158.504510]  ret_from_fork+0x35/0x40
      
      Fix by making the call to bch_prio_write() non-blocking, so that
      bch_allocator_thread() never waits on itself.
      
      Moreover, make sure to wake up the garbage collector thread when
      bch_prio_write() is failing to allocate buckets.
      
      BugLink: https://bugs.launchpad.net/bugs/1784665
      BugLink: https://bugs.launchpad.net/bugs/1796292Signed-off-by: NAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      84c529ae
    • C
      bcache: add code comment bch_keylist_pop() and bch_keylist_pop_front() · 06c1526d
      Coly Li 提交于
      This patch adds simple code comments for bch_keylist_pop() and
      bch_keylist_pop_front() in bset.c, to make the code more easier to
      be understand.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      06c1526d
    • C
      bcache: deleted code comments for dead code in bch_data_insert_keys() · 41fa4dee
      Coly Li 提交于
      In request.c:bch_data_insert_keys(), there is code comment for a piece
      of dead code. This patch deletes the dead code and its code comment
      since they are useless in practice.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      41fa4dee
    • C
      bcache: add more accurate error messages in read_super() · aaf8dbea
      Coly Li 提交于
      Previous code only returns "Not a bcache superblock" for both bcache
      super block offset and magic error. This patch addss more accurate error
      messages,
      - for super block unmatched offset:
        "Not a bcache superblock (bad offset)"
      - for super block unmatched magic number:
        "Not a bcache superblock (bad magic)"
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aaf8dbea
    • C
      bcache: fix static checker warning in bcache_device_free() · 2d886951
      Coly Li 提交于
      Commit cafe5635 ("bcache: A block layer cache") leads to the
      following static checker warning:
      
          ./drivers/md/bcache/super.c:770 bcache_device_free()
          warn: variable dereferenced before check 'd->disk' (see line 766)
      
      drivers/md/bcache/super.c
         762  static void bcache_device_free(struct bcache_device *d)
         763  {
         764          lockdep_assert_held(&bch_register_lock);
         765
         766          pr_info("%s stopped", d->disk->disk_name);
                                            ^^^^^^^^^
      Unchecked dereference.
      
         767
         768          if (d->c)
         769                  bcache_device_detach(d);
         770          if (d->disk && d->disk->flags & GENHD_FL_UP)
                          ^^^^^^^
      Check too late.
      
         771                  del_gendisk(d->disk);
         772          if (d->disk && d->disk->queue)
         773                  blk_cleanup_queue(d->disk->queue);
         774          if (d->disk) {
         775                  ida_simple_remove(&bcache_device_idx,
         776                                    first_minor_to_idx(d->disk->first_minor));
         777                  put_disk(d->disk);
         778          }
         779
      
      It is not 100% sure that the gendisk struct of bcache device will always
      be there, the warning makes sense when there is problem in block core.
      
      This patch tries to remove the static checking warning by checking
      d->disk to avoid NULL pointer deferences.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d886951
    • G
      bcache: fix a lost wake-up problem caused by mca_cannibalize_lock · 34cf78bf
      Guoju Fang 提交于
      This patch fix a lost wake-up problem caused by the race between
      mca_cannibalize_lock and bch_cannibalize_unlock.
      
      Consider two processes, A and B. Process A is executing
      mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock
      and is executing bch_cannibalize_unlock. The problem happens that after
      process A executes cmpxchg and will execute prepare_to_wait. In this
      timeslice process B executes wake_up, but after that process A executes
      prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A
      goes to sleep but no one will wake up it. This problem may cause bcache
      device to dead.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      34cf78bf
    • C
      bcache: fix fifo index swapping condition in journal_pin_cmp() · c0e0954e
      Coly Li 提交于
      Fifo structure journal.pin is implemented by a cycle buffer, if the back
      index reaches highest location of the cycle buffer, it will be swapped
      to 0. Once the swapping happens, it means a smaller fifo index might be
      associated to a newer journal entry. So the btree node with oldest
      journal entry won't be selected in bch_btree_leaf_dirty() to reference
      the dirty B+tree leaf node. This problem may cause bcache journal won't
      protect unflushed oldest B+tree dirty leaf node in power failure, and
      this B+tree leaf node is possible to beinconsistent after reboot from
      power failure.
      
      This patch fixes the fifo index comparing logic in journal_pin_cmp(),
      to avoid potential corrupted B+tree leaf node when the back index of
      journal pin is swapped.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c0e0954e
  14. 13 11月, 2019 2 次提交