1. 25 3月, 2021 6 次提交
  2. 11 3月, 2021 1 次提交
  3. 05 3月, 2021 2 次提交
    • M
      dm verity: fix FEC for RS roots unaligned to block size · df7b59ba
      Milan Broz 提交于
      Optional Forward Error Correction (FEC) code in dm-verity uses
      Reed-Solomon code and should support roots from 2 to 24.
      
      The error correction parity bytes (of roots lengths per RS block) are
      stored on a separate device in sequence without any padding.
      
      Currently, to access FEC device, the dm-verity-fec code uses dm-bufio
      client with block size set to verity data block (usually 4096 or 512
      bytes).
      
      Because this block size is not divisible by some (most!) of the roots
      supported lengths, data repair cannot work for partially stored parity
      bytes.
      
      This fix changes FEC device dm-bufio block size to "roots << SECTOR_SHIFT"
      where we can be sure that the full parity data is always available.
      (There cannot be partial FEC blocks because parity must cover whole
      sectors.)
      
      Because the optional FEC starting offset could be unaligned to this
      new block size, we have to use dm_bufio_set_sector_offset() to
      configure it.
      
      The problem is easily reproduced using veritysetup, e.g. for roots=13:
      
        # create verity device with RS FEC
        dd if=/dev/urandom of=data.img bs=4096 count=8 status=none
        veritysetup format data.img hash.img --fec-device=fec.img --fec-roots=13 | awk '/^Root hash/{ print $3 }' >roothash
      
        # create an erasure that should be always repairable with this roots setting
        dd if=/dev/zero of=data.img conv=notrunc bs=1 count=8 seek=4088 status=none
      
        # try to read it through dm-verity
        veritysetup open data.img test hash.img --fec-device=fec.img --fec-roots=13 $(cat roothash)
        dd if=/dev/mapper/test of=/dev/null bs=4096 status=noxfer
        # wait for possible recursive recovery in kernel
        udevadm settle
        veritysetup close test
      
      With this fix, errors are properly repaired.
        device-mapper: verity-fec: 7:1: FEC 0: corrected 8 errors
        ...
      
      Without it, FEC code usually ends on unrecoverable failure in RS decoder:
        device-mapper: verity-fec: 7:1: FEC 0: failed to correct: -74
        ...
      
      This problem is present in all kernels since the FEC code's
      introduction (kernel 4.5).
      
      It is thought that this problem is not visible in Android ecosystem
      because it always uses a default RS roots=2.
      
      Depends-on: a14e5ec6 ("dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size")
      Signed-off-by: NMilan Broz <gmazyland@gmail.com>
      Tested-by: NJérôme Carretero <cJ-ko@zougloub.eu>
      Reviewed-by: NSami Tolvanen <samitolvanen@google.com>
      Cc: stable@vger.kernel.org # 4.5+
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      df7b59ba
    • M
      dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size · a14e5ec6
      Mikulas Patocka 提交于
      dm_bufio_get_device_size returns the device size in blocks. Before
      returning the value, we must subtract the nubmer of starting
      sectors. The number of starting sectors may not be divisible by block
      size.
      
      Note that currently, no target is using dm_bufio_set_sector_offset and
      dm_bufio_get_device_size simultaneously, so this change has no effect.
      However, an upcoming dm-verity-fec fix needs this change.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NMilan Broz <gmazyland@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a14e5ec6
  4. 27 2月, 2021 1 次提交
  5. 11 2月, 2021 12 次提交
    • M
      dm: fix deadlock when swapping to encrypted device · a666e5c0
      Mikulas Patocka 提交于
      The system would deadlock when swapping to a dm-crypt device. The reason
      is that for each incoming write bio, dm-crypt allocates memory that holds
      encrypted data. These excessive allocations exhaust all the memory and the
      result is either deadlock or OOM trigger.
      
      This patch limits the number of in-flight swap bios, so that the memory
      consumed by dm-crypt is limited. The limit is enforced if the target set
      the "limit_swap_bios" variable and if the bio has REQ_SWAP set.
      
      Non-swap bios are not affected becuase taking the semaphore would cause
      performance degradation.
      
      This is similar to request-based drivers - they will also block when the
      number of requests is over the limit.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a666e5c0
    • M
      dm: simplify target code conditional on CONFIG_BLK_DEV_ZONED · e3290b94
      Mike Snitzer 提交于
      Allow removal of CONFIG_BLK_DEV_ZONED conditionals in target_type
      definition of various targets.
      Suggested-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      e3290b94
    • S
      dm: set DM_TARGET_PASSES_CRYPTO feature for some targets · 3db564b4
      Satya Tangirala 提交于
      dm-linear and dm-flakey obviously can pass through inline crypto support.
      Co-developed-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NSatya Tangirala <satyat@google.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3db564b4
    • S
      dm: support key eviction from keyslot managers of underlying devices · 9355a9eb
      Satya Tangirala 提交于
      Now that device mapper supports inline encryption, add the ability to
      evict keys from all underlying devices. When an upper layer requests
      a key eviction, we simply iterate through all underlying devices
      and evict that key from each device.
      Co-developed-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NSatya Tangirala <satyat@google.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      9355a9eb
    • S
      dm: add support for passing through inline crypto support · aa6ce87a
      Satya Tangirala 提交于
      Update the device-mapper core to support exposing the inline crypto
      support of the underlying device(s) through the device-mapper device.
      
      This works by creating a "passthrough keyslot manager" for the dm
      device, which declares support for encryption settings which all
      underlying devices support.  When a supported setting is used, the bio
      cloning code handles cloning the crypto context to the bios for all the
      underlying devices.  When an unsupported setting is used, the blk-crypto
      fallback is used as usual.
      
      Crypto support on each underlying device is ignored unless the
      corresponding dm target opts into exposing it.  This is needed because
      for inline crypto to semantically operate on the original bio, the data
      must not be transformed by the dm target.  Thus, targets like dm-linear
      can expose crypto support of the underlying device, but targets like
      dm-crypt can't.  (dm-crypt could use inline crypto itself, though.)
      
      A DM device's table can only be changed if the "new" inline encryption
      capabilities are a (*not* necessarily strict) superset of the "old" inline
      encryption capabilities.  Attempts to make changes to the table that result
      in some inline encryption capability becoming no longer supported will be
      rejected.
      
      For the sake of clarity, key eviction from underlying devices will be
      handled in a future patch.
      Co-developed-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NSatya Tangirala <satyat@google.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      aa6ce87a
    • N
      dm era: only resize metadata in preresume · cca2c6ae
      Nikos Tsironis 提交于
      Metadata resize shouldn't happen in the ctr. The ctr loads a temporary
      (inactive) table that will only become active upon resume. That is why
      resize should always be done in terms of resume. Otherwise a load (ctr)
      whose inactive table never becomes active will incorrectly resize the
      metadata.
      
      Also, perform the resize directly in preresume, instead of using the
      worker to do it.
      
      The worker might run other metadata operations, e.g., it could start
      digestion, before resizing the metadata. These operations will end up
      using the old size.
      
      This could lead to errors, like:
      
        device-mapper: era: metadata_digest_transcribe_writeset: dm_array_set_value failed
        device-mapper: era: process_old_eras: digest step failed, stopping digestion
      
      The reason of the above error is that the worker started the digestion
      of the archived writeset using the old, larger size.
      
      As a result, metadata_digest_transcribe_writeset tried to write beyond
      the end of the era array.
      
      Fixes: eec40579 ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      cca2c6ae
    • N
      dm era: Use correct value size in equality function of writeset tree · 64f2d15a
      Nikos Tsironis 提交于
      Fix the writeset tree equality test function to use the right value size
      when comparing two btree values.
      
      Fixes: eec40579 ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: NMing-Hung Tsai <mtsai@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      64f2d15a
    • N
      dm era: Fix bitset memory leaks · 904e6b26
      Nikos Tsironis 提交于
      Deallocate the memory allocated for the in-core bitsets when destroying
      the target and in error paths.
      
      Fixes: eec40579 ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: NMing-Hung Tsai <mtsai@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      904e6b26
    • N
      dm era: Verify the data block size hasn't changed · c8e846ff
      Nikos Tsironis 提交于
      dm-era doesn't support changing the data block size of existing devices,
      so check explicitly that the requested block size for a new target
      matches the one stored in the metadata.
      
      Fixes: eec40579 ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: NMing-Hung Tsai <mtsai@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c8e846ff
    • N
      dm era: Reinitialize bitset cache before digesting a new writeset · 25249333
      Nikos Tsironis 提交于
      In case of devices with at most 64 blocks, the digestion of consecutive
      eras uses the writeset of the first era as the writeset of all eras to
      digest, leading to lost writes. That is, we lose the information about
      what blocks were written during the affected eras.
      
      The digestion code uses a dm_disk_bitset object to access the archived
      writesets. This structure includes a one word (64-bit) cache to reduce
      the number of array lookups.
      
      This structure is initialized only once, in metadata_digest_start(),
      when we kick off digestion.
      
      But, when we insert a new writeset into the writeset tree, before the
      digestion of the previous writeset is done, or equivalently when there
      are multiple writesets in the writeset tree to digest, then all these
      writesets are digested using the same cache and the cache is not
      re-initialized when moving from one writeset to the next.
      
      For devices with more than 64 blocks, i.e., the size of the cache, the
      cache is indirectly invalidated when we move to a next set of blocks, so
      we avoid the bug.
      
      But for devices with at most 64 blocks we end up using the same cached
      data for digesting all archived writesets, i.e., the cache is loaded
      when digesting the first writeset and it never gets reloaded, until the
      digestion is done.
      
      As a result, the writeset of the first era to digest is used as the
      writeset of all the following archived eras, leading to lost writes.
      
      Fix this by reinitializing the dm_disk_bitset structure, and thus
      invalidating the cache, every time the digestion code starts digesting a
      new writeset.
      
      Fixes: eec40579 ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      25249333
    • N
      dm era: Update in-core bitset after committing the metadata · 2099b145
      Nikos Tsironis 提交于
      In case of a system crash, dm-era might fail to mark blocks as written
      in its metadata, although the corresponding writes to these blocks were
      passed down to the origin device and completed successfully.
      
      Consider the following sequence of events:
      
      1. We write to a block that has not been yet written in the current era
      2. era_map() checks the in-core bitmap for the current era and sees
         that the block is not marked as written.
      3. The write is deferred for submission after the metadata have been
         updated and committed.
      4. The worker thread processes the deferred write
         (process_deferred_bios()) and marks the block as written in the
         in-core bitmap, **before** committing the metadata.
      5. The worker thread starts committing the metadata.
      6. We do more writes that map to the same block as the write of step (1)
      7. era_map() checks the in-core bitmap and sees that the block is marked
         as written, **although the metadata have not been committed yet**.
      8. These writes are passed down to the origin device immediately and the
         device reports them as completed.
      9. The system crashes, e.g., power failure, before the commit from step
         (5) finishes.
      
      When the system recovers and we query the dm-era target for the list of
      written blocks it doesn't report the aforementioned block as written,
      although the writes of step (6) completed successfully.
      
      The issue is that era_map() decides whether to defer or not a write
      based on non committed information. The root cause of the bug is that we
      update the in-core bitmap, **before** committing the metadata.
      
      Fix this by updating the in-core bitmap **after** successfully
      committing the metadata.
      
      Fixes: eec40579 ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2099b145
    • N
      dm era: Recover committed writeset after crash · de89afc1
      Nikos Tsironis 提交于
      Following a system crash, dm-era fails to recover the committed writeset
      for the current era, leading to lost writes. That is, we lose the
      information about what blocks were written during the affected era.
      
      dm-era assumes that the writeset of the current era is archived when the
      device is suspended. So, when resuming the device, it just moves on to
      the next era, ignoring the committed writeset.
      
      This assumption holds when the device is properly shut down. But, when
      the system crashes, the code that suspends the target never runs, so the
      writeset for the current era is not archived.
      
      There are three issues that cause the committed writeset to get lost:
      
      1. dm-era doesn't load the committed writeset when opening the metadata
      2. The code that resizes the metadata wipes the information about the
         committed writeset (assuming it was loaded at step 1)
      3. era_preresume() starts a new era, without taking into account that
         the current era might not have been archived, due to a system crash.
      
      To fix this:
      
      1. Load the committed writeset when opening the metadata
      2. Fix the code that resizes the metadata to make sure it doesn't wipe
         the loaded writeset
      3. Fix era_preresume() to check for a loaded writeset and archive it,
         before starting a new era.
      
      Fixes: eec40579 ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      de89afc1
  6. 10 2月, 2021 8 次提交
    • J
      bcache: Avoid comma separated statements · 6751c1e3
      Joe Perches 提交于
      Use semicolons and braces.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6751c1e3
    • K
      bcache: Move journal work to new flush wq · afe78ab4
      Kai Krakow 提交于
      This is potentially long running and not latency sensitive, let's get
      it out of the way of other latency sensitive events.
      
      As observed in the previous commit, the `system_wq` comes easily
      congested by bcache, and this fixes a few more stalls I was observing
      every once in a while.
      
      Let's not make this `WQ_MEM_RECLAIM` as it showed to reduce performance
      of boot and file system operations in my tests. Also, without
      `WQ_MEM_RECLAIM`, I no longer see desktop stalls. This matches the
      previous behavior as `system_wq` also does no memory reclaim:
      
      > // workqueue.c:
      > system_wq = alloc_workqueue("events", 0, 0);
      
      Cc: Coly Li <colyli@suse.de>
      Cc: stable@vger.kernel.org # 5.4+
      Signed-off-by: NKai Krakow <kai@kaishome.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      afe78ab4
    • K
      bcache: Give btree_io_wq correct semantics again · d797bd98
      Kai Krakow 提交于
      Before killing `btree_io_wq`, the queue was allocated using
      `create_singlethread_workqueue()` which has `WQ_MEM_RECLAIM`. After
      killing it, it no longer had this property but `system_wq` is not
      single threaded.
      
      Let's combine both worlds and make it multi threaded but able to
      reclaim memory.
      
      Cc: Coly Li <colyli@suse.de>
      Cc: stable@vger.kernel.org # 5.4+
      Signed-off-by: NKai Krakow <kai@kaishome.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d797bd98
    • K
      Revert "bcache: Kill btree_io_wq" · 9f233ffe
      Kai Krakow 提交于
      This reverts commit 56b30770.
      
      With the btree using the `system_wq`, I seem to see a lot more desktop
      latency than I should.
      
      After some more investigation, it looks like the original assumption
      of 56b30770 no longer is true, and bcache has a very high potential of
      congesting the `system_wq`. In turn, this introduces laggy desktop
      performance, IO stalls (at least with btrfs), and input events may be
      delayed.
      
      So let's revert this. It's important to note that the semantics of
      using `system_wq` previously mean that `btree_io_wq` should be created
      before and destroyed after other bcache wqs to keep the same
      assumptions.
      
      Cc: Coly Li <colyli@suse.de>
      Cc: stable@vger.kernel.org # 5.4+
      Signed-off-by: NKai Krakow <kai@kaishome.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9f233ffe
    • K
      bcache: Fix register_device_aync typo · d7fae7b4
      Kai Krakow 提交于
      Should be `register_device_async`.
      
      Cc: Coly Li <colyli@suse.de>
      Signed-off-by: NKai Krakow <kai@kaishome.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d7fae7b4
    • D
      bcache: consider the fragmentation when update the writeback rate · 71dda2a5
      dongdong tao 提交于
      Current way to calculate the writeback rate only considered the
      dirty sectors, this usually works fine when the fragmentation
      is not high, but it will give us unreasonable small rate when
      we are under a situation that very few dirty sectors consumed
      a lot dirty buckets. In some case, the dirty bucekts can reached
      to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even
      reached the writeback_percent, the writeback rate will still
      be the minimum value (4k), thus it will cause all the writes to be
      stucked in a non-writeback mode because of the slow writeback.
      
      We accelerate the rate in 3 stages with different aggressiveness,
      the first stage starts when dirty buckets percent reach above
      BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
      BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
      BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
      the first stage tries to writeback the amount of dirty data
      in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
      the second stage tries to writeback the amount of dirty data in one bucket
      in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third
      stage tries to writeback the amount of dirty data in one bucket in
      (1 / (dirty_buckets_percent - 64)) millisecond.
      
      the initial rate at each stage can be controlled by 3 configurable
      parameters writeback_rate_fp_term_{low|mid|high}, they are by default
      1, 10, 1000, the hint of IO throughput that these values are trying
      to achieve is described by above paragraph, the reason that
      I choose those value as default is based on the testing and the
      production data, below is some details:
      
      A. When it comes to the low stage, there is still a bit far from the 70
         threshold, so we only want to give it a little bit push by setting the
         term to 1, it means the initial rate will be 170 if the fragment is 6,
         it is calculated by bucket_size/fragment, this rate is very small,
         but still much reasonable than the minimum 8.
         For a production bcache with unheavy workload, if the cache device
         is bigger than 1 TB, it may take hours to consume 1% buckets,
         so it is very possible to reclaim enough dirty buckets in this stage,
         thus to avoid entering the next stage.
      
      B. If the dirty buckets ratio didn't turn around during the first stage,
         it comes to the mid stage, then it is necessary for mid stage
         to be more aggressive than low stage, so i choose the initial rate
         to be 10 times more than low stage, that means 1700 as the initial
         rate if the fragment is 6. This is some normal rate
         we usually see for a normal workload when writeback happens
         because of writeback_percent.
      
      C. If the dirty buckets ratio didn't turn around during the low and mid
         stages, it comes to the third stage, and it is the last chance that
         we can turn around to avoid the horrible cutoff writeback sync issue,
         then we choose 100 times more aggressive than the mid stage, that
         means 170000 as the initial rate if the fragment is 6. This is also
         inferred from a production bcache, I've got one week's writeback rate
         data from a production bcache which has quite heavy workloads,
         again, the writeback is triggered by the writeback percent,
         the highest rate area is around 100000 to 240000, so I believe this
         kind aggressiveness at this stage is reasonable for production.
         And it should be mostly enough because the hint is trying to reclaim
         1000 bucket per second, and from that heavy production env,
         it is consuming 50 bucket per second on average in one week's data.
      
      Option writeback_consider_fragment is to control whether we want
      this feature to be on or off, it's on by default.
      
      Lastly, below is the performance data for all the testing result,
      including the data from production env:
      https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharingSigned-off-by: Ndongdong tao <dongdong.tao@canonical.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      71dda2a5
    • M
    • M
      dm writecache: fix writing beyond end of underlying device when shrinking · 4134455f
      Mikulas Patocka 提交于
      Do not attempt to write any data beyond the end of the underlying data
      device while shrinking it.
      
      The DM writecache device must be suspended when the underlying data
      device is shrunk.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      4134455f
  7. 09 2月, 2021 5 次提交
  8. 08 2月, 2021 1 次提交
  9. 04 2月, 2021 1 次提交
  10. 03 2月, 2021 3 次提交