1. 30 5月, 2015 1 次提交
    • J
      dm cache: fix race when issuing a POLICY_REPLACE operation · fb4100ae
      Joe Thornber 提交于
      There is a race between a policy deciding to replace a cache entry,
      the core target writing back any dirty data from this block, and other
      IO threads doing IO to the same block.
      
      This sort of problem is avoided most of the time by the core target
      grabbing a bio prison cell before making the request to the policy.
      But for a demotion the core target doesn't know which block will be
      demoted, so can't do this in advance.
      
      Fix this demotion race by introducing a callback to the policy interface
      that allows the policy to grab the cell on behalf of the core target.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      fb4100ae
  2. 22 5月, 2015 1 次提交
    • M
      block: remove management of bi_remaining when restoring original bi_end_io · 326e1dbb
      Mike Snitzer 提交于
      Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
      non-chains") regressed all existing callers that followed this pattern:
       1) saving a bio's original bi_end_io
       2) wiring up an intermediate bi_end_io
       3) restoring the original bi_end_io from intermediate bi_end_io
       4) calling bio_endio() to execute the restored original bi_end_io
      
      The regression was due to BIO_CHAIN only ever getting set if
      bio_inc_remaining() is called.  For the above pattern it isn't set until
      step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
      the first bio_endio(), in step 2 above, never decremented __bi_remaining
      before calling the intermediate bi_end_io -- leaving __bi_remaining with
      the value 1 instead of 0.  When bio_inc_remaining() occurred during step
      3 it brought it to a value of 2.  When the second bio_endio() was
      called, in step 4 above, it should've called the original bi_end_io but
      it didn't because there was an extra reference that wasn't dropped (due
      to atomic operations being optimized away since BIO_CHAIN wasn't set
      upfront).
      
      Fix this issue by removing the __bi_remaining management complexity for
      all callers that use the above pattern -- bio_chain() is the only
      interface that _needs_ to be concerned with __bi_remaining.  For the
      above pattern callers just expect the bi_end_io they set to get called!
      Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
      that aren't associated with the bio_chain() interface.
      
      Also, the bio_inc_remaining() interface has been moved local to bio.c.
      
      Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      326e1dbb
  3. 06 5月, 2015 1 次提交
  4. 10 2月, 2015 1 次提交
  5. 24 1月, 2015 1 次提交
    • J
      dm cache: fix problematic dual use of a single migration count variable · a59db676
      Joe Thornber 提交于
      Introduce a new variable to count the number of allocated migration
      structures.  The existing variable cache->nr_migrations became
      overloaded.  It was used to:
      
       i) track of the number of migrations in flight for the purposes of
          quiescing during suspend.
      
       ii) to estimate the amount of background IO occuring.
      
      Recent discard changes meant that REQ_DISCARD bios are processed with
      a migration.  Discards are not background IO so nr_migrations was not
      incremented.  However this could cause quiescing to complete early.
      
      (i) is now handled with a new variable cache->nr_allocated_migrations.
      cache->nr_migrations has been renamed cache->nr_io_migrations.
      cleanup_migration() is now called free_io_migration(), since it
      decrements that variable.
      
      Also, remove the unused cache->next_migration variable that got replaced
      with with prealloc_structs a while ago.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      a59db676
  6. 02 12月, 2014 7 次提交
  7. 13 11月, 2014 1 次提交
  8. 11 11月, 2014 5 次提交
  9. 10 9月, 2014 1 次提交
    • A
      dm cache: fix race causing dirty blocks to be marked as clean · 40aa978e
      Anssi Hannula 提交于
      When a writeback or a promotion of a block is completed, the cell of
      that block is removed from the prison, the block is marked as clean, and
      the clear_dirty() callback of the cache policy is called.
      
      Unfortunately, performing those actions in this order allows an incoming
      new write bio for that block to come in before clearing the dirty status
      is completed and therefore possibly causing one of these two scenarios:
      
      Scenario A:
      
      Thread 1                      Thread 2
      cell_defer()                  .
      - cell removed from prison    .
      - detained bios queued        .
      .                             incoming write bio
      .                             remapped to cache
      .                             set_dirty() called,
      .                               but block already dirty
      .                               => it does nothing
      clear_dirty()                 .
      - block marked clean          .
      - policy clear_dirty() called .
      
      Result: Block is marked clean even though it is actually dirty. No
      writeback will occur.
      
      Scenario B:
      
      Thread 1                      Thread 2
      cell_defer()                  .
      - cell removed from prison    .
      - detained bios queued        .
      clear_dirty()                 .
      - block marked clean          .
      .                             incoming write bio
      .                             remapped to cache
      .                             set_dirty() called
      .                             - block marked dirty
      .                             - policy set_dirty() called
      - policy clear_dirty() called .
      
      Result: Block is properly marked as dirty, but policy thinks it is clean
      and therefore never asks us to writeback it.
      This case is visible in "dmsetup status" dirty block count (which
      normally decreases to 0 on a quiet device).
      
      Fix these issues by calling clear_dirty() before calling cell_defer().
      Incoming bios for that block will then be detained in the cell and
      released only after clear_dirty() has completed, so the race will not
      occur.
      
      Found by inspecting the code after noticing spurious dirty counts
      (scenario B).
      Signed-off-by: NAnssi Hannula <anssi.hannula@iki.fi>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      40aa978e
  10. 02 8月, 2014 5 次提交
  11. 27 5月, 2014 1 次提交
  12. 02 5月, 2014 1 次提交
  13. 05 4月, 2014 1 次提交
    • J
      dm cache: fix a lock-inversion · 0596661f
      Joe Thornber 提交于
      When suspending a cache the policy is walked and the individual policy
      hints written to the metadata via sync_metadata().  This led to this
      lock order:
      
            policy->lock
              cache_metadata->root_lock
      
      When loading the cache target the policy is populated while the metadata
      lock is held:
      
            cache_metadata->root_lock
               policy->lock
      
      Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by
      ensuring the cache_metadata root_lock is held whilst all the hints are
      written, rather than being repeatedly locked while policy->lock is held
      (as was the case with each callout that policy_walk_mappings() made to
      the old save_hint() method).
      
      Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove
      locking correctness") build option.  However, it is not clear how the
      LOCKDEP reported paths can lead to a deadlock since the two paths,
      suspending a target and loading a target, never occur at the same time.
      But that doesn't mean the same lock-inversion couldn't have occurred
      elsewhere.
      Reported-by: NMarian Csontos <mcsontos@redhat.com>
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      0596661f
  14. 28 3月, 2014 2 次提交
  15. 13 3月, 2014 2 次提交
    • H
      dm cache: fix access beyond end of origin device · e893fba9
      Heinz Mauelshagen 提交于
      In order to avoid wasting cache space a partial block at the end of the
      origin device is not cached.  Unfortunately, the check for such a
      partial block at the end of the origin device was flawed.
      
      Fix accesses beyond the end of the origin device that occured due to
      attempted promotion of an undetected partial block by:
      
      - initializing the per bio data struct to allow cache_end_io to work properly
      - recognizing access to the partial block at the end of the origin device
      - avoiding out of bounds access to the discard bitset
      
      Otherwise, users can experience errors like the following:
      
       attempt to access beyond end of device
       dm-5: rw=0, want=20971520, limit=20971456
       ...
       device-mapper: cache: promotion failed; couldn't copy block
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      e893fba9
    • H
      dm cache: fix truncation bug when copying a block to/from >2TB fast device · 8b9d9666
      Heinz Mauelshagen 提交于
      During demotion or promotion to a cache's >2TB fast device we must not
      truncate the cache block's associated sector to 32bits.  The 32bit
      temporary result of from_cblock() caused a 32bit multiplication when
      calculating the sector of the fast device in issue_copy_real().
      
      Use an intermediate 64bit type to store the 32bit from_cblock() to allow
      for proper 64bit multiplication.
      
      Here is an example of how this bug manifests on an ext4 filesystem:
      
       EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt.
       JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      8b9d9666
  16. 28 2月, 2014 1 次提交
  17. 18 2月, 2014 2 次提交
    • M
      dm cache: do not add migration to completed list before unhooking bio · 80ae49aa
      Mike Snitzer 提交于
      When completing an overwrite bio, in overwrite_endio(), the associated
      migration should not be added to the 'completed_migrations' until the
      bio's fields are restored with dm_unhook_bio().
      
      Otherwise, do_worker() can race to process 'completed_migrations' before
      dm_unhook_bio() -- so the bio's bi_end_io is incorrect.  This is
      unlikely to cause any problems given the current code but should be
      fixed on the basis of correctness.
      
      Also, the cache's spinlock only needs to be held when manipulating the
      'completed_migrations' list -- other changes don't need protection.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      80ae49aa
    • M
      dm cache: move hook_info into common portion of per_bio_data structure · c6eda5e8
      Mike Snitzer 提交于
      Commit c9d28d5d ("dm cache: promotion optimisation for writes")
      incorrectly placed the 'hook_info' member in the writethrough-only
      portion of the per_bio_data structure.
      
      Given that the overwrite optimization may be used for writeback the
      'hook_info' member must be placed above the 'cache' member of the
      per_bio_data structure.  Any members above 'cache' are available from
      both writeback and writethrough modes' per_bio_data structure.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org # 3.13+
      c6eda5e8
  18. 17 1月, 2014 1 次提交
    • M
      dm cache: add policy name to status output · 2e68c4e6
      Mike Snitzer 提交于
      The cache's policy may have been established using the "default" alias,
      which is currently the "mq" policy but the default policy may change in
      the future.  It is useful to know exactly which policy is being used.
      
      Add a 'real' member to the dm_cache_policy_type structure and have the
      "default" dm_cache_policy_type point to the real "mq"
      dm_cache_policy_type.  Update dm_cache_policy_get_name() to check if
      real is set, if so report the name of the real policy (not the alias).
      Requested-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2e68c4e6
  19. 10 1月, 2014 1 次提交
    • M
      dm cache: add block sizes and total cache blocks to status output · 6a388618
      Mike Snitzer 提交于
      Improve cache_status to emit:
      <metadata block size> <#used metadata blocks>/<#total metadata blocks>
      <cache block size> <#used cache blocks>/<#total cache blocks>
      ...
      
      Adding the block sizes allows for easier calculation of the overall size
      of both the metadata and cache devices.  Adding <#total cache blocks>
      provides useful context for how much of the cache is used.
      
      Unfortunately these additions to the status will require updates to
      users' scripts that monitor the cache status.  But these changes help
      provide more comprehensive information about the cache device and will
      simplify tools that are being developed to manage dm-cache devices --
      because they won't need to issue 3 operations to cobble together the
      information that we can easily provide via a single status ioctl.
      
      While updating the status documentation in cache.txt spaces were
      tabify'd.
      Requested-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      6a388618
  20. 11 12月, 2013 1 次提交
  21. 04 12月, 2013 1 次提交
  22. 24 11月, 2013 2 次提交
    • K
      block: Generic bio chaining · 196d38bc
      Kent Overstreet 提交于
      This adds a generic mechanism for chaining bio completions. This is
      going to be used for a bio_split() replacement, and it turns out to be
      very useful in a fair amount of driver code - a fair number of drivers
      were implementing this in their own roundabout ways, often painfully.
      
      Note that this means it's no longer to call bio_endio() more than once
      on the same bio! This can cause problems for drivers that save/restore
      bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all
      - in all but the simplest cases they'd be better off just cloning the
      bio, and immutable biovecs is making bio cloning cheaper. But for now,
      we add a bio_endio_nodec() for these cases.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      196d38bc
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37