1. 19 8月, 2014 2 次提交
    • N
      md/raid10: Fix memory leak when raid10 reshape completes. · b3968552
      NeilBrown 提交于
      When a raid10 commences a resync/recovery/reshape it allocates
      some buffer space.
      When a resync/recovery completes the buffer space is freed.  But not
      when the reshape completes.
      This can result in a small memory leak.
      
      There is a subtle side-effect of this bug.  When a RAID10 is reshaped
      to a larger array (more devices), the reshape is immediately followed
      by a "resync" of the new space.  This "resync" will use the buffer
      space which was allocated for "reshape".  This can cause problems
      including a "BUG" in the SCSI layer.  So this is suitable for -stable.
      
      Cc: stable@vger.kernel.org (v3.5+)
      Fixes: 3ea7daa5Signed-off-by: NNeilBrown <neilb@suse.de>
      b3968552
    • N
      md/raid10: fix memory leak when reshaping a RAID10. · ce0b0a46
      NeilBrown 提交于
      raid10 reshape clears unwanted bits from a bio->bi_flags using
      a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
      was added.
      Since then it clears that bit but shouldn't.  This results in a
      memory leak.
      
      So change to used the approved method of clearing unwanted bits.
      
      As this causes a memory leak which can consume all of memory
      the fix is suitable for -stable.
      
      Fixes: a38352e0
      Cc: stable@vger.kernel.org (v3.10+)
      Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ce0b0a46
  2. 31 7月, 2014 1 次提交
    • N
      md/raid1,raid10: always abort recover on write error. · 2446dba0
      NeilBrown 提交于
      Currently we don't abort recovery on a write error if the write error
      to the recovering device was triggerd by normal IO (as opposed to
      recovery IO).
      
      This means that for one bitmap region, the recovery might write to the
      recovering device for a few sectors, then not bother for subsequent
      sectors (as it never writes to failed devices).  In this case
      the bitmap bit will be cleared, but it really shouldn't.
      
      The result is that if the recovering device fails and is then re-added
      (after fixing whatever hardware problem triggerred the failure),
      the second recovery won't redo the region it was in the middle of,
      so some of the device will not be recovered properly.
      
      If we abort the recovery, the region being processes will be cancelled
      (bit not cleared) and the whole region will be retried.
      
      As the bug can result in data corruption the patch is suitable for
      -stable.  For kernels prior to 3.11 there is a conflict in raid10.c
      which will require care.
      
      Original-from: jiao hui <jiaohui@bwstor.com.cn>
      Reported-and-tested-by: Njiao hui <jiaohui@bwstor.com.cn>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@vger.kernel.org
      2446dba0
  3. 06 5月, 2014 1 次提交
  4. 14 1月, 2014 3 次提交
  5. 24 11月, 2013 4 次提交
    • K
      block: Introduce new bio_split() · 20d0189b
      Kent Overstreet 提交于
      The new bio_split() can split arbitrary bios - it's not restricted to
      single page bios, like the old bio_split() (previously renamed to
      bio_pair_split()). It also has different semantics - it doesn't allocate
      a struct bio_pair, leaving it up to the caller to handle completions.
      
      Then convert the existing bio_pair_split() users to the new bio_split()
      - and also nvme, which was open coding bio splitting.
      
      (We have to take that BUG_ON() out of bio_integrity_trim() because this
      bio_split() needs to use it, and there's no reason it has to be used on
      bios marked as cloned; BIO_CLONED doesn't seem to have clearly
      documented semantics anyways.)
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Neil Brown <neilb@suse.de>
      20d0189b
    • K
      block: Rename bio_split() -> bio_pair_split() · ee67891b
      Kent Overstreet 提交于
      This is prep work for introducing a more general bio_split().
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Sage Weil <sage@inktank.com>
      ee67891b
    • K
      block: Kill bio_segments()/bi_vcnt usage · 458b76ed
      Kent Overstreet 提交于
      When we start sharing biovecs, keeping bi_vcnt accurate for splits is
      going to be error prone - and unnecessary, if we refactor some code.
      
      So bio_segments() has to go - but most of the existing users just needed
      to know if the bio had multiple segments, which is easier - add a
      bio_multiple_segments() for them.
      
      (Two of the current uses of bio_segments() are going to go away in a
      couple patches, but the current implementation of bio_segments() is
      unsafe as soon as we start doing driver conversions for immutable
      biovecs - so implement a dumb version for bisectability, it'll go away
      in a couple patches)
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
      Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      458b76ed
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37
  6. 19 11月, 2013 1 次提交
    • N
      md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. · c91abf5a
      NeilBrown 提交于
      We currently use kthread_should_stop() in various places in the
      sync/reshape code to abort early.
      However some places set MD_RECOVERY_INTR but don't immediately call
      md_reap_sync_thread() (and we will shortly get another one).
      When this happens we are relying on md_check_recovery() to reap the
      thread and that only happen when it finishes normally.
      So MD_RECOVERY_INTR must lead to a normal finish without the
      kthread_should_stop() test.
      
      So replace all relevant tests, and be more careful when the thread is
      interrupted not to acknowledge that latest step in a reshape as it may
      not be fully committed yet.
      
      Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
      so we don't wait have to wait for the speed to drop before we can abort.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c91abf5a
  7. 09 11月, 2013 1 次提交
  8. 24 10月, 2013 1 次提交
    • L
      md: Fix skipping recovery for read-only arrays. · 61e4947c
      Lukasz Dorau 提交于
      Since:
              commit 7ceb17e8
              md: Allow devices to be re-added to a read-only array.
      
      spares are activated on a read-only array. In case of raid1 and raid10
      personalities it causes that not-in-sync devices are marked in-sync
      without checking if recovery has been finished.
      
      If a read-only array is degraded and one of its devices is not in-sync
      (because the array has been only partially recovered) recovery will be skipped.
      
      This patch adds checking if recovery has been finished before marking a device
      in-sync for raid1 and raid10 personalities. In case of raid5 personality
      such condition is already present (at raid5.c:6029).
      
      Bug was introduced in 3.10 and causes data corruption.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com>
      Signed-off-by: NLukasz Dorau <lukasz.dorau@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      61e4947c
  9. 25 7月, 2013 1 次提交
    • N
      md/raid10: remove use-after-free bug. · 0eb25bb0
      NeilBrown 提交于
      We always need to be careful when calling generic_make_request, as it
      can start a chain of events which might free something that we are
      using.
      
      Here is one place I wasn't careful enough.  If the wbio2 is not in
      use, then it might get freed at the first generic_make_request call.
      So perform all necessary tests first.
      
      This bug was introduced in 3.3-rc3 (24afd80d) and can cause an
      oops, so fix is suitable for any -stable since then.
      
      Cc: stable@vger.kernel.org (3.3+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0eb25bb0
  10. 18 7月, 2013 1 次提交
    • N
      md/raid10: fix two problems with RAID10 resync. · 7bb23c49
      NeilBrown 提交于
      1/ When an different between blocks is found, data is copied from
         one bio to the other.  However bv_len is used as the length to
         copy and this could be zero.  So use r10_bio->sectors to calculate
         length instead.
         Using bv_len was probably always a bit dubious, but the introduction
         of bio_advance made it much more likely to be a problem.
      
      2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
         except on bios that we schedule for a read.  This ensures that
         missing/failed devices don't confuse the loop at the top of
         sync_request write.
         Commit 8be185f2 "raid10: Use bio_reset()"
         removed a loop which set BIO_UPTDATE on all appropriate bios.
         So we need to re-add that flag.
      
      These bugs were introduced in 3.10, so this patch is suitable for
      3.10-stable, and can remove a potential for data corruption.
      
      Cc: stable@vger.kernel.org (3.10)
      Reported-by: NBrassow Jonathan <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7bb23c49
  11. 04 7月, 2013 1 次提交
    • N
      md/raid10: fix bug which causes all RAID10 reshapes to move no data. · 13765120
      NeilBrown 提交于
      The recent comment:
      commit 7e83ccbe
          md/raid10: Allow skipping recovery when clean arrays are assembled
      
      Causes raid10 to skip a recovery in certain cases where it is safe to
      do so.  Unfortunately it also causes a reshape to be skipped which is
      never safe.  The result is that an attempt to reshape a RAID10 will
      appear to complete instantly, but no data will have been moves so the
      array will now contain garbage.
      (If nothing is written, you can recovery by simple performing the
      reverse reshape which will also complete instantly).
      
      Bug was introduced in 3.10, so this is suitable for 3.10-stable.
      
      Cc: stable@vger.kernel.org (3.10)
      Cc: Martin Wilck <mwilck@arcor.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      13765120
  12. 03 7月, 2013 1 次提交
    • N
      md/raid10: fix two bugs affecting RAID10 reshape. · 78eaa0d4
      NeilBrown 提交于
      1/ If a RAID10 is being reshaped to a fewer number of devices
       and is stopped while this is ongoing, then when the array is
       reassembled the 'mirrors' array will be allocated too small.
       This will lead to an access error or memory corruption.
      
      2/ A sanity test for a reshaping RAID10 array is restarted
       is slightly incorrect.
      
      Due to the first bug, this is suitable for any -stable
      kernel since 3.5 where this code was introduced.
      
      Cc: stable@vger.kernel.org (v3.5+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      78eaa0d4
  13. 14 6月, 2013 3 次提交
    • N
      md/raid10: check In_sync flag in 'enough()'. · 725d6e57
      NeilBrown 提交于
      It isn't really enough to check that the rdev is present, we need to
      also be sure that the device is still In_sync.
      
      Doing this requires using rcu_dereference to access the rdev, and
      holding the rcu_read_lock() to ensure the rdev doesn't disappear while
      we look at it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      725d6e57
    • N
      md/raid10: locking changes for 'enough()'. · 635f6416
      NeilBrown 提交于
      As 'enough' accesses conf->prev and conf->geo, which can change
      spontanously, it should guard against changes.
      This can be done with device_lock as start_reshape holds device_lock
      while updating 'geo' and end_reshape holds it while updating 'prev'.
      
      So 'error' needs to hold 'device_lock'.
      
      On the other hand, raid10_end_read_request knows which of the two it
      really wants to access, and as it is an active request on that one,
      the value cannot change underneath it.
      
      So change _enough to take flag rather than a pointer, pass the
      appropriate flag from raid10_end_read_request(), and remove the locking.
      
      All other calls to 'enough' are made with reconfig_mutex held, so
      neither 'prev' nor 'geo' can change.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      635f6416
    • J
      DM RAID: Add ability to restore transiently failed devices on resume · 9092c02d
      Jonathan Brassow 提交于
      DM RAID: Add ability to restore transiently failed devices on resume
      
      This patch adds code to the resume function to check over the devices
      in the RAID array.  If any are found to be marked as failed and their
      superblocks can be read, an attempt is made to reintegrate them into
      the array.  This allows the user to refresh the array with a simple
      suspend and resume of the array - rather than having to load a
      completely new table, allocate and initialize all the structures and
      throw away the old instantiation.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9092c02d
  14. 13 6月, 2013 3 次提交
    • H
      md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place · 5026d7a9
      H. Peter Anvin 提交于
      There are cases where the kernel will believe that the WRITE SAME
      command is supported by a block device which does not, in fact,
      support WRITE SAME.  This currently happens for SATA drivers behind a
      SAS controller, but there are probably a hundred other ways that can
      happen, including drive firmware bugs.
      
      After receiving an error for WRITE SAME the block layer will retry the
      request as a plain write of zeroes, but mdraid will consider the
      failure as fatal and consider the drive failed.  This has the effect
      that all the mirrors containing a specific set of data are each
      offlined in very rapid succession resulting in data loss.
      
      However, just bouncing the request back up to the block layer isn't
      ideal either, because the whole initial request-retry sequence should
      be inside the write bitmap fence, which probably means that md needs
      to do its own conversion of WRITE SAME to write zero.
      
      Until the failure scenario has been sorted out, disable WRITE SAME for
      raid1, raid5, and raid10.
      
      [neilb: added raid5]
      
      This patch is appropriate for any -stable since 3.7 when write_same
      support was added.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5026d7a9
    • N
      md/raid1,raid10: use freeze_array in place of raise_barrier in various places. · e2d59925
      NeilBrown 提交于
      Various places in raid1 and raid10 are calling raise_barrier when they
      really should call freeze_array.
      The former is only intended to be called from "make_request".
      The later has extra checks for 'nr_queued' and makes a call to
      flush_pending_writes(), so it is safe to call it from within the
      management thread.
      
      Using raise_barrier will sometimes deadlock.  Using freeze_array
      should not.
      
      As 'freeze_array' currently expects one request to be pending (in
      handle_read_error - the only previous caller), we need to pass
      it the number of pending requests (extra) to ignore.
      
      The deadlock was made particularly noticeable by commits
      050b6615 (raid10) and 6b740b8d (raid1) which
      appeared in 3.4, so the fix is appropriate for any -stable
      kernel since then.
      
      This patch probably won't apply directly to some early kernels and
      will need to be applied by hand.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e2d59925
    • A
      md/raid1: consider WRITE as successful only if at least one non-Faulty and... · 3056e3ae
      Alex Lyakas 提交于
      md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
      
      Without that fix, the following scenario could happen:
      
      - RAID1 with drives A and B; drive B was freshly-added and is rebuilding
      - Drive A fails
      - WRITE request arrives to the array. It is failed by drive A, so
      r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
      succeeds in writing it, so the same r1_bio is marked as
      R1BIO_Uptodate.
      - r1_bio arrives to handle_write_finished, badblocks are disabled,
      md_error()->error() does nothing because we don't fail the last drive
      of raid1
      - raid_end_bio_io()  calls call_bio_endio()
      - As a result, in call_bio_endio():
              if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
                      clear_bit(BIO_UPTODATE, &bio->bi_flags);
      this code doesn't clear the BIO_UPTODATE flag, and the whole master
      WRITE succeeds, back to the upper layer.
      
      So we returned success to the upper layer, even though we had written
      the data onto the rebuilding drive only. But when we want to read the
      data back, we would not read from the rebuilding drive, so this data
      is lost.
      
      [neilb - applied identical change to raid10 as well]
      
      This bug can result in lost data, so it is suitable for any
      -stable kernel.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAlex Lyakas <alex@zadarastorage.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3056e3ae
  15. 30 4月, 2013 2 次提交
  16. 24 4月, 2013 1 次提交
  17. 24 3月, 2013 5 次提交
    • K
      raid10: Use bio_reset() · 8be185f2
      Kent Overstreet 提交于
      More prep work for immutable bio vecs, mainly getting rid of references
      to bi_idx.
      
      bio_reset was being open coded in a few places. The one in sync_request
      was a bit nontrivial to convert, so could use some extra eyeballs.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
      Acked-by: NNeilBrown <neilb@suse.de>
      8be185f2
    • K
      block: Add submit_bio_wait(), remove from md · 9e882242
      Kent Overstreet 提交于
      Random cleanup - this code was duplicated and it's not really specific
      to md.
      
      Also added the ability to return the actual error code.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
      Acked-by: NTejun Heo <tj@kernel.org>
      9e882242
    • K
      block: Remove bi_idx references · 4f2ac93c
      Kent Overstreet 提交于
      For immutable bvecs, all bi_idx usage needs to be audited - so here
      we're removing all the unnecessary uses.
      
      Most of these are places where it was being initialized on a bio that
      was just allocated, a few others are conversions to standard macros.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      4f2ac93c
    • K
      block: Change bio_split() to respect the current value of bi_idx · 5b83636a
      Kent Overstreet 提交于
      In the current code bio_split() won't be seeing partially completed bios
      so this doesn't change any behaviour, but this makes the code a bit
      clearer as to what bio_split() actually requires.
      
      The immediate purpose of the patch is removing unnecessary bi_idx
      references, but the end goal is to allow partial completed bios to be
      submitted, which along with immutable biovecs enables effecient bio
      splitting.
      
      Some of the callers were (double) checking that bios could be split, so
      update their checks too.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
      CC: Neil Brown <neilb@suse.de>
      CC: Martin K. Petersen <martin.petersen@oracle.com>
      5b83636a
    • K
      block: Use bio_sectors() more consistently · aa8b57aa
      Kent Overstreet 提交于
      Bunch of places in the code weren't using it where they could be -
      this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
      into a struct bvec_iter.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: "Ed L. Cashin" <ecashin@coraid.com>
      CC: Nick Piggin <npiggin@kernel.dk>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Jim Paris <jim@jtan.com>
      CC: Geoff Levand <geoff@infradead.org>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Steven Rostedt <rostedt@goodmis.org>
      Acked-by: NEd Cashin <ecashin@coraid.com>
      aa8b57aa
  18. 26 2月, 2013 5 次提交
    • N
      md/raid1,raid10: fix deadlock with freeze_array() · ee0b0244
      NeilBrown 提交于
      When raid1/raid10 needs to fix a read error, it first drains
      all pending requests by calling freeze_array().
      This calls flush_pending_writes() if it needs to sleep,
      but some writes may be pending in a per-process plug rather
      than in the per-array request queue.
      
      When raid1{,0}_unplug() moves the request from the per-process
      plug to the per-array request queue (from which
      flush_pending_writes() can flush them), it needs to wake up
      freeze_array(), or freeze_array() will never flush them and so
      it will block forever.
      
      So add the requires wake_up() calls.
      
      This bug was introduced by commit
         f54a9d0e
      for raid1 and a similar commit for RAID10, and so has been present
      since linux-3.6.  As the bug causes a deadlock I believe this fix is
      suitable for -stable.
      
      Cc: stable@vger.kernel.org (3.6.y 3.7.y 3.8.y)
      Reported-by: NTregaron Bayly <tbayly@bluehost.com>
      Tested-by: NTregaron Bayly <tbayly@bluehost.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ee0b0244
    • J
      MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2) · 9a3152ab
      Jonathan Brassow 提交于
      MD RAID10:  Improve redundancy for 'far' and 'offset' algorithms (part 2)
      
      This patch addresses raid arrays that have a number of devices that cannot
      be evenly divided by 'far_copies'.  (E.g. 5 devices, far_copies = 2)  This
      case must be handled differently because it causes that last set to be of
      a different size than the rest of the sets.  We must compute a new modulo
      for this last set so that copied chunks are properly wrapped around.
      
      Example use_far_sets=1, far_copies=2, near_copies=1, devices=5:
                      "far" algorithm
              dev1 dev2 dev3 dev4 dev5
      	==== ==== ==== ==== ====
      	[ A   B ] [ C    D   E ]
              [ G   H ] [ I    J   K ]
                          ...
              [ B   A ] [ E    C   D ] --> nominal set of 2 and last set of 3
              [ H   G ] [ K    I   J ]     []'s show far/offset sets
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9a3152ab
    • J
      MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1) · 475901af
      Jonathan Brassow 提交于
      The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
      widths - copying them to a different location on the same devices after
      shifting the stripe.  An example layout of each follows below:
      
      	        "far" algorithm
      	dev1 dev2 dev3 dev4 dev5 dev6
      	==== ==== ==== ==== ==== ====
      	 A    B    C    D    E    F
      	 G    H    I    J    K    L
      	            ...
      	 F    A    B    C    D    E  --> Copy of stripe0, but shifted by 1
      	 L    G    H    I    J    K
      	            ...
      
      		"offset" algorithm
      	dev1 dev2 dev3 dev4 dev5 dev6
      	==== ==== ==== ==== ==== ====
      	 A    B    C    D    E    F
      	 F    A    B    C    D    E  --> Copy of stripe0, but shifted by 1
      	 G    H    I    J    K    L
      	 L    G    H    I    J    K
      	            ...
      
      Redundancy for these algorithms is gained by shifting the copied stripes
      one device to the right.  This patch proposes that array be divided into
      sets of adjacent devices and when the stripe copies are shifted, they wrap
      on set boundaries rather than the array size boundary.  That is, for the
      purposes of shifting, the copies are confined to their sets within the
      array.  The sets are 'near_copies * far_copies' in size.
      
      The above "far" algorithm example would change to:
      	        "far" algorithm
      	dev1 dev2 dev3 dev4 dev5 dev6
      	==== ==== ==== ==== ==== ====
      	 A    B    C    D    E    F
      	 G    H    I    J    K    L
      	            ...
      	 B    A    D    C    F    E  --> Copy of stripe0, shifted 1, 2-dev sets
      	 H    G    J    I    L    K      Dev sets are 1-2, 3-4, 5-6
      	            ...
      
      This has the affect of improving the redundancy of the array.  We can
      always sustain at least one failure, but sometimes more than one can
      be handled.  In the first examples, the pairs of devices that CANNOT fail
      together are:
      	(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
      In the example where the copies are confined to sets, the pairs of
      devices that cannot fail together are:
      	(1,2) (3,4) (5,6)                    [20% of possible pairs]
      
      We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
      variable is used to indicate whether we use the old or new method of computing
      the shift.  (This is similar to the way the 16th bit indicates whether the
      "far" algorithm or the "offset" algorithm is being used.)
      
      This patch only handles the cases where the number of total raid disks is
      a multiple of 'far_copies'.  A follow-on patch addresses the condition where
      this is not true.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      475901af
    • J
      MD RAID10: Minor non-functional code changes · 4c0ca26b
      Jonathan Brassow 提交于
      Changes include assigning 'addr' from 's' instead of 'sector' to be
      consistent with the way the code does it just a few lines later and
      using '%=' vs a conditional and subtraction.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4c0ca26b
    • J
      md: raid1,10: Handle REQ_WRITE_SAME flag in write bios · c8dc9c65
      Joe Lawrence 提交于
      Set mddev queue's max_write_same_sectors to its chunk_sector value (before
      disk_stack_limits merges the underlying disk limits.)  With that in place,
      be sure to handle writes coming down from the block layer that have the
      REQ_WRITE_SAME flag set.  That flag needs to be copied into any newly cloned
      write bio.
      Signed-off-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Acked-by: N"Martin K. Petersen" <martin.petersen@oracle.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c8dc9c65
  19. 30 11月, 2012 1 次提交
    • L
      wait: add wait_event_lock_irq() interface · eed8c02e
      Lukas Czerner 提交于
      New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
      moves the private wait_event_lock_irq() macro from MD to regular wait
      includes, introduces new macro wait_event_lock_irq_cmd() instead of using
      the old method with omitting cmd parameter which is ugly and makes a use
      of new macros in the MD. It also introduces the _interruptible_ variant.
      
      The use of new interface is when one have a special lock to protect data
      structures used in the condition, or one also needs to invoke "cmd"
      before putting it to sleep.
      
      All new macros are expected to be called with the lock taken. The lock
      is released before sleep and is reacquired afterwards. We will leave the
      macro with the lock held.
      
      Note to DM: IMO this should also fix theoretical race on waitqueue while
      using simultaneously wait_event_lock_irq() and wait_event() because of
      lack of locking around current state setting and wait queue removal.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eed8c02e
  20. 27 11月, 2012 1 次提交
    • N
      md/raid1{,0}: fix deadlock in bitmap_unplug. · 874807a8
      NeilBrown 提交于
      If the raid1 or raid10 unplug function gets called
      from a make_request function (which is very possible) when
      there are bios on the current->bio_list list, then it will not
      be able to successfully call bitmap_unplug() and it could
      need to submit more bios and wait for them to complete.
      But they won't complete while current->bio_list is non-empty.
      
      So detect that case and handle the unplugging off to another thread
      just like we already do when called from within the scheduler.
      
      RAID1 version of bug was introduced in 3.6, so that part of fix is
      suitable for 3.6.y.  RAID10 part won't apply.
      
      Cc: stable@vger.kernel.org
      Reported-by: NTorsten Kaiser <just.for.lkml@googlemail.com>
      Reported-by: NPeter Maloney <peter.maloney@brockmann-consult.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      874807a8
  21. 22 11月, 2012 1 次提交
    • N
      md/raid10: decrement correct pending counter when writing to replacement. · 884162df
      NeilBrown 提交于
      When a write to a replacement device completes, we carefully
      and correctly found the rdev that the write actually went to
      and the blithely called rdev_dec_pending on the primary rdev,
      even if this write was to the replacement.
      
      This means that any writes to an array while a replacement
      was ongoing would cause the nr_pending count for the primary
      device to go negative, so it could never be removed.
      
      This bug has been present since replacement was introduced in
      3.3, so it is suitable for any -stable kernel since then.
      Reported-by: N"George Spelvin" <linux@horizon.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      884162df