1. 09 4月, 2014 1 次提交
    • S
      raid5: make_request does less prepare wait · 27c0f68f
      Shaohua Li 提交于
      In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
      lot of contention for sequential workload (or big request size
      workload). For such workload, each bio includes several stripes. So we
      can just do prepare_to_wait/finish_wait once for the whold bio instead
      of every stripe.  This reduces the lock contention completely for such
      workload. Random workload might have the similar lock contention too,
      but I didn't see it yet, maybe because my stroage is still not fast
      enough.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      27c0f68f
  2. 13 2月, 2014 1 次提交
    • O
      md/raid5: Fix CPU hotplug callback registration · 789b5e03
      Oleg Nesterov 提交于
      Subsystems that want to register CPU hotplug callbacks, as well as perform
      initialization for the CPUs that are already online, often do it as shown
      below:
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	put_online_cpus();
      
      This is wrong, since it is prone to ABBA deadlocks involving the
      cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
      with CPU hotplug operations).
      
      Interestingly, the raid5 code can actually prevent double initialization and
      hence can use the following simplified form of callback registration:
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	put_online_cpus();
      
      A hotplug operation that occurs between registering the notifier and calling
      get_online_cpus(), won't disrupt anything, because the code takes care to
      perform the memory allocations only once.
      
      So reorganize the code in raid5 this way to fix the deadlock with callback
      registration.
      
      Cc: linux-raid@vger.kernel.org
      Cc: stable@vger.kernel.org (v2.6.32+)
      Fixes: 36d1c647Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      [Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
      free_scratch_buffer() helper to condense code further and wrote the changelog.]
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      789b5e03
  3. 22 1月, 2014 1 次提交
    • N
      md/raid5: close recently introduced race in stripe_head management. · 7da9d450
      NeilBrown 提交于
      As release_stripe and __release_stripe decrement ->count and then
      manipulate ->lru both under ->device_lock, it is important that
      get_active_stripe() increments ->count and clears ->lru also under
      ->device_lock.
      
      However we currently list_del_init ->lru under the lock, but increment
      the ->count outside the lock.  This can lead to races and list
      corruption.
      
      So move the atomic_inc(&sh->count) up inside the ->device_lock
      protected region.
      
      Note that we still increment ->count without device lock in the case
      where get_free_stripe() was called, and in fact don't take
      ->device_lock at all in that path.
      This is safe because if the stripe_head can be found by
      get_free_stripe, then the hash lock assures us the no-one else could
      possibly be calling release_stripe() at the same time.
      
      Fixes: 566c09c5
      Cc: stable@vger.kernel.org (3.13)
      Reported-and-tested-by: NIan Kumlien <ian.kumlien@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7da9d450
  4. 16 1月, 2014 1 次提交
    • N
      md/raid5: fix long-standing problem with bitmap handling on write failure. · 9f97e4b1
      NeilBrown 提交于
      Before a write starts we set a bit in the write-intent bitmap.
      When the write completes we clear that bit if the write was successful
      to all devices.  However if the write wasn't fully successful we
      should not clear the bit.  If the faulty drive is subsequently
      re-added, the fact that the bit is still set ensure that we will
      re-write the data that is missing.
      
      This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
      bitmap bit when this flag is not set.
      Currently we correctly set the flag if a write starts when some
      devices are failed or missing.  But we do *not* set the flag if some
      device failed during the write attempt.
      This is wrong and can result in clearing the bit inappropriately.
      
      So: set the flag when a write fails.
      
      This bug has been present since bitmaps were introduces, so the fix is
      suitable for any -stable kernel.
      Reported-by: NEthan Wilson <ethan.wilson@shiftmail.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9f97e4b1
  5. 14 1月, 2014 2 次提交
    • N
      md/raid5: fix a recently broken BUG_ON(). · 5af9bef7
      NeilBrown 提交于
      commit 6d183de4
          md/raid5: fix newly-broken locking in get_active_stripe.
      
      simplified a BUG_ON, but removed too much so now it sometimes fires
      when it shouldn't.
      
      When the STRIPE_EXPANDING flag is set, the stripe_head might be on a
      special list while multiple stripe_heads are collected, or it might
      not be on any list, even a 'free' list when the refcount is zero.  As
      long as STRIPE_EXPANDING is set, it will be found and added back to a
      list eventually.
      
      So both of the BUG_ONs which test for the ->lru being empty or not
      need to avoid the case where STRIPE_EXPANDING is set.
      
      The patch which broke this was marked for -stable, so this patch needs
      to be applied to any branch that received 6d183de4
      
      Fixes: 6d183de4
      Cc: stable@vger.kernel.org (any release to which above was applied)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5af9bef7
    • N
      md/raid5: Fix possible confusion when multiple write errors occur. · 1cc03eb9
      NeilBrown 提交于
      commit 5d8c71f9
          md: raid5 crash during degradation
      
      Fixed a crash in an overly simplistic way which could leave
      R5_WriteError or R5_MadeGood set in the stripe cache for devices
      for which it is no longer relevant.
      When those devices are removed and spares added the flags are still
      set and can cause incorrect behaviour.
      
      commit 14a75d3e
          md/raid5: preferentially read from replacement device if possible.
      
      Fixed the same bug if a more effective way, so we can now revert
      the original commit.
      Reported-and-tested-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Cc: stable@vger.kernel.org (3.2+ - 3.2 will need a different fix though)
      Fixes: 5d8c71f9Signed-off-by: NNeilBrown <neilb@suse.de>
      1cc03eb9
  6. 09 1月, 2014 1 次提交
    • K
      bcache/md: Use raid stripe size · c78afc62
      Kent Overstreet 提交于
      Now that we've got code for raid5/6 stripe awareness, bcache just needs
      to know about the stripes and when writing partial stripes is expensive
      - we probably don't want to enable this optimization for raid1 or 10,
      even though they have stripes. So add a flag to queue_limits.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      c78afc62
  7. 28 11月, 2013 2 次提交
    • N
      md/raid5: fix newly-broken locking in get_active_stripe. · 6d183de4
      NeilBrown 提交于
      commit 566c09c5 raid5: relieve lock contention in get_active_stripe()
      
      modified the locking in get_active_stripe() reducing the range
      protected by the (highly contended) device_lock.
      Unfortunately it reduced the range too much opening up some races.
      
      One race can occur if get_priority_stripe runs between the
      test on sh->count and device_lock being taken.
      This will mean that sh->lru is not empty while get_active_stripe
      thinks ->count is zero resulting in a 'BUG' firing.
      
      Another race happens if __release_stripe is called immediately
      after sh->count is tested and found to be non-zero.  If STRIPE_HANDLE
      is not set, get_active_stripe should increment ->active_stripes
      when it increments ->count from 0, but as it didn't think it was 0,
      it doesn't.
      
      Extending device_lock to cover the test on sh->count close these
      races.
      
      While we are here, fix the two BUG tests:
       -If count is zero, then lru really must not be empty, or we've
        lock the stripe_head somehow - no other tests are relevant.
       -STRIPE_ON_RELEASE_LIST is completely independent of ->lru so
        testing it is pointless.
      Reported-and-tested-by: NBrassow Jonathan <jbrassow@redhat.com>
      Reviewed-by: NShaohua Li <shli@kernel.org>
      Fixes: 566c09c5Signed-off-by: NNeilBrown <neilb@suse.de>
      6d183de4
    • N
      md/raid5: fix new memory-reference bug in alloc_thread_groups. · 0c775d52
      NeilBrown 提交于
      In alloc_thread_groups, worker_groups is a pointer to an array,
      not an array of pointers.
      So
         worker_groups[i]
      is wrong.  It should be
         &(*worker_groups)[i]
      
      Found-by: coverity
      Fixes: 60aaf933Reported-by: NBen Hutchings <bhutchings@solarflare.com>
      Cc: majianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0c775d52
  8. 24 11月, 2013 2 次提交
    • K
      block: Convert bio_for_each_segment() to bvec_iter · 7988613b
      Kent Overstreet 提交于
      More prep work for immutable biovecs - with immutable bvecs drivers
      won't be able to use the biovec directly, they'll need to use helpers
      that take into account bio->bi_iter.bi_bvec_done.
      
      This updates callers for the new usage without changing the
      implementation yet.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Paul Clements <Paul.Clements@steeleye.com>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
      Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
      Cc: support@lsi.com
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: drbd-user@lists.linbit.com
      Cc: nbd-general@lists.sourceforge.net
      Cc: cbe-oss-dev@lists.ozlabs.org
      Cc: xen-devel@lists.xensource.com
      Cc: virtualization@lists.linux-foundation.org
      Cc: linux-raid@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Cc: DL-MPTFusionLinux@lsi.com
      Cc: linux-scsi@vger.kernel.org
      Cc: devel@driverdev.osuosl.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: cluster-devel@redhat.com
      Cc: linux-mm@kvack.org
      Acked-by: NGeoff Levand <geoff@infradead.org>
      7988613b
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37
  9. 19 11月, 2013 6 次提交
    • M
      md/raid5: Use conf->device_lock protect changing of multi-thread resources. · 60aaf933
      majianpeng 提交于
      When we change group_thread_cnt from sysfs entry, it can OOPS.
      
      The kernel messages are:
      [  135.299021] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [  135.299073] IP: [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.299107] PGD 0
      [  135.299122] Oops: 0000 [#1] SMP
      [  135.299144] Modules linked in: netconsole e1000e ptp pps_core
      [  135.299188] CPU: 3 PID: 2225 Comm: md0_raid5 Not tainted 3.12.0+ #24
      [  135.299214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [  135.299255] task: ffff8800b9638f80 ti: ffff8800b77a4000 task.ti: ffff8800b77a4000
      [  135.299283] RIP: 0010:[<ffffffff815188ab>]  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.299323] RSP: 0018:ffff8800b77a5c48  EFLAGS: 00010002
      [  135.299344] RAX: ffff880037bb5c70 RBX: 0000000000000000 RCX: 0000000000000008
      [  135.299371] RDX: ffff880037bb5cb8 RSI: 0000000000000001 RDI: ffff880037bb5c00
      [  135.299398] RBP: ffff8800b77a5d08 R08: 0000000000000001 R09: 0000000000000000
      [  135.299425] R10: ffff8800b77a5c98 R11: 00000000ffffffff R12: ffff880037bb5c00
      [  135.299452] R13: 0000000000000000 R14: 0000000000000000 R15: ffff880037bb5c70
      [  135.299479] FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
      [  135.299510] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  135.299532] CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000407e0
      [  135.299559] Stack:
      [  135.299570]  ffff8800b77a5c88 ffffffff8107383e ffff8800b77a5c88 ffff880037a64300
      [  135.299611]  000000000000ec08 ffff880037bb5cb8 ffff8800b77a5c98 ffffffffffffffd8
      [  135.299654]  000000000000ec08 ffff880037bb5c60 ffff8800b77a5c98 ffff8800b77a5c98
      [  135.299696] Call Trace:
      [  135.299711]  [<ffffffff8107383e>] ? __wake_up+0x4e/0x70
      [  135.299733]  [<ffffffff81518f88>] raid5d+0x4c8/0x680
      [  135.299756]  [<ffffffff817174ed>] ? schedule_timeout+0x15d/0x1f0
      [  135.299781]  [<ffffffff81524c9f>] md_thread+0x11f/0x170
      [  135.299804]  [<ffffffff81069cd0>] ? wake_up_bit+0x40/0x40
      [  135.299826]  [<ffffffff81524b80>] ? md_rdev_init+0x110/0x110
      [  135.299850]  [<ffffffff81069656>] kthread+0xc6/0xd0
      [  135.299871]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  135.299899]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
      [  135.299923]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  135.299951] Code: ff ff ff 0f 84 d7 fe ff ff e9 5c fe ff ff 66 90 41 8b b4 24 d8 01 00 00 45 31 ed 85 f6 0f 8e 7b fd ff ff 49 8b 9c 24 d0 01 00 00 <48> 3b 1b 49 89 dd 0f 85 67 fd ff ff 48 8d 43 28 31 d2 eb 17 90
      [  135.300005] RIP  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.300005]  RSP <ffff8800b77a5c48>
      [  135.300005] CR2: 0000000000000000
      [  135.300005] ---[ end trace 504854e5bb7562ed ]---
      [  135.300005] Kernel panic - not syncing: Fatal exception
      
      This is because raid5d() can be running when the multi-thread
      resources are changed via system. We see need to provide locking.
      
      mddev->device_lock is suitable, but we cannot simple call
      alloc_thread_groups under this lock as we cannot allocate memory
      while holding a spinlock.
      So change alloc_thread_groups() to allocate and return the data
      structures, then raid5_store_group_thread_cnt() can take the lock
      while updating the pointers to the data structures.
      
      This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x
      stable series.
      
      Fixes: b721420e
      Cc: stable@vger.kernel.org (3.12)
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NShaohua Li <shli@kernel.org>
      60aaf933
    • M
      md/raid5: Before freeing old multi-thread worker, it should flush them. · d206dcfa
      majianpeng 提交于
      When changing group_thread_cnt from sysfs entry, the kernel can oops.
      
      The kernel messages are:
      [  740.961389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  740.961444] IP: [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.961476] PGD b9013067 PUD b651e067 PMD 0
      [  740.961503] Oops: 0000 [#1] SMP
      [  740.961525] Modules linked in: netconsole e1000e ptp pps_core
      [  740.961577] CPU: 0 PID: 3683 Comm: kworker/u8:5 Not tainted 3.12.0+ #23
      [  740.961602] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [  740.961646] task: ffff88013abe0000 ti: ffff88013a246000 task.ti: ffff88013a246000
      [  740.961673] RIP: 0010:[<ffffffff81062570>]  [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.961708] RSP: 0018:ffff88013a247e08  EFLAGS: 00010086
      [  740.961730] RAX: ffff8800b912b400 RBX: ffff88013a61e680 RCX: ffff8800b912b400
      [  740.961757] RDX: ffff8800b912b600 RSI: ffff8800b912b600 RDI: ffff88013a61e680
      [  740.961782] RBP: ffff88013a247e48 R08: ffff88013a246000 R09: 000000000002c09d
      [  740.961808] R10: 000000000000010f R11: 0000000000000000 R12: ffff88013b00cc00
      [  740.961833] R13: 0000000000000000 R14: ffff88013b00cf80 R15: ffff88013a61e6b0
      [  740.961861] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
      [  740.961893] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  740.962001] CR2: 00000000000000b8 CR3: 00000000b24fe000 CR4: 00000000000407f0
      [  740.962001] Stack:
      [  740.962001]  0000000000000008 ffff8800b912b600 ffff88013b00cc00 ffff88013a61e680
      [  740.962001]  ffff88013b00cc00 ffff88013b00cc18 ffff88013b00cf80 ffff88013a61e6b0
      [  740.962001]  ffff88013a247eb8 ffffffff810639c6 0000000000012a80 ffff88013a247fd8
      [  740.962001] Call Trace:
      [  740.962001]  [<ffffffff810639c6>] worker_thread+0x206/0x3f0
      [  740.962001]  [<ffffffff810637c0>] ? manage_workers+0x2c0/0x2c0
      [  740.962001]  [<ffffffff81069656>] kthread+0xc6/0xd0
      [  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  740.962001]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
      [  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  740.962001] Code: 89 e5 41 57 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 18 48 8b 06 4c 8b 67 48 48 89 c1 30 c9 a8 04 4c 0f 45 e9 80 7f 58 00 <49> 8b 45 08 44 8b b0 00 01 00 00 78 0c 41 f6 44 24 10 04 0f 84
      [  740.962001] RIP  [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.962001]  RSP <ffff88013a247e08>
      [  740.962001] CR2: 0000000000000008
      [  740.962001] ---[ end trace 39181460000748de ]---
      [  740.962001] Kernel panic - not syncing: Fatal exception
      
      This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH.
      A worker is queued to handle them.
      But before calling raid5_do_work, raid5d handles those
      stripes making conf->active_stripe = 0.
      So mddev_suspend() can return.
      We might then free old worker resources before the queued
      raid5_do_work() handled them.  When it runs, it crashes.
      
      	raid5d()		raid5_store_group_thread_cnt()
      	queue_work		mddev_suspend()
      				handle_strips
      				active_stripe=0
      				free(old worker resources)
      	process_one_work
      	raid5_do_work
      
      To avoid this, we should only flush the worker resources before freeing them.
      
      This fixes a bug introduced in 3.12 so is suitable for the 3.12.x
      stable series.
      
      Cc: stable@vger.kernel.org (3.12)
      Fixes: b721420eSigned-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NShaohua Li <shli@kernel.org>
      d206dcfa
    • M
      md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE. · e59aa23f
      majianpeng 提交于
      For R5_ReadNoMerge,it mean this bio can't merge with other bios or
      request.It used REQ_FLUSH to achieve this. But REQ_NOMERGE can do the
      same work.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e59aa23f
    • N
      md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. · c91abf5a
      NeilBrown 提交于
      We currently use kthread_should_stop() in various places in the
      sync/reshape code to abort early.
      However some places set MD_RECOVERY_INTR but don't immediately call
      md_reap_sync_thread() (and we will shortly get another one).
      When this happens we are relying on md_check_recovery() to reap the
      thread and that only happen when it finishes normally.
      So MD_RECOVERY_INTR must lead to a normal finish without the
      kthread_should_stop() test.
      
      So replace all relevant tests, and be more careful when the thread is
      interrupted not to acknowledge that latest step in a reshape as it may
      not be fully committed yet.
      
      Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
      so we don't wait have to wait for the speed to drop before we can abort.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c91abf5a
    • B
      raid5: Retry R5_ReadNoMerge flag when hit a read error. · edfa1f65
      Bian Yu 提交于
      Because of block layer merge, one bio fails will cause other bios
      which belongs to the same request fails, so raid5_end_read_request
      will record all these bios as badblocks.
      If retry request with R5_ReadNoMerge flag to avoid bios merge,
      badblocks can only record sector which is bad exactly.
      
      test:
      hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb
      mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean
      mdadm /dev/md0 -f /dev/sdd
      mdadm /dev/md0 -r /dev/sdd
      mdadm --zero-superblock /dev/sdd
      mdadm /dev/md0 -a /dev/sdd
      
      1. Without this patch:
      cat /sys/block/md0/md/rd*/bad_blocks
      299776 256
      299776 256
      
      2. With this patch:
      cat /sys/block/md0/md/rd*/bad_blocks
      300000 8
      300000 8
      Signed-off-by: NBian Yu <bianyu@kedacom.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      edfa1f65
    • S
      raid5: relieve lock contention in get_active_stripe() · 4bda556a
      Shaohua Li 提交于
      track empty inactive list count, so md_raid5_congested() can use it to make
      decision.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4bda556a
  10. 15 11月, 2013 1 次提交
  11. 14 11月, 2013 3 次提交
    • S
      raid5: relieve lock contention in get_active_stripe() · 566c09c5
      Shaohua Li 提交于
      get_active_stripe() is the last place we have lock contention. It has two
      paths. One is stripe isn't found and new stripe is allocated, the other is
      stripe is found.
      
      The first path basically calls __find_stripe and init_stripe. It accesses
      conf->generation, conf->previous_raid_disks, conf->raid_disks,
      conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
      conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
      stripe_hashtbl and inactive_list, other fields are changed very rarely.
      
      With this patch, we split inactive_list and add new hash locks. Each free
      stripe belongs to a specific inactive list. Which inactive list is determined
      by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
      lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
      is determined by it's lock_hash too. The lock_hash is derivied from current
      stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
      to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
      list too. The goal of the new hash locks introduced is we can only use the new
      locks in the first path of get_active_stripe(). Since we have several hash
      locks, lock contention is relieved significantly.
      
      The first path of get_active_stripe() accesses other fields, since they are
      changed rarely, changing them now need take conf->device_lock and all hash
      locks. For a slow path, this isn't a problem.
      
      If we need lock device_lock and hash lock, we always lock hash lock first. The
      tricky part is release_stripe and friends. We need take device_lock first.
      Neil's suggestion is we put inactive stripes to a temporary list and readd it
      to inactive_list after device_lock is released. In this way, we add stripes to
      temporary list with device_lock hold and remove stripes from the list with hash
      lock hold. So we don't allow concurrent access to the temporary list, which
      means we need allocate temporary list for all participants of release_stripe.
      
      One downside is free stripes are maintained in their inactive list, they can't
      across between the lists. By default, we have total 256 stripes and 8 lists, so
      each list will have 32 stripes. It's possible one list has free stripe but
      other list hasn't. The chance should be rare because stripes allocation are
      even distributed. And we can always allocate more stripes for cache, several
      mega bytes memory isn't a big deal.
      
      This completely removes the lock contention of the first path of
      get_active_stripe(). It slows down the second code path a little bit though
      because we now need takes two locks, but since the hash lock isn't contended,
      the overhead should be quite small (several atomic instructions). The second
      path of get_active_stripe() (basically sequential write or big request size
      randwrite) still has lock contentions.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      566c09c5
    • N
      md/raid5.c: add proper locking to error path of raid5_start_reshape. · ba8805b9
      NeilBrown 提交于
      If raid5_start_reshape errors out, we need to reset all the fields
      that were updated (not just some), and need to use the seq_counter
      to ensure make_request() doesn't use an inconsitent state.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ba8805b9
    • M
      raid5: Use slow_path to release stripe when mddev->thread is null · ad4068de
      majianpeng 提交于
      When release_stripe() is called in grow_one_stripe(), the
      mddev->thread is null. So it will omit one wakeup this thread to
      release stripe.
      For this condition, use slow_path to release stripe.
      
      Bug was introduced in 3.12
      
      Cc: stable@vger.kernel.org (3.12+)
      Fixes: 773ca82fSigned-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ad4068de
  12. 24 10月, 2013 2 次提交
  13. 02 9月, 2013 1 次提交
    • S
      raid5: only wakeup necessary threads · bfc90cb0
      Shaohua Li 提交于
      If there are not enough stripes to handle, we'd better not always
      queue all available work_structs. If one worker can only handle small
      or even none stripes, it will impact request merge and create lock
      contention.
      
      With this patch, the number of work_struct running will depend on
      pending stripes number. Note: some statistics info used in the patch
      are accessed without locking protection. This should doesn't matter,
      we just try best to avoid queue unnecessary work_struct.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bfc90cb0
  14. 28 8月, 2013 6 次提交
    • N
      md/raid5: flush out all pending requests before proceeding with reshape. · 4d77e3ba
      NeilBrown 提交于
      Some requests - particularly 'discard' and 'read' are handled
      differently depending on whether a reshape is active or not.
      
      It is harmless to assume reshape is active if it isn't but wrong
      to act as though reshape is not active when it is.
      
      So when we start reshape - after making clear to all requests that
      reshape has started - use mddev_suspend/mddev_resume to flush out all
      requests.  This will ensure that no requests will be assuming the
      absence of reshape once it really starts.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4d77e3ba
    • N
      md/raid5: use seqcount to protect access to shape in make_request. · c46501b2
      NeilBrown 提交于
      make_request() access various shape parameters (raid_disks, chunk_size
      etc) which might be changed by raid5_start_reshape().
      
      If the later is called at and awkward time during the form, the wrong
      stripe_head might be used.
      
      So introduce a 'seqcount' and after finding a stripe_head make sure
      there is no reason to expect that we got the wrong one.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c46501b2
    • S
      raid5: sysfs entry to control worker thread number · b721420e
      Shaohua Li 提交于
      Add a sysfs entry to control running workqueue thread number. If
      group_thread_cnt is set to 0, we will disable workqueue offload handling of
      stripes.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b721420e
    • S
      raid5: offload stripe handle to workqueue · 851c30c9
      Shaohua Li 提交于
      This is another attempt to create multiple threads to handle raid5 stripes.
      This time I use workqueue.
      
      raid5 handles request (especially write) in stripe unit. A stripe is page size
      aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
      state machine for the corresponding stripe, which includes reading some disks
      of the stripe, calculating parity, and writing some disks of the stripe. The
      state machine is running in raid5d thread currently. Since there is only one
      thread, it doesn't scale well for high speed storage. An obvious solution is
      multi-threading.
      
      To get better performance, we have some requirements:
      a. locality. stripe corresponding to request submitted from one cpu is better
      handled in thread in local cpu or local node. local cpu is preferred but some
      times could be a bottleneck, for example, parity calculation is too heavy.
      local node running has wide adaptability.
      b. configurablity. Different setup of raid5 array might need diffent
      configuration. Especially the thread number. More threads don't always mean
      better performance because of lock contentions.
      
      My original implementation is creating some kernel threads. There are
      interfaces to control which cpu's stripe each thread should handle. And
      userspace can set affinity of the threads. This provides biggest flexibility
      and configurability. But it's hard to use and apparently a new thread pool
      implementation is disfavor.
      
      Recent workqueue improvement is quite promising. unbound workqueue will be
      bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
      do affinity setting. For example, we can only include one HT sibling in
      affinity. Since work is non-reentrant by default, and we can control running
      thread number by limiting dispatched work_struct number.
      
      In this patch, I created several stripe worker group. A group is a numa node.
      stripes from cpus of one node will be added to a group list. Workqueue thread
      of one node will only handle stripes of worker group of the node. In this way,
      stripe handling has numa node locality. And as I said, we can control thread
      number by limiting dispatched work_struct number.
      
      The work_struct callback function handles several stripes in one run. A typical
      work queue usage is to run one unit in each work_struct. In raid5 case, the
      unit is a stripe. But we can't do that:
      a. Though handling a stripe doesn't need lock because of reference accounting
      and stripe isn't in any list, queuing a work_struct for each stripe will make
      workqueue lock contended very heavily.
      b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
      might dispatch request. If each work_struct only handles one stripe, such block
      plug is meaningless.
      
      This implementation can't do very fine grained configuration. But the numa
      binding is most popular usage model, should be enough for most workloads.
      
      Note: since we have only one stripe queue, switching to multi-thread might
      decrease request size dispatching down to low level layer. The impact depends
      on thread number, raid configuration and workload. So multi-thread raid5 might
      not be proper for all setups.
      
      Changes V1 -> V2:
      1. remove WQ_NON_REENTRANT
      2. disabling multi-threading by default
      3. Add more descriptions in changelog
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      851c30c9
    • S
      raid5: fix stripe release order · d265d9dc
      Shaohua Li 提交于
      patch "make release_stripe lockless" changes the order stripes are released.
      Originally I thought block layer can take care of request merge, but it appears
      there are still some requests not merged. It's easy to fix the order.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d265d9dc
    • S
      raid5: make release_stripe lockless · 773ca82f
      Shaohua Li 提交于
      release_stripe still has big lock contention. We just add the stripe to a llist
      without taking device_lock. We let the raid5d thread to do the real stripe
      release, which must hold device_lock anyway. In this way, release_stripe
      doesn't hold any locks.
      
      The side effect is the released stripes order is changed. But sounds not a big
      deal, stripes are never handled in order. And I thought block layer can already
      do nice request merge, which means order isn't that important.
      
      I kept the unplug release batch, which is unnecessary with this patch from lock
      contention avoid point of view, and actually if we delete it, the stripe_head
      release_list and lru can share storage. But the unplug release batch is also
      helpful for request merge. We probably can delay wakeup raid5d till unplug, but
      I'm still afraid of the case which raid5d is running.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      773ca82f
  15. 25 7月, 2013 1 次提交
    • N
      md/raid5: fix interaction of 'replace' and 'recovery'. · f94c0b66
      NeilBrown 提交于
      If a device in a RAID4/5/6 is being replaced while another is being
      recovered, then the writes to the replacement device currently don't
      happen, resulting in corruption when the replacement completes and the
      new drive takes over.
      
      This is because the replacement writes are only triggered when
      's.replacing' is set and not when the similar 's.sync' is set (which
      is the case during resync and recovery - it means all devices need to
      be read).
      
      So schedule those writes when s.replacing is set as well.
      
      In this case we cannot use "STRIPE_INSYNC" to record that the
      replacement has happened as that is needed for recording that any
      parity calculation is complete.  So introduce STRIPE_REPLACED to
      record if the replacement has happened.
      
      For safety we should also check that STRIPE_COMPUTE_RUN is not set.
      This has a similar effect to the "s.locked == 0" test.  The latter
      ensure that now IO has been flagged but not started.  The former
      checks if any parity calculation has been flagged by not started.
      We must wait for both of these to complete before triggering the
      'replace'.
      
      Add a similar test to the subsequent check for "are we finished yet".
      This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
      but it makes it more obvious that the REPLACE will happen before we
      think we are finished.
      
      Finally if a NeedReplace device is not UPTODATE then that is an
      error.  We really must trigger a warning.
      
      This bug was introduced in commit 9a3e1101
      (md/raid5:  detect and handle replacements during recovery.)
      which introduced replacement for raid5.
      That was in 3.3-rc3, so any stable kernel since then would benefit
      from this fix.
      
      Cc: stable@vger.kernel.org (3.3+)
      Reported-by: Nqindehua <13691222965@163.com>
      Tested-by: Nqindehua <qindehua@163.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f94c0b66
  16. 04 7月, 2013 1 次提交
    • N
      md/raid5: allow 5-device RAID6 to be reshaped to 4-device. · fdcfbbb6
      NeilBrown 提交于
      There is a bug in 'check_reshape' for raid5.c  To checks
      that the new minimum number of devices is large enough (which is
      good), but it does so also after the reshape has started (bad).
      
      This is bad because
       - the calculation is now wrong as mddev->raid_disks has changed
         already, and
       - it is pointless because it is now too late to stop.
      
      So only perform that test when reshape has not been committed to.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fdcfbbb6
  17. 14 6月, 2013 1 次提交
  18. 13 6月, 2013 1 次提交
    • H
      md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place · 5026d7a9
      H. Peter Anvin 提交于
      There are cases where the kernel will believe that the WRITE SAME
      command is supported by a block device which does not, in fact,
      support WRITE SAME.  This currently happens for SATA drivers behind a
      SAS controller, but there are probably a hundred other ways that can
      happen, including drive firmware bugs.
      
      After receiving an error for WRITE SAME the block layer will retry the
      request as a plain write of zeroes, but mdraid will consider the
      failure as fatal and consider the drive failed.  This has the effect
      that all the mirrors containing a specific set of data are each
      offlined in very rapid succession resulting in data loss.
      
      However, just bouncing the request back up to the block layer isn't
      ideal either, because the whole initial request-retry sequence should
      be inside the write bitmap fence, which probably means that md needs
      to do its own conversion of WRITE SAME to write zero.
      
      Until the failure scenario has been sorted out, disable WRITE SAME for
      raid1, raid5, and raid10.
      
      [neilb: added raid5]
      
      This patch is appropriate for any -stable since 3.7 when write_same
      support was added.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5026d7a9
  19. 30 5月, 2013 1 次提交
  20. 24 4月, 2013 2 次提交
  21. 19 4月, 2013 1 次提交
  22. 24 3月, 2013 2 次提交
    • K
      raid5: use bio_reset() · 2f6db2a7
      Kent Overstreet 提交于
      Had to shuffle the code around a bit (where bi_rw and bi_end_io were
      set), but shouldn't really be anything tricky here
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
      2f6db2a7
    • K
      block: Use bio_sectors() more consistently · aa8b57aa
      Kent Overstreet 提交于
      Bunch of places in the code weren't using it where they could be -
      this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
      into a struct bvec_iter.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: "Ed L. Cashin" <ecashin@coraid.com>
      CC: Nick Piggin <npiggin@kernel.dk>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Jim Paris <jim@jtan.com>
      CC: Geoff Levand <geoff@infradead.org>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Steven Rostedt <rostedt@goodmis.org>
      Acked-by: NEd Cashin <ecashin@coraid.com>
      aa8b57aa