1. 08 6月, 2016 2 次提交
  2. 10 5月, 2016 2 次提交
    • G
      md: set MD_CHANGE_PENDING in a atomic region · 85ad1d13
      Guoqing Jiang 提交于
      Some code waits for a metadata update by:
      
      1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
      2. setting MD_CHANGE_PENDING and waking the management thread
      3. waiting for MD_CHANGE_PENDING to be cleared
      
      If the first two are done without locking, the code in md_update_sb()
      which checks if it needs to repeat might test if an update is needed
      before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
      in the wait returning early.
      
      So make sure all places that set MD_CHANGE_PENDING are atomicial, and
      bit_clear_unless (suggested by Neil) is introduced for the purpose.
      
      Cc: Martin Kepplinger <martink@posteo.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: <linux-kernel@vger.kernel.org>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      85ad1d13
    • H
      md: raid10: add prerequisite to run underneath dm-raid · 859644f0
      Heinz Mauelshagen 提交于
      In case md runs underneath the dm-raid target, the mddev does not have
      a request queue or gendisk, thus avoid accesses to it.
      
      This patch adds two missing conditionals to the raid10 personality.
      Signed-of-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      859644f0
  3. 18 3月, 2016 1 次提交
  4. 21 1月, 2016 1 次提交
  5. 14 1月, 2016 1 次提交
    • D
      md/raid: only permit hot-add of compatible integrity profiles · 1501efad
      Dan Williams 提交于
      It is not safe for an integrity profile to be changed while i/o is
      in-flight in the queue.  Prevent adding new disks or otherwise online
      spares to an array if the device has an incompatible integrity profile.
      
      The original change to the blk_integrity_unregister implementation in
      md, commmit c7bfced9 "md: suspend i/o during runtime
      blk_integrity_unregister" introduced an immediate hang regression.
      
      This policy of disallowing changes the integrity profile once one has
      been established is shared with DM.
      
      Here is an abbreviated log from a test run that:
      1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
      2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
      3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]
      
      [   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
      [   59.078302] md: data integrity enabled on md0
      [..]
      [   90.489209] md0: incompatible integrity profile for pmem1m
      [..]
      [  205.671277] md: super_written gets error=-5
      [  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
      [  205.677386] md/raid1:md0: Operation continuing on 1 devices.
      [  205.683037] RAID1 conf printout:
      [  205.684699]  --- wd:1 rd:2
      [  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
      [  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
      [  205.691717] md: recovery of RAID array md0
      
      Fixes: c7bfced9 ("md: suspend i/o during runtime blk_integrity_unregister")
      Cc: <stable@vger.kernel.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reported-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      1501efad
  6. 18 12月, 2015 1 次提交
    • A
      md/raid10: fix data corruption and crash during resync · cc578588
      Artur Paszkiewicz 提交于
      The commit c31df25f ("md/raid10: make sync_request_write() call
      bio_copy_data()") replaced manual data copying with bio_copy_data() but
      it doesn't work as intended. The source bio (fbio) is already processed,
      so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt.  Because of
      this, bio_copy_data() either does not copy anything, or worse, copies
      data from the ->bi_next bio if it is set.  This causes wrong data to be
      written to drives during resync and sometimes lockups/crashes in
      bio_copy_data():
      
      [  517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319]
      [  517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
      [  517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1
      [  517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015
      [  517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000
      [  517.468529] RIP: 0010:[<ffffffff812e1888>]  [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0
      [  517.478164] RSP: 0018:ffff880150dfbc98  EFLAGS: 00000246
      [  517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000
      [  517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0
      [  517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980
      [  517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000
      [  517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000
      [  517.525412] FS:  0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000
      [  517.534844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0
      [  517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  517.566144] Stack:
      [  517.568626]  ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000
      [  517.577659]  0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800
      [  517.586715]  ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000
      [  517.595773] Call Trace:
      [  517.598747]  [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10]
      [  517.605610]  [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2
      [  517.611987]  [<ffffffff814ff206>] md_thread+0x106/0x140
      [  517.618072]  [<ffffffff810c1d80>] ? wait_woken+0x80/0x80
      [  517.624252]  [<ffffffff814ff100>] ? super_1_load+0x520/0x520
      [  517.630817]  [<ffffffff8109ef89>] kthread+0xc9/0xe0
      [  517.636506]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
      [  517.643653]  [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70
      [  517.649929]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Reviewed-by: NShaohua Li <shli@kernel.org>
      Cc: stable@vger.kernel.org (v4.2+)
      Fixes: c31df25f ("md/raid10: make sync_request_write() call bio_copy_data()")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      cc578588
  7. 24 10月, 2015 2 次提交
    • N
      md/raid10: fix the 'new' raid10 layout to work correctly. · 8bce6d35
      NeilBrown 提交于
      In Linux 3.9 we introduce a new 'far' layout for RAID10 which was
      supposed to rotate the replicas differently and so provide better
      resilience.  In particular it could survive more combinations of 2
      drive failures.
      
      Unfortunately. due to a coding error, this some did what was wanted,
      sometimes improved less than we hoped, and sometimes - in very
      unlikely circumstances - put multiple replicas on the same device so
      the redundancy was harmed.
      
      No public user-space tool has created arrays using this layout so it
      is very unlikely that zero-redundancy arrays actually exist.  Probably
      no arrays using any form of the new layout exist.  But we cannot be
      certain.
      
      So use another bit in the 'layout' number and introduce a bug-fixed
      version of the layout.
      Also when assembling an array, if it has a zero-redundancy layout,
      give a warning.
      Reported-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      8bce6d35
    • N
      md/raid10: don't clear bitmap bit when bad-block-list write fails. · c340702c
      NeilBrown 提交于
      When a write fails and a bad-block-list is present, we can
      update the bad-block-list instead of writing the data.  If
      this succeeds then it is OK clear the relevant bitmap-bit as
      no further 'sync' of the block is needed.
      
      However if writing the bad-block-list fails then we need to
      treat the write as failed and particularly must not clear
      the bitmap bit.  Otherwise the device can be re-added (after
      any hardware connection issues are resolved) and because the
      relevant bit in the bitmap is clear, that block will not be
      resynced.  This leads to data corruption.
      
      We already delay the final bio_endio() on the write until
      the bad-block-list is written so that when the write
      returns: either that data is safe, the bad-block record is
      safe, or the fact that the device is faulty is safe.
      However we *don't* delay the clearing of the bitmap, so the
      bitmap bit can be recorded as cleared before we know if the
      bad-block-list was written safely.
      
      So: delay that until the write really is safe.
      i.e. move the call to close_write() until just before
      calling bio_endio(), and recheck the 'is array degraded'
      status before making that call.
      
      This bug goes back to v3.1 when bad-block-lists were
      introduced, though it only affects arrays created with
      mdadm-3.3 or later as only those have bad-block lists.
      
      Backports will require at least
      Commit: 95af587e ("md/raid10: ensure device failure recorded before write request returns.")
      as well.  I'll send that to 'stable' separately.
      
      Note that of the two tests of R10BIO_WriteError that this
      patch adds, the first is certain to fail and the second is
      certain to succeed.  However doing it this way makes the
      patch more obviously correct.  I will tidy the code up in a
      future merge window.
      Reported-by: NNate Dailey <nate.dailey@stratus.com>
      Fixes: bd870a16 ("md/raid10:  Handle write errors by updating badblock log.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c340702c
  8. 22 10月, 2015 1 次提交
  9. 21 10月, 2015 1 次提交
  10. 12 10月, 2015 1 次提交
    • G
      md-cluster: Use a small window for resync · c40f341f
      Goldwyn Rodrigues 提交于
      Suspending the entire device for resync could take too long. Resync
      in small chunks.
      
      cluster's resync window (32M) is maintained in r1conf as
      cluster_sync_low and cluster_sync_high and processed in
      raid1's sync_request(). If the current resync is outside the cluster
      resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Check if the sync will fit in the new window, if not issue a
         wait_barrier() and set cluster_sync_low to sector_nr.
      3. Set cluster_sync_high to cluster_sync_low + resync_window.
      4. Send a message to all nodes so they may add it in their suspension
         list.
      
      bitmap_cond_end_sync is modified to allow to force a sync inorder
      to get the curr_resync_completed uptodate with the sector passed.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c40f341f
  11. 09 10月, 2015 1 次提交
    • M
      crash in md-raid1 and md-raid10 due to incorrect list manipulation · a452744b
      Mikulas Patocka 提交于
      The commit 55ce74d4 (md/raid1: ensure
      device failure recorded before write request returns) is causing crash in
      the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100%
      reproducible.
      
      The reason for the crash is that the newly added code in raid1d moves the
      list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and
      then incorrectly pops the bio from conf->bio_end_io_list (which is empty
      because the list was alrady moved).
      
      Raid-10 has a similar bug.
      
      Kernel Fault: Code=15 regs=000000006ccb8640 (Addr=0000000100000000)
      CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35
      task: 000000006cc1f258 ti: 000000006ccb8000 task.ti: 000000006ccb8000
      
           YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
      PSW: 00001000000001001111111000001111 Not tainted
      r00-03  000000ff0804fe0f 000000001059d000 000000001059f818 000000007f16be38
      r04-07  000000001059d000 000000007f16be08 0000000000200200 0000000000000001
      r08-11  000000006ccb8260 000000007b7934d0 0000000000000001 0000000000000000
      r12-15  000000004056f320 0000000000000000 0000000000013dd0 0000000000000000
      r16-19  00000000f0d00ae0 0000000000000000 0000000000000000 0000000000000001
      r20-23  000000000800000f 0000000042200390 0000000000000000 0000000000000000
      r24-27  0000000000000001 000000000800000f 000000007f16be08 000000001059d000
      r28-31  0000000100000000 000000006ccb8560 000000006ccb8640 0000000000000000
      sr00-03  0000000000249800 0000000000000000 0000000000000000 0000000000249800
      sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
      IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001059f61c 000000001059f620
       IIR: 0f8010c6    ISR: 0000000000000000  IOR: 0000000100000000
       CPU:        3   CR30: 000000006ccb8000 CR31: 0000000000000000
       ORIG_R28: 000000001059d000
       IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1]
       IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1]
       RP(r2): raid_end_bio_io+0x88/0x168 [raid1]
      Backtrace:
       [<000000001059f818>] raid_end_bio_io+0x88/0x168 [raid1]
       [<00000000105a4f64>] raid1d+0x144/0x1640 [raid1]
       [<000000004017fd5c>] kthread+0x144/0x160
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 55ce74d4 ("md/raid1: ensure device failure recorded before write request returns.")
      Fixes: 95af587e ("md/raid10: ensure device failure recorded before write request returns.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a452744b
  12. 02 10月, 2015 1 次提交
  13. 01 9月, 2015 2 次提交
    • N
      md/raid10: ensure device failure recorded before write request returns. · 95af587e
      NeilBrown 提交于
      When a write to one of the legs of a RAID10 fails, the failure is
      recorded in the metadata of the other legs so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again (maybe a cable was unplugged).
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that experienced an error on a new queue which
         is only processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      95af587e
    • N
      md/raid10: fix a few typos in comments · 02ec5026
      NeilBrown 提交于
      Signed-off-by: NNeilBrown <neilb@suse.com>
      02ec5026
  14. 14 8月, 2015 1 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666
  15. 29 7月, 2015 2 次提交
    • J
      block: manipulate bio->bi_flags through helpers · b7c44ed9
      Jens Axboe 提交于
      Some places use helpers now, others don't. We only have the 'is set'
      helper, add helpers for setting and clearing flags too.
      
      It was a bit of a mess of atomic vs non-atomic access. With
      BIO_UPTODATE gone, we don't have any risk of concurrent access to the
      flags. So relax the restriction and don't make any of them atomic. The
      flags that do have serialization issues (reffed and chained), we
      already handle those separately.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b7c44ed9
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  16. 22 7月, 2015 1 次提交
    • N
      md/raid10: always set reshape_safe when initializing reshape_position. · 299b0685
      NeilBrown 提交于
      'reshape_position' tracks where in the reshape we have reached.
      'reshape_safe' tracks where in the reshape we have safely recorded
      in the metadata.
      
      These are compared to determine when to update the metadata.
      So it is important that reshape_safe is initialised properly.
      Currently it isn't.  When starting a reshape from the beginning
      it usually has the correct value by luck.  But when reducing the
      number of devices in a RAID10, it has the wrong value and this leads
      to the metadata not being updated correctly.
      This can lead to corruption if the reshape is not allowed to complete.
      
      This patch is suitable for any -stable kernel which supports RAID10
      reshape, which is 3.5 and later.
      
      Fixes: 3ea7daa5 ("md/raid10: add reshape support")
      Cc: stable@vger.kernel.org (v3.5+ please wait for -final to be out for 2 weeks)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      299b0685
  17. 17 6月, 2015 1 次提交
  18. 12 6月, 2015 1 次提交
    • N
      md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync · ea358cd0
      NeilBrown 提交于
      MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
      resync etc finished.  However it is possible for raid5_start_reshape
      to race and start a reshape before MD_RECOVERY_DONE is cleared.  This
      can lean to multiple reshapes running at the same time, which isn't
      good.
      
      To make sure it is cleared before starting a reshape, and also clear
      it when reaping a thread, just to be safe.
      Signed-off-by: NNeilBrown  <neilb@suse.de>
      ea358cd0
  19. 02 6月, 2015 1 次提交
    • T
      writeback: move backing_dev_info->state into bdi_writeback · 4452226e
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->state into wb.
      
      * enum bdi_state is renamed to wb_state and the prefix of all enums is
        changed from BDI_ to WB_.
      
      * Explicit zeroing of bdi->state is removed without adding zeoring of
        wb->state as the whole data structure is zeroed on init anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->state are mechanically replaced with bdi->wb.state
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: drbd-dev@lists.linbit.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4452226e
  20. 22 4月, 2015 1 次提交
    • N
      md: remove 'go_faster' option from ->sync_request() · 09314799
      NeilBrown 提交于
      This option is not well justified and testing suggests that
      it hardly ever makes any difference.
      
      The comment suggests there might be a need to wait for non-resync
      activity indicated by ->nr_waiting, however raise_barrier()
      already waits for all of that.
      
      So just remove it to simplify reasoning about speed limiting.
      
      This allows us to remove a 'FIXME' comment from raid5.c as that
      never used the flag.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      09314799
  21. 16 2月, 2015 1 次提交
  22. 12 2月, 2015 1 次提交
    • N
      md/raid10: fix conversion from RAID0 to RAID10 · 53a6ab4d
      NeilBrown 提交于
      A RAID0 array (like a LINEAR array) does not have a concept
      of 'size' being the amount of each device that is in use.
      Rather, as much of each device as is available is used.
      So the 'size' is set to 0 and ignored.
      
      RAID10 does have this concept and needs it to be set correctly.
      So when we convert RAID0 to RAID10 we must determine the
      'size' (that being the size of the first 'strip_zone' in the
      RAID0), and set it correctly.
      Reported-and-tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      53a6ab4d
  23. 04 2月, 2015 4 次提交
    • N
      md: rename ->stop to ->free · afa0f557
      NeilBrown 提交于
      Now that the ->stop function only frees the private data,
      rename is accordingly.
      
      Also pass in the private pointer as an arg rather than using
      mddev->private.  This flexibility will be useful in level_store().
      
      Finally, don't clear ->private.  It doesn't make sense to clear
      it seeing that isn't what we free, and it is no longer necessary
      to clear ->private (it was some time ago before  ->to_remove was
      introduced).
      
      Setting ->to_remove in ->free() is a bit of a wart, but not a
      big problem at the moment.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      afa0f557
    • N
      md: split detach operation out from ->stop. · 5aa61f42
      NeilBrown 提交于
      Each md personality has a 'stop' operation which does two
      things:
       1/ it finalizes some aspects of the array to ensure nothing
          is accessing the ->private data
       2/ it frees the ->private data.
      
      All the steps in '1' can apply to all arrays and so can be
      performed in common code.
      
      This is useful as in the case where we change the personality which
      manages an array (in level_store()), it would be helpful to do
      step 1 early, and step 2 later.
      
      So split the 'step 1' functionality out into a new mddev_detach().
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5aa61f42
    • N
      md: make merge_bvec_fn more robust in face of personality changes. · 64590f45
      NeilBrown 提交于
      There is no locking around calls to merge_bvec_fn(), so
      it is possible that calls which coincide with a level (or personality)
      change could go wrong.
      
      So create a central dispatch point for these functions and use
      rcu_read_lock().
      If the array is suspended, reject any merge that can be rejected.
      If not, we know it is safe to call the function.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      64590f45
    • N
      md: make ->congested robust against personality changes. · 5c675f83
      NeilBrown 提交于
      There is currently no locking around calls to the 'congested'
      bdi function.  If called at an awkward time while an array is
      being converted from one level (or personality) to another, there
      is a tiny chance of running code in an unreferenced module etc.
      
      So add a 'congested' function to the md_personality operations
      structure, and call it with appropriate locking from a central
      'mddev_congested'.
      
      When the array personality is changing the array will be 'suspended'
      so no IO is processed.
      If mddev_congested detects this, it simply reports that the
      array is congested, which is a safe guess.
      As mddev_suspend calls synchronize_rcu(), mddev_congested can
      avoid races by included the whole call inside an rcu_read_lock()
      region.
      This require that the congested functions for all subordinate devices
      can be run under rcu_lock.  Fortunately this is the case.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5c675f83
  24. 14 10月, 2014 2 次提交
  25. 09 10月, 2014 1 次提交
  26. 19 8月, 2014 4 次提交
    • N
      md/raid10: always initialise ->state on newly allocated r10_bio · cb8b12b5
      NeilBrown 提交于
      Most places which allocate an r10_bio zero the ->state, some don't.
      As the r10_bio comes from a mempool, and the allocation function uses
      kzalloc it is often zero anyway.  But sometimes it isn't and it is
      best to be safe.
      
      I only noticed this because of the bug fixed by an earlier patch
      where the r10_bios allocated for a reshape were left around to
      be used by a subsequent resync.  In that case the R10BIO_IsReshape
      flag caused problems.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cb8b12b5
    • N
      md/raid10: avoid memory leak on error path during reshape. · e337aead
      NeilBrown 提交于
      If raid10 reshape fails to find somewhere to read a block
      from, it returns without freeing memory...
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e337aead
    • N
      md/raid10: Fix memory leak when raid10 reshape completes. · b3968552
      NeilBrown 提交于
      When a raid10 commences a resync/recovery/reshape it allocates
      some buffer space.
      When a resync/recovery completes the buffer space is freed.  But not
      when the reshape completes.
      This can result in a small memory leak.
      
      There is a subtle side-effect of this bug.  When a RAID10 is reshaped
      to a larger array (more devices), the reshape is immediately followed
      by a "resync" of the new space.  This "resync" will use the buffer
      space which was allocated for "reshape".  This can cause problems
      including a "BUG" in the SCSI layer.  So this is suitable for -stable.
      
      Cc: stable@vger.kernel.org (v3.5+)
      Fixes: 3ea7daa5Signed-off-by: NNeilBrown <neilb@suse.de>
      b3968552
    • N
      md/raid10: fix memory leak when reshaping a RAID10. · ce0b0a46
      NeilBrown 提交于
      raid10 reshape clears unwanted bits from a bio->bi_flags using
      a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
      was added.
      Since then it clears that bit but shouldn't.  This results in a
      memory leak.
      
      So change to used the approved method of clearing unwanted bits.
      
      As this causes a memory leak which can consume all of memory
      the fix is suitable for -stable.
      
      Fixes: a38352e0
      Cc: stable@vger.kernel.org (v3.10+)
      Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ce0b0a46
  27. 31 7月, 2014 1 次提交
    • N
      md/raid1,raid10: always abort recover on write error. · 2446dba0
      NeilBrown 提交于
      Currently we don't abort recovery on a write error if the write error
      to the recovering device was triggerd by normal IO (as opposed to
      recovery IO).
      
      This means that for one bitmap region, the recovery might write to the
      recovering device for a few sectors, then not bother for subsequent
      sectors (as it never writes to failed devices).  In this case
      the bitmap bit will be cleared, but it really shouldn't.
      
      The result is that if the recovering device fails and is then re-added
      (after fixing whatever hardware problem triggerred the failure),
      the second recovery won't redo the region it was in the middle of,
      so some of the device will not be recovered properly.
      
      If we abort the recovery, the region being processes will be cancelled
      (bit not cleared) and the whole region will be retried.
      
      As the bug can result in data corruption the patch is suitable for
      -stable.  For kernels prior to 3.11 there is a conflict in raid10.c
      which will require care.
      
      Original-from: jiao hui <jiaohui@bwstor.com.cn>
      Reported-and-tested-by: Njiao hui <jiaohui@bwstor.com.cn>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@vger.kernel.org
      2446dba0
  28. 06 5月, 2014 1 次提交