1. 23 11月, 2016 1 次提交
    • N
      md: Use REQ_FAILFAST_* on metadata writes where appropriate · 46533ff7
      NeilBrown 提交于
      This can only be supported on personalities which ensure
      that md_error() never causes an array to enter the 'failed'
      state.  i.e. if marking a device Faulty would cause some
      data to be inaccessible, the device is status is left as
      non-Faulty.  This is true for RAID1 and RAID10.
      
      If we get a failure writing metadata but the device doesn't
      fail, it must be the last device so we re-write without
      FAILFAST to improve chance of success.  We also flag the
      device as LastDev so that future metadata updates don't
      waste time on failfast writes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      46533ff7
  2. 19 11月, 2016 2 次提交
    • N
      md/raid1, raid10: add blktrace records when IO is delayed · 578b54ad
      NeilBrown 提交于
      Both raid1 and raid10 will sometimes delay handling an IO request,
      such as when resync is happening or there are too many requests queued.
      
      Add some blktrace messsages so we can see when that is happening when
      looking for performance artefacts.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      578b54ad
    • N
      md: add block tracing for bio_remapping · 109e3765
      NeilBrown 提交于
      The block tracing infrastructure (accessed with blktrace/blkparse)
      supports the tracing of mapping bios from one device to another.
      This is currently used when a bio in a partition is mapped to the
      whole device, when bios are mapped by dm, and for mapping in md/raid5.
      Other md personalities do not include this tracing yet, so add it.
      
      When a read-error is detected we redirect the request to a different device.
      This could justifiably be seen as a new mapping for the originial bio,
      or a secondary mapping for the bio that errors.  This patch uses
      the second option.
      
      When md is used under dm-raid, the mappings are not traced as we do
      not have access to the block device number of the parent.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      109e3765
  3. 08 11月, 2016 2 次提交
  4. 25 10月, 2016 1 次提交
    • S
      RAID10: ignore discard error · 579ed34f
      Shaohua Li 提交于
      This is the counterpart of raid10 fix. If a write error occurs, raid10
      will try to rewrite the bio in small chunk size. If the rewrite fails,
      raid10 will record the error in bad block. narrow_write_error will
      always use WRITE for the bio, but actually it could be a discard. Since
      discard bio hasn't payload, write the bio will cause different issues.
      But discard error isn't fatal, we can safely ignore it. This is what
      this patch does.
      
      This issue should exist since discard is added, but only exposed with
      recent arbitrary bio size feature.
      
      Cc: Sitsofe Wheeler <sitsofe@gmail.com>
      Cc: stable@vger.kernel.org (v3.6)
      Signed-off-by: NShaohua Li <shli@fb.com>
      579ed34f
  5. 25 8月, 2016 1 次提交
  6. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  7. 31 7月, 2016 1 次提交
  8. 20 7月, 2016 2 次提交
    • T
      raid10: improve random reads performance · 0e5313e2
      Tomasz Majchrzak 提交于
      RAID10 random read performance is lower than expected due to excessive spinlock
      utilisation which is required mostly for rebuild/resync. Simplify allow_barrier
      as it's in IO path and encounters a lot of unnecessary congestion.
      
      As lower_barrier just takes a lock in order to decrement a counter, convert
      counter (nr_pending) into atomic variable and remove the spin lock. There is
      also a congestion for wake_up (it uses lock internally) so call it only when
      it's really needed. As wake_up is not called constantly anymore, ensure process
      waiting to raise a barrier is notified when there are no more waiting IOs.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0e5313e2
    • A
      md: use seconds granularity for error logging · 0e3ef49e
      Arnd Bergmann 提交于
      The md code stores the exact time of the last error in the
      last_read_error variable using a timespec structure. It only
      ever uses the seconds portion of that though, so we can
      use a scalar for it.
      
      There won't be an overflow in 2038 here, because it already
      used monotonic time and 32-bit is enough for that, but I've
      decided to use time64_t for consistency in the conversion.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0e3ef49e
  9. 14 6月, 2016 10 次提交
  10. 09 6月, 2016 1 次提交
  11. 08 6月, 2016 3 次提交
  12. 10 5月, 2016 2 次提交
    • G
      md: set MD_CHANGE_PENDING in a atomic region · 85ad1d13
      Guoqing Jiang 提交于
      Some code waits for a metadata update by:
      
      1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
      2. setting MD_CHANGE_PENDING and waking the management thread
      3. waiting for MD_CHANGE_PENDING to be cleared
      
      If the first two are done without locking, the code in md_update_sb()
      which checks if it needs to repeat might test if an update is needed
      before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
      in the wait returning early.
      
      So make sure all places that set MD_CHANGE_PENDING are atomicial, and
      bit_clear_unless (suggested by Neil) is introduced for the purpose.
      
      Cc: Martin Kepplinger <martink@posteo.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: <linux-kernel@vger.kernel.org>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      85ad1d13
    • H
      md: raid10: add prerequisite to run underneath dm-raid · 859644f0
      Heinz Mauelshagen 提交于
      In case md runs underneath the dm-raid target, the mddev does not have
      a request queue or gendisk, thus avoid accesses to it.
      
      This patch adds two missing conditionals to the raid10 personality.
      Signed-of-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      859644f0
  13. 18 3月, 2016 1 次提交
  14. 21 1月, 2016 1 次提交
  15. 14 1月, 2016 1 次提交
    • D
      md/raid: only permit hot-add of compatible integrity profiles · 1501efad
      Dan Williams 提交于
      It is not safe for an integrity profile to be changed while i/o is
      in-flight in the queue.  Prevent adding new disks or otherwise online
      spares to an array if the device has an incompatible integrity profile.
      
      The original change to the blk_integrity_unregister implementation in
      md, commmit c7bfced9 "md: suspend i/o during runtime
      blk_integrity_unregister" introduced an immediate hang regression.
      
      This policy of disallowing changes the integrity profile once one has
      been established is shared with DM.
      
      Here is an abbreviated log from a test run that:
      1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
      2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
      3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]
      
      [   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
      [   59.078302] md: data integrity enabled on md0
      [..]
      [   90.489209] md0: incompatible integrity profile for pmem1m
      [..]
      [  205.671277] md: super_written gets error=-5
      [  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
      [  205.677386] md/raid1:md0: Operation continuing on 1 devices.
      [  205.683037] RAID1 conf printout:
      [  205.684699]  --- wd:1 rd:2
      [  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
      [  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
      [  205.691717] md: recovery of RAID array md0
      
      Fixes: c7bfced9 ("md: suspend i/o during runtime blk_integrity_unregister")
      Cc: <stable@vger.kernel.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reported-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      1501efad
  16. 18 12月, 2015 1 次提交
    • A
      md/raid10: fix data corruption and crash during resync · cc578588
      Artur Paszkiewicz 提交于
      The commit c31df25f ("md/raid10: make sync_request_write() call
      bio_copy_data()") replaced manual data copying with bio_copy_data() but
      it doesn't work as intended. The source bio (fbio) is already processed,
      so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt.  Because of
      this, bio_copy_data() either does not copy anything, or worse, copies
      data from the ->bi_next bio if it is set.  This causes wrong data to be
      written to drives during resync and sometimes lockups/crashes in
      bio_copy_data():
      
      [  517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319]
      [  517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
      [  517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1
      [  517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015
      [  517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000
      [  517.468529] RIP: 0010:[<ffffffff812e1888>]  [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0
      [  517.478164] RSP: 0018:ffff880150dfbc98  EFLAGS: 00000246
      [  517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000
      [  517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0
      [  517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980
      [  517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000
      [  517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000
      [  517.525412] FS:  0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000
      [  517.534844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0
      [  517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  517.566144] Stack:
      [  517.568626]  ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000
      [  517.577659]  0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800
      [  517.586715]  ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000
      [  517.595773] Call Trace:
      [  517.598747]  [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10]
      [  517.605610]  [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2
      [  517.611987]  [<ffffffff814ff206>] md_thread+0x106/0x140
      [  517.618072]  [<ffffffff810c1d80>] ? wait_woken+0x80/0x80
      [  517.624252]  [<ffffffff814ff100>] ? super_1_load+0x520/0x520
      [  517.630817]  [<ffffffff8109ef89>] kthread+0xc9/0xe0
      [  517.636506]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
      [  517.643653]  [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70
      [  517.649929]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Reviewed-by: NShaohua Li <shli@kernel.org>
      Cc: stable@vger.kernel.org (v4.2+)
      Fixes: c31df25f ("md/raid10: make sync_request_write() call bio_copy_data()")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      cc578588
  17. 24 10月, 2015 2 次提交
    • N
      md/raid10: fix the 'new' raid10 layout to work correctly. · 8bce6d35
      NeilBrown 提交于
      In Linux 3.9 we introduce a new 'far' layout for RAID10 which was
      supposed to rotate the replicas differently and so provide better
      resilience.  In particular it could survive more combinations of 2
      drive failures.
      
      Unfortunately. due to a coding error, this some did what was wanted,
      sometimes improved less than we hoped, and sometimes - in very
      unlikely circumstances - put multiple replicas on the same device so
      the redundancy was harmed.
      
      No public user-space tool has created arrays using this layout so it
      is very unlikely that zero-redundancy arrays actually exist.  Probably
      no arrays using any form of the new layout exist.  But we cannot be
      certain.
      
      So use another bit in the 'layout' number and introduce a bug-fixed
      version of the layout.
      Also when assembling an array, if it has a zero-redundancy layout,
      give a warning.
      Reported-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      8bce6d35
    • N
      md/raid10: don't clear bitmap bit when bad-block-list write fails. · c340702c
      NeilBrown 提交于
      When a write fails and a bad-block-list is present, we can
      update the bad-block-list instead of writing the data.  If
      this succeeds then it is OK clear the relevant bitmap-bit as
      no further 'sync' of the block is needed.
      
      However if writing the bad-block-list fails then we need to
      treat the write as failed and particularly must not clear
      the bitmap bit.  Otherwise the device can be re-added (after
      any hardware connection issues are resolved) and because the
      relevant bit in the bitmap is clear, that block will not be
      resynced.  This leads to data corruption.
      
      We already delay the final bio_endio() on the write until
      the bad-block-list is written so that when the write
      returns: either that data is safe, the bad-block record is
      safe, or the fact that the device is faulty is safe.
      However we *don't* delay the clearing of the bitmap, so the
      bitmap bit can be recorded as cleared before we know if the
      bad-block-list was written safely.
      
      So: delay that until the write really is safe.
      i.e. move the call to close_write() until just before
      calling bio_endio(), and recheck the 'is array degraded'
      status before making that call.
      
      This bug goes back to v3.1 when bad-block-lists were
      introduced, though it only affects arrays created with
      mdadm-3.3 or later as only those have bad-block lists.
      
      Backports will require at least
      Commit: 95af587e ("md/raid10: ensure device failure recorded before write request returns.")
      as well.  I'll send that to 'stable' separately.
      
      Note that of the two tests of R10BIO_WriteError that this
      patch adds, the first is certain to fail and the second is
      certain to succeed.  However doing it this way makes the
      patch more obviously correct.  I will tidy the code up in a
      future merge window.
      Reported-by: NNate Dailey <nate.dailey@stratus.com>
      Fixes: bd870a16 ("md/raid10:  Handle write errors by updating badblock log.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c340702c
  18. 22 10月, 2015 1 次提交
  19. 21 10月, 2015 1 次提交
  20. 12 10月, 2015 1 次提交
    • G
      md-cluster: Use a small window for resync · c40f341f
      Goldwyn Rodrigues 提交于
      Suspending the entire device for resync could take too long. Resync
      in small chunks.
      
      cluster's resync window (32M) is maintained in r1conf as
      cluster_sync_low and cluster_sync_high and processed in
      raid1's sync_request(). If the current resync is outside the cluster
      resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Check if the sync will fit in the new window, if not issue a
         wait_barrier() and set cluster_sync_low to sector_nr.
      3. Set cluster_sync_high to cluster_sync_low + resync_window.
      4. Send a message to all nodes so they may add it in their suspension
         list.
      
      bitmap_cond_end_sync is modified to allow to force a sync inorder
      to get the curr_resync_completed uptodate with the sector passed.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c40f341f
  21. 09 10月, 2015 1 次提交
    • M
      crash in md-raid1 and md-raid10 due to incorrect list manipulation · a452744b
      Mikulas Patocka 提交于
      The commit 55ce74d4 (md/raid1: ensure
      device failure recorded before write request returns) is causing crash in
      the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100%
      reproducible.
      
      The reason for the crash is that the newly added code in raid1d moves the
      list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and
      then incorrectly pops the bio from conf->bio_end_io_list (which is empty
      because the list was alrady moved).
      
      Raid-10 has a similar bug.
      
      Kernel Fault: Code=15 regs=000000006ccb8640 (Addr=0000000100000000)
      CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35
      task: 000000006cc1f258 ti: 000000006ccb8000 task.ti: 000000006ccb8000
      
           YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
      PSW: 00001000000001001111111000001111 Not tainted
      r00-03  000000ff0804fe0f 000000001059d000 000000001059f818 000000007f16be38
      r04-07  000000001059d000 000000007f16be08 0000000000200200 0000000000000001
      r08-11  000000006ccb8260 000000007b7934d0 0000000000000001 0000000000000000
      r12-15  000000004056f320 0000000000000000 0000000000013dd0 0000000000000000
      r16-19  00000000f0d00ae0 0000000000000000 0000000000000000 0000000000000001
      r20-23  000000000800000f 0000000042200390 0000000000000000 0000000000000000
      r24-27  0000000000000001 000000000800000f 000000007f16be08 000000001059d000
      r28-31  0000000100000000 000000006ccb8560 000000006ccb8640 0000000000000000
      sr00-03  0000000000249800 0000000000000000 0000000000000000 0000000000249800
      sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
      IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001059f61c 000000001059f620
       IIR: 0f8010c6    ISR: 0000000000000000  IOR: 0000000100000000
       CPU:        3   CR30: 000000006ccb8000 CR31: 0000000000000000
       ORIG_R28: 000000001059d000
       IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1]
       IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1]
       RP(r2): raid_end_bio_io+0x88/0x168 [raid1]
      Backtrace:
       [<000000001059f818>] raid_end_bio_io+0x88/0x168 [raid1]
       [<00000000105a4f64>] raid1d+0x144/0x1640 [raid1]
       [<000000004017fd5c>] kthread+0x144/0x160
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Fixes: 55ce74d4 ("md/raid1: ensure device failure recorded before write request returns.")
      Fixes: 95af587e ("md/raid10: ensure device failure recorded before write request returns.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a452744b
  22. 02 10月, 2015 1 次提交
  23. 01 9月, 2015 2 次提交
    • N
      md/raid10: ensure device failure recorded before write request returns. · 95af587e
      NeilBrown 提交于
      When a write to one of the legs of a RAID10 fails, the failure is
      recorded in the metadata of the other legs so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again (maybe a cable was unplugged).
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that experienced an error on a new queue which
         is only processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      95af587e
    • N
      md/raid10: fix a few typos in comments · 02ec5026
      NeilBrown 提交于
      Signed-off-by: NNeilBrown <neilb@suse.com>
      02ec5026