1. 16 2月, 2017 1 次提交
    • M
      md: fast clone bio in bio_clone_mddev() · d7a10308
      Ming Lei 提交于
      Firstly bio_clone_mddev() is used in raid normal I/O and isn't
      in resync I/O path.
      
      Secondly all the direct access to bvec table in raid happens on
      resync I/O except for write behind of raid1, in which we still
      use bio_clone() for allocating new bvec table.
      
      So this patch replaces bio_clone() with bio_clone_fast()
      in bio_clone_mddev().
      
      Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
      suggested by Christoph Hellwig.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d7a10308
  2. 02 2月, 2017 1 次提交
  3. 04 1月, 2017 1 次提交
  4. 09 12月, 2016 1 次提交
    • S
      md: separate flags for superblock changes · 2953079c
      Shaohua Li 提交于
      The mddev->flags are used for different purposes. There are a lot of
      places we check/change the flags without masking unrelated flags, we
      could check/change unrelated flags. These usage are most for superblock
      write, so spearate superblock related flags. This should make the code
      clearer and also fix real bugs.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2953079c
  5. 23 11月, 2016 3 次提交
    • N
      md/raid10: add failfast handling for writes. · 1919cbb2
      NeilBrown 提交于
      When writing to a fastfail device, we use MD_FASTFAIL unless
      it is the only device being written to.  For
      resync/recovery, assume there was a working device to read
      from so always use MD_FASTFAIL.
      
      If a write for resync/recovery fails, we just fail the
      device - there is not much else to do.
      
      If a normal write fails, but the device cannot be marked
      Faulty (must be only one left), we queue for write error
      handling which calls narrow_write_error() to write the block
      synchronously without any failfast flags.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1919cbb2
    • N
      md/raid10: add failfast handling for reads. · 8d3ca83d
      NeilBrown 提交于
      If a device is marked FailFast, and it is not the only
      device we can read from, we mark the bio as MD_FAILFAST.
      
      If this does fail-fast, we don't try read repair but just
      allow failure.
      
      If it was the last device, it doesn't get marked Faulty so
      the retry happens on the same device - this time without
      FAILFAST.  A subsequent failure will not retry but will just
      pass up the error.
      
      During resync we may use FAILFAST requests, and on a failure
      we will simply use the other device(s).
      
      During recovery we will only use FAILFAST in the unusual
      case were there are multiple places to read from - i.e. if
      there are > 2 devices.  If we get a failure we will fail the
      device and complete the resync/recovery with remaining
      devices.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      8d3ca83d
    • N
      md: Use REQ_FAILFAST_* on metadata writes where appropriate · 46533ff7
      NeilBrown 提交于
      This can only be supported on personalities which ensure
      that md_error() never causes an array to enter the 'failed'
      state.  i.e. if marking a device Faulty would cause some
      data to be inaccessible, the device is status is left as
      non-Faulty.  This is true for RAID1 and RAID10.
      
      If we get a failure writing metadata but the device doesn't
      fail, it must be the last device so we re-write without
      FAILFAST to improve chance of success.  We also flag the
      device as LastDev so that future metadata updates don't
      waste time on failfast writes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      46533ff7
  6. 19 11月, 2016 2 次提交
    • N
      md/raid1, raid10: add blktrace records when IO is delayed · 578b54ad
      NeilBrown 提交于
      Both raid1 and raid10 will sometimes delay handling an IO request,
      such as when resync is happening or there are too many requests queued.
      
      Add some blktrace messsages so we can see when that is happening when
      looking for performance artefacts.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      578b54ad
    • N
      md: add block tracing for bio_remapping · 109e3765
      NeilBrown 提交于
      The block tracing infrastructure (accessed with blktrace/blkparse)
      supports the tracing of mapping bios from one device to another.
      This is currently used when a bio in a partition is mapped to the
      whole device, when bios are mapped by dm, and for mapping in md/raid5.
      Other md personalities do not include this tracing yet, so add it.
      
      When a read-error is detected we redirect the request to a different device.
      This could justifiably be seen as a new mapping for the originial bio,
      or a secondary mapping for the bio that errors.  This patch uses
      the second option.
      
      When md is used under dm-raid, the mappings are not traced as we do
      not have access to the block device number of the parent.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      109e3765
  7. 08 11月, 2016 2 次提交
  8. 25 10月, 2016 1 次提交
    • S
      RAID10: ignore discard error · 579ed34f
      Shaohua Li 提交于
      This is the counterpart of raid10 fix. If a write error occurs, raid10
      will try to rewrite the bio in small chunk size. If the rewrite fails,
      raid10 will record the error in bad block. narrow_write_error will
      always use WRITE for the bio, but actually it could be a discard. Since
      discard bio hasn't payload, write the bio will cause different issues.
      But discard error isn't fatal, we can safely ignore it. This is what
      this patch does.
      
      This issue should exist since discard is added, but only exposed with
      recent arbitrary bio size feature.
      
      Cc: Sitsofe Wheeler <sitsofe@gmail.com>
      Cc: stable@vger.kernel.org (v3.6)
      Signed-off-by: NShaohua Li <shli@fb.com>
      579ed34f
  9. 25 8月, 2016 1 次提交
  10. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  11. 31 7月, 2016 1 次提交
  12. 20 7月, 2016 2 次提交
    • T
      raid10: improve random reads performance · 0e5313e2
      Tomasz Majchrzak 提交于
      RAID10 random read performance is lower than expected due to excessive spinlock
      utilisation which is required mostly for rebuild/resync. Simplify allow_barrier
      as it's in IO path and encounters a lot of unnecessary congestion.
      
      As lower_barrier just takes a lock in order to decrement a counter, convert
      counter (nr_pending) into atomic variable and remove the spin lock. There is
      also a congestion for wake_up (it uses lock internally) so call it only when
      it's really needed. As wake_up is not called constantly anymore, ensure process
      waiting to raise a barrier is notified when there are no more waiting IOs.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0e5313e2
    • A
      md: use seconds granularity for error logging · 0e3ef49e
      Arnd Bergmann 提交于
      The md code stores the exact time of the last error in the
      last_read_error variable using a timespec structure. It only
      ever uses the seconds portion of that though, so we can
      use a scalar for it.
      
      There won't be an overflow in 2038 here, because it already
      used monotonic time and 32-bit is enough for that, but I've
      decided to use time64_t for consistency in the conversion.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0e3ef49e
  13. 14 6月, 2016 10 次提交
  14. 09 6月, 2016 1 次提交
  15. 08 6月, 2016 3 次提交
  16. 10 5月, 2016 2 次提交
    • G
      md: set MD_CHANGE_PENDING in a atomic region · 85ad1d13
      Guoqing Jiang 提交于
      Some code waits for a metadata update by:
      
      1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
      2. setting MD_CHANGE_PENDING and waking the management thread
      3. waiting for MD_CHANGE_PENDING to be cleared
      
      If the first two are done without locking, the code in md_update_sb()
      which checks if it needs to repeat might test if an update is needed
      before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
      in the wait returning early.
      
      So make sure all places that set MD_CHANGE_PENDING are atomicial, and
      bit_clear_unless (suggested by Neil) is introduced for the purpose.
      
      Cc: Martin Kepplinger <martink@posteo.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: <linux-kernel@vger.kernel.org>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      85ad1d13
    • H
      md: raid10: add prerequisite to run underneath dm-raid · 859644f0
      Heinz Mauelshagen 提交于
      In case md runs underneath the dm-raid target, the mddev does not have
      a request queue or gendisk, thus avoid accesses to it.
      
      This patch adds two missing conditionals to the raid10 personality.
      Signed-of-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      859644f0
  17. 18 3月, 2016 1 次提交
  18. 21 1月, 2016 1 次提交
  19. 14 1月, 2016 1 次提交
    • D
      md/raid: only permit hot-add of compatible integrity profiles · 1501efad
      Dan Williams 提交于
      It is not safe for an integrity profile to be changed while i/o is
      in-flight in the queue.  Prevent adding new disks or otherwise online
      spares to an array if the device has an incompatible integrity profile.
      
      The original change to the blk_integrity_unregister implementation in
      md, commmit c7bfced9 "md: suspend i/o during runtime
      blk_integrity_unregister" introduced an immediate hang regression.
      
      This policy of disallowing changes the integrity profile once one has
      been established is shared with DM.
      
      Here is an abbreviated log from a test run that:
      1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
      2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
      3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]
      
      [   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
      [   59.078302] md: data integrity enabled on md0
      [..]
      [   90.489209] md0: incompatible integrity profile for pmem1m
      [..]
      [  205.671277] md: super_written gets error=-5
      [  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
      [  205.677386] md/raid1:md0: Operation continuing on 1 devices.
      [  205.683037] RAID1 conf printout:
      [  205.684699]  --- wd:1 rd:2
      [  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
      [  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
      [  205.691717] md: recovery of RAID array md0
      
      Fixes: c7bfced9 ("md: suspend i/o during runtime blk_integrity_unregister")
      Cc: <stable@vger.kernel.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reported-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      1501efad
  20. 18 12月, 2015 1 次提交
    • A
      md/raid10: fix data corruption and crash during resync · cc578588
      Artur Paszkiewicz 提交于
      The commit c31df25f ("md/raid10: make sync_request_write() call
      bio_copy_data()") replaced manual data copying with bio_copy_data() but
      it doesn't work as intended. The source bio (fbio) is already processed,
      so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt.  Because of
      this, bio_copy_data() either does not copy anything, or worse, copies
      data from the ->bi_next bio if it is set.  This causes wrong data to be
      written to drives during resync and sometimes lockups/crashes in
      bio_copy_data():
      
      [  517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319]
      [  517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
      [  517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1
      [  517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015
      [  517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000
      [  517.468529] RIP: 0010:[<ffffffff812e1888>]  [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0
      [  517.478164] RSP: 0018:ffff880150dfbc98  EFLAGS: 00000246
      [  517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000
      [  517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0
      [  517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980
      [  517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000
      [  517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000
      [  517.525412] FS:  0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000
      [  517.534844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0
      [  517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  517.566144] Stack:
      [  517.568626]  ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000
      [  517.577659]  0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800
      [  517.586715]  ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000
      [  517.595773] Call Trace:
      [  517.598747]  [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10]
      [  517.605610]  [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2
      [  517.611987]  [<ffffffff814ff206>] md_thread+0x106/0x140
      [  517.618072]  [<ffffffff810c1d80>] ? wait_woken+0x80/0x80
      [  517.624252]  [<ffffffff814ff100>] ? super_1_load+0x520/0x520
      [  517.630817]  [<ffffffff8109ef89>] kthread+0xc9/0xe0
      [  517.636506]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
      [  517.643653]  [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70
      [  517.649929]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Reviewed-by: NShaohua Li <shli@kernel.org>
      Cc: stable@vger.kernel.org (v4.2+)
      Fixes: c31df25f ("md/raid10: make sync_request_write() call bio_copy_data()")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      cc578588
  21. 24 10月, 2015 2 次提交
    • N
      md/raid10: fix the 'new' raid10 layout to work correctly. · 8bce6d35
      NeilBrown 提交于
      In Linux 3.9 we introduce a new 'far' layout for RAID10 which was
      supposed to rotate the replicas differently and so provide better
      resilience.  In particular it could survive more combinations of 2
      drive failures.
      
      Unfortunately. due to a coding error, this some did what was wanted,
      sometimes improved less than we hoped, and sometimes - in very
      unlikely circumstances - put multiple replicas on the same device so
      the redundancy was harmed.
      
      No public user-space tool has created arrays using this layout so it
      is very unlikely that zero-redundancy arrays actually exist.  Probably
      no arrays using any form of the new layout exist.  But we cannot be
      certain.
      
      So use another bit in the 'layout' number and introduce a bug-fixed
      version of the layout.
      Also when assembling an array, if it has a zero-redundancy layout,
      give a warning.
      Reported-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      8bce6d35
    • N
      md/raid10: don't clear bitmap bit when bad-block-list write fails. · c340702c
      NeilBrown 提交于
      When a write fails and a bad-block-list is present, we can
      update the bad-block-list instead of writing the data.  If
      this succeeds then it is OK clear the relevant bitmap-bit as
      no further 'sync' of the block is needed.
      
      However if writing the bad-block-list fails then we need to
      treat the write as failed and particularly must not clear
      the bitmap bit.  Otherwise the device can be re-added (after
      any hardware connection issues are resolved) and because the
      relevant bit in the bitmap is clear, that block will not be
      resynced.  This leads to data corruption.
      
      We already delay the final bio_endio() on the write until
      the bad-block-list is written so that when the write
      returns: either that data is safe, the bad-block record is
      safe, or the fact that the device is faulty is safe.
      However we *don't* delay the clearing of the bitmap, so the
      bitmap bit can be recorded as cleared before we know if the
      bad-block-list was written safely.
      
      So: delay that until the write really is safe.
      i.e. move the call to close_write() until just before
      calling bio_endio(), and recheck the 'is array degraded'
      status before making that call.
      
      This bug goes back to v3.1 when bad-block-lists were
      introduced, though it only affects arrays created with
      mdadm-3.3 or later as only those have bad-block lists.
      
      Backports will require at least
      Commit: 95af587e ("md/raid10: ensure device failure recorded before write request returns.")
      as well.  I'll send that to 'stable' separately.
      
      Note that of the two tests of R10BIO_WriteError that this
      patch adds, the first is certain to fail and the second is
      certain to succeed.  However doing it this way makes the
      patch more obviously correct.  I will tidy the code up in a
      future merge window.
      Reported-by: NNate Dailey <nate.dailey@stratus.com>
      Fixes: bd870a16 ("md/raid10:  Handle write errors by updating badblock log.")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c340702c
  22. 22 10月, 2015 1 次提交