1. 14 1月, 2016 4 次提交
    • D
      md/raid: only permit hot-add of compatible integrity profiles · 1501efad
      Dan Williams 提交于
      It is not safe for an integrity profile to be changed while i/o is
      in-flight in the queue.  Prevent adding new disks or otherwise online
      spares to an array if the device has an incompatible integrity profile.
      
      The original change to the blk_integrity_unregister implementation in
      md, commmit c7bfced9 "md: suspend i/o during runtime
      blk_integrity_unregister" introduced an immediate hang regression.
      
      This policy of disallowing changes the integrity profile once one has
      been established is shared with DM.
      
      Here is an abbreviated log from a test run that:
      1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
      2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
      3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]
      
      [   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
      [   59.078302] md: data integrity enabled on md0
      [..]
      [   90.489209] md0: incompatible integrity profile for pmem1m
      [..]
      [  205.671277] md: super_written gets error=-5
      [  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
      [  205.677386] md/raid1:md0: Operation continuing on 1 devices.
      [  205.683037] RAID1 conf printout:
      [  205.684699]  --- wd:1 rd:2
      [  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
      [  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
      [  205.691717] md: recovery of RAID array md0
      
      Fixes: c7bfced9 ("md: suspend i/o during runtime blk_integrity_unregister")
      Cc: <stable@vger.kernel.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reported-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      1501efad
    • S
      raid5-cache: handle journal hotadd in quiesce · 16a43f6a
      Shaohua Li 提交于
      Handle journal hotadd in quiesce to avoid creating duplicated threads.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      16a43f6a
    • S
      MD: add journal with array suspended · 87d4d916
      Shaohua Li 提交于
      Hot add journal disk in recovery thread context brings a lot of trouble
      as IO could be running. Unlike spare disk hot add, adding journal disk
      with array suspended makes more sense and implmentation is much easier.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      87d4d916
    • S
      md: set MD_HAS_JOURNAL in correct places · a62ab49e
      Shaohua Li 提交于
      Set MD_HAS_JOURNAL when a array is loaded or journal is initialized.
      This is to avoid the flags set too early in journal disk hotadd.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      a62ab49e
  2. 07 1月, 2016 2 次提交
  3. 06 1月, 2016 18 次提交
  4. 21 12月, 2015 1 次提交
    • N
      md: remove check for MD_RECOVERY_NEEDED in action_store. · 312045ee
      NeilBrown 提交于
      md currently doesn't allow a 'sync_action' such as 'reshape' to be set
      while MD_RECOVERY_NEEDED is set.
      
      This s a problem, particularly since commit 738a2738 as that can
      cause ->check_shape to call mddev_resume() which sets
      MD_RECOVERY_NEEDED.  So by the time we come to start 'reshape' it is
      very likely that MD_RECOVERY_NEEDED is still set.
      
      Testing for this flag is not really needed and is in any case very
      racy as it can be set at any moment - asynchronously.  Any race
      between setting a sync_action and setting MD_RECOVERY_NEEDED must
      already be handled properly in some locked code, probably
      md_check_recovery(), so remove the test here.
      
      The test on MD_RECOVERY_RUNNING is also racy in the 'reshape' case
      so we should test it again after getting mddev_lock().
      
      As this fixes a race and a regression which can cause 'reshape' to
      fail, it is suitable for -stable kernels since 4.1
      Reported-by: NXiao Ni <xni@redhat.com>
      Fixes: 738a2738 ("md/raid5: fix allocation of 'scribble' array.")
      Cc: stable@vger.kernel.org (v4.1+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      312045ee
  5. 18 12月, 2015 4 次提交
    • G
      Fix remove_and_add_spares removes drive added as spare in slot_store · cb01c549
      Goldwyn Rodrigues 提交于
      Commit 2910ff17
      introduced a regression which would remove a recently added spare via
      slot_store. Revert part of the patch which touches slot_store() and add
      the disk directly using pers->hot_add_disk()
      
      Fixes: 2910ff17 ("md: remove_and_add_spares() to activate specific
      rdev")
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      cb01c549
    • M
      md: fix bug due to nested suspend · 0dc10e50
      Mikulas Patocka 提交于
      The patch c7bfced9 committed to 4.4-rc
      causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with
      this BUG, the reason is that we attempt to suspend a device that is
      already suspended. See also
      https://bugzilla.redhat.com/show_bug.cgi?id=1283491
      
      This patch fixes the bug by changing functions mddev_suspend and
      mddev_resume to always nest.
      The number of nested calls to mddev_nested_suspend is kept in the
      variable mddev->suspended.
      [neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend]
      
      kernel BUG at drivers/md/md.c:317!
      CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1
      task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000
      
           YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
      PSW: 00001000000001000000000000001111 Not tainted
      r00-03  000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810
      r04-07  0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810
      r08-11  000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff
      r12-15  0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4
      r16-19  00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001
      r20-23  0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1
      r24-27  00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000
      r28-31  0000000000000001 0000000047014840 0000000047014930 0000000000000001
      sr00-03  0000000007040800 0000000000000000 0000000000000000 0000000007040800
      sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
      IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390
       IIR: 03ffe01f    ISR: 0000000000000000  IOR: 00000000102b2748
       CPU:        3   CR30: 0000000047014000 CR31: 0000000000000000
       ORIG_R28: 00000000000000b0
       IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod]
       IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod]
       RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1]
      Backtrace:
       [<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1]
       [<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid]
       [<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod]
       [<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod]
       [<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod]
       [<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod]
       [<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod]
       [<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558
      
      Fixes: c7bfced9 ("md: suspend i/o during runtime blk_integrity_unregister")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      0dc10e50
    • S
      MD: change journal disk role to disk 0 · 9b15603d
      Shaohua Li 提交于
      Neil pointed out setting journal disk role to raid_disks will confuse
      reshape if we support reshape eventually. Switching the role to 0 (we
      should be fine as long as the value >=0) and skip sysfs file creation to
      avoid error.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      9b15603d
    • A
      md/raid10: fix data corruption and crash during resync · cc578588
      Artur Paszkiewicz 提交于
      The commit c31df25f ("md/raid10: make sync_request_write() call
      bio_copy_data()") replaced manual data copying with bio_copy_data() but
      it doesn't work as intended. The source bio (fbio) is already processed,
      so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt.  Because of
      this, bio_copy_data() either does not copy anything, or worse, copies
      data from the ->bi_next bio if it is set.  This causes wrong data to be
      written to drives during resync and sometimes lockups/crashes in
      bio_copy_data():
      
      [  517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319]
      [  517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
      [  517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1
      [  517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015
      [  517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000
      [  517.468529] RIP: 0010:[<ffffffff812e1888>]  [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0
      [  517.478164] RSP: 0018:ffff880150dfbc98  EFLAGS: 00000246
      [  517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000
      [  517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0
      [  517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980
      [  517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000
      [  517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000
      [  517.525412] FS:  0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000
      [  517.534844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0
      [  517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  517.566144] Stack:
      [  517.568626]  ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000
      [  517.577659]  0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800
      [  517.586715]  ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000
      [  517.595773] Call Trace:
      [  517.598747]  [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10]
      [  517.605610]  [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2
      [  517.611987]  [<ffffffff814ff206>] md_thread+0x106/0x140
      [  517.618072]  [<ffffffff810c1d80>] ? wait_woken+0x80/0x80
      [  517.624252]  [<ffffffff814ff100>] ? super_1_load+0x520/0x520
      [  517.630817]  [<ffffffff8109ef89>] kthread+0xc9/0xe0
      [  517.636506]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
      [  517.643653]  [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70
      [  517.649929]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Reviewed-by: NShaohua Li <shli@kernel.org>
      Cc: stable@vger.kernel.org (v4.2+)
      Fixes: c31df25f ("md/raid10: make sync_request_write() call bio_copy_data()")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      cc578588
  6. 10 12月, 2015 3 次提交
  7. 03 12月, 2015 2 次提交
  8. 24 11月, 2015 1 次提交
    • M
      dm thin: fix regression in advertised discard limits · 0fcb04d5
      Mike Snitzer 提交于
      When establishing a thin device's discard limits we cannot rely on the
      underlying thin-pool device's discard capabilities (which are inherited
      from the thin-pool's underlying data device) given that DM thin devices
      must provide discard support even when the thin-pool's underlying data
      device doesn't support discards.
      
      Users were exposed to this thin device discard limits regression if
      their thin-pool's underlying data device does _not_ support discards.
      This regression caused all upper-layers that called the
      blkdev_issue_discard() interface to not be able to issue discards to
      thin devices (because discard_granularity was 0).  This regression
      wasn't caught earlier because the device-mapper-test-suite's extensive
      'thin-provisioning' discard tests are only ever performed against
      thin-pool's with data devices that support discards.
      
      Fix is to have thin_io_hints() test the pool's 'discard_enabled' feature
      rather than inferring whether or not a thin device's discard support
      should be enabled by looking at the thin-pool's discard_granularity.
      
      Fixes: 21607670 ("dm thin: disable discard support for thin devices if pool's is disabled")
      Reported-by: NMike Gerber <mike@sprachgewalt.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.1+
      0fcb04d5
  9. 20 11月, 2015 1 次提交
    • M
      dm crypt: fix a possible hang due to race condition on exit · bcbd94ff
      Mikulas Patocka 提交于
      A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
      __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
      It is possible that the processor reorders memory accesses so that
      kthread_should_stop() is executed before __set_current_state().  If such
      reordering happens, there is a possible race on thread termination:
      
      CPU 0:
      calls kthread_should_stop()
      	it tests KTHREAD_SHOULD_STOP bit, returns false
      CPU 1:
      calls kthread_stop(cc->write_thread)
      	sets the KTHREAD_SHOULD_STOP bit
      	calls wake_up_process on the kernel thread, that sets the thread
      	state to TASK_RUNNING
      CPU 0:
      sets __set_current_state(TASK_INTERRUPTIBLE)
      spin_unlock_irq(&cc->write_thread_wait.lock)
      schedule() - and the process is stuck and never terminates, because the
      	state is TASK_INTERRUPTIBLE and wake_up_process on CPU 1 already
      	terminated
      
      Fix this race condition by using a new flag DM_CRYPT_EXIT_THREAD to
      signal that the kernel thread should exit.  The flag is set and tested
      while holding cc->write_thread_wait.lock, so there is no possibility of
      racy access to the flag.
      
      Also, remove the unnecessary set_task_state(current, TASK_RUNNING)
      following the schedule() call.  When the process was woken up, its state
      was already set to TASK_RUNNING.  Other kernel code also doesn't set the
      state to TASK_RUNNING following schedule() (for example,
      do_wait_for_common in completion.c doesn't do it).
      
      Fixes: dc267621 ("dm crypt: offload writes to thread")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org # v4.0+
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      bcbd94ff
  10. 18 11月, 2015 3 次提交
    • J
      dm mpath: fix infinite recursion in ioctl when no paths and !queue_if_no_path · 43e43c9e
      Junichi Nomura 提交于
      In multipath_prepare_ioctl(),
        - pgpath is a path selected from available paths
        - m->queue_io is true if we cannot send a request immediately to
          paths, either because:
            * there is no available path
            * the path group needs activation (pg_init)
                - pg_init is not started
                - pg_init is still running
        - m->queue_if_no_path is true if the device is configured to queue
          I/O if there are no available paths
      
      If !pgpath && !m->queue_if_no_path, the handler should return -EIO.
      However in the course of refactoring the condition check has broken
      and returns success in that case.  Since bdev points to the dm device
      itself, dm_blk_ioctl() calls __blk_dev_driver_ioctl() for itself and
      recurses until crash.
      
      You could reproduce the problem like this:
      
        # dmsetup create mp --table '0 1024 multipath 0 0 0 0'
        # sg_inq /dev/mapper/mp
        <crash>
        [  172.648615] BUG: unable to handle kernel paging request at fffffffc81b10268
        [  172.662843] PGD 19dd067 PUD 0
        [  172.666269] Thread overran stack, or stack corrupted
        [  172.671808] Oops: 0000 [#1] SMP
        ...
      
      Fix the condition check with some clarifications.
      
      Fixes: e56f81e0 ("dm: refactor ioctl handling")
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      43e43c9e
    • M
      dm: do not reuse dm_blk_ioctl block_device input as local variable · 647a20d5
      Mike Snitzer 提交于
      (Ab)using the @bdev passed to dm_blk_ioctl() opens the potential for
      targets' .prepare_ioctl to fail if they go on to check the bdev for
      !NULL.
      
      Fixes: e56f81e0 ("dm: refactor ioctl handling")
      Reported-by: NJunichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      647a20d5
    • J
      dm: fix ioctl retry termination with signal · 5bbbfdf6
      Junichi Nomura 提交于
      dm-mpath retries ioctl, when no path is readily available and the device
      is configured to queue I/O in such a case. If you want to stop the retry
      before multipathd decides to turn off queueing mode, you could send
      signal for the process to exit from the loop.
      
      However the check of fatal signal has not carried along when commit
      6c182cd8 ("dm mpath: fix ioctl deadlock when no paths") moved the
      loop from dm-mpath to dm core. As a result, we can't terminate such
      a process in the retry loop.
      
      Easy reproducer of the situation is:
      
        # dmsetup create mp --table '0 1024 multipath 0 0 0 0'
        # dmsetup message mp 0 'queue_if_no_path'
        # sg_inq /dev/mapper/mp
      
      then you should be able to terminate sg_inq by pressing Ctrl+C.
      
      Fixes: 6c182cd8 ("dm mpath: fix ioctl deadlock when no paths")
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      5bbbfdf6
  11. 16 11月, 2015 1 次提交
    • M
      dm thin: restore requested 'error_if_no_space' setting on OODS to WRITE transition · 172c2386
      Mike Snitzer 提交于
      A thin-pool that is in out-of-data-space (OODS) mode may transition back
      to write mode -- without the admin adding more space to the thin-pool --
      if/when blocks are released (either by deleting thin devices or
      discarding provisioned blocks).
      
      But as part of the thin-pool's earlier transition to out-of-data-space
      mode the thin-pool may have set the 'error_if_no_space' flag to true if
      the no_space_timeout expires without more space having been made
      available.  That implementation detail, of changing the pool's
      error_if_no_space setting, needs to be reset back to the default that
      the user specified when the thin-pool's table was loaded.
      
      Otherwise we'll drop the user requested behaviour on the floor when this
      out-of-data-space to write mode transition occurs.
      Reported-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Fixes: 2c43fd26 ("dm thin: fix missing out-of-data-space to write mode transition if blocks are released")
      Cc: stable@vger.kernel.org
      172c2386