1. 18 12月, 2015 2 次提交
    • S
      MD: change journal disk role to disk 0 · 9b15603d
      Shaohua Li 提交于
      Neil pointed out setting journal disk role to raid_disks will confuse
      reshape if we support reshape eventually. Switching the role to 0 (we
      should be fine as long as the value >=0) and skip sysfs file creation to
      avoid error.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      9b15603d
    • A
      md/raid10: fix data corruption and crash during resync · cc578588
      Artur Paszkiewicz 提交于
      The commit c31df25f ("md/raid10: make sync_request_write() call
      bio_copy_data()") replaced manual data copying with bio_copy_data() but
      it doesn't work as intended. The source bio (fbio) is already processed,
      so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt.  Because of
      this, bio_copy_data() either does not copy anything, or worse, copies
      data from the ->bi_next bio if it is set.  This causes wrong data to be
      written to drives during resync and sometimes lockups/crashes in
      bio_copy_data():
      
      [  517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319]
      [  517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
      [  517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1
      [  517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015
      [  517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000
      [  517.468529] RIP: 0010:[<ffffffff812e1888>]  [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0
      [  517.478164] RSP: 0018:ffff880150dfbc98  EFLAGS: 00000246
      [  517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000
      [  517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0
      [  517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980
      [  517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000
      [  517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000
      [  517.525412] FS:  0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000
      [  517.534844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0
      [  517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  517.566144] Stack:
      [  517.568626]  ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000
      [  517.577659]  0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800
      [  517.586715]  ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000
      [  517.595773] Call Trace:
      [  517.598747]  [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10]
      [  517.605610]  [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2
      [  517.611987]  [<ffffffff814ff206>] md_thread+0x106/0x140
      [  517.618072]  [<ffffffff810c1d80>] ? wait_woken+0x80/0x80
      [  517.624252]  [<ffffffff814ff100>] ? super_1_load+0x520/0x520
      [  517.630817]  [<ffffffff8109ef89>] kthread+0xc9/0xe0
      [  517.636506]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
      [  517.643653]  [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70
      [  517.649929]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Reviewed-by: NShaohua Li <shli@kernel.org>
      Cc: stable@vger.kernel.org (v4.2+)
      Fixes: c31df25f ("md/raid10: make sync_request_write() call bio_copy_data()")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      cc578588
  2. 10 12月, 2015 3 次提交
  3. 03 12月, 2015 2 次提交
  4. 24 11月, 2015 1 次提交
    • M
      dm thin: fix regression in advertised discard limits · 0fcb04d5
      Mike Snitzer 提交于
      When establishing a thin device's discard limits we cannot rely on the
      underlying thin-pool device's discard capabilities (which are inherited
      from the thin-pool's underlying data device) given that DM thin devices
      must provide discard support even when the thin-pool's underlying data
      device doesn't support discards.
      
      Users were exposed to this thin device discard limits regression if
      their thin-pool's underlying data device does _not_ support discards.
      This regression caused all upper-layers that called the
      blkdev_issue_discard() interface to not be able to issue discards to
      thin devices (because discard_granularity was 0).  This regression
      wasn't caught earlier because the device-mapper-test-suite's extensive
      'thin-provisioning' discard tests are only ever performed against
      thin-pool's with data devices that support discards.
      
      Fix is to have thin_io_hints() test the pool's 'discard_enabled' feature
      rather than inferring whether or not a thin device's discard support
      should be enabled by looking at the thin-pool's discard_granularity.
      
      Fixes: 21607670 ("dm thin: disable discard support for thin devices if pool's is disabled")
      Reported-by: NMike Gerber <mike@sprachgewalt.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.1+
      0fcb04d5
  5. 20 11月, 2015 1 次提交
    • M
      dm crypt: fix a possible hang due to race condition on exit · bcbd94ff
      Mikulas Patocka 提交于
      A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
      __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
      It is possible that the processor reorders memory accesses so that
      kthread_should_stop() is executed before __set_current_state().  If such
      reordering happens, there is a possible race on thread termination:
      
      CPU 0:
      calls kthread_should_stop()
      	it tests KTHREAD_SHOULD_STOP bit, returns false
      CPU 1:
      calls kthread_stop(cc->write_thread)
      	sets the KTHREAD_SHOULD_STOP bit
      	calls wake_up_process on the kernel thread, that sets the thread
      	state to TASK_RUNNING
      CPU 0:
      sets __set_current_state(TASK_INTERRUPTIBLE)
      spin_unlock_irq(&cc->write_thread_wait.lock)
      schedule() - and the process is stuck and never terminates, because the
      	state is TASK_INTERRUPTIBLE and wake_up_process on CPU 1 already
      	terminated
      
      Fix this race condition by using a new flag DM_CRYPT_EXIT_THREAD to
      signal that the kernel thread should exit.  The flag is set and tested
      while holding cc->write_thread_wait.lock, so there is no possibility of
      racy access to the flag.
      
      Also, remove the unnecessary set_task_state(current, TASK_RUNNING)
      following the schedule() call.  When the process was woken up, its state
      was already set to TASK_RUNNING.  Other kernel code also doesn't set the
      state to TASK_RUNNING following schedule() (for example,
      do_wait_for_common in completion.c doesn't do it).
      
      Fixes: dc267621 ("dm crypt: offload writes to thread")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org # v4.0+
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      bcbd94ff
  6. 18 11月, 2015 3 次提交
    • J
      dm mpath: fix infinite recursion in ioctl when no paths and !queue_if_no_path · 43e43c9e
      Junichi Nomura 提交于
      In multipath_prepare_ioctl(),
        - pgpath is a path selected from available paths
        - m->queue_io is true if we cannot send a request immediately to
          paths, either because:
            * there is no available path
            * the path group needs activation (pg_init)
                - pg_init is not started
                - pg_init is still running
        - m->queue_if_no_path is true if the device is configured to queue
          I/O if there are no available paths
      
      If !pgpath && !m->queue_if_no_path, the handler should return -EIO.
      However in the course of refactoring the condition check has broken
      and returns success in that case.  Since bdev points to the dm device
      itself, dm_blk_ioctl() calls __blk_dev_driver_ioctl() for itself and
      recurses until crash.
      
      You could reproduce the problem like this:
      
        # dmsetup create mp --table '0 1024 multipath 0 0 0 0'
        # sg_inq /dev/mapper/mp
        <crash>
        [  172.648615] BUG: unable to handle kernel paging request at fffffffc81b10268
        [  172.662843] PGD 19dd067 PUD 0
        [  172.666269] Thread overran stack, or stack corrupted
        [  172.671808] Oops: 0000 [#1] SMP
        ...
      
      Fix the condition check with some clarifications.
      
      Fixes: e56f81e0 ("dm: refactor ioctl handling")
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      43e43c9e
    • M
      dm: do not reuse dm_blk_ioctl block_device input as local variable · 647a20d5
      Mike Snitzer 提交于
      (Ab)using the @bdev passed to dm_blk_ioctl() opens the potential for
      targets' .prepare_ioctl to fail if they go on to check the bdev for
      !NULL.
      
      Fixes: e56f81e0 ("dm: refactor ioctl handling")
      Reported-by: NJunichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      647a20d5
    • J
      dm: fix ioctl retry termination with signal · 5bbbfdf6
      Junichi Nomura 提交于
      dm-mpath retries ioctl, when no path is readily available and the device
      is configured to queue I/O in such a case. If you want to stop the retry
      before multipathd decides to turn off queueing mode, you could send
      signal for the process to exit from the loop.
      
      However the check of fatal signal has not carried along when commit
      6c182cd8 ("dm mpath: fix ioctl deadlock when no paths") moved the
      loop from dm-mpath to dm core. As a result, we can't terminate such
      a process in the retry loop.
      
      Easy reproducer of the situation is:
      
        # dmsetup create mp --table '0 1024 multipath 0 0 0 0'
        # dmsetup message mp 0 'queue_if_no_path'
        # sg_inq /dev/mapper/mp
      
      then you should be able to terminate sg_inq by pressing Ctrl+C.
      
      Fixes: 6c182cd8 ("dm mpath: fix ioctl deadlock when no paths")
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      5bbbfdf6
  7. 16 11月, 2015 1 次提交
    • M
      dm thin: restore requested 'error_if_no_space' setting on OODS to WRITE transition · 172c2386
      Mike Snitzer 提交于
      A thin-pool that is in out-of-data-space (OODS) mode may transition back
      to write mode -- without the admin adding more space to the thin-pool --
      if/when blocks are released (either by deleting thin devices or
      discarding provisioned blocks).
      
      But as part of the thin-pool's earlier transition to out-of-data-space
      mode the thin-pool may have set the 'error_if_no_space' flag to true if
      the no_space_timeout expires without more space having been made
      available.  That implementation detail, of changing the pool's
      error_if_no_space setting, needs to be reset back to the default that
      the user specified when the thin-pool's table was loaded.
      
      Otherwise we'll drop the user requested behaviour on the floor when this
      out-of-data-space to write mode transition occurs.
      Reported-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NJoe Thornber <ejt@redhat.com>
      Fixes: 2c43fd26 ("dm thin: fix missing out-of-data-space to write mode transition if blocks are released")
      Cc: stable@vger.kernel.org
      172c2386
  8. 09 11月, 2015 1 次提交
  9. 08 11月, 2015 1 次提交
  10. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  11. 06 11月, 2015 1 次提交
  12. 01 11月, 2015 23 次提交