1. 19 7月, 2014 6 次提交
  2. 07 7月, 2014 1 次提交
    • Y
      workqueue: zero cpumask of wq_numa_possible_cpumask on init · 5a6024f1
      Yasuaki Ishimatsu 提交于
      When hot-adding and onlining CPU, kernel panic occurs, showing following
      call trace.
      
        BUG: unable to handle kernel paging request at 0000000000001d08
        IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
        PGD 0
        Oops: 0000 [#1] SMP
        ...
        Call Trace:
         [<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
         [<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
         [<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
         [<ffffffff811926f1>] new_slab+0x91/0x300
         [<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
         [<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
         [<ffffffff810a3c78>] ? load_balance+0x218/0x890
         [<ffffffff8101a679>] ? sched_clock+0x9/0x10
         [<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
         [<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
         [<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
         [<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
         [<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
         [<ffffffff8105d0ec>] do_fork+0xbc/0x360
         [<ffffffff8105d3b6>] kernel_thread+0x26/0x30
         [<ffffffff81086652>] kthreadd+0x2c2/0x300
         [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
         [<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
         [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
      
      In my investigation, I found the root cause is wq_numa_possible_cpumask.
      All entries of wq_numa_possible_cpumask is allocated by
      alloc_cpumask_var_node(). And these entries are used without initializing.
      So these entries have wrong value.
      
      When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
      wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
      calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
      as follow:
      
      3592         /* if cpumask is contained inside a NUMA node, we belong to that node */
      3593         if (wq_numa_enabled) {
      3594                 for_each_node(node) {
      3595                         if (cpumask_subset(pool->attrs->cpumask,
      3596                                            wq_numa_possible_cpumask[node])) {
      3597                                 pool->node = node;
      3598                                 break;
      3599                         }
      3600                 }
      3601         }
      
      But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
      node is selected. As a result, kernel panic occurs.
      
      By this patch, all entries of wq_numa_possible_cpumask are allocated by
      zalloc_cpumask_var_node to initialize them. And the panic disappeared.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: bce90380 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
      5a6024f1
  3. 06 7月, 2014 1 次提交
  4. 04 7月, 2014 1 次提交
  5. 02 7月, 2014 2 次提交
    • T
      cpuset: break kernfs active protection in cpuset_write_resmask() · 76bb5ab8
      Tejun Heo 提交于
      Writing to either "cpuset.cpus" or "cpuset.mems" file flushes
      cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up
      migrating tasks off a cpuset after new resources are added to it.
      
      As cpuset_hotplug_work calls into cgroup core via
      cgroup_transfer_tasks(), this flushing adds the dependency to cgroup
      core locking from cpuset_write_resmak().  This used to be okay because
      cgroup interface files were protected by a different mutex; however,
      8353da1f ("cgroup: remove cgroup_tree_mutex") simplified the
      cgroup core locking and this dependency became a deadlock hazard -
      cgroup file removal performed under cgroup core lock tries to drain
      on-going file operation which is trying to flush cpuset_hotplug_work
      blocked on the same cgroup core lock.
      
      The locking simplification was done because kernfs added an a lot
      easier way to deal with circular dependencies involving kernfs active
      protection.  Let's use the same strategy in cpuset and break active
      protection in cpuset_write_resmask().  While it isn't the prettiest,
      this is a very rare, likely unique, situation which also goes away on
      the unified hierarchy.
      
      The commands to trigger the deadlock warning without the patch and the
      lockdep output follow.
      
       localhost:/ # mount -t cgroup -o cpuset xxx /cpuset
       localhost:/ # mkdir /cpuset/tmp
       localhost:/ # echo 1 > /cpuset/tmp/cpuset.cpus
       localhost:/ # echo 0 > cpuset/tmp/cpuset.mems
       localhost:/ # echo $$ > /cpuset/tmp/tasks
       localhost:/ # echo 0 > /sys/devices/system/cpu/cpu1/online
      
        ======================================================
        [ INFO: possible circular locking dependency detected ]
        3.16.0-rc1-0.1-default+ #7 Not tainted
        -------------------------------------------------------
        kworker/1:0/32649 is trying to acquire lock:
         (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150
      
        but task is already holding lock:
         (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (cpuset_hotplug_work){+.+...}:
        ...
        -> #1 (s_active#175){++++.+}:
        ...
        -> #0 (cgroup_mutex){+.+.+.}:
        ...
      
        other info that might help us debug this:
      
        Chain exists of:
          cgroup_mutex --> s_active#175 --> cpuset_hotplug_work
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(cpuset_hotplug_work);
      				 lock(s_active#175);
      				 lock(cpuset_hotplug_work);
          lock(cgroup_mutex);
      
         *** DEADLOCK ***
      
        2 locks held by kworker/1:0/32649:
         #0:  ("events"){.+.+.+}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
         #1:  (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
      
        stack backtrace:
        CPU: 1 PID: 32649 Comm: kworker/1:0 Not tainted 3.16.0-rc1-0.1-default+ #7
       ...
        Call Trace:
         [<ffffffff815a5f78>] dump_stack+0x72/0x8a
         [<ffffffff810c263f>] print_circular_bug+0x10f/0x120
         [<ffffffff810c481e>] check_prev_add+0x43e/0x4b0
         [<ffffffff810c4ee6>] validate_chain+0x656/0x7c0
         [<ffffffff810c53d2>] __lock_acquire+0x382/0x660
         [<ffffffff810c57a9>] lock_acquire+0xf9/0x170
         [<ffffffff815aa13f>] mutex_lock_nested+0x6f/0x380
         [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150
         [<ffffffff811129c0>] hotplug_update_tasks_insane+0x110/0x1d0
         [<ffffffff81112bbd>] cpuset_hotplug_update_tasks+0x13d/0x180
         [<ffffffff811148ec>] cpuset_hotplug_workfn+0x18c/0x630
         [<ffffffff810854d4>] process_one_work+0x254/0x520
         [<ffffffff810875dd>] worker_thread+0x13d/0x3d0
         [<ffffffff8108e0c8>] kthread+0xf8/0x100
         [<ffffffff815acaec>] ret_from_fork+0x7c/0xb0
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NLi Zefan <lizefan@huawei.com>
      Tested-by: NLi Zefan <lizefan@huawei.com>
      76bb5ab8
    • S
      tracing: Remove ftrace_stop/start() from reading the trace file · 099ed151
      Steven Rostedt (Red Hat) 提交于
      Disabling reading and writing to the trace file should not be able to
      disable all function tracing callbacks. There's other users today
      (like kprobes and perf). Reading a trace file should not stop those
      from happening.
      
      Cc: stable@vger.kernel.org # 3.0+
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      099ed151
  6. 01 7月, 2014 4 次提交
  7. 30 6月, 2014 2 次提交
    • L
      cgroup: fix a race between cgroup_mount() and cgroup_kill_sb() · 3a32bd72
      Li Zefan 提交于
      We've converted cgroup to kernfs so cgroup won't be intertwined with
      vfs objects and locking, but there are dark areas.
      
      Run two instances of this script concurrently:
      
          for ((; ;))
          {
          	mount -t cgroup -o cpuacct xxx /cgroup
          	umount /cgroup
          }
      
      After a while, I saw two mount processes were stuck at retrying, because
      they were waiting for a subsystem to become free, but the root associated
      with this subsystem never got freed.
      
      This can happen, if thread A is in the process of killing superblock but
      hasn't called percpu_ref_kill(), and at this time thread B is mounting
      the same cgroup root and finds the root in the root list and performs
      percpu_ref_try_get().
      
      To fix this, we try to increase both the refcnt of the superblock and the
      percpu refcnt of cgroup root.
      
      v2:
      - we should try to get both the superblock refcnt and cgroup_root refcnt,
        because cgroup_root may have no superblock assosiated with it.
      - adjust/add comments.
      
      tj: Updated comments.  Renamed @sb to @pinned_sb.
      
      Cc: <stable@vger.kernel.org> # 3.15
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3a32bd72
    • L
      cgroup: fix mount failure in a corner case · 970317aa
      Li Zefan 提交于
        # cat test.sh
        #! /bin/bash
      
        mount -t cgroup -o cpu xxx /cgroup
        umount /cgroup
      
        mount -t cgroup -o cpu,cpuacct xxx /cgroup
        umount /cgroup
        # ./test.sh
        mount: xxx already mounted or /cgroup busy
        mount: according to mtab, xxx is already mounted on /cgroup
      
      It's because the cgroupfs_root of the first mount was under destruction
      asynchronously.
      
      Fix this by delaying and then retrying mount for this case.
      
      v3:
      - put the refcnt immediately after getting it. (Tejun)
      
      v2:
      - use percpu_ref_tryget_live() rather that introducing
        percpu_ref_alive(). (Tejun)
      - adjust comment.
      
      tj: Updated the comment a bit.
      
      Cc: <stable@vger.kernel.org> # 3.15
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      970317aa
  8. 25 6月, 2014 1 次提交
    • G
      cpuset,mempolicy: fix sleeping function called from invalid context · 391acf97
      Gu Zheng 提交于
      When runing with the kernel(3.15-rc7+), the follow bug occurs:
      [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
      [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
      [ 9969.441175] INFO: lockdep is turned off.
      [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G       A      3.15.0-rc7+ #85
      [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
      [ 9969.706052]  ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
      [ 9969.795323]  ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
      [ 9969.884710]  ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
      [ 9969.974071] Call Trace:
      [ 9970.003403]  [<ffffffff8162f523>] dump_stack+0x4d/0x66
      [ 9970.065074]  [<ffffffff8109995a>] __might_sleep+0xfa/0x130
      [ 9970.130743]  [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0
      [ 9970.200638]  [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210
      [ 9970.272610]  [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140
      [ 9970.344584]  [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
      [ 9970.409282]  [<ffffffff811b1385>] __mpol_dup+0xe5/0x150
      [ 9970.471897]  [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
      [ 9970.536585]  [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40
      [ 9970.613763]  [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10
      [ 9970.683660]  [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50
      [ 9970.759795]  [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40
      [ 9970.834885]  [<ffffffff8106a598>] do_fork+0xd8/0x380
      [ 9970.894375]  [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0
      [ 9970.969470]  [<ffffffff8106a8c6>] SyS_clone+0x16/0x20
      [ 9971.030011]  [<ffffffff81642009>] stub_clone+0x69/0x90
      [ 9971.091573]  [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b
      
      The cause is that cpuset_mems_allowed() try to take
      mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
      __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
      under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
      protection region to protect the access to cpuset only in
      current_cpuset_is_being_rebound(). So that we can avoid this bug.
      
      This patch is a temporary solution that just addresses the bug
      mentioned above, can not fix the long-standing issue about cpuset.mems
      rebinding on fork():
      
      "When the forker's task_struct is duplicated (which includes
       ->mems_allowed) and it races with an update to cpuset_being_rebound
       in update_tasks_nodemask() then the task's mems_allowed doesn't get
       updated. And the child task's mems_allowed can be wrong if the
       cpuset's nodemask changes before the child has been added to the
       cgroup's tasklist."
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable <stable@vger.kernel.org>
      391acf97
  9. 24 6月, 2014 6 次提交
    • A
      kernel/watchdog.c: print traces for all cpus on lockup detection · ed235875
      Aaron Tomlin 提交于
      A 'softlockup' is defined as a bug that causes the kernel to loop in
      kernel mode for more than a predefined period to time, without giving
      other tasks a chance to run.
      
      Currently, upon detection of this condition by the per-cpu watchdog
      task, debug information (including a stack trace) is sent to the system
      log.
      
      On some occasions, we have observed that the "victim" rather than the
      actual "culprit" (i.e.  the owner/holder of the contended resource) is
      reported to the user.  Often this information has proven to be
      insufficient to assist debugging efforts.
      
      To avoid loss of useful debug information, for architectures which
      support NMI, this patch makes it possible to improve soft lockup
      reporting.  This is accomplished by issuing an NMI to each cpu to obtain
      a stack trace.
      
      If NMI is not supported we just revert back to the old method.  A sysctl
      and boot-time parameter is available to toggle this feature.
      
      [dzickus@redhat.com: add CONFIG_SMP in certain areas]
      [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
      [mq@suse.cz: fix warning]
      Signed-off-by: NAaron Tomlin <atomlin@redhat.com>
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NJan Moskyto Matejka <mq@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed235875
    • D
      mm, pcp: allow restoring percpu_pagelist_fraction default · 7cd2b0a3
      David Rientjes 提交于
      Oleg reports a division by zero error on zero-length write() to the
      percpu_pagelist_fraction sysctl:
      
          divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
          CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
          Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
          task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
          RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
          RSP: 0018:ffff8800d87a3e78  EFLAGS: 00010246
          RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
          RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
          RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
          R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
          R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
          FS:  00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
          Call Trace:
            proc_sys_call_handler+0xb3/0xc0
            proc_sys_write+0x14/0x20
            vfs_write+0xba/0x1e0
            SyS_write+0x46/0xb0
            tracesys+0xe1/0xe6
      
      However, if the percpu_pagelist_fraction sysctl is set by the user, it
      is also impossible to restore it to the kernel default since the user
      cannot write 0 to the sysctl.
      
      This patch allows the user to write 0 to restore the default behavior.
      It still requires a fraction equal to or larger than 8, however, as
      stated by the documentation for sanity.  If a value in the range [1, 7]
      is written, the sysctl will return EINVAL.
      
      This successfully solves the divide by zero issue at the same time.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Drokin <green@linuxhacker.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cd2b0a3
    • D
      kernel/watchdog.c: remove preemption restrictions when restarting lockup detector · bde92cf4
      Don Zickus 提交于
      Peter Wu noticed the following splat on his machine when updating
      /proc/sys/kernel/watchdog_thresh:
      
        BUG: sleeping function called from invalid context at mm/slub.c:965
        in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
        3 locks held by init/1:
         #0:  (sb_writers#3){.+.+.+}, at: [<ffffffff8117b663>] vfs_write+0x143/0x180
         #1:  (watchdog_proc_mutex){+.+.+.}, at: [<ffffffff810e02d3>] proc_dowatchdog+0x33/0x110
         #2:  (cpu_hotplug.lock){.+.+.+}, at: [<ffffffff810589c2>] get_online_cpus+0x32/0x80
        Preemption disabled at:[<ffffffff810e0384>] proc_dowatchdog+0xe4/0x110
      
        CPU: 0 PID: 1 Comm: init Not tainted 3.16.0-rc1-testing #34
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x4e/0x7a
          __might_sleep+0x11d/0x190
          kmem_cache_alloc_trace+0x4e/0x1e0
          perf_event_alloc+0x55/0x440
          perf_event_create_kernel_counter+0x26/0xe0
          watchdog_nmi_enable+0x75/0x140
          update_timers_all_cpus+0x53/0xa0
          proc_dowatchdog+0xe4/0x110
          proc_sys_call_handler+0xb3/0xc0
          proc_sys_write+0x14/0x20
          vfs_write+0xad/0x180
          SyS_write+0x49/0xb0
          system_call_fastpath+0x16/0x1b
        NMI watchdog: disabled (cpu0): hardware events not enabled
      
      What happened is after updating the watchdog_thresh, the lockup detector
      is restarted to utilize the new value.  Part of this process involved
      disabling preemption.  Once preemption was disabled, perf tried to
      allocate a new event (as part of the restart).  This caused the above
      BUG_ON as you can't sleep with preemption disabled.
      
      The preemption restriction seemed agressive as we are not doing anything
      on that particular cpu, but with all the online cpus (which are
      protected by the get_online_cpus lock).  Remove the restriction and the
      BUG_ON goes away.
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NPeter Wu <peter@lekensteyn.nl>
      Tested-by: NPeter Wu <peter@lekensteyn.nl>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>		[3.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bde92cf4
    • P
      kexec: save PG_head_mask in VMCOREINFO · b3acc56b
      Petr Tesarik 提交于
      To allow filtering of huge pages, makedumpfile must be able to identify
      them in the dump.  This can be done by checking the appropriate page
      flag, so communicate its value to makedumpfile through the VMCOREINFO
      interface.
      
      There's only one small catch.  Depending on how many page flags are
      available on a given architecture, this bit can be called PG_head or
      PG_compound.
      
      I sent a similar patch back in 2012, but Eric Biederman did not like
      using an #ifdef.  So, this time I'm adding a common symbol
      (PG_head_mask) instead.
      
      See https://lkml.org/lkml/2012/11/28/91 for the previous version.
      Signed-off-by: NPetr Tesarik <ptesarik@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3acc56b
    • S
      CPU hotplug, smp: flush any pending IPI callbacks before CPU offline · 8d056c48
      Srivatsa S. Bhat 提交于
      There is a race between the CPU offline code (within stop-machine) and
      the smp-call-function code, which can lead to getting IPIs on the
      outgoing CPU, *after* it has gone offline.
      
      Specifically, this can happen when using
      smp_call_function_single_async() to send the IPI, since this API allows
      sending asynchronous IPIs from IRQ disabled contexts.  The exact race
      condition is described below.
      
      During CPU offline, in stop-machine, we don't enforce any rule in the
      _DISABLE_IRQ stage, regarding the order in which the outgoing CPU and
      the other CPUs disable their local interrupts.  Due to this, we can
      encounter a situation in which an IPI is sent by one of the other CPUs
      to the outgoing CPU (while it is *still* online), but the outgoing CPU
      ends up noticing it only *after* it has gone offline.
      
                    CPU 1                                         CPU 2
                (Online CPU)                               (CPU going offline)
      
             Enter _PREPARE stage                          Enter _PREPARE stage
      
                                                           Enter _DISABLE_IRQ stage
      
                                                         =
             Got a device interrupt, and                 | Didn't notice the IPI
             the interrupt handler sent an               | since interrupts were
             IPI to CPU 2 using                          | disabled on this CPU.
             smp_call_function_single_async()            |
                                                         =
      
             Enter _DISABLE_IRQ stage
      
             Enter _RUN stage                              Enter _RUN stage
      
                                        =
             Busy loop with interrupts  |                  Invoke take_cpu_down()
             disabled.                  |                  and take CPU 2 offline
                                        =
      
             Enter _EXIT stage                             Enter _EXIT stage
      
             Re-enable interrupts                          Re-enable interrupts
      
                                                           The pending IPI is noted
                                                           immediately, but alas,
                                                           the CPU is offline at
                                                           this point.
      
      This of course, makes the smp-call-function IPI handler code running on
      CPU 2 unhappy and it complains about "receiving an IPI on an offline
      CPU".
      
      One real example of the scenario on CPU 1 is the block layer's
      complete-request call-path:
      
      	__blk_complete_request() [interrupt-handler]
      	    raise_blk_irq()
      	        smp_call_function_single_async()
      
      However, if we look closely, the block layer does check that the target
      CPU is online before firing the IPI.  So in this case, it is actually
      the unfortunate ordering/timing of events in the stop-machine phase that
      leads to receiving IPIs after the target CPU has gone offline.
      
      In reality, getting a late IPI on an offline CPU is not too bad by
      itself (this can happen even due to hardware latencies in IPI
      send-receive).  It is a bug only if the target CPU really went offline
      without executing all the callbacks queued on its list.  (Note that a
      CPU is free to execute its pending smp-call-function callbacks in a
      batch, without waiting for the corresponding IPIs to arrive for each one
      of those callbacks).
      
      So, fixing this issue can be broken up into two parts:
      
      1. Ensure that a CPU goes offline only after executing all the
         callbacks queued on it.
      
      2. Modify the warning condition in the smp-call-function IPI handler
         code such that it warns only if an offline CPU got an IPI *and* that
         CPU had gone offline with callbacks still pending in its queue.
      
      Achieving part 1 is straight-forward - just flush (execute) all the
      queued callbacks on the outgoing CPU in the CPU_DYING stage[1],
      including those callbacks for which the source CPU's IPIs might not have
      been received on the outgoing CPU yet.  Once we do this, an IPI that
      arrives late on the CPU going offline (either due to the race mentioned
      above, or due to hardware latencies) will be completely harmless, since
      the outgoing CPU would have executed all the queued callbacks before
      going offline.
      
      Overall, this fix (parts 1 and 2 put together) additionally guarantees
      that we will see a warning only when the *IPI-sender code* is buggy -
      that is, if it queues the callback _after_ the target CPU has gone
      offline.
      
      [1].  The CPU_DYING part needs a little more explanation: by the time we
      execute the CPU_DYING notifier callbacks, the CPU would have already
      been marked offline.  But we want to flush out the pending callbacks at
      this stage, ignoring the fact that the CPU is offline.  So restructure
      the IPI handler code so that we can by-pass the "is-cpu-offline?" check
      in this particular case.  (Of course, the right solution here is to fix
      CPU hotplug to mark the CPU offline _after_ invoking the CPU_DYING
      notifiers, but this requires a lot of audit to ensure that this change
      doesn't break any existing code; hence lets go with the solution
      proposed above until that is done).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Suggested-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Tested-by: NSachin Kamat <sachin.kamat@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d056c48
    • M
      workqueue: fix dev_set_uevent_suppress() imbalance · bddbceb6
      Maxime Bizon 提交于
      Uevents are suppressed during attributes registration, but never
      restored, so kobject_uevent() does nothing.
      Signed-off-by: NMaxime Bizon <mbizon@freebox.fr>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: 226223ab
      bddbceb6
  10. 21 6月, 2014 3 次提交
  11. 18 6月, 2014 1 次提交
  12. 17 6月, 2014 2 次提交
  13. 16 6月, 2014 1 次提交
    • T
      rtmutex: Plug slow unlock race · 27e35715
      Thomas Gleixner 提交于
      When the rtmutex fast path is enabled the slow unlock function can
      create the following situation:
      
      spin_lock(foo->m->wait_lock);
      foo->m->owner = NULL;
      	    			rt_mutex_lock(foo->m); <-- fast path
      				free = atomic_dec_and_test(foo->refcnt);
      				rt_mutex_unlock(foo->m); <-- fast path
      				if (free)
      				   kfree(foo);
      
      spin_unlock(foo->m->wait_lock); <--- Use after free.
      
      Plug the race by changing the slow unlock to the following scheme:
      
           while (!rt_mutex_has_waiters(m)) {
           	    /* Clear the waiters bit in m->owner */
      	    clear_rt_mutex_waiters(m);
            	    owner = rt_mutex_owner(m);
            	    spin_unlock(m->wait_lock);
            	    if (cmpxchg(m->owner, owner, 0) == owner)
            	       return;
            	    spin_lock(m->wait_lock);
           }
      
      So in case of a new waiter incoming while the owner tries the slow
      path unlock we have two situations:
      
       unlock(wait_lock);
      					lock(wait_lock);
       cmpxchg(p, owner, 0) == owner
       	    	   			mark_rt_mutex_waiters(lock);
      	 				acquire(lock);
      
      Or:
      
       unlock(wait_lock);
      					lock(wait_lock);
      	 				mark_rt_mutex_waiters(lock);
       cmpxchg(p, owner, 0) != owner
      					enqueue_waiter();
      					unlock(wait_lock);
       lock(wait_lock);
       wakeup_next waiter();
       unlock(wait_lock);
      					lock(wait_lock);
      					acquire(lock);
      
      If the fast path is disabled, then the simple
      
         m->owner = NULL;
         unlock(m->wait_lock);
      
      is sufficient as all access to m->owner is serialized via
      m->wait_lock;
      
      Also document and clarify the wakeup_next_waiter function as suggested
      by Oleg Nesterov.
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      27e35715
  14. 14 6月, 2014 1 次提交
  15. 11 6月, 2014 4 次提交
  16. 10 6月, 2014 3 次提交
  17. 09 6月, 2014 1 次提交