1. 07 7月, 2015 1 次提交
    • T
      x86/irq: Plug irq vector hotplug race · 5a3f75e3
      Thomas Gleixner 提交于
      Jin debugged a nasty cpu hotplug race which results in leaking a irq
      vector on the newly hotplugged cpu.
      
      cpu N				cpu M
      native_cpu_up                   device_shutdown
        do_boot_cpu			  free_msi_irqs
        start_secondary                   arch_teardown_msi_irqs
          smp_callin                        default_teardown_msi_irqs
             setup_vector_irq                  arch_teardown_msi_irq
              __setup_vector_irq		   native_teardown_msi_irq
                lock(vector_lock)		     destroy_irq 
                install vectors
                unlock(vector_lock)
      					       lock(vector_lock)
      --->                                  	       __clear_irq_vector
                                          	       unlock(vector_lock)
          lock(vector_lock)
          set_cpu_online
          unlock(vector_lock)
      
      This leaves the irq vector(s) which are torn down on CPU M stale in
      the vector array of CPU N, because CPU M does not see CPU N online
      yet. There is a similar issue with concurrent newly setup interrupts.
      
      The alloc/free protection of irq descriptors does not prevent the
      above race, because it merily prevents interrupt descriptors from
      going away or changing concurrently.
      
      Prevent this by moving the call to setup_vector_irq() into the
      vector_lock held region which protects set_cpu_online():
      
      cpu N				cpu M
      native_cpu_up                   device_shutdown
        do_boot_cpu			  free_msi_irqs
        start_secondary                   arch_teardown_msi_irqs
          smp_callin                        default_teardown_msi_irqs
             lock(vector_lock)                arch_teardown_msi_irq
             setup_vector_irq()
              __setup_vector_irq		   native_teardown_msi_irq
                install vectors		     destroy_irq 
             set_cpu_online
             unlock(vector_lock)
      					       lock(vector_lock)
                                        	       __clear_irq_vector
                                          	       unlock(vector_lock)
      
      So cpu M either sees the cpu N online before clearing the vector or
      cpu N installs the vectors after cpu M has cleared it.
      Reported-by: Nxiao jin <jin.xiao@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
      Link: http://lkml.kernel.org/r/20150705171102.141898931@linutronix.de
      5a3f75e3
  2. 06 7月, 2015 2 次提交
  3. 27 5月, 2015 1 次提交
  4. 19 5月, 2015 2 次提交
    • I
      x86/fpu: Rename fpu-internal.h to fpu/internal.h · 78f7f1e5
      Ingo Molnar 提交于
      This unifies all the FPU related header files under a unified, hiearchical
      naming scheme:
      
       - asm/fpu/types.h:      FPU related data types, needed for 'struct task_struct',
                               widely included in almost all kernel code, and hence kept
                               as small as possible.
      
       - asm/fpu/api.h:        FPU related 'public' methods exported to other subsystems.
      
       - asm/fpu/internal.h:   FPU subsystem internal methods
      
       - asm/fpu/xsave.h:      XSAVE support internal methods
      
      (Also standardize the header guard in asm/fpu/internal.h.)
      Reviewed-by: NBorislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      78f7f1e5
    • I
      x86/fpu: Fix header file dependencies of fpu-internal.h · f89e32e0
      Ingo Molnar 提交于
      Fix a minor header file dependency bug in asm/fpu-internal.h: it
      relies on i387.h but does not include it. All users of fpu-internal.h
      included it explicitly.
      
      Also remove unnecessary includes, to reduce compilation time.
      
      This also makes it easier to use it as a standalone header file
      for FPU internals, such as an upcoming C module in arch/x86/kernel/fpu/.
      Reviewed-by: NBorislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f89e32e0
  5. 18 5月, 2015 1 次提交
    • I
      x86/smp/boot: Fix legacy SMP bootup slow-boot bug · 7cb68598
      Ingo Molnar 提交于
      So while testing kernels using tools/kvm/ (kvmtool) I noticed that it
      booted super slow:
      
      [    0.142991] Performance Events: no PMU driver, software events only.
      [    0.149265] x86: Booting SMP configuration:
      [    0.149765] .... node  #0, CPUs:          #1
      [    0.148304] kvm-clock: cpu 1, msr 2:1bfe9041, secondary cpu clock
      [   10.158813] KVM setup async PF for cpu 1
      [   10.159000]    #2
      [   10.159000] kvm-stealtime: cpu 1, msr 211a4d400
      [   10.158829] kvm-clock: cpu 2, msr 2:1bfe9081, secondary cpu clock
      [   20.167805] KVM setup async PF for cpu 2
      [   20.168000]    #3
      [   20.168000] kvm-stealtime: cpu 2, msr 211a8d400
      [   20.167818] kvm-clock: cpu 3, msr 2:1bfe90c1, secondary cpu clock
      [   30.176902] KVM setup async PF for cpu 3
      [   30.177000]    #4
      [   30.177000] kvm-stealtime: cpu 3, msr 211acd400
      
      One CPU booted up per 10 seconds. With 120 CPUs that takes a while.
      
      Bisection pinpointed this commit:
      
        853b160a ("Revert f5d6a52f ("x86/smpboot: Skip delays during SMP initialization similar to Xen")")
      
      But that commit just restores previous behavior, so it cannot cause the
      problem. After some head scratching it turns out that these two commits:
      
        1a744cb3 ("x86/smp/boot: Remove 10ms delay from cpu_up() on modern processors")
        d68921f9 ("x86/smp/boot: Add cmdline "cpu_init_udelay=N" to specify cpu_up() delay")
      
      added the following code to smpboot.c:
      
      -               mdelay(10);
      +               mdelay(init_udelay);
      
      Note the mismatch in the units: the delay is called 'udelay' and is set
      to microseconds - while the function used here is actually 'mdelay',
      which counts in milliseconds ...
      
      So the delay for legacy systems is off by a factor of 1,000, so instead
      of 10 msecs we waited for 10 seconds ...
      
      The reason bisection pointed to 853b160a was that 853b160a removed
      a (broken) boot-time speedup patch, which masked the factor 1,000 bug.
      
      Fix it by using udelay(). This fixes my bootup problems.
      
      Cc: Len Brown <len.brown@intel.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan H. Schönherr <jschoenh@amazon.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7cb68598
  6. 13 5月, 2015 1 次提交
    • I
      Revert f5d6a52f ("x86/smpboot: Skip delays during SMP initialization similar to Xen") · 853b160a
      Ingo Molnar 提交于
      Huang Ying reported x86 boot hangs due to this commit.
      
      Turns out that the change, despite its changelog, does more
      than just change timeouts: it also changes the way we
      assert/deassert INIT via the APIC_DM_INIT IPI, in the x2apic
      case it skips the deassert step.
      
      This is historically fragile code and the patch did not
      improve it, so revert these changes.
      
      This commit:
      
        1a744cb3 ("x86/smp/boot: Remove 10ms delay from cpu_up() on modern processors")
      
      independently removes the worst of the delays (the 10 msec delay).
      
      The remaining delays can be addressed one by one, combined
      with careful testing.
      Reported-by: NHuang Ying <ying.huang@intel.com>
      Cc: Anthony Liguori <aliguori@amazon.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Gang Wei <gang.wei@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan H. Schönherr <jschoenh@amazon.de>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Deegan <tim@xen.org>
      Link: http://lkml.kernel.org/r/1430732554-7294-1-git-send-email-jschoenh@amazon.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      853b160a
  7. 12 5月, 2015 2 次提交
  8. 08 5月, 2015 1 次提交
  9. 06 5月, 2015 1 次提交
    • J
      x86/smpboot: Skip delays during SMP initialization similar to Xen · f5d6a52f
      Jan H. Schönherr 提交于
      Remove the per-CPU delays during SMP initialization, which seems
      to be possible on newer architectures with an x2APIC.
      
      Xen does this since 2011. In fact, this commit is basically a
      combination of the following Xen commits. The first removes the
      delays, the second fixes an issue with the removal:
      
        commit 68fce206f6dba9981e8322269db49692c95ce250
        Author: Tim Deegan <Tim.Deegan@citrix.com>
        Date:   Tue Jul 19 14:13:01 2011 +0100
      
          x86: Remove timeouts from INIT-SIPI-SIPI sequence when using x2apic.
      
          Some of the timeouts are pointless since they're waiting for the ICR
          to ack the IPI delivery and that doesn't happen on x2apic.
          The others should be benign (and are suggested in the SDM) but
          removing them makes AP bringup much more reliable on some test boxes.
      Signed-off-by: NTim Deegan <Tim.Deegan@citrix.com>
      
        commit f12ee533150761df5a7099c83f2a5fa6c07d1187
        Author: Gang Wei <gang.wei@intel.com>
        Date:   Thu Dec 29 10:07:54 2011 +0000
      
          X86: Add a delay between INIT & SIPIs for tboot AP bring-up in X2APIC case
      
          Without this delay, Xen could not bring APs up while working with
          TXT/tboot, because tboot needs some time in APs to handle INIT before
          becoming ready for receiving SIPIs (this delay was removed as part of
          c/s 23724 by Tim Deegan).
      Signed-off-by: NGang Wei <gang.wei@intel.com>
      Acked-by: NKeir Fraser <keir@xen.org>
      Acked-by: NTim Deegan <tim@xen.org>
      Committed-by: NTim Deegan <tim@xen.org>
      Signed-off-by: NJan H. Schönherr <jschoenh@amazon.de>
      Cc: Anthony Liguori <aliguori@amazon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Gang Wei <gang.wei@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Deegan <tim@xen.org>
      Link: http://lkml.kernel.org/r/1430732554-7294-1-git-send-email-jschoenh@amazon.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f5d6a52f
  10. 02 4月, 2015 1 次提交
  11. 01 4月, 2015 1 次提交
  12. 25 3月, 2015 1 次提交
    • D
      x86/asm/entry: Get rid of KERNEL_STACK_OFFSET · ef593260
      Denys Vlasenko 提交于
      PER_CPU_VAR(kernel_stack) was set up in a way where it points
      five stack slots below the top of stack.
      
      Presumably, it was done to avoid one "sub $5*8,%rsp"
      in syscall/sysenter code paths, where iret frame needs to be
      created by hand.
      
      Ironically, none of them benefits from this optimization,
      since all of them need to allocate additional data on stack
      (struct pt_regs), so they still have to perform subtraction.
      
      This patch eliminates KERNEL_STACK_OFFSET.
      
      PER_CPU_VAR(kernel_stack) now points directly to top of stack.
      pt_regs allocations are adjusted to allocate iret frame as well.
      Hopefully we can merge it later with 32-bit specific
      PER_CPU_VAR(cpu_current_top_of_stack) variable...
      
      Net result in generated code is that constants in several insns
      are changed.
      
      This change is necessary for changing struct pt_regs creation
      in SYSCALL64 code path from MOV to PUSH instructions.
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1426785469-15125-2-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ef593260
  13. 12 3月, 2015 1 次提交
    • P
      x86: Use common outgoing-CPU-notification code · 2a442c9c
      Paul E. McKenney 提交于
      This commit removes the open-coded CPU-offline notification with new
      common code.  Among other things, this change avoids calling scheduler
      code using RCU from an offline CPU that RCU is ignoring.  It also allows
      Xen to notice at online time that the CPU did not go offline correctly.
      Note that Xen has the surviving CPU carry out some cleanup operations,
      so if the surviving CPU times out, these cleanup operations might have
      been carried out while the outgoing CPU was still running.  It might
      therefore be unwise to bring this CPU back online, and this commit
      avoids doing so.
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: <x86@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: <xen-devel@lists.xenproject.org>
      2a442c9c
  14. 07 3月, 2015 1 次提交
  15. 22 1月, 2015 6 次提交
  16. 16 12月, 2014 1 次提交
  17. 10 11月, 2014 1 次提交
  18. 05 11月, 2014 1 次提交
  19. 19 10月, 2014 1 次提交
  20. 03 10月, 2014 1 次提交
  21. 24 9月, 2014 3 次提交
    • W
      sched: Fix unreleased llc_shared_mask bit during CPU hotplug · 03bd4e1f
      Wanpeng Li 提交于
      The following bug can be triggered by hot adding and removing a large number of
      xen domain0's vcpus repeatedly:
      
      	BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 IP: [..] find_busiest_group
      	PGD 5a9d5067 PUD 13067 PMD 0
      	Oops: 0000 [#3] SMP
      	[...]
      	Call Trace:
      	load_balance
      	? _raw_spin_unlock_irqrestore
      	idle_balance
      	__schedule
      	schedule
      	schedule_timeout
      	? lock_timer_base
      	schedule_timeout_uninterruptible
      	msleep
      	lock_device_hotplug_sysfs
      	online_store
      	dev_attr_store
      	sysfs_write_file
      	vfs_write
      	SyS_write
      	system_call_fastpath
      
      Last level cache shared mask is built during CPU up and the
      build_sched_domain() routine takes advantage of it to setup
      the sched domain CPU topology.
      
      However, llc_shared_mask is not released during CPU disable,
      which leads to an invalid sched domainCPU topology.
      
      This patch fix it by releasing the llc_shared_mask correctly
      during CPU disable.
      
      Yasuaki also reported that this can happen on real hardware:
      
        https://lkml.org/lkml/2014/7/22/1018
      
      His case is here:
      
      	==
      	Here is an example on my system.
      	My system has 4 sockets and each socket has 15 cores and HT is
      	enabled. In this case, each core of sockes is numbered as
      	follows:
      
      		 | CPU#
      	Socket#0 | 0-14 , 60-74
      	Socket#1 | 15-29, 75-89
      	Socket#2 | 30-44, 90-104
      	Socket#3 | 45-59, 105-119
      
      	Then llc_shared_mask of CPU#30 has 0x3fff80000001fffc0000000.
      
      	It means that last level cache of Socket#2 is shared with
      	CPU#30-44 and 90-104.
      
      	When hot-removing socket#2 and #3, each core of sockets is
      	numbered as follows:
      
      		 | CPU#
      	Socket#0 | 0-14 , 60-74
      	Socket#1 | 15-29, 75-89
      
      	But llc_shared_mask is not cleared. So llc_shared_mask of CPU#30
      	remains having 0x3fff80000001fffc0000000.
      
      	After that, when hot-adding socket#2 and #3, each core of
      	sockets is numbered as follows:
      
      		 | CPU#
      	Socket#0 | 0-14 , 60-74
      	Socket#1 | 15-29, 75-89
      	Socket#2 | 30-59
      	Socket#3 | 90-119
      
      	Then llc_shared_mask of CPU#30 becomes
      	0x3fff8000fffffffc0000000. It means that last level cache of
      	Socket#2 is shared with CPU#30-59 and 90-104. So the mask has
      	the wrong value.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Tested-by: NLinn Crosetto <linn@hp.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NToshi Kani <toshi.kani@hp.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1411547885-48165-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      03bd4e1f
    • L
      x86/smpboot: Speed up suspend/resume by avoiding 100ms sleep for CPU offline during S3 · 2ed53c0d
      Lan Tianyu 提交于
      With certain kernel configurations, CPU offline consumes more than
      100ms during S3.
      
      It's a timing related issue: native_cpu_die() would occasionally fall
      into a 100ms sleep when the CPU idle loop thread marked the CPU state
      to DEAD too slowly.
      
      What native_cpu_die() does is that it polls the CPU state and waits
      for 100ms if CPU state hasn't been marked to DEAD. The 100ms sleep
      doesn't make sense and is purely historic.
      
      To avoid such long sleeping, this patch adds a 'struct completion'
      to each CPU, waits for the completion in native_cpu_die() and wakes
      up the completion when the CPU state is marked to DEAD.
      
      Tested on an Intel Xeon server with 48 cores, Ivybridge and on
      Haswell laptops. The CPU offlining cost on these machines is
      reduced from more than 100ms to less than 5ms. The system
      suspend time is reduced by 2.3s on the servers.
      
      Borislav and Prarit also helped to test the patch on an AMD
      machine and a few systems of various sizes and configurations
      (multi-socket, single-socket, no hyper threading, etc.). No
      issues were seen.
      Tested-by: NPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: NLan Tianyu <tianyu.lan@intel.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: srostedt@redhat.com
      Cc: toshi.kani@hp.com
      Cc: imammedo@redhat.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1409039025-32310-1-git-send-email-tianyu.lan@intel.com
      [ Improved a few minor details in the code, cleaned up the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2ed53c0d
    • D
      x86, sched: Add new topology for multi-NUMA-node CPUs · cebf15eb
      Dave Hansen 提交于
      I'm getting the spew below when booting with Haswell (Xeon
      E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled
      in the BIOS.  It seems similar to the issue that some folks from
      AMD ran in to on their systems and addressed in this commit:
      
        161270fc ("x86/smp: Fix topology checks on AMD MCM CPUs")
      
      Both these Intel and AMD systems break an assumption which is
      being enforced by topology_sane(): a socket may not contain more
      than one NUMA node.
      
      AMD special-cased their system by looking for a cpuid flag.  The
      Intel mode is dependent on BIOS options and I do not know of a
      way which it is enumerated other than the tables being parsed
      during the CPU bringup process.  In other words, we have to trust
      the ACPI tables <shudder>.
      
      This detects the situation where a NUMA node occurs at a place in
      the middle of the "CPU" sched domains.  It replaces the default
      topology with one that relies on the NUMA information from the
      firmware (SRAT table) for all levels of sched domains above the
      hyperthreads.
      
      This also fixes a sysfs bug.  We used to freak out when we saw
      the "mc" group cross a node boundary, so we stopped building the
      MC group.  MC gets exported as the 'core_siblings_list' in
      /sys/devices/system/cpu/cpu*/topology/ and this caused CPUs with
      the same 'physical_package_id' to not be listed together in
      'core_siblings_list'.  This violates a statement from
      Documentation/ABI/testing/sysfs-devices-system-cpu:
      
      	core_siblings: internal kernel map of cpu#'s hardware threads
      	within the same physical_package_id.
      
      	core_siblings_list: human-readable list of the logical CPU
      	numbers within the same physical_package_id as cpu#.
      
      The sysfs effects here cause an issue with the hwloc tool where
      it gets confused and thinks there are more sockets than are
      physically present.
      
      Before this patch, there are two packages:
      
      # cd /sys/devices/system/cpu/
      # cat cpu*/topology/physical_package_id | sort | uniq -c
           18 0
           18 1
      
      But 4 _sets_ of core siblings:
      
      # cat cpu*/topology/core_siblings_list | sort | uniq -c
            9 0-8
            9 18-26
            9 27-35
            9 9-17
      
      After this set, there are only 2 sets of core siblings, which
      is what we expect for a 2-socket system.
      
      # cat cpu*/topology/physical_package_id | sort | uniq -c
           18 0
           18 1
      # cat cpu*/topology/core_siblings_list | sort | uniq -c
           18 0-17
           18 18-35
      
      Example spew:
      ...
      	NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
      	 #2  #3  #4  #5  #6  #7  #8
      	.... node  #1, CPUs:    #9
      	------------[ cut here ]------------
      	WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90()
      	sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
      	Modules linked in:
      	CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631
      	Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
      	0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48
      	ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009
      	ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98
      	Call Trace:
      	[<ffffffff8172e485>] dump_stack+0x45/0x56
      	[<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0
      	[<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50
      	[<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90
      	[<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0
      	[<ffffffff8107568d>] start_secondary+0x1ad/0x240
      	---[ end trace 3fe5f587a9fcde61 ]---
      	#10 #11 #12 #13 #14 #15 #16 #17
      	.... node  #2, CPUs:   #18 #19 #20 #21 #22 #23 #24 #25 #26
      	.... node  #3, CPUs:   #27 #28 #29 #30 #31 #32 #33 #34 #35
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      [ Added LLC domain and s/match_mc/match_die/ ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: brice.goglin@gmail.com
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Link: http://lkml.kernel.org/r/20140918193334.C065EBCE@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cebf15eb
  22. 16 9月, 2014 1 次提交
  23. 31 7月, 2014 2 次提交
  24. 09 6月, 2014 1 次提交
  25. 05 6月, 2014 3 次提交
    • I
      x86/smpboot: Initialize secondary CPU only if master CPU will wait for it · 3e1a878b
      Igor Mammedov 提交于
      Hang is observed on virtual machines during CPU hotplug,
      especially in big guests with many CPUs. (It reproducible
      more often if host is over-committed).
      
      It happens because master CPU gives up waiting on
      secondary CPU and allows it to run wild. As result
      AP causes locking or crashing system. For example
      as described here:
      
         https://lkml.org/lkml/2014/3/6/257
      
      If master CPU have sent STARTUP IPI successfully,
      and AP signalled to master CPU that it's ready
      to start initialization, make master CPU wait
      indefinitely till AP is onlined.
      To ensure that AP won't ever run wild, make it
      wait at early startup till master CPU confirms its
      intention to wait for AP. If AP doesn't respond in 10
      seconds, the master CPU will timeout and cancel
      AP onlining.
      Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1401975765-22328-4-git-send-email-imammedo@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3e1a878b
    • I
      x86/smpboot: Log error on secondary CPU wakeup failure at ERR level · feef1e8e
      Igor Mammedov 提交于
      If system is running without debug level logging,
      it will not log error if do_boot_cpu() failed to
      wakeup AP. It may lead to silent AP bringup
      failures at boot time.
      Change message level to KERN_ERR to make error
      visible to user as it's done on other architectures.
      Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1401975765-22328-3-git-send-email-imammedo@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      feef1e8e
    • I
      x86: Fix list/memory corruption on CPU hotplug · 89f898c1
      Igor Mammedov 提交于
      currently if AP wake up is failed, master CPU marks AP as not
      present in do_boot_cpu() by calling set_cpu_present(cpu, false).
      That leads to following list corruption on the next physical CPU
      hotplug:
      
      [  418.107336] WARNING: CPU: 1 PID: 45 at lib/list_debug.c:33 __list_add+0xbe/0xd0()
      [  418.115268] list_add corruption. prev->next should be next (ffff88003dc57600), but was ffff88003e20c3a0. (prev=ffff88003e20c3a0).
      [  418.123693] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT ipt_REJECT cfg80211 xt_conntrack rfkill ee
      [  418.138979] CPU: 1 PID: 45 Comm: kworker/u10:1 Not tainted 3.14.0-rc6+ #387
      [  418.149989] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      [  418.165750] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
      [  418.166433]  0000000000000021 ffff880038ca7988 ffffffff8159b22d 0000000000000021
      [  418.176460]  ffff880038ca79d8 ffff880038ca79c8 ffffffff8106942c ffff880038ca79e8
      [  418.177453]  ffff88003e20c3a0 ffff88003dc57600 ffff88003e20c3a0 00000000ffffffea
      [  418.178445] Call Trace:
      [  418.185811]  [<ffffffff8159b22d>] dump_stack+0x49/0x5c
      [  418.186440]  [<ffffffff8106942c>] warn_slowpath_common+0x8c/0xc0
      [  418.187192]  [<ffffffff81069516>] warn_slowpath_fmt+0x46/0x50
      [  418.191231]  [<ffffffff8136ef51>] ? acpi_ns_get_node+0xb7/0xc7
      [  418.193889]  [<ffffffff812f796e>] __list_add+0xbe/0xd0
      [  418.196649]  [<ffffffff812e2aa9>] kobject_add_internal+0x79/0x200
      [  418.208610]  [<ffffffff812e2e18>] kobject_add_varg+0x38/0x60
      [  418.213831]  [<ffffffff812e2ef4>] kobject_add+0x44/0x70
      [  418.229961]  [<ffffffff813e2c60>] device_add+0xd0/0x550
      [  418.234991]  [<ffffffff813f0e95>] ? pm_runtime_init+0xe5/0xf0
      [  418.250226]  [<ffffffff813e32be>] device_register+0x1e/0x30
      [  418.255296]  [<ffffffff813e82a3>] register_cpu+0xe3/0x130
      [  418.266539]  [<ffffffff81592be5>] arch_register_cpu+0x65/0x150
      [  418.285845]  [<ffffffff81355c0d>] acpi_processor_hotadd_init+0x5a/0x9b
      ...
      Which is caused by the fact that generic_processor_info() allocates
      logical CPU id by calling:
      
       cpu = cpumask_next_zero(-1, cpu_present_mask);
      
      which returns id of previously failed to wake up CPU, since its
      bit is cleared by do_boot_cpu() and as result register_cpu()
      tries to register another CPU with the same id as already
      present but failed to be onlined CPU.
      
      Taking in account that AP will not do anything if master CPU
      failed to wake it up, there is no reason to mark that AP as not
      present and break next cpu hotplug attempts. As a side effect of
      not marking AP as not present, user would be allowed to online
      it again later.
      
      Also fix memory corruption in acpi_unmap_lsapic()
      
      if during CPU hotplug master CPU failed to wake up AP
      it set percpu x86_cpu_to_apicid to BAD_APICID=0xFFFF for AP.
      
      However following attempt to unplug that CPU will lead to
      out of bound write access to __apicid_to_node[] which is
      32768 items long on x86_64 kernel.
      
      So with above fix of cpu_present_mask make sure that a present
      CPU has a valid APIC ID by not setting x86_cpu_to_apicid
      to BAD_APICID in do_boot_cpu() on failure and allow
      acpi_processor_remove()->acpi_unmap_lsapic() cleanly remove CPU.
      Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1401975765-22328-2-git-send-email-imammedo@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      89f898c1
  26. 05 5月, 2014 1 次提交
  27. 01 5月, 2014 1 次提交
    • H
      x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack · 3891a04a
      H. Peter Anvin 提交于
      The IRET instruction, when returning to a 16-bit segment, only
      restores the bottom 16 bits of the user space stack pointer.  This
      causes some 16-bit software to break, but it also leaks kernel state
      to user space.  We have a software workaround for that ("espfix") for
      the 32-bit kernel, but it relies on a nonzero stack segment base which
      is not available in 64-bit mode.
      
      In checkin:
      
          b3b42ac2 x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
      
      we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
      the logic that 16-bit support is crippled on 64-bit kernels anyway (no
      V86 support), but it turns out that people are doing stuff like
      running old Win16 binaries under Wine and expect it to work.
      
      This works around this by creating percpu "ministacks", each of which
      is mapped 2^16 times 64K apart.  When we detect that the return SS is
      on the LDT, we copy the IRET frame to the ministack and use the
      relevant alias to return to userspace.  The ministacks are mapped
      readonly, so if IRET faults we promote #GP to #DF which is an IST
      vector and thus has its own stack; we then do the fixup in the #DF
      handler.
      
      (Making #GP an IST exception would make the msr_safe functions unsafe
      in NMI/MC context, and quite possibly have other effects.)
      
      Special thanks to:
      
      - Andy Lutomirski, for the suggestion of using very small stack slots
        and copy (as opposed to map) the IRET frame there, and for the
        suggestion to mark them readonly and let the fault promote to #DF.
      - Konrad Wilk for paravirt fixup and testing.
      - Borislav Petkov for testing help and useful comments.
      Reported-by: NBrian Gerst <brgerst@gmail.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrew Lutomriski <amluto@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Dirk Hohndel <dirk@hohndel.org>
      Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
      Cc: comex <comexk@gmail.com>
      Cc: Alexander van Heukelum <heukelum@fastmail.fm>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: <stable@vger.kernel.org> # consider after upstream merge
      3891a04a