1. 25 9月, 2013 8 次提交
  2. 05 9月, 2013 2 次提交
  3. 23 8月, 2013 2 次提交
  4. 20 8月, 2013 3 次提交
    • C
      xen/smp: initialize IPI vectors before marking CPU online · fc78d343
      Chuck Anderson 提交于
      An older PVHVM guest (v3.0 based) crashed during vCPU hot-plug with:
      
      	kernel BUG at drivers/xen/events.c:1328!
      
      RCU has detected that a CPU has not entered a quiescent state within the
      grace period.  It needs to send the CPU a reschedule IPI if it is not
      offline.  rcu_implicit_offline_qs() does this check:
      
      	/*
      	 * If the CPU is offline, it is in a quiescent state.  We can
      	 * trust its state not to change because interrupts are disabled.
      	 */
      	if (cpu_is_offline(rdp->cpu)) {
      		rdp->offline_fqs++;
      		return 1;
      	}
      
      	Else the CPU is online.  Send it a reschedule IPI.
      
      The CPU is in the middle of being hot-plugged and has been marked online
      (!cpu_is_offline()).  See start_secondary():
      
      	set_cpu_online(smp_processor_id(), true);
      	...
      	per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;
      
      start_secondary() then waits for the CPU bringing up the hot-plugged CPU to
      mark it as active:
      
      	/*
      	 * Wait until the cpu which brought this one up marked it
      	 * online before enabling interrupts. If we don't do that then
      	 * we can end up waking up the softirq thread before this cpu
      	 * reached the active state, which makes the scheduler unhappy
      	 * and schedule the softirq thread on the wrong cpu. This is
      	 * only observable with forced threaded interrupts, but in
      	 * theory it could also happen w/o them. It's just way harder
      	 * to achieve.
      	 */
      	while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
      		cpu_relax();
      
      	/* enable local interrupts */
      	local_irq_enable();
      
      The CPU being hot-plugged will be marked active after it has been fully
      initialized by the CPU managing the hot-plug.  In the Xen PVHVM case
      xen_smp_intr_init() is called to set up the hot-plugged vCPU's
      XEN_RESCHEDULE_VECTOR.
      
      The hot-plugging CPU is marked online, not marked active and does not have
      its IPI vectors set up.  rcu_implicit_offline_qs() sees the hot-plugging
      cpu is !cpu_is_offline() and tries to send it a reschedule IPI:
      This will lead to:
      
      	kernel BUG at drivers/xen/events.c:1328!
      
      	xen_send_IPI_one()
      	xen_smp_send_reschedule()
      	rcu_implicit_offline_qs()
      	rcu_implicit_dynticks_qs()
      	force_qs_rnp()
      	force_quiescent_state()
      	__rcu_process_callbacks()
      	rcu_process_callbacks()
      	__do_softirq()
      	call_softirq()
      	do_softirq()
      	irq_exit()
      	xen_evtchn_do_upcall()
      
      because xen_send_IPI_one() will attempt to use an uninitialized IRQ for
      the XEN_RESCHEDULE_VECTOR.
      
      There is at least one other place that has caused the same crash:
      
      	xen_smp_send_reschedule()
      	wake_up_idle_cpu()
      	add_timer_on()
      	clocksource_watchdog()
      	call_timer_fn()
      	run_timer_softirq()
      	__do_softirq()
      	call_softirq()
      	do_softirq()
      	irq_exit()
      	xen_evtchn_do_upcall()
      	xen_hvm_callback_vector()
      
      clocksource_watchdog() uses cpu_online_mask to pick the next CPU to handle
      a watchdog timer:
      
      	/*
      	 * Cycle through CPUs to check if the CPUs stay synchronized
      	 * to each other.
      	 */
      	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
      	if (next_cpu >= nr_cpu_ids)
      		next_cpu = cpumask_first(cpu_online_mask);
      	watchdog_timer.expires += WATCHDOG_INTERVAL;
      	add_timer_on(&watchdog_timer, next_cpu);
      
      This resulted in an attempt to send an IPI to a hot-plugging CPU that
      had not initialized its reschedule vector. One option would be to make
      the RCU code check to not check for CPU offline but for CPU active.
      As becoming active is done after a CPU is online (in older kernels).
      
      But Srivatsa pointed out that "the cpu_active vs cpu_online ordering has been
      completely reworked - in the online path, cpu_active is set *before* cpu_online,
      and also, in the cpu offline path, the cpu_active bit is reset in the CPU_DYING
      notification instead of CPU_DOWN_PREPARE." Drilling in this the bring-up
      path: "[brought up CPU].. send out a CPU_STARTING notification, and in response
      to that, the scheduler sets the CPU in the cpu_active_mask. Again, this mask
      is better left to the scheduler alone, since it has the intelligence to use it
      judiciously."
      
      The conclusion was that:
      "
      1. At the IPI sender side:
      
         It is incorrect to send an IPI to an offline CPU (cpu not present in
         the cpu_online_mask). There are numerous places where we check this
         and warn/complain.
      
      2. At the IPI receiver side:
      
         It is incorrect to let the world know of our presence (by setting
         ourselves in global bitmasks) until our initialization steps are complete
         to such an extent that we can handle the consequences (such as
         receiving interrupts without crashing the sender etc.)
      " (from Srivatsa)
      
      As the native code enables the interrupts at some point we need to be
      able to service them. In other words a CPU must have valid IPI vectors
      if it has been marked online.
      
      It doesn't need to handle the IPI (interrupts may be disabled) but needs
      to have valid IPI vectors because another CPU may find it in cpu_online_mask
      and attempt to send it an IPI.
      
      This patch will change the order of the Xen vCPU bring-up functions so that
      Xen vectors have been set up before start_secondary() is called.
      It also will not continue to bring up a Xen vCPU if xen_smp_intr_init() fails
      to initialize it.
      
      Orabug 13823853
      Signed-off-by Chuck Anderson <chuck.anderson@oracle.com>
      Acked-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      fc78d343
    • D
      x86/xen: do not identity map UNUSABLE regions in the machine E820 · 3bc38cbc
      David Vrabel 提交于
      If there are UNUSABLE regions in the machine memory map, dom0 will
      attempt to map them 1:1 which is not permitted by Xen and the kernel
      will crash.
      
      There isn't anything interesting in the UNUSABLE region that the dom0
      kernel needs access to so we can avoid making the 1:1 mapping and
      treat it as RAM.
      
      We only do this for dom0, as that is where tboot case shows up.
      A PV domU could have an UNUSABLE region in its pseudo-physical map
      and would need to be handled in another patch.
      
      This fixes a boot failure on hosts with tboot.
      
      tboot marks a region in the e820 map as unusable and the dom0 kernel
      would attempt to map this region and Xen does not permit unusable
      regions to be mapped by guests.
      
        (XEN)  0000000000000000 - 0000000000060000 (usable)
        (XEN)  0000000000060000 - 0000000000068000 (reserved)
        (XEN)  0000000000068000 - 000000000009e000 (usable)
        (XEN)  0000000000100000 - 0000000000800000 (usable)
        (XEN)  0000000000800000 - 0000000000972000 (unusable)
      
      tboot marked this region as unusable.
      
        (XEN)  0000000000972000 - 00000000cf200000 (usable)
        (XEN)  00000000cf200000 - 00000000cf38f000 (reserved)
        (XEN)  00000000cf38f000 - 00000000cf3ce000 (ACPI data)
        (XEN)  00000000cf3ce000 - 00000000d0000000 (reserved)
        (XEN)  00000000e0000000 - 00000000f0000000 (reserved)
        (XEN)  00000000fe000000 - 0000000100000000 (reserved)
        (XEN)  0000000100000000 - 0000000630000000 (usable)
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      [v1: Altered the patch and description with domU's with UNUSABLE regions]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      3bc38cbc
    • Y
      x86/mm: Fix boot crash with DEBUG_PAGE_ALLOC=y and more than 512G RAM · 527bf129
      Yinghai Lu 提交于
      Dave Hansen reported that systems between 500G and 600G RAM
      crash early if DEBUG_PAGEALLOC is selected.
      
       > [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
       > [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
       > [    0.000000] BRK [0x02086000, 0x02086fff] PGTABLE
       > [    0.000000] BRK [0x02087000, 0x02087fff] PGTABLE
       > [    0.000000] BRK [0x02088000, 0x02088fff] PGTABLE
       > [    0.000000] init_memory_mapping: [mem 0xe80ee00000-0xe80effffff]
       > [    0.000000]  [mem 0xe80ee00000-0xe80effffff] page 4k
       > [    0.000000] BRK [0x02089000, 0x02089fff] PGTABLE
       > [    0.000000] BRK [0x0208a000, 0x0208afff] PGTABLE
       > [    0.000000] Kernel panic - not syncing: alloc_low_page: ran out of memory
      
      It turns out that we missed increasing needed pages in BRK to
      mapping initial 2M and [0,1M) when we switched to use the #PF
      handler to set memory mappings:
      
       > commit 8170e6be
       > Author: H. Peter Anvin <hpa@zytor.com>
       > Date:   Thu Jan 24 12:19:52 2013 -0800
       >
       >     x86, 64bit: Use a #PF handler to materialize early mappings on demand
      
      Before that, we had the maping from [0,512M) in head_64.S, and we
      can spare two pages [0-1M).  After that change, we can not reuse
      pages anymore.
      
      When we have more than 512M ram, we need an extra page for pgd page
      with [512G, 1024g).
      
      Increase pages in BRK for page table to solve the boot crash.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Bisected-by: NDave Hansen <dave.hansen@intel.com>
      Tested-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: <stable@vger.kernel.org> # v3.9 and later
      Link: http://lkml.kernel.org/r/1376351004-4015-1-git-send-email-yinghai@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      527bf129
  5. 14 8月, 2013 3 次提交
  6. 13 8月, 2013 3 次提交
    • O
      sched: fix the theoretical signal_wake_up() vs schedule() race · e0acd0a6
      Oleg Nesterov 提交于
      This is only theoretical, but after try_to_wake_up(p) was changed
      to check p->state under p->pi_lock the code like
      
      	__set_current_state(TASK_INTERRUPTIBLE);
      	schedule();
      
      can miss a signal. This is the special case of wait-for-condition,
      it relies on try_to_wake_up/schedule interaction and thus it does
      not need mb() between __set_current_state() and if(signal_pending).
      
      However, this __set_current_state() can move into the critical
      section protected by rq->lock, now that try_to_wake_up() takes
      another lock we need to ensure that it can't be reordered with
      "if (signal_pending(current))" check inside that section.
      
      The patch is actually one-liner, it simply adds smp_wmb() before
      spin_lock_irq(rq->lock). This is what try_to_wake_up() already
      does by the same reason.
      
      We turn this wmb() into the new helper, smp_mb__before_spinlock(),
      for better documentation and to allow the architectures to change
      the default implementation.
      
      While at it, kill smp_mb__after_lock(), it has no callers.
      
      Perhaps we can also add smp_mb__before/after_spinunlock() for
      prepare_to_wait().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0acd0a6
    • T
      x86, microcode, AMD: Fix early microcode loading · 84516098
      Torsten Kaiser 提交于
      load_microcode_amd() (and the helper it is using) should not have an
      cpu parameter. The microcode loading does not depend on the CPU wrt the
      patches loaded since they will end up in a global list for all CPUs
      anyway.
      
      The change from cpu to x86family in load_microcode_amd()
      now allows to drop the code messing with cpu_data(cpu) from
      collect_cpu_info_amd_early(), which is wrong anyway because at that
      point the per-cpu cpu_info is not yet setup (These values would later be
      overwritten by smp_store_boot_cpu_info() / smp_store_cpu_info()).
      
      Fold the rest of collect_cpu_info_amd_early() into load_ucode_amd_ap(),
      because its only used at one place and without the cpuinfo_x86 accesses
      it was not much left.
      Signed-off-by: NTorsten Kaiser <just.for.lkml@googlemail.com>
      [ Fengguang: build fix ]
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      [ Boris: adapt it to current tree. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      84516098
    • T
      x86, microcode, AMD: Make cpu_has_amd_erratum() use the correct struct cpuinfo_x86 · 8c6b79bb
      Torsten Kaiser 提交于
      cpu_has_amd_erratum() is buggy, because it uses the per-cpu cpu_info
      before it is filled by smp_store_boot_cpu_info() / smp_store_cpu_info().
      
      If early microcode loading is enabled its collect_cpu_info_amd_early()
      will fill ->x86 and so the fallback to boot_cpu_data is not used. But
      ->x86_vendor was not filled and is still X86_VENDOR_INTEL resulting in
      no errata fixes getting applied and my system hangs on boot.
      
      Using cpu_info in cpu_has_amd_erratum() is wrong anyway: its only
      caller init_amd() will have a struct cpuinfo_x86 as parameter and the
      set_cpu_bug() that is controlled by cpu_has_amd_erratum() also only uses
      that struct.
      
      So pass the struct cpuinfo_x86 from init_amd() to cpu_has_amd_erratum()
      and the broken fallback can be dropped.
      
      [ Boris: Drop WARN_ON() since we're called only from init_amd() ]
      Signed-off-by: NTorsten Kaiser <just.for.lkml@googlemail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      8c6b79bb
  7. 12 8月, 2013 1 次提交
  8. 10 8月, 2013 1 次提交
    • D
      x86: Don't clear olpc_ofw_header when sentinel is detected · d55e37bb
      Daniel Drake 提交于
      OpenFirmware wasn't quite following the protocol described in boot.txt
      and the kernel has detected this through use of the sentinel value
      in boot_params. OFW does zero out almost all of the stuff that it should
      do, but not the sentinel.
      
      This causes the kernel to clear olpc_ofw_header, which breaks x86 OLPC
      support.
      
      OpenFirmware has now been fixed. However, it would be nice if we could
      maintain Linux compatibility with old firmware versions. To do that, we just
      have to avoid zeroing out olpc_ofw_header.
      
      OFW does not write to any other parts of the header that are being zapped
      by the sentinel-detection code, and all users of olpc_ofw_header are
      somewhat protected through checking for the OLPC_OFW_SIG magic value
      before using it. So this should not cause any problems for anyone.
      Signed-off-by: NDaniel Drake <dsd@laptop.org>
      Link: http://lkml.kernel.org/r/20130809221420.618E6FAB03@dev.laptop.orgAcked-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: <stable@vger.kernel.org> # v3.9+
      d55e37bb
  9. 05 8月, 2013 1 次提交
    • V
      perf/x86: Fix intel QPI uncore event definitions · c9601247
      Vince Weaver 提交于
      John McCalpin reports that the "drs_data" and "ncb_data" QPI
      uncore events are missing the "extra bit" and always return zero
      values unless the bit is properly set.
      
      More details from him:
      
       According to the Xeon E5-2600 Product Family Uncore Performance
       Monitoring Guide, Table 2-94, about 1/2 of the QPI Link Layer events
       (including the ones that "perf" calls "drs_data" and "ncb_data") require
       that the "extra bit" be set.
      
       This was confusing for a while -- a note at the bottom of page 94 says
       that the "extra bit" is bit 16 of the control register.
       Unfortunately, Table 2-86 clearly says that bit 16 is reserved and must
       be zero.  Looking around a bit, I found that bit 21 appears to be the
       correct "extra bit", and further investigation shows that "perf" actually
       agrees with me:
      	[root@c560-003.stampede]# cat /sys/bus/event_source/devices/uncore_qpi_0/format/event
      	config:0-7,21
      
       So the command
      	# perf -e "uncore_qpi_0/event=drs_data/"
       Is the same as
      	# perf -e "uncore_qpi_0/event=0x02,umask=0x08/"
       While it should be
      	# perf -e "uncore_qpi_0/event=0x102,umask=0x08/"
      
       I confirmed that this last version gives results that agree with the
       amount of data that I expected the STREAM benchmark to move across the QPI
       link in the second (cross-chip) test of the original script.
      Reported-by: NJohn McCalpin <mccalpin@tacc.utexas.edu>
      Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
      Cc: zheng.z.yan@intel.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1308021037280.26119@vincent-weaver-1.um.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c9601247
  10. 01 8月, 2013 1 次提交
  11. 31 7月, 2013 1 次提交
  12. 30 7月, 2013 1 次提交
  13. 27 7月, 2013 1 次提交
    • H
      x86, fpu: correct the asm constraints for fxsave, unbreak mxcsr.daz · eaa5a990
      H.J. Lu 提交于
      GCC will optimize mxcsr_feature_mask_init in arch/x86/kernel/i387.c:
      
      		memset(&fx_scratch, 0, sizeof(struct i387_fxsave_struct));
      		asm volatile("fxsave %0" : : "m" (fx_scratch));
      		mask = fx_scratch.mxcsr_mask;
      		if (mask == 0)
      			mask = 0x0000ffbf;
      
      to
      
      		memset(&fx_scratch, 0, sizeof(struct i387_fxsave_struct));
      		asm volatile("fxsave %0" : : "m" (fx_scratch));
      		mask = 0x0000ffbf;
      
      since asm statement doesn’t say it will update fx_scratch.  As the
      result, the DAZ bit will be cleared.  This patch fixes it. This bug
      dates back to at least kernel 2.6.12.
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      eaa5a990
  14. 26 7月, 2013 1 次提交
  15. 24 7月, 2013 1 次提交
  16. 23 7月, 2013 1 次提交
  17. 19 7月, 2013 1 次提交
  18. 18 7月, 2013 1 次提交
  19. 17 7月, 2013 1 次提交
    • K
      x86: Make sure IDT is page aligned · 4df05f36
      Kees Cook 提交于
      Since the IDT is referenced from a fixmap, make sure it is page aligned.
      Merge with 32-bit one, since it was already aligned to deal with F00F
      bug. Since bss is cleared before IDT setup, it can live there. This also
      moves the other *_idt_table variables into common locations.
      
      This avoids the risk of the IDT ever being moved in the bss and having
      the mapping be offset, resulting in calling incorrect handlers. In the
      current upstream kernel this is not a manifested bug, but heavily patched
      kernels (such as those using the PaX patch series) did encounter this bug.
      
      The tables other than idt_table technically do not need to be page
      aligned, at least not at the current time, but using a common
      declaration avoids mistakes.  On 64 bits the table is exactly one page
      long, anyway.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: http://lkml.kernel.org/r/20130716183441.GA14232@www.outflux.netReported-by: NPaX Team <pageexec@gmail.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      4df05f36
  20. 16 7月, 2013 1 次提交
  21. 15 7月, 2013 1 次提交
    • P
      x86: delete __cpuinit usage from all x86 files · 148f9bb8
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      Note that some harmless section mismatch warnings may result, since
      notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
      are flagged as __cpuinit  -- so if we remove the __cpuinit from
      arch specific callers, we will also get section mismatch warnings.
      As an intermediate step, we intend to turn the linux/init.h cpuinit
      content into no-ops as early as possible, since that will get rid
      of these warnings.  In any case, they are temporary and harmless.
      
      This removes all the arch/x86 uses of the __cpuinit macros from
      all C files.  x86 only had the one __CPUINIT used in assembly files,
      and it wasn't paired off with a .previous or a __FINIT, so we can
      delete it directly w/o any corresponding additional change there.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      148f9bb8
  22. 12 7月, 2013 2 次提交
  23. 11 7月, 2013 2 次提交