1. 13 9月, 2013 6 次提交
  2. 12 9月, 2013 2 次提交
  3. 03 9月, 2013 1 次提交
    • L
      lockref: implement lockless reference count updates using cmpxchg() · bc08b449
      Linus Torvalds 提交于
      Instead of taking the spinlock, the lockless versions atomically check
      that the lock is not taken, and do the reference count update using a
      cmpxchg() loop.  This is semantically identical to doing the reference
      count update protected by the lock, but avoids the "wait for lock"
      contention that you get when accesses to the reference count are
      contended.
      
      Note that a "lockref" is absolutely _not_ equivalent to an atomic_t.
      Even when the lockref reference counts are updated atomically with
      cmpxchg, the fact that they also verify the state of the spinlock means
      that the lockless updates can never happen while somebody else holds the
      spinlock.
      
      So while "lockref_put_or_lock()" looks a lot like just another name for
      "atomic_dec_and_lock()", and both optimize to lockless updates, they are
      fundamentally different: the decrement done by atomic_dec_and_lock() is
      truly independent of any lock (as long as it doesn't decrement to zero),
      so a locked region can still see the count change.
      
      The lockref structure, in contrast, really is a *locked* reference
      count.  If you hold the spinlock, the reference count will be stable and
      you can modify the reference count without using atomics, because even
      the lockless updates will see and respect the state of the lock.
      
      In order to enable the cmpxchg lockless code, the architecture needs to
      do three things:
      
       (1) Make sure that the "arch_spinlock_t" and an "unsigned int" can fit
           in an aligned u64, and have a "cmpxchg()" implementation that works
           on such a u64 data type.
      
       (2) define a helper function to test for a spinlock being unlocked
           ("arch_spin_value_unlocked()")
      
       (3) select the "ARCH_USE_CMPXCHG_LOCKREF" config variable in its
           Kconfig file.
      
      This enables it for x86-64 (but not 32-bit, we'd need to make sure
      cmpxchg() turns into the proper cmpxchg8b in order to enable it for
      32-bit mode).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc08b449
  4. 02 9月, 2013 5 次提交
  5. 30 8月, 2013 3 次提交
  6. 26 8月, 2013 1 次提交
  7. 23 8月, 2013 4 次提交
  8. 22 8月, 2013 1 次提交
  9. 20 8月, 2013 4 次提交
    • C
      xen/smp: initialize IPI vectors before marking CPU online · fc78d343
      Chuck Anderson 提交于
      An older PVHVM guest (v3.0 based) crashed during vCPU hot-plug with:
      
      	kernel BUG at drivers/xen/events.c:1328!
      
      RCU has detected that a CPU has not entered a quiescent state within the
      grace period.  It needs to send the CPU a reschedule IPI if it is not
      offline.  rcu_implicit_offline_qs() does this check:
      
      	/*
      	 * If the CPU is offline, it is in a quiescent state.  We can
      	 * trust its state not to change because interrupts are disabled.
      	 */
      	if (cpu_is_offline(rdp->cpu)) {
      		rdp->offline_fqs++;
      		return 1;
      	}
      
      	Else the CPU is online.  Send it a reschedule IPI.
      
      The CPU is in the middle of being hot-plugged and has been marked online
      (!cpu_is_offline()).  See start_secondary():
      
      	set_cpu_online(smp_processor_id(), true);
      	...
      	per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;
      
      start_secondary() then waits for the CPU bringing up the hot-plugged CPU to
      mark it as active:
      
      	/*
      	 * Wait until the cpu which brought this one up marked it
      	 * online before enabling interrupts. If we don't do that then
      	 * we can end up waking up the softirq thread before this cpu
      	 * reached the active state, which makes the scheduler unhappy
      	 * and schedule the softirq thread on the wrong cpu. This is
      	 * only observable with forced threaded interrupts, but in
      	 * theory it could also happen w/o them. It's just way harder
      	 * to achieve.
      	 */
      	while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
      		cpu_relax();
      
      	/* enable local interrupts */
      	local_irq_enable();
      
      The CPU being hot-plugged will be marked active after it has been fully
      initialized by the CPU managing the hot-plug.  In the Xen PVHVM case
      xen_smp_intr_init() is called to set up the hot-plugged vCPU's
      XEN_RESCHEDULE_VECTOR.
      
      The hot-plugging CPU is marked online, not marked active and does not have
      its IPI vectors set up.  rcu_implicit_offline_qs() sees the hot-plugging
      cpu is !cpu_is_offline() and tries to send it a reschedule IPI:
      This will lead to:
      
      	kernel BUG at drivers/xen/events.c:1328!
      
      	xen_send_IPI_one()
      	xen_smp_send_reschedule()
      	rcu_implicit_offline_qs()
      	rcu_implicit_dynticks_qs()
      	force_qs_rnp()
      	force_quiescent_state()
      	__rcu_process_callbacks()
      	rcu_process_callbacks()
      	__do_softirq()
      	call_softirq()
      	do_softirq()
      	irq_exit()
      	xen_evtchn_do_upcall()
      
      because xen_send_IPI_one() will attempt to use an uninitialized IRQ for
      the XEN_RESCHEDULE_VECTOR.
      
      There is at least one other place that has caused the same crash:
      
      	xen_smp_send_reschedule()
      	wake_up_idle_cpu()
      	add_timer_on()
      	clocksource_watchdog()
      	call_timer_fn()
      	run_timer_softirq()
      	__do_softirq()
      	call_softirq()
      	do_softirq()
      	irq_exit()
      	xen_evtchn_do_upcall()
      	xen_hvm_callback_vector()
      
      clocksource_watchdog() uses cpu_online_mask to pick the next CPU to handle
      a watchdog timer:
      
      	/*
      	 * Cycle through CPUs to check if the CPUs stay synchronized
      	 * to each other.
      	 */
      	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
      	if (next_cpu >= nr_cpu_ids)
      		next_cpu = cpumask_first(cpu_online_mask);
      	watchdog_timer.expires += WATCHDOG_INTERVAL;
      	add_timer_on(&watchdog_timer, next_cpu);
      
      This resulted in an attempt to send an IPI to a hot-plugging CPU that
      had not initialized its reschedule vector. One option would be to make
      the RCU code check to not check for CPU offline but for CPU active.
      As becoming active is done after a CPU is online (in older kernels).
      
      But Srivatsa pointed out that "the cpu_active vs cpu_online ordering has been
      completely reworked - in the online path, cpu_active is set *before* cpu_online,
      and also, in the cpu offline path, the cpu_active bit is reset in the CPU_DYING
      notification instead of CPU_DOWN_PREPARE." Drilling in this the bring-up
      path: "[brought up CPU].. send out a CPU_STARTING notification, and in response
      to that, the scheduler sets the CPU in the cpu_active_mask. Again, this mask
      is better left to the scheduler alone, since it has the intelligence to use it
      judiciously."
      
      The conclusion was that:
      "
      1. At the IPI sender side:
      
         It is incorrect to send an IPI to an offline CPU (cpu not present in
         the cpu_online_mask). There are numerous places where we check this
         and warn/complain.
      
      2. At the IPI receiver side:
      
         It is incorrect to let the world know of our presence (by setting
         ourselves in global bitmasks) until our initialization steps are complete
         to such an extent that we can handle the consequences (such as
         receiving interrupts without crashing the sender etc.)
      " (from Srivatsa)
      
      As the native code enables the interrupts at some point we need to be
      able to service them. In other words a CPU must have valid IPI vectors
      if it has been marked online.
      
      It doesn't need to handle the IPI (interrupts may be disabled) but needs
      to have valid IPI vectors because another CPU may find it in cpu_online_mask
      and attempt to send it an IPI.
      
      This patch will change the order of the Xen vCPU bring-up functions so that
      Xen vectors have been set up before start_secondary() is called.
      It also will not continue to bring up a Xen vCPU if xen_smp_intr_init() fails
      to initialize it.
      
      Orabug 13823853
      Signed-off-by Chuck Anderson <chuck.anderson@oracle.com>
      Acked-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      fc78d343
    • D
      x86/xen: do not identity map UNUSABLE regions in the machine E820 · 3bc38cbc
      David Vrabel 提交于
      If there are UNUSABLE regions in the machine memory map, dom0 will
      attempt to map them 1:1 which is not permitted by Xen and the kernel
      will crash.
      
      There isn't anything interesting in the UNUSABLE region that the dom0
      kernel needs access to so we can avoid making the 1:1 mapping and
      treat it as RAM.
      
      We only do this for dom0, as that is where tboot case shows up.
      A PV domU could have an UNUSABLE region in its pseudo-physical map
      and would need to be handled in another patch.
      
      This fixes a boot failure on hosts with tboot.
      
      tboot marks a region in the e820 map as unusable and the dom0 kernel
      would attempt to map this region and Xen does not permit unusable
      regions to be mapped by guests.
      
        (XEN)  0000000000000000 - 0000000000060000 (usable)
        (XEN)  0000000000060000 - 0000000000068000 (reserved)
        (XEN)  0000000000068000 - 000000000009e000 (usable)
        (XEN)  0000000000100000 - 0000000000800000 (usable)
        (XEN)  0000000000800000 - 0000000000972000 (unusable)
      
      tboot marked this region as unusable.
      
        (XEN)  0000000000972000 - 00000000cf200000 (usable)
        (XEN)  00000000cf200000 - 00000000cf38f000 (reserved)
        (XEN)  00000000cf38f000 - 00000000cf3ce000 (ACPI data)
        (XEN)  00000000cf3ce000 - 00000000d0000000 (reserved)
        (XEN)  00000000e0000000 - 00000000f0000000 (reserved)
        (XEN)  00000000fe000000 - 0000000100000000 (reserved)
        (XEN)  0000000100000000 - 0000000630000000 (usable)
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      [v1: Altered the patch and description with domU's with UNUSABLE regions]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      3bc38cbc
    • Y
      x86/mm: Fix boot crash with DEBUG_PAGE_ALLOC=y and more than 512G RAM · 527bf129
      Yinghai Lu 提交于
      Dave Hansen reported that systems between 500G and 600G RAM
      crash early if DEBUG_PAGEALLOC is selected.
      
       > [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
       > [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
       > [    0.000000] BRK [0x02086000, 0x02086fff] PGTABLE
       > [    0.000000] BRK [0x02087000, 0x02087fff] PGTABLE
       > [    0.000000] BRK [0x02088000, 0x02088fff] PGTABLE
       > [    0.000000] init_memory_mapping: [mem 0xe80ee00000-0xe80effffff]
       > [    0.000000]  [mem 0xe80ee00000-0xe80effffff] page 4k
       > [    0.000000] BRK [0x02089000, 0x02089fff] PGTABLE
       > [    0.000000] BRK [0x0208a000, 0x0208afff] PGTABLE
       > [    0.000000] Kernel panic - not syncing: alloc_low_page: ran out of memory
      
      It turns out that we missed increasing needed pages in BRK to
      mapping initial 2M and [0,1M) when we switched to use the #PF
      handler to set memory mappings:
      
       > commit 8170e6be
       > Author: H. Peter Anvin <hpa@zytor.com>
       > Date:   Thu Jan 24 12:19:52 2013 -0800
       >
       >     x86, 64bit: Use a #PF handler to materialize early mappings on demand
      
      Before that, we had the maping from [0,512M) in head_64.S, and we
      can spare two pages [0-1M).  After that change, we can not reuse
      pages anymore.
      
      When we have more than 512M ram, we need an extra page for pgd page
      with [512G, 1024g).
      
      Increase pages in BRK for page table to solve the boot crash.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Bisected-by: NDave Hansen <dave.hansen@intel.com>
      Tested-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: <stable@vger.kernel.org> # v3.9 and later
      Link: http://lkml.kernel.org/r/1376351004-4015-1-git-send-email-yinghai@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      527bf129
    • Y
      x86/ioapic/kcrash: Prevent crash_kexec() from deadlocking on ioapic_lock · 17405453
      Yoshihiro YUNOMAE 提交于
      Prevent crash_kexec() from deadlocking on ioapic_lock. When
      crash_kexec() is executed on a CPU, the CPU will take ioapic_lock
      in disable_IO_APIC(). So if the cpu gets an NMI while locking
      ioapic_lock, a deadlock will happen.
      
      In this patch, ioapic_lock is zapped/initialized before disable_IO_APIC().
      
      You can reproduce this deadlock the following way:
      
      1. Add mdelay(1000) after raw_spin_lock_irqsave() in
         native_ioapic_set_affinity()@arch/x86/kernel/apic/io_apic.c
      
         Although the deadlock can occur without this modification, it will increase
         the potential of the deadlock problem.
      
      2. Build and install the kernel
      
      3. Set up the OS which will run panic() and kexec when NMI is injected
          # echo "kernel.unknown_nmi_panic=1" >> /etc/sysctl.conf
          # vim /etc/default/grub
            add "nmi_watchdog=0 crashkernel=256M" in GRUB_CMDLINE_LINUX line
          # grub2-mkconfig
      
      4. Reboot the OS
      
      5. Run following command for each vcpu on the guest
          # while true; do echo <CPU num> > /proc/irq/<IO-APIC-edge or IO-APIC-fasteoi>/smp_affinitity; done;
         By running this command, cpus will get ioapic_lock for setting affinity.
      
      6. Inject NMI (push a dump button or execute 'virsh inject-nmi <domain>' if you
         use VM). After injecting NMI, panic() is called in an nmi-handler context.
         Then, kexec will normally run in panic(), but the operation will be stopped
         by deadlock on ioapic_lock in crash_kexec()->machine_crash_shutdown()->
         native_machine_crash_shutdown()->disable_IO_APIC()->clear_IO_APIC()->
         clear_IO_APIC_pin()->ioapic_read_entry().
      Signed-off-by: NYoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: yrl.pp-manager.tt@hitachi.com
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Link: http://lkml.kernel.org/r/20130820070107.28245.83806.stgit@yunodevelSigned-off-by: NIngo Molnar <mingo@kernel.org>
      17405453
  10. 19 8月, 2013 1 次提交
  11. 16 8月, 2013 3 次提交
  12. 15 8月, 2013 1 次提交
  13. 14 8月, 2013 8 次提交