1. 04 1月, 2012 2 次提交
    • E
      x86: Fix atomic64_xxx_cx8() functions · ceb7b40b
      Eric Dumazet 提交于
      It appears about all functions in arch/x86/lib/atomic64_cx8_32.S
      are wrong in case cmpxchg8b must be restarted, because
      LOCK_PREFIX macro defines a label "1" clashing with other local
      labels :
      
      1:
      	some_instructions
      	LOCK_PREFIX
      	cmpxchg8b (%ebp)
      	jne 1b  / jumps to beginning of LOCK_PREFIX !
      
      A possible fix is to use a magic label "672" in LOCK_PREFIX asm
      definition, similar to the "671" one we defined in
      LOCK_PREFIX_HERE.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NJan Beulich <JBeulich@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1325608540.2320.103.camel@edumazet-HP-Compaq-6005-Pro-SFF-PCSigned-off-by: NIngo Molnar <mingo@elte.hu>
      ceb7b40b
    • J
      x86: Fix and improve cmpxchg_double{,_local}() · cdcd6298
      Jan Beulich 提交于
      Just like the per-CPU ones they had several
      problems/shortcomings:
      
      Only the first memory operand was mentioned in the asm()
      operands, and the 2x64-bit version didn't have a memory clobber
      while the 2x32-bit one did. The former allowed the compiler to
      not recognize the need to re-load the data in case it had it
      cached in some register, while the latter was overly
      destructive.
      
      The types of the local copies of the old and new values were
      incorrect (the types of the pointed-to variables should be used
      here, to make sure the respective old/new variable types are
      compatible).
      
      The __dummy/__junk variables were pointless, given that local
      copies of the inputs already existed (and can hence be used for
      discarded outputs).
      
      The 32-bit variant of cmpxchg_double_local() referenced
      cmpxchg16b_local().
      
      At once also:
      
       - change the return value type to what it really is: 'bool'
       - unify 32- and 64-bit variants
       - abstract out the common part of the 'normal' and 'local' variants
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/4F01F12A020000780006A19B@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      cdcd6298
  2. 24 12月, 2011 2 次提交
  3. 21 12月, 2011 2 次提交
  4. 18 12月, 2011 3 次提交
  5. 17 12月, 2011 1 次提交
  6. 16 12月, 2011 2 次提交
    • D
      x86_64, asm: Optimise fls(), ffs() and fls64() · ca3d30cc
      David Howells 提交于
      fls(N), ffs(N) and fls64(N) can be optimised on x86_64.  Currently they use a
      CMOV instruction after the BSR/BSF to set the destination register to -1 if the
      value to be scanned was 0 (in which case BSR/BSF set the Z flag).
      
      Instead, according to the AMD64 specification, we can make use of the fact that
      BSR/BSF doesn't modify its output register if its input is 0.  By preloading
      the output with -1 and incrementing the result, we achieve the desired result
      without the need for a conditional check.
      
      The Intel x86_64 specification, however, says that the result of BSR/BSF in
      such a case is undefined.  That said, when queried, one of the Intel CPU
      architects said that the behaviour on all Intel CPUs is that:
      
       (1) with BSRQ/BSFQ, the 64-bit destination register is written with its
           original value if the source is 0, thus, in essence, giving the effect we
           want.  And,
      
       (2) with BSRL/BSFL, the lower half of the 64-bit destination register is
           written with its original value if the source is 0, and the upper half is
           cleared, thus giving us the effect we want (we return a 4-byte int).
      
      Further, it was indicated that they (Intel) are unlikely to get away with
      changing the behaviour.
      
      It might be possible to optimise the 32-bit versions of these functions, but
      there's a lot more variation, and so the effective non-destructive property of
      BSRL/BSRF cannot be relied on.
      
      [ hpa: specifically, some 486 chips are known to NOT have this property. ]
      
      I have benchmarked these functions on my Core2 Duo test machine using the
      following program:
      
      	#include <stdlib.h>
      	#include <stdio.h>
      
      	#ifndef __x86_64__
      	#error
      	#endif
      
      	#define PAGE_SHIFT 12
      
      	typedef unsigned long long __u64, u64;
      	typedef unsigned int __u32, u32;
      	#define noinline	__attribute__((noinline))
      
      	static __always_inline int fls64(__u64 x)
      	{
      		long bitpos = -1;
      
      		asm("bsrq %1,%0"
      		    : "+r" (bitpos)
      		    : "rm" (x));
      		return bitpos + 1;
      	}
      
      	static inline unsigned long __fls(unsigned long word)
      	{
      		asm("bsr %1,%0"
      		    : "=r" (word)
      		    : "rm" (word));
      		return word;
      	}
      	static __always_inline int old_fls64(__u64 x)
      	{
      		if (x == 0)
      			return 0;
      		return __fls(x) + 1;
      	}
      
      	static noinline // __attribute__((const))
      	int old_get_order(unsigned long size)
      	{
      		int order;
      
      		size = (size - 1) >> (PAGE_SHIFT - 1);
      		order = -1;
      		do {
      			size >>= 1;
      			order++;
      		} while (size);
      		return order;
      	}
      
      	static inline __attribute__((const))
      	int get_order_old_fls64(unsigned long size)
      	{
      		int order;
      		size--;
      		size >>= PAGE_SHIFT;
      		order = old_fls64(size);
      		return order;
      	}
      
      	static inline __attribute__((const))
      	int get_order(unsigned long size)
      	{
      		int order;
      		size--;
      		size >>= PAGE_SHIFT;
      		order = fls64(size);
      		return order;
      	}
      
      	unsigned long prevent_optimise_out;
      
      	static noinline unsigned long test_old_get_order(void)
      	{
      		unsigned long n, total = 0;
      		long rep, loop;
      
      		for (rep = 1000000; rep > 0; rep--) {
      			for (loop = 0; loop <= 16384; loop += 4) {
      				n = 1UL << loop;
      				total += old_get_order(n);
      			}
      		}
      		return total;
      	}
      
      	static noinline unsigned long test_get_order_old_fls64(void)
      	{
      		unsigned long n, total = 0;
      		long rep, loop;
      
      		for (rep = 1000000; rep > 0; rep--) {
      			for (loop = 0; loop <= 16384; loop += 4) {
      				n = 1UL << loop;
      				total += get_order_old_fls64(n);
      			}
      		}
      		return total;
      	}
      
      	static noinline unsigned long test_get_order(void)
      	{
      		unsigned long n, total = 0;
      		long rep, loop;
      
      		for (rep = 1000000; rep > 0; rep--) {
      			for (loop = 0; loop <= 16384; loop += 4) {
      				n = 1UL << loop;
      				total += get_order(n);
      			}
      		}
      		return total;
      	}
      
      	int main(int argc, char **argv)
      	{
      		unsigned long total;
      
      		switch (argc) {
      		case 1:  total = test_old_get_order();		break;
      		case 2:  total = test_get_order_old_fls64();	break;
      		default: total = test_get_order();		break;
      		}
      		prevent_optimise_out = total;
      		return 0;
      	}
      
      This allows me to test the use of the old fls64() implementation and the new
      fls64() implementation and also to contrast these to the out-of-line loop-based
      implementation of get_order().  The results were:
      
      	warthog>time ./get_order
      	real    1m37.191s
      	user    1m36.313s
      	sys     0m0.861s
      	warthog>time ./get_order x
      	real    0m16.892s
      	user    0m16.586s
      	sys     0m0.287s
      	warthog>time ./get_order x x
      	real    0m7.731s
      	user    0m7.727s
      	sys     0m0.002s
      
      Using the current upstream fls64() as a basis for an inlined get_order() [the
      second result above] is much faster than using the current out-of-line
      loop-based get_order() [the first result above].
      
      Using my optimised inline fls64()-based get_order() [the third result above]
      is even faster still.
      
      [ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64
        instead of comparing BITS_PER_LONG, updated comments, rebased manually
        on top of 83d99df7 x86, bitops: Move fls64.h inside __KERNEL__ ]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      ca3d30cc
    • H
      x86, bitops: Move fls64.h inside __KERNEL__ · 83d99df7
      H. Peter Anvin 提交于
      We would include <asm-generic/bitops/fls64.h> even without __KERNEL__,
      but that doesn't make sense, as:
      
      1. That file provides fls64(), but the corresponding function fls() is
         not exported to user space.
      2. The implementation of fls64.h uses kernel-only symbols.
      3. fls64.h is not exported to user space.
      
      This appears to have been a bug introduced in checkin:
      
      d57594c2 bitops: use __fls for fls64 on 64-bit archs
      
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Alexander van Heukelum <heukelum@mailshack.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      Link: http://lkml.kernel.org/r/4EEA77E1.6050009@zytor.com
      83d99df7
  7. 15 12月, 2011 1 次提交
    • J
      x86: Fix and improve percpu_cmpxchg{8,16}b_double() · cebef5be
      Jan Beulich 提交于
      They had several problems/shortcomings:
      
      Only the first memory operand was mentioned in the 2x32bit asm()
      operands, and 2x64-bit version had a memory clobber. The first
      allowed the compiler to not recognize the need to re-load the
      data in case it had it cached in some register, and the second
      was overly destructive.
      
      The memory operand in the 2x32-bit asm() was declared to only be
      an output.
      
      The types of the local copies of the old and new values were
      incorrect (as in other per-CPU ops, the types of the per-CPU
      variables accessed should be used here, to make sure the
      respective types are compatible).
      
      The __dummy variable was pointless (and needlessly initialized
      in the 2x32-bit case), given that local copies of the inputs
      already exist.
      
      The 2x64-bit variant forced the address of the first object into
      %rsi, even though this is needed only for the call to the
      emulation function. The real cmpxchg16b can operate on an
      memory.
      
      At once also change the return value type to what it really is -
      'bool'.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4EE86D6502000078000679FE@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      cebef5be
  8. 14 12月, 2011 3 次提交
  9. 13 12月, 2011 1 次提交
  10. 09 12月, 2011 1 次提交
    • M
      x86, efi: Calling __pa() with an ioremap()ed address is invalid · e8c71062
      Matt Fleming 提交于
      If we encounter an efi_memory_desc_t without EFI_MEMORY_WB set
      in ->attribute we currently call set_memory_uc(), which in turn
      calls __pa() on a potentially ioremap'd address.
      
      On CONFIG_X86_32 this is invalid, resulting in the following
      oops on some machines:
      
        BUG: unable to handle kernel paging request at f7f22280
        IP: [<c10257b9>] reserve_ram_pages_type+0x89/0x210
        [...]
      
        Call Trace:
         [<c104f8ca>] ? page_is_ram+0x1a/0x40
         [<c1025aff>] reserve_memtype+0xdf/0x2f0
         [<c1024dc9>] set_memory_uc+0x49/0xa0
         [<c19334d0>] efi_enter_virtual_mode+0x1c2/0x3aa
         [<c19216d4>] start_kernel+0x291/0x2f2
         [<c19211c7>] ? loglevel+0x1b/0x1b
         [<c19210bf>] i386_start_kernel+0xbf/0xc8
      
      A better approach to this problem is to map the memory region
      with the correct attributes from the start, instead of modifying
      it after the fact. The uncached case can be handled by
      ioremap_nocache() and the cached by ioremap_cache().
      
      Despite first impressions, it's not possible to use
      ioremap_cache() to map all cached memory regions on
      CONFIG_X86_64 because EFI_RUNTIME_SERVICES_DATA regions really
      don't like being mapped into the vmalloc space, as detailed in
      the following bug report,
      
      	https://bugzilla.redhat.com/show_bug.cgi?id=748516
      
      Therefore, we need to ensure that any EFI_RUNTIME_SERVICES_DATA
      regions are covered by the direct kernel mapping table on
      CONFIG_X86_64. To accomplish this we now map E820_RESERVED_EFI
      regions via the direct kernel mapping with the initial call to
      init_memory_mapping() in setup_arch(), whereas previously these
      regions wouldn't be mapped if they were after the last E820_RAM
      region until efi_ioremap() was called. Doing it this way allows
      us to delete efi_ioremap() completely.
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg@redhat.com>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Huang Ying <huang.ying.caritas@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console-pimps.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      e8c71062
  11. 07 12月, 2011 3 次提交
  12. 06 12月, 2011 9 次提交
  13. 05 12月, 2011 4 次提交
  14. 04 12月, 2011 1 次提交
    • K
      xen/pm_idle: Make pm_idle be default_idle under Xen. · e5fd47bf
      Konrad Rzeszutek Wilk 提交于
      The idea behind commit d91ee586 ("cpuidle: replace xen access to x86
      pm_idle and default_idle") was to have one call - disable_cpuidle()
      which would make pm_idle not be molested by other code.  It disallows
      cpuidle_idle_call to be set to pm_idle (which is excellent).
      
      But in the select_idle_routine() and idle_setup(), the pm_idle can still
      be set to either: amd_e400_idle, mwait_idle or default_idle.  This
      depends on some CPU flags (MWAIT) and in AMD case on the type of CPU.
      
      In case of mwait_idle we can hit some instances where the hypervisor
      (Amazon EC2 specifically) sets the MWAIT and we get:
      
        Brought up 2 CPUs
        invalid opcode: 0000 [#1] SMP
      
        Pid: 0, comm: swapper Not tainted 3.1.0-0.rc6.git0.3.fc16.x86_64 #1
        RIP: e030:[<ffffffff81015d1d>]  [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
        ...
        Call Trace:
         [<ffffffff8100e2ed>] cpu_idle+0xae/0xe8
         [<ffffffff8149ee78>] cpu_bringup_and_idle+0xe/0x10
        RIP  [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
         RSP <ffff8801d28ddf10>
      
      In the case of amd_e400_idle we don't get so spectacular crashes, but we
      do end up making an MSR which is trapped in the hypervisor, and then
      follow it up with a yield hypercall.  Meaning we end up going to
      hypervisor twice instead of just once.
      
      The previous behavior before v3.0 was that pm_idle was set to
      default_idle regardless of select_idle_routine/idle_setup.
      
      We want to do that, but only for one specific case: Xen.  This patch
      does that.
      
      Fixes RH BZ #739499 and Ubuntu #881076
      Reported-by: NStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5fd47bf
  15. 26 11月, 2011 2 次提交
  16. 22 11月, 2011 1 次提交
  17. 17 11月, 2011 1 次提交
  18. 10 11月, 2011 1 次提交
    • J
      x86/mrst: Avoid reporting wrong nmi status · 064a59b6
      Jacob Pan 提交于
      Moorestown/Medfield platform does not have port 0x61 to report
      NMI status, nor does it have external NMI sources. The only NMI
      sources are from lapic, as results of perf counter overflow or
      IPI, e.g. NMI watchdog or spin lock debug.
      
      Reading port 0x61 on Moorestown will return 0xff which misled
      NMI handlers to false critical errors such memory parity error.
      The subsequent ioport access for NMI handling can also cause
      undefined behavior on Moorestown.
      
      This patch allows kernel process NMI due to watchdog or backrace
      dump without unnecessary hangs.
      Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      [hand applied]
      Signed-off-by: NAlan Cox <alan@linux.intel.com>
      064a59b6