1. 30 12月, 2016 1 次提交
    • L
      mm: optimize PageWaiters bit use for unlock_page() · b91e1302
      Linus Torvalds 提交于
      In commit 62906027 ("mm: add PageWaiters indicating tasks are
      waiting for a page bit") Nick Piggin made our page locking no longer
      unconditionally touch the hashed page waitqueue, which not only helps
      performance in general, but is particularly helpful on NUMA machines
      where the hashed wait queues can bounce around a lot.
      
      However, the "clear lock bit atomically and then test the waiters bit"
      sequence turns out to be much more expensive than it needs to be,
      because you get a nasty stall when trying to access the same word that
      just got updated atomically.
      
      On architectures where locking is done with LL/SC, this would be trivial
      to fix with a new primitive that clears one bit and tests another
      atomically, but that ends up not working on x86, where the only atomic
      operations that return the result end up being cmpxchg and xadd.  The
      atomic bit operations return the old value of the same bit we changed,
      not the value of an unrelated bit.
      
      On x86, we could put the lock bit in the high bit of the byte, and use
      "xadd" with that bit (where the overflow ends up not touching other
      bits), and look at the other bits of the result.  However, an even
      simpler model is to just use a regular atomic "and" to clear the lock
      bit, and then the sign bit in eflags will indicate the resulting state
      of the unrelated bit #7.
      
      So by moving the PageWaiters bit up to bit #7, we can atomically clear
      the lock bit and test the waiters bit on x86 too.  And architectures
      with LL/SC (which is all the usual RISC suspects), the particular bit
      doesn't matter, so they are fine with this approach too.
      
      This avoids the extra access to the same atomic word, and thus avoids
      the costly stall at page unlock time.
      
      The only downside is that the interface ends up being a bit odd and
      specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
      love the resulting name of the new primitive, but I'd rather make the
      name be descriptive and very clear about the limitation imposed by
      trying to work across all relevant architectures than make it be some
      generic thing that doesn't make the odd semantics explicit.
      
      So this introduces the new architecture primitive
      
          clear_bit_unlock_is_negative_byte();
      
      and adds the trivial implementation for x86.  We have a generic
      non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
      combination) which can be overridden by any architecture that can do
      better.  According to Nick, Power has the same hickup x86 has, for
      example, but some other architectures may not even care.
      
      All these optimizations mean that my page locking stress-test (which is
      just executing a lot of small short-lived shell scripts: "make test" in
      the git source tree) no longer makes our page locking look horribly bad.
      Before all these optimizations, just the unlock_page() costs were just
      over 3% of all CPU overhead on "make test".  After this, it's down to
      0.66%, so just a quarter of the cost it used to be.
      
      (The difference on NUMA is bigger, but there this micro-optimization is
      likely less noticeable, since the big issue on NUMA was not the accesses
      to 'struct page', but the waitqueue accesses that were already removed
      by Nick's earlier commit).
      Acked-by: NNick Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b91e1302
  2. 09 6月, 2016 4 次提交
  3. 09 2月, 2016 1 次提交
    • D
      x86/asm/bitops: Force inlining of test_and_set_bit and friends · 8dd5032d
      Denys Vlasenko 提交于
      Sometimes GCC mysteriously doesn't inline very small functions
      we expect to be inlined, see:
      
        https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      
      Arguably, GCC should do better, but GCC people aren't willing
      to invest time into it and are asking to use __always_inline
      instead.
      
      With this .config:
      
        http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os
      
      here's an example of functions getting deinlined many times:
      
        test_and_set_bit (166 copies, ~1260 calls)
               55                      push   %rbp
               48 89 e5                mov    %rsp,%rbp
               f0 48 0f ab 3e          lock bts %rdi,(%rsi)
               72 04                   jb     <test_and_set_bit+0xf>
               31 c0                   xor    %eax,%eax
               eb 05                   jmp    <test_and_set_bit+0x14>
               b8 01 00 00 00          mov    $0x1,%eax
               5d                      pop    %rbp
               c3                      retq
      
        test_and_clear_bit (124 copies, ~1000 calls)
               55                      push   %rbp
               48 89 e5                mov    %rsp,%rbp
               f0 48 0f b3 3e          lock btr %rdi,(%rsi)
               72 04                   jb     <test_and_clear_bit+0xf>
               31 c0                   xor    %eax,%eax
               eb 05                   jmp    <test_and_clear_bit+0x14>
               b8 01 00 00 00          mov    $0x1,%eax
               5d                      pop    %rbp
               c3                      retq
      
        change_bit (3 copies, 8 calls)
               55                      push   %rbp
               48 89 e5                mov    %rsp,%rbp
               f0 48 0f bb 3e          lock btc %rdi,(%rsi)
               5d                      pop    %rbp
               c3                      retq
      
        clear_bit_unlock (2 copies, 11 calls)
               55                      push   %rbp
               48 89 e5                mov    %rsp,%rbp
               f0 48 0f b3 3e          lock btr %rdi,(%rsi)
               5d                      pop    %rbp
               c3                      retq
      
      This patch works it around via s/inline/__always_inline/.
      
      Code size decrease by ~13.5k after the patch:
      
            text     data      bss       dec    filename
        92110727 20826144 36417536 149354407    vmlinux.before
        92097234 20826176 36417536 149340946    vmlinux.after
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Graf <tgraf@suug.ch>
      Link: http://lkml.kernel.org/r/1454881887-1367-1-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8dd5032d
  4. 14 9月, 2014 1 次提交
    • L
      Make ARCH_HAS_FAST_MULTIPLIER a real config variable · 72d93104
      Linus Torvalds 提交于
      It used to be an ad-hoc hack defined by the x86 version of
      <asm/bitops.h> that enabled a couple of library routines to know whether
      an integer multiply is faster than repeated shifts and additions.
      
      This just makes it use the real Kconfig system instead, and makes x86
      (which was the only architecture that did this) select the option.
      
      NOTE! Even for x86, this really is kind of wrong.  If we cared, we would
      probably not enable this for builds optimized for netburst (P4), where
      shifts-and-adds are generally faster than multiplies.  This patch does
      *not* change that kind of logic, though, it is purely a syntactic change
      with no code changes.
      
      This was triggered by the fact that we have other places that really
      want to know "do I want to expand multiples by constants by hand or
      not", particularly the hash generation code.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72d93104
  5. 18 4月, 2014 1 次提交
  6. 05 12月, 2013 1 次提交
  7. 25 9月, 2013 1 次提交
  8. 17 7月, 2013 1 次提交
    • H
      x86, bitops: Change bitops to be native operand size · 9b710506
      H. Peter Anvin 提交于
      Change the bitops operation to be naturally "long", i.e. 63 bits on
      the 64-bit kernel.  Additional bugs are likely to crop up in the
      future.
      
      We already have bugs which machines with > 16 TiB of memory in a
      single node, as can happen if memory is interleaved.  The x86 bitop
      operations take a signed index, so using an unsigned type is not an
      option.
      
      Jim Kukunas measured the effect of this patch on kernel size: it adds
      2779 bytes to the allyesconfig kernel.  Some of that probably could be
      elided by replacing the inline functions with macros which select the
      32-bit type if the index is a 32-bit value, something like:
      
      In that case we could also use "Jr" constraints for the 64-bit
      version.
      
      However, this would more than double the amount of code for a
      relatively small gain.
      
      Note that we can't use ilog2() for _BITOPS_LONG_SHIFT, as that causes
      a recursive header inclusion problem.
      
      The change to constant_test_bit() should both generate better code and
      give correct result for negative bit indicies.  As previously written
      the compiler had to generate extra code to create the proper wrong
      result for negative values.
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: Jim Kukunas <james.t.kukunas@intel.com>
      Link: http://lkml.kernel.org/n/tip-z61ofiwe90xeyb461o72h8ya@git.kernel.org
      9b710506
  9. 19 9月, 2012 1 次提交
  10. 13 9月, 2012 2 次提交
  11. 25 6月, 2012 1 次提交
  12. 23 5月, 2012 1 次提交
  13. 16 12月, 2011 2 次提交
    • D
      x86_64, asm: Optimise fls(), ffs() and fls64() · ca3d30cc
      David Howells 提交于
      fls(N), ffs(N) and fls64(N) can be optimised on x86_64.  Currently they use a
      CMOV instruction after the BSR/BSF to set the destination register to -1 if the
      value to be scanned was 0 (in which case BSR/BSF set the Z flag).
      
      Instead, according to the AMD64 specification, we can make use of the fact that
      BSR/BSF doesn't modify its output register if its input is 0.  By preloading
      the output with -1 and incrementing the result, we achieve the desired result
      without the need for a conditional check.
      
      The Intel x86_64 specification, however, says that the result of BSR/BSF in
      such a case is undefined.  That said, when queried, one of the Intel CPU
      architects said that the behaviour on all Intel CPUs is that:
      
       (1) with BSRQ/BSFQ, the 64-bit destination register is written with its
           original value if the source is 0, thus, in essence, giving the effect we
           want.  And,
      
       (2) with BSRL/BSFL, the lower half of the 64-bit destination register is
           written with its original value if the source is 0, and the upper half is
           cleared, thus giving us the effect we want (we return a 4-byte int).
      
      Further, it was indicated that they (Intel) are unlikely to get away with
      changing the behaviour.
      
      It might be possible to optimise the 32-bit versions of these functions, but
      there's a lot more variation, and so the effective non-destructive property of
      BSRL/BSRF cannot be relied on.
      
      [ hpa: specifically, some 486 chips are known to NOT have this property. ]
      
      I have benchmarked these functions on my Core2 Duo test machine using the
      following program:
      
      	#include <stdlib.h>
      	#include <stdio.h>
      
      	#ifndef __x86_64__
      	#error
      	#endif
      
      	#define PAGE_SHIFT 12
      
      	typedef unsigned long long __u64, u64;
      	typedef unsigned int __u32, u32;
      	#define noinline	__attribute__((noinline))
      
      	static __always_inline int fls64(__u64 x)
      	{
      		long bitpos = -1;
      
      		asm("bsrq %1,%0"
      		    : "+r" (bitpos)
      		    : "rm" (x));
      		return bitpos + 1;
      	}
      
      	static inline unsigned long __fls(unsigned long word)
      	{
      		asm("bsr %1,%0"
      		    : "=r" (word)
      		    : "rm" (word));
      		return word;
      	}
      	static __always_inline int old_fls64(__u64 x)
      	{
      		if (x == 0)
      			return 0;
      		return __fls(x) + 1;
      	}
      
      	static noinline // __attribute__((const))
      	int old_get_order(unsigned long size)
      	{
      		int order;
      
      		size = (size - 1) >> (PAGE_SHIFT - 1);
      		order = -1;
      		do {
      			size >>= 1;
      			order++;
      		} while (size);
      		return order;
      	}
      
      	static inline __attribute__((const))
      	int get_order_old_fls64(unsigned long size)
      	{
      		int order;
      		size--;
      		size >>= PAGE_SHIFT;
      		order = old_fls64(size);
      		return order;
      	}
      
      	static inline __attribute__((const))
      	int get_order(unsigned long size)
      	{
      		int order;
      		size--;
      		size >>= PAGE_SHIFT;
      		order = fls64(size);
      		return order;
      	}
      
      	unsigned long prevent_optimise_out;
      
      	static noinline unsigned long test_old_get_order(void)
      	{
      		unsigned long n, total = 0;
      		long rep, loop;
      
      		for (rep = 1000000; rep > 0; rep--) {
      			for (loop = 0; loop <= 16384; loop += 4) {
      				n = 1UL << loop;
      				total += old_get_order(n);
      			}
      		}
      		return total;
      	}
      
      	static noinline unsigned long test_get_order_old_fls64(void)
      	{
      		unsigned long n, total = 0;
      		long rep, loop;
      
      		for (rep = 1000000; rep > 0; rep--) {
      			for (loop = 0; loop <= 16384; loop += 4) {
      				n = 1UL << loop;
      				total += get_order_old_fls64(n);
      			}
      		}
      		return total;
      	}
      
      	static noinline unsigned long test_get_order(void)
      	{
      		unsigned long n, total = 0;
      		long rep, loop;
      
      		for (rep = 1000000; rep > 0; rep--) {
      			for (loop = 0; loop <= 16384; loop += 4) {
      				n = 1UL << loop;
      				total += get_order(n);
      			}
      		}
      		return total;
      	}
      
      	int main(int argc, char **argv)
      	{
      		unsigned long total;
      
      		switch (argc) {
      		case 1:  total = test_old_get_order();		break;
      		case 2:  total = test_get_order_old_fls64();	break;
      		default: total = test_get_order();		break;
      		}
      		prevent_optimise_out = total;
      		return 0;
      	}
      
      This allows me to test the use of the old fls64() implementation and the new
      fls64() implementation and also to contrast these to the out-of-line loop-based
      implementation of get_order().  The results were:
      
      	warthog>time ./get_order
      	real    1m37.191s
      	user    1m36.313s
      	sys     0m0.861s
      	warthog>time ./get_order x
      	real    0m16.892s
      	user    0m16.586s
      	sys     0m0.287s
      	warthog>time ./get_order x x
      	real    0m7.731s
      	user    0m7.727s
      	sys     0m0.002s
      
      Using the current upstream fls64() as a basis for an inlined get_order() [the
      second result above] is much faster than using the current out-of-line
      loop-based get_order() [the first result above].
      
      Using my optimised inline fls64()-based get_order() [the third result above]
      is even faster still.
      
      [ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64
        instead of comparing BITS_PER_LONG, updated comments, rebased manually
        on top of 83d99df7 x86, bitops: Move fls64.h inside __KERNEL__ ]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      ca3d30cc
    • H
      x86, bitops: Move fls64.h inside __KERNEL__ · 83d99df7
      H. Peter Anvin 提交于
      We would include <asm-generic/bitops/fls64.h> even without __KERNEL__,
      but that doesn't make sense, as:
      
      1. That file provides fls64(), but the corresponding function fls() is
         not exported to user space.
      2. The implementation of fls64.h uses kernel-only symbols.
      3. fls64.h is not exported to user space.
      
      This appears to have been a bug introduced in checkin:
      
      d57594c2 bitops: use __fls for fls64 on 64-bit archs
      
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Alexander van Heukelum <heukelum@mailshack.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      Link: http://lkml.kernel.org/r/4EEA77E1.6050009@zytor.com
      83d99df7
  14. 27 7月, 2011 1 次提交
  15. 24 3月, 2011 3 次提交
    • A
      bitops: remove minix bitops from asm/bitops.h · 61f2e7b0
      Akinobu Mita 提交于
      minix bit operations are only used by minix filesystem and useless by
      other modules.  Because byte order of inode and block bitmaps is different
      on each architecture like below:
      
      m68k:
      	big-endian 16bit indexed bitmaps
      
      h8300, microblaze, s390, sparc, m68knommu:
      	big-endian 32 or 64bit indexed bitmaps
      
      m32r, mips, sh, xtensa:
      	big-endian 32 or 64bit indexed bitmaps for big-endian mode
      	little-endian bitmaps for little-endian mode
      
      Others:
      	little-endian bitmaps
      
      In order to move minix bit operations from asm/bitops.h to architecture
      independent code in minix filesystem, this provides two config options.
      
      CONFIG_MINIX_FS_BIG_ENDIAN_16BIT_INDEXED is only selected by m68k.
      CONFIG_MINIX_FS_NATIVE_ENDIAN is selected by the architectures which use
      native byte order bitmaps (h8300, microblaze, s390, sparc, m68knommu,
      m32r, mips, sh, xtensa).  The architectures which always use little-endian
      bitmaps do not select these options.
      
      Finally, we can remove minix bit operations from asm/bitops.h for all
      architectures.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NGreg Ungerer <gerg@uclinux.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Andreas Schwab <schwab@linux-m68k.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f2e7b0
    • A
      bitops: remove ext2 non-atomic bitops from asm/bitops.h · f312eff8
      Akinobu Mita 提交于
      As the result of conversions, there are no users of ext2 non-atomic bit
      operations except for ext2 filesystem itself.  Now we can put them into
      architecture independent code in ext2 filesystem, and remove from
      asm/bitops.h for all architectures.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f312eff8
    • A
      bitops: introduce little-endian bitops for most architectures · 861b5ae7
      Akinobu Mita 提交于
      Introduce little-endian bit operations to the big-endian architectures
      which do not have native little-endian bit operations and the
      little-endian architectures.  (alpha, avr32, blackfin, cris, frv, h8300,
      ia64, m32r, mips, mn10300, parisc, sh, sparc, tile, x86, xtensa)
      
      These architectures can just include generic implementation
      (asm-generic/bitops/le.h).
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Grant Grundler <grundler@parisc-linux.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: NHans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
      Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      861b5ae7
  16. 10 10月, 2010 1 次提交
    • A
      bitops: make asm-generic/bitops/find.h more generic · 708ff2a0
      Akinobu Mita 提交于
      asm-generic/bitops/find.h has the extern declarations of find_next_bit()
      and find_next_zero_bit() and the macro definitions of find_first_bit()
      and find_first_zero_bit(). It is only usable by the architectures which
      enables CONFIG_GENERIC_FIND_NEXT_BIT and disables
      CONFIG_GENERIC_FIND_FIRST_BIT.
      
      x86 and tile enable both CONFIG_GENERIC_FIND_NEXT_BIT and
      CONFIG_GENERIC_FIND_FIRST_BIT. These architectures cannot include
      asm-generic/bitops/find.h in their asm/bitops.h. So ifdefed extern
      declarations of find_first_bit and find_first_zero_bit() are put in
      linux/bitops.h.
      
      This makes asm-generic/bitops/find.h usable by these architectures
      and use it. Also this change is needed for the forthcoming duplicated
      extern declarations cleanup.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      708ff2a0
  17. 27 9月, 2010 1 次提交
  18. 07 4月, 2010 1 次提交
    • B
      x86: Add optimized popcnt variants · d61931d8
      Borislav Petkov 提交于
      Add support for the hardware version of the Hamming weight function,
      popcnt, present in CPUs which advertize it under CPUID, Function
      0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
      default lib/hweight.c sw versions.
      
      A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
      a 3x speedup on a F10h machine.
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      LKML-Reference: <20100318112015.GC11152@aftab>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      d61931d8
  19. 14 1月, 2009 1 次提交
    • A
      x86, generic: mark complex bitops.h inlines as __always_inline · c8399943
      Andi Kleen 提交于
      Impact: reduce kernel image size
      
      Hugh Dickins noticed that older gcc versions when the kernel
      is built for code size didn't inline some of the bitops.
      
      Mark all complex x86 bitops that have more than a single
      asm statement or two as always inline to avoid this problem.
      
      Probably should be done for other architectures too.
      
      Ingo then found a better fix that only requires
      a single line change, but it unfortunately only
      works on gcc 4.3.
      
      On older gccs the original patch still makes a ~0.3% defconfig
      difference with CONFIG_OPTIMIZE_INLINING=y.
      
      With gcc 4.1 and a defconfig like build:
      
          6116998 1138540  883788 8139326  7c323e vmlinux-oi-with-patch
          6137043 1138540  883788 8159371  7c808b vmlinux-optimize-inlining
      
      ~20k / 0.3% difference.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c8399943
  20. 10 1月, 2009 1 次提交
  21. 06 11月, 2008 1 次提交
  22. 23 10月, 2008 2 次提交
  23. 23 9月, 2008 1 次提交
  24. 23 7月, 2008 1 次提交
    • V
      x86: consolidate header guards · 77ef50a5
      Vegard Nossum 提交于
      This patch is the result of an automatic script that consolidates the
      format of all the headers in include/asm-x86/.
      
      The format:
      
      1. No leading underscore. Names with leading underscores are reserved.
      2. Pathname components are separated by two underscores. So we can
         distinguish between mm_types.h and mm/types.h.
      3. Everything except letters and numbers are turned into single
         underscores.
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      77ef50a5
  25. 18 7月, 2008 1 次提交
  26. 21 6月, 2008 1 次提交
    • I
      x86, bitops: make constant-bit set/clear_bit ops faster, gcc workaround · 437a0a54
      Ingo Molnar 提交于
      Jeremy Fitzhardinge reported this compiler bug:
      
      Suggestion from Linus: add "r" to the input constraint of the
      set_bit()/clear_bit()'s constant 'nr' branch:
      
      Blows up on "gcc version 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)":
      
       CC      init/main.o
      include2/asm/bitops.h: In function `start_kernel':
      include2/asm/bitops.h:59: warning: asm operand 1 probably doesn't match constraints
      include2/asm/bitops.h:59: warning: asm operand 1 probably doesn't match constraints
      include2/asm/bitops.h:59: warning: asm operand 1 probably doesn't match constraints
      include2/asm/bitops.h:59: error: impossible constraint in `asm'
      include2/asm/bitops.h:59: error: impossible constraint in `asm'
      include2/asm/bitops.h:59: error: impossible constraint in `asm'
      Reported-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      437a0a54
  27. 20 6月, 2008 1 次提交
  28. 19 6月, 2008 1 次提交
    • L
      x86, bitops: make constant-bit set/clear_bit ops faster · 1a750e0c
      Linus Torvalds 提交于
      On Wed, 18 Jun 2008, Linus Torvalds wrote:
      >
      > And yes, the "lock andl" should be noticeably faster than the xchgl.
      
      I dunno. Here's a untested (!!) patch that turns constant-bit
      set/clear_bit ops into byte mask ops (lock orb/andb).
      
      It's not exactly pretty. The reason for using the byte versions is that a
      locked op is serialized in the memory pipeline anyway, so there are no
      forwarding issues (that could slow down things when we access things with
      different sizes), and the byte ops are a lot smaller than 32-bit and
      particularly 64-bit ops (big constants, and the 64-bit ops need the REX
      prefix byte too).
      
      [ Side note: I wonder if we should turn the "test_bit()" C version into a
        "char *" version too.. It could actually help with alias analysis, since
        char pointers can alias anything. So it might be the RightThing(tm) to
        do for multiple reasons. I dunno. It's a separate issue. ]
      
      It does actually shrink the kernel image a bit (a couple of hundred bytes
      on the text segment for my everything-compiled-in image), and while it's
      totally untested the (admittedly few) code generation points I looked at
      seemed sane. And "lock orb" should be noticeably faster than "lock bts".
      
      If somebody wants to play with it, go wild. I didn't do "change_bit()",
      because nobody sane uses that thing anyway. I guarantee nothing. And if it
      breaks, nobody saw me do anything.  You can't prove this email wasn't sent
      by somebody who is good at forging smtp.
      
      This does require a gcc that is recent enough for "__builtin_constant_p()"
      to work in an inline function, but I suspect our kernel requirements are
      already higher than that. And if you do have an old gcc that is supported,
      the worst that would happen is that the optimization doesn't trigger.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1a750e0c
  29. 25 5月, 2008 1 次提交
  30. 11 5月, 2008 1 次提交
  31. 27 4月, 2008 2 次提交