提交 · 17a870bea3b86f464706b6ba2736210cb8602693 · gsplhtlxg / clone-Linux

30 12月, 2016 1 次提交

mm: optimize PageWaiters bit use for unlock_page() · b91e1302

由 Linus Torvalds 提交于 12月 27, 2016

In commit 62906027 ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.

However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.

On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd.  The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.

On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result.  However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.

So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too.  And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.

This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.

The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.

So this introduces the new architecture primitive

    clear_bit_unlock_is_negative_byte();

and adds the trivial implementation for x86.  We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better.  According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.

All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test".  After this, it's down to
0.66%, so just a quarter of the cost it used to be.

(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: NNick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b91e1302

09 6月, 2016 4 次提交

x86, asm: Use CC_SET()/CC_OUT() in <asm/bitops.h> · 86b61240

由 H. Peter Anvin 提交于 6月 08, 2016

Remove open-coded uses of set instructions to use CC_SET()/CC_OUT() in
<asm/bitops.h>.
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1465414726-197858-7-git-send-email-hpa@linux.intel.comReviewed-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>

86b61240

x86, asm: change the GEN_*_RMWcc() macros to not quote the condition · 18fe5822

由 H. Peter Anvin 提交于 6月 08, 2016

Change the lexical defintion of the GEN_*_RMWcc() macros to not take
the condition code as a quoted string. This will help support
changing them to use the new __GCC_ASM_FLAG_OUTPUTS__ feature in a
subsequent patch.
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1465414726-197858-4-git-send-email-hpa@linux.intel.comReviewed-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>

18fe5822

x86, asm: use bool for bitops and other assembly outputs · 117780ee

由 H. Peter Anvin 提交于 6月 08, 2016

The gcc people have confirmed that using "bool" when combined with
inline assembly always is treated as a byte-sized operand that can be
assumed to be 0 or 1, which is exactly what the SET instruction
emits. Change the output types and intermediate variables of as many
operations as practical to "bool".
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1465414726-197858-3-git-send-email-hpa@linux.intel.comReviewed-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>

117780ee

x86, bitops: remove use of "sbb" to return CF · 2823d4da

由 H. Peter Anvin 提交于 6月 08, 2016

Use SETC instead of SBB to return the value of CF from assembly. Using
SETcc enables uniformity with other flags-returning pieces of assembly
code.
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1465414726-197858-2-git-send-email-hpa@linux.intel.comReviewed-by: NAndy Lutomirski <luto@kernel.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>

2823d4da

09 2月, 2016 1 次提交

x86/asm/bitops: Force inlining of test_and_set_bit and friends · 8dd5032d

由 Denys Vlasenko 提交于 2月 07, 2016

Sometimes GCC mysteriously doesn't inline very small functions
we expect to be inlined, see:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

Arguably, GCC should do better, but GCC people aren't willing
to invest time into it and are asking to use __always_inline
instead.

With this .config:

  http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os

here's an example of functions getting deinlined many times:

  test_and_set_bit (166 copies, ~1260 calls)
         55                      push   %rbp
         48 89 e5                mov    %rsp,%rbp
         f0 48 0f ab 3e          lock bts %rdi,(%rsi)
         72 04                   jb     <test_and_set_bit+0xf>
         31 c0                   xor    %eax,%eax
         eb 05                   jmp    <test_and_set_bit+0x14>
         b8 01 00 00 00          mov    $0x1,%eax
         5d                      pop    %rbp
         c3                      retq

  test_and_clear_bit (124 copies, ~1000 calls)
         55                      push   %rbp
         48 89 e5                mov    %rsp,%rbp
         f0 48 0f b3 3e          lock btr %rdi,(%rsi)
         72 04                   jb     <test_and_clear_bit+0xf>
         31 c0                   xor    %eax,%eax
         eb 05                   jmp    <test_and_clear_bit+0x14>
         b8 01 00 00 00          mov    $0x1,%eax
         5d                      pop    %rbp
         c3                      retq

  change_bit (3 copies, 8 calls)
         55                      push   %rbp
         48 89 e5                mov    %rsp,%rbp
         f0 48 0f bb 3e          lock btc %rdi,(%rsi)
         5d                      pop    %rbp
         c3                      retq

  clear_bit_unlock (2 copies, 11 calls)
         55                      push   %rbp
         48 89 e5                mov    %rsp,%rbp
         f0 48 0f b3 3e          lock btr %rdi,(%rsi)
         5d                      pop    %rbp
         c3                      retq

This patch works it around via s/inline/__always_inline/.

Code size decrease by ~13.5k after the patch:

      text     data      bss       dec    filename
  92110727 20826144 36417536 149354407    vmlinux.before
  92097234 20826176 36417536 149340946    vmlinux.after
Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Graf <tgraf@suug.ch>
Link: http://lkml.kernel.org/r/1454881887-1367-1-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

8dd5032d

14 9月, 2014 1 次提交

Make ARCH_HAS_FAST_MULTIPLIER a real config variable · 72d93104

由 Linus Torvalds 提交于 9月 13, 2014

It used to be an ad-hoc hack defined by the x86 version of
<asm/bitops.h> that enabled a couple of library routines to know whether
an integer multiply is faster than repeated shifts and additions.

This just makes it use the real Kconfig system instead, and makes x86
(which was the only architecture that did this) select the option.

NOTE! Even for x86, this really is kind of wrong.  If we cared, we would
probably not enable this for builds optimized for netburst (P4), where
shifts-and-adds are generally faster than multiplies.  This patch does
*not* change that kind of logic, though, it is purely a syntactic change
with no code changes.

This was triggered by the fact that we have other places that really
want to know "do I want to expand multiples by constants by hand or
not", particularly the hash generation code.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

72d93104

18 4月, 2014 1 次提交

arch,x86: Convert smp_mb__*() · d00a5692

由 Peter Zijlstra 提交于 3月 13, 2014

x86 is strongly ordered and all its atomic ops imply a full barrier.

Implement the two new primitives as the old ones were.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-knswsr5mldkr0w1lrdxvc81w@git.kernel.org
Cc: Dave Jones <davej@redhat.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

d00a5692

05 12月, 2013 1 次提交

x86, bitops: Correct the assembly constraints to testing bitops · e0f6dec3

由 H. Peter Anvin 提交于 12月 04, 2013

In checkin:

0c44c2d0 x86: Use asm goto to implement better modify_and_test() functions

the various functions which do modify and test were unified and
optimized using "asm goto".  However, this change missed the detail
that the bitops require an "Ir" constraint rather than an "er"
constraint ("I" = integer constant from 0-31, "e" = signed 32-bit
integer constant).  This would cause code to miscompile if these
functions were used on constant bit positions 32-255 and the build to
fail if used on constant bit positions above 255.

Add the constraints as a parameter to the GEN_BINARY_RMWcc() macro to
avoid this problem.
Reported-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/529E8719.4070202@zytor.com

e0f6dec3

25 9月, 2013 1 次提交

x86: Use asm goto to implement better modify_and_test() functions · 0c44c2d0

由 Peter Zijlstra 提交于 9月 11, 2013

Linus suggested using asm goto to get rid of the typical SETcc + TEST
instruction pair -- which also clobbers an extra register -- for our
typical modify_and_test() functions.

Because asm goto doesn't allow output fields it has to include an
unconditinal memory clobber when it changes a memory variable to force
a reload.

Luckily all atomic ops already imply a compiler barrier to go along
with their memory barrier semantics.
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-0mtn9siwbeo1d33bap1422se@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

0c44c2d0

17 7月, 2013 1 次提交

x86, bitops: Change bitops to be native operand size · 9b710506

由 H. Peter Anvin 提交于 7月 16, 2013

Change the bitops operation to be naturally "long", i.e. 63 bits on
the 64-bit kernel.  Additional bugs are likely to crop up in the
future.

We already have bugs which machines with > 16 TiB of memory in a
single node, as can happen if memory is interleaved.  The x86 bitop
operations take a signed index, so using an unsigned type is not an
option.

Jim Kukunas measured the effect of this patch on kernel size: it adds
2779 bytes to the allyesconfig kernel.  Some of that probably could be
elided by replacing the inline functions with macros which select the
32-bit type if the index is a 32-bit value, something like:

In that case we could also use "Jr" constraints for the 64-bit
version.

However, this would more than double the amount of code for a
relatively small gain.

Note that we can't use ilog2() for _BITOPS_LONG_SHIFT, as that causes
a recursive header inclusion problem.

The change to constant_test_bit() should both generate better code and
give correct result for negative bit indicies.  As previously written
the compiler had to generate extra code to create the proper wrong
result for negative values.
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
Cc: Jim Kukunas <james.t.kukunas@intel.com>
Link: http://lkml.kernel.org/n/tip-z61ofiwe90xeyb461o72h8ya@git.kernel.org

9b710506

19 9月, 2012 1 次提交

x86: Use REP BSF unconditionally · e26a44a2

由 Jan Beulich 提交于 9月 18, 2012

Make "REP BSF" unconditional, as per the suggestion of hpa
and Linus, this removes the insane BSF_PREFIX conditional
and simplifies the logic.
Suggested-by: N"H. Peter Anvin" <hpa@zytor.com>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

e26a44a2

13 9月, 2012 2 次提交

x86: Prefer TZCNT over BFS · 5870661c

由 Jan Beulich 提交于 9月 10, 2012

Following a relatively recent compiler change, make use of the
fact that for non-zero input BSF and TZCNT produce the same
result, and that CPUs not knowing of TZCNT will treat the
instruction as BSF (i.e. ignore what looks like a REP prefix to
them). The assumption here is that TZCNT would never have worse
performance than BSF.

For the moment, only do this when the respective generic-CPU
option is selected (as there are no specific-CPU options
covering the CPUs supporting TZCNT), and don't do that when size
optimization was requested.
Signed-off-by: NJan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/504DEA1B020000780009A277@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

5870661c

x86/64: Adjust types of temporaries used by ffs()/fls()/fls64() · 1edfbb41

由 Jan Beulich 提交于 9月 10, 2012

The 64-bit special cases of the former two (the thrird one is
64-bit only anyway) don't need to use "long" temporaries, as the
result will always fit in a 32-bit variable, and the functions
return plain "int". This avoids a few REX prefixes, i.e.
minimally reduces code size.
Signed-off-by: NJan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/504DE550020000780009A258@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

1edfbb41

25 6月, 2012 1 次提交

x86, bitops: note on __test_and_clear_bit atomicity · d0a69d63

由 Michael S. Tsirkin 提交于 6月 24, 2012

__test_and_clear_bit is actually atomic with respect
to the local CPU. Add a note saying that KVM on x86
relies on this behaviour so people don't accidentaly break it.
Also warn not to rely on this in portable code.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

d0a69d63

23 5月, 2012 1 次提交

x86/bitops: Move BIT_64() for a wider use · e8f380e0

由 Borislav Petkov 提交于 5月 22, 2012

Needed for shifting 64-bit values on 32-bit, like MSR values,
for example.
Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frank Arnold <frank.arnold@amd.com>
Link: http://lkml.kernel.org/r/1337684026-19740-1-git-send-email-bp@amd64.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

e8f380e0

16 12月, 2011 2 次提交

x86_64, asm: Optimise fls(), ffs() and fls64() · ca3d30cc

由 David Howells 提交于 12月 13, 2011

fls(N), ffs(N) and fls64(N) can be optimised on x86_64.  Currently they use a
CMOV instruction after the BSR/BSF to set the destination register to -1 if the
value to be scanned was 0 (in which case BSR/BSF set the Z flag).

Instead, according to the AMD64 specification, we can make use of the fact that
BSR/BSF doesn't modify its output register if its input is 0.  By preloading
the output with -1 and incrementing the result, we achieve the desired result
without the need for a conditional check.

The Intel x86_64 specification, however, says that the result of BSR/BSF in
such a case is undefined.  That said, when queried, one of the Intel CPU
architects said that the behaviour on all Intel CPUs is that:

 (1) with BSRQ/BSFQ, the 64-bit destination register is written with its
     original value if the source is 0, thus, in essence, giving the effect we
     want.  And,

 (2) with BSRL/BSFL, the lower half of the 64-bit destination register is
     written with its original value if the source is 0, and the upper half is
     cleared, thus giving us the effect we want (we return a 4-byte int).

Further, it was indicated that they (Intel) are unlikely to get away with
changing the behaviour.

It might be possible to optimise the 32-bit versions of these functions, but
there's a lot more variation, and so the effective non-destructive property of
BSRL/BSRF cannot be relied on.

[ hpa: specifically, some 486 chips are known to NOT have this property. ]

I have benchmarked these functions on my Core2 Duo test machine using the
following program:

	#include <stdlib.h>
	#include <stdio.h>

	#ifndef __x86_64__
	#error
	#endif

	#define PAGE_SHIFT 12

	typedef unsigned long long __u64, u64;
	typedef unsigned int __u32, u32;
	#define noinline	__attribute__((noinline))

	static __always_inline int fls64(__u64 x)
	{
		long bitpos = -1;

		asm("bsrq %1,%0"
		    : "+r" (bitpos)
		    : "rm" (x));
		return bitpos + 1;
	}

	static inline unsigned long __fls(unsigned long word)
	{
		asm("bsr %1,%0"
		    : "=r" (word)
		    : "rm" (word));
		return word;
	}
	static __always_inline int old_fls64(__u64 x)
	{
		if (x == 0)
			return 0;
		return __fls(x) + 1;
	}

	static noinline // __attribute__((const))
	int old_get_order(unsigned long size)
	{
		int order;

		size = (size - 1) >> (PAGE_SHIFT - 1);
		order = -1;
		do {
			size >>= 1;
			order++;
		} while (size);
		return order;
	}

	static inline __attribute__((const))
	int get_order_old_fls64(unsigned long size)
	{
		int order;
		size--;
		size >>= PAGE_SHIFT;
		order = old_fls64(size);
		return order;
	}

	static inline __attribute__((const))
	int get_order(unsigned long size)
	{
		int order;
		size--;
		size >>= PAGE_SHIFT;
		order = fls64(size);
		return order;
	}

	unsigned long prevent_optimise_out;

	static noinline unsigned long test_old_get_order(void)
	{
		unsigned long n, total = 0;
		long rep, loop;

		for (rep = 1000000; rep > 0; rep--) {
			for (loop = 0; loop <= 16384; loop += 4) {
				n = 1UL << loop;
				total += old_get_order(n);
			}
		}
		return total;
	}

	static noinline unsigned long test_get_order_old_fls64(void)
	{
		unsigned long n, total = 0;
		long rep, loop;

		for (rep = 1000000; rep > 0; rep--) {
			for (loop = 0; loop <= 16384; loop += 4) {
				n = 1UL << loop;
				total += get_order_old_fls64(n);
			}
		}
		return total;
	}

	static noinline unsigned long test_get_order(void)
	{
		unsigned long n, total = 0;
		long rep, loop;

		for (rep = 1000000; rep > 0; rep--) {
			for (loop = 0; loop <= 16384; loop += 4) {
				n = 1UL << loop;
				total += get_order(n);
			}
		}
		return total;
	}

	int main(int argc, char **argv)
	{
		unsigned long total;

		switch (argc) {
		case 1:  total = test_old_get_order();		break;
		case 2:  total = test_get_order_old_fls64();	break;
		default: total = test_get_order();		break;
		}
		prevent_optimise_out = total;
		return 0;
	}

This allows me to test the use of the old fls64() implementation and the new
fls64() implementation and also to contrast these to the out-of-line loop-based
implementation of get_order().  The results were:

	warthog>time ./get_order
	real    1m37.191s
	user    1m36.313s
	sys     0m0.861s
	warthog>time ./get_order x
	real    0m16.892s
	user    0m16.586s
	sys     0m0.287s
	warthog>time ./get_order x x
	real    0m7.731s
	user    0m7.727s
	sys     0m0.002s

Using the current upstream fls64() as a basis for an inlined get_order() [the
second result above] is much faster than using the current out-of-line
loop-based get_order() [the first result above].

Using my optimised inline fls64()-based get_order() [the third result above]
is even faster still.

[ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64
  instead of comparing BITS_PER_LONG, updated comments, rebased manually
  on top of 83d99df7 x86, bitops: Move fls64.h inside __KERNEL__ ]
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

ca3d30cc

x86, bitops: Move fls64.h inside __KERNEL__ · 83d99df7

由 H. Peter Anvin 提交于 12月 15, 2011

We would include <asm-generic/bitops/fls64.h> even without __KERNEL__,
but that doesn't make sense, as:

1. That file provides fls64(), but the corresponding function fls() is
   not exported to user space.
2. The implementation of fls64.h uses kernel-only symbols.
3. fls64.h is not exported to user space.

This appears to have been a bug introduced in checkin:

d57594c2 bitops: use __fls for fls64 on 64-bit archs

Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Alexander van Heukelum <heukelum@mailshack.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/4EEA77E1.6050009@zytor.com

83d99df7

27 7月, 2011 1 次提交

asm-generic: add another generic ext2 atomic bitops · 148817ba

由 Akinobu Mita 提交于 7月 26, 2011

The majority of architectures implement ext2 atomic bitops as
test_and_{set,clear}_bit() without spinlock.

This adds this type of generic implementation in ext2-atomic-setbit.h and
use it wherever possible.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Suggested-by: NAndreas Dilger <adilger@dilger.ca>
Suggested-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

148817ba

24 3月, 2011 3 次提交

bitops: remove minix bitops from asm/bitops.h · 61f2e7b0

由 Akinobu Mita 提交于 3月 23, 2011

minix bit operations are only used by minix filesystem and useless by
other modules.  Because byte order of inode and block bitmaps is different
on each architecture like below:

m68k:
	big-endian 16bit indexed bitmaps

h8300, microblaze, s390, sparc, m68knommu:
	big-endian 32 or 64bit indexed bitmaps

m32r, mips, sh, xtensa:
	big-endian 32 or 64bit indexed bitmaps for big-endian mode
	little-endian bitmaps for little-endian mode

Others:
	little-endian bitmaps

In order to move minix bit operations from asm/bitops.h to architecture
independent code in minix filesystem, this provides two config options.

CONFIG_MINIX_FS_BIG_ENDIAN_16BIT_INDEXED is only selected by m68k.
CONFIG_MINIX_FS_NATIVE_ENDIAN is selected by the architectures which use
native byte order bitmaps (h8300, microblaze, s390, sparc, m68knommu,
m32r, mips, sh, xtensa).  The architectures which always use little-endian
bitmaps do not select these options.

Finally, we can remove minix bit operations from asm/bitops.h for all
architectures.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NGreg Ungerer <gerg@uclinux.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Michal Simek <monstr@monstr.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Acked-by: NRalf Baechle <ralf@linux-mips.org>
Acked-by: NPaul Mundt <lethal@linux-sh.org>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

61f2e7b0

bitops: remove ext2 non-atomic bitops from asm/bitops.h · f312eff8

由 Akinobu Mita 提交于 3月 23, 2011

As the result of conversions, there are no users of ext2 non-atomic bit
operations except for ext2 filesystem itself.  Now we can put them into
architecture independent code in ext2 filesystem, and remove from
asm/bitops.h for all architectures.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f312eff8

bitops: introduce little-endian bitops for most architectures · 861b5ae7

由 Akinobu Mita 提交于 3月 23, 2011

Introduce little-endian bit operations to the big-endian architectures
which do not have native little-endian bit operations and the
little-endian architectures.  (alpha, avr32, blackfin, cris, frv, h8300,
ia64, m32r, mips, mn10300, parisc, sh, sparc, tile, x86, xtensa)

These architectures can just include generic implementation
(asm-generic/bitops/le.h).
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: NHans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

861b5ae7

10 10月, 2010 1 次提交

bitops: make asm-generic/bitops/find.h more generic · 708ff2a0

由 Akinobu Mita 提交于 9月 29, 2010

asm-generic/bitops/find.h has the extern declarations of find_next_bit()
and find_next_zero_bit() and the macro definitions of find_first_bit()
and find_first_zero_bit(). It is only usable by the architectures which
enables CONFIG_GENERIC_FIND_NEXT_BIT and disables
CONFIG_GENERIC_FIND_FIRST_BIT.

x86 and tile enable both CONFIG_GENERIC_FIND_NEXT_BIT and
CONFIG_GENERIC_FIND_FIRST_BIT. These architectures cannot include
asm-generic/bitops/find.h in their asm/bitops.h. So ifdefed extern
declarations of find_first_bit and find_first_zero_bit() are put in
linux/bitops.h.

This makes asm-generic/bitops/find.h usable by these architectures
and use it. Also this change is needed for the forthcoming duplicated
extern declarations cleanup.
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Chris Metcalf <cmetcalf@tilera.com>

708ff2a0

27 9月, 2010 1 次提交

x86: Avoid 'constant_test_bit()' misoptimization due to cast to non-volatile · c9e2fbd9

由 Alexander Chumachenko 提交于 4月 01, 2010

While debugging bit_spin_lock() hang, it was tracked down to gcc-4.4
misoptimization of non-inlined constant_test_bit() due to non-volatile
addr when 'const volatile unsigned long *addr' cast to 'unsigned long *'
with subsequent unconditional jump to pause (and not to the test) leading
to hang.

Compiling with gcc-4.3 or disabling CONFIG_OPTIMIZE_INLINING yields inlined
constant_test_bit() and correct jump, thus working around the kernel bug.

Other arches than asm-x86 may implement this slightly differently;
2.6.29 mitigates the misoptimization by changing the function prototype
(commit c4295fbb) but probably fixing the issue
itself is better.
Signed-off-by: NAlexander Chumachenko <ledest@gmail.com>
Signed-off-by: NMichael Shigorin <mike@osdn.org.ua>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

c9e2fbd9

07 4月, 2010 1 次提交

x86: Add optimized popcnt variants · d61931d8

由 Borislav Petkov 提交于 3月 05, 2010

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.
Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20100318112015.GC11152@aftab>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

d61931d8

14 1月, 2009 1 次提交

x86, generic: mark complex bitops.h inlines as __always_inline · c8399943

由 Andi Kleen 提交于 1月 12, 2009

Impact: reduce kernel image size

Hugh Dickins noticed that older gcc versions when the kernel
is built for code size didn't inline some of the bitops.

Mark all complex x86 bitops that have more than a single
asm statement or two as always inline to avoid this problem.

Probably should be done for other architectures too.

Ingo then found a better fix that only requires
a single line change, but it unfortunately only
works on gcc 4.3.

On older gccs the original patch still makes a ~0.3% defconfig
difference with CONFIG_OPTIMIZE_INLINING=y.

With gcc 4.1 and a defconfig like build:

    6116998 1138540  883788 8139326  7c323e vmlinux-oi-with-patch
    6137043 1138540  883788 8159371  7c808b vmlinux-optimize-inlining

~20k / 0.3% difference.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c8399943

10 1月, 2009 1 次提交

x86: make 'constant_test_bit()' take an unsigned bit number · c4295fbb

由 Linus Torvalds 提交于 1月 09, 2009

Ingo noticed that using signed arithmetic seems to confuse the gcc
inliner, and make it potentially decide that it's all too complicated.

(Yeah, yeah, it's a constant. It's always positive. Still..)

Based-on: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c4295fbb

06 11月, 2008 1 次提交

x86: Implement change_bit with immediate operand as "lock xorb" · 838e8bb7

由 Uros Bizjak 提交于 10月 24, 2008

Impact: Minor optimization.

Implement change_bit with immediate bit count as "lock xorb". This is
similar to  "lock orb" and "lock andb"  for set_bit and clear_bit
functions.
Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

838e8bb7

23 10月, 2008 2 次提交

x86: Fix ASM_X86__ header guards · 1965aae3

由 H. Peter Anvin 提交于 10月 22, 2008

Change header guards named "ASM_X86__*" to "_ASM_X86_*" since:

a. the double underscore is ugly and pointless.
b. no leading underscore violates namespace constraints.
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

1965aae3

x86, um: ... and asm-x86 move · bb898558

由 Al Viro 提交于 8月 17, 2008

Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

bb898558

23 9月, 2008 1 次提交

x86: remove set_bit_string() · ed6dc498

由 FUJITA Tomonori 提交于 9月 23, 2008

"export iommu_area_reserve helper funciton" patch converted all the
users of set_bit_string, GART, Calgary and AMD IOMMU drivers, to use
iommu_area_reserve helper function. Now we can remove unused
set_bit_string function.
Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ed6dc498

23 7月, 2008 1 次提交

x86: consolidate header guards · 77ef50a5

由 Vegard Nossum 提交于 6月 18, 2008

This patch is the result of an automatic script that consolidates the
format of all the headers in include/asm-x86/.

The format:

1. No leading underscore. Names with leading underscores are reserved.
2. Pathname components are separated by two underscores. So we can
   distinguish between mm_types.h and mm/types.h.
3. Everything except letters and numbers are turned into single
   underscores.
Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>

77ef50a5

18 7月, 2008 1 次提交

x86, cleanup: fix description of __fls(): __fls(0) is undefined · 8450e853

由 Alexander van Heukelum 提交于 7月 05, 2008

Ricardo M. Correia spotted that the use of __fls() in fls64() did
not seem to make sense. In fact fls64()'s implementation is fine,
but the description of __fls() was wrong. Fix that.
Reported-by: N"Ricardo M. Correia" <Ricardo.M.Correia@Sun.COM>
Signed-off-by: NAlexander van Heukelum <heukelum@fastmail.fm>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8450e853

21 6月, 2008 1 次提交

x86, bitops: make constant-bit set/clear_bit ops faster, gcc workaround · 437a0a54

由 Ingo Molnar 提交于 6月 20, 2008

Jeremy Fitzhardinge reported this compiler bug:

Suggestion from Linus: add "r" to the input constraint of the
set_bit()/clear_bit()'s constant 'nr' branch:

Blows up on "gcc version 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)":

 CC      init/main.o
include2/asm/bitops.h: In function `start_kernel':
include2/asm/bitops.h:59: warning: asm operand 1 probably doesn't match constraints
include2/asm/bitops.h:59: warning: asm operand 1 probably doesn't match constraints
include2/asm/bitops.h:59: warning: asm operand 1 probably doesn't match constraints
include2/asm/bitops.h:59: error: impossible constraint in `asm'
include2/asm/bitops.h:59: error: impossible constraint in `asm'
include2/asm/bitops.h:59: error: impossible constraint in `asm'
Reported-by: NJeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

437a0a54

20 6月, 2008 1 次提交

x86, bitops: make constant-bit set/clear_bit ops faster, adapt, clean up · 7dbceaf9

由 Ingo Molnar 提交于 6月 20, 2008

fix integration bug introduced by "x86: bitops take an unsigned long *"
which turned "(void *) + x" into "(long *) + x".

small cleanups to make it more apparent which value get propagated where.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

7dbceaf9

19 6月, 2008 1 次提交

x86, bitops: make constant-bit set/clear_bit ops faster · 1a750e0c

由 Linus Torvalds 提交于 6月 18, 2008

On Wed, 18 Jun 2008, Linus Torvalds wrote:
>
> And yes, the "lock andl" should be noticeably faster than the xchgl.

I dunno. Here's a untested (!!) patch that turns constant-bit
set/clear_bit ops into byte mask ops (lock orb/andb).

It's not exactly pretty. The reason for using the byte versions is that a
locked op is serialized in the memory pipeline anyway, so there are no
forwarding issues (that could slow down things when we access things with
different sizes), and the byte ops are a lot smaller than 32-bit and
particularly 64-bit ops (big constants, and the 64-bit ops need the REX
prefix byte too).

[ Side note: I wonder if we should turn the "test_bit()" C version into a
  "char *" version too.. It could actually help with alias analysis, since
  char pointers can alias anything. So it might be the RightThing(tm) to
  do for multiple reasons. I dunno. It's a separate issue. ]

It does actually shrink the kernel image a bit (a couple of hundred bytes
on the text segment for my everything-compiled-in image), and while it's
totally untested the (admittedly few) code generation points I looked at
seemed sane. And "lock orb" should be noticeably faster than "lock bts".

If somebody wants to play with it, go wild. I didn't do "change_bit()",
because nobody sane uses that thing anyway. I guarantee nothing. And if it
breaks, nobody saw me do anything.  You can't prove this email wasn't sent
by somebody who is good at forging smtp.

This does require a gcc that is recent enough for "__builtin_constant_p()"
to work in an inline function, but I suspect our kernel requirements are
already higher than that. And if you do have an old gcc that is supported,
the worst that would happen is that the optimization doesn't trigger.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1a750e0c

25 5月, 2008 1 次提交

x86: bitops take an unsigned long * · 5136dea5

由 Andrew Morton 提交于 5月 14, 2008

All (or most) other architectures do this.  So should x86.  Fix.

Cc: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

5136dea5

11 5月, 2008 1 次提交

x86: revert commit ("x86: bitops asm constraint fixes") · eb2b4e68

由 Simon Holm Thøgersen 提交于 5月 05, 2008

709f744f causes my computer to freeze during the start up of X and my
login manger (GDM). It gets to the point where it has shown the default
X mouse cursor logo (a big X / cross) and does not respond to anything
from that point on.

This worked fine before 709f744f, and it works fine with 709f744f
reverted on top of Linus' current tree (f74d505b). The revert had
conflicts, as far as I can tell due to white space changes. The diff I
ended up with is below.

It is 100% reproducible.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

eb2b4e68

27 4月, 2008 2 次提交

J
x86: include/asm-x86/pgalloc.h/bitops.h: checkpatch cleanups - formatting only · f19dcf4a
由 Joe Perches 提交于 3月 23, 2008
```
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
```
f19dcf4a

x86: finalize bitops unification · d66462f5

由 Alexander van Heukelum 提交于 4月 04, 2008

include/asm-x86/bitops_32.h and include/asm-x86/bitops_64.h are now
almost identical. The 64-bit version sets ARCH_HAS_FAST_MULTIPLIER
and has an extra inline function set_bit_string. The define currently
has no influence on the generated code, but it can be argued that
setting it on i386 is the right thing to do anyhow. The addition
of the extra inline function on i386 does not hurt either.
Signed-off-by: NAlexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d66462f5