提交 · 56142536868a2be34f261ed8fdca1610f8a73fbd · OpenHarmony / kernel_linux

29 4月, 2006 1 次提交

Remove unneeded _syscallX macros from user view in asm-*/unistd.h · 56142536

由 David Woodhouse 提交于 4月 29, 2006

These aren't needed by glibc or klibc, and they're broken in some cases
anyway. The uClibc folks are apparently switching over to stop using
them too (now that we agreed that they should be dropped, at least).
Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>

56142536

27 4月, 2006 1 次提交
- D
  Exclude asm-generic/{page,memory_model}.h from user bits of i386/x86_64 page.h · cd469e0c
  由 David Woodhouse 提交于 4月 27, 2006
```
Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
```
  cd469e0c
26 4月, 2006 2 次提交

D
Don't include linux/config.h from anywhere else in include/ · 62c4f0a2
由 David Woodhouse 提交于 4月 26, 2006
```
Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
```
62c4f0a2

[PATCH] Add support for the sys_vmsplice syscall · 912d35f8

由 Jens Axboe 提交于 4月 26, 2006

sys_splice() moves data to/from pipes with a file input/output. sys_vmsplice()
moves data to a pipe, with the input being a user address range instead.

This uses an approach suggested by Linus, where we can hold partial ranges
inside the pages[] map. Hopefully this will be useful for network
receive support as well.
Signed-off-by: NJens Axboe <axboe@suse.de>

912d35f8

20 4月, 2006 3 次提交

[PATCH] x86_64: bring back __read_mostly support to linux-2.6.17-rc2 · 0b699e36

由 Eric Dumazet 提交于 4月 20, 2006

It seems latest kernel has a wrong/missing __read_mostly implementation
for x86_64

__read_mostly macro should be declared outside of #if CONFIG_X86_VSMP block
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

0b699e36

[PATCH] i386/x86-64: Fix x87 information leak between processes · 18bd057b

由 Andi Kleen 提交于 4月 20, 2006

AMD K7/K8 CPUs only save/restore the FOP/FIP/FDP x87 registers in FXSAVE
when an exception is pending.  This means the value leak through
context switches and allow processes to observe some x87 instruction
state of other processes.

This was actually documented by AMD, but nobody recognized it as
being different from Intel before.

The fix first adds an optimization: instead of unconditionally
calling FNCLEX after each FXSAVE test if ES is pending and skip
it when not needed. Then do a x87 load from a kernel variable to
clear FOP/FIP/FDP.

This means other processes always will only see a constant value
defined by the kernel in their FP state.

I took some pain to make sure to chose a variable that's already
in L1 during context switch to make the overhead of this low.

Also alternative() is used to patch away the new code on CPUs
who don't need it.

Patch for both i386/x86-64.

The problem was discovered originally by Jan Beulich. Richard
Brunner provided the basic code for the workarounds, with contribution
from Jan.

This is CVE-2006-1056

Cc: richard.brunner@amd.com
Cc: jbeulich@novell.com
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

18bd057b

[PATCH] for_each_possible_cpu: x86_64 · 676ff453

由 KAMEZAWA Hiroyuki 提交于 4月 18, 2006

for_each_cpu() actually iterates across all possible CPUs.  We've had
mistakes in the past where people were using for_each_cpu() where they
should have been iterating across only online or present CPUs.  This is
inefficient and possibly buggy.

We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this
in the future.

This patch replaces for_each_cpu with for_each_possible_cpu.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NAndi Kleen <ak@muc.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

676ff453

19 4月, 2006 2 次提交

[PATCH] x86_64: Add tee and sync_file_range · f1233ab2

由 Andi Kleen 提交于 4月 18, 2006

tee was already there for some reason for native 64bit, but
sys_sync_file_range was missing. Also add it to the compat layer.
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

f1233ab2

[PATCH] x86_64: Increase NUMA hash function nodemap · 6fa679fd

由 Andi Kleen 提交于 4月 18, 2006

Needed for some big Opteron systems to compute a numa hash function
They have more than 12 bits significant address.

TBD switch this over to dynamic allocation or use better hash
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

6fa679fd

11 4月, 2006 4 次提交

[PATCH] splice: add support for sys_tee() · 70524490

由 Jens Axboe 提交于 4月 11, 2006

Basically an in-kernel implementation of tee, which uses splice and the
pipe buffers as an intelligent way to pass data around by reference.

Where the user space tee consumes the input and produces a stdout and
file output, this syscall merely duplicates the data inside a pipe to
another pipe. No data is copied, the output just grabs a reference to the
input pipe data.
Signed-off-by: NJens Axboe <axboe@suse.de>

70524490

[PATCH] x86_64: inline function prefix with __always_inline in vsyscall · cde227af

由 mao, bibo 提交于 4月 11, 2006

In vsyscall function do_vgettimeofday(), some functions are declared as
inlined, which is a hint for gcc to compile the function inlined but it
not forced. Sometimes compiler does not compile the function as
inlined, so here inline is replaced by __always_inline prefix.

It does not happen in gcc compiler actually, but it possibly happens.
Signed-off-by: Nbibo mao <bibo.mao@intel.com>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

cde227af

[PATCH] x86_64: fix sync before RDTSC on Intel cpus · e4cff6ac

由 Siddha, Suresh B 提交于 4月 11, 2006

Commit c818a181 didn't do the expected
thing.  This fix will remove the additional sync(cpuid) before RDTSC on
Intel platforms..
Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

e4cff6ac

[PATCH] Configurable NODES_SHIFT · c80d79d7

由 Yasunori Goto 提交于 4月 10, 2006

Current implementations define NODES_SHIFT in include/asm-xxx/numnodes.h for
each arch.  Its definition is sometimes configurable.  Indeed, ia64 defines 5
NODES_SHIFT values in the current git tree.  But it looks a bit messy.

SGI-SN2(ia64) system requires 1024 nodes, and the number of nodes already has
been changeable by config.  Suitable node's number may be changed in the
future even if it is other architecture.  So, I wrote configurable node's
number.

This patch set defines just default value for each arch which needs multi
nodes except ia64.  But, it is easy to change to configurable if necessary.

On ia64 the number of nodes can be already configured in generic ia64 and SN2
config.  But, NODES_SHIFT is defined for DIG64 and HP'S machine too.  So, I
changed it so that all platforms can be configured via CONFIG_NODES_SHIFT.  It
would be simpler.

See also: http://marc.theaimsgroup.com/?l=linux-kernel&m=114358010523896&w=2Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jack Steiner <steiner@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

c80d79d7

10 4月, 2006 8 次提交

[PATCH] x86_64: Eliminate IA32_NR_syscalls define · 67d53ea5

由 Andi Kleen 提交于 4月 07, 2006

Or rather compute it based on the table length automatically.

This also has the intended side effect of not warning for new system calls
anymore.
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

67d53ea5

[PATCH] x86_64: fix CONFIG_REORDER · bbd3aff8

由 Sam Ravnborg 提交于 4月 07, 2006

Fix CONFIG_REORDER.

The value of cflags-y was assined to CFLAGS before cflags-y was assigned
the value used for CONFIG_REORDER.

Use cflags-y for all CFLAGS options in the Makefile to avoid this
happening again.
Signed-off-by: NSam Ravnborg <sam@ravnborg.org>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

bbd3aff8

[PATCH] x86_64: Fix drift with HPET timer enabled · b20367a6

由 Jordan Hargrave 提交于 4月 07, 2006

If the HPET timer is enabled, the clock can drift by ~3 seconds a day.
This is due to the HPET timer not being initialized with the correct
setting (still using PIT count).

If HZ changes, this drift can become even more pronounced.

HPET patch initializes tick_nsec with correct tick_nsec settings for
HPET timer.

Vojtech comments:

  "It's not entirely correct (it assumes the HPET ticks totally
   exactly), but it's significantly better than assuming the PIT error
   there."
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

b20367a6

[PATCH] x86_64: Don't run NMI watchdog during machine checks · 553f265f

由 Andi Kleen 提交于 4月 07, 2006

Machine checks can stall the machine for a long time and
it's not good to trigger the nmi watchdog during that.
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

553f265f

[PATCH] x86_64: extra NODES_SHIFT definition · be56db61

由 Dave Hansen 提交于 4月 07, 2006

The generic linux/numa.h file defines NODES_SHIFT to 0 in case
the architecture did not.

Every architecture which has a NUMA config option defines
NODES_SHIFT in its asm-$ARCH headers, but only if NUMA is
enabled, except for x86_64.

This should make it like all the rest.
Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

be56db61

[PATCH] x86_64: Introduce e820_all_mapped · 95222368

由 Arjan van de Ven 提交于 4月 07, 2006

Introduce a e820_all_mapped() function which checks if the entire range
<start,end> is mapped with type.

This is done by moving the local start variable to the end of each
known-good region; if at the end of the function the start address is
still before end, there must be a part that's not of the correct type;
otherwise it's a good region.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

95222368

[PATCH] x86_64: Rename e820_mapped to e820_any_mapped · eee5a9fa

由 Arjan van de Ven 提交于 4月 07, 2006

Rename e820_mapped to e820_any_mapped since it tests if any part of the
range is mapped according to the type.

Later steps will introduce e820_all_mapped which will check if the
entire range is mapped with the type.  Both have their merit.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

eee5a9fa

[PATCH] x86_64: Reserve SRAT hotadd memory on x86-64 · 68a3a7fe

由 Andi Kleen 提交于 4月 07, 2006

From: Keith Mannthey, Andi Kleen

Implement memory hotadd without sparsemem. The memory in the SRAT
hotadd area is just preserved instead and can be activated later.

There are a few restrictions:
- Only one continuous hotadd area allowed per node

The main problem is dealing with the many buggy SRAT tables
that are out there. The strategy here is to reject anything
suspicious.

Originally from Keith Mannthey, with several hacks and changes by AK
and also contributions from Andrew Morton

[ TBD: Problems pointed out by KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:

 1) Goto's rebuild_zonelist patch will not work if CONFIG_MEMORY_HOTPLUG=n.

    Rebuilding zonelist is necessary when the system has just memory <
    4G at boot, and hot add memory > 4G.  because x86_64 has DMA32,
    ZONE_NORAML is not included into zonelist at boot time if system
    doesn't have memory >4G at boot.

    [AK: should just force the higher zones at boot time when SRAT tells us]

 2) zone and node's spanned_pages and present_pages are not incremented.
    They should be.

    For example, our server (ia64/Fujitsu PrimeQuest) can equip memory
    from 4G to 1T(maybe 2T in future), and SRAT will *always* say we have
    possible 1T +memory.  (Microsoft requires "write all possible memory
    in SRAT") When we reserve memmap for possible 1T memory, Linux will
    not work well in +minimum 4G configuraion ;)

    [AK: needs limiting to 5-10% of max memory]
 ]
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

68a3a7fe

01 4月, 2006 1 次提交

[PATCH] make local_t signed · 2cf8d82d

由 Andrew Morton 提交于 3月 31, 2006

local_t's were defined to be unsigned.  This increases confusion because
atomic_t's are signed.  The patch goes through and changes all implementations
to use signed longs throughout.

Also, x86-64 was using 32-bit quantities for the value passed into local_add()
and local_sub().  Fixed.

All (actually, both) existing users have been audited.

(Also s/__inline__/inline/ in x86_64/local.h)

Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

2cf8d82d

31 3月, 2006 1 次提交

[PATCH] Introduce sys_splice() system call · 5274f052

由 Jens Axboe 提交于 3月 30, 2006

This adds support for the sys_splice system call. Using a pipe as a
transport, it can connect to files or sockets (latter as output only).

From the splice.c comments:

   "splice": joining two ropes together by interweaving their strands.

   This is the "extended pipe" functionality, where a pipe is used as
   an arbitrary in-memory buffer. Think of a pipe as a small kernel
   buffer that you can use to transfer data from one end to the other.

   The traditional unix read/write is extended with a "splice()" operation
   that transfers data buffers to or from a pipe buffer.

   Named by Larry McVoy, original implementation from Linus, extended by
   Jens to support splicing to files and fixing the initial implementation
   bugs.
Signed-off-by: NJens Axboe <axboe@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

5274f052

29 3月, 2006 1 次提交

[PATCH] arch/i386/kernel/microcode.c: remove the obsolete microcode_ioctl · f45e4656

由 Adrian Bunk 提交于 3月 28, 2006

Nowadays, even Debian stable ships a microcode_ctl utility recent enough to no
longer use this ioctl.
Signed-off-by: NAdrian Bunk <bunk@stusta.de>
Acked-by: NTigran Aivazian <tigran_aivazian@symantec.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

f45e4656

28 3月, 2006 6 次提交

[PATCH] Notifier chain update: API changes · e041c683

由 Alan Stern 提交于 3月 27, 2006

The kernel's implementation of notifier chains is unsafe.  There is no
protection against entries being added to or removed from a chain while the
chain is in use.  The issues were discussed in this thread:

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2

We noticed that notifier chains in the kernel fall into two basic usage
classes:

	"Blocking" chains are always called from a process context
	and the callout routines are allowed to sleep;

	"Atomic" chains can be called from an atomic context and
	the callout routines are not allowed to sleep.

We decided to codify this distinction and make it part of the API.  Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name).  New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain.  The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.

With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed.  For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections.  (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)

There are some limitations, which should not be too hard to live with.  For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem.  Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain.  (This did happen in a couple of places and the code
had to be changed to avoid it.)

Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization.  Instead we use RCU.  The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.

Here is the list of chains that we adjusted and their classifications.  None
of them use the raw API, so for the moment it is only a placeholder.

  ATOMIC CHAINS
  -------------
arch/i386/kernel/traps.c:		i386die_chain
arch/ia64/kernel/traps.c:		ia64die_chain
arch/powerpc/kernel/traps.c:		powerpc_die_chain
arch/sparc64/kernel/traps.c:		sparc64die_chain
arch/x86_64/kernel/traps.c:		die_chain
drivers/char/ipmi/ipmi_si_intf.c:	xaction_notifier_list
kernel/panic.c:				panic_notifier_list
kernel/profile.c:			task_free_notifier
net/bluetooth/hci_core.c:		hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c:	ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c:	ip_conntrack_expect_chain
net/ipv6/addrconf.c:			inet6addr_chain
net/netfilter/nf_conntrack_core.c:	nf_conntrack_chain
net/netfilter/nf_conntrack_core.c:	nf_conntrack_expect_chain
net/netlink/af_netlink.c:		netlink_chain

  BLOCKING CHAINS
  ---------------
arch/powerpc/platforms/pseries/reconfig.c:	pSeries_reconfig_chain
arch/s390/kernel/process.c:		idle_chain
arch/x86_64/kernel/process.c		idle_notifier
drivers/base/memory.c:			memory_chain
drivers/cpufreq/cpufreq.c		cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c		cpufreq_transition_notifier_list
drivers/macintosh/adb.c:		adb_client_list
drivers/macintosh/via-pmu.c		sleep_notifier_list
drivers/macintosh/via-pmu68k.c		sleep_notifier_list
drivers/macintosh/windfarm_core.c	wf_client_list
drivers/usb/core/notify.c		usb_notifier_list
drivers/video/fbmem.c			fb_notifier_list
kernel/cpu.c				cpu_chain
kernel/module.c				module_notify_list
kernel/profile.c			munmap_notifier
kernel/profile.c			task_exit_notifier
kernel/sys.c				reboot_notifier_list
net/core/dev.c				netdev_chain
net/decnet/dn_dev.c:			dnaddr_chain
net/ipv4/devinet.c:			inetaddr_chain

It's possible that some of these classifications are wrong.  If they are,
please let us know or submit a patch to fix them.  Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)

The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.

[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: NJes Sorensen <jes@sgi.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

e041c683

[PATCH] lightweight robust futexes updates · 8f17d3a5

由 Ingo Molnar 提交于 3月 27, 2006

- fix: initialize the robust list(s) to NULL in copy_process.

- doc update

- cleanup: rename _inuser to _inatomic

- __user cleanups and other small cleanups
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

8f17d3a5

[PATCH] lightweight robust futexes: x86_64 · 8fdd6c6d

由 Ingo Molnar 提交于 3月 27, 2006

x86_64: add the futex_atomic_cmpxchg_inuser() assembly implementation, and
wire up the new syscalls.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NArjan van de Ven <arjan@infradead.org>
Acked-by: NUlrich Drepper <drepper@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

8fdd6c6d

[PATCH] lightweight robust futexes: arch defaults · e9056f13

由 Ingo Molnar 提交于 3月 27, 2006

This patchset provides a new (written from scratch) implementation of robust
futexes, called "lightweight robust futexes".  We believe this new
implementation is faster and simpler than the vma-based robust futex solutions
presented before, and we'd like this patchset to be adopted in the upstream
kernel.  This is version 1 of the patchset.

  Background
  ----------

What are robust futexes?  To answer that, we first need to understand what
futexes are: normal futexes are special types of locks that in the
noncontended case can be acquired/released from userspace without having to
enter the kernel.

A futex is in essence a user-space address, e.g.  a 32-bit lock variable
field.  If userspace notices contention (the lock is already owned and someone
else wants to grab it too) then the lock is marked with a value that says
"there's a waiter pending", and the sys_futex(FUTEX_WAIT) syscall is used to
wait for the other guy to release it.  The kernel creates a 'futex queue'
internally, so that it can later on match up the waiter with the waker -
without them having to know about each other.  When the owner thread releases
the futex, it notices (via the variable value) that there were waiter(s)
pending, and does the sys_futex(FUTEX_WAKE) syscall to wake them up.  Once all
waiters have taken and released the lock, the futex is again back to
'uncontended' state, and there's no in-kernel state associated with it.  The
kernel completely forgets that there ever was a futex at that address.  This
method makes futexes very lightweight and scalable.

"Robustness" is about dealing with crashes while holding a lock: if a process
exits prematurely while holding a pthread_mutex_t lock that is also shared
with some other process (e.g.  yum segfaults while holding a pthread_mutex_t,
or yum is kill -9-ed), then waiters for that lock need to be notified that the
last owner of the lock exited in some irregular way.

To solve such types of problems, "robust mutex" userspace APIs were created:
pthread_mutex_lock() returns an error value if the owner exits prematurely -
and the new owner can decide whether the data protected by the lock can be
recovered safely.

There is a big conceptual problem with futex based mutexes though: it is the
kernel that destroys the owner task (e.g.  due to a SEGFAULT), but the kernel
cannot help with the cleanup: if there is no 'futex queue' (and in most cases
there is none, futexes being fast lightweight locks) then the kernel has no
information to clean up after the held lock!  Userspace has no chance to clean
up after the lock either - userspace is the one that crashes, so it has no
opportunity to clean up.  Catch-22.

In practice, when e.g.  yum is kill -9-ed (or segfaults), a system reboot is
needed to release that futex based lock.  This is one of the leading
bugreports against yum.

To solve this problem, 'Robust Futex' patches were created and presented on
lkml: the one written by Todd Kneisel and David Singleton is the most advanced
at the moment.  These patches all tried to extend the futex abstraction by
registering futex-based locks in the kernel - and thus give the kernel a
chance to clean up.

E.g.  in David Singleton's robust-futex-6.patch, there are 3 new syscall
variants to sys_futex(): FUTEX_REGISTER, FUTEX_DEREGISTER and FUTEX_RECOVER.
The kernel attaches such robust futexes to vmas (via
vma->vm_file->f_mapping->robust_head), and at do_exit() time, all vmas are
searched to see whether they have a robust_head set.

Lots of work went into the vma-based robust-futex patch, and recently it has
improved significantly, but unfortunately it still has two fundamental
problems left:

 - they have quite complex locking and race scenarios.  The vma-based
   patches had been pending for years, but they are still not completely
   reliable.

 - they have to scan _every_ vma at sys_exit() time, per thread!

The second disadvantage is a real killer: pthread_exit() takes around 1
microsecond on Linux, but with thousands (or tens of thousands) of vmas every
pthread_exit() takes a millisecond or more, also totally destroying the CPU's
L1 and L2 caches!

This is very much noticeable even for normal process sys_exit_group() calls:
the kernel has to do the vma scanning unconditionally!  (this is because the
kernel has no knowledge about how many robust futexes there are to be cleaned
up, because a robust futex might have been registered in another task, and the
futex variable might have been simply mmap()-ed into this process's address
space).

This huge overhead forced the creation of CONFIG_FUTEX_ROBUST, but worse than
that: the overhead makes robust futexes impractical for any type of generic
Linux distribution.

So it became clear to us, something had to be done.  Last week, when Thomas
Gleixner tried to fix up the vma-based robust futex patch in the -rt tree, he
found a handful of new races and we were talking about it and were analyzing
the situation.  At that point a fundamentally different solution occured to
me.  This patchset (written in the past couple of days) implements that new
solution.  Be warned though - the patchset does things we normally dont do in
Linux, so some might find the approach disturbing.  Parental advice
recommended ;-)

  New approach to robust futexes
  ------------------------------

At the heart of this new approach there is a per-thread private list of robust
locks that userspace is holding (maintained by glibc) - which userspace list
is registered with the kernel via a new syscall [this registration happens at
most once per thread lifetime].  At do_exit() time, the kernel checks this
user-space list: are there any robust futex locks to be cleaned up?

In the common case, at do_exit() time, there is no list registered, so the
cost of robust futexes is just a simple current->robust_list != NULL
comparison.  If the thread has registered a list, then normally the list is
empty.  If the thread/process crashed or terminated in some incorrect way then
the list might be non-empty: in this case the kernel carefully walks the list
[not trusting it], and marks all locks that are owned by this thread with the
FUTEX_OWNER_DEAD bit, and wakes up one waiter (if any).

The list is guaranteed to be private and per-thread, so it's lockless.  There
is one race possible though: since adding to and removing from the list is
done after the futex is acquired by glibc, there is a few instructions window
for the thread (or process) to die there, leaving the futex hung.  To protect
against this possibility, userspace (glibc) also maintains a simple per-thread
'list_op_pending' field, to allow the kernel to clean up if the thread dies
after acquiring the lock, but just before it could have added itself to the
list.  Glibc sets this list_op_pending field before it tries to acquire the
futex, and clears it after the list-add (or list-remove) has finished.

That's all that is needed - all the rest of robust-futex cleanup is done in
userspace [just like with the previous patches].

Ulrich Drepper has implemented the necessary glibc support for this new
mechanism, which fully enables robust mutexes.  (Ulrich plans to commit these
changes to glibc-HEAD later today.)

Key differences of this userspace-list based approach, compared to the vma
based method:

 - it's much, much faster: at thread exit time, there's no need to loop
   over every vma (!), which the VM-based method has to do.  Only a very
   simple 'is the list empty' op is done.

 - no VM changes are needed - 'struct address_space' is left alone.

 - no registration of individual locks is needed: robust mutexes dont need
   any extra per-lock syscalls.  Robust mutexes thus become a very lightweight
   primitive - so they dont force the application designer to do a hard choice
   between performance and robustness - robust mutexes are just as fast.

 - no per-lock kernel allocation happens.

 - no resource limits are needed.

 - no kernel-space recovery call (FUTEX_RECOVER) is needed.

 - the implementation and the locking is "obvious", and there are no
   interactions with the VM.

  Performance
  -----------

I have benchmarked the time needed for the kernel to process a list of 1
million (!) held locks, using the new method [on a 2GHz CPU]:

 - with FUTEX_WAIT set [contended mutex]: 130 msecs
 - without FUTEX_WAIT set [uncontended mutex]: 30 msecs

I have also measured an approach where glibc does the lock notification [which
it currently does for !pshared robust mutexes], and that took 256 msecs -
clearly slower, due to the 1 million FUTEX_WAKE syscalls userspace had to do.

(1 million held locks are unheard of - we expect at most a handful of locks to
be held at a time.  Nevertheless it's nice to know that this approach scales
nicely.)

  Implementation details
  ----------------------

The patch adds two new syscalls: one to register the userspace list, and one
to query the registered list pointer:

 asmlinkage long
 sys_set_robust_list(struct robust_list_head __user *head,
                     size_t len);

 asmlinkage long
 sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
                     size_t __user *len_ptr);

List registration is very fast: the pointer is simply stored in
current->robust_list.  [Note that in the future, if robust futexes become
widespread, we could extend sys_clone() to register a robust-list head for new
threads, without the need of another syscall.]

So there is virtually zero overhead for tasks not using robust futexes, and
even for robust futex users, there is only one extra syscall per thread
lifetime, and the cleanup operation, if it happens, is fast and
straightforward.  The kernel doesnt have any internal distinction between
robust and normal futexes.

If a futex is found to be held at exit time, the kernel sets the highest bit
of the futex word:

	#define FUTEX_OWNER_DIED        0x40000000

and wakes up the next futex waiter (if any). User-space does the rest of
the cleanup.

Otherwise, robust futexes are acquired by glibc by putting the TID into the
futex field atomically.  Waiters set the FUTEX_WAITERS bit:

	#define FUTEX_WAITERS           0x80000000

and the remaining bits are for the TID.

  Testing, architecture support
  -----------------------------

I've tested the new syscalls on x86 and x86_64, and have made sure the parsing
of the userspace list is robust [ ;-) ] even if the list is deliberately
corrupted.

i386 and x86_64 syscalls are wired up at the moment, and Ulrich has tested the
new glibc code (on x86_64 and i386), and it works for his robust-mutex
testcases.

All other architectures should build just fine too - but they wont have the
new syscalls yet.

Architectures need to implement the new futex_atomic_cmpxchg_inuser() inline
function before writing up the syscalls (that function returns -ENOSYS right
now).

This patch:

Add placeholder futex_atomic_cmpxchg_inuser() implementations to every
architecture that supports futexes.  It returns -ENOSYS.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NArjan van de Ven <arjan@infradead.org>
Acked-by: NUlrich Drepper <drepper@redhat.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

e9056f13

[PATCH] unify pfn_to_page: x86_64 pfn_to_page · dc8ecb43

由 KAMEZAWA Hiroyuki 提交于 3月 27, 2006

x86_64 can use generic funcs.
For DISCONTIGMEM, CONFIG_OUT_OF_LINE_PFN_TO_PAGE is selected.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

dc8ecb43

[PATCH] sched: new sched domain for representing multi-core · 1e9f28fa

由 Siddha, Suresh B 提交于 3月 27, 2006

Add a new sched domain for representing multi-core with shared caches
between cores. Consider a dual package system, each package containing two
cores and with last level cache shared between cores with in a package. If
there are two runnable processes, with this appended patch those two
processes will be scheduled on different packages.

On such systems, with this patch we have observed 8% perf improvement with
specJBB(2 warehouse) benchmark and 35% improvement with CFP2000 rate(with 2
users).

This new domain will come into play only on multi-core systems with shared
caches. On other systems, this sched domain will be removed by domain
degeneration code. This new domain can be also used for implementing power
savings policy (see OLS 2005 CMP kernel scheduler paper for more details..
I will post another patch for power savings policy soon)

Most of the arch/* file changes are for cpu_coregroup_map() implementation.
Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

1e9f28fa

27 3月, 2006 1 次提交

[PATCH] bitops: x86_64: use generic bitops · f33e2fba

由 Akinobu Mita 提交于 3月 26, 2006

- remove sched_find_first_bit()
- remove generic_hweight{64,32,16,8}()
- remove ext2_{set,clear,test,find_first_zero,find_next_zero}_bit()
- remove minix_{test,set,test_and_clear,test,find_first_zero}_bit()
Signed-off-by: NAkinobu Mita <mita@miraclelinux.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

f33e2fba

26 3月, 2006 9 次提交

[PATCH] x86_64: Removed duplicated declaration of force_iommu · f271a6f5

由 Andi Kleen 提交于 3月 25, 2006

Noticed by Andrew Morton.
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

f271a6f5

[PATCH] x86_64: group memnodemap and memnodeshift in a memnode structure · dcf36bfa

由 Eric Dumazet 提交于 3月 25, 2006

pfn_to_page() and others need to access both memnode_shift and the very
first bytes of memnodemap[]. If we force memnode_shift to be just before the
memnodemap array, we can reduce the memory footprint to one cache line
instead of two for most setups. This patch introduce a 'memnode' structure
where shift and map[] are carefully placed.
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

dcf36bfa

[PATCH] x86_64: Make local_t 64bit instead of 32bit · 94949436

由 Andi Kleen 提交于 3月 25, 2006

For consistency with other architectures
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

94949436

[PATCH] x86_64: Remove CONFIG_UNORDERED_IO · ba22f135

由 Andi Kleen 提交于 3月 25, 2006

It was a failed experiment - all benchmarks done with it on both AMD
and Intel showed it was a loss. That was probably because the store
buffers of the CPUs for write combining traffic weren't large enough.
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

ba22f135

[PATCH] x86_64: timer interrupt lockup due to pending interrupt · da7ed9f9

由 Vivek Goyal 提交于 3月 25, 2006

o check_timer() routine fails while second kernel is booting after a crash
on an opetron box. Problem happens because timer vector (0x31) seems to be
locked.

o After a system crash, it is not safe to service interrupts any more, hence
interrupts are disabled. This leads to pending interrupts at LAPIC. LAPIC
sends these interrupts to the CPU during early boot of second kernel. Other
pending interrupts are discarded saying unexpected trap but timer interrupt
is serviced and CPU does not issue an LAPIC EOI because it think this
interrupt came from i8259 and sends ack to 8259. This leads to vector 0x31
locking as LAPIC does not clear respective ISR and keeps on waiting for
EOI.

o This patch issues extra EOI for the pending interrupts who have ISR set.

o Though today only timer seems to be the special case because in early
boot it thinks interrupts are coming from i8259 and uses
mask_and_ack_8259A() as ack handler and does not issue LAPIC EOI. But
probably doing it in generic manner for all vectors makes sense.
Signed-off-by: NVivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

da7ed9f9

[PATCH] x86_64: Reorder one field of the PDA to reduce padding · df92004c

由 Arjan van de Ven 提交于 3月 25, 2006

This reorders the mmu_state int in the pda, such that there is no more
padding (there currently is 4 bytes of padding).  Boot tested.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

df92004c

[PATCH] x86_64: Don't invoke OOM killer while allocating floppy DMA buffers · 554d284b

由 Andi Kleen 提交于 3月 25, 2006

Floppy can fall back to smaller buffers, so don't do OOM killing.
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

554d284b

[PATCH] x86_64: Implement early DMI scanning · f2d3efed

由 Andi Kleen 提交于 3月 25, 2006

There are more and more cases where we need to know DMI information
early to work around bugs.  i386 already had early DMI scanning, but
x86-64 didn't.  Implement this now.

This required some cleanup in the i386 code.
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

f2d3efed

[PATCH] x86_64: Don't define string functions to builtin · 6edfba1b

由 Andi Kleen 提交于 3月 25, 2006

gcc should handle this anyways, and it causes problems when
sprintf is turned into strcpy by gcc behind our backs and
the C fallback version of strcpy is actually defining __builtin_strcpy

Then drop -ffreestanding from the main Makefile because it isn't
needed anymore and implies -fno-builtin, which is wrong now.
(it was only added for x86-64, so dropping it should be safe)

Noticed by Roman Zippel

Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

6edfba1b

OpenHarmony / kernel_linux 上一次同步 接近 4 年

OpenHarmony / kernel_linux
上一次同步接近 4 年