- 20 12月, 2012 1 次提交
-
-
由 Al Viro 提交于
Again, conditional on CONFIG_GENERIC_SIGALTSTACK Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 29 11月, 2012 2 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 01 10月, 2012 1 次提交
-
-
由 Al Viro 提交于
32bit wrapper is lost on that; 64bit one is *not*, since we need to arrange for full pt_regs on stack when we call sys_execve() and we need to load callee-saved ones from there afterwards. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 22 9月, 2012 2 次提交
-
-
由 H. Peter Anvin 提交于
Signal handling contains a bunch of accesses to individual user space items, which causes an excessive number of STAC and CLAC instructions. Instead, let get/put_user_try ... get/put_user_catch() contain the STAC and CLAC instructions. This means that get/put_user_try no longer nests, and furthermore that it is no longer legal to use user space access functions other than __get/put_user_ex() inside those blocks. However, these macros are x86-specific anyway and are only used in the signal-handling paths; a simple reordering of moving the larger subroutine calls out of the try...catch blocks resolves that problem. Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/1348256595-29119-12-git-send-email-hpa@linux.intel.com
-
由 H. Peter Anvin 提交于
When Supervisor Mode Access Prevention (SMAP) is enabled, access to userspace from the kernel is controlled by the AC flag. To make the performance of manipulating that flag acceptable, there are two new instructions, STAC and CLAC, to set and clear it. This patch adds those instructions, via alternative(), when the SMAP feature is enabled. It also adds X86_EFLAGS_AC unconditionally to the SYSCALL entry mask; there is simply no reason to make that one conditional. Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/1348256595-29119-9-git-send-email-hpa@linux.intel.com
-
- 19 9月, 2012 1 次提交
-
-
由 Suresh Siddha 提交于
Currently for x86 and x86_32 binaries, fpstate in the user sigframe is copied to/from the fpstate in the task struct. And in the case of signal delivery for x86_64 binaries, if the fpstate is live in the CPU registers, then the live state is copied directly to the user sigframe. Otherwise fpstate in the task struct is copied to the user sigframe. During restore, fpstate in the user sigframe is restored directly to the live CPU registers. Historically, different code paths led to different bugs. For example, x86_64 code path was not preemption safe till recently. Also there is lot of code duplication for support of new features like xsave etc. Unify signal handling code paths for x86 and x86_64 kernels. New strategy is as follows: Signal delivery: Both for 32/64-bit frames, align the core math frame area to 64bytes as needed by xsave (this where the main fpu/extended state gets copied to and excludes the legacy compatibility fsave header for the 32-bit [f]xsave frames). If the state is live, copy the register state directly to the user frame. If not live, copy the state in the thread struct to the user frame. And for 32-bit [f]xsave frames, construct the fsave header separately before the actual [f]xsave area. Signal return: As the 32-bit frames with [f]xstate has an additional 'fsave' header, copy everything back from the user sigframe to the fpstate in the task structure and reconstruct the fxstate from the 'fsave' header (Also user passed pointers may not be correctly aligned for any attempt to directly restore any partial state). At the next fpstate usage, everything will be restored to the live CPU registers. For all the 64-bit frames and the 32-bit fsave frame, restore the state from the user sigframe directly to the live CPU registers. 64-bit signals always restored the math frame directly, so we can expect the math frame pointer to be correctly aligned. For 32-bit fsave frames, there are no alignment requirements, so we can restore the state directly. "lat_sig catch" microbenchmark numbers (for x86, x86_64, x86_32 binaries) are with in the noise range with this change. Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1343171129-2747-4-git-send-email-suresh.b.siddha@intel.com [ Merged in compilation fix ] Link: http://lkml.kernel.org/r/1344544736.8326.17.camel@sbsiddha-desk.sc.intel.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 05 9月, 2012 2 次提交
-
-
由 Mathias Krause 提交于
Fix the following sparse warnings by adding appropriate __user casts and annotations: ia32_signal.c:165:38: warning: incorrect type in argument 1 (different address spaces) ia32_signal.c:165:38: expected struct sigaltstack const [noderef] [usertype] <asn:1>*<noident> ia32_signal.c:165:38: got struct sigaltstack * [...] Signed-off-by: NMathias Krause <minipli@googlemail.com> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/1346621506-30857-4-git-send-email-minipli@googlemail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mathias Krause 提交于
Fix the following sparse warnings: sys_ia32.c:293:38: warning: incorrect type in argument 2 (different address spaces) sys_ia32.c:293:38: expected unsigned int [noderef] [usertype] <asn:1>*stat_addr sys_ia32.c:293:38: got unsigned int *stat_addr Ironically, sys_ia32.h was introduced to fix sparse warnings but missed that one. Signed-off-by: NMathias Krause <minipli@googlemail.com> Link: http://lkml.kernel.org/r/1346621506-30857-2-git-send-email-minipli@googlemail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 15 6月, 2012 1 次提交
-
-
由 Suresh Siddha 提交于
Signal delivery compat path may not have the 'TS_COMPAT' flag (that flag indicates how we entered the kernel). So use test_thread_flag(TIF_IA32) instead of is_ia32_task(): one of the functions of TIF_IA32 is just what kind of signal frame we want. Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1339722435.3475.57.camel@sbsiddha-desk.sc.intel.com Cc: stable@kernel.org # v3.4 Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
-
- 02 6月, 2012 1 次提交
-
-
由 Al Viro 提交于
Only 3 out of 63 do not. Renamed the current variant to __set_current_blocked(), added set_current_blocked() that will exclude unblockable signals, switched open-coded instances to it. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 22 5月, 2012 1 次提交
-
-
由 Al Viro 提交于
guts of saved_sigmask-based sigsuspend/rt_sigsuspend. Takes kernel sigset_t *. Open-coded instances replaced with calling it. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 16 5月, 2012 1 次提交
-
-
由 Eric W. Biederman 提交于
- Store uids and gids with kuid_t and kgid_t in struct kstat - Convert uid and gids to userspace usable values with from_kuid and from_kgid Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
-
- 08 5月, 2012 1 次提交
-
-
由 Jan Beulich 提交于
None of the three routines being removed here was actually hooked up anywhere, so they all represented dead code. Signed-off-by: NJan Beulich <jbeulich@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4FA947FE020000780008247F@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 07 5月, 2012 1 次提交
-
-
由 Al Viro 提交于
Setting TIF_IA32 in load_aout_binary() used to be enough; these days TASK_SIZE is controlled by TIF_ADDR32 and that one doesn't get set there. Switch to use of set_personality_ia32()... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 4月, 2012 4 次提交
-
-
由 Linus Torvalds 提交于
This continues the theme started with vm_brk() and vm_munmap(): vm_mmap() does the same thing as do_mmap(), but additionally does the required VM locking. This uninlines (and rewrites it to be clearer) do_mmap(), which sadly duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have to export our internal do_mmap_pgoff() function. Some day we hopefully don't have to export do_mmap() either, if all modular users can become the simpler vm_mmap() instead. We're actually very close to that already, with the notable exception of the (broken) use in i810, and a couple of stragglers in binfmt_elf. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
It does the same thing as "do_brk()", except it handles the VM locking too. It turns out that all external callers want that anyway, so we can make do_brk() static to just mm/mmap.c while at it. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 H. Peter Anvin 提交于
Remove open-coded exception table entries in arch/x86/ia32/ia32entry.S, and replace them with _ASM_EXTABLE() macros; this will allow us to change the format and type of the exception table entries. This one was missed from the previous patch to this file. Signed-off-by: NH. Peter Anvin <hpa@zytor.com> Cc: David Daney <david.daney@cavium.com> Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
-
由 H. Peter Anvin 提交于
Remove open-coded exception table entries in arch/x86/ia32/ia32entry.S, and replace them with _ASM_EXTABLE() macros; this will allow us to change the format and type of the exception table entries. Signed-off-by: NH. Peter Anvin <hpa@zytor.com> Cc: David Daney <david.daney@cavium.com> Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
-
- 14 4月, 2012 1 次提交
-
-
由 Will Drewry 提交于
This change enables SIGSYS, defines _sigfields._sigsys, and adds x86 (compat) arch support. _sigsys defines fields which allow a signal handler to receive the triggering system call number, the relevant AUDIT_ARCH_* value for that number, and the address of the callsite. SIGSYS is added to the SYNCHRONOUS_MASK because it is desirable for it to have setup_frame() called for it. The goal is to ensure that ucontext_t reflects the machine state from the time-of-syscall and not from another signal handler. The first consumer of SIGSYS would be seccomp filter. In particular, a filter program could specify a new return value, SECCOMP_RET_TRAP, which would result in the system call being denied and the calling thread signaled. This also means that implementing arch-specific support can be dependent upon HAVE_ARCH_SECCOMP_FILTER. Suggested-by: NH. Peter Anvin <hpa@zytor.com> Signed-off-by: NWill Drewry <wad@chromium.org> Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Reviewed-by: NH. Peter Anvin <hpa@zytor.com> Acked-by: NEric Paris <eparis@redhat.com> v18: - added acked by, rebase v17: - rebase and reviewed-by addition v14: - rebase/nochanges v13: - rebase on to 88ebdda6 v12: - reworded changelog (oleg@redhat.com) v11: - fix dropped words in the change description - added fallback copy_siginfo support. - added __ARCH_SIGSYS define to allow stepped arch support. v10: - first version based on suggestion Signed-off-by: NJames Morris <james.l.morris@oracle.com>
-
- 29 3月, 2012 1 次提交
-
-
由 David Howells 提交于
Disintegrate asm/system.h for X86. Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NH. Peter Anvin <hpa@zytor.com> cc: x86@kernel.org
-
- 21 3月, 2012 2 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Just don't pass NULL to it - nobody does, anyway. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 14 3月, 2012 1 次提交
-
-
由 H. Peter Anvin 提交于
Fix a stray ! which flipped the sense if we were generating a signal frame for ia32 vs. x32. Introduced in: e7084fd5 x32: Switch to a 64-bit clock_t Reported-by: NH. J. Lu <hjl.tools@gmail.com> Signed-off-by: NH. Peter Anvin <hpa@zytor.com> Cc: Gregory M. Lueck <gregory.m.lueck@intel.com> Link: http://lkml.kernel.org/r/1329696488-16970-1-git-send-email-hpa@zytor.com
-
- 13 3月, 2012 1 次提交
-
-
由 Srikar Dronamraju 提交于
There are precedences of trap number being referred to as trap_nr. However thread struct refers trap number as trap_no. Change it to trap_nr. Also use enum instead of left-over literals for trap values. This is pure cleanup, no functional change intended. Suggested-by: NIngo Molnar <mingo@eltu.hu> Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com> Cc: Linux-mm <linux-mm@kvack.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20120312092555.5379.942.sendpatchset@srdronam.in.ibm.com [ Fixed the math-emu build ] Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 06 3月, 2012 2 次提交
-
-
由 H. Peter Anvin 提交于
clock_t is used mainly to give the number of jiffies a certain process has burned. It is entirely feasible for a long-running process to consume more than 2^32 jiffies especially in a multiprocess system. As such, switch to a 64-bit clock_t for x32, just as we already switched to a 64-bit time_t. clock_t is only used in a handful of places, and as such it is really not a very significant change. The one that has the biggest impact is in struct siginfo, but since the *size* of struct siginfo doesn't change (it is padded to the hilt) it is fairly easy to make this a localized change. This also gets rid of sys_x32_times, however since this is a pretty late change don't compactify the system call numbers; we can reuse system call slot 521 next time we need an x32 system call. Reported-by: NGregory M. Lueck <gregory.m.lueck@intel.com> Signed-off-by: NH. Peter Anvin <hpa@zytor.com> Cc: H. J. Lu <hjl.tools@gmail.com> Link: http://lkml.kernel.org/r/1329696488-16970-1-git-send-email-hpa@zytor.com
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 2月, 2012 1 次提交
-
-
由 Linus Torvalds 提交于
While various modules include <asm/i387.h> to get access to things we actually *intend* for them to use, most of that header file was really pretty low-level internal stuff that we really don't want to expose to others. So split the header file into two: the small exported interfaces remain in <asm/i387.h>, while the internal definitions that are only used by core architecture code are now in <asm/fpu-internal.h>. The guiding principle for this was to expose functions that we export to modules, and leave them in <asm/i387.h>, while stuff that is used by task switching or was marked GPL-only is in <asm/fpu-internal.h>. The fpu-internal.h file could be further split up too, especially since arch/x86/kvm/ uses some of the remaining stuff for its module. But that kvm usage should probably be abstracted out a bit, and at least now the internal FPU accessor functions are much more contained. Even if it isn't perhaps as contained as it _could_ be. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 21 2月, 2012 2 次提交
-
-
由 H. Peter Anvin 提交于
There are some definitions which are duplicated between kernel/signal.c and ia32/ia32_signal.c; move them to a common header file. Rather than adding stuff to existing header files which contain data structures, create a new header file; hence the slightly odd name ("all the good ones were taken.") Note: nothing relied on signal_fault() being defined in <asm/ptrace.h>. Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
-
由 H. Peter Anvin 提交于
On x86, the only difference between sys_rt_sigprocmask and sys32_rt_sigprocmask is the alignment of the data structures. However, x86 allows data accesses with arbitrary alignment, and therefore there is no reason for this code to be different. Reported-by: NGregory M. Lueck <gregory.m.lueck@intel.com> Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
-
- 18 1月, 2012 3 次提交
-
-
由 Eric Paris 提交于
Every arch calls: if (unlikely(current->audit_context)) audit_syscall_entry() which requires knowledge about audit (the existance of audit_context) in the arch code. Just do it all in static inline in audit.h so that arch's can remain blissfully ignorant. Signed-off-by: NEric Paris <eparis@redhat.com>
-
由 Eric Paris 提交于
In the ia32entry syscall exit audit fastpath we have assembly code which calls __audit_syscall_exit directly. This code was, however, zeroes the upper 32 bits of the return code. It then proceeded to call code which expects longs to be 64bits long. In order to handle code which expects longs to be 64bit we sign extend the return code if that code is an error. Thus the __audit_syscall_exit function can correctly handle using the values in snprintf("%ld"). This fixes the regression introduced in 5cbf1565. Old record: type=SYSCALL msg=audit(1306197182.256:281): arch=40000003 syscall=192 success=no exit=4294967283 New record: type=SYSCALL msg=audit(1306197182.256:281): arch=40000003 syscall=192 success=no exit=-13 Signed-off-by: NEric Paris <eparis@redhat.com> Acked-by: NH. Peter Anvin <hpa@zytor.com>
-
由 Eric Paris 提交于
The audit system previously expected arches calling to audit_syscall_exit to supply as arguments if the syscall was a success and what the return code was. Audit also provides a helper AUDITSC_RESULT which was supposed to simplify things by converting from negative retcodes to an audit internal magic value stating success or failure. This helper was wrong and could indicate that a valid pointer returned to userspace was a failed syscall. The fix is to fix the layering foolishness. We now pass audit_syscall_exit a struct pt_reg and it in turns calls back into arch code to collect the return value and to determine if the syscall was a success or failure. We also define a generic is_syscall_success() macro which determines success/failure based on if the value is < -MAX_ERRNO. This works for arches like x86 which do not use a separate mechanism to indicate syscall failure. We make both the is_syscall_success() and regs_return_value() static inlines instead of macros. The reason is because the audit function must take a void* for the regs. (uml calls theirs struct uml_pt_regs instead of just struct pt_regs so audit_syscall_exit can't take a struct pt_regs). Since the audit function takes a void* we need to use static inlines to cast it back to the arch correct structure to dereference it. The other major change is that on some arches, like ia64, MIPS and ppc, we change regs_return_value() to give us the negative value on syscall failure. THE only other user of this macro, kretprobe_example.c, won't notice and it makes the value signed consistently for the audit functions across all archs. In arch/sh/kernel/ptrace_64.c I see that we were using regs[9] in the old audit code as the return value. But the ptrace_64.h code defined the macro regs_return_value() as regs[3]. I have no idea which one is correct, but this patch now uses the regs_return_value() function, so it now uses regs[3]. For powerpc we previously used regs->result but now use the regs_return_value() function which uses regs->gprs[3]. regs->gprs[3] is always positive so the regs_return_value(), much like ia64 makes it negative before calling the audit code when appropriate. Signed-off-by: NEric Paris <eparis@redhat.com> Acked-by: H. Peter Anvin <hpa@zytor.com> [for x86 portion] Acked-by: Tony Luck <tony.luck@intel.com> [for ia64] Acked-by: Richard Weinberger <richard@nod.at> [for uml] Acked-by: David S. Miller <davem@davemloft.net> [for sparc] Acked-by: Ralf Baechle <ralf@linux-mips.org> [for mips] Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [for ppc]
-
- 06 12月, 2011 2 次提交
-
-
由 Jan Beulich 提交于
system_call_after_swapgs doesn't really benefit from forcing alignment from it - quite the opposite, native code needlessly so far got a big NOP instruction inserted in front of it. Xen being the only user of the separate entry point can well live with the branch going to three bytes into a cache line. The compatibility mode ptregs entry points for one can make use of the GLOBAL() macro, and should be suitably aligned. Their shared continuation point (ia32_ptregs_common) otoh doesn't need to be global at all, but should continue to be properly aligned. Signed-off-by: NJan Beulich <jbeulich@suse.com> Reviewed-by: NAndi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/4ED4CEEA020000780006407D@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Jan Beulich 提交于
GET_THREAD_INFO() involves a memory read immediately followed by an "sub" on the value read, in turn (in several cases) immediately followed by a use of the calculated value as the base address of a memory access. This combination of instructions has a non-negligible potential for stalls. In the system call entry point code, however, the (fixed) offset of the stack pointer from the end of the stack is generally known, and hence we can instead avoid the memory load and subtract, and instead do the memory reference using %rsp as the base register. To do so in a legible fashion, introduce a THREAD_INFO() macro which, provided a register (generally %rsp) and the known offset from the end of the stack, produces a suitable memory access operand. The patch attempts to only touch the fast paths (no auditing and alike), but manages to do so only in the 64-bit entry point case; the compatibility mode entry points have so many interdependencies between their various branch targets that it was necessary to also adjust the slow paths to eliminate the risk of having missed some register dependency during code analysis. Signed-off-by: NJan Beulich <jbeulich@suse.com> Reviewed-by: NAndi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/4ED4CD690200007800064075@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 19 11月, 2011 1 次提交
-
-
由 H. Peter Anvin 提交于
Fix the same typo as was fixed in: b7641d2c x86-64, syscall: Adjust comment spacing and remove typo ... for the new versions of this file (32-bit and IA32 compat). Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/1321569446-20433-4-git-send-email-hpa@linux.intel.com
-
- 18 11月, 2011 2 次提交
-
-
由 H. Peter Anvin 提交于
Generate system call tables and unistd_*.h automatically from the tables in arch/x86/syscalls. All other information, like NR_syscalls, is auto-generated, some of which is in asm-offsets_*.c. This allows us to keep all the system call information in one place, and allows for kernel space and user space to see different information; this is currently used for the ia32 system call numbers when building the 64-bit kernel, but will be used by the x32 ABI in the near future. This also removes some gratuitious differences between i386, x86-64 and ia32; in particular, now all system call tables are generated with the same mechanism. Cc: H. J. Lu <hjl.tools@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Michal Marek <mmarek@suse.cz> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 H. Peter Anvin 提交于
Move compat_ni_syscall out of ia32entry.S and into its own .c file. Although this is a trivial function, it is not performance-critical, and this will simplify further cleanups. Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 01 11月, 2011 1 次提交
-
-
由 Christopher Yeoh 提交于
The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. The following patch attempts to achieve this by allowing a destination process, given an address and size from a source process, to copy memory directly from the source process into its own address space via a system call. There is also a symmetrical ability to copy from the current process's address space into a destination process's address space. - Use of /proc/pid/mem has been considered, but there are issues with using it: - Does not allow for specifying iovecs for both src and dest, assuming preadv or pwritev was implemented either the area read from or written to would need to be contiguous. - Currently mem_read allows only processes who are currently ptrace'ing the target and are still able to ptrace the target to read from the target. This check could possibly be moved to the open call, but its not clear exactly what race this restriction is stopping (reason appears to have been lost) - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix domain socket is a bit ugly from a userspace point of view, especially when you may have hundreds if not (eventually) thousands of processes that all need to do this with each other - Doesn't allow for some future use of the interface we would like to consider adding in the future (see below) - Interestingly reading from /proc/pid/mem currently actually involves two copies! (But this could be fixed pretty easily) As mentioned previously use of vmsplice instead was considered, but has problems. Since you need the reader and writer working co-operatively if the pipe is not drained then you block. Which requires some wrapping to do non blocking on the send side or polling on the receive. In all to all communication it requires ordering otherwise you can deadlock. And in the example of many MPI tasks writing to one MPI task vmsplice serialises the copying. There are some cases of MPI collectives where even a single copy interface does not get us the performance gain we could. For example in an MPI_Reduce rather than copy the data from the source we would like to instead use it directly in a mathops (say the reduce is doing a sum) as this would save us doing a copy. We don't need to keep a copy of the data from the source. I haven't implemented this, but I think this interface could in the future do all this through the use of the flags - eg could specify the math operation and type and the kernel rather than just copying the data would apply the specified operation between the source and destination and store it in the destination. Although we don't have a "second user" of the interface (though I've had some nibbles from people who may be interested in using it for intra process messaging which is not MPI). This interface is something which hardware vendors are already doing for their custom drivers to implement fast local communication. And so in addition to this being useful for OpenMPI it would mean the driver maintainers don't have to fix things up when the mm changes. There was some discussion about how much faster a true zero copy would go. Here's a link back to the email with some testing I did on that: http://marc.info/?l=linux-mm&m=130105930902915&w=2 There is a basic man page for the proposed interface here: http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt This has been implemented for x86 and powerpc, other architecture should mainly (I think) just need to add syscall numbers for the process_vm_readv and process_vm_writev. There are 32 bit compatibility versions for 64-bit kernels. For arch maintainers there are some simple tests to be able to quickly verify that the syscalls are working correctly here: http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: NChris Yeoh <yeohc@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <linux-man@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 8月, 2011 1 次提交
-
-
由 NeilBrown 提交于
The nfsservctl system call is now gone, so we should remove all linkage for it. Signed-off-by: NNeilBrown <neilb@suse.de> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-