- 01 5月, 2013 12 次提交
-
-
由 Davidlohr Bueso 提交于
Various forms of ipc use ipcctl_pre_down() to retrieve an ipc object and check permissions, mostly for IPC_RMID and IPC_SET commands. Introduce ipcctl_pre_down_nolock(), a lockless version of this function. The locking version is retained, yet modified to call the nolock version without affecting its semantics, thus transparent to all ipc callers. Signed-off-by: NDavidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: NRik van Riel <riel@redhat.com> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Jason Low <jason.low2@hp.com> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: NSedat Dilek <sedat.dilek@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Davidlohr Bueso 提交于
Through ipc_lock() and therefore ipc_lock_check() we currently return the locked ipc object. This is not necessary for all situations and can, therefore, cause unnecessary ipc lock contention. Introduce analogous ipc_obtain_object() and ipc_obtain_object_check() functions that only lookup and return the ipc object. Both these functions must be called within the RCU read critical section. [akpm@linux-foundation.org: propagate the ipc_obtain_object() errno from ipc_lock()] Signed-off-by: NDavidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: NRik van Riel <riel@redhat.com> Reviewed-by: NChegu Vinod <chegu_vinod@hp.com> Acked-by: NMichel Lespinasse <walken@google.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Jason Low <jason.low2@hp.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: NSedat Dilek <sedat.dilek@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Davidlohr Bueso 提交于
This series makes the sysv semaphore code more scalable, by reducing the time the semaphore lock is held, and making the locking more scalable for semaphore arrays with multiple semaphores. The first four patches were written by Davidlohr Buesso, and reduce the hold time of the semaphore lock. The last three patches change the sysv semaphore code locking to be more fine grained, providing a performance boost when multiple semaphores in a semaphore array are being manipulated simultaneously. On a 24 CPU system, performance numbers with the semop-multi test with N threads and N semaphores, look like this: vanilla Davidlohr's Davidlohr's + Davidlohr's + threads patches rwlock patches v3 patches 10 610652 726325 1783589 2142206 20 341570 365699 1520453 1977878 30 288102 307037 1498167 2037995 40 290714 305955 1612665 2256484 50 288620 312890 1733453 2650292 60 289987 306043 1649360 2388008 70 291298 306347 1723167 2717486 80 290948 305662 1729545 2763582 90 290996 306680 1736021 2757524 100 292243 306700 1773700 3059159 This patch: There is no reason to be holding the ipc lock while reading ipcp->seq, hence remove misleading comment. Also simplify the return value for the function. Signed-off-by: NDavidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: NRik van Riel <riel@redhat.com> Cc: Chegu Vinod <chegu_vinod@hp.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Jason Low <jason.low2@hp.com> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: NSedat Dilek <sedat.dilek@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 HoSung Jung 提交于
Signed-off-by: NHoSung Jung <rain6557@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Hurley 提交于
[fengguang.wu@intel.com: find_msg can be static] Signed-off-by: NPeter Hurley <peter@hurleysoftware.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Hurley 提交于
Signed-off-by: NPeter Hurley <peter@hurleysoftware.com> Acked-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Hurley 提交于
Teach the helper routines about MSG_COPY so that msgtyp is preserved as the message number to copy. The security functions affected by this change were audited and no additional changes are necessary. Signed-off-by: NPeter Hurley <peter@hurleysoftware.com> Acked-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Hurley 提交于
In preparation for refactoring the queue scan into a separate function, relocate msg copying. Signed-off-by: NPeter Hurley <peter@hurleysoftware.com> Acked-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Hurley 提交于
Signed-off-by: NPeter Hurley <peter@hurleysoftware.com> Acked-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Hurley 提交于
Signed-off-by: NPeter Hurley <peter@hurleysoftware.com> Acked-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Hurley 提交于
Separating msg allocation enables single-block vmalloc allocation instead. Signed-off-by: NPeter Hurley <peter@hurleysoftware.com> Acked-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Hurley 提交于
Signed-off-by: NPeter Hurley <peter@hurleysoftware.com> Acked-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 4月, 2013 1 次提交
-
-
由 Andrew Morton 提交于
Squishes a statement-with-no-effect warning, removes some ifdefs and shrinks .text by one byte! Note that this code fails to check for blocking_notifier_chain_register() failures. Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 4月, 2013 1 次提交
-
-
由 Stanislav Kinsbursky 提交于
Make sure that msg pointer is set back to error value in case of MSG_COPY flag is set and desired message to copy wasn't found. This garantees that msg is either a error pointer or a copy address. Otherwise the last message in queue will be freed without unlinking from the queue (which leads to memory corruption) and the dummy allocated copy won't be released. Signed-off-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 3月, 2013 1 次提交
-
-
由 Eric W. Biederman 提交于
Only allow mounting the mqueue filesystem if the caller has CAP_SYS_ADMIN rights over the ipc namespace. The principle here is if you create or have capabilities over it you can mount it, otherwise you get to live with what other people have mounted. This information is not particularly sensitive and mqueue essentially only reports which posix messages queues exist. Still when creating a restricted environment for an application to live any extra information may be of use to someone with sufficient creativity. The historical if imperfect way this information has been restricted has been not to allow mounts and restricting this to ipc namespace creators maintains the spirit of the historical restriction. Cc: stable@vger.kernel.org Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 23 3月, 2013 1 次提交
-
-
由 Vladimir Davydov 提交于
mnt_drop_write() must be called only if mnt_want_write() succeeded, otherwise the mnt_writers counter will diverge. mnt_writers counters are used to check if remounting FS as read-only is OK, so after an extra mnt_drop_write() call, it would be impossible to remount mqueue FS as read-only. Besides, on umount a warning would be printed like this one: ===================================== [ BUG: bad unlock balance detected! ] 3.9.0-rc3 #5 Not tainted ------------------------------------- a.out/12486 is trying to release lock (sb_writers) at: mnt_drop_write+0x1f/0x30 but there are no more locks to release! Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Doug Ledford <dledford@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 3月, 2013 2 次提交
-
-
由 Peter Hurley 提交于
When MSG_COPY is set, a duplicate message must be allocated for the copy before locking the queue. However, the copy could not be larger than was sent which is limited to msg_ctlmax. Signed-off-by: NPeter Hurley <peter@hurleysoftware.com> Acked-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Hurley 提交于
If the src msg is > 4k, then dest->next points to the next allocated segment; resetting it just prior to dereferencing is bad. Signed-off-by: NPeter Hurley <peter@hurleysoftware.com> Acked-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 3月, 2013 1 次提交
-
-
由 Al Viro 提交于
just have the bugger take unsigned long and deal with SETVAL case (when we use an int member in the union) explicitly. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 04 3月, 2013 3 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 2月, 2013 1 次提交
-
-
由 Tejun Heo 提交于
Convert to the much saner new idr interface. The new interface doesn't directly translate to the way idr_pre_get() was used around ipc_addid() as preloading disables preemption. From my cursory reading, it seems like we should be able to do all allocation from ipc_addid(), so I moved it there. Can you please check whether this would be okay? If this is wrong and ipc_addid() should be allowed to be called from non-sleepable context, I'd suggest allocating id itself in the outer functions and later install the pointer using idr_replace(). Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NSedat Dilek <sedat.dilek@gmail.com> Tested-by: NSedat Dilek <sedat.dilek@gmail.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 2月, 2013 2 次提交
-
-
由 Michel Lespinasse 提交于
do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE multiple, however there was no equivalent code in mm_populate(), which caused issues. This could be fixed by introduced the same rounding in mm_populate(), however I think it's preferable to make do_mmap_pgoff() return populate as a size rather than as a boolean, so we don't have to duplicate the size rounding logic in mm_populate(). Signed-off-by: NMichel Lespinasse <walken@google.com> Acked-by: NRik van Riel <riel@redhat.com> Tested-by: NAndy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or with MCL_FUTURE in effect), we want to populate the pages within the newly created vmas. This may take a while as we may have to read pages from disk, so ideally we want to do this outside of the write-locked mmap_sem region. This change introduces mm_populate(), which is used to defer populating such mappings until after the mmap_sem write lock has been released. This is implemented as a generalization of the former do_mlock_pages(), which accomplished the same task but was using during mlock() / mlockall(). Signed-off-by: NMichel Lespinasse <walken@google.com> Reported-by: NAndy Lutomirski <luto@amacapital.net> Acked-by: NRik van Riel <riel@redhat.com> Tested-by: NAndy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 2月, 2013 2 次提交
-
-
由 Anatol Pomozov 提交于
Allocating a file structure in function get_empty_filp() might fail because of several reasons: - not enough memory for file structures - operation is not allowed - user is over its limit Currently the function returns NULL in all cases and we loose the exact reason of the error. All callers of get_empty_filp() assume that the function can fail with ENFILE only. Return error through pointer. Change all callers to preserve this error code. [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit (things remaining here deal with alloc_file()), removed pipe(2) behaviour change] Signed-off-by: NAnatol Pomozov <anatol.pomozov@gmail.com> Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 1月, 2013 1 次提交
-
-
由 Gao feng 提交于
This patch allow the unprivileged user to mount mqueuefs in user ns. If two userns share the same ipcns,the files in mqueue fs should be seen in both these two userns. If the userns has its own ipcns,it has its own mqueue fs too. ipcns has already done this job well. Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com> Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
-
- 05 1月, 2013 9 次提交
-
-
由 Stanislav Kinsbursky 提交于
Signed-off-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Stanislav Kinsbursky 提交于
Remove the redundant and confusing fill_copy(). Also add copy_msg() check for error. In this case exit from the function have to be done instead of break, because further code interprets any error as EAGAIN. Also define copy_msg() for the case when CONFIG_CHECKPOINT_RESTORE is disabled. Signed-off-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Stanislav Kinsbursky 提交于
This code works if CONFIG_CHECKPOINT_RESTORE is disabled. [akpm@linux-foundation.org: remove __maybe_unused] Signed-off-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Stanislav Kinsbursky 提交于
Passing and checking of msgflg to free_copy() is redundant. This patch sets copy to NULL on declaration instead and checks for non-NULL in free_copy(). Note: in case of copy allocation failure, error is returned immediately. So no need to check for IS_ERR() in free_copy(). Signed-off-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Stanislav Kinsbursky 提交于
This test can be used to check wheither kernel supports IPC message queue copy and restore features (required by CRIU project). Signed-off-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Stanislav Kinsbursky 提交于
This patch is required for checkpoint/restore in userspace. c/r requires some way to get all pending IPC messages without deleting them from the queue (checkpoint can fail and in this case tasks will be resumed, so queue have to be valid). To achive this, new operation flag MSG_COPY for sys_msgrcv() system call was introduced. If this flag was specified, then mtype is interpreted as number of the message to copy. If MSG_COPY is set, then kernel will allocate dummy message with passed size, and then use new copy_msg() helper function to copy desired message (instead of unlinking it from the queue). Notes: 1) Return -ENOSYS if MSG_COPY is specified, but CONFIG_CHECKPOINT_RESTORE is not set. Signed-off-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Stanislav Kinsbursky 提交于
Move all message related manipulation into one function msg_fill(). Actually, two functions because of the compat one. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Stanislav Kinsbursky 提交于
Add 3 new variables and sysctls to tune them (by one "next_id" variable for messages, semaphores and shared memory respectively). This variable can be used to set desired id for next allocated IPC object. By default it's equal to -1 and old behaviour is preserved. If this variable is non-negative, then desired idr will be extracted from it and used as a start value to search for free IDR slot. Notes: 1) this patch doesn't guarantee that the new object will have desired id. So it's up to user space how to handle new object with wrong id. 2) After a sucessful id allocation attempt, "next_id" will be set back to -1 (if it was non-negative). [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Stanislav Kinsbursky 提交于
This is a cleanup patch. The assignment is redundant. Signed-off-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 12月, 2012 1 次提交
-
-
由 Eric W. Biederman 提交于
Andy Lutomirski <luto@amacapital.net> found a nasty little bug in the permissions of setns. With unprivileged user namespaces it became possible to create new namespaces without privilege. However the setns calls were relaxed to only require CAP_SYS_ADMIN in the user nameapce of the targed namespace. Which made the following nasty sequence possible. pid = clone(CLONE_NEWUSER | CLONE_NEWNS); if (pid == 0) { /* child */ system("mount --bind /home/me/passwd /etc/passwd"); } else if (pid != 0) { /* parent */ char path[PATH_MAX]; snprintf(path, sizeof(path), "/proc/%u/ns/mnt"); fd = open(path, O_RDONLY); setns(fd, 0); system("su -"); } Prevent this possibility by requiring CAP_SYS_ADMIN in the current user namespace when joing all but the user namespace. Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 12 12月, 2012 1 次提交
-
-
由 Andi Kleen 提交于
There was some desire in large applications using MAP_HUGETLB or SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on others. This is useful together with NUMA policy: use 2MB interleaving on some mappings, but 1GB on local mappings. This patch extends the IPC/SHM syscall interfaces slightly to allow specifying the page size. It borrows some upper bits in the existing flag arguments and allows encoding the log of the desired page size in addition to the *_HUGETLB flag. When 0 is specified the default size is used, this makes the change fully compatible. Extending the internal hugetlb code to handle this is straight forward. Instead of a single mount it just keeps an array of them and selects the right mount based on the specified page size. When no page size is specified it uses the mount of the default page size. The change is not visible in /proc/mounts because internal mounts don't appear there. It also has very little overhead: the additional mounts just consume a super block, but not more memory when not used. I also exported the new flags to the user headers (they were previously under __KERNEL__). Right now only symbols for x86 and some other architecture for 1GB and 2MB are defined. The interface should already work for all other architectures though. Only architectures that define multiple hugetlb sizes actually need it (that is currently x86, tile, powerpc). However tile and powerpc have user configurable hugetlb sizes, so it's not easy to add defines. A program on those architectures would need to query sysfs and use the appropiate log2. [akpm@linux-foundation.org: cleanups] [rientjes@google.com: fix build] [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: NAndi Kleen <ak@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 11月, 2012 1 次提交
-
-
由 Eric W. Biederman 提交于
Assign a unique proc inode to each namespace, and use that inode number to ensure we only allocate at most one proc inode for every namespace in proc. A single proc inode per namespace allows userspace to test to see if two processes are in the same namespace. This has been a long requested feature and only blocked because a naive implementation would put the id in a global space and would ultimately require having a namespace for the names of namespaces, making migration and certain virtualization tricks impossible. We still don't have per superblock inode numbers for proc, which appears necessary for application unaware checkpoint/restart and migrations (if the application is using namespace file descriptors) but that is now allowd by the design if it becomes important. I have preallocated the ipc and uts initial proc inode numbers so their structures can be statically initialized. Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
-