- 13 11月, 2013 2 次提交
-
-
由 Ryan Mallon 提交于
Some setuid binaries will allow reading of files which have read permission by the real user id. This is problematic with files which use %pK because the file access permission is checked at open() time, but the kptr_restrict setting is checked at read() time. If a setuid binary opens a %pK file as an unprivileged user, and then elevates permissions before reading the file, then kernel pointer values may be leaked. This happens for example with the setuid pppd application on Ubuntu 12.04: $ head -1 /proc/kallsyms 00000000 T startup_32 $ pppd file /proc/kallsyms pppd: In file /proc/kallsyms: unrecognized option 'c1000000' This will only leak the pointer value from the first line, but other setuid binaries may leak more information. Fix this by adding a check that in addition to the current process having CAP_SYSLOG, that effective user and group ids are equal to the real ids. If a setuid binary reads the contents of a file which uses %pK then the pointer values will be printed as NULL if the real user is unprivileged. Update the sysctl documentation to reflect the changes, and also correct the documentation to state the kptr_restrict=0 is the default. This is a only temporary solution to the issue. The correct solution is to do the permission check at open() time on files, and to replace %pK with a function which checks the open() time permission. %pK uses in printk should be removed since no sane permission check can be done, and instead protected by using dmesg_restrict. Signed-off-by: NRyan Mallon <rmallon@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zheng Liu 提交于
Now dirty_background_ratio/dirty_ratio contains a percentage of total avaiable memory, which contains free pages and reclaimable pages. The number of these pages is not equal to the number of total system memory. But they are described as a percentage of total system memory in Documentation/sysctl/vm.txt. So we need to fix them to avoid misunderstanding. Signed-off-by: NZheng Liu <wenqing.lz@taobao.com> Cc: Rob Landley <rob@landley.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 10月, 2013 5 次提交
-
-
由 Rik van Riel 提交于
Shared faults can lead to lots of unnecessary page migrations, slowing down the system, and causing private faults to hit the per-pgdat migration ratelimit. This patch adds sysctl numa_balancing_migrate_deferred, which specifies how many shared page migrations to skip unconditionally, after each page migration that is skipped because it is a shared fault. This reduces the number of page migrations back and forth in shared fault situations. It also gives a strong preference to the tasks that are already running where most of the memory is, and to moving the other tasks to near the memory. Testing this with a much higher scan rate than the default still seems to result in fewer page migrations than before. Memory seems to be somewhat better consolidated than previously, with multi-instance specjbb runs on a 4 node system. Signed-off-by: NRik van Riel <riel@redhat.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
With scan rate adaptions based on whether the workload has properly converged or not there should be no need for the scan period reset hammer. Get rid of it. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
This patch favours moving tasks towards NUMA node that recorded a higher number of NUMA faults during active load balancing. Ideally this is self-reinforcing as the longer the task runs on that node, the more faults it should incur causing task_numa_placement to keep the task running on that node. In reality a big weakness is that the nodes CPUs can be overloaded and it would be more efficient to queue tasks on an idle node and migrate to the new node. This would require additional smarts in the balancer so for now the balancer will simply prefer to place the task on the preferred node for a PTE scans which is controlled by the numa_balancing_settle_count sysctl. Once the settle_count number of scans has complete the schedule is free to place the task on an alternative node if the load is imbalanced. [srikar@linux.vnet.ibm.com: Fixed statistics] Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [ Tunable and use higher faults instead of preferred. ] Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
The NUMA PTE scan rate is controlled with a combination of the numa_balancing_scan_period_min, numa_balancing_scan_period_max and numa_balancing_scan_size. This scan rate is independent of the size of the task and as an aside it is further complicated by the fact that numa_balancing_scan_size controls how many pages are marked pte_numa and not how much virtual memory is scanned. In combination, it is almost impossible to meaningfully tune the min and max scan periods and reasoning about performance is complex when the time to complete a full scan is is partially a function of the tasks memory size. This patch alters the semantic of the min and max tunables to be about tuning the length time it takes to complete a scan of a tasks occupied virtual address space. Conceptually this is a lot easier to understand. There is a "sanity" check to ensure the scan rate is never extremely fast based on the amount of virtual memory that should be scanned in a second. The default of 2.5G seems arbitrary but it is to have the maximum scan rate after the patch roughly match the maximum scan rate before the patch was applied. On a similar note, numa_scan_period is in milliseconds and not jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies to numa_scan_period means that the rate scanning slows depends on HZ which is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-3-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 12 9月, 2013 2 次提交
-
-
由 Stéphane Graber 提交于
Add a new %P variable to be used in core_pattern. This variable contains the global PID (PID in the init namespace) as %p contains the PID in the current namespace which isn't always what we want. The main use for this is to make it easier to handle crashes that happened within a container. With that new variables it's possible to have the crashes dumped into the container or forwarded to the host with the right PID (from the host's point of view). Signed-off-by: NStéphane Graber <stgraber@ubuntu.com> Reported-by: NHans Feldt <hans.feldt@ericsson.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Andy Whitcroft <apw@canonical.com> Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Now hugepage migration is enabled, although restricted on pmd-based hugepages for now (due to lack of testing.) So we should allocate migratable hugepages from ZONE_MOVABLE if possible. This patch makes GFP flags in hugepage allocation dependent on migration support, not only the value of hugepages_treat_as_movable. It provides no change on the behavior for architectures which do not support hugepage migration, Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NAndi Kleen <ak@linux.intel.com> Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 8月, 2013 1 次提交
-
-
由 stephen hemminger 提交于
By default, the pfifo_fast queue discipline has been used by default for all devices. But we have better choices now. This patch allow setting the default queueing discipline with sysctl. This allows easy use of better queueing disciplines on all devices without having to use tc qdisc scripts. It is intended to allow an easy path for distributions to make fq_codel or sfq the default qdisc. This patch also makes pfifo_fast more of a first class qdisc, since it is now possible to manually override the default and explicitly use pfifo_fast. The behavior for systems who do not use the sysctl is unchanged, they still get pfifo_fast Also removes leftover random # in sysctl net core. Signed-off-by: NStephen Hemminger <stephen@networkplumber.org> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 8月, 2013 1 次提交
-
-
由 Cong Wang 提交于
Eliezer renames several *ll_poll to *busy_poll, but forgets CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too. Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: NCong Wang <amwang@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 7月, 2013 1 次提交
-
-
由 Eliezer Tamir 提交于
Rename LL_SO to BUSY_POLL_SO Rename sysctl_net_ll_{read,poll} to sysctl_busy_{read,poll} Fix up users of these variables. Fix documentation for sysctl. a patch for the socket.7 man page will follow separately, because of limitations of my mail setup. Signed-off-by: NEliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 7月, 2013 1 次提交
-
-
由 Wanpeng Li 提交于
The default zonelist order selecter will select "node" order if any nodes DMA zone comprises greater than 70% of its local memory instead of 60%, according to default_zonelist_order::low_kmem_size > total * 70/100. Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 7月, 2013 1 次提交
-
-
由 Eliezer Tamir 提交于
Rename functions in include/net/ll_poll.h to busy wait. Clarify documentation about expected power use increase. Rename POLL_LL to POLL_BUSY_LOOP. Add need_resched() testing to poll/select busy loops. Note, that in select and poll can_busy_poll is dynamic and is updated continuously to reflect the existence of supported sockets with valid queue information. Signed-off-by: NEliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 6月, 2013 1 次提交
-
-
由 Eliezer Tamir 提交于
select/poll busy-poll support. Split sysctl value into two separate ones, one for read and one for poll. updated Documentation/sysctl/net.txt Add a new poll flag POLL_LL. When this flag is set, sock_poll will call sk_poll_ll if possible. sock_poll sets this flag in its return value to indicate to select/poll when a socket that can busy poll is found. When poll/select have nothing to report, call the low-level sock_poll again until we are out of time or we find something. Once the system call finds something, it stops setting POLL_LL, so it can return the result to the user ASAP. Signed-off-by: NEliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 6月, 2013 1 次提交
-
-
由 Paul Gortmaker 提交于
This random old kernel version has no bearing on the actual file contents, and has not had for a long time. Delete it. Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: NJiri Kosina <jkosina@suse.cz>
-
- 23 6月, 2013 1 次提交
-
-
由 Dave Hansen 提交于
This patch keeps track of how long perf's NMI handler is taking, and also calculates how many samples perf can take a second. If the sample length times the expected max number of samples exceeds a configurable threshold, it drops the sample rate. This way, we don't have a runaway sampling process eating up the CPU. This patch can tend to drop the sample rate down to level where perf doesn't work very well. *BUT* the alternative is that my system hangs because it spends all of its time handling NMIs. I'll take a busted performance tool over an entire system that's busted and undebuggable any day. BTW, my suspicion is that there's still an underlying bug here. Using the HPET instead of the TSC is definitely a contributing factor, but I suspect there are some other things going on. But, I can't go dig down on a bug like that with my machine hanging all the time. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus@samba.org Cc: acme@ghostprotocols.net Cc: Dave Hansen <dave@sr71.net> [ Prettified it a bit. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 18 6月, 2013 1 次提交
-
-
由 Ying Xue 提交于
As per feedback from the netdev community, we change the buffer overflow protection algorithm in receiving sockets so that it always respects the nominal upper limit set in sk_rcvbuf. Instead of scaling up from a small sk_rcvbuf value, which leads to violation of the configured sk_rcvbuf limit, we now calculate the weighted per-message limit by scaling down from a much bigger value, still in the same field, according to the importance priority of the received message. To allow for administrative tunability of the socket receive buffer size, we create a tipc_rmem sysctl variable to allow the user to configure an even bigger value via sysctl command. It is a size of three (min/default/max) to be consistent with things like tcp_rmem. By default, the value initialized in tipc_rmem[1] is equal to the receive socket size needed by a TIPC_CRITICAL_IMPORTANCE message. This value is also set as the default value of sk_rcvbuf. Originally-by: NJon Maloy <jon.maloy@ericsson.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Jon Maloy <jon.maloy@ericsson.com> [Ying: added sysctl variation to Jon's original patch] Signed-off-by: NYing Xue <ying.xue@windriver.com> [PG: don't compile sysctl.c if not config'd; add Documentation] Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 6月, 2013 1 次提交
-
-
由 Eliezer Tamir 提交于
Adds an ndo_ll_poll method and the code that supports it. This method can be used by low latency applications to busy-poll Ethernet device queues directly from the socket code. sysctl_net_ll_poll controls how many microseconds to poll. Default is zero (disabled). Individual protocol support will be added by subsequent patches. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: NEliezer Tamir <eliezer.tamir@linux.intel.com> Acked-by: NEric Dumazet <edumazet@google.com> Tested-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 5月, 2013 2 次提交
-
-
由 Li Zefan 提交于
The old softlockup detector has been replaced with new lockup detector long ago. Signed-off-by: NLi Zefan <lizefan@huawei.com> Acked-by: NDon Zickus <dzickus@redhat.com> Link: http://lkml.kernel.org/r/51959687.9090305@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Li Zefan 提交于
Signed-off-by: NLi Zefan <lizefan@huawei.com> Acked-by: NDon Zickus <dzickus@redhat.com> Link: http://lkml.kernel.org/r/51959678.6000802@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 20 5月, 2013 1 次提交
-
-
由 Rami Rosen 提交于
This patch removes mentioning the sysfsf net_device weight attribute (class/net/<device>/weight) in Documentation/sysctl/net.txt, since the net sysfs weight attribute was removed by the following patch: [NET]: Make NAPI polling independent of struct net_device objects bea3348eSigned-off-by: NRami Rosen <ramirose@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 4月, 2013 2 次提交
-
-
由 Andrew Shewmaker 提交于
Add an admin_reserve_kbytes knob to allow admins to change the hardcoded memory reserve to something other than 3%, which may be multiple gigabytes on large memory systems. Only about 8MB is necessary to enable recovery in the default mode, and only a few hundred MB are required even when overcommit is disabled. This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER. admin_reserve_kbytes is initialized to min(3% free pages, 8MB) I arrived at 8MB by summing the RSS of sshd or login, bash, and top. Please see first patch in this series for full background, motivation, testing, and full changelog. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: make init_admin_reserve() static] Signed-off-by: NAndrew Shewmaker <agshew@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Shewmaker 提交于
Add user_reserve_kbytes knob. Limit the growth of the memory reserved for other user processes to min(3% current process size, user_reserve_pages). Only about 8MB is necessary to enable recovery in the default mode, and only a few hundred MB are required even when overcommit is disabled. user_reserve_pages defaults to min(3% free pages, 128MB) I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ... then adding the RSS of each. This only affects OVERCOMMIT_NEVER mode. Background 1. user reserve __vm_enough_memory reserves a hardcoded 3% of the current process size for other applications when overcommit is disabled. This was done so that a user could recover if they launched a memory hogging process. Without the reserve, a user would easily run into a message such as: bash: fork: Cannot allocate memory 2. admin reserve Additionally, a hardcoded 3% of free memory is reserved for root in both overcommit 'guess' and 'never' modes. This was intended to prevent a scenario where root-cant-log-in and perform recovery operations. Note that this reserve shrinks, and doesn't guarantee a useful reserve. Motivation The two hardcoded memory reserves should be updated to account for current memory sizes. Also, the admin reserve would be more useful if it didn't shrink too much. When the current code was originally written, 1GB was considered "enterprise". Now the 3% reserve can grow to multiple GB on large memory systems, and it only needs to be a few hundred MB at most to enable a user or admin to recover a system with an unwanted memory hogging process. I've found that reducing these reserves is especially beneficial for a specific type of application load: * single application system * one or few processes (e.g. one per core) * allocating all available memory * not initializing every page immediately * long running I've run scientific clusters with this sort of load. A long running job sometimes failed many hours (weeks of CPU time) into a calculation. They weren't initializing all of their memory immediately, and they weren't using calloc, so I put systems into overcommit 'never' mode. These clusters run diskless and have no swap. However, with the current reserves, a user wishing to allocate as much memory as possible to one process may be prevented from using, for example, almost 2GB out of 32GB. The effect is less, but still significant when a user starts a job with one process per core. I have repeatedly seen a set of processes requesting the same amount of memory fail because one of them could not allocate the amount of memory a user would expect to be able to allocate. For example, Message Passing Interfce (MPI) processes, one per core. And it is similar for other parallel programming frameworks. Changing this reserve code will make the overcommit never mode more useful by allowing applications to allocate nearly all of the available memory. Also, the new admin_reserve_kbytes will be safer than the current behavior since the hardcoded 3% of available memory reserve can shrink to something useless in the case where applications have grabbed all available memory. Risks * "bash: fork: Cannot allocate memory" The downside of the first patch-- which creates a tunable user reserve that is only used in overcommit 'never' mode--is that an admin can set it so low that a user may not be able to kill their process, even if they already have a shell prompt. Of course, a user can get in the same predicament with the current 3% reserve--they just have to launch processes until 3% becomes negligible. * root-cant-log-in problem The second patch, adding the tunable rootuser_reserve_pages, allows the admin to shoot themselves in the foot by setting it too small. They can easily get the system into a state where root-can't-log-in. However, the new admin_reserve_kbytes will be safer than the current behavior since the hardcoded 3% of available memory reserve can shrink to something useless in the case where applications have grabbed all available memory. Alternatives * Memory cgroups provide a more flexible way to limit application memory. Not everyone wants to set up cgroups or deal with their overhead. * We could create a fourth overcommit mode which provides smaller reserves. The size of useful reserves may be drastically different depending on the whether the system is embedded or enterprise. * Force users to initialize all of their memory or use calloc. Some users don't want/expect the system to overcommit when they malloc. Overcommit 'never' mode is for this scenario, and it should work well. The new user and admin reserve tunables are simple to use, with low overhead compared to cgroups. The patches preserve current behavior where 3% of memory is less than 128MB, except that the admin reserve doesn't shrink to an unusable size under pressure. The code allows admins to tune for embedded and enterprise usage. FAQ * How is the root-cant-login problem addressed? What happens if admin_reserve_pages is set to 0? Root is free to shoot themselves in the foot by setting admin_reserve_kbytes too low. On x86_64, the minimum useful reserve is: 8MB for overcommit 'guess' 128MB for overcommit 'never' admin_reserve_pages defaults to min(3% free memory, 8MB) So, anyone switching to 'never' mode needs to adjust admin_reserve_pages. * How do you calculate a minimum useful reserve? A user or the admin needs enough memory to login and perform recovery operations, which includes, at a minimum: sshd or login + bash (or some other shell) + top (or ps, kill, etc.) For overcommit 'guess', we can sum resident set sizes (RSS) because we only need enough memory to handle what the recovery programs will typically use. On x86_64 this is about 8MB. For overcommit 'never', we can take the max of their virtual sizes (VSZ) and add the sum of their RSS. We use VSZ instead of RSS because mode forces us to ensure we can fulfill all of the requested memory allocations-- even if the programs only use a fraction of what they ask for. On x86_64 this is about 128MB. When swap is enabled, reserves are useful even when they are as small as 10MB, regardless of overcommit mode. When both swap and overcommit are disabled, then the admin should tune the reserves higher to be absolutley safe. Over 230MB each was safest in my testing. * What happens if user_reserve_pages is set to 0? Note, this only affects overcomitt 'never' mode. Then a user will be able to allocate all available memory minus admin_reserve_kbytes. However, they will easily see a message such as: "bash: fork: Cannot allocate memory" And they won't be able to recover/kill their application. The admin should be able to recover the system if admin_reserve_kbytes is set appropriately. * What's the difference between overcommit 'guess' and 'never'? "Guess" allows an allocation if there are enough free + reclaimable pages. It has a hardcoded 3% of free pages reserved for root. "Never" allows an allocation if there is enough swap + a configurable percentage (default is 50) of physical RAM. It has a hardcoded 3% of free pages reserved for root, like "Guess" mode. It also has a hardcoded 3% of the current process size reserved for additional applications. * Why is overcommit 'guess' not suitable even when an app eventually writes to every page? It takes free pages, file pages, available swap pages, reclaimable slab pages into consideration. In other words, these are all pages available, then why isn't overcommit suitable? Because it only looks at the present state of the system. It does not take into account the memory that other applications have malloced, but haven't initialized yet. It overcommits the system. Test Summary There was little change in behavior in the default overcommit 'guess' mode with swap enabled before and after the patch. This was expected. Systems run most predictably (i.e. no oom kills) in overcommit 'never' mode with swap enabled. This also allowed the most memory to be allocated to a user application. Overcommit 'guess' mode without swap is a bad idea. It is easy to crash the system. None of the other tested combinations crashed. This matches my experience on the Roadrunner supercomputer. Without the tunable user reserve, a system in overcommit 'never' mode and without swap does not allow the admin to recover, although the admin can. With the new tunable reserves, a system in overcommit 'never' mode and without swap can be configured to: 1. maximize user-allocatable memory, running close to the edge of recoverability 2. maximize recoverability, sacrificing allocatable memory to ensure that a user cannot take down a system Test Description Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap System is booted into multiuser console mode, with unnecessary services turned off. Caches were dropped before each test. Hogs are user memtester processes that attempt to allocate all free memory as reported by /proc/meminfo In overcommit 'never' mode, memory_ratio=100 Test Results 3.9.0-rc1-mm1 Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery ---------- ---- ---- ------------- ---- ------------- -------------- guess yes 1 5432/5432 no yes yes guess yes 4 5444/5444 1 yes yes guess no 1 5302/5449 no yes yes guess no 4 - crash no no never yes 1 5460/5460 1 yes yes never yes 4 5460/5460 1 yes yes never no 1 5218/5432 no no yes never no 4 5203/5448 no no yes 3.9.0-rc1-mm1-tunablereserves User and Admin Recovery show their respective reserves, if applicable. Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery ---------- ---- ---- ------------- ---- ------------- -------------- guess yes 1 5419/5419 no - yes 8MB yes guess yes 4 5436/5436 1 - yes 8MB yes guess no 1 5440/5440 * - yes 8MB yes guess no 4 - crash - no 8MB no * process would successfully mlock, then the oom killer would pick it never yes 1 5446/5446 no 10MB yes 20MB yes never yes 4 5456/5456 no 10MB yes 20MB yes never no 1 5387/5429 no 128MB no 8MB barely never no 1 5323/5428 no 226MB barely 8MB barely never no 1 5323/5428 no 226MB barely 8MB barely never no 1 5359/5448 no 10MB no 10MB barely never no 1 5323/5428 no 0MB no 10MB barely never no 1 5332/5428 no 0MB no 50MB yes never no 1 5293/5429 no 0MB no 90MB yes never no 1 5001/5427 no 230MB yes 338MB yes never no 4* 4998/5424 no 230MB yes 338MB yes * more memtesters were launched, able to allocate approximately another 100MB Future Work - Test larger memory systems. - Test an embedded image. - Test other architectures. - Time malloc microbenchmarks. - Would it be useful to be able to set overcommit policy for each memory cgroup? - Some lines are slightly above 80 chars. Perhaps define a macro to convert between pages and kb? Other places in the kernel do this. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: make init_user_reserve() static] Signed-off-by: NAndrew Shewmaker <agshew@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 1月, 2013 2 次提交
-
-
Signed-off-by: NCarlos Alberto Lopez Perez <clopez@igalia.com> Cc: Rob Landley <rob@landley.net> Cc: Larry Finger <Larry.Finger@lwfinger.net> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Stanislav Kinsbursky 提交于
Add 3 new variables and sysctls to tune them (by one "next_id" variable for messages, semaphores and shared memory respectively). This variable can be used to set desired id for next allocated IPC object. By default it's equal to -1 and old behaviour is preserved. If this variable is non-negative, then desired idr will be extracted from it and used as a start value to search for free IDR slot. Notes: 1) this patch doesn't guarantee that the new object will have desired id. So it's up to user space how to handle new object with wrong id. 2) After a sucessful id allocation attempt, "next_id" will be set back to -1 (if it was non-negative). [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: NStanislav Kinsbursky <skinsbursky@parallels.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 10月, 2012 1 次提交
-
-
由 Oleg Nesterov 提交于
Some coredump handlers want to create a core file in a way compatible with standard behavior. Standard behavior with fs.suid_dumpable = 2 is to create core file with uid=gid=0. However, there was no way for coredump handler to know that the process being dumped was suid'ed. This patch adds the new %d specifier for format_corename() which simply reports __get_dumpable(mm->flags), this is compatible with /proc/sys/fs/suid_dumpable we already have. Addresses https://bugzilla.redhat.com/show_bug.cgi?id=787135 Developed during a discussion with Denys Vlasenko. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Alex Kelly <alex.page.kelly@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Cong Wang <amwang@redhat.com> Cc: Jiri Moskovcak <jmoskovc@redhat.com> Acked-by: NNeil Horman <nhorman@tuxdriver.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 8月, 2012 1 次提交
-
-
由 Artem Bityutskiy 提交于
The pdflush thread is long gone, however we still mention it incorrectly in the kernel documentation. This patch fixes the situation. Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 01 8月, 2012 3 次提交
-
-
由 David Rientjes 提交于
The number of ptes and swap entries are used in the oom killer's badness heuristic, so they should be shown in the tasklist dump. This patch adds those fields and replaces cpu and oom_adj values that are currently emitted. Cpu isn't interesting and oom_adj is deprecated and will be removed later this year, the same information is already displayed as oom_score_adj which is used internally. At the same time, make the documentation a little more clear to state this information is helpful to determine why the oom killer chose the task it did to kill. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wanpeng Li 提交于
Since per-BDI flusher threads were introduced in 2.6, the pdflush mechanism is not used any more. But the old interface exported through /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless. For back-compatibility, printk warning information and return 2 to notify the users that the interface is removed. Signed-off-by: NWanpeng Li <liwp@linux.vnet.ibm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christian Ehrhardt 提交于
Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of the code and add some comments about what the tunable will change in that behavior. Signed-off-by: NChristian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Acked-by: NJens Axboe <axboe@kernel.dk> Reviewed-by: NMinchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 7月, 2012 1 次提交
-
-
由 Kees Cook 提交于
When the suid_dumpable sysctl is set to "2", and there is no core dump pipe defined in the core_pattern sysctl, a local user can cause core files to be written to root-writable directories, potentially with user-controlled content. This means an admin can unknowningly reintroduce a variation of CVE-2006-2451, allowing local users to gain root privileges. $ cat /proc/sys/fs/suid_dumpable 2 $ cat /proc/sys/kernel/core_pattern core $ ulimit -c unlimited $ cd / $ ls -l core ls: cannot access core: No such file or directory $ touch core touch: cannot touch `core': Permission denied $ OHAI="evil-string-here" ping localhost >/dev/null 2>&1 & $ pid=$! $ sleep 1 $ kill -SEGV $pid $ ls -l core -rw------- 1 root kees 458752 Jun 21 11:35 core $ sudo strings core | grep evil OHAI=evil-string-here While cron has been fixed to abort reading a file when there is any parse error, there are still other sensitive directories that will read any file present and skip unparsable lines. Instead of introducing a suid_dumpable=3 mode and breaking all users of mode 2, this only disables the unsafe portion of mode 2 (writing to disk via relative path). Most users of mode 2 (e.g. Chrome OS) already use a core dump pipe handler, so this change will not break them. For the situations where a pipe handler is not defined but mode 2 is still active, crash dumps will only be written to fully qualified paths. If a relative path is defined (e.g. the default "core" pattern), dump attempts will trigger a printk yelling about the lack of a fully qualified path. Signed-off-by: NKees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alan Cox <alan@linux.intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 7月, 2012 1 次提交
-
-
由 Kees Cook 提交于
This adds symlink and hardlink restrictions to the Linux VFS. Symlinks: A long-standing class of security issues is the symlink-based time-of-check-time-of-use race, most commonly seen in world-writable directories like /tmp. The common method of exploitation of this flaw is to cross privilege boundaries when following a given symlink (i.e. a root process follows a symlink belonging to another user). For a likely incomplete list of hundreds of examples across the years, please see: http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp The solution is to permit symlinks to only be followed when outside a sticky world-writable directory, or when the uid of the symlink and follower match, or when the directory owner matches the symlink's owner. Some pointers to the history of earlier discussion that I could find: 1996 Aug, Zygo Blaxell http://marc.info/?l=bugtraq&m=87602167419830&w=2 1996 Oct, Andrew Tridgell http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html 1997 Dec, Albert D Cahalan http://lkml.org/lkml/1997/12/16/4 2005 Feb, Lorenzo Hernández García-Hierro http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html 2010 May, Kees Cook https://lkml.org/lkml/2010/5/30/144 Past objections and rebuttals could be summarized as: - Violates POSIX. - POSIX didn't consider this situation and it's not useful to follow a broken specification at the cost of security. - Might break unknown applications that use this feature. - Applications that break because of the change are easy to spot and fix. Applications that are vulnerable to symlink ToCToU by not having the change aren't. Additionally, no applications have yet been found that rely on this behavior. - Applications should just use mkstemp() or O_CREATE|O_EXCL. - True, but applications are not perfect, and new software is written all the time that makes these mistakes; blocking this flaw at the kernel is a single solution to the entire class of vulnerability. - This should live in the core VFS. - This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135) - This should live in an LSM. - This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188) Hardlinks: On systems that have user-writable directories on the same partition as system files, a long-standing class of security issues is the hardlink-based time-of-check-time-of-use race, most commonly seen in world-writable directories like /tmp. The common method of exploitation of this flaw is to cross privilege boundaries when following a given hardlink (i.e. a root process follows a hardlink created by another user). Additionally, an issue exists where users can "pin" a potentially vulnerable setuid/setgid file so that an administrator will not actually upgrade a system fully. The solution is to permit hardlinks to only be created when the user is already the existing file's owner, or if they already have read/write access to the existing file. Many Linux users are surprised when they learn they can link to files they have no access to, so this change appears to follow the doctrine of "least surprise". Additionally, this change does not violate POSIX, which states "the implementation may require that the calling process has permission to access the existing file"[1]. This change is known to break some implementations of the "at" daemon, though the version used by Fedora and Ubuntu has been fixed[2] for a while. Otherwise, the change has been undisruptive while in use in Ubuntu for the last 1.5 years. [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html [2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279 This patch is based on the patches in Openwall and grsecurity, along with suggestions from Al Viro. I have added a sysctl to enable the protected behavior, and documentation. Signed-off-by: NKees Cook <keescook@chromium.org> Acked-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 01 6月, 2012 1 次提交
-
-
由 KOSAKI Motohiro 提交于
Commit b231cca4 ("message queues: increase range limits") changed mqueue default value when attr parameter is specified NULL from hard coded value to fs.mqueue.{msg,msgsize}_max sysctl value. This made large side effect. When user need to use two mqueue applications 1) using !NULL attr parameter and it require big message size and 2) using NULL attr parameter and only need small size message, app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large memory size even though it doesn't need. Doug Ledford propsed to switch back it to static hard coded value. However it also has a compatibility problem. Some applications might started depend on the default value is tunable. The solution is to separate default value from maximum value. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NDoug Ledford <dledford@redhat.com> Acked-by: NDoug Ledford <dledford@redhat.com> Acked-by: NJoe Korty <joe.korty@ccur.com> Cc: Amerigo Wang <amwang@redhat.com> Acked-by: NSerge E. Hallyn <serue@us.ibm.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 4月, 2012 1 次提交
-
-
由 Shan Wei 提交于
All parameter descriptions in /proc/sys/net/core/* now is separated two places. So, merge them into Documentation/sysctl/net.txt. Signed-off-by: NShan Wei <davidshan@tencent.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 2月, 2012 1 次提交
-
-
由 Larry Finger 提交于
Two of the bits in the tainted flag are not documented. Signed-off-by: NLarry Finger <Larry.Finger@lwfinger.net> Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 1月, 2012 1 次提交
-
-
由 Pavel Emelyanov 提交于
The sysctl works on the current task's pid namespace, getting and setting its last_pid field. Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible to create a task with desired pid value. This ability is required badly for the checkpoint/restore in userspace. This approach suits all the parties for now. Signed-off-by: NPavel Emelyanov <xemul@parallels.com> Acked-by: NTejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 12月, 2011 1 次提交
-
-
由 Mitsuo Hayasaka 提交于
Currently, messages are just output on the detection of stack overflow, which is not sufficient for systems that need a high reliability. This is because in general the overflow may corrupt data, and the additional corruption may occur due to reading them unless systems stop. This patch adds the sysctl parameter kernel.panic_on_stackoverflow and causes a panic when detecting the overflows of kernel, IRQ and exception stacks except user stack according to the parameter. It is disabled by default. Signed-off-by: NMitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: http://lkml.kernel.org/r/20111129060836.11076.12323.stgit@ltc219.sdl.hitachi.co.jpSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 01 11月, 2011 1 次提交
-
-
由 Dan Ballard 提交于
Userspace needs to know the highest valid capability of the running kernel, which right now cannot reliably be retrieved from the header files only. The fact that this value cannot be determined properly right now creates various problems for libraries compiled on newer header files which are run on older kernels. They assume capabilities are available which actually aren't. libcap-ng is one example. And we ran into the same problem with systemd too. Now the capability is exported in /proc/sys/kernel/cap_last_cap. [akpm@linux-foundation.org: make cap_last_cap const, per Ulrich] Signed-off-by: NDan Ballard <dan@mindstab.net> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Lennart Poettering <lennart@poettering.net> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Ulrich Drepper <drepper@akkadia.org> Cc: James Morris <jmorris@namei.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 7月, 2011 1 次提交
-
-
由 Vasiliy Kulikov 提交于
Add support for the shm_rmid_forced sysctl. If set to 1, all shared memory objects in current ipc namespace will be automatically forced to use IPC_RMID. The POSIX way of handling shmem allows one to create shm objects and call shmdt(), leaving shm object associated with no process, thus consuming memory not counted via rlimits. With shm_rmid_forced=1 the shared memory object is counted at least for one process, so OOM killer may effectively kill the fat process holding the shared memory. It obviously breaks POSIX - some programs relying on the feature would stop working. So set shm_rmid_forced=1 only if you're sure nobody uses "orphaned" memory. Use shm_rmid_forced=0 by default for compatability reasons. The feature was previously impemented in -ow as a configure option. [akpm@linux-foundation.org: fix documentation, per Randy] [akpm@linux-foundation.org: fix warning] [akpm@linux-foundation.org: readability/conventionality tweaks] [akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout] Signed-off-by: NVasiliy Kulikov <segoon@openwall.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Solar Designer <solar@openwall.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-