1. 17 4月, 2015 1 次提交
  2. 16 4月, 2015 1 次提交
    • E
      mm: allow compaction of unevictable pages · 5bbe3547
      Eric B Munson 提交于
      Currently, pages which are marked as unevictable are protected from
      compaction, but not from other types of migration.  The POSIX real time
      extension explicitly states that mlock() will prevent a major page
      fault, but the spirit of this is that mlock() should give a process the
      ability to control sources of latency, including minor page faults.
      However, the mlock manpage only explicitly says that a locked page will
      not be written to swap and this can cause some confusion.  The
      compaction code today does not give a developer who wants to avoid swap
      but wants to have large contiguous areas available any method to achieve
      this state.  This patch introduces a sysctl for controlling compaction
      behavior with respect to the unevictable lru.  Users who demand no page
      faults after a page is present can set compact_unevictable_allowed to 0
      and users who need the large contiguous areas can enable compaction on
      locked memory by leaving the default value of 1.
      
      To illustrate this problem I wrote a quick test program that mmaps a
      large number of 1MB files filled with random data.  These maps are
      created locked and read only.  Then every other mmap is unmapped and I
      attempt to allocate huge pages to the static huge page pool.  When the
      compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
      after fragmenting memory.  When the value is set to 1, allocations
      succeed.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5bbe3547
  3. 15 4月, 2015 1 次提交
    • U
      watchdog: enable the new user interface of the watchdog mechanism · 195daf66
      Ulrich Obergfell 提交于
      With the current user interface of the watchdog mechanism it is only
      possible to disable or enable both lockup detectors at the same time.
      This series introduces new kernel parameters and changes the semantics of
      some existing kernel parameters, so that the hard lockup detector and the
      soft lockup detector can be disabled or enabled individually.  With this
      series applied, the user interface is as follows.
      
      - parameters in /proc/sys/kernel
      
        . soft_watchdog
          This is a new parameter to control and examine the run state of
          the soft lockup detector.
      
        . nmi_watchdog
          The semantics of this parameter have changed. It can now be used
          to control and examine the run state of the hard lockup detector.
      
        . watchdog
          This parameter is still available to control the run state of both
          lockup detectors at the same time. If this parameter is examined,
          it shows the logical OR of soft_watchdog and nmi_watchdog.
      
        . watchdog_thresh
          The semantics of this parameter are not affected by the patch.
      
      - kernel command line parameters
      
        . nosoftlockup
          The semantics of this parameter have changed. It can now be used
          to disable the soft lockup detector at boot time.
      
        . nmi_watchdog=0 or nmi_watchdog=1
          Disable or enable the hard lockup detector at boot time. The patch
          introduces '=1' as a new option.
      
        . nowatchdog
          The semantics of this parameter are not affected by the patch. It
          is still available to disable both lockup detectors at boot time.
      
      Also, remove the proc_dowatchdog() function which is no longer needed.
      
      [dzickus@redhat.com: wrote changelog]
      [dzickus@redhat.com: update documentation for kernel params and sysctl]
      Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      195daf66
  4. 12 2月, 2015 1 次提交
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
  5. 03 2月, 2015 1 次提交
    • W
      net-timestamp: no-payload only sysctl · b245be1f
      Willem de Bruijn 提交于
      Tx timestamps are looped onto the error queue on top of an skb. This
      mechanism leaks packet headers to processes unless the no-payload
      options SOF_TIMESTAMPING_OPT_TSONLY is set.
      
      Add a sysctl that optionally drops looped timestamp with data. This
      only affects processes without CAP_NET_RAW.
      
      The policy is checked when timestamps are generated in the stack.
      It is possible for timestamps with data to be reported after the
      sysctl is set, if these were queued internally earlier.
      
      No vulnerability is immediately known that exploits knowledge
      gleaned from packet headers, but it may still be preferable to allow
      administrators to lock down this path at the cost of possible
      breakage of legacy applications.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        (v1 -> v2)
        - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
        (rfc -> v1)
        - document the sysctl in Documentation/sysctl/net.txt
        - fix access control race: read .._OPT_TSONLY only once,
              use same value for permission check and skb generation.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b245be1f
  6. 29 1月, 2015 1 次提交
  7. 22 12月, 2014 1 次提交
  8. 14 12月, 2014 1 次提交
    • M
      ipc/msg: increase MSGMNI, remove scaling · 0050ee05
      Manfred Spraul 提交于
      SysV can be abused to allocate locked kernel memory.  For most systems, a
      small limit doesn't make sense, see the discussion with regards to SHMMAX.
      
      Therefore: increase MSGMNI to the maximum supported.
      
      And: If we ignore the risk of locking too much memory, then an automatic
      scaling of MSGMNI doesn't make sense.  Therefore the logic can be removed.
      
      The code preserves auto_msgmni to avoid breaking any user space applications
      that expect that the value exists.
      
      Notes:
      1) If an administrator must limit the memory allocations, then he can set
      MSGMNI as necessary.
      
      Or he can disable sysv entirely (as e.g. done by Android).
      
      2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
      to control latency vs. throughput:
      If MSGMNB is large, then msgsnd() just returns and more messages can be queued
      before a task switch to a task that calls msgrcv() is forced.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0050ee05
  9. 11 12月, 2014 1 次提交
    • P
      kernel: add panic_on_warn · 9e3961a0
      Prarit Bhargava 提交于
      There have been several times where I have had to rebuild a kernel to
      cause a panic when hitting a WARN() in the code in order to get a crash
      dump from a system.  Sometimes this is easy to do, other times (such as
      in the case of a remote admin) it is not trivial to send new images to
      the user.
      
      A much easier method would be a switch to change the WARN() over to a
      panic.  This makes debugging easier in that I can now test the actual
      image the WARN() was seen on and I do not have to engage in remote
      debugging.
      
      This patch adds a panic_on_warn kernel parameter and
      /proc/sys/kernel/panic_on_warn calls panic() in the
      warn_slowpath_common() path.  The function will still print out the
      location of the warning.
      
      An example of the panic_on_warn output:
      
      The first line below is from the WARN_ON() to output the WARN_ON()'s
      location.  After that the panic() output is displayed.
      
          WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
          Kernel panic - not syncing: panic_on_warn set ...
      
          CPU: 30 PID: 11698 Comm: insmod Tainted: G        W  OE  3.17.0+ #57
          Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
           0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
           0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
           ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
          Call Trace:
           [<ffffffff81665190>] dump_stack+0x46/0x58
           [<ffffffff8165e2ec>] panic+0xd0/0x204
           [<ffffffffa038e05f>] ? init_dummy+0x1f/0x30 [dummy_module]
           [<ffffffff81076b90>] warn_slowpath_common+0xd0/0xd0
           [<ffffffffa038e040>] ? dummy_greetings+0x40/0x40 [dummy_module]
           [<ffffffff81076c8a>] warn_slowpath_null+0x1a/0x20
           [<ffffffffa038e05f>] init_dummy+0x1f/0x30 [dummy_module]
           [<ffffffff81002144>] do_one_initcall+0xd4/0x210
           [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
           [<ffffffff810f8889>] load_module+0x16a9/0x1b30
           [<ffffffff810f3d30>] ? store_uevent+0x70/0x70
           [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
           [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
           [<ffffffff8166cf29>] system_call_fastpath+0x12/0x17
      
      Successfully tested by me.
      
      hpa said: There is another very valid use for this: many operators would
      rather a machine shuts down than being potentially compromised either
      functionally or security-wise.
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Acked-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e3961a0
  10. 17 11月, 2014 1 次提交
    • E
      net: provide a per host RSS key generic infrastructure · 960fb622
      Eric Dumazet 提交于
      RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
      RSS key.
      
      Some drivers use a constant (and well known key), some drivers use a random
      key per port, making bonding setups hard to tune. Well known keys increase
      attack surface, considering that number of queues is usually a power of two.
      
      This patch provides infrastructure to help drivers doing the right thing.
      
      netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
      even if they provide ethtool -X support to let user redefine the key later.
      
      A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
      RSS key even for drivers not providing ethtool -x support, in case some
      applications want to precisely setup flows to match some RX queues.
      
      Tested:
      
      myhost:~# cat /proc/sys/net/core/netdev_rss_key
      11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72
      
      myhost:~# ethtool -x eth0
      RX flow hash indirection table for eth0 with 8 RX ring(s):
          0:      0     1     2     3     4     5     6     7
      RSS hash key:
      11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      960fb622
  11. 12 11月, 2014 1 次提交
    • J
      net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited · ba7a46f1
      Joe Perches 提交于
      Use the more common dynamic_debug capable net_dbg_ratelimited
      and remove the LIMIT_NETDEBUG macro.
      
      All messages are still ratelimited.
      
      Some KERN_<LEVEL> uses are changed to KERN_DEBUG.
      
      This may have some negative impact on messages that were
      emitted at KERN_INFO that are not not enabled at all unless
      DEBUG is defined or dynamic_debug is enabled.  Even so,
      these messages are now _not_ emitted by default.
      
      This also eliminates the use of the net_msg_warn sysctl
      "/proc/sys/net/core/warnings".  For backward compatibility,
      the sysctl is not removed, but it has no function.  The extern
      declaration of net_msg_warn is removed from sock.h and made
      static in net/core/sysctl_net_core.c
      
      Miscellanea:
      
      o Update the sysctl documentation
      o Remove the embedded uses of pr_fmt
      o Coalesce format fragments
      o Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba7a46f1
  12. 14 10月, 2014 1 次提交
    • O
      coredump: add %i/%I in core_pattern to report the tid of the crashed thread · b03023ec
      Oleg Nesterov 提交于
      format_corename() can only pass the leader's pid to the core handler,
      but there is no simple way to figure out which thread originated the
      coredump.
      
      As Jan explains, this also means that there is no simple way to create
      the backtrace of the crashed process:
      
      As programs are mostly compiled with implicit gcc -fomit-frame-pointer
      one needs program's .eh_frame section (equivalently PT_GNU_EH_FRAME
      segment) or .debug_frame section.  .debug_frame usually is present only
      in separate debug info files usually not even installed on the system.
      While .eh_frame is a part of the executable/library (and it is even
      always mapped for C++ exceptions unwinding) it no longer has to be
      present anywhere on the disk as the program could be upgraded in the
      meantime and the running instance has its executable file already
      unlinked from disk.
      
      One possibility is to echo 0x3f >/proc/*/coredump_filter and dump all
      the file-backed memory including the executable's .eh_frame section.
      But that can create huge core files, for example even due to mmapped
      data files.
      
      Other possibility would be to read .eh_frame from /proc/PID/mem at the
      core_pattern handler time of the core dump.  For the backtrace one needs
      to read the register state first which can be done from core_pattern
      handler:
      
          ptrace(PTRACE_SEIZE, tid, 0, PTRACE_O_TRACEEXIT)
          close(0);    // close pipe fd to resume the sleeping dumper
          waitpid();   // should report EXIT
          PTRACE_GETREGS or other requests
      
      The remaining problem is how to get the 'tid' value of the crashed
      thread.  It could be read from the first NT_PRSTATUS note of the core
      file but that makes the core_pattern handler complicated.
      
      Unfortunately %t is already used so this patch uses %i/%I.
      
      Automatic Bug Reporting Tool (https://github.com/abrt/abrt/wiki/overview)
      is experimenting with this.  It is using the elfutils
      (https://fedorahosted.org/elfutils/) unwinder for generating the
      backtraces.  Apart from not needing matching executables as mentioned
      above, another advantage is that we can get the backtrace without saving
      the core (which might be quite large) to disk.
      
      [mmilata@redhat.com: final paragraph of changelog]
      Signed-off-by: NJan Kratochvil <jan.kratochvil@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Mark Wielaard <mjw@redhat.com>
      Cc: Martin Milata <mmilata@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b03023ec
  13. 02 9月, 2014 1 次提交
    • E
      tipc: add name distributor resiliency queue · a5325ae5
      Erik Hugne 提交于
      TIPC name table updates are distributed asynchronously in a cluster,
      entailing a risk of certain race conditions. E.g., if two nodes
      simultaneously issue conflicting (overlapping) publications, this may
      not be detected until both publications have reached a third node, in
      which case one of the publications will be silently dropped on that
      node. Hence, we end up with an inconsistent name table.
      
      In most cases this conflict is just a temporary race, e.g., one
      node is issuing a publication under the assumption that a previous,
      conflicting, publication has already been withdrawn by the other node.
      However, because of the (rtt related) distributed update delay, this
      may not yet hold true on all nodes. The symptom of this failure is a
      syslog message: "tipc: Cannot publish {%u,%u,%u}, overlap error".
      
      In this commit we add a resiliency queue at the receiving end of
      the name table distributor. When insertion of an arriving publication
      fails, we retain it in this queue for a short amount of time, assuming
      that another update will arrive very soon and clear the conflict. If so
      happens, we insert the publication, otherwise we drop it.
      
      The (configurable) retention value defaults to 2000 ms. Knowing from
      experience that the situation described above is extremely rare, there
      is no risk that the queue will accumulate any large number of items.
      Signed-off-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5325ae5
  14. 09 8月, 2014 1 次提交
  15. 24 6月, 2014 2 次提交
    • A
      kernel/watchdog.c: print traces for all cpus on lockup detection · ed235875
      Aaron Tomlin 提交于
      A 'softlockup' is defined as a bug that causes the kernel to loop in
      kernel mode for more than a predefined period to time, without giving
      other tasks a chance to run.
      
      Currently, upon detection of this condition by the per-cpu watchdog
      task, debug information (including a stack trace) is sent to the system
      log.
      
      On some occasions, we have observed that the "victim" rather than the
      actual "culprit" (i.e.  the owner/holder of the contended resource) is
      reported to the user.  Often this information has proven to be
      insufficient to assist debugging efforts.
      
      To avoid loss of useful debug information, for architectures which
      support NMI, this patch makes it possible to improve soft lockup
      reporting.  This is accomplished by issuing an NMI to each cpu to obtain
      a stack trace.
      
      If NMI is not supported we just revert back to the old method.  A sysctl
      and boot-time parameter is available to toggle this feature.
      
      [dzickus@redhat.com: add CONFIG_SMP in certain areas]
      [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
      [mq@suse.cz: fix warning]
      Signed-off-by: NAaron Tomlin <atomlin@redhat.com>
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NJan Moskyto Matejka <mq@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed235875
    • D
      mm, pcp: allow restoring percpu_pagelist_fraction default · 7cd2b0a3
      David Rientjes 提交于
      Oleg reports a division by zero error on zero-length write() to the
      percpu_pagelist_fraction sysctl:
      
          divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
          CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
          Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
          task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
          RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
          RSP: 0018:ffff8800d87a3e78  EFLAGS: 00010246
          RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
          RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
          RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
          R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
          R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
          FS:  00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
          Call Trace:
            proc_sys_call_handler+0xb3/0xc0
            proc_sys_write+0x14/0x20
            vfs_write+0xba/0x1e0
            SyS_write+0x46/0xb0
            tracesys+0xe1/0xe6
      
      However, if the percpu_pagelist_fraction sysctl is set by the user, it
      is also impossible to restore it to the kernel default since the user
      cannot write 0 to the sysctl.
      
      This patch allows the user to write 0 to restore the default behavior.
      It still requires a fraction equal to or larger than 8, however, as
      stated by the documentation for sanity.  If a value in the range [1, 7]
      is written, the sysctl will return EINVAL.
      
      This successfully solves the divide by zero issue at the same time.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Drokin <green@linuxhacker.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cd2b0a3
  16. 07 6月, 2014 1 次提交
    • K
      sysctl: allow for strict write position handling · f4aacea2
      Kees Cook 提交于
      When writing to a sysctl string, each write, regardless of VFS position,
      begins writing the string from the start.  This means the contents of
      the last write to the sysctl controls the string contents instead of the
      first:
      
        open("/proc/sys/kernel/modprobe", O_WRONLY)   = 1
        write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
        write(1, "/bin/true", 9)                = 9
        close(1)                                = 0
      
        $ cat /proc/sys/kernel/modprobe
        /bin/true
      
      Expected behaviour would be to have the sysctl be "AAAA..." capped at
      maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
      contents of the second write.  Similarly, multiple short writes would
      not append to the sysctl.
      
      The old behavior is unlike regular POSIX files enough that doing audits
      of software that interact with sysctls can end up in unexpected or
      dangerous situations.  For example, "as long as the input starts with a
      trusted path" turns out to be an insufficient filter, as what must also
      happen is for the input to be entirely contained in a single write
      syscall -- not a common consideration, especially for high level tools.
      
      This provides kernel.sysctl_writes_strict as a way to make this behavior
      act in a less surprising manner for strings, and disallows non-zero file
      position when writing numeric sysctls (similar to what is already done
      when reading from non-zero file positions).  For now, the default (0) is
      to warn about non-zero file position use, but retain the legacy
      behavior.  Setting this to -1 disables the warning, and setting this to
      1 enables the file position respecting behavior.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: move misplaced hunk, per Randy]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4aacea2
  17. 05 6月, 2014 2 次提交
    • D
      Documentation/sysctl/vm.txt: clarify vfs_cache_pressure description · 4a0da71b
      Denys Vlasenko 提交于
      Existing description is worded in a way which almost encourages setting of
      vfs_cache_pressure above 100, possibly way above it.
      
      Users are left in a dark what this numeric value is - an int?  a
      percentage?  what the scale is?
      
      As a result, we are getting reports about noticeable performance
      degradation from users who have set vfs_cache_pressure to ridiculously
      high values - because they thought there is no downside to it.
      
      Via code inspection it's obvious that this value is treated as a
      percentage.  This patch changes text to reflect this fact, and adds a
      cautionary paragraph advising against setting vfs_cache_pressure sky high.
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a0da71b
    • M
      mm: disable zone_reclaim_mode by default · 4f9b16a6
      Mel Gorman 提交于
      When it was introduced, zone_reclaim_mode made sense as NUMA distances
      punished and workloads were generally partitioned to fit into a NUMA
      node.  NUMA machines are now common but few of the workloads are
      NUMA-aware and it's routine to see major performance degradation due to
      zone_reclaim_mode being enabled but relatively few can identify the
      problem.
      
      Those that require zone_reclaim_mode are likely to be able to detect
      when it needs to be enabled and tune appropriately so lets have a
      sensible default for the bulk of users.
      
      This patch (of 2):
      
      zone_reclaim_mode causes processes to prefer reclaiming memory from
      local node instead of spilling over to other nodes.  This made sense
      initially when NUMA machines were almost exclusively HPC and the
      workload was partitioned into nodes.  The NUMA penalties were
      sufficiently high to justify reclaiming the memory.  On current machines
      and workloads it is often the case that zone_reclaim_mode destroys
      performance but not all users know how to detect this.  Favour the
      common case and disable it by default.  Users that are sophisticated
      enough to know they need zone_reclaim_mode will detect it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f9b16a6
  18. 08 4月, 2014 1 次提交
  19. 04 4月, 2014 1 次提交
  20. 13 3月, 2014 1 次提交
    • M
      Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE · 66cc69e3
      Mathieu Desnoyers 提交于
      Users have reported being unable to trace non-signed modules loaded
      within a kernel supporting module signature.
      
      This is caused by tracepoint.c:tracepoint_module_coming() refusing to
      take into account tracepoints sitting within force-loaded modules
      (TAINT_FORCED_MODULE). The reason for this check, in the first place, is
      that a force-loaded module may have a struct module incompatible with
      the layout expected by the kernel, and can thus cause a kernel crash
      upon forced load of that module on a kernel with CONFIG_TRACEPOINTS=y.
      
      Tracepoints, however, specifically accept TAINT_OOT_MODULE and
      TAINT_CRAP, since those modules do not lead to the "very likely system
      crash" issue cited above for force-loaded modules.
      
      With kernels having CONFIG_MODULE_SIG=y (signed modules), a non-signed
      module is tainted re-using the TAINT_FORCED_MODULE taint flag.
      Unfortunately, this means that Tracepoints treat that module as a
      force-loaded module, and thus silently refuse to consider any tracepoint
      within this module.
      
      Since an unsigned module does not fit within the "very likely system
      crash" category of tainting, add a new TAINT_UNSIGNED_MODULE taint flag
      to specifically address this taint behavior, and accept those modules
      within Tracepoints. We use the letter 'X' as a taint flag character for
      a module being loaded that doesn't know how to sign its name (proposed
      by Steven Rostedt).
      
      Also add the missing 'O' entry to trace event show_module_flags() list
      for the sake of completeness.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      NAKed-by: NIngo Molnar <mingo@redhat.com>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: David Howells <dhowells@redhat.com>
      CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      66cc69e3
  21. 31 1月, 2014 1 次提交
  22. 30 1月, 2014 1 次提交
  23. 28 1月, 2014 1 次提交
  24. 25 1月, 2014 1 次提交
  25. 24 1月, 2014 1 次提交
    • K
      kexec: add sysctl to disable kexec_load · 7984754b
      Kees Cook 提交于
      For general-purpose (i.e.  distro) kernel builds it makes sense to build
      with CONFIG_KEXEC to allow end users to choose what kind of things they
      want to do with kexec.  However, in the face of trying to lock down a
      system with such a kernel, there needs to be a way to disable kexec_load
      (much like module loading can be disabled).  Without this, it is too easy
      for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM
      and modules_disabled are set.  With this change, it is still possible to
      load an image for use later, then disable kexec_load so the image (or lack
      of image) can't be altered.
      
      The intention is for using this in environments where "perfect"
      enforcement is hard.  Without a verified boot, along with verified
      modules, and along with verified kexec, this is trying to give a system a
      better chance to defend itself (or at least grow the window of
      discoverability) against attack in the face of a privilege escalation.
      
      In my mind, I consider several boot scenarios:
      
      1) Verified boot of read-only verified root fs loading fd-based
         verification of kexec images.
      2) Secure boot of writable root fs loading signed kexec images.
      3) Regular boot loading kexec (e.g. kcrash) image early and locking it.
      4) Regular boot with no control of kexec image at all.
      
      1 and 2 don't exist yet, but will soon once the verified kexec series has
      landed.  4 is the state of things now.  The gap between 2 and 4 is too
      large, so this change creates scenario 3, a middle-ground above 4 when 2
      and 1 are not possible for a system.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7984754b
  26. 22 1月, 2014 1 次提交
    • J
      mm: add overcommit_kbytes sysctl variable · 49f0ce5f
      Jerome Marchand 提交于
      Some applications that run on HPC clusters are designed around the
      availability of RAM and the overcommit ratio is fine tuned to get the
      maximum usage of memory without swapping.  With growing memory, the
      1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
      for these workload (on a 2TB machine it represents no less than 20GB).
      
      This patch adds the new overcommit_kbytes sysctl variable that allow a
      much finer grain.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix nommu build]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49f0ce5f
  27. 17 12月, 2013 1 次提交
  28. 13 11月, 2013 2 次提交
    • R
      vsprintf: check real user/group id for %pK · 312b4e22
      Ryan Mallon 提交于
      Some setuid binaries will allow reading of files which have read
      permission by the real user id.  This is problematic with files which
      use %pK because the file access permission is checked at open() time,
      but the kptr_restrict setting is checked at read() time.  If a setuid
      binary opens a %pK file as an unprivileged user, and then elevates
      permissions before reading the file, then kernel pointer values may be
      leaked.
      
      This happens for example with the setuid pppd application on Ubuntu 12.04:
      
        $ head -1 /proc/kallsyms
        00000000 T startup_32
      
        $ pppd file /proc/kallsyms
        pppd: In file /proc/kallsyms: unrecognized option 'c1000000'
      
      This will only leak the pointer value from the first line, but other
      setuid binaries may leak more information.
      
      Fix this by adding a check that in addition to the current process having
      CAP_SYSLOG, that effective user and group ids are equal to the real ids.
      If a setuid binary reads the contents of a file which uses %pK then the
      pointer values will be printed as NULL if the real user is unprivileged.
      
      Update the sysctl documentation to reflect the changes, and also correct
      the documentation to state the kptr_restrict=0 is the default.
      
      This is a only temporary solution to the issue.  The correct solution is
      to do the permission check at open() time on files, and to replace %pK
      with a function which checks the open() time permission.  %pK uses in
      printk should be removed since no sane permission check can be done, and
      instead protected by using dmesg_restrict.
      Signed-off-by: NRyan Mallon <rmallon@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Joe Perches <joe@perches.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      312b4e22
    • Z
      mm: improve the description for dirty_background_ratio/dirty_ratio sysctl · 715ea41e
      Zheng Liu 提交于
      Now dirty_background_ratio/dirty_ratio contains a percentage of total
      avaiable memory, which contains free pages and reclaimable pages.  The
      number of these pages is not equal to the number of total system memory.
      But they are described as a percentage of total system memory in
      Documentation/sysctl/vm.txt.  So we need to fix them to avoid
      misunderstanding.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Cc: Rob Landley <rob@landley.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      715ea41e
  29. 09 10月, 2013 5 次提交
  30. 12 9月, 2013 2 次提交
  31. 31 8月, 2013 1 次提交
    • S
      qdisc: allow setting default queuing discipline · 6da7c8fc
      stephen hemminger 提交于
      By default, the pfifo_fast queue discipline has been used by default
      for all devices. But we have better choices now.
      
      This patch allow setting the default queueing discipline with sysctl.
      This allows easy use of better queueing disciplines on all devices
      without having to use tc qdisc scripts. It is intended to allow
      an easy path for distributions to make fq_codel or sfq the default
      qdisc.
      
      This patch also makes pfifo_fast more of a first class qdisc, since
      it is now possible to manually override the default and explicitly
      use pfifo_fast. The behavior for systems who do not use the sysctl
      is unchanged, they still get pfifo_fast
      
      Also removes leftover random # in sysctl net core.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6da7c8fc
  32. 02 8月, 2013 1 次提交