1. 05 9月, 2015 2 次提交
    • M
      mm: send one IPI per CPU to TLB flush all entries after unmapping pages · 72b252ae
      Mel Gorman 提交于
      An IPI is sent to flush remote TLBs when a page is unmapped that was
      potentially accesssed by other CPUs.  There are many circumstances where
      this happens but the obvious one is kswapd reclaiming pages belonging to a
      running process as kswapd and the task are likely running on separate
      CPUs.
      
      On small machines, this is not a significant problem but as machine gets
      larger with more cores and more memory, the cost of these IPIs can be
      high.  This patch uses a simple structure that tracks CPUs that
      potentially have TLB entries for pages being unmapped.  When the unmapping
      is complete, the full TLB is flushed on the assumption that a refill cost
      is lower than flushing individual entries.
      
      Architectures wishing to do this must give the following guarantee.
      
              If a clean page is unmapped and not immediately flushed, the
              architecture must guarantee that a write to that linear address
              from a CPU with a cached TLB entry will trap a page fault.
      
      This is essentially what the kernel already depends on but the window is
      much larger with this patch applied and is worth highlighting.  The
      architecture should consider whether the cost of the full TLB flush is
      higher than sending an IPI to flush each individual entry.  An additional
      architecture helper called flush_tlb_local is required.  It's a trivial
      wrapper with some accounting in the x86 case.
      
      The impact of this patch depends on the workload as measuring any benefit
      requires both mapped pages co-located on the LRU and memory pressure.  The
      case with the biggest impact is multiple processes reading mapped pages
      taken from the vm-scalability test suite.  The test case uses NR_CPU
      readers of mapped files that consume 10*RAM.
      
      Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
      Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
      Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User          581.00       611.43
      System       5804.93      4111.76
      Elapsed       161.03       122.12
      
      This is showing that the readers completed 24.40% faster with 29% less
      system CPU time.  From vmstats, it is known that the vanilla kernel was
      interrupted roughly 900K times per second during the steady phase of the
      test and the patched kernel was interrupts 180K times per second.
      
      The impact is lower on a single socket machine.
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
      Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
      Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User           58.09        57.64
      System        111.82        76.56
      Elapsed        27.29        22.55
      
      It's still a noticeable improvement with vmstat showing interrupts went
      from roughly 500K per second to 45K per second.
      
      The patch will have no impact on workloads with no memory pressure or have
      relatively few mapped pages.  It will have an unpredictable impact on the
      workload running on the CPU being flushed as it'll depend on how many TLB
      entries need to be refilled and how long that takes.  Worst case, the TLB
      will be completely cleared of active entries when the target PFNs were not
      resident at all.
      
      [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72b252ae
    • A
      userfaultfd: buildsystem activation · a14c151e
      Andrea Arcangeli 提交于
      This allows to select the userfaultfd during configuration to build it.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a14c151e
  2. 15 8月, 2015 1 次提交
  3. 14 8月, 2015 1 次提交
  4. 13 8月, 2015 1 次提交
  5. 07 8月, 2015 6 次提交
  6. 23 7月, 2015 1 次提交
  7. 15 7月, 2015 1 次提交
    • A
      cgroup: implement the PIDs subsystem · 49b786ea
      Aleksa Sarai 提交于
      Adds a new single-purpose PIDs subsystem to limit the number of
      tasks that can be forked inside a cgroup. Essentially this is an
      implementation of RLIMIT_NPROC that applies to a cgroup rather than a
      process tree.
      
      However, it should be noted that organisational operations (adding and
      removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
      the number of tasks in the hierarchy cannot exceed the limit through
      forking. This is due to the fact that, in the unified hierarchy, attach
      cannot fail (and it is not possible for a task to overcome its PIDs
      cgroup policy limit by attaching to a child cgroup -- even if migrating
      mid-fork it must be able to fork in the parent first).
      
      PIDs are fundamentally a global resource, and it is possible to reach
      PID exhaustion inside a cgroup without hitting any reasonable kmemcg
      policy. Once you've hit PID exhaustion, you're only in a marginally
      better state than OOM. This subsystem allows PID exhaustion inside a
      cgroup to be prevented.
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      49b786ea
  8. 07 7月, 2015 1 次提交
  9. 04 7月, 2015 1 次提交
  10. 01 7月, 2015 1 次提交
    • I
      printk: Increase maximum CONFIG_LOG_BUF_SHIFT from 21 to 25 · fb39f98d
      Ingo Molnar 提交于
      So I tried to some kernel debugging that produced a ton of kernel messages
      on a big box, and wanted to save them all: but CONFIG_LOG_BUF_SHIFT maxes
      out at 21 (2 MB).
      
      Increase it to 25 (32 MB).
      
      This does not affect any existing config or defaults.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      fb39f98d
  11. 26 6月, 2015 1 次提交
    • I
      fs, proc: introduce CONFIG_PROC_CHILDREN · 2e13ba54
      Iago López Galeiras 提交于
      Commit 81841161 ("fs, proc: introduce /proc/<pid>/task/<tid>/children
      entry") introduced the children entry for checkpoint restore and the
      file is only available on kernels configured with CONFIG_EXPERT and
      CONFIG_CHECKPOINT_RESTORE.
      
      This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
      because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
      But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.
      
      However, the children proc file is useful outside of checkpoint restore.
      I would like to use it in rkt.  The rkt process exec() another program
      it does not control, and that other program will fork()+exec() a child
      process.  I would like to find the pid of the child process from an
      external tool without iterating in /proc over all processes to find
      which one has a parent pid equal to rkt.
      
      This commit introduces CONFIG_PROC_CHILDREN and makes
      CONFIG_CHECKPOINT_RESTORE select it.  This allows enabling
      /proc/<pid>/task/<tid>/children without needing to enable
      CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.
      
      Alban tested that /proc/<pid>/task/<tid>/children is present when the
      kernel is configured with CONFIG_PROC_CHILDREN=y but without
      CONFIG_CHECKPOINT_RESTORE
      Signed-off-by: NIago López Galeiras <iago@endocode.com>
      Tested-by: NAlban Crequy <alban@endocode.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Djalal Harouni <djalal@endocode.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e13ba54
  12. 23 6月, 2015 1 次提交
  13. 02 6月, 2015 1 次提交
    • T
      writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK · 89e9b9e0
      Tejun Heo 提交于
      cgroup writeback requires support from both bdi and filesystem sides.
      Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
      support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
      default.  Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
      both MEMCG and BLK_CGROUP are enabled.
      
      inode_cgwb_enabled() which determines whether a given inode's both bdi
      and fs support cgroup writeback is added.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      89e9b9e0
  14. 28 5月, 2015 10 次提交
  15. 27 5月, 2015 1 次提交
    • T
      sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem · d59cfc09
      Tejun Heo 提交于
      The cgroup side of threadgroup locking uses signal_struct->group_rwsem
      to synchronize against threadgroup changes.  This per-process rwsem
      adds small overhead to thread creation, exit and exec paths, forces
      cgroup code paths to do lock-verify-unlock-retry dance in a couple
      places and makes it impossible to atomically perform operations across
      multiple processes.
      
      This patch replaces signal_struct->group_rwsem with a global
      percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
      side and contained in cgroups proper.  This patch converts one-to-one.
      
      This does make writer side heavier and lower the granularity; however,
      cgroup process migration is a fairly cold path, we do want to optimize
      thread operations over it and cgroup migration operations don't take
      enough time for the lower granularity to matter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      d59cfc09
  16. 08 5月, 2015 1 次提交
  17. 16 4月, 2015 1 次提交
  18. 11 4月, 2015 1 次提交
  19. 02 4月, 2015 1 次提交
    • I
      bpf: Fix the build on BPF_SYSCALL=y && !CONFIG_TRACING kernels, make it more configurable · e1abf2cc
      Ingo Molnar 提交于
      So bpf_tracing.o depends on CONFIG_BPF_SYSCALL - but that's not its only
      dependency, it also depends on the tracing infrastructure and on kprobes,
      without which it will fail to build with:
      
        In file included from kernel/trace/bpf_trace.c:14:0:
        kernel/trace/trace.h: In function ‘trace_test_and_set_recursion’:
        kernel/trace/trace.h:491:28: error: ‘struct task_struct’ has no member named ‘trace_recursion’
          unsigned int val = current->trace_recursion;
        [...]
      
      It took quite some time to trigger this build failure, because right now
      BPF_SYSCALL is very obscure, depends on CONFIG_EXPERT. So also make BPF_SYSCALL
      more configurable, not just under CONFIG_EXPERT.
      
      If BPF_SYSCALL, tracing and kprobes are enabled then enable the bpf_tracing
      gateway as well.
      
      We might want to make this an interactive option later on, although
      I'd not complicate it unnecessarily: enabling BPF_SYSCALL is enough of
      an indicator that the user wants BPF support.
      
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e1abf2cc
  20. 27 2月, 2015 1 次提交
    • P
      rcu: Add Kconfig option to expedite grace periods during boot · ee42571f
      Paul E. McKenney 提交于
      This commit adds a CONFIG_RCU_EXPEDITE_BOOT Kconfig parameter
      that emulates a very early boot rcu_expedite_gp().  A late-boot
      call to rcu_end_inkernel_boot() will provide the corresponding
      rcu_unexpedite_gp().  The late-boot call to rcu_end_inkernel_boot()
      should be made just before init is spawned.
      
      According to Arjan:
      
      > To show the boot time, I'm using the timestamp of the "Write protecting"
      > line, that's pretty much the last thing we print prior to ring 3 execution.
      >
      > A kernel with default RCU behavior (inside KVM, only virtual devices)
      > looks like this:
      >
      > [    0.038724] Write protecting the kernel read-only data: 10240k
      >
      > a kernel with expedited RCU (using the command line option, so that I
      > don't have to recompile between measurements and thus am completely
      > oranges-to-oranges)
      >
      > [    0.031768] Write protecting the kernel read-only data: 10240k
      >
      > which, in percentage, is an 18% improvement.
      Reported-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NArjan van de Ven <arjan@linux.intel.com>
      ee42571f
  21. 14 2月, 2015 1 次提交
  22. 16 1月, 2015 1 次提交
    • P
      rcu: Optionally run grace-period kthreads at real-time priority · a94844b2
      Paul E. McKenney 提交于
      Recent testing has shown that under heavy load, running RCU's grace-period
      kthreads at real-time priority can improve performance (according to 0day
      test robot) and reduce the incidence of RCU CPU stall warnings.  However,
      most systems do just fine with the default non-realtime priorities for
      these kthreads, and it does not make sense to expose the entire user
      base to any risk stemming from this change, given that this change is
      of use only to a few users running extremely heavy workloads.
      
      Therefore, this commit allows users to specify realtime priorities
      for the grace-period kthreads, but leaves them running SCHED_OTHER
      by default.  The realtime priority may be specified at build time
      via the RCU_KTHREAD_PRIO Kconfig parameter, or at boot time via the
      rcutree.kthread_prio parameter.  Either way, 0 says to continue the
      default SCHED_OTHER behavior and values from 1-99 specify that priority
      of SCHED_FIFO behavior.  Note that a value of 0 is not permitted when
      the RCU_BOOST Kconfig parameter is specified.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a94844b2
  23. 08 1月, 2015 1 次提交
  24. 07 1月, 2015 2 次提交