1. 03 11月, 2015 1 次提交
  2. 26 10月, 2015 2 次提交
    • S
      tracing: Implement event pid filtering · 3fdaf80f
      Steven Rostedt (Red Hat) 提交于
      Add the necessary hooks to use the pids loaded in set_event_pid to filter
      all the events enabled in the tracing instance that match the pids listed.
      
      Two probes are added to both sched_switch and sched_wakeup tracepoints to be
      called before other probes are called and after the other probes are called.
      The first is used to set the necessary flags to let the probes know to test
      if they should be traced or not.
      
      The sched_switch pre probe will set the "ignore_pid" flag if neither the
      previous or next task has a matching pid.
      
      The sched_switch probe will set the "ignore_pid" flag if the next task
      does not match the matching pid.
      
      The pre probe allows for probes tracing sched_switch to be traced if
      necessary.
      
      The sched_wakeup pre probe will set the "ignore_pid" flag if neither the
      current task nor the wakee task has a matching pid.
      
      The sched_wakeup post probe will set the "ignore_pid" flag if the current
      task does not have a matching pid.
      
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3fdaf80f
    • S
      tracepoint: Give priority to probes of tracepoints · 7904b5c4
      Steven Rostedt (Red Hat) 提交于
      In order to guarantee that a probe will be called before other probes that
      are attached to a tracepoint, there needs to be a mechanism to provide
      priority of one probe over the others.
      
      Adding a prio field to the struct tracepoint_func, which lets the probes be
      sorted by the priority set in the structure. If no priority is specified,
      then a priority of 10 is given (this is a macro, and perhaps may be changed
      in the future).
      
      Now probes may be added to affect other probes that are attached to a
      tracepoint with a guaranteed order.
      
      One use case would be to allow tracing of tracepoints be able to filter by
      pid. A special (higher priority probe) may be added to the sched_switch
      tracepoint and set the necessary flags of the other tracepoints to notify
      them if they should be traced or not. In case a tracepoint is enabled at the
      sched_switch tracepoint too, the order of the two are not random.
      
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      7904b5c4
  3. 21 10月, 2015 1 次提交
  4. 26 9月, 2015 2 次提交
  5. 18 9月, 2015 2 次提交
  6. 16 9月, 2015 10 次提交
  7. 15 9月, 2015 3 次提交
  8. 14 9月, 2015 1 次提交
  9. 13 9月, 2015 1 次提交
    • L
      blk: rq_data_dir() should not return a boolean · 10fbd36e
      Linus Torvalds 提交于
      rq_data_dir() returns either READ or WRITE (0 == READ, 1 == WRITE), not
      a boolean value.
      
      Now, admittedly the "!= 0" doesn't really change the value (0 stays as
      zero, 1 stays as one), but it's not only redundant, it confuses gcc, and
      causes gcc to warn about the construct
      
          switch (rq_data_dir(req)) {
              case READ:
                  ...
              case WRITE:
                  ...
      
      that we have in a few drivers.
      
      Now, the gcc warning is silly and stupid (it seems to warn not about the
      switch value having a different type from the case statements, but about
      _any_ boolean switch value), but in this case the code itself is silly
      and stupid too, so let's just change it, and get rid of warnings like
      this:
      
        drivers/block/hd.c: In function ‘hd_request’:
        drivers/block/hd.c:630:11: warning: switch condition has boolean value [-Wswitch-bool]
           switch (rq_data_dir(req)) {
      
      The odd '!= 0' came in when "cmd_flags" got turned into a "u64" in
      commit 5953316d ("block: make rq->cmd_flags be 64-bit") and is
      presumably because the old code (that just did a logical 'and' with 1)
      would then end up making the type of rq_data_dir() be u64 too.
      
      But if we want to retain the old regular integer type, let's just cast
      the result to 'int' rather than use that rather odd '!= 0'.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10fbd36e
  10. 12 9月, 2015 2 次提交
    • J
      fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void · 6798a8ca
      Joe Perches 提交于
      The seq_<foo> function return values were frequently misused.
      
      See: commit 1f33c41c ("seq_file: Rename seq_overflow() to
           seq_has_overflowed() and make public")
      
      All uses of these return values have been removed, so convert the
      return types to void.
      
      Miscellanea:
      
      o Move seq_put_decimal_<type> and seq_escape prototypes closer the
        other seq_vprintf prototypes
      o Reorder seq_putc and seq_puts to return early on overflow
      o Add argument names to seq_vprintf and seq_printf
      o Update the seq_escape kernel-doc
      o Convert a couple of leading spaces to tabs in seq_escape
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Joerg Roedel <jroedel@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6798a8ca
    • M
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers 提交于
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  11. 11 9月, 2015 15 次提交
    • S
      block: Refuse request/bio merges with gaps in the integrity payload · 7f39add3
      Sagi Grimberg 提交于
      If a driver sets the block queue virtual boundary mask, it means that
      it cannot handle gaps so we must not allow those in the integrity
      payload as well.
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      
      Fixed up by me to have duplicate integrity merge functions, depending
      on whether block integrity is enabled or not. Fixes a compilations
      issue with CONFIG_BLK_DEV_INTEGRITY unset.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7f39add3
    • M
      PM / devfreq: comments for get_dev_status usage updated · d54cdf3f
      MyungJoo Ham 提交于
      With the introduction of devfreq_update_stats(), governors
      are not recommended to use get_dev_status() directly.
      Signed-off-by: NMyungJoo Ham <myungjoo.ham@samsung.com>
      d54cdf3f
    • J
      PM / devfreq: cache the last call to get_dev_status() · 08e75e75
      Javi Merino 提交于
      The return value of get_dev_status() can be reused.  Cache it so that
      other parts of the kernel can reuse it instead of having to call the
      same function again.
      
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NJavi Merino <javi.merino@arm.com>
      Signed-off-by: NMyungJoo Ham <myungjoo.ham@samsung.com>
      08e75e75
    • O
      mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff() · 1fcfd8db
      Oleg Nesterov 提交于
      Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(),
      rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple
      wrapper on top of do_mmap().  Perhaps we should update the callers of
      do_mmap_pgoff() and kill it later.
      
      This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not
      play with vm internals.
      
      After this change mmap_region() has a single user outside of mmap.c,
      arch/tile/mm/elf.c:arch_setup_additional_pages().  It would be nice to
      change arch/tile/ and unexport mmap_region().
      
      [kirill@shutemov.name: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Tested-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fcfd8db
    • D
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young 提交于
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
    • D
      kexec: split kexec_file syscall code to kexec_file.c · a43cac0d
      Dave Young 提交于
      Split kexec_file syscall related code to another file kernel/kexec_file.c
      so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.
      
      Sharing variables and functions are moved to kernel/kexec_internal.h per
      suggestion from Vivek and Petr.
      
      [akpm@linux-foundation.org: fix bisectability]
      [akpm@linux-foundation.org: declare the various arch_kexec functions]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a43cac0d
    • A
      seq_file: provide an analogue of print_hex_dump() · 37607102
      Andy Shevchenko 提交于
      This introduces a new helper and switches current users to use it.  All
      patches are compiled tested. kmemleak is tested via its own test suite.
      
      This patch (of 6):
      
      The new seq_hex_dump() is a complete analogue of print_hex_dump().
      
      We have few users of this functionality already. It allows to reduce their
      codebase.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Joe Perches <joe@perches.com>
      Cc: Tadeusz Struk <tadeusz.struk@intel.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Tuchscherer <ingo.tuchscherer@de.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Vladimir Kondratiev <qca_vkondrat@qca.qualcomm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37607102
    • F
      kmod: use system_unbound_wq instead of khelper · 90f02303
      Frederic Weisbecker 提交于
      We need to launch the usermodehelper kernel threads with the widest
      affinity and this is partly why we use khelper.  This workqueue has
      unbound properties and thus a wide affinity inherited by all its children.
      
      Now khelper also has special properties that we aren't much interested in:
      ordered and singlethread.  There is really no need about ordering as all
      we do is creating kernel threads.  This can be done concurrently.  And
      singlethread is a useless limitation as well.
      
      The workqueue engine already proposes generic unbound workqueues that
      don't share these useless properties and handle well parallel jobs.
      
      The only worrysome specific is their affinity to the node of the current
      CPU.  It's fine for creating the usermodehelper kernel threads but those
      inherit this affinity for longer jobs such as requesting modules.
      
      This patch proposes to use these node affine unbound workqueues assuming
      that a node is sufficient to handle several parallel usermodehelper
      requests.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90f02303
    • K
      lib/string_helpers: rename "esc" arg to "only" · b40bdb7f
      Kees Cook 提交于
      To further clarify the purpose of the "esc" argument, rename it to "only"
      to reflect that it is a limit, not a list of additional characters to
      escape.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b40bdb7f
    • L
      hexdump: do not print debug dumps for !CONFIG_DEBUG · cdf17449
      Linus Walleij 提交于
      print_hex_dump_debug() is likely supposed to be analogous to pr_debug() or
      dev_dbg() & friends.  Currently it will adhere to dynamic debug, but will
      not stub out prints if CONFIG_DEBUG is not set.  Let's make it do the
      right thing, because I am tired of having my dmesg buffer full of hex
      dumps on production systems.
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdf17449
    • J
      include/linux/printk.h: include pr_fmt in pr_debug_ratelimited · 515a9adc
      Jason A. Donenfeld 提交于
      The other two implementations of pr_debug_ratelimited include pr_fmt,
      along with every other pr_* function.  But pr_debug_ratelimited forgot to
      add it with the CONFIG_DYNAMIC_DEBUG implementation.
      
      This patch unifies the behavior.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      515a9adc
    • V
      include/linux/poison.h: remove not-used poison pointer macros · 8b839635
      Vasily Kulikov 提交于
      Signed-off-by: NVasily Kulikov <segoon@openwall.com>
      Cc: Solar Designer <solar@openwall.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b839635
    • V
      include/linux/poison.h: fix LIST_POISON{1,2} offset · 8a5e5e02
      Vasily Kulikov 提交于
      Poison pointer values should be small enough to find a room in
      non-mmap'able/hardly-mmap'able space.  E.g.  on x86 "poison pointer space"
      is located starting from 0x0.  Given unprivileged users cannot mmap
      anything below mmap_min_addr, it should be safe to use poison pointers
      lower than mmap_min_addr.
      
      The current poison pointer values of LIST_POISON{1,2} might be too big for
      mmap_min_addr values equal or less than 1 MB (common case, e.g.  Ubuntu
      uses only 0x10000).  There is little point to use such a big value given
      the "poison pointer space" below 1 MB is not yet exhausted.  Changing it
      to a smaller value solves the problem for small mmap_min_addr setups.
      
      The values are suggested by Solar Designer:
      http://www.openwall.com/lists/oss-security/2015/05/02/6Signed-off-by: NVasily Kulikov <segoon@openwall.com>
      Cc: Solar Designer <solar@openwall.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a5e5e02
    • V
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov 提交于
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
    • V
      mmu-notifier: add clear_young callback · 1d7715c6
      Vladimir Davydov 提交于
      In the scope of the idle memory tracking feature, which is introduced by
      the following patch, we need to clear the referenced/accessed bit not only
      in primary, but also in secondary ptes.  The latter is required in order
      to estimate wss of KVM VMs.  At the same time we want to avoid flushing
      tlb, because it is quite expensive and it won't really affect the final
      result.
      
      Currently, there is no function for clearing pte young bit that would meet
      our requirements, so this patch introduces one.  To achieve that we have
      to add a new mmu-notifier callback, clear_young, since there is no method
      for testing-and-clearing a secondary pte w/o flushing tlb.  The new method
      is not mandatory and currently only implemented by KVM.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d7715c6