1. 21 5月, 2016 40 次提交
    • P
      printk/nmi: warn when some message has been lost in NMI context · b522deab
      Petr Mladek 提交于
      We could not resize the temporary buffer in NMI context.  Let's warn if
      a message is lost.
      
      This is rather theoretical.  printk() should not be used in NMI.  The
      only sensible use is when we want to print backtrace from all CPUs.  The
      current buffer should be enough for this purpose.
      
      [akpm@linux-foundation.org: whitespace fixlet]
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b522deab
    • P
      printk/nmi: generic solution for safe printk in NMI · 42a0bb3f
      Petr Mladek 提交于
      printk() takes some locks and could not be used a safe way in NMI
      context.
      
      The chance of a deadlock is real especially when printing stacks from
      all CPUs.  This particular problem has been addressed on x86 by the
      commit a9edc880 ("x86/nmi: Perform a safe NMI stack trace on all
      CPUs").
      
      The patchset brings two big advantages.  First, it makes the NMI
      backtraces safe on all architectures for free.  Second, it makes all NMI
      messages almost safe on all architectures (the temporary buffer is
      limited.  We still should keep the number of messages in NMI context at
      minimum).
      
      Note that there already are several messages printed in NMI context:
      WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
      handlers.  These are not easy to avoid.
      
      This patch reuses most of the code and makes it generic.  It is useful
      for all messages and architectures that support NMI.
      
      The alternative printk_func is set when entering and is reseted when
      leaving NMI context.  It queues IRQ work to copy the messages into the
      main ring buffer in a safe context.
      
      __printk_nmi_flush() copies all available messages and reset the buffer.
      Then we could use a simple cmpxchg operations to get synchronized with
      writers.  There is also used a spinlock to get synchronized with other
      flushers.
      
      We do not longer use seq_buf because it depends on external lock.  It
      would be hard to make all supported operations safe for a lockless use.
      It would be confusing and error prone to make only some operations safe.
      
      The code is put into separate printk/nmi.c as suggested by Steven
      Rostedt.  It needs a per-CPU buffer and is compiled only on
      architectures that call nmi_enter().  This is achieved by the new
      HAVE_NMI Kconfig flag.
      
      The are MN10300 and Xtensa architectures.  We need to clean up NMI
      handling there first.  Let's do it separately.
      
      The patch is heavily based on the draft from Peter Zijlstra, see
      
        https://lkml.org/lkml/2015/6/10/327
      
      [arnd@arndb.de: printk-nmi: use %zu format string for size_t]
      [akpm@linux-foundation.org: min_t->min - all types are size_t here]
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>	[arm part]
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42a0bb3f
    • R
      include/linux/syscalls.h: use pid_t instead of int · 2eeed7e9
      René Nyffenegger 提交于
      In include/linux/syscalls.h, the four functions sys_kill, sys_tgkill,
      sys_tkill and sys_rt_sigqueueinfo are declared with "int pid" and "int
      tgid".
      
      However, in kernel/signal.c, the corresponding definitions use the more
      appropriate "pid_t" (which is a typedef'd int).
      
      This patch changes "int" to "pid_t" in the declarations of sys_kill,
      sys_tgkill, sys_tkill and sys_rt_sigqueueinfo in <linux/syscalls.h> in
      order to harmonize the function declarations with their respective
      definitions.
      
      Link: http://lkml.kernel.org/r/57302FDA.7020205@renenyffenegger.chSigned-off-by: NRené Nyffenegger <mail@renenyffenegger.ch>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Milosz Tanski <milosz@adfin.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2eeed7e9
    • J
      fork: free thread in copy_process on failure · 0740aa5f
      Jiri Slaby 提交于
      When using this program (as root):
      
      	#include <err.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      
      	#include <sys/io.h>
      	#include <sys/types.h>
      	#include <sys/wait.h>
      
      	#define ITER 1000
      	#define FORKERS 15
      	#define THREADS (6000/FORKERS) // 1850 is proc max
      
      	static void fork_100_wait()
      	{
      		unsigned a, to_wait = 0;
      
      		printf("\t%d forking %d\n", THREADS, getpid());
      
      		for (a = 0; a < THREADS; a++) {
      			switch (fork()) {
      			case 0:
      				usleep(1000);
      				exit(0);
      				break;
      			case -1:
      				break;
      			default:
      				to_wait++;
      				break;
      			}
      		}
      
      		printf("\t%d forked from %d, waiting for %d\n", THREADS, getpid(),
      				to_wait);
      
      		for (a = 0; a < to_wait; a++)
      			wait(NULL);
      
      		printf("\t%d waited from %d\n", THREADS, getpid());
      	}
      
      	static void run_forkers()
      	{
      		pid_t forkers[FORKERS];
      		unsigned a;
      
      		for (a = 0; a < FORKERS; a++) {
      			switch ((forkers[a] = fork())) {
      			case 0:
      				fork_100_wait();
      				exit(0);
      				break;
      			case -1:
      				err(1, "DIE fork of %d'th forker", a);
      				break;
      			default:
      				break;
      			}
      		}
      
      		for (a = 0; a < FORKERS; a++)
      			waitpid(forkers[a], NULL, 0);
      	}
      
      	int main()
      	{
      		unsigned a;
      		int ret;
      
      		ret = ioperm(10, 20, 0);
      		if (ret < 0)
      			err(1, "ioperm");
      
      		for (a = 0; a < ITER; a++)
      			run_forkers();
      
      		return 0;
      	}
      
      kmemleak reports many occurences of this leak:
      unreferenced object 0xffff8805917c8000 (size 8192):
        comm "fork-leak", pid 2932, jiffies 4295354292 (age 1871.028s)
        hex dump (first 32 bytes):
          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
        backtrace:
          [<ffffffff814cfbf5>] kmemdup+0x25/0x50
          [<ffffffff8103ab43>] copy_thread_tls+0x6c3/0x9a0
          [<ffffffff81150174>] copy_process+0x1a84/0x5790
          [<ffffffff811dc375>] wake_up_new_task+0x2d5/0x6f0
          [<ffffffff8115411d>] _do_fork+0x12d/0x820
      ...
      
      Due to the leakage of the memory items which should have been freed in
      arch/x86/kernel/process.c:exit_thread().
      
      Make sure the memory is freed when fork fails later in copy_process.
      This is done by calling exit_thread with the thread to kill.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0740aa5f
    • J
      exit_thread: accept a task parameter to be exited · e6464694
      Jiri Slaby 提交于
      We need to call exit_thread from copy_process in a fail path.  So make it
      accept task_struct as a parameter.
      
      [v2]
      * s390: exit_thread_runtime_instr doesn't make sense to be called for
        non-current tasks.
      * arm: fix the comment in vfp_thread_copy
      * change 'me' to 'tsk' for task_struct
      * now we can change only archs that actually have exit_thread
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6464694
    • J
      exit_thread: remove empty bodies · 5f56a5df
      Jiri Slaby 提交于
      Define HAVE_EXIT_THREAD for archs which want to do something in
      exit_thread. For others, let's define exit_thread as an empty inline.
      
      This is a cleanup before we change the prototype of exit_thread to
      accept a task parameter.
      
      [akpm@linux-foundation.org: fix mips]
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f56a5df
    • J
      mn10300: let exit_fpu accept a task · 2ec656eb
      Jiri Slaby 提交于
      We need to call exit_thread from copy_process in a fail path.  Since
      exit_thread on mn10300 calls exit_thread_runtime_instr, make it accept
      task_struct as a parameter now.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ec656eb
    • J
      procfs: fix pthread cross-thread naming if !PR_DUMPABLE · 1b3044e3
      Janis Danisevskis 提交于
      The PR_DUMPABLE flag causes the pid related paths of the proc file
      system to be owned by ROOT.
      
      The implementation of pthread_set/getname_np however needs access to
      /proc/<pid>/task/<tid>/comm.  If PR_DUMPABLE is false this
      implementation is locked out.
      
      This patch installs a special permission function for the file "comm"
      that grants read and write access to all threads of the same group
      regardless of the ownership of the inode.  For all other threads the
      function falls back to the generic inode permission check.
      
      [akpm@linux-foundation.org: fix spello in comment]
      Signed-off-by: NJanis Danisevskis <jdanis@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minfei Huang <mnfhuang@gmail.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Calvin Owens <calvinowens@fb.com>
      Cc: Jann Horn <jann@thejh.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b3044e3
    • R
      procfs: expose umask in /proc/<PID>/status · 3e42979e
      Richard W.M. Jones 提交于
      It's not possible to read the process umask without also modifying it,
      which is what umask(2) does.  A library cannot read umask safely,
      especially if the main program might be multithreaded.
      
      Add a new status line ("Umask") in /proc/<PID>/status.  It contains the
      file mode creation mask (umask) in octal.  It is only shown for tasks
      which have task->fs.
      
      This patch is adapted from one originally written by Pierre Carrier.
      
      The use case is that we have endless trouble with people setting weird
      umask() values (usually on the grounds of "security"), and then
      everything breaking.  I'm on the hook to fix these.  We'd like to add
      debugging to our program so we can dump out the umask in debug reports.
      
      Previous versions of the patch used a syscall so you could only read
      your own umask.  That's all I need.  However there was quite a lot of
      push-back from those, so this new version exports it in /proc.
      
      See:
        https://lkml.org/lkml/2016/4/13/704 [umask2]
        https://lkml.org/lkml/2016/4/13/487 [getumask]
      Signed-off-by: NRichard W.M. Jones <rjones@redhat.com>
      Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pierre Carrier <pierre@spotify.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e42979e
    • S
      zram: introduce per-device debug_stat sysfs node · 623e47fc
      Sergey Senozhatsky 提交于
      debug_stat sysfs is read-only and represents various debugging data that
      zram developers may need.  This file is not meant to be used by anyone
      else: its content is not documented and will change any time w/o any
      notice.  Therefore, the output of debug_stat file contains a version
      string.  To avoid any confusion, we will increase the version number
      every time we modify the output.
      
      At the moment this file exports only one value -- the number of
      re-compressions, IOW, the number of times compression fast path has
      failed.  This stat is temporary any will be useful in case if any
      per-cpu compression streams regressions will be reported.
      
      Link: http://lkml.kernel.org/r/20160513230834.GB26763@bbox
      Link: http://lkml.kernel.org/r/20160511134553.12655-1-sergey.senozhatsky@gmail.comSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      623e47fc
    • S
      zram: remove max_comp_streams internals · 43209ea2
      Sergey Senozhatsky 提交于
      Remove the internal part of max_comp_streams interface, since we
      switched to per-cpu streams.  We will keep RW max_comp_streams attr
      around, because:
      
      a) we may (silently) switch back to idle compression streams list and
         don't want to disturb user space
      
      b) max_comp_streams attr must wait for the next 'lay off cycle'; we
         give user space 2 years to adjust before we remove/downgrade the attr,
         and there are already several attrs scheduled for removal in 4.11, so
         it's too late for max_comp_streams.
      
      This slightly change a user visible behaviour:
      
      - First, reading from max_comp_stream file now will always return the
        number of online CPUs.
      
      - Second, writing to max_comp_stream will not take any effect.
      
      Link: http://lkml.kernel.org/r/20160503165546.25201-1-sergey.senozhatsky@gmail.comSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43209ea2
    • D
      mm/zsmalloc: don't fail if can't create debugfs info · d34f6157
      Dan Streetman 提交于
      Change the return type of zs_pool_stat_create() to void, and remove the
      logic to abort pool creation if the stat debugfs dir/file could not be
      created.
      
      The debugfs stat file is for debugging/information only, and doesn't
      affect operation of zsmalloc; there is no reason to abort creating the
      pool if the stat file can't be created.  This was seen with zswap, which
      used the same name for all pool creations, which caused zsmalloc to fail
      to create a second pool for zswap if CONFIG_ZSMALLOC_STAT was enabled.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d34f6157
    • D
      mm/zswap: use workqueue to destroy pool · 200867af
      Dan Streetman 提交于
      Add a work_struct to struct zswap_pool, and change __zswap_pool_empty to
      use the workqueue instead of using call_rcu().
      
      When zswap destroys a pool no longer in use, it uses call_rcu() to
      perform the destruction/freeing.  Since that executes in softirq
      context, it must not sleep.  However, actually destroying the pool
      involves freeing the per-cpu compressors (which requires locking the
      cpu_add_remove_lock mutex) and freeing the zpool, for which the
      implementation may sleep (e.g.  zsmalloc calls kmem_cache_destroy, which
      locks the slab_mutex).  So if either mutex is currently taken, or any
      other part of the compressor or zpool implementation sleeps, it will
      result in a BUG().
      
      It's not easy to reproduce this when changing zswap's params normally.
      In testing with a loaded system, this does not fail:
      
        $ cd /sys/module/zswap/parameters
        $ echo lz4 > compressor ; echo zsmalloc > zpool
      
      nor does this:
      
        $ while true ; do
        > echo lzo > compressor ; echo zbud > zpool
        > sleep 1
        > echo lz4 > compressor ; echo zsmalloc > zpool
        > sleep 1
        > done
      
      although it's still possible either of those might fail, depending on
      whether anything else besides zswap has locked the mutexes.
      
      However, changing a parameter with no delay immediately causes the
      schedule while atomic BUG:
      
        $ while true ; do
        > echo lzo > compressor ; echo lz4 > compressor
        > done
      
      This is essentially the same as Yu Zhao's proposed patch to zsmalloc,
      but moved to zswap, to cover compressor and zpool freeing.
      
      Fixes: f1c54846 ("zswap: dynamic pool creation")
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Reported-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      200867af
    • S
      zram: user per-cpu compression streams · da9556a2
      Sergey Senozhatsky 提交于
      Remove idle streams list and keep compression streams in per-cpu data.
      This removes two contented spin_lock()/spin_unlock() calls from write
      path and also prevent write OP from being preempted while holding the
      compression stream, which can cause slow downs.
      
      For instance, let's assume that we have N cpus and N-2
      max_comp_streams.TASK1 owns the last idle stream, TASK2-TASK3 come in
      with the write requests:
      
        TASK1            TASK2              TASK3
       zram_bvec_write()
        spin_lock
        find stream
        spin_unlock
      
        compress
      
        <<preempted>>   zram_bvec_write()
                         spin_lock
                         find stream
                         spin_unlock
                           no_stream
                             schedule
                                           zram_bvec_write()
                                            spin_lock
                                            find_stream
                                            spin_unlock
                                              no_stream
                                                schedule
         spin_lock
         release stream
         spin_unlock
           wake up TASK2
      
      not only TASK2 and TASK3 will not get the stream, TASK1 will be
      preempted in the middle of its operation; while we would prefer it to
      finish compression and release the stream.
      
      Test environment: x86_64, 4 CPU box, 3G zram, lzo
      
      The following fio tests were executed:
            read, randread, write, randwrite, rw, randrw
      with the increasing number of jobs from 1 to 10.
      
                        4 streams        8 streams       per-cpu
        ===========================================================
        jobs1
        READ:           2520.1MB/s       2566.5MB/s      2491.5MB/s
        READ:           2102.7MB/s       2104.2MB/s      2091.3MB/s
        WRITE:          1355.1MB/s       1320.2MB/s      1378.9MB/s
        WRITE:          1103.5MB/s       1097.2MB/s      1122.5MB/s
        READ:           434013KB/s       435153KB/s      439961KB/s
        WRITE:          433969KB/s       435109KB/s      439917KB/s
        READ:           403166KB/s       405139KB/s      403373KB/s
        WRITE:          403223KB/s       405197KB/s      403430KB/s
        jobs2
        READ:           7958.6MB/s       8105.6MB/s      8073.7MB/s
        READ:           6864.9MB/s       6989.8MB/s      7021.8MB/s
        WRITE:          2438.1MB/s       2346.9MB/s      3400.2MB/s
        WRITE:          1994.2MB/s       1990.3MB/s      2941.2MB/s
        READ:           981504KB/s       973906KB/s      1018.8MB/s
        WRITE:          981659KB/s       974060KB/s      1018.1MB/s
        READ:           937021KB/s       938976KB/s      987250KB/s
        WRITE:          934878KB/s       936830KB/s      984993KB/s
        jobs3
        READ:           13280MB/s        13553MB/s       13553MB/s
        READ:           11534MB/s        11785MB/s       11755MB/s
        WRITE:          3456.9MB/s       3469.9MB/s      4810.3MB/s
        WRITE:          3029.6MB/s       3031.6MB/s      4264.8MB/s
        READ:           1363.8MB/s       1362.6MB/s      1448.9MB/s
        WRITE:          1361.9MB/s       1360.7MB/s      1446.9MB/s
        READ:           1309.4MB/s       1310.6MB/s      1397.5MB/s
        WRITE:          1307.4MB/s       1308.5MB/s      1395.3MB/s
        jobs4
        READ:           20244MB/s        20177MB/s       20344MB/s
        READ:           17886MB/s        17913MB/s       17835MB/s
        WRITE:          4071.6MB/s       4046.1MB/s      6370.2MB/s
        WRITE:          3608.9MB/s       3576.3MB/s      5785.4MB/s
        READ:           1824.3MB/s       1821.6MB/s      1997.5MB/s
        WRITE:          1819.8MB/s       1817.4MB/s      1992.5MB/s
        READ:           1765.7MB/s       1768.3MB/s      1937.3MB/s
        WRITE:          1767.5MB/s       1769.1MB/s      1939.2MB/s
        jobs5
        READ:           18663MB/s        18986MB/s       18823MB/s
        READ:           16659MB/s        16605MB/s       16954MB/s
        WRITE:          3912.4MB/s       3888.7MB/s      6126.9MB/s
        WRITE:          3506.4MB/s       3442.5MB/s      5519.3MB/s
        READ:           1798.2MB/s       1746.5MB/s      1935.8MB/s
        WRITE:          1792.7MB/s       1740.7MB/s      1929.1MB/s
        READ:           1727.6MB/s       1658.2MB/s      1917.3MB/s
        WRITE:          1726.5MB/s       1657.2MB/s      1916.6MB/s
        jobs6
        READ:           21017MB/s        20922MB/s       21162MB/s
        READ:           19022MB/s        19140MB/s       18770MB/s
        WRITE:          3968.2MB/s       4037.7MB/s      6620.8MB/s
        WRITE:          3643.5MB/s       3590.2MB/s      6027.5MB/s
        READ:           1871.8MB/s       1880.5MB/s      2049.9MB/s
        WRITE:          1867.8MB/s       1877.2MB/s      2046.2MB/s
        READ:           1755.8MB/s       1710.3MB/s      1964.7MB/s
        WRITE:          1750.5MB/s       1705.9MB/s      1958.8MB/s
        jobs7
        READ:           21103MB/s        20677MB/s       21482MB/s
        READ:           18522MB/s        18379MB/s       19443MB/s
        WRITE:          4022.5MB/s       4067.4MB/s      6755.9MB/s
        WRITE:          3691.7MB/s       3695.5MB/s      5925.6MB/s
        READ:           1841.5MB/s       1933.9MB/s      2090.5MB/s
        WRITE:          1842.7MB/s       1935.3MB/s      2091.9MB/s
        READ:           1832.4MB/s       1856.4MB/s      1971.5MB/s
        WRITE:          1822.3MB/s       1846.2MB/s      1960.6MB/s
        jobs8
        READ:           20463MB/s        20194MB/s       20862MB/s
        READ:           18178MB/s        17978MB/s       18299MB/s
        WRITE:          4085.9MB/s       4060.2MB/s      7023.8MB/s
        WRITE:          3776.3MB/s       3737.9MB/s      6278.2MB/s
        READ:           1957.6MB/s       1944.4MB/s      2109.5MB/s
        WRITE:          1959.2MB/s       1946.2MB/s      2111.4MB/s
        READ:           1900.6MB/s       1885.7MB/s      2082.1MB/s
        WRITE:          1896.2MB/s       1881.4MB/s      2078.3MB/s
        jobs9
        READ:           19692MB/s        19734MB/s       19334MB/s
        READ:           17678MB/s        18249MB/s       17666MB/s
        WRITE:          4004.7MB/s       4064.8MB/s      6990.7MB/s
        WRITE:          3724.7MB/s       3772.1MB/s      6193.6MB/s
        READ:           1953.7MB/s       1967.3MB/s      2105.6MB/s
        WRITE:          1953.4MB/s       1966.7MB/s      2104.1MB/s
        READ:           1860.4MB/s       1897.4MB/s      2068.5MB/s
        WRITE:          1858.9MB/s       1895.9MB/s      2066.8MB/s
        jobs10
        READ:           19730MB/s        19579MB/s       19492MB/s
        READ:           18028MB/s        18018MB/s       18221MB/s
        WRITE:          4027.3MB/s       4090.6MB/s      7020.1MB/s
        WRITE:          3810.5MB/s       3846.8MB/s      6426.8MB/s
        READ:           1956.1MB/s       1994.6MB/s      2145.2MB/s
        WRITE:          1955.9MB/s       1993.5MB/s      2144.8MB/s
        READ:           1852.8MB/s       1911.6MB/s      2075.8MB/s
        WRITE:          1855.7MB/s       1914.6MB/s      2078.1MB/s
      
      perf stat
      
                                        4 streams                       8 streams                       per-cpu
        ====================================================================================================================
        jobs1
        stalled-cycles-frontend      23,174,811,209 (  38.21%)     23,220,254,188 (  38.25%)       23,061,406,918 (  38.34%)
        stalled-cycles-backend       11,514,174,638 (  18.98%)     11,696,722,657 (  19.27%)       11,370,852,810 (  18.90%)
        instructions                 73,925,005,782 (    1.22)     73,903,177,632 (    1.22)       73,507,201,037 (    1.22)
        branches                     14,455,124,835 ( 756.063)     14,455,184,779 ( 755.281)       14,378,599,509 ( 758.546)
        branch-misses                    69,801,336 (   0.48%)         80,225,529 (   0.55%)           72,044,726 (   0.50%)
        jobs2
        stalled-cycles-frontend      49,912,741,782 (  46.11%)     50,101,189,290 (  45.95%)       32,874,195,633 (  35.11%)
        stalled-cycles-backend       27,080,366,230 (  25.02%)     27,949,970,232 (  25.63%)       16,461,222,706 (  17.58%)
        instructions                122,831,629,690 (    1.13)    122,919,846,419 (    1.13)      121,924,786,775 (    1.30)
        branches                     23,725,889,239 ( 692.663)     23,733,547,140 ( 688.062)       23,553,950,311 ( 794.794)
        branch-misses                    90,733,041 (   0.38%)         96,320,895 (   0.41%)           84,561,092 (   0.36%)
        jobs3
        stalled-cycles-frontend      66,437,834,608 (  45.58%)     63,534,923,344 (  43.69%)       42,101,478,505 (  33.19%)
        stalled-cycles-backend       34,940,799,661 (  23.97%)     34,774,043,148 (  23.91%)       21,163,324,388 (  16.68%)
        instructions                171,692,121,862 (    1.18)    171,775,373,044 (    1.18)      170,353,542,261 (    1.34)
        branches                     32,968,962,622 ( 628.723)     32,987,739,894 ( 630.512)       32,729,463,918 ( 717.027)
        branch-misses                   111,522,732 (   0.34%)        110,472,894 (   0.33%)           99,791,291 (   0.30%)
        jobs4
        stalled-cycles-frontend      98,741,701,675 (  49.72%)     94,797,349,965 (  47.59%)       54,535,655,381 (  33.53%)
        stalled-cycles-backend       54,642,609,615 (  27.51%)     55,233,554,408 (  27.73%)       27,882,323,541 (  17.14%)
        instructions                220,884,807,851 (    1.11)    220,930,887,273 (    1.11)      218,926,845,851 (    1.35)
        branches                     42,354,518,180 ( 592.105)     42,362,770,587 ( 590.452)       41,955,552,870 ( 716.154)
        branch-misses                   138,093,449 (   0.33%)        131,295,286 (   0.31%)          121,794,771 (   0.29%)
        jobs5
        stalled-cycles-frontend     116,219,747,212 (  48.14%)    110,310,397,012 (  46.29%)       66,373,082,723 (  33.70%)
        stalled-cycles-backend       66,325,434,776 (  27.48%)     64,157,087,914 (  26.92%)       32,999,097,299 (  16.76%)
        instructions                270,615,008,466 (    1.12)    270,546,409,525 (    1.14)      268,439,910,948 (    1.36)
        branches                     51,834,046,557 ( 599.108)     51,811,867,722 ( 608.883)       51,412,576,077 ( 729.213)
        branch-misses                   158,197,086 (   0.31%)        142,639,805 (   0.28%)          133,425,455 (   0.26%)
        jobs6
        stalled-cycles-frontend     138,009,414,492 (  48.23%)    139,063,571,254 (  48.80%)       75,278,568,278 (  32.80%)
        stalled-cycles-backend       79,211,949,650 (  27.68%)     79,077,241,028 (  27.75%)       37,735,797,899 (  16.44%)
        instructions                319,763,993,731 (    1.12)    319,937,782,834 (    1.12)      316,663,600,784 (    1.38)
        branches                     61,219,433,294 ( 595.056)     61,250,355,540 ( 598.215)       60,523,446,617 ( 733.706)
        branch-misses                   169,257,123 (   0.28%)        154,898,028 (   0.25%)          141,180,587 (   0.23%)
        jobs7
        stalled-cycles-frontend     162,974,812,119 (  49.20%)    159,290,061,987 (  48.43%)       88,046,641,169 (  33.21%)
        stalled-cycles-backend       92,223,151,661 (  27.84%)     91,667,904,406 (  27.87%)       44,068,454,971 (  16.62%)
        instructions                369,516,432,430 (    1.12)    369,361,799,063 (    1.12)      365,290,380,661 (    1.38)
        branches                     70,795,673,950 ( 594.220)     70,743,136,124 ( 597.876)       69,803,996,038 ( 732.822)
        branch-misses                   181,708,327 (   0.26%)        165,767,821 (   0.23%)          150,109,797 (   0.22%)
        jobs8
        stalled-cycles-frontend     185,000,017,027 (  49.30%)    182,334,345,473 (  48.37%)       99,980,147,041 (  33.26%)
        stalled-cycles-backend      105,753,516,186 (  28.18%)    107,937,830,322 (  28.63%)       51,404,177,181 (  17.10%)
        instructions                418,153,161,055 (    1.11)    418,308,565,828 (    1.11)      413,653,475,581 (    1.38)
        branches                     80,035,882,398 ( 592.296)     80,063,204,510 ( 589.843)       79,024,105,589 ( 730.530)
        branch-misses                   199,764,528 (   0.25%)        177,936,926 (   0.22%)          160,525,449 (   0.20%)
        jobs9
        stalled-cycles-frontend     210,941,799,094 (  49.63%)    204,714,679,254 (  48.55%)      114,251,113,756 (  33.96%)
        stalled-cycles-backend      122,640,849,067 (  28.85%)    122,188,553,256 (  28.98%)       58,360,041,127 (  17.35%)
        instructions                468,151,025,415 (    1.10)    467,354,869,323 (    1.11)      462,665,165,216 (    1.38)
        branches                     89,657,067,510 ( 585.628)     89,411,550,407 ( 588.990)       88,360,523,943 ( 730.151)
        branch-misses                   218,292,301 (   0.24%)        191,701,247 (   0.21%)          178,535,678 (   0.20%)
        jobs10
        stalled-cycles-frontend     233,595,958,008 (  49.81%)    227,540,615,689 (  49.11%)      160,341,979,938 (  43.07%)
        stalled-cycles-backend      136,153,676,021 (  29.03%)    133,635,240,742 (  28.84%)       65,909,135,465 (  17.70%)
        instructions                517,001,168,497 (    1.10)    516,210,976,158 (    1.11)      511,374,038,613 (    1.37)
        branches                     98,911,641,329 ( 585.796)     98,700,069,712 ( 591.583)       97,646,761,028 ( 728.712)
        branch-misses                   232,341,823 (   0.23%)        199,256,308 (   0.20%)          183,135,268 (   0.19%)
      
      per-cpu streams tend to cause significantly less stalled cycles; execute
      less branches and hit less branch-misses.
      
      perf stat reported execution time
      
                                4 streams        8 streams       per-cpu
        ====================================================================
        jobs1
        seconds elapsed        20.909073870     20.875670495    20.817838540
        jobs2
        seconds elapsed        18.529488399     18.720566469    16.356103108
        jobs3
        seconds elapsed        18.991159531     18.991340812    16.766216066
        jobs4
        seconds elapsed        19.560643828     19.551323547    16.246621715
        jobs5
        seconds elapsed        24.746498464     25.221646740    20.696112444
        jobs6
        seconds elapsed        28.258181828     28.289765505    22.885688857
        jobs7
        seconds elapsed        32.632490241     31.909125381    26.272753738
        jobs8
        seconds elapsed        35.651403851     36.027596308    29.108024711
        jobs9
        seconds elapsed        40.569362365     40.024227989    32.898204012
        jobs10
        seconds elapsed        44.673112304     43.874898137    35.632952191
      
      Please see
        Link: http://marc.info/?l=linux-kernel&m=146166970727530
        Link: http://marc.info/?l=linux-kernel&m=146174716719650
      for more test results (under low memory conditions).
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da9556a2
    • S
      zsmalloc: require GFP in zs_malloc() · d0d8da2d
      Sergey Senozhatsky 提交于
      Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to
      zs_create_pool(), so we can be more flexible, but, more importantly, we
      need this to switch zram to per-cpu compression streams -- zram will try
      to allocate handle with preemption disabled in a fast path and switch to
      a slow path (using different gfp mask) if the fast one has failed.
      
      Apart from that, this also align zs_malloc() interface with zspool/zbud.
      
      [sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask]
        Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
      Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfishSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0d8da2d
    • M
      zsmalloc: remove unused pool param in obj_free · 1ee47165
      Minchan Kim 提交于
      Let's remove unused pool param in obj_free
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ee47165
    • M
      zsmalloc: reorder function parameters · 251cbb95
      Minchan Kim 提交于
      Clean up function parameter ordering to order higher data structure
      first.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      251cbb95
    • M
      zsmalloc: clean up many BUG_ON · 830e4bc5
      Minchan Kim 提交于
      There are many BUG_ON in zsmalloc.c which is not recommened so change
      them as alternatives.
      
      Normal rule is as follows:
      
      1. avoid BUG_ON if possible. Instead, use VM_BUG_ON or VM_BUG_ON_PAGE
      
      2. use VM_BUG_ON_PAGE if we need to see struct page's fields
      
      3. use those assertion in primitive functions so higher functions can
         rely on the assertion in the primitive function.
      
      4. Don't use assertion if following instruction can trigger Oops
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      830e4bc5
    • M
      zsmalloc: use first_page rather than page · a4209467
      Minchan Kim 提交于
      Clean up function parameter "struct page".  Many functions of zsmalloc
      expect that page paramter is "first_page" so use "first_page" rather
      than "page" for code readability.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4209467
    • A
      kasan/tests: add tests for user memory access functions · eae08dca
      Andrey Ryabinin 提交于
      Add some tests for the newly-added user memory access API.
      
      Link: http://lkml.kernel.org/r/1462538722-1574-1-git-send-email-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eae08dca
    • A
      x86/kasan: instrument user memory access API · 1771c6e1
      Andrey Ryabinin 提交于
      Exchange between user and kernel memory is coded in assembly language.
      Which means that such accesses won't be spotted by KASAN as a compiler
      instruments only C code.
      
      Add explicit KASAN checks to user memory access API to ensure that
      userspace writes to (or reads from) a valid kernel memory.
      
      Note: Unlike others strncpy_from_user() is written mostly in C and KASAN
      sees memory accesses in it.  However, it makes sense to add explicit
      check for all @count bytes that *potentially* could be written to the
      kernel.
      
      [aryabinin@virtuozzo.com: move kasan check under the condition]
        Link: http://lkml.kernel.org/r/1462869209-21096-1-git-send-email-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/1462538722-1574-4-git-send-email-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1771c6e1
    • A
      mm/kasan: add API to check memory regions · 64f8ebaf
      Andrey Ryabinin 提交于
      Memory access coded in an assembly won't be seen by KASAN as a compiler
      can instrument only C code.  Add kasan_check_[read,write]() API which is
      going to be used to check a certain memory range.
      
      Link: http://lkml.kernel.org/r/1462538722-1574-3-git-send-email-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NAlexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64f8ebaf
    • A
      mm/kasan: print name of mem[set,cpy,move]() caller in report · 936bb4bb
      Andrey Ryabinin 提交于
      When bogus memory access happens in mem[set,cpy,move]() it's usually
      caller's fault.  So don't blame mem[set,cpy,move]() in bug report, blame
      the caller instead.
      
      Before:
        BUG: KASAN: out-of-bounds access in memset+0x23/0x40 at <address>
      After:
        BUG: KASAN: out-of-bounds access in <memset_caller> at <address>
      
      Link: http://lkml.kernel.org/r/1462538722-1574-2-git-send-email-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NAlexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      936bb4bb
    • A
      mm, kasan: add a ksize() test · 96fe805f
      Alexander Potapenko 提交于
      Add a test that makes sure ksize() unpoisons the whole chunk.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96fe805f
    • A
      mm, kasan: don't call kasan_krealloc() from ksize(). · 4ebb31a4
      Alexander Potapenko 提交于
      Instead of calling kasan_krealloc(), which replaces the memory
      allocation stack ID (if stack depot is used), just unpoison the whole
      memory chunk.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ebb31a4
    • A
      mm: kasan: initial memory quarantine implementation · 55834c59
      Alexander Potapenko 提交于
      Quarantine isolates freed objects in a separate queue.  The objects are
      returned to the allocator later, which helps to detect use-after-free
      errors.
      
      When the object is freed, its state changes from KASAN_STATE_ALLOC to
      KASAN_STATE_QUARANTINE.  The object is poisoned and put into quarantine
      instead of being returned to the allocator, therefore every subsequent
      access to that object triggers a KASAN error, and the error handler is
      able to say where the object has been allocated and deallocated.
      
      When it's time for the object to leave quarantine, its state becomes
      KASAN_STATE_FREE and it's returned to the allocator.  From now on the
      allocator may reuse it for another allocation.  Before that happens,
      it's still possible to detect a use-after free on that object (it
      retains the allocation/deallocation stacks).
      
      When the allocator reuses this object, the shadow is unpoisoned and old
      allocation/deallocation stacks are wiped.  Therefore a use of this
      object, even an incorrect one, won't trigger ASan warning.
      
      Without the quarantine, it's not guaranteed that the objects aren't
      reused immediately, that's why the probability of catching a
      use-after-free is lower than with quarantine in place.
      
      Quarantine isolates freed objects in a separate queue.  The objects are
      returned to the allocator later, which helps to detect use-after-free
      errors.
      
      Freed objects are first added to per-cpu quarantine queues.  When a
      cache is destroyed or memory shrinking is requested, the objects are
      moved into the global quarantine queue.  Whenever a kmalloc call allows
      memory reclaiming, the oldest objects are popped out of the global queue
      until the total size of objects in quarantine is less than 3/4 of the
      maximum quarantine size (which is a fraction of installed physical
      memory).
      
      As long as an object remains in the quarantine, KASAN is able to report
      accesses to it, so the chance of reporting a use-after-free is
      increased.  Once the object leaves quarantine, the allocator may reuse
      it, in which case the object is unpoisoned and KASAN can't detect
      incorrect accesses to it.
      
      Right now quarantine support is only enabled in SLAB allocator.
      Unification of KASAN features in SLAB and SLUB will be done later.
      
      This patch is based on the "mm: kasan: quarantine" patch originally
      prepared by Dmitry Chernenkov.  A number of improvements have been
      suggested by Andrey Ryabinin.
      
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55834c59
    • Y
      mm: call page_ext_init() after all struct pages are initialized · b8f1a75d
      Yang Shi 提交于
      When DEFERRED_STRUCT_PAGE_INIT is enabled, just a subset of memmap at
      boot are initialized, then the rest are initialized in parallel by
      starting one-off "pgdatinitX" kernel thread for each node X.
      
      If page_ext_init is called before it, some pages will not have valid
      extension, this may lead the below kernel oops when booting up kernel:
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
        PGD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in:
        CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
        Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
        task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000
        RIP: 0010:[<ffffffff8118d982>]  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
        RSP: 0000:ffff88017c087c48  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
        RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401
        RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009
        R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400
        R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080
        FS:  0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0
        Call Trace:
          free_hot_cold_page+0x192/0x1d0
          __free_pages+0x5c/0x90
          __free_pages_boot_core+0x11a/0x14e
          deferred_free_range+0x50/0x62
          deferred_init_memmap+0x220/0x3c3
          kthread+0xf8/0x110
          ret_from_fork+0x22/0x40
        Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
        RIP  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
         RSP <ffff88017c087c48>
        CR2: 0000000000000000
      
      Move page_ext_init() after page_alloc_init_late() to make sure page extension
      is setup for all pages.
      
      Link: http://lkml.kernel.org/r/1463696006-31360-1-git-send-email-yang.shi@linaro.orgSigned-off-by: NYang Shi <yang.shi@linaro.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8f1a75d
    • D
      mm, migrate: increment fail count on ENOMEM · dfef2ef4
      David Rientjes 提交于
      If page migration fails due to -ENOMEM, nr_failed should still be
      incremented for proper statistics.
      
      This was encountered recently when all page migration vmstats showed 0,
      and inferred that migrate_pages() was never called, although in reality
      the first page migration failed because compaction_alloc() failed to
      find a migration target.
      
      This patch increments nr_failed so the vmstat is properly accounted on
      ENOMEM.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605191510230.32658@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfef2ef4
    • C
      mm/compaction.c: fix zoneindex in kcompactd() · 6cd9dc3e
      Chen Feng 提交于
      While testing the kcompactd in my platform 3G MEM only DMA ZONE.  I
      found the kcompactd never wakeup.  It seems the zoneindex has already
      minus 1 before.  So the traverse here should be <=.
      
      It fixes a regression where kswapd could previously compact, but
      kcompactd not.  Not a crash fix though.
      
      [akpm@linux-foundation.org: fix kcompactd_do_work() as well, per Hugh]
      Link: http://lkml.kernel.org/r/1463659121-84124-1-git-send-email-puck.chen@hisilicon.com
      Fixes: accf6242 ("mm, kswapd: replace kswapd compaction with waking up kcompactd")
      Signed-off-by: NChen Feng <puck.chen@hisilicon.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zhuangluan Su <suzhuangluan@hisilicon.com>
      Cc: Yiping Xu <xuyiping@hisilicon.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cd9dc3e
    • Y
      mm: page_is_guard(): return false when page_ext arrays are not allocated yet · 0bb2fd13
      Yang Shi 提交于
      When enabling the below kernel configs:
      
      CONFIG_DEFERRED_STRUCT_PAGE_INIT
      CONFIG_DEBUG_PAGEALLOC
      CONFIG_PAGE_EXTENSION
      CONFIG_DEBUG_VM
      
      kernel bootup may fail due to the following oops:
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
        PGD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in:
        CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
        Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
        task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000
        RIP: 0010:[<ffffffff8118d982>]  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
        RSP: 0000:ffff88017c087c48  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
        RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401
        RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009
        R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400
        R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080
        FS:  0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0
        Call Trace:
          free_hot_cold_page+0x192/0x1d0
          __free_pages+0x5c/0x90
          __free_pages_boot_core+0x11a/0x14e
          deferred_free_range+0x50/0x62
          deferred_init_memmap+0x220/0x3c3
          kthread+0xf8/0x110
          ret_from_fork+0x22/0x40
        Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
        RIP  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
         RSP <ffff88017c087c48>
        CR2: 0000000000000000
      
      The problem is lookup_page_ext() returns NULL then page_is_guard() tried
      to access it in page freeing.
      
      page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension
      flag, but freeing page might reach here before the page_ext arrays are
      allocated when feeding a range of pages to the allocator for the first
      time during bootup or memory hotplug.
      
      When it returns NULL, page_is_guard() should just return false instead
      of checking PAGE_EXT_DEBUG_GUARD unconditionally.
      
      Link: http://lkml.kernel.org/r/1463610225-29060-1-git-send-email-yang.shi@linaro.orgSigned-off-by: NYang Shi <yang.shi@linaro.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bb2fd13
    • D
      mm, thp: khugepaged should scan when sleep value is written · f0508977
      David Rientjes 提交于
      If a large value is written to scan_sleep_millisecs, for example, that
      period must lapse before khugepaged will wake up for periodic
      collapsing.
      
      If this value is tuned to 1 day, for example, and then re-tuned to its
      default 10s, khugepaged will still wait for a day before scanning again.
      
      This patch causes khugepaged to wakeup immediately when the value is
      changed and then sleep until that value is rewritten or the new value
      lapses.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605181453200.4786@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0508977
    • N
      MM: increase safety margin provided by PF_LESS_THROTTLE · a53eaff8
      NeilBrown 提交于
      When nfsd is exporting a filesystem over NFS which is then NFS-mounted
      on the local machine there is a risk of deadlock.  This happens when
      there are lots of dirty pages in the NFS filesystem and they cause NFSD
      to be throttled, either in throttle_vm_writeout() or in
      balance_dirty_pages().
      
      To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads
      and it provides a 25% increase to the limits that affect NFSD.  Any
      process writing to an NFS filesystem will be throttled well before the
      number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD
      will not deadlock on pages that it needs to write out.  At least it
      shouldn't.
      
      All processes are allowed a small excess margin to avoid performing too
      many calculations: ratelimit_pages.
      
      ratelimit_pages is set so that if a thread on every CPU uses the entire
      margin, the total will only go 3% over the limit, and this is much less
      than the 25% bonus that PF_LESS_THROTTLE provides, so this margin
      shouldn't be a problem.  But it is.
      
      The "total memory" that these 3% and 25% are calculated against are not
      really total memory but are "global_dirtyable_memory()" which doesn't
      include anonymous memory, just free memory and page-cache memory.
      
      The "ratelimit_pages" number is based on whatever the
      global_dirtyable_memory was on the last CPU hot-plug, which might not be
      what you expect, but is probably close to the total freeable memory.
      
      The throttle threshold uses the global_dirtable_memory at the moment
      when the throttling happens, which could be much less than at the last
      CPU hotplug.  So if lots of anonymous memory has been allocated, thus
      pushing out lots of page-cache pages, then NFSD might end up being
      throttled due to dirty NFS pages because the "25%" bonus it gets is
      calculated against a rather small amount of dirtyable memory, while the
      "3%" margin that other processes are allowed to dirty without penalty is
      calculated against a much larger number.
      
      To remove this possibility of deadlock we need to make sure that the
      margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin.
      Simply adding ratelimit_pages isn't enough as that should be multiplied
      by the number of cpus.
      
      So add "global_wb_domain.dirty_limit / 32" as that more accurately
      reflects the current total over-shoot margin.  This ensures that the
      number of dirty NFS pages never gets so high that nfsd will be throttled
      waiting for them to be written.
      
      Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.nameSigned-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a53eaff8
    • N
      mm: check_new_page_bad() directly returns in __PG_HWPOISON case · e570f56c
      Naoya Horiguchi 提交于
      Currently we check page->flags twice for "HWPoisoned" case of
      check_new_page_bad(), which can cause a race with unpoisoning.
      
      This race unnecessarily taints kernel with "BUG: Bad page state".
      check_new_page_bad() is the only caller of bad_page() which is
      interested in __PG_HWPOISON, so let's move the hwpoison related code in
      bad_page() to it.
      
      Link: http://lkml.kernel.org/r/20160518100949.GA17299@hori1.linux.bs1.fc.nec.co.jpSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e570f56c
    • S
      mm, kasan: fix to call kasan_free_pages() after poisoning page · 29b52de1
      seokhoon.yoon 提交于
      When CONFIG_PAGE_POISONING and CONFIG_KASAN is enabled,
      free_pages_prepare()'s codeflow is below.
      
        1)kmemcheck_free_shadow()
        2)kasan_free_pages()
          - set shadow byte of page is freed
        3)kernel_poison_pages()
        3.1) check access to page is valid or not using kasan
          ---> error occur, kasan think it is invalid access
        3.2) poison page
        4)kernel_map_pages()
      
      So kasan_free_pages() should be called after poisoning the page.
      
      Link: http://lkml.kernel.org/r/1463220405-7455-1-git-send-email-iamyooon@gmail.comSigned-off-by: Nseokhoon.yoon <iamyooon@gmail.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29b52de1
    • M
      mm: disable fault around on emulated access bit architecture · d0834a6c
      Minchan Kim 提交于
      fault_around aims to reduce minor faults of file-backed pages via
      speculative ahead pte mapping and relying on readahead logic.  However,
      on non-HW access bit architecture the benefit is highly limited because
      they should emulate the young bit with minor faults for reclaim's page
      aging algorithm.  IOW, we cannot reduce minor faults on those
      architectures.
      
      I did quick a test on my ARM machine.
      
      512M file mmap sequential every word read on eSATA drive 4 times.
      stddev is stable.
      
        = fault_around 4096 =
        elapsed time(usec): 6747645
      
        = fault_around 65536 =
        elapsed time(usec): 6709263
      
        0.5% gain.
      
      Even when I tested it with eMMC there is no gain because I guess with
      slow storage the major fault is the dominant factor.
      
      Also, fault_around has the side effect of shrinking slab more
      aggressively and causes higher vmpressure, so if such speculation fails,
      it can evict slab more which can result in page I/O (e.g., inode cache).
      In the end, it would make void any benefit of fault_around.
      
      So let's make the default "disabled" on those architectures.
      
      Link: http://lkml.kernel.org/r/20160518014229.GB21538@bboxSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0834a6c
    • K
      mm: make faultaround produce old ptes · 5c0a85fa
      Kirill A. Shutemov 提交于
      Currently, faultaround code produces young pte.  This can screw up
      vmscan behaviour[1], as it makes vmscan think that these pages are hot
      and not push them out on first round.
      
      During sparse file access faultaround gets more pages mapped and all of
      them are young.  Under memory pressure, this makes vmscan swap out anon
      pages instead, or to drop other page cache pages which otherwise stay
      resident.
      
      Modify faultaround to produce old ptes, so they can easily be reclaimed
      under memory pressure.
      
      This can to some extend defeat the purpose of faultaround on machines
      without hardware accessed bit as it will not help us with reducing the
      number of minor page faults.
      
      We may want to disable faultaround on such machines altogether, but
      that's subject for separate patchset.
      
      Minchan:
       "I tested 512M mmap sequential word read test on non-HW access bit
        system (i.e., ARM) and confirmed it doesn't increase minor fault any
        more.
      
        old: 4096 fault_around
        minor fault: 131291
        elapsed time: 6747645 usec
      
        new: 65536 fault_around
        minor fault: 131291
        elapsed time: 6709263 usec
      
        0.56% benefit"
      
      [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org
      
      Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Tested-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c0a85fa
    • S
      mm: use phys_addr_t for reserve_bootmem_region() arguments · 4b50bcc7
      Stefan Bader 提交于
      Since commit 92923ca3 ("mm: meminit: only set page reserved in the
      memblock region") the reserved bit is set on reserved memblock regions.
      However start and end address are passed as unsigned long.  This is only
      32bit on i386, so it can end up marking the wrong pages reserved for
      ranges at 4GB and above.
      
      This was observed on a 32bit Xen dom0 which was booted with initial
      memory set to a value below 4G but allowing to balloon in memory
      (dom0_mem=1024M for example).  This would define a reserved bootmem
      region for the additional memory (for example on a 8GB system there was
      a reverved region covering the 4GB-8GB range).  But since the addresses
      were passed on as unsigned long, this was actually marking all pages
      from 0 to 4GB as reserved.
      
      Fixes: 92923ca3 ("mm: meminit: only set page reserved in the memblock region")
      Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.comSigned-off-by: NStefan Bader <stefan.bader@canonical.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b50bcc7
    • O
      userfaultfd: don't pin the user memory in userfaultfd_file_create() · d2005e3f
      Oleg Nesterov 提交于
      userfaultfd_file_create() increments mm->mm_users; this means that the
      memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
      after that can populate the orphaned mm more.
      
      Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
      mm->mm_count to pin mm_struct.  This means that
      atomic_inc_not_zero(mm->mm_users) is needed when we are going to
      actually play with this memory.  Except handle_userfault() path doesn't
      need this, the caller must already have a reference.
      
      The patch adds the new trivial helper, mmget_not_zero(), it can have
      more users.
      
      Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2005e3f
    • R
      mm/memblock.c: remove unnecessary always-true comparison · cd33a76b
      Richard Leitner 提交于
      Comparing an u64 variable to >= 0 returns always true and can therefore
      be removed.  This issue was detected using the -Wtype-limits gcc flag.
      
      This patch fixes following type-limits warning:
      
        mm/memblock.c: In function `__next_reserved_mem_region':
        mm/memblock.c:843:11: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
          if (*idx >= 0 && *idx < type->cnt) {
      
      Link: http://lkml.kernel.org/r/20160510103625.3a7f8f32@g0hl1n.netSigned-off-by: NRichard Leitner <dev@g0hl1n.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd33a76b
    • V
      z3fold: the 3-fold allocator for compressed pages · 9a001fc1
      Vitaly Wool 提交于
      This patch introduces z3fold, a special purpose allocator for storing
      compressed pages.  It is designed to store up to three compressed pages
      per physical page.  It is a ZBUD derivative which allows for higher
      compression ratio keeping the simplicity and determinism of its
      predecessor.
      
      This patch comes as a follow-up to the discussions at the Embedded Linux
      Conference in San-Diego related to the talk [1].  The outcome of these
      discussions was that it would be good to have a compressed page
      allocator as stable and deterministic as zbud with with higher
      compression ratio.
      
      To keep the determinism and simplicity, z3fold, just like zbud, always
      stores an integral number of compressed pages per page, but it can store
      up to 3 pages unlike zbud which can store at most 2.  Therefore the
      compression ratio goes to around 2.6x while zbud's one is around 1.7x.
      
      The patch is based on the latest linux.git tree.
      
      This version has been updated after testing on various simulators (e.g.
      ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
      comments from Dan Streetman [3].
      
      [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
      [2] https://lkml.org/lkml/2016/4/21/799
      [3] https://lkml.org/lkml/2016/5/4/852
      
      Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.comSigned-off-by: NVitaly Wool <vitalywool@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a001fc1