1. 27 5月, 2016 1 次提交
  2. 25 5月, 2016 1 次提交
    • P
      sched/core: Fix remote wakeups · b7e7ade3
      Peter Zijlstra 提交于
      Commit:
      
        b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      
      ... introduced a bug: Mike Galbraith found that it introduced a
      performance regression, while Paul E. McKenney reported lost
      wakeups and bisected it to this commit.
      
      The reason is that I mis-read ttwu_queue() such that I assumed any
      wakeup that got a remote queue must have had the task migrated.
      
      Since this is not so; we need to transfer this information between
      queueing the wakeup and actually doing the wakeup. Use a new
      task_struct::sched_flag for this, we already write to
      sched_contributes_to_load in the wakeup path so this is a hot and
      modified cacheline.
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Fixes: b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      Link: http://lkml.kernel.org/r/20160523091907.GD15728@worktop.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b7e7ade3
  3. 24 5月, 2016 12 次提交
    • M
      uprobes: wait for mmap_sem for write killable · 598fdc1d
      Michal Hocko 提交于
      xol_add_vma needs mmap_sem for write.  If the waiting task gets killed
      by the oom killer it would block oom_reaper from asynchronous address
      space reclaim and reduce the chances of timely OOM resolving.  Wait for
      the lock in the killable mode and return with EINTR if the task got
      killed while waiting.
      
      Do not warn in dup_xol_work if __create_xol_area failed due to fatal
      signal pending because this is usually considered a kernel issue.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      598fdc1d
    • M
      prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable · 17b0573d
      Michal Hocko 提交于
      PR_SET_THP_DISABLE requires mmap_sem for write.  If the waiting task
      gets killed by the oom killer it would block oom_reaper from
      asynchronous address space reclaim and reduce the chances of timely OOM
      resolving.  Wait for the lock in the killable mode and return with EINTR
      if the task got killed while waiting.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NAlex Thorlton <athorlton@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17b0573d
    • M
      mm, fork: make dup_mmap wait for mmap_sem for write killable · 7c051267
      Michal Hocko 提交于
      dup_mmap needs to lock current's mm mmap_sem for write.  If the waiting
      task gets killed by the oom killer it would block oom_reaper from
      asynchronous address space reclaim and reduce the chances of timely OOM
      resolving.  Wait for the lock in the killable mode and return with EINTR
      if the task got killed while waiting.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c051267
    • X
      s390/kexec: consolidate crash_map/unmap_reserved_pages() and... · 7a0058ec
      Xunlei Pang 提交于
      s390/kexec: consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()
      
      Commit 3f625002581b ("kexec: introduce a protection mechanism for the
      crashkernel reserved memory") is a similar mechanism for protecting the
      crash kernel reserved memory to previous crash_map/unmap_reserved_pages()
      implementation, the new one is more generic in name and cleaner in code
      (besides, some arch may not be allowed to unmap the pgtable).
      
      Therefore, this patch consolidates them, and uses the new
      arch_kexec_protect(unprotect)_crashkres() to replace former
      crash_map/unmap_reserved_pages() which by now has been only used by
      S390.
      
      The consolidation work needs the crash memory to be mapped initially,
      this is done in machine_kdump_pm_init() which is after
      reserve_crashkernel().  Once kdump kernel is loaded, the new
      arch_kexec_protect_crashkres() implemented for S390 will actually
      unmap the pgtable like before.
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Acked-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Minfei Huang <mhuang@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a0058ec
    • M
      kexec: do a cleanup for function kexec_load · 0eea0867
      Minfei Huang 提交于
      There are a lof of work to be done in function kexec_load, not only for
      allocating structs and loading initram, but also for some misc.
      
      To make it more clear, wrap a new function do_kexec_load which is used
      to allocate structs and load initram.  And the pre-work will be done in
      kexec_load.
      Signed-off-by: NMinfei Huang <mnfhuang@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Xunlei Pang <xlpang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eea0867
    • M
      kexec: make a pair of map/unmap reserved pages in error path · 917a3560
      Minfei Huang 提交于
      For some arch, kexec shall map the reserved pages, then use them, when
      we try to start the kdump service.
      
      kexec may return directly, without unmaping the reserved pages, if it
      fails during starting service.  To fix it, we make a pair of map/unmap
      reserved pages both in generic path and error path.
      
      This patch only affects s390.  Other architecturess don't implement the
      interface of crash_unmap_reserved_pages and crash_map_reserved_pages.
      
      It isn't a urgent patch.  Kernel can work well without any risk,
      although the reserved pages are not unmapped before returning in error
      path.
      Signed-off-by: NMinfei Huang <mnfhuang@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Xunlei Pang <xlpang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      917a3560
    • X
      kexec: introduce a protection mechanism for the crashkernel reserved memory · 9b492cf5
      Xunlei Pang 提交于
      For the cases that some kernel (module) path stamps the crash reserved
      memory(already mapped by the kernel) where has been loaded the second
      kernel data, the kdump kernel will probably fail to boot when panic
      happens (or even not happens) leaving the culprit at large, this is
      unacceptable.
      
      The patch introduces a mechanism for detecting such cases:
      
      1) After each crash kexec loading, it simply marks the reserved memory
         regions readonly since we no longer access it after that.  When someone
         stamps the region, the first kernel will panic and trigger the kdump.
         The weak arch_kexec_protect_crashkres() is introduced to do the actual
         protection.
      
      2) To allow multiple loading, once 1) was done we also need to remark
         the reserved memory to readwrite each time a system call related to
         kdump is made.  The weak arch_kexec_unprotect_crashkres() is introduced
         to do the actual protection.
      
      The architecture can make its specific implementation by overriding
      arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres().
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Minfei Huang <mhuang@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b492cf5
    • A
      kernek/fork.c: allocate idle task for a CPU always on its local node · 725fc629
      Andi Kleen 提交于
      Linux preallocates the task structs of the idle tasks for all possible
      CPUs.  This currently means they all end up on node 0.  This also
      implies that the cache line of MWAIT, which is around the flags field in
      the task struct, are all located in node 0.
      
      We see a noticeable performance improvement on Knights Landing CPUs when
      the cache lines used for MWAIT are located in the local nodes of the
      CPUs using them.  I would expect this to give a (likely slight)
      improvement on other systems too.
      
      The patch implements placing the idle task in the node of its CPUs, by
      passing the right target node to copy_process()
      
      [akpm@linux-foundation.org: use NUMA_NO_NODE, not a bare -1]
      Link: http://lkml.kernel.org/r/1463492694-15833-1-git-send-email-andi@firstfloor.orgSigned-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      725fc629
    • W
      kernel/signal.c: convert printk(KERN_<LEVEL> ...) to pr_<level>(...) · 747800ef
      Wang Xiaoqiang 提交于
      Use pr_<level> instead of printk(KERN_<LEVEL> ).
      Signed-off-by: NWang Xiaoqiang <wangxq10@lzu.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747800ef
    • O
      wait: allow sys_waitid() to accept __WNOTHREAD/__WCLONE/__WALL · 91c4e8ea
      Oleg Nesterov 提交于
      I see no reason why waitid() can't support other linux-specific flags
      allowed in sys_wait4().
      
      In particular this change can help if we reconsider the previous change
      ("wait/ptrace: assume __WALL if the child is traced") which adds the
      "automagical" __WALL for debugger.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: <syzkaller@googlegroups.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91c4e8ea
    • O
      wait/ptrace: assume __WALL if the child is traced · bf959931
      Oleg Nesterov 提交于
      The following program (simplified version of generated by syzkaller)
      
      	#include <pthread.h>
      	#include <unistd.h>
      	#include <sys/ptrace.h>
      	#include <stdio.h>
      	#include <signal.h>
      
      	void *thread_func(void *arg)
      	{
      		ptrace(PTRACE_TRACEME, 0,0,0);
      		return 0;
      	}
      
      	int main(void)
      	{
      		pthread_t thread;
      
      		if (fork())
      			return 0;
      
      		while (getppid() != 1)
      			;
      
      		pthread_create(&thread, NULL, thread_func, NULL);
      		pthread_join(thread, NULL);
      		return 0;
      	}
      
      creates an unreapable zombie if /sbin/init doesn't use __WALL.
      
      This is not a kernel bug, at least in a sense that everything works as
      expected: debugger should reap a traced sub-thread before it can reap the
      leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
      
      Unfortunately, it seems that /sbin/init in most (all?) distributions
      doesn't use it and we have to change the kernel to avoid the problem.
      Note also that most init's use sys_waitid() which doesn't allow __WALL, so
      the necessary user-space fix is not that trivial.
      
      This patch just adds the "ptrace" check into eligible_child().  To some
      degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is
      mostly ignored when the tracee reports to debugger.  Or WSTOPPED, the
      tracer doesn't need to set this flag to wait for the stopped tracee.
      
      This obviously means the user-visible change: __WCLONE and __WALL no
      longer have any meaning for debugger.  And I can only hope that this won't
      break something, but at least strace/gdb won't suffer.
      
      We could make a more conservative change.  Say, we can take __WCLONE into
      account, or !thread_group_leader().  But it would be nice to not
      complicate these historical/confusing checks.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: <syzkaller@googlegroups.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf959931
    • R
      ELF/MIPS build fix · f43edca7
      Ralf Baechle 提交于
      CONFIG_MIPS32_N32=y but CONFIG_BINFMT_ELF disabled results in the
      following linker errors:
      
        arch/mips/built-in.o: In function `elf_core_dump':
        binfmt_elfn32.c:(.text+0x23dbc): undefined reference to `elf_core_extra_phdrs'
        binfmt_elfn32.c:(.text+0x246e4): undefined reference to `elf_core_extra_data_size'
        binfmt_elfn32.c:(.text+0x248d0): undefined reference to `elf_core_write_extra_phdrs'
        binfmt_elfn32.c:(.text+0x24ac4): undefined reference to `elf_core_write_extra_data'
      
      CONFIG_MIPS32_O32=y but CONFIG_BINFMT_ELF disabled results in the following
      linker errors:
      
        arch/mips/built-in.o: In function `elf_core_dump':
        binfmt_elfo32.c:(.text+0x28a04): undefined reference to `elf_core_extra_phdrs'
        binfmt_elfo32.c:(.text+0x29330): undefined reference to `elf_core_extra_data_size'
        binfmt_elfo32.c:(.text+0x2951c): undefined reference to `elf_core_write_extra_phdrs'
        binfmt_elfo32.c:(.text+0x29710): undefined reference to `elf_core_write_extra_data'
      
      This is because binfmt_elfn32 and binfmt_elfo32 are using symbols from
      elfcore but for these configurations elfcore will not be built.
      
      Fixed by making elfcore selectable by a separate config symbol which
      unlike the current mechanism can also be used from other directories
      than kernel/, then having each flavor of ELF that relies on elfcore.o,
      select it in Kconfig, including CONFIG_MIPS32_N32 and CONFIG_MIPS32_O32
      which fixes this issue.
      
      Link: http://lkml.kernel.org/r/20160520141705.GA1913@linux-mips.orgSigned-off-by: NRalf Baechle <ralf@linux-mips.org>
      Reviewed-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f43edca7
  4. 23 5月, 2016 1 次提交
    • L
      x86: remove more uaccess_32.h complexity · bd28b145
      Linus Torvalds 提交于
      I'm looking at trying to possibly merge the 32-bit and 64-bit versions
      of the x86 uaccess.h implementation, but first this needs to be cleaned
      up.
      
      For example, the 32-bit version of "__copy_from_user_inatomic()" is
      mostly the special cases for the constant size, and it's actually almost
      never relevant.  Most users aren't actually using a constant size
      anyway, and the few cases that do small constant copies are better off
      just using __get_user() instead.
      
      So get rid of the unnecessary complexity.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd28b145
  5. 21 5月, 2016 14 次提交
    • M
      radix-tree: introduce radix_tree_empty · e9256efc
      Matthew Wilcox 提交于
      Commit e6145236 ("radix_tree: add support for multi-order entries")
      left the impression that the support for multiorder radix tree entries
      was functional.  As soon as Ross tried to use it, it became apparent
      that my testing was completely inadequate, and it didn't even work a
      little bit for orders that were not a multiple of shift.
      
      This series of patches is the result of about 6 weeks of redesign,
      reimplementation, testing, arguing and hair-pulling.  The great news is
      that the test-suite is now far better than it was.  That's reflected in
      the diffstat for the test-suite alone:
      
       12 files changed, 436 insertions(+), 28 deletions(-)
      
      The highlight for users of the tree is that the restriction on the order
      of inserted entries being >= RADIX_TREE_MAP_SHIFT is now gone; the radix
      tree now supports any order between 0 and 64.
      
      For those who are interested in how the tree works, patch 9 is probably
      the most interesting one as it introduces the new machinery for handling
      sibling entries.
      
      I've tried to be fair in attributing authorship to the person who
      contributed the majority of the code in each patch; Ross has been an
      invaluable partner in the development of this support and it's fair to
      say that each of us has code in every commit.
      
      I should also express my appreciation of the 0day testing.  It prompted
      me that I was bloating the tinyconfig in an unacceptable way, and it
      bisected to a commit which contained a rather nasty memory-corruption
      bug.
      
      This patch (of 29):
      
      The irqdomain code was checking for 0 or 1 entries, not 0 entries like
      the comment said they were.  Introduce a new helper that will actually
      check for an empty tree.
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9256efc
    • A
      kernel/sysctl_binary.c: use generic UUID library · ede9c277
      Andy Shevchenko 提交于
      UUID library provides uuid_be type and uuid_be_to_bin() function.  This
      substitutes open coded variant by generic library calls.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ede9c277
    • P
      printk/nmi: flush NMI messages on the system panic · cf9b1106
      Petr Mladek 提交于
      In NMI context, printk() messages are stored into per-CPU buffers to
      avoid a possible deadlock.  They are normally flushed to the main ring
      buffer via an IRQ work.  But the work is never called when the system
      calls panic() in the very same NMI handler.
      
      This patch tries to flush NMI buffers before the crash dump is
      generated.  In this case it does not risk a double release and bails out
      when the logbuf_lock is already taken.  The aim is to get the messages
      into the main ring buffer when possible.  It makes them better
      accessible in the vmcore.
      
      Then the patch tries to flush the buffers second time when other CPUs
      are down.  It might be more aggressive and reset logbuf_lock.  The aim
      is to get the messages available for the consequent kmsg_dump() and
      console_flush_on_panic() calls.
      
      The patch causes vprintk_emit() to be called even in NMI context again.
      But it is done via printk_deferred() so that the console handling is
      skipped.  Consoles use internal locks and we could not prevent a
      deadlock easily.  They are explicitly called later when the crash dump
      is not generated, see console_flush_on_panic().
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf9b1106
    • P
      printk/nmi: increase the size of NMI buffer and make it configurable · 427934b8
      Petr Mladek 提交于
      Testing has shown that the backtrace sometimes does not fit into the 4kB
      temporary buffer that is used in NMI context.  The warnings are gone
      when I double the temporary buffer size.
      
      This patch doubles the buffer size and makes it configurable.
      
      Note that this problem existed even in the x86-specific implementation
      that was added by the commit a9edc880 ("x86/nmi: Perform a safe NMI
      stack trace on all CPUs").  Nobody noticed it because it did not print
      any warnings.
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      427934b8
    • P
      printk/nmi: warn when some message has been lost in NMI context · b522deab
      Petr Mladek 提交于
      We could not resize the temporary buffer in NMI context.  Let's warn if
      a message is lost.
      
      This is rather theoretical.  printk() should not be used in NMI.  The
      only sensible use is when we want to print backtrace from all CPUs.  The
      current buffer should be enough for this purpose.
      
      [akpm@linux-foundation.org: whitespace fixlet]
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b522deab
    • P
      printk/nmi: generic solution for safe printk in NMI · 42a0bb3f
      Petr Mladek 提交于
      printk() takes some locks and could not be used a safe way in NMI
      context.
      
      The chance of a deadlock is real especially when printing stacks from
      all CPUs.  This particular problem has been addressed on x86 by the
      commit a9edc880 ("x86/nmi: Perform a safe NMI stack trace on all
      CPUs").
      
      The patchset brings two big advantages.  First, it makes the NMI
      backtraces safe on all architectures for free.  Second, it makes all NMI
      messages almost safe on all architectures (the temporary buffer is
      limited.  We still should keep the number of messages in NMI context at
      minimum).
      
      Note that there already are several messages printed in NMI context:
      WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
      handlers.  These are not easy to avoid.
      
      This patch reuses most of the code and makes it generic.  It is useful
      for all messages and architectures that support NMI.
      
      The alternative printk_func is set when entering and is reseted when
      leaving NMI context.  It queues IRQ work to copy the messages into the
      main ring buffer in a safe context.
      
      __printk_nmi_flush() copies all available messages and reset the buffer.
      Then we could use a simple cmpxchg operations to get synchronized with
      writers.  There is also used a spinlock to get synchronized with other
      flushers.
      
      We do not longer use seq_buf because it depends on external lock.  It
      would be hard to make all supported operations safe for a lockless use.
      It would be confusing and error prone to make only some operations safe.
      
      The code is put into separate printk/nmi.c as suggested by Steven
      Rostedt.  It needs a per-CPU buffer and is compiled only on
      architectures that call nmi_enter().  This is achieved by the new
      HAVE_NMI Kconfig flag.
      
      The are MN10300 and Xtensa architectures.  We need to clean up NMI
      handling there first.  Let's do it separately.
      
      The patch is heavily based on the draft from Peter Zijlstra, see
      
        https://lkml.org/lkml/2015/6/10/327
      
      [arnd@arndb.de: printk-nmi: use %zu format string for size_t]
      [akpm@linux-foundation.org: min_t->min - all types are size_t here]
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>	[arm part]
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42a0bb3f
    • J
      fork: free thread in copy_process on failure · 0740aa5f
      Jiri Slaby 提交于
      When using this program (as root):
      
      	#include <err.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      
      	#include <sys/io.h>
      	#include <sys/types.h>
      	#include <sys/wait.h>
      
      	#define ITER 1000
      	#define FORKERS 15
      	#define THREADS (6000/FORKERS) // 1850 is proc max
      
      	static void fork_100_wait()
      	{
      		unsigned a, to_wait = 0;
      
      		printf("\t%d forking %d\n", THREADS, getpid());
      
      		for (a = 0; a < THREADS; a++) {
      			switch (fork()) {
      			case 0:
      				usleep(1000);
      				exit(0);
      				break;
      			case -1:
      				break;
      			default:
      				to_wait++;
      				break;
      			}
      		}
      
      		printf("\t%d forked from %d, waiting for %d\n", THREADS, getpid(),
      				to_wait);
      
      		for (a = 0; a < to_wait; a++)
      			wait(NULL);
      
      		printf("\t%d waited from %d\n", THREADS, getpid());
      	}
      
      	static void run_forkers()
      	{
      		pid_t forkers[FORKERS];
      		unsigned a;
      
      		for (a = 0; a < FORKERS; a++) {
      			switch ((forkers[a] = fork())) {
      			case 0:
      				fork_100_wait();
      				exit(0);
      				break;
      			case -1:
      				err(1, "DIE fork of %d'th forker", a);
      				break;
      			default:
      				break;
      			}
      		}
      
      		for (a = 0; a < FORKERS; a++)
      			waitpid(forkers[a], NULL, 0);
      	}
      
      	int main()
      	{
      		unsigned a;
      		int ret;
      
      		ret = ioperm(10, 20, 0);
      		if (ret < 0)
      			err(1, "ioperm");
      
      		for (a = 0; a < ITER; a++)
      			run_forkers();
      
      		return 0;
      	}
      
      kmemleak reports many occurences of this leak:
      unreferenced object 0xffff8805917c8000 (size 8192):
        comm "fork-leak", pid 2932, jiffies 4295354292 (age 1871.028s)
        hex dump (first 32 bytes):
          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
        backtrace:
          [<ffffffff814cfbf5>] kmemdup+0x25/0x50
          [<ffffffff8103ab43>] copy_thread_tls+0x6c3/0x9a0
          [<ffffffff81150174>] copy_process+0x1a84/0x5790
          [<ffffffff811dc375>] wake_up_new_task+0x2d5/0x6f0
          [<ffffffff8115411d>] _do_fork+0x12d/0x820
      ...
      
      Due to the leakage of the memory items which should have been freed in
      arch/x86/kernel/process.c:exit_thread().
      
      Make sure the memory is freed when fork fails later in copy_process.
      This is done by calling exit_thread with the thread to kill.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0740aa5f
    • J
      exit_thread: accept a task parameter to be exited · e6464694
      Jiri Slaby 提交于
      We need to call exit_thread from copy_process in a fail path.  So make it
      accept task_struct as a parameter.
      
      [v2]
      * s390: exit_thread_runtime_instr doesn't make sense to be called for
        non-current tasks.
      * arm: fix the comment in vfp_thread_copy
      * change 'me' to 'tsk' for task_struct
      * now we can change only archs that actually have exit_thread
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6464694
    • M
      mm, oom_reaper: do not mmput synchronously from the oom reaper context · ec8d7c14
      Michal Hocko 提交于
      Tetsuo has properly noted that mmput slow path might get blocked waiting
      for another party (e.g.  exit_aio waits for an IO).  If that happens the
      oom_reaper would be put out of the way and will not be able to process
      next oom victim.  We should strive for making this context as reliable
      and independent on other subsystems as much as possible.
      
      Introduce mmput_async which will perform the slow path from an async
      (WQ) context.  This will delay the operation but that shouldn't be a
      problem because the oom_reaper has reclaimed the victim's address space
      for most cases as much as possible and the remaining context shouldn't
      bind too much memory anymore.  The only exception is when mmap_sem
      trylock has failed which shouldn't happen too often.
      
      The issue is only theoretical but not impossible.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8d7c14
    • A
      bpf: teach verifier to recognize imm += ptr pattern · 1b9b69ec
      Alexei Starovoitov 提交于
      Humans don't write C code like:
        u8 *ptr = skb->data;
        int imm = 4;
        imm += ptr;
      but from llvm backend point of view 'imm' and 'ptr' are registers and
      imm += ptr may be preferred vs ptr += imm depending which register value
      will be used further in the code, while verifier can only recognize ptr += imm.
      That caused small unrelated changes in the C code of the bpf program to
      trigger rejection by the verifier. Therefore teach the verifier to recognize
      both ptr += imm and imm += ptr.
      For example:
      when R6=pkt(id=0,off=0,r=62) R7=imm22
      after r7 += r6 instruction
      will be R6=pkt(id=0,off=0,r=62) R7=pkt(id=0,off=22,r=62)
      
      Fixes: 969bf05e ("bpf: direct packet access")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b9b69ec
    • A
      bpf: support decreasing order in direct packet access · d91b28ed
      Alexei Starovoitov 提交于
      when packet headers are accessed in 'decreasing' order (like TCP port
      may be fetched before the program reads IP src) the llvm may generate
      the following code:
      [...]                // R7=pkt(id=0,off=22,r=70)
      r2 = *(u32 *)(r7 +0) // good access
      [...]
      r7 += 40             // R7=pkt(id=0,off=62,r=70)
      r8 = *(u32 *)(r7 +0) // good access
      [...]
      r1 = *(u32 *)(r7 -20) // this one will fail though it's within a safe range
                            // it's doing *(u32*)(skb->data + 42)
      Fix verifier to recognize such code pattern
      
      Alos turned out that 'off > range' condition is not a verifier bug.
      It's a buggy program that may do something like:
      if (ptr + 50 > data_end)
        return 0;
      ptr += 60;
      *(u32*)ptr;
      in such case emit
      "invalid access to packet, off=0 size=4, R1(id=0,off=60,r=50)" error message,
      so all information is available for the program author to fix the program.
      
      Fixes: 969bf05e ("bpf: direct packet access")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d91b28ed
    • E
      bpf: Use mount_nodev not mount_ns to mount the bpf filesystem · e27f4a94
      Eric W. Biederman 提交于
      While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the
      bpf filesystem.  Looking at the code I saw a broken usage of mount_ns
      with current->nsproxy->mnt_ns. As the code does not acquire a
      reference to the mount namespace it can not possibly be correct to
      store the mount namespace on the superblock as it does.
      
      Replace mount_ns with mount_nodev so that each mount of the bpf
      filesystem returns a distinct instance, and the code is not buggy.
      
      In discussion with Hannes Frederic Sowa it was reported that the use
      of mount_ns was an attempt to have one bpf instance per mount
      namespace, in an attempt to keep resources that pin resources from
      hiding.  That intent simply does not work, the vfs is not built to
      allow that kind of behavior.  Which means that the bpf filesystem
      really is buggy both semantically and in it's implemenation as it does
      not nor can it implement the original intent.
      
      This change is userspace visible, but my experience with similar
      filesystems leads me to believe nothing will break with a model of each
      mount of the bpf filesystem is distinct from all others.
      
      Fixes: b2197755 ("bpf: add support for persistent maps/progs")
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e27f4a94
    • D
      bpf: rather use get_random_int for randomizations · b7552e1b
      Daniel Borkmann 提交于
      Start address randomization and blinding in BPF currently use
      prandom_u32(). prandom_u32() values are not exposed to unpriviledged
      user space to my knowledge, but given other kernel facilities such as
      ASLR, stack canaries, etc make use of stronger get_random_int(), we
      better make use of it here as well given blinding requests successively
      new random values. get_random_int() has minimal entropy pool depletion,
      is not cryptographically secure, but doesn't need to be for our use
      cases here.
      Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7552e1b
    • S
      ftrace: Don't disable irqs when taking the tasklist_lock read_lock · 6112a300
      Soumya PN 提交于
      In ftrace.c inside the function alloc_retstack_tasklist() (which will be
      invoked when function_graph tracing is on) the tasklist_lock is being
      held as reader while iterating through a list of threads. Here the lock
      is being held as reader with irqs disabled. The tasklist_lock is never
      write_locked in interrupt context so it is safe to not disable interrupts
      for the duration of read_lock in this block which, can be significant,
      given the block of code iterates through all threads. Hence changing the
      code to call read_lock() and read_unlock() instead of read_lock_irqsave()
      and read_unlock_irqrestore().
      
      A similar change was made in commits: 8063e41d ("tracing: Change
      syscall_*regfunc() to check PF_KTHREAD and use for_each_process_thread()")'
      and 3472eaa1 ("sched: normalize_rt_tasks(): Don't use _irqsave for
      tasklist_lock, use task_rq_lock()")'
      
      Link: http://lkml.kernel.org/r/1463500874-77480-1-git-send-email-soumya.p.n@hpe.comSigned-off-by: NSoumya PN <soumya.p.n@hpe.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      6112a300
  6. 20 5月, 2016 11 次提交
    • V
      cpuset: use static key better and convert to new API · 002f2906
      Vlastimil Babka 提交于
      An important function for cpusets is cpuset_node_allowed(), which
      optimizes on the fact if there's a single root CPU set, it must be
      trivially allowed.  But the check "nr_cpusets() <= 1" doesn't use the
      cpusets_enabled_key static key the right way where static keys eliminate
      branching overhead with jump labels.
      
      This patch converts it so that static key is used properly.  It's also
      switched to the new static key API and the checking functions are
      converted to return bool instead of int.  We also provide a new variant
      __cpuset_zone_allowed() which expects that the static key check was
      already done and they key was enabled.  This is needed for
      get_page_from_freelist() where we want to also avoid the relatively
      slower check when ALLOC_CPUSET is not set in alloc_flags.
      
      The impact on the page allocator microbenchmark is less than expected
      but the cleanup in itself is worthwhile.
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                             multcheck-v1r20               cpuset-v1r20
        Min      alloc-odr0-1               348.00 (  0.00%)           348.00 (  0.00%)
        Min      alloc-odr0-2               254.00 (  0.00%)           254.00 (  0.00%)
        Min      alloc-odr0-4               213.00 (  0.00%)           213.00 (  0.00%)
        Min      alloc-odr0-8               186.00 (  0.00%)           183.00 (  1.61%)
        Min      alloc-odr0-16              173.00 (  0.00%)           171.00 (  1.16%)
        Min      alloc-odr0-32              166.00 (  0.00%)           163.00 (  1.81%)
        Min      alloc-odr0-64              162.00 (  0.00%)           159.00 (  1.85%)
        Min      alloc-odr0-128             160.00 (  0.00%)           157.00 (  1.88%)
        Min      alloc-odr0-256             169.00 (  0.00%)           166.00 (  1.78%)
        Min      alloc-odr0-512             180.00 (  0.00%)           180.00 (  0.00%)
        Min      alloc-odr0-1024            188.00 (  0.00%)           187.00 (  0.53%)
        Min      alloc-odr0-2048            194.00 (  0.00%)           193.00 (  0.52%)
        Min      alloc-odr0-4096            199.00 (  0.00%)           198.00 (  0.50%)
        Min      alloc-odr0-8192            202.00 (  0.00%)           201.00 (  0.50%)
        Min      alloc-odr0-16384           203.00 (  0.00%)           202.00 (  0.49%)
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      002f2906
    • H
      mm: /proc/sys/vm/stat_refresh to force vmstat update · 52b6f46b
      Hugh Dickins 提交于
      Provide /proc/sys/vm/stat_refresh to force an immediate update of
      per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
      before checking counts when testing.  Originally added to work around a
      bug which left counts stranded indefinitely on a cpu going idle (an
      inaccuracy magnified when small below-batch numbers represent "huge"
      amounts of memory), but I believe that bug is now fixed: nonetheless,
      this is still a useful knob.
      
      Its schedule_on_each_cpu() is probably too expensive just to fold into
      reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
      Allow a write or a read to do the same: nothing to read, but "grep -h
      Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient.  Oh, and
      since global_page_state() itself is careful to disguise any underflow as
      0, hack in an "Invalid argument" and pr_warn() if a counter is negative
      after the refresh - this helped to fix a misaccounting of
      NR_ISOLATED_FILE in my migration code.
      
      But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
      often go negative some of the time.  I have not yet worked out why, but
      have no evidence that it's actually harmful.  Punt for the moment by
      just ignoring the anomaly on those.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52b6f46b
    • A
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton 提交于
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
    • J
      mm: rename _count, field of the struct page, to _refcount · 0139aa7b
      Joonsoo Kim 提交于
      Many developers already know that field for reference count of the
      struct page is _count and atomic type.  They would try to handle it
      directly and this could break the purpose of page reference count
      tracepoint.  To prevent direct _count modification, this patch rename it
      to _refcount and add warning message on the code.  After that, developer
      who need to handle reference count will find that field should not be
      accessed directly.
      
      [akpm@linux-foundation.org: fix comments, per Vlastimil]
      [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
      [sfr@canb.auug.org.au: sync ethernet driver changes]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Manish Chopra <manish.chopra@qlogic.com>
      Cc: Yuval Mintz <yuval.mintz@qlogic.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0139aa7b
    • A
      kernel/padata.c: hide unused functions · 19d795b6
      Arnd Bergmann 提交于
      A recent cleanup removed some exported functions that were not used
      anywhere, which in turn exposed the fact that some other functions in
      the same file are only used in some configurations.
      
      We now get a warning about them when CONFIG_HOTPLUG_CPU is disabled:
      
        kernel/padata.c:670:12: error: '__padata_remove_cpu' defined but not used [-Werror=unused-function]
         static int __padata_remove_cpu(struct padata_instance *pinst, int cpu)
                    ^~~~~~~~~~~~~~~~~~~
        kernel/padata.c:650:12: error: '__padata_add_cpu' defined but not used [-Werror=unused-function]
         static int __padata_add_cpu(struct padata_instance *pinst, int cpu)
      
      This rearranges the code so the __padata_remove_cpu/__padata_add_cpu
      functions are within the #ifdef that protects the code that calls them.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Fixes: 4ba6d78c671e ("kernel/padata.c: removed unused code")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Richard Cochran <rcochran@linutronix.de>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19d795b6
    • R
      kernel/padata.c: removed unused code · 815613da
      Richard Cochran 提交于
      By accident I stumbled across code that has never been used.  This
      driver has EXPORT_SYMBOL functions, and the only user of the code is
      pcrypt.c, but this only uses a subset of the exported symbols.
      
      According to 'git log -G', the functions, padata_set_cpumasks,
      padata_add_cpu, and padata_remove_cpu have never been used since they
      were first introduced.  This patch removes the unused code.
      
      On one 64 bit build, with CRYPTO_PCRYPT built in, the text is more than
      4k smaller.
      
        kbuild_hp> size $KBUILD_OUTPUT/vmlinux
            text    data     bss      dec hex    filename
        10566658 4678360 1122304 16367322 f9beda vmlinux
        10561984 4678360 1122304 16362648 f9ac98 vmlinux
      
      On another config, 32 bit, the saving is about 0.5k bytes.
      
        kbuild_hp-x86> size $KBUILD_OUTPUT/vmlinux
        6012005 2409513 2785280 11206798 ab008e vmlinux
        6011491 2409513 2785280 11206284 aafe8c vmlinux
      Signed-off-by: NRichard Cochran <rcochran@linutronix.de>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      815613da
    • D
      debugobjects: insulate non-fixup logic related to static obj from fixup callbacks · b9fdac7f
      Du, Changbin 提交于
      When activating a static object we need make sure that the object is
      tracked in the object tracker.  If it is a non-static object then the
      activation is illegal.
      
      In previous implementation, each subsystem need take care of this in
      their fixup callbacks.  Actually we can put it into debugobjects core.
      Thus we can save duplicated code, and have *pure* fixup callbacks.
      
      To achieve this, a new callback "is_static_object" is introduced to let
      the type specific code decide whether a object is static or not.  If
      yes, we take it into object tracker, otherwise give warning and invoke
      fixup callback.
      
      This change has paassed debugobjects selftest, and I also do some test
      with all debugobjects supports enabled.
      
      At last, I have a concern about the fixups that can it change the object
      which is in incorrect state on fixup? Because the 'addr' may not point
      to any valid object if a non-static object is not tracked.  Then Change
      such object can overwrite someone's memory and cause unexpected
      behaviour.  For example, the timer_fixup_activate bind timer to function
      stub_timer.
      
      Link: http://lkml.kernel.org/r/1462576157-14539-1-git-send-email-changbin.du@intel.com
      [changbin.du@intel.com: improve code comments where invoke the new is_static_object callback]
        Link: http://lkml.kernel.org/r/1462777431-8171-1-git-send-email-changbin.du@intel.comSigned-off-by: NDu, Changbin <changbin.du@intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Triplett <josh@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9fdac7f
    • D
      rcu: update debugobjects fixup callbacks return type · 3263d28e
      Du, Changbin 提交于
      Update the return type to use bool instead of int, corresponding to
      cheange (debugobjects: make fixup functions return bool instead of int).
      Signed-off-by: NDu, Changbin <changbin.du@intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Triplett <josh@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3263d28e
    • D
      timer: update debugobjects fixup callbacks return type · e3252464
      Du, Changbin 提交于
      Update the return type to use bool instead of int, corresponding to
      cheange (debugobjects: make fixup functions return bool instead of int).
      Signed-off-by: NDu, Changbin <changbin.du@intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Triplett <josh@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3252464
    • D
      workqueue: update debugobjects fixup callbacks return type · 02a982a6
      Du, Changbin 提交于
      Update the return type to use bool instead of int, corresponding to
      change (debugobjects: make fixup functions return bool instead of int)
      Signed-off-by: NDu, Changbin <changbin.du@intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Triplett <josh@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02a982a6
    • D
      time: remove timespec_add_safe() · 8e4f70e2
      Deepa Dinamani 提交于
      All references to timespec_add_safe() now use timespec64_add_safe().
      
      The plan is to replace struct timespec references with struct timespec64
      throughout the kernel as timespec is not y2038 safe.
      
      Drop timespec_add_safe() and use timespec64_add_safe() for all
      architectures.
      
      Link: http://lkml.kernel.org/r/1461947989-21926-4-git-send-email-deepa.kernel@gmail.comSigned-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Acked-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e4f70e2