1. 01 7月, 2016 1 次提交
  2. 21 6月, 2016 6 次提交
    • A
      timer: Avoid using timespec · 7c71feb0
      Arnd Bergmann 提交于
      The tstats_show() function prints a ktime_t variable by converting
      it to struct timespec first. The algorithm is ok, but we want to
      stop using timespec in general because of the 32-bit time_t
      overflow problem.
      
      This changes the code to use struct timespec64, without any
      functional change.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      7c71feb0
    • A
      time: Avoid timespec in udelay_test · 4a19bd3d
      Arnd Bergmann 提交于
      udelay_test_single() uses ktime_get_ts() to get two timespec values
      and calculate the difference between them, while udelay_test_show()
      uses the same to printk() the current monotonic time.
      
      Both of these are y2038 safe on all machines, but we want to
      get rid of struct timespec anyway, so this converts the code to
      use ktime_get_ns() and ktime_get_ts64() respectively.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      4a19bd3d
    • D
      time: Add time64_to_tm() · e6c2682a
      Deepa Dinamani 提交于
      time_to_tm() takes time_t as an argument.
      time_t is not y2038 safe.
      Add time64_to_tm() that takes time64_t as an argument
      which is y2038 safe.
      The plan is to eventually replace all calls to time_to_tm()
      by time64_to_tm().
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      e6c2682a
    • P
      alarmtimer: Fix comments describing structure fields · af4afb40
      Pratyush Patel 提交于
      Updated struct alarm and struct alarm_timer descriptions.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NPratyush Patel <pratyushpatel.1995@gmail.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      af4afb40
    • T
      timekeeping: Fix 1ns/tick drift with GENERIC_TIME_VSYSCALL_OLD · 0209b937
      Thomas Graziadei 提交于
      The user notices the problem in a raw and real time drift, calling
      clock_gettime with CLOCK_REALTIME / CLOCK_MONOTONIC_RAW on a system
      with no ntp correction taking place (no ntpd or ptp stuff running).
      
      The problem is, that old_vsyscall_fixup adds an extra 1ns even though
      xtime_nsec is already held in full nsecs and the remainder in this
      case is 0. Do the rounding up buisness only if needed.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NThomas Graziadei <thomas.graziadei@omicronenergy.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      0209b937
    • M
      clocksource: Make clocksource insert entry more efficient · 0fb71d34
      Minfei Huang 提交于
      In clocksource_enqueue(), it is unnecessary to continue looping
      the list, if we find there is an entry that the value of rating
      is smaller than the new one. It is safe to be out the loop,
      because all of entry are inserted in descending order.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NMinfei Huang <mnghuan@gmail.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      0fb71d34
  3. 10 6月, 2016 2 次提交
  4. 01 6月, 2016 1 次提交
  5. 28 5月, 2016 1 次提交
    • A
      remove lots of IS_ERR_VALUE abuses · 287980e4
      Arnd Bergmann 提交于
      Most users of IS_ERR_VALUE() in the kernel are wrong, as they
      pass an 'int' into a function that takes an 'unsigned long'
      argument. This happens to work because the type is sign-extended
      on 64-bit architectures before it gets converted into an
      unsigned type.
      
      However, anything that passes an 'unsigned short' or 'unsigned int'
      argument into IS_ERR_VALUE() is guaranteed to be broken, as are
      8-bit integers and types that are wider than 'unsigned long'.
      
      Andrzej Hajda has already fixed a lot of the worst abusers that
      were causing actual bugs, but it would be nice to prevent any
      users that are not passing 'unsigned long' arguments.
      
      This patch changes all users of IS_ERR_VALUE() that I could find
      on 32-bit ARM randconfig builds and x86 allmodconfig. For the
      moment, this doesn't change the definition of IS_ERR_VALUE()
      because there are probably still architecture specific users
      elsewhere.
      
      Almost all the warnings I got are for files that are better off
      using 'if (err)' or 'if (err < 0)'.
      The only legitimate user I could find that we get a warning for
      is the (32-bit only) freescale fman driver, so I did not remove
      the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
      For 9pfs, I just worked around one user whose calling conventions
      are so obscure that I did not dare change the behavior.
      
      I was using this definition for testing:
      
       #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
             unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))
      
      which ends up making all 16-bit or wider types work correctly with
      the most plausible interpretation of what IS_ERR_VALUE() was supposed
      to return according to its users, but also causes a compile-time
      warning for any users that do not pass an 'unsigned long' argument.
      
      I suggested this approach earlier this year, but back then we ended
      up deciding to just fix the users that are obviously broken. After
      the initial warning that caused me to get involved in the discussion
      (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
      asked me to send the whole thing again.
      
      [ Updated the 9p parts as per Al Viro  - Linus ]
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Andrzej Hajda <a.hajda@samsung.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: https://lkml.org/lkml/2016/1/7/363
      Link: https://lkml.org/lkml/2016/5/27/486
      Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      287980e4
  6. 27 5月, 2016 1 次提交
  7. 26 5月, 2016 1 次提交
  8. 25 5月, 2016 1 次提交
    • P
      sched/core: Fix remote wakeups · b7e7ade3
      Peter Zijlstra 提交于
      Commit:
      
        b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      
      ... introduced a bug: Mike Galbraith found that it introduced a
      performance regression, while Paul E. McKenney reported lost
      wakeups and bisected it to this commit.
      
      The reason is that I mis-read ttwu_queue() such that I assumed any
      wakeup that got a remote queue must have had the task migrated.
      
      Since this is not so; we need to transfer this information between
      queueing the wakeup and actually doing the wakeup. Use a new
      task_struct::sched_flag for this, we already write to
      sched_contributes_to_load in the wakeup path so this is a hot and
      modified cacheline.
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Fixes: b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      Link: http://lkml.kernel.org/r/20160523091907.GD15728@worktop.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b7e7ade3
  9. 24 5月, 2016 14 次提交
    • M
      genirq: Fix missing return value in irq_destroy_ipi() · 59fa5860
      Matt Redfearn 提交于
      Commit 7cec18a3 changed the return type of irq_destroy_ipi to int, but
      missed adding a value to one return statement. Fix this to silence the
      resulting compiler warning:
      
      kernel/irq/ipi.c In function ‘irq_destroy_ipi’:
      kernel/irq/ipi.c:128:3: warning: ‘return’ with no value, in function returning non-void [-Wreturn-type]
      
      Fixes: 7cec18a3 "genirq: Add error code reporting to irq_{reserve,destroy}_ipi"
      Signed-off-by: NMatt Redfearn <matt.redfearn@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Link: http://lkml.kernel.org/r/1464086550-24734-1-git-send-email-matt.redfearn@imgtec.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      59fa5860
    • M
      uprobes: wait for mmap_sem for write killable · 598fdc1d
      Michal Hocko 提交于
      xol_add_vma needs mmap_sem for write.  If the waiting task gets killed
      by the oom killer it would block oom_reaper from asynchronous address
      space reclaim and reduce the chances of timely OOM resolving.  Wait for
      the lock in the killable mode and return with EINTR if the task got
      killed while waiting.
      
      Do not warn in dup_xol_work if __create_xol_area failed due to fatal
      signal pending because this is usually considered a kernel issue.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      598fdc1d
    • M
      prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable · 17b0573d
      Michal Hocko 提交于
      PR_SET_THP_DISABLE requires mmap_sem for write.  If the waiting task
      gets killed by the oom killer it would block oom_reaper from
      asynchronous address space reclaim and reduce the chances of timely OOM
      resolving.  Wait for the lock in the killable mode and return with EINTR
      if the task got killed while waiting.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NAlex Thorlton <athorlton@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17b0573d
    • M
      mm, fork: make dup_mmap wait for mmap_sem for write killable · 7c051267
      Michal Hocko 提交于
      dup_mmap needs to lock current's mm mmap_sem for write.  If the waiting
      task gets killed by the oom killer it would block oom_reaper from
      asynchronous address space reclaim and reduce the chances of timely OOM
      resolving.  Wait for the lock in the killable mode and return with EINTR
      if the task got killed while waiting.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c051267
    • X
      s390/kexec: consolidate crash_map/unmap_reserved_pages() and... · 7a0058ec
      Xunlei Pang 提交于
      s390/kexec: consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()
      
      Commit 3f625002581b ("kexec: introduce a protection mechanism for the
      crashkernel reserved memory") is a similar mechanism for protecting the
      crash kernel reserved memory to previous crash_map/unmap_reserved_pages()
      implementation, the new one is more generic in name and cleaner in code
      (besides, some arch may not be allowed to unmap the pgtable).
      
      Therefore, this patch consolidates them, and uses the new
      arch_kexec_protect(unprotect)_crashkres() to replace former
      crash_map/unmap_reserved_pages() which by now has been only used by
      S390.
      
      The consolidation work needs the crash memory to be mapped initially,
      this is done in machine_kdump_pm_init() which is after
      reserve_crashkernel().  Once kdump kernel is loaded, the new
      arch_kexec_protect_crashkres() implemented for S390 will actually
      unmap the pgtable like before.
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Acked-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Minfei Huang <mhuang@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a0058ec
    • M
      kexec: do a cleanup for function kexec_load · 0eea0867
      Minfei Huang 提交于
      There are a lof of work to be done in function kexec_load, not only for
      allocating structs and loading initram, but also for some misc.
      
      To make it more clear, wrap a new function do_kexec_load which is used
      to allocate structs and load initram.  And the pre-work will be done in
      kexec_load.
      Signed-off-by: NMinfei Huang <mnfhuang@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Xunlei Pang <xlpang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eea0867
    • M
      kexec: make a pair of map/unmap reserved pages in error path · 917a3560
      Minfei Huang 提交于
      For some arch, kexec shall map the reserved pages, then use them, when
      we try to start the kdump service.
      
      kexec may return directly, without unmaping the reserved pages, if it
      fails during starting service.  To fix it, we make a pair of map/unmap
      reserved pages both in generic path and error path.
      
      This patch only affects s390.  Other architecturess don't implement the
      interface of crash_unmap_reserved_pages and crash_map_reserved_pages.
      
      It isn't a urgent patch.  Kernel can work well without any risk,
      although the reserved pages are not unmapped before returning in error
      path.
      Signed-off-by: NMinfei Huang <mnfhuang@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Xunlei Pang <xlpang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      917a3560
    • X
      kexec: introduce a protection mechanism for the crashkernel reserved memory · 9b492cf5
      Xunlei Pang 提交于
      For the cases that some kernel (module) path stamps the crash reserved
      memory(already mapped by the kernel) where has been loaded the second
      kernel data, the kdump kernel will probably fail to boot when panic
      happens (or even not happens) leaving the culprit at large, this is
      unacceptable.
      
      The patch introduces a mechanism for detecting such cases:
      
      1) After each crash kexec loading, it simply marks the reserved memory
         regions readonly since we no longer access it after that.  When someone
         stamps the region, the first kernel will panic and trigger the kdump.
         The weak arch_kexec_protect_crashkres() is introduced to do the actual
         protection.
      
      2) To allow multiple loading, once 1) was done we also need to remark
         the reserved memory to readwrite each time a system call related to
         kdump is made.  The weak arch_kexec_unprotect_crashkres() is introduced
         to do the actual protection.
      
      The architecture can make its specific implementation by overriding
      arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres().
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Minfei Huang <mhuang@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b492cf5
    • A
      kernek/fork.c: allocate idle task for a CPU always on its local node · 725fc629
      Andi Kleen 提交于
      Linux preallocates the task structs of the idle tasks for all possible
      CPUs.  This currently means they all end up on node 0.  This also
      implies that the cache line of MWAIT, which is around the flags field in
      the task struct, are all located in node 0.
      
      We see a noticeable performance improvement on Knights Landing CPUs when
      the cache lines used for MWAIT are located in the local nodes of the
      CPUs using them.  I would expect this to give a (likely slight)
      improvement on other systems too.
      
      The patch implements placing the idle task in the node of its CPUs, by
      passing the right target node to copy_process()
      
      [akpm@linux-foundation.org: use NUMA_NO_NODE, not a bare -1]
      Link: http://lkml.kernel.org/r/1463492694-15833-1-git-send-email-andi@firstfloor.orgSigned-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      725fc629
    • W
      kernel/signal.c: convert printk(KERN_<LEVEL> ...) to pr_<level>(...) · 747800ef
      Wang Xiaoqiang 提交于
      Use pr_<level> instead of printk(KERN_<LEVEL> ).
      Signed-off-by: NWang Xiaoqiang <wangxq10@lzu.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747800ef
    • O
      wait: allow sys_waitid() to accept __WNOTHREAD/__WCLONE/__WALL · 91c4e8ea
      Oleg Nesterov 提交于
      I see no reason why waitid() can't support other linux-specific flags
      allowed in sys_wait4().
      
      In particular this change can help if we reconsider the previous change
      ("wait/ptrace: assume __WALL if the child is traced") which adds the
      "automagical" __WALL for debugger.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: <syzkaller@googlegroups.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91c4e8ea
    • O
      wait/ptrace: assume __WALL if the child is traced · bf959931
      Oleg Nesterov 提交于
      The following program (simplified version of generated by syzkaller)
      
      	#include <pthread.h>
      	#include <unistd.h>
      	#include <sys/ptrace.h>
      	#include <stdio.h>
      	#include <signal.h>
      
      	void *thread_func(void *arg)
      	{
      		ptrace(PTRACE_TRACEME, 0,0,0);
      		return 0;
      	}
      
      	int main(void)
      	{
      		pthread_t thread;
      
      		if (fork())
      			return 0;
      
      		while (getppid() != 1)
      			;
      
      		pthread_create(&thread, NULL, thread_func, NULL);
      		pthread_join(thread, NULL);
      		return 0;
      	}
      
      creates an unreapable zombie if /sbin/init doesn't use __WALL.
      
      This is not a kernel bug, at least in a sense that everything works as
      expected: debugger should reap a traced sub-thread before it can reap the
      leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
      
      Unfortunately, it seems that /sbin/init in most (all?) distributions
      doesn't use it and we have to change the kernel to avoid the problem.
      Note also that most init's use sys_waitid() which doesn't allow __WALL, so
      the necessary user-space fix is not that trivial.
      
      This patch just adds the "ptrace" check into eligible_child().  To some
      degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is
      mostly ignored when the tracee reports to debugger.  Or WSTOPPED, the
      tracer doesn't need to set this flag to wait for the stopped tracee.
      
      This obviously means the user-visible change: __WCLONE and __WALL no
      longer have any meaning for debugger.  And I can only hope that this won't
      break something, but at least strace/gdb won't suffer.
      
      We could make a more conservative change.  Say, we can take __WCLONE into
      account, or !thread_group_leader().  But it would be nice to not
      complicate these historical/confusing checks.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: <syzkaller@googlegroups.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf959931
    • R
      ELF/MIPS build fix · f43edca7
      Ralf Baechle 提交于
      CONFIG_MIPS32_N32=y but CONFIG_BINFMT_ELF disabled results in the
      following linker errors:
      
        arch/mips/built-in.o: In function `elf_core_dump':
        binfmt_elfn32.c:(.text+0x23dbc): undefined reference to `elf_core_extra_phdrs'
        binfmt_elfn32.c:(.text+0x246e4): undefined reference to `elf_core_extra_data_size'
        binfmt_elfn32.c:(.text+0x248d0): undefined reference to `elf_core_write_extra_phdrs'
        binfmt_elfn32.c:(.text+0x24ac4): undefined reference to `elf_core_write_extra_data'
      
      CONFIG_MIPS32_O32=y but CONFIG_BINFMT_ELF disabled results in the following
      linker errors:
      
        arch/mips/built-in.o: In function `elf_core_dump':
        binfmt_elfo32.c:(.text+0x28a04): undefined reference to `elf_core_extra_phdrs'
        binfmt_elfo32.c:(.text+0x29330): undefined reference to `elf_core_extra_data_size'
        binfmt_elfo32.c:(.text+0x2951c): undefined reference to `elf_core_write_extra_phdrs'
        binfmt_elfo32.c:(.text+0x29710): undefined reference to `elf_core_write_extra_data'
      
      This is because binfmt_elfn32 and binfmt_elfo32 are using symbols from
      elfcore but for these configurations elfcore will not be built.
      
      Fixed by making elfcore selectable by a separate config symbol which
      unlike the current mechanism can also be used from other directories
      than kernel/, then having each flavor of ELF that relies on elfcore.o,
      select it in Kconfig, including CONFIG_MIPS32_N32 and CONFIG_MIPS32_O32
      which fixes this issue.
      
      Link: http://lkml.kernel.org/r/20160520141705.GA1913@linux-mips.orgSigned-off-by: NRalf Baechle <ralf@linux-mips.org>
      Reviewed-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f43edca7
    • D
      bpf, inode: disallow userns mounts · 612bacad
      Daniel Borkmann 提交于
      Follow-up to commit e27f4a94 ("bpf: Use mount_nodev not mount_ns
      to mount the bpf filesystem"), which removes the FS_USERNS_MOUNT flag.
      
      The original idea was to have a per mountns instance instead of a
      single global fs instance, but that didn't work out and we had to
      switch to mount_nodev() model. The intent of that middle ground was
      that we avoid users who don't play nice to create endless instances
      of bpf fs which are difficult to control and discover from an admin
      point of view, but at the same time it would have allowed us to be
      more flexible with regard to namespaces.
      
      Therefore, since we now did the switch to mount_nodev() as a fix
      where individual instances are created, we also need to remove userns
      mount flag along with it to avoid running into mentioned situation.
      I don't expect any breakage at this early point in time with removing
      the flag and we can revisit this later should the requirement for
      this come up with future users. This and commit e27f4a94 have
      been split to facilitate tracking should any of them run into the
      unlikely case of causing a regression.
      
      Fixes: b2197755 ("bpf: add support for persistent maps/progs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      612bacad
  10. 23 5月, 2016 1 次提交
    • L
      x86: remove more uaccess_32.h complexity · bd28b145
      Linus Torvalds 提交于
      I'm looking at trying to possibly merge the 32-bit and 64-bit versions
      of the x86 uaccess.h implementation, but first this needs to be cleaned
      up.
      
      For example, the 32-bit version of "__copy_from_user_inatomic()" is
      mostly the special cases for the constant size, and it's actually almost
      never relevant.  Most users aren't actually using a constant size
      anyway, and the few cases that do small constant copies are better off
      just using __get_user() instead.
      
      So get rid of the unnecessary complexity.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd28b145
  11. 21 5月, 2016 11 次提交
    • M
      radix-tree: introduce radix_tree_empty · e9256efc
      Matthew Wilcox 提交于
      Commit e6145236 ("radix_tree: add support for multi-order entries")
      left the impression that the support for multiorder radix tree entries
      was functional.  As soon as Ross tried to use it, it became apparent
      that my testing was completely inadequate, and it didn't even work a
      little bit for orders that were not a multiple of shift.
      
      This series of patches is the result of about 6 weeks of redesign,
      reimplementation, testing, arguing and hair-pulling.  The great news is
      that the test-suite is now far better than it was.  That's reflected in
      the diffstat for the test-suite alone:
      
       12 files changed, 436 insertions(+), 28 deletions(-)
      
      The highlight for users of the tree is that the restriction on the order
      of inserted entries being >= RADIX_TREE_MAP_SHIFT is now gone; the radix
      tree now supports any order between 0 and 64.
      
      For those who are interested in how the tree works, patch 9 is probably
      the most interesting one as it introduces the new machinery for handling
      sibling entries.
      
      I've tried to be fair in attributing authorship to the person who
      contributed the majority of the code in each patch; Ross has been an
      invaluable partner in the development of this support and it's fair to
      say that each of us has code in every commit.
      
      I should also express my appreciation of the 0day testing.  It prompted
      me that I was bloating the tinyconfig in an unacceptable way, and it
      bisected to a commit which contained a rather nasty memory-corruption
      bug.
      
      This patch (of 29):
      
      The irqdomain code was checking for 0 or 1 entries, not 0 entries like
      the comment said they were.  Introduce a new helper that will actually
      check for an empty tree.
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9256efc
    • A
      kernel/sysctl_binary.c: use generic UUID library · ede9c277
      Andy Shevchenko 提交于
      UUID library provides uuid_be type and uuid_be_to_bin() function.  This
      substitutes open coded variant by generic library calls.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ede9c277
    • P
      printk/nmi: flush NMI messages on the system panic · cf9b1106
      Petr Mladek 提交于
      In NMI context, printk() messages are stored into per-CPU buffers to
      avoid a possible deadlock.  They are normally flushed to the main ring
      buffer via an IRQ work.  But the work is never called when the system
      calls panic() in the very same NMI handler.
      
      This patch tries to flush NMI buffers before the crash dump is
      generated.  In this case it does not risk a double release and bails out
      when the logbuf_lock is already taken.  The aim is to get the messages
      into the main ring buffer when possible.  It makes them better
      accessible in the vmcore.
      
      Then the patch tries to flush the buffers second time when other CPUs
      are down.  It might be more aggressive and reset logbuf_lock.  The aim
      is to get the messages available for the consequent kmsg_dump() and
      console_flush_on_panic() calls.
      
      The patch causes vprintk_emit() to be called even in NMI context again.
      But it is done via printk_deferred() so that the console handling is
      skipped.  Consoles use internal locks and we could not prevent a
      deadlock easily.  They are explicitly called later when the crash dump
      is not generated, see console_flush_on_panic().
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf9b1106
    • P
      printk/nmi: increase the size of NMI buffer and make it configurable · 427934b8
      Petr Mladek 提交于
      Testing has shown that the backtrace sometimes does not fit into the 4kB
      temporary buffer that is used in NMI context.  The warnings are gone
      when I double the temporary buffer size.
      
      This patch doubles the buffer size and makes it configurable.
      
      Note that this problem existed even in the x86-specific implementation
      that was added by the commit a9edc880 ("x86/nmi: Perform a safe NMI
      stack trace on all CPUs").  Nobody noticed it because it did not print
      any warnings.
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      427934b8
    • P
      printk/nmi: warn when some message has been lost in NMI context · b522deab
      Petr Mladek 提交于
      We could not resize the temporary buffer in NMI context.  Let's warn if
      a message is lost.
      
      This is rather theoretical.  printk() should not be used in NMI.  The
      only sensible use is when we want to print backtrace from all CPUs.  The
      current buffer should be enough for this purpose.
      
      [akpm@linux-foundation.org: whitespace fixlet]
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b522deab
    • P
      printk/nmi: generic solution for safe printk in NMI · 42a0bb3f
      Petr Mladek 提交于
      printk() takes some locks and could not be used a safe way in NMI
      context.
      
      The chance of a deadlock is real especially when printing stacks from
      all CPUs.  This particular problem has been addressed on x86 by the
      commit a9edc880 ("x86/nmi: Perform a safe NMI stack trace on all
      CPUs").
      
      The patchset brings two big advantages.  First, it makes the NMI
      backtraces safe on all architectures for free.  Second, it makes all NMI
      messages almost safe on all architectures (the temporary buffer is
      limited.  We still should keep the number of messages in NMI context at
      minimum).
      
      Note that there already are several messages printed in NMI context:
      WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
      handlers.  These are not easy to avoid.
      
      This patch reuses most of the code and makes it generic.  It is useful
      for all messages and architectures that support NMI.
      
      The alternative printk_func is set when entering and is reseted when
      leaving NMI context.  It queues IRQ work to copy the messages into the
      main ring buffer in a safe context.
      
      __printk_nmi_flush() copies all available messages and reset the buffer.
      Then we could use a simple cmpxchg operations to get synchronized with
      writers.  There is also used a spinlock to get synchronized with other
      flushers.
      
      We do not longer use seq_buf because it depends on external lock.  It
      would be hard to make all supported operations safe for a lockless use.
      It would be confusing and error prone to make only some operations safe.
      
      The code is put into separate printk/nmi.c as suggested by Steven
      Rostedt.  It needs a per-CPU buffer and is compiled only on
      architectures that call nmi_enter().  This is achieved by the new
      HAVE_NMI Kconfig flag.
      
      The are MN10300 and Xtensa architectures.  We need to clean up NMI
      handling there first.  Let's do it separately.
      
      The patch is heavily based on the draft from Peter Zijlstra, see
      
        https://lkml.org/lkml/2015/6/10/327
      
      [arnd@arndb.de: printk-nmi: use %zu format string for size_t]
      [akpm@linux-foundation.org: min_t->min - all types are size_t here]
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>	[arm part]
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42a0bb3f
    • J
      fork: free thread in copy_process on failure · 0740aa5f
      Jiri Slaby 提交于
      When using this program (as root):
      
      	#include <err.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      
      	#include <sys/io.h>
      	#include <sys/types.h>
      	#include <sys/wait.h>
      
      	#define ITER 1000
      	#define FORKERS 15
      	#define THREADS (6000/FORKERS) // 1850 is proc max
      
      	static void fork_100_wait()
      	{
      		unsigned a, to_wait = 0;
      
      		printf("\t%d forking %d\n", THREADS, getpid());
      
      		for (a = 0; a < THREADS; a++) {
      			switch (fork()) {
      			case 0:
      				usleep(1000);
      				exit(0);
      				break;
      			case -1:
      				break;
      			default:
      				to_wait++;
      				break;
      			}
      		}
      
      		printf("\t%d forked from %d, waiting for %d\n", THREADS, getpid(),
      				to_wait);
      
      		for (a = 0; a < to_wait; a++)
      			wait(NULL);
      
      		printf("\t%d waited from %d\n", THREADS, getpid());
      	}
      
      	static void run_forkers()
      	{
      		pid_t forkers[FORKERS];
      		unsigned a;
      
      		for (a = 0; a < FORKERS; a++) {
      			switch ((forkers[a] = fork())) {
      			case 0:
      				fork_100_wait();
      				exit(0);
      				break;
      			case -1:
      				err(1, "DIE fork of %d'th forker", a);
      				break;
      			default:
      				break;
      			}
      		}
      
      		for (a = 0; a < FORKERS; a++)
      			waitpid(forkers[a], NULL, 0);
      	}
      
      	int main()
      	{
      		unsigned a;
      		int ret;
      
      		ret = ioperm(10, 20, 0);
      		if (ret < 0)
      			err(1, "ioperm");
      
      		for (a = 0; a < ITER; a++)
      			run_forkers();
      
      		return 0;
      	}
      
      kmemleak reports many occurences of this leak:
      unreferenced object 0xffff8805917c8000 (size 8192):
        comm "fork-leak", pid 2932, jiffies 4295354292 (age 1871.028s)
        hex dump (first 32 bytes):
          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
        backtrace:
          [<ffffffff814cfbf5>] kmemdup+0x25/0x50
          [<ffffffff8103ab43>] copy_thread_tls+0x6c3/0x9a0
          [<ffffffff81150174>] copy_process+0x1a84/0x5790
          [<ffffffff811dc375>] wake_up_new_task+0x2d5/0x6f0
          [<ffffffff8115411d>] _do_fork+0x12d/0x820
      ...
      
      Due to the leakage of the memory items which should have been freed in
      arch/x86/kernel/process.c:exit_thread().
      
      Make sure the memory is freed when fork fails later in copy_process.
      This is done by calling exit_thread with the thread to kill.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0740aa5f
    • J
      exit_thread: accept a task parameter to be exited · e6464694
      Jiri Slaby 提交于
      We need to call exit_thread from copy_process in a fail path.  So make it
      accept task_struct as a parameter.
      
      [v2]
      * s390: exit_thread_runtime_instr doesn't make sense to be called for
        non-current tasks.
      * arm: fix the comment in vfp_thread_copy
      * change 'me' to 'tsk' for task_struct
      * now we can change only archs that actually have exit_thread
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6464694
    • M
      mm, oom_reaper: do not mmput synchronously from the oom reaper context · ec8d7c14
      Michal Hocko 提交于
      Tetsuo has properly noted that mmput slow path might get blocked waiting
      for another party (e.g.  exit_aio waits for an IO).  If that happens the
      oom_reaper would be put out of the way and will not be able to process
      next oom victim.  We should strive for making this context as reliable
      and independent on other subsystems as much as possible.
      
      Introduce mmput_async which will perform the slow path from an async
      (WQ) context.  This will delay the operation but that shouldn't be a
      problem because the oom_reaper has reclaimed the victim's address space
      for most cases as much as possible and the remaining context shouldn't
      bind too much memory anymore.  The only exception is when mmap_sem
      trylock has failed which shouldn't happen too often.
      
      The issue is only theoretical but not impossible.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8d7c14
    • A
      bpf: teach verifier to recognize imm += ptr pattern · 1b9b69ec
      Alexei Starovoitov 提交于
      Humans don't write C code like:
        u8 *ptr = skb->data;
        int imm = 4;
        imm += ptr;
      but from llvm backend point of view 'imm' and 'ptr' are registers and
      imm += ptr may be preferred vs ptr += imm depending which register value
      will be used further in the code, while verifier can only recognize ptr += imm.
      That caused small unrelated changes in the C code of the bpf program to
      trigger rejection by the verifier. Therefore teach the verifier to recognize
      both ptr += imm and imm += ptr.
      For example:
      when R6=pkt(id=0,off=0,r=62) R7=imm22
      after r7 += r6 instruction
      will be R6=pkt(id=0,off=0,r=62) R7=pkt(id=0,off=22,r=62)
      
      Fixes: 969bf05e ("bpf: direct packet access")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b9b69ec
    • A
      bpf: support decreasing order in direct packet access · d91b28ed
      Alexei Starovoitov 提交于
      when packet headers are accessed in 'decreasing' order (like TCP port
      may be fetched before the program reads IP src) the llvm may generate
      the following code:
      [...]                // R7=pkt(id=0,off=22,r=70)
      r2 = *(u32 *)(r7 +0) // good access
      [...]
      r7 += 40             // R7=pkt(id=0,off=62,r=70)
      r8 = *(u32 *)(r7 +0) // good access
      [...]
      r1 = *(u32 *)(r7 -20) // this one will fail though it's within a safe range
                            // it's doing *(u32*)(skb->data + 42)
      Fix verifier to recognize such code pattern
      
      Alos turned out that 'off > range' condition is not a verifier bug.
      It's a buggy program that may do something like:
      if (ptr + 50 > data_end)
        return 0;
      ptr += 60;
      *(u32*)ptr;
      in such case emit
      "invalid access to packet, off=0 size=4, R1(id=0,off=60,r=50)" error message,
      so all information is available for the program author to fix the program.
      
      Fixes: 969bf05e ("bpf: direct packet access")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d91b28ed