1. 23 3月, 2016 9 次提交
    • D
      kernel: add kcov code coverage · 5c9a8750
      Dmitry Vyukov 提交于
      kcov provides code coverage collection for coverage-guided fuzzing
      (randomized testing).  Coverage-guided fuzzing is a testing technique
      that uses coverage feedback to determine new interesting inputs to a
      system.  A notable user-space example is AFL
      (http://lcamtuf.coredump.cx/afl/).  However, this technique is not
      widely used for kernel testing due to missing compiler and kernel
      support.
      
      kcov does not aim to collect as much coverage as possible.  It aims to
      collect more or less stable coverage that is function of syscall inputs.
      To achieve this goal it does not collect coverage in soft/hard
      interrupts and instrumentation of some inherently non-deterministic or
      non-interesting parts of kernel is disbled (e.g.  scheduler, locking).
      
      Currently there is a single coverage collection mode (tracing), but the
      API anticipates additional collection modes.  Initially I also
      implemented a second mode which exposes coverage in a fixed-size hash
      table of counters (what Quentin used in his original patch).  I've
      dropped the second mode for simplicity.
      
      This patch adds the necessary support on kernel side.  The complimentary
      compiler support was added in gcc revision 231296.
      
      We've used this support to build syzkaller system call fuzzer, which has
      found 90 kernel bugs in just 2 months:
      
        https://github.com/google/syzkaller/wiki/Found-Bugs
      
      We've also found 30+ bugs in our internal systems with syzkaller.
      Another (yet unexplored) direction where kcov coverage would greatly
      help is more traditional "blob mutation".  For example, mounting a
      random blob as a filesystem, or receiving a random blob over wire.
      
      Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
      coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
      typical coverage can be just a dozen of basic blocks (e.g.  an invalid
      input).  In such context gcov becomes prohibitively expensive as
      reset/collect coverage steps depend on total number of basic
      blocks/edges in program (in case of kernel it is about 2M).  Cost of
      kcov depends only on number of executed basic blocks/edges.  On top of
      that, kernel requires per-thread coverage because there are always
      background threads and unrelated processes that also produce coverage.
      With inlined gcov instrumentation per-thread coverage is not possible.
      
      kcov exposes kernel PCs and control flow to user-space which is
      insecure.  But debugfs should not be mapped as user accessible.
      
      Based on a patch by Quentin Casasnovas.
      
      [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
      [akpm@linux-foundation.org: unbreak allmodconfig]
      [akpm@linux-foundation.org: follow x86 Makefile layout standards]
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tavis Ormandy <taviso@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Drysdale <drysdale@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c9a8750
    • A
      profile: hide unused functions when !CONFIG_PROC_FS · ade356b9
      Arnd Bergmann 提交于
      A couple of functions and variables in the profile implementation are
      used only on SMP systems by the procfs code, but are unused if either
      procfs is disabled or in uniprocessor kernels.  gcc prints a harmless
      warning about the unused symbols:
      
        kernel/profile.c:243:13: error: 'profile_flip_buffers' defined but not used [-Werror=unused-function]
         static void profile_flip_buffers(void)
                     ^
        kernel/profile.c:266:13: error: 'profile_discard_flip_buffers' defined but not used [-Werror=unused-function]
         static void profile_discard_flip_buffers(void)
                     ^
        kernel/profile.c:330:12: error: 'profile_cpu_callback' defined but not used [-Werror=unused-function]
         static int profile_cpu_callback(struct notifier_block *info,
                    ^
      
      This adds further #ifdef to the file, to annotate exactly in which cases
      they are used.  I have done several thousand ARM randconfig kernels with
      this patch applied and no longer get any warnings in this file.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ade356b9
    • H
      panic: change nmi_panic from macro to function · ebc41f20
      Hidehiro Kawai 提交于
      Commit 1717f209 ("panic, x86: Fix re-entrance problem due to panic
      on NMI") and commit 58c5661f ("panic, x86: Allow CPUs to save
      registers even if looping in NMI context") introduced nmi_panic() which
      prevents concurrent/recursive execution of panic().  It also saves
      registers for the crash dump on x86.
      
      However, there are some cases where NMI handlers still use panic().
      This patch set partially replaces them with nmi_panic() in those cases.
      
      Even this patchset is applied, some NMI or similar handlers (e.g.  MCE
      handler) continue to use panic().  This is because I can't test them
      well and actual problems won't happen.  For example, the possibility
      that normal panic and panic on MCE happen simultaneously is very low.
      
      This patch (of 3):
      
      Convert nmi_panic() to a proper function and export it instead of
      exporting internal implementation details to modules, for obvious
      reasons.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: Javi Merino <javi.merino@arm.com>
      Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com>
      Cc: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebc41f20
    • J
      fs/coredump: prevent fsuid=0 dumps into user-controlled directories · 378c6520
      Jann Horn 提交于
      This commit fixes the following security hole affecting systems where
      all of the following conditions are fulfilled:
      
       - The fs.suid_dumpable sysctl is set to 2.
       - The kernel.core_pattern sysctl's value starts with "/". (Systems
         where kernel.core_pattern starts with "|/" are not affected.)
       - Unprivileged user namespace creation is permitted. (This is
         true on Linux >=3.8, but some distributions disallow it by
         default using a distro patch.)
      
      Under these conditions, if a program executes under secure exec rules,
      causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
      namespace, changes its root directory and crashes, the coredump will be
      written using fsuid=0 and a path derived from kernel.core_pattern - but
      this path is interpreted relative to the root directory of the process,
      allowing the attacker to control where a coredump will be written with
      root privileges.
      
      To fix the security issue, always interpret core_pattern for dumps that
      are written under SUID_DUMP_ROOT relative to the root directory of init.
      Signed-off-by: NJann Horn <jann@thejh.net>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      378c6520
    • O
      ptrace: change __ptrace_unlink() to clear ->ptrace under ->siglock · 1333ab03
      Oleg Nesterov 提交于
      This test-case (simplified version of generated by syzkaller)
      
      	#include <unistd.h>
      	#include <sys/ptrace.h>
      	#include <sys/wait.h>
      
      	void test(void)
      	{
      		for (;;) {
      			if (fork()) {
      				wait(NULL);
      				continue;
      			}
      
      			ptrace(PTRACE_SEIZE, getppid(), 0, 0);
      			ptrace(PTRACE_INTERRUPT, getppid(), 0, 0);
      			_exit(0);
      		}
      	}
      
      	int main(void)
      	{
      		int np;
      
      		for (np = 0; np < 8; ++np)
      			if (!fork())
      				test();
      
      		while (wait(NULL) > 0)
      			;
      		return 0;
      	}
      
      triggers the 2nd WARN_ON_ONCE(!signr) warning in do_jobctl_trap().  The
      problem is that __ptrace_unlink() clears task->jobctl under siglock but
      task->ptrace is cleared without this lock held; this fools the "else"
      branch which assumes that !PT_SEIZED means PT_PTRACED.
      
      Note also that most of other PTRACE_SEIZE checks can race with detach
      from the exiting tracer too.  Say, the callers of ptrace_trap_notify()
      assume that SEIZED can't go away after it was checked.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1333ab03
    • A
      auditsc: for seccomp events, log syscall compat state using in_compat_syscall · efbc0fbf
      Andy Lutomirski 提交于
      Except on SPARC, this is what the code always did.  SPARC compat seccomp
      was buggy, although the impact of the bug was limited because SPARC
      32-bit and 64-bit syscall numbers are the same.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efbc0fbf
    • A
      ptrace: in PEEK_SIGINFO, check syscall bitness, not task bitness · 5c465217
      Andy Lutomirski 提交于
      Users of the 32-bit ptrace() ABI expect the full 32-bit ABI.  siginfo
      translation should check ptrace() ABI, not caller task ABI.
      
      This is an ABI change on SPARC.  Let's hope that no one relied on the
      old buggy ABI.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c465217
    • A
      seccomp: check in_compat_syscall, not is_compat_task, in strict mode · 5c38065e
      Andy Lutomirski 提交于
      Seccomp wants to know the syscall bitness, not the caller task bitness,
      when it selects the syscall whitelist.
      
      As far as I know, this makes no difference on any architecture, so it's
      not a security problem.  (It generates identical code everywhere except
      sparc, and, on sparc, the syscall numbering is the same for both ABIs.)
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c38065e
    • T
      kernel/hung_task.c: use timeout diff when timeout is updated · b4aa14a6
      Tetsuo Handa 提交于
      When new timeout is written to /proc/sys/kernel/hung_task_timeout_secs,
      khungtaskd is interrupted and again sleeps for full timeout duration.
      
      This means that hang task will not be checked if new timeout is written
      periodically within old timeout duration and/or checking of hang task
      will be delayed for up to previous timeout duration.  Fix this by
      remembering last time khungtaskd checked hang task.
      
      This change will allow other watchdog tasks (if any) to share khungtaskd
      by sleeping for minimal timeout diff of all watchdog tasks.  Doing more
      watchdog tasks from khungtaskd will reduce the possibility of printk()
      collisions by multiple watchdog threads.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4aa14a6
  2. 18 3月, 2016 10 次提交
    • J
      lib/bug.c: use common WARN helper · 2553b67a
      Josh Poimboeuf 提交于
      The traceoff_on_warning option doesn't have any effect on s390, powerpc,
      arm64, parisc, and sh because there are two different types of WARN
      implementations:
      
      1) The above mentioned architectures treat WARN() as a special case of a
         BUG() exception.  They handle warnings in report_bug() in lib/bug.c.
      
      2) All other architectures just call warn_slowpath_*() directly.  Their
         warnings are handled in warn_slowpath_common() in kernel/panic.c.
      
      Support traceoff_on_warning on all architectures and prevent any future
      divergence by using a single common function to emit the warning.
      
      Also remove the '()' from '%pS()', because the parentheses look funky:
      
        [   45.607629] WARNING: at /root/warn_mod/warn_mod.c:17 .init_dummy+0x20/0x40 [warn_mod]()
      Reported-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Tested-by: NPrarit Bhargava <prarit@redhat.com>
      Acked-by: NPrarit Bhargava <prarit@redhat.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2553b67a
    • K
      param: convert some "on"/"off" users to strtobool · 4cc7ecb7
      Kees Cook 提交于
      This changes several users of manual "on"/"off" parsing to use
      strtobool.
      
      Some side-effects:
      - these uses will now parse y/n/1/0 meaningfully too
      - the early_param uses will now bubble up parse errors
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Amitkumar Karwar <akarwar@marvell.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kalle Valo <kvalo@codeaurora.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Nishant Sarmukadam <nishants@marvell.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Steve French <sfrench@samba.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cc7ecb7
    • I
      printk: add clear_idx symbol to vmcoreinfo · f468908b
      Ivan Delalande 提交于
      This allows us to extract from the vmcore only the messages emitted
      since the last time the ring buffer was cleared.  We just have to make
      sure its value is always up-to-date, when old messages are discarded to
      free space in log_make_free_space() for example.
      Signed-off-by: NZeyu Zhao <zzy8200@gmail.com>
      Signed-off-by: NIvan Delalande <colona@arista.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f468908b
    • S
      printk: check CON_ENABLED in have_callable_console() · adaf6590
      Sergey Senozhatsky 提交于
      have_callable_console() must also test CON_ENABLED bit, not just
      CON_ANYTIME.  We may have disabled CON_ANYTIME console so printk can
      wrongly assume that it's safe to call_console_drivers().
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kyle McMartin <kyle@kernel.org>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Calvin Owens <calvinowens@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      adaf6590
    • S
      printk: set may_schedule for some of console_trylock() callers · 6b97a20d
      Sergey Senozhatsky 提交于
      console_unlock() allows to cond_resched() if its caller has set
      `console_may_schedule' to 1, since 8d91f8b1 ("printk: do
      cond_resched() between lines while outputting to consoles").
      
      The rules are:
      -- console_lock() always sets `console_may_schedule' to 1
      -- console_trylock() always sets `console_may_schedule' to 0
      
      However, console_trylock() callers (among them is printk()) do not
      always call printk() from atomic contexts, and some of them can
      cond_resched() in console_unlock(), so console_trylock() can set
      `console_may_schedule' to 1 for such processes.
      
      For !CONFIG_PREEMPT_COUNT kernels, however, console_trylock() always
      sets `console_may_schedule' to 0.
      
      It's possible to drop explicit preempt_disable()/preempt_enable() in
      vprintk_emit(), because console_unlock() and console_trylock() are now
      smart enough:
       a) console_unlock() does not cond_resched() when it's unsafe
          (console_trylock() takes care of that)
       b) console_unlock() does can_use_console() check.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kyle McMartin <kyle@kernel.org>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Calvin Owens <calvinowens@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b97a20d
    • S
      printk: move can_use_console() out of console_trylock_for_printk() · a8199371
      Sergey Senozhatsky 提交于
      console_unlock() allows to cond_resched() if its caller has set
      `console_may_schedule' to 1 (this functionality is present since
      8d91f8b1 ("printk: do cond_resched() between lines while outputting
      to consoles").
      
      The rules are:
      -- console_lock() always sets `console_may_schedule' to 1
      -- console_trylock() always sets `console_may_schedule' to 0
      
      printk() calls console_unlock() with preemption desabled, which
      basically can lead to RCU stalls, watchdog soft lockups, etc.  if
      something is simultaneously calling printk() frequent enough (IOW,
      console_sem owner always has new data to send to console divers and
      can't leave console_unlock() for a long time).
      
      printk()->console_trylock() callers do not necessarily execute in atomic
      contexts, and some of them can cond_resched() in console_unlock().
      console_trylock() can set `console_may_schedule' to 1 (allow
      cond_resched() later in consoe_unlock()) when it's safe.
      
      This patch (of 3):
      
      vprintk_emit() disables preemption around console_trylock_for_printk()
      and console_unlock() calls for a strong reason -- can_use_console()
      check.  The thing is that vprintl_emit() can be called on a CPU that is
      not fully brought up yet (!cpu_online()), which potentially can cause
      problems if console driver wants to access per-cpu data.  A console
      driver can explicitly state that it's safe to call it from !online cpu
      by setting CON_ANYTIME bit in console ->flags.  That's why for
      !cpu_online() can_use_console() iterates all the console to find out if
      there is a CON_ANYTIME console, otherwise console_unlock() must be
      avoided.
      
      can_use_console() ensures that console_unlock() call is safe in
      vprintk_emit() only; console_lock() and console_trylock() are not
      covered by this check.  Even though call_console_drivers(), invoked from
      console_cont_flush() and console_unlock(), tests `!cpu_online() &&
      CON_ANYTIME' for_each_console(), it may be too late, which can result in
      messages loss.
      
      Assume that we have 2 cpus -- CPU0 is online, CPU1 is !online, and no
      CON_ANYTIME consoles available.
      
      CPU0 online                        CPU1 !online
                                       console_trylock()
                                       ...
                                       console_unlock()
                                         console_cont_flush
                                           spin_lock logbuf_lock
                                           if (!cont.len) {
                                              spin_unlock logbuf_lock
                                              return
                                           }
                                         for (;;) {
      vprintk_emit
        spin_lock logbuf_lock
        log_store
        spin_unlock logbuf_lock
                                           spin_lock logbuf_lock
        !console_trylock_for_printk        msg_print_text
       return                              console_idx = log_next()
                                           console_seq++
                                           console_prev = msg->flags
                                           spin_unlock logbuf_lock
      
                                           call_console_drivers()
                                             for_each_console(con) {
                                               if (!cpu_online() &&
                                                   !(con->flags & CON_ANYTIME))
                                                       continue;
                                               }
                                         /*
                                          * no message printed, we lost it
                                          */
      vprintk_emit
        spin_lock logbuf_lock
        log_store
        spin_unlock logbuf_lock
        !console_trylock_for_printk
       return
                                         /*
                                          * go to the beginning of the loop,
                                          * find out there are new messages,
                                          * lose it
                                          */
                                         }
      
      console_trylock()/console_lock() call on CPU1 may come from cpu
      notifiers registered on that CPU.  Since notifiers are not getting
      unregistered when CPU is going DOWN, all of the notifiers receive
      notifications during CPU UP.  For example, on my x86_64, I see around 50
      notification sent from offline CPU to itself
      
       [swapper/2] from cpu:2 to:2 action:CPU_STARTING hotplug_hrtick
       [swapper/2] from cpu:2 to:2 action:CPU_STARTING blk_mq_main_cpu_notify
       [swapper/2] from cpu:2 to:2 action:CPU_STARTING blk_mq_queue_reinit_notify
       [swapper/2] from cpu:2 to:2 action:CPU_STARTING console_cpu_notify
      
      while doing
        echo 0 > /sys/devices/system/cpu/cpu2/online
        echo 1 > /sys/devices/system/cpu/cpu2/online
      
      So grabbing the console_sem lock while CPU is !online is possible,
      in theory.
      
      This patch moves can_use_console() check out of
      console_trylock_for_printk().  Instead it calls it in console_unlock(),
      so now console_lock()/console_unlock() are also 'protected' by
      can_use_console().  This also means that console_trylock_for_printk() is
      not really needed anymore and can be removed.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kyle McMartin <kyle@kernel.org>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Calvin Owens <calvinowens@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8199371
    • J
      timer: convert timer_slack_ns from unsigned long to u64 · da8b44d5
      John Stultz 提交于
      This patchset introduces a /proc/<pid>/timerslack_ns interface which
      would allow controlling processes to be able to set the timerslack value
      on other processes in order to save power by avoiding wakeups (Something
      Android currently does via out-of-tree patches).
      
      The first patch tries to fix the internal timer_slack_ns usage which was
      defined as a long, which limits the slack range to ~4 seconds on 32bit
      systems.  It converts it to a u64, which provides the same basically
      unlimited slack (500 years) on both 32bit and 64bit machines.
      
      The second patch introduces the /proc/<pid>/timerslack_ns interface
      which allows the full 64bit slack range for a task to be read or set on
      both 32bit and 64bit machines.
      
      With these two patches, on a 32bit machine, after setting the slack on
      bash to 10 seconds:
      
      $ time sleep 1
      
      real    0m10.747s
      user    0m0.001s
      sys     0m0.005s
      
      The first patch is a little ugly, since I had to chase the slack delta
      arguments through a number of functions converting them to u64s.  Let me
      know if it makes sense to break that up more or not.
      
      Other than that things are fairly straightforward.
      
      This patch (of 2):
      
      The timer_slack_ns value in the task struct is currently a unsigned
      long.  This means that on 32bit applications, the maximum slack is just
      over 4 seconds.  However, on 64bit machines, its much much larger (~500
      years).
      
      This disparity could make application development a little (as well as
      the default_slack) to a u64.  This means both 32bit and 64bit systems
      have the same effective internal slack range.
      
      Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
      the interface as a unsigned long, so we preserve that limitation on
      32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
      long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
      actually larger then what can be stored by an unsigned long.
      
      This patch also modifies hrtimer functions which specified the slack
      delta as a unsigned long.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da8b44d5
    • J
      mm: scale kswapd watermarks in proportion to memory · 795ae7a0
      Johannes Weiner 提交于
      In machines with 140G of memory and enterprise flash storage, we have
      seen read and write bursts routinely exceed the kswapd watermarks and
      cause thundering herds in direct reclaim.  Unfortunately, the only way
      to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
      system's emergency reserves - which is entirely unrelated to the
      system's latency requirements.  In order to get kswapd to maintain a
      250M buffer of free memory, the emergency reserves need to be set to 1G.
      That is a lot of memory wasted for no good reason.
      
      On the other hand, it's reasonable to assume that allocation bursts and
      overall allocation concurrency scale with memory capacity, so it makes
      sense to make kswapd aggressiveness a function of that as well.
      
      Change the kswapd watermark scale factor from the currently fixed 25% of
      the tunable emergency reserve to a tunable 0.1% of memory.
      
      Beyond 1G of memory, this will produce bigger watermark steps than the
      current formula in default settings.  Ensure that the new formula never
      chooses steps smaller than that, i.e.  25% of the emergency reserve.
      
      On a 140G machine, this raises the default watermark steps - the
      distance between min and low, and low and high - from 16M to 143M.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      795ae7a0
    • V
      mm: memcontrol: report kernel stack usage in cgroup2 memory.stat · 12580e4b
      Vladimir Davydov 提交于
      Show how much memory is allocated to kernel stacks.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12580e4b
    • J
      watchdog: don't run proc_watchdog_update if new value is same as old · a1ee1932
      Joshua Hunt 提交于
      While working on a script to restore all sysctl params before a series of
      tests I found that writing any value into the
      /proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh}
      causes them to call proc_watchdog_update().
      
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
      
      There doesn't appear to be a reason for doing this work every time a write
      occurs, so only do it when the values change.
      Signed-off-by: NJosh Hunt <johunt@akamai.com>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.1.x+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1ee1932
  3. 17 3月, 2016 4 次提交
    • J
      livepatch/module: remove livepatch module notifier · 7e545d6e
      Jessica Yu 提交于
      Remove the livepatch module notifier in favor of directly enabling and
      disabling patches to modules in the module loader. Hard-coding the
      function calls ensures that ftrace_module_enable() is run before
      klp_module_coming() during module load, and that klp_module_going() is
      run before ftrace_release_mod() during module unload. This way, ftrace
      and livepatch code is run in the correct order during the module
      load/unload sequence without dependence on the module notifier call chain.
      Signed-off-by: NJessica Yu <jeyu@redhat.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      7e545d6e
    • J
      modules: split part of complete_formation() into prepare_coming_module() · 4c973d16
      Jessica Yu 提交于
      Put all actions in complete_formation() that are performed after
      module->state is set to MODULE_STATE_COMING into a separate function
      prepare_coming_module(). This split prepares for the removal of the
      livepatch module notifiers in favor of hard-coding function calls to
      klp_module_{coming,going} in the module loader.
      
      The complete_formation -> prepare_coming_module split will also make error
      handling easier since we can jump to the appropriate error label to do any
      module GOING cleanup after all the COMING-actions have completed.
      Signed-off-by: NJessica Yu <jeyu@redhat.com>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      4c973d16
    • A
      cgroup: avoid false positive gcc-6 warning · cfe02a8a
      Arnd Bergmann 提交于
      When all subsystems are disabled, gcc notices that cgroup_subsys_enabled_key
      is a zero-length array and that any access to it must be out of bounds:
      
      In file included from ../include/linux/cgroup.h:19:0,
                       from ../kernel/cgroup.c:31:
      ../kernel/cgroup.c: In function 'cgroup_add_cftypes':
      ../kernel/cgroup.c:261:53: error: array subscript is above array bounds [-Werror=array-bounds]
        return static_key_enabled(cgroup_subsys_enabled_key[ssid]);
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
      ../include/linux/jump_label.h:271:40: note: in definition of macro 'static_key_enabled'
        static_key_count((struct static_key *)x) > 0;    \
                                              ^
      
      We should never call the function in this particular case, so this is
      not a bug. In order to silence the warning, this adds an explicit check
      for the CGROUP_SUBSYS_COUNT==0 case.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      cfe02a8a
    • T
      cgroup: ignore css_sets associated with dead cgroups during migration · 2b021cbf
      Tejun Heo 提交于
      Before 2e91fa7f ("cgroup: keep zombies associated with their
      original cgroups"), all dead tasks were associated with init_css_set.
      If a zombie task is requested for migration, while migration prep
      operations would still be performed on init_css_set, the actual
      migration would ignore zombie tasks.  As init_css_set is always valid,
      this worked fine.
      
      However, after 2e91fa7f, zombie tasks stay with the css_set it was
      associated with at the time of death.  Let's say a task T associated
      with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2.  After T
      becomes a zombie, it would still remain associated with A and B.  If A
      only contains zombie tasks, it can be removed.  On removal, A gets
      marked offline but stays pinned until all zombies are drained.  At
      this point, if migration is initiated on T to a cgroup C on hierarchy
      H-2, migration path would try to prepare T's css_set for migration and
      trigger the following.
      
       WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
       CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
       ...
       Call Trace:
        [<ffffffff8127e63c>] dump_stack+0x4e/0x82
        [<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
        [<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
        [<ffffffff810c33e1>] cgroup_get+0x121/0x160
        [<ffffffff810c349b>] link_css_set+0x7b/0x90
        [<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
        [<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
        [<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
        [<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
        [<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
        [<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
        [<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
        [<ffffffff81151673>] __vfs_write+0x23/0xe0
        [<ffffffff81152494>] vfs_write+0xa4/0x1a0
        [<ffffffff811532d4>] SyS_write+0x44/0xa0
        [<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      It doesn't make sense to prepare migration for css_sets pointing to
      dead cgroups as they are guaranteed to contain only zombies which are
      ignored later during migration.  This patch makes cgroup destruction
      path mark all affected css_sets as dead and updates the migration path
      to ignore them during preparation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 2e91fa7f ("cgroup: keep zombies associated with their original cgroups")
      Cc: stable@vger.kernel.org # v4.4+
      2b021cbf
  4. 16 3月, 2016 4 次提交
    • A
      kallsyms: add support for relative offsets in kallsyms address table · 2213e9a6
      Ard Biesheuvel 提交于
      Similar to how relative extables are implemented, it is possible to emit
      the kallsyms table in such a way that it contains offsets relative to
      some anchor point in the kernel image rather than absolute addresses.
      
      On 64-bit architectures, it cuts the size of the kallsyms address table
      in half, since offsets between kernel symbols can typically be expressed
      in 32 bits.  This saves several hundreds of kilobytes of permanent
      .rodata on average.  In addition, the kallsyms address table is no
      longer subject to dynamic relocation when CONFIG_RELOCATABLE is in
      effect, so the relocation work done after decompression now doesn't have
      to do relocation updates for all these values.  This saves up to 24
      bytes (i.e., the size of a ELF64 RELA relocation table entry) per value,
      which easily adds up to a couple of megabytes of uncompressed __init
      data on ppc64 or arm64.  Even if these relocation entries typically
      compress well, the combined size reduction of 2.8 MB uncompressed for a
      ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500
      KB space saving in the compressed image.
      
      Since it is useful for some architectures (like x86) to retain the
      ability to emit absolute values as well, this patch also adds support
      for capturing both absolute and relative values when
      KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu
      addresses as positive 32-bit values, and addresses relative to the
      lowest encountered relative symbol as negative values, which are
      subtracted from the runtime address of this base symbol to produce the
      actual address.
      
      Support for the above is enabled by default for all architectures except
      IA-64 and Tile-GX, whose symbols are too far apart to capture in this
      manner.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Tested-by: NKees Cook <keescook@chromium.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2213e9a6
    • L
      mm/page_poisoning.c: allow for zero poisoning · 1414c7f4
      Laura Abbott 提交于
      By default, page poisoning uses a poison value (0xaa) on free.  If this
      is changed to 0, the page is not only sanitized but zeroing on alloc
      with __GFP_ZERO can be skipped as well.  The tradeoff is that detecting
      corruption from the poisoning is harder to detect.  This feature also
      cannot be used with hibernation since pages are not guaranteed to be
      zeroed after hibernation.
      
      Credit to Grsecurity/PaX team for inspiring this work
      Signed-off-by: NLaura Abbott <labbott@fedoraproject.org>
      Acked-by: NRafael J. Wysocki <rjw@rjwysocki.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1414c7f4
    • A
      mm: fix two typos in comments for to_vmem_altmap() · 07061aab
      Andreas Ziegler 提交于
      Commit 4b94ffdc ("x86, mm: introduce vmem_altmap to augment
      vmemmap_populate()"), introduced the to_vmem_altmap() function.
      
      The comments in this function contain two typos (one misspelling of the
      Kconfig option CONFIG_SPARSEMEM_VMEMMAP, and one missing letter 'n'),
      let's fix them up.
      Signed-off-by: NAndreas Ziegler <andreas.ziegler@fau.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07061aab
    • P
      tags: Fix DEFINE_PER_CPU expansions · 25528213
      Peter Zijlstra 提交于
      $ make tags
        GEN     tags
      ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
      ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
      ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
      ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
      ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
      ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
      ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
      
      Which are all the result of the DEFINE_PER_CPU pattern:
      
        scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
        scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
      
      The below cures them. All except the workqueue one are within reasonable
      distance of the 80 char limit. TJ do you have any preference on how to
      fix the wq one, or shall we just not care its too long?
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25528213
  5. 13 3月, 2016 1 次提交
  6. 12 3月, 2016 1 次提交
  7. 11 3月, 2016 2 次提交
    • T
      cpu/hotplug: Fix smpboot thread ordering · 2a58c527
      Thomas Gleixner 提交于
      Commit 931ef163 moved the smpboot thread park/unpark invocation to the
      state machine. The move of the unpark invocation was premature as it depends
      on work in progress patches.
      
      As a result cpu down can fail, because rcu synchronization in takedown_cpu()
      eventually requires a functional softirq thread. I never encountered the
      problem in testing, but 0day testing managed to provide a reliable reproducer.
      
      Remove the smpboot_threads_park() call from the state machine for now and put
      it back into the original place after the rcu synchronization.
      
      I'm embarrassed as I knew about the dependency and still managed to get it
      wrong. Hotplug induced brain melt seems to be the only sensible explanation
      for that.
      
      Fixes: 931ef163 "cpu/hotplug: Unpark smpboot threads from the state machine"
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2a58c527
    • R
      cpufreq: Move scheduler-related code to the sched directory · adaf9fcd
      Rafael J. Wysocki 提交于
      Create cpufreq.c under kernel/sched/ and move the cpufreq code
      related to the scheduler to that file and to sched.h.
      
      Redefine cpufreq_update_util() as a static inline function to avoid
      function calls at its call sites in the scheduler code (as suggested
      by Peter Zijlstra).
      
      Also move the definition of struct update_util_data and declaration
      of cpufreq_set_update_util_data() from include/linux/cpufreq.h to
      include/linux/sched.h.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      adaf9fcd
  8. 10 3月, 2016 9 次提交