1. 19 12月, 2015 3 次提交
    • H
      kexec: Fix race between panic() and crash_kexec() · 7bbee5ca
      Hidehiro Kawai 提交于
      Currently, panic() and crash_kexec() can be called at the same time.
      For example (x86 case):
      
      CPU 0:
        oops_end()
          crash_kexec()
            mutex_trylock() // acquired
              nmi_shootdown_cpus() // stop other CPUs
      
      CPU 1:
        panic()
          crash_kexec()
            mutex_trylock() // failed to acquire
          smp_send_stop() // stop other CPUs
          infinite loop
      
      If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
      fails.
      
      In another case:
      
      CPU 0:
        oops_end()
          crash_kexec()
            mutex_trylock() // acquired
              <NMI>
              io_check_error()
                panic()
                  crash_kexec()
                    mutex_trylock() // failed to acquire
                  infinite loop
      
      Clearly, this is an undesirable result.
      
      To fix this problem, this patch changes crash_kexec() to exclude others
      by using the panic_cpu atomic.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: kexec@lists.infradead.org
      Cc: linux-doc@vger.kernel.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Minfei Huang <mnfhuang@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: x86-ml <x86@kernel.org>
      Link: http://lkml.kernel.org/r/20151210014630.25437.94161.stgit@softrsSigned-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      7bbee5ca
    • H
      panic, x86: Allow CPUs to save registers even if looping in NMI context · 58c5661f
      Hidehiro Kawai 提交于
      Currently, kdump_nmi_shootdown_cpus(), a subroutine of crash_kexec(),
      sends an NMI IPI to CPUs which haven't called panic() to stop them,
      save their register information and do some cleanups for crash dumping.
      However, if such a CPU is infinitely looping in NMI context, we fail to
      save its register information into the crash dump.
      
      For example, this can happen when unknown NMIs are broadcast to all
      CPUs as follows:
      
        CPU 0                             CPU 1
        ===========================       ==========================
        receive an unknown NMI
        unknown_nmi_error()
          panic()                         receive an unknown NMI
            spin_trylock(&panic_lock)     unknown_nmi_error()
            crash_kexec()                   panic()
                                              spin_trylock(&panic_lock)
                                              panic_smp_self_stop()
                                                infinite loop
              kdump_nmi_shootdown_cpus()
                issue NMI IPI -----------> blocked until IRET
                                                infinite loop...
      
      Here, since CPU 1 is in NMI context, the second NMI from CPU 0 is
      blocked until CPU 1 executes IRET. However, CPU 1 never executes IRET,
      so the NMI is not handled and the callback function to save registers is
      never called.
      
      In practice, this can happen on some servers which broadcast NMIs to all
      CPUs when the NMI button is pushed.
      
      To save registers in this case, we need to:
      
        a) Return from NMI handler instead of looping infinitely
        or
        b) Call the callback function directly from the infinite loop
      
      Inherently, a) is risky because NMI is also used to prevent corrupted
      data from being propagated to devices.  So, we chose b).
      
      This patch does the following:
      
      1. Move the infinite looping of CPUs which haven't called panic() in NMI
         context (actually done by panic_smp_self_stop()) outside of panic() to
         enable us to refer pt_regs. Please note that panic_smp_self_stop() is
         still used for normal context.
      
      2. Call a callback of kdump_nmi_shootdown_cpus() directly to save
         registers and do some cleanups after setting waiting_for_crash_ipi which
         is used for counting down the number of CPUs which handled the callback
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Javi Merino <javi.merino@arm.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: kexec@lists.infradead.org
      Cc: linux-doc@vger.kernel.org
      Cc: lkml <linux-kernel@vger.kernel.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Stefan Lippers-Hollmann <s.l-h@gmx.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Link: http://lkml.kernel.org/r/20151210014628.25437.75256.stgit@softrs
      [ Cleanup comments, fixup formatting. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      58c5661f
    • H
      panic, x86: Fix re-entrance problem due to panic on NMI · 1717f209
      Hidehiro Kawai 提交于
      If panic on NMI happens just after panic() on the same CPU, panic() is
      recursively called. Kernel stalls, as a result, after failing to acquire
      panic_lock.
      
      To avoid this problem, don't call panic() in NMI context if we've
      already entered panic().
      
      For that, introduce nmi_panic() macro to reduce code duplication. In
      the case of panic on NMI, don't return from NMI handlers if another CPU
      already panicked.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Javi Merino <javi.merino@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: kexec@lists.infradead.org
      Cc: linux-doc@vger.kernel.org
      Cc: lkml <linux-kernel@vger.kernel.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Link: http://lkml.kernel.org/r/20151210014626.25437.13302.stgit@softrs
      [ Cleanup comments, fixup formatting. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1717f209
  2. 18 12月, 2015 1 次提交
  3. 14 12月, 2015 1 次提交
  4. 13 12月, 2015 1 次提交
    • C
      kernel: remove stop_machine() Kconfig dependency · 86fffe4a
      Chris Wilson 提交于
      Currently the full stop_machine() routine is only enabled on SMP if
      module unloading is enabled, or if the CPUs are hotpluggable.  This
      leads to configurations where stop_machine() is broken as it will then
      only run the callback on the local CPU with irqs disabled, and not stop
      the other CPUs or run the callback on them.
      
      For example, this breaks MTRR setup on x86 in certain configs since
      ea8596bb ("kprobes/x86: Remove unused text_poke_smp() and
      text_poke_smp_batch() functions") as the MTRR is only established on the
      boot CPU.
      
      This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
      and HOTPLUG_CPU config options to compile the correct stop_machine() for
      the architecture, removing the false dependency on MODULE_UNLOAD in the
      process.
      
      Link: https://lkml.org/lkml/2014/10/8/124
      References: https://bugs.freedesktop.org/show_bug.cgi?id=84794Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Iulia Manda <iulia.manda21@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86fffe4a
  5. 06 12月, 2015 1 次提交
    • J
      perf: Do not send exit event twice · 4e93ad60
      Jiri Olsa 提交于
      In case we monitor events system wide, we get EXIT event
      (when configured) twice for each task that exited.
      
      Note doubled lines with same pid/tid in following example:
      
        $ sudo ./perf record -a
        ^C[ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.480 MB perf.data (2518 samples) ]
        $ sudo ./perf report -D | grep EXIT
      
        0 60290687567581 0x59910 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
        0 60290687568354 0x59948 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
        0 60290687988744 0x59ad8 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
        0 60290687989198 0x59b10 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
        1 60290692567895 0x62af0 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253)
        1 60290692568322 0x62b28 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253)
        2 60290692739276 0x69a18 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252)
        2 60290692739910 0x69a50 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252)
      
      The reason is that the cpu contexts are processes each time
      we call perf_event_task. I'm changing the perf_event_aux logic
      to serve task_ctx and cpu contexts separately, which ensure we
      don't get EXIT event generated twice on same cpu context.
      
      This does not affect other auxiliary events, as they don't
      use task_ctx at all.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1446649205-5822-1-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4e93ad60
  6. 04 12月, 2015 7 次提交
  7. 03 12月, 2015 4 次提交
    • T
      cgroup_pids: don't account for the root cgroup · 67cde9c4
      Tejun Heo 提交于
      Because accounting resources for the root cgroup sometimes incurs
      measureable overhead for workloads which don't care about cgroup and
      often ends up calculating a number which is available elsewhere in a
      slightly different form, cgroup is not in the business of providing
      system-wide statistics.  The pids controller which was introduced
      recently was exposing "pids.current" at the root.  This patch disable
      accounting for root cgroup and removes the file from the root
      directory.
      
      While this is a userland visible behavior change, pids has been
      available only in one version and was badly broken there, so I don't
      think this will be noticeable.  If it turns out to be a problem, we
      can reinstate it for v1 hierarchies.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      67cde9c4
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
    • T
      cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach() · 599c963a
      Tejun Heo 提交于
      If one or more tasks get moved into a frozen css, the frozen state is
      cleared up from the destination css so that it can be reasserted once
      the migrated tasks are frozen.  freezer_attach() implements this in
      two separate steps - clearing CGROUP_FROZEN on the target css while
      processing each task and propagating the clearing upwards after the
      task loop is done if necessary.
      
      This patch merges the two steps.  Propagation now takes place inside
      the task loop.  This simplifies the code and prepares it for the fix
      of multi-destination migration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      599c963a
    • A
      bpf: fix allocation warnings in bpf maps and integer overflow · 01b3f521
      Alexei Starovoitov 提交于
      For large map->value_size the user space can trigger memory allocation warnings like:
      WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989
      __alloc_pages_nodemask+0x695/0x14e0()
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff82743b56>] dump_stack+0x68/0x92 lib/dump_stack.c:50
       [<ffffffff81244ec9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
       [<ffffffff812450f9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
       [<     inline     >] __alloc_pages_slowpath mm/page_alloc.c:2989
       [<ffffffff81554e95>] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235
       [<ffffffff816188fe>] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055
       [<     inline     >] alloc_pages include/linux/gfp.h:451
       [<ffffffff81550706>] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414
       [<ffffffff815a1c89>] kmalloc_order+0x19/0x60 mm/slab_common.c:1007
       [<ffffffff815a1cef>] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018
       [<     inline     >] kmalloc_large include/linux/slab.h:390
       [<ffffffff81627784>] __kmalloc+0x234/0x250 mm/slub.c:3525
       [<     inline     >] kmalloc include/linux/slab.h:463
       [<     inline     >] map_update_elem kernel/bpf/syscall.c:288
       [<     inline     >] SYSC_bpf kernel/bpf/syscall.c:744
      
      To avoid never succeeding kmalloc with order >= MAX_ORDER check that
      elem->value_size and computed elem_size are within limits for both hash and
      array type maps.
      Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings.
      Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size <= 512,
      so keep those kmalloc-s as-is.
      
      Large value_size can cause integer overflows in elem_size and map.pages
      formulas, so check for that as well.
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01b3f521
  8. 02 12月, 2015 2 次提交
    • D
      bpf, array: fix heap out-of-bounds access when updating elements · fbca9d2d
      Daniel Borkmann 提交于
      During own review but also reported by Dmitry's syzkaller [1] it has been
      noticed that we trigger a heap out-of-bounds access on eBPF array maps
      when updating elements. This happens with each map whose map->value_size
      (specified during map creation time) is not multiple of 8 bytes.
      
      In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and
      used to align array map slots for faster access. However, in function
      array_map_update_elem(), we update the element as ...
      
      memcpy(array->value + array->elem_size * index, value, array->elem_size);
      
      ... where we access 'value' out-of-bounds, since it was allocated from
      map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER)
      and later on copied through copy_from_user(value, uvalue, map->value_size).
      Thus, up to 7 bytes, we can access out-of-bounds.
      
      Same could happen from within an eBPF program, where in worst case we
      access beyond an eBPF program's designated stack.
      
      Since 1be7f75d ("bpf: enable non-root eBPF programs") didn't hit an
      official release yet, it only affects priviledged users.
      
      In case of array_map_lookup_elem(), the verifier prevents eBPF programs
      from accessing beyond map->value_size through check_map_access(). Also
      from syscall side map_lookup_elem() only copies map->value_size back to
      user, so nothing could leak.
      
        [1] http://github.com/google/syzkaller
      
      Fixes: 28fbcfa0 ("bpf: add array type of eBPF maps")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbca9d2d
    • S
      tracing: Add sched_wakeup_new and sched_waking tracepoints for pid filter · 0f72e37e
      Steven Rostedt (Red Hat) 提交于
      The set_event_pid filter relies on attaching to the sched_switch and
      sched_wakeup tracepoints to see if it should filter the tracing on schedule
      tracepoints. By adding the callbacks to sched_wakeup, pids in the
      set_event_pid file will trace the wakeups of those tasks with those pids.
      
      But sched_wakeup_new and sched_waking were missed. These two should also be
      traced. Luckily, these tracepoints share the same class as sched_wakeup
      which means they can use the same pre and post callbacks as sched_wakeup
      does.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      0f72e37e
  9. 30 11月, 2015 3 次提交
    • O
      cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork() · afbcb364
      Oleg Nesterov 提交于
      Now that we know that the forking task can't migrate amd the child is always
      moved to the same cgroup by cgroup_post_fork()->css_set_move_task() we can
      change pids_can_fork() and pids_cancel_fork() to just use task_css(current).
      And since we no longer need to pin this css, we can remove pid_fork().
      
      Note: the patch uses task_css_check(true), perhaps it makes sense to add a
      helper or change task_css_set_check() to take cgroup_threadgroup_rwsem into
      account.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      afbcb364
    • O
      cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate() · c9e75f04
      Oleg Nesterov 提交于
      If the new child migrates to another cgroup before cgroup_post_fork() calls
      subsys->fork(), then both pids_can_attach() and pids_fork() will do the same
      pids_uncharge(old_pids) + pids_charge(pids) sequence twice.
      
      Change copy_process() to call threadgroup_change_begin/threadgroup_change_end
      unconditionally. percpu_down_read() is cheap and this allows other cleanups,
      see the next changes.
      
      Also, this way we can unify cgroup_threadgroup_rwsem and dup_mmap_sem.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c9e75f04
    • T
      cgroup: make css_set pin its css's to avoid use-afer-free · 53254f90
      Tejun Heo 提交于
      A css_set represents the relationship between a set of tasks and
      css's.  css_set never pinned the associated css's.  This was okay
      because tasks used to always disassociate immediately (in RCU sense) -
      either a task is moved to a different css_set or exits and never
      accesses css_set again.
      
      Unfortunately, afcf6c8b ("cgroup: add cgroup_subsys->free() method
      and use it to fix pids controller") and patches leading up to it made
      a zombie hold onto its css_set and deref the associated css's on its
      release.  Nothing pins the css's after exit and it might have already
      been freed leading to use-after-free.
      
       general protection fault: 0000 [#1] PREEMPT SMP
       task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
       RIP: 0010:[<ffffffff810fa205>]  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
       ...
       Call Trace:
        <IRQ>
        [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
        [<ffffffff810f8893>] cgroup_free+0x53/0xe0
        [<ffffffff8104ed62>] __put_task_struct+0x42/0x130
        [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
        [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
        [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
        [<ffffffff81056e54>] __do_softirq+0xd4/0x460
        [<ffffffff81057369>] irq_exit+0x89/0xa0
        [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
        [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
        <EOI>
       ...
       Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
       RIP  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
        RSP <ffff88001fc03e20>
       ---[ end trace 89a4a4b916b90c49 ]---
       Kernel panic - not syncing: Fatal exception in interrupt
       Kernel Offset: disabled
       ---[ end Kernel panic - not syncing: Fatal exception in interrupt
      
      Fix it by making css_set pin the associate css's until its release.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Reported-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk
      Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de
      Fixes: afcf6c8b ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")
      53254f90
  10. 26 11月, 2015 1 次提交
    • D
      bpf: fix clearing on persistent program array maps · c9da161c
      Daniel Borkmann 提交于
      Currently, when having map file descriptors pointing to program arrays,
      there's still the issue that we unconditionally flush program array
      contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
      when such a file descriptor is released and is independent of the map's
      refcount.
      
      Having this flush independent of the refcount is for a reason: there
      can be arbitrary complex dependency chains among tail calls, also circular
      ones (direct or indirect, nesting limit determined during runtime), and
      we need to make sure that the map drops all references to eBPF programs
      it holds, so that the map's refcount can eventually drop to zero and
      initiate its freeing. Btw, a walk of the whole dependency graph would
      not be possible for various reasons, one being complexity and another
      one inconsistency, i.e. new programs can be added to parts of the graph
      at any time, so there's no guaranteed consistent state for the time of
      such a walk.
      
      Now, the program array pinning itself works, but the issue is that each
      derived file descriptor on close would nevertheless call unconditionally
      into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
      this flush until the last reference to a user is dropped. As this only
      concerns a subset of references (f.e. a prog array could hold a program
      that itself has reference on the prog array holding it, etc), we need to
      track them separately.
      
      Short analysis on the refcounting: on map creation time usercnt will be
      one, so there's no change in behaviour for bpf_map_release(), if unpinned.
      If we already fail in map_create(), we are immediately freed, and no
      file descriptor has been made public yet. In bpf_obj_pin_user(), we need
      to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
      reference, so before we drop the reference on the fd with fdput().
      Therefore, if actual pinning fails, we need to drop that reference again
      in bpf_any_put(), otherwise we keep holding it. When last reference
      drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
      care of dropping the usercnt again. In the bpf_obj_get_user() case, the
      bpf_any_get() will grab a reference on the usercnt, still at a time when
      we have the reference on the path. Should we later on fail to grab a new
      file descriptor, bpf_any_put() will drop it, otherwise we hold it until
      bpf_map_release() time.
      
      Joint work with Alexei.
      
      Fixes: b2197755 ("bpf: add support for persistent maps/progs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9da161c
  11. 25 11月, 2015 1 次提交
  12. 24 11月, 2015 2 次提交
    • S
      ring-buffer: Put back the length if crossed page with add_timestamp · bd1b7cd3
      Steven Rostedt (Red Hat) 提交于
      Commit fcc742ea "ring-buffer: Add event descriptor to simplify passing
      data" added a descriptor that holds various data instead of passing around
      several variables through parameters. The problem was that one of the
      parameters was modified in a function and the code was designed not to have
      an effect on that modified  parameter. Now that the parameter is a
      descriptor and any modifications to it are non-volatile, the size of the
      data could be unnecessarily expanded.
      
      Remove the extra space added if a timestamp was added and the event went
      across the page.
      
      Cc: stable@vger.kernel.org # 4.3+
      Fixes: fcc742ea "ring-buffer: Add event descriptor to simplify passing data"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      bd1b7cd3
    • S
      ring-buffer: Update read stamp with first real commit on page · b81f472a
      Steven Rostedt (Red Hat) 提交于
      Do not update the read stamp after swapping out the reader page from the
      write buffer. If the reader page is swapped out of the buffer before an
      event is written to it, then the read_stamp may get an out of date
      timestamp, as the page timestamp is updated on the first commit to that
      page.
      
      rb_get_reader_page() only returns a page if it has an event on it, otherwise
      it will return NULL. At that point, check if the page being returned has
      events and has not been read yet. Then at that point update the read_stamp
      to match the time stamp of the reader page.
      
      Cc: stable@vger.kernel.org # 2.6.30+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b81f472a
  13. 23 11月, 2015 4 次提交
  14. 21 11月, 2015 2 次提交
    • V
      kernel/panic.c: turn off locks debug before releasing console lock · 7625b3a0
      Vitaly Kuznetsov 提交于
      Commit 08d78658 ("panic: release stale console lock to always get the
      logbuf printed out") introduced an unwanted bad unlock balance report when
      panic() is called directly and not from OOPS (e.g.  from out_of_memory()).
      The difference is that in case of OOPS we disable locks debug in
      oops_enter() and on direct panic call nobody does that.
      
      Fixes: 08d78658 ("panic: release stale console lock to always get the logbuf printed out")
      Reported-by: Nkernel test robot <ying.huang@linux.intel.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7625b3a0
    • R
      kernel/signal.c: unexport sigsuspend() · 9d8a7652
      Richard Weinberger 提交于
      sigsuspend() is nowhere used except in signal.c itself, so we can mark it
      static do not pollute the global namespace.
      
      But this patch is more than a boring cleanup patch, it fixes a real issue
      on UserModeLinux.  UML has a special console driver to display ttys using
      xterm, or other terminal emulators, on the host side.  Vegard reported
      that sometimes UML is unable to spawn a xterm and he's facing the
      following warning:
      
        WARNING: CPU: 0 PID: 908 at include/linux/thread_info.h:128 sigsuspend+0xab/0xc0()
      
      It turned out that this warning makes absolutely no sense as the UML
      xterm code calls sigsuspend() on the host side, at least it tries.  But
      as the kernel itself offers a sigsuspend() symbol the linker choose this
      one instead of the glibc wrapper.  Interestingly this code used to work
      since ever but always blocked signals on the wrong side.  Some recent
      kernel change made the WARN_ON() trigger and uncovered the bug.
      
      It is a wonderful example of how much works by chance on computers. :-)
      
      Fixes: 68f3f16d ("new helper: sigsuspend()")
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Tested-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d8a7652
  15. 16 11月, 2015 1 次提交
    • T
      cgroup: fix cftype->file_offset handling · 34c06254
      Tejun Heo 提交于
      6f60eade ("cgroup: generalize obtaining the handles of and
      notifying cgroup files") introduced cftype->file_offset so that the
      handles for per-css file instances can be recorded.  These handles
      then can be used, for example, to generate file modified
      notifications.
      
      Unfortunately, it made the wrong assumption that files are created
      once for a given css and removed on its destruction.  Due to the
      dependencies among subsystems, a css may be hidden from userland and
      then later shown again.  This is implemented by removing and
      re-creating the affected files, so the associated kernfs_node for a
      given cgroup file may change over time.  This incorrect assumption led
      to the corruption of css->files lists.
      
      Reimplement cftype->file_offset handling so that cgroup_file->kn is
      protected by a lock and updated as files are created and destroyed.
      This also makes keeping them on per-cgroup list unnecessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJames Sedgwick <jsedgwick@fb.com>
      Fixes: 6f60eade ("cgroup: generalize obtaining the handles of and notifying cgroup files")
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      34c06254
  16. 12 11月, 2015 1 次提交
  17. 11 11月, 2015 1 次提交
    • S
      bpf_trace: Make dependent on PERF_EVENTS · a31d82d8
      Steven Rostedt 提交于
      Arnd Bergmann reported:
      
        In my ARM randconfig tests, I'm getting a build error for
        newly added code in bpf_perf_event_read and bpf_perf_event_output
        whenever CONFIG_PERF_EVENTS is disabled:
      
        kernel/trace/bpf_trace.c: In function 'bpf_perf_event_read':
        kernel/trace/bpf_trace.c:203:11: error: 'struct perf_event' has no member named 'oncpu'
        if (event->oncpu != smp_processor_id() ||
                 ^
        kernel/trace/bpf_trace.c:204:11: error: 'struct perf_event' has no member named 'pmu'
              event->pmu->count)
      
        This can happen when UPROBE_EVENT is enabled but KPROBE_EVENT
        is disabled. I'm not sure if that is a configuration we care
        about, otherwise we could prevent this case from occuring by
        adding Kconfig dependencies.
      
      Looking at this further, it's really that UPROBE_EVENT enables PERF_EVENTS.
      By just having BPF_EVENTS depend on PERF_EVENTS, then all is fine.
      
      Link: http://lkml.kernel.org/r/4525348.Aq9YoXkChv@wuerfelReported-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a31d82d8
  18. 10 11月, 2015 4 次提交