1. 10 5月, 2019 2 次提交
    • P
      genirq: Prevent use-after-free and work list corruption · 33f2aa87
      Prasad Sodagudi 提交于
      [ Upstream commit 59c39840f5abf4a71e1810a8da71aaccd6c17d26 ]
      
      When irq_set_affinity_notifier() replaces the notifier, then the
      reference count on the old notifier is dropped which causes it to be
      freed. But nothing ensures that the old notifier is not longer queued
      in the work list. If it is queued this results in a use after free and
      possibly in work list corruption.
      
      Ensure that the work is canceled before the reference is dropped.
      Signed-off-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: marc.zyngier@arm.com
      Link: https://lkml.kernel.org/r/1553439424-6529-1-git-send-email-psodagud@codeaurora.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
      33f2aa87
    • P
      perf/core: Fix perf_event_disable_inatomic() race · 42638d6a
      Peter Zijlstra 提交于
      [ Upstream commit 1d54ad944074010609562da5c89e4f5df2f4e5db ]
      
      Thomas-Mich Richter reported he triggered a WARN()ing from event_function_local()
      on his s390. The problem boils down to:
      
      	CPU-A				CPU-B
      
      	perf_event_overflow()
      	  perf_event_disable_inatomic()
      	    @pending_disable = 1
      	    irq_work_queue();
      
      	sched-out
      	  event_sched_out()
      	    @pending_disable = 0
      
      					sched-in
      					perf_event_overflow()
      					  perf_event_disable_inatomic()
      					    @pending_disable = 1;
      					    irq_work_queue(); // FAILS
      
      	irq_work_run()
      	  perf_pending_event()
      	    if (@pending_disable)
      	      perf_event_disable_local(); // WHOOPS
      
      The problem exists in generic, but s390 is particularly sensitive
      because it doesn't implement arch_irq_work_raise(), nor does it call
      irq_work_run() from it's PMU interrupt handler (nor would that be
      sufficient in this case, because s390 also generates
      perf_event_overflow() from pmu::stop). Add to that the fact that s390
      is a virtual architecture and (virtual) CPU-A can stall long enough
      for the above race to happen, even if it would self-IPI.
      
      Adding a irq_work_sync() to event_sched_in() would work for all hardare
      PMUs that properly use irq_work_run() but fails for software PMUs.
      
      Instead encode the CPU number in @pending_disable, such that we can
      tell which CPU requested the disable. This then allows us to detect
      the above scenario and even redirect the IPI to make up for the failed
      queue.
      Reported-by: NThomas-Mich Richter <tmricht@linux.ibm.com>
      Tested-by: NThomas Richter <tmricht@linux.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hendrik Brueckner <brueckner@linux.ibm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      42638d6a
  2. 04 5月, 2019 2 次提交
  3. 02 5月, 2019 6 次提交
  4. 27 4月, 2019 6 次提交
    • W
      kernel/sysctl.c: fix out-of-bounds access when setting file-max · cdd369fe
      Will Deacon 提交于
      commit 9002b21465fa4d829edfc94a5a441005cffaa972 upstream.
      
      Commit 32a5ad9c2285 ("sysctl: handle overflow for file-max") hooked up
      min/max values for the file-max sysctl parameter via the .extra1 and
      .extra2 fields in the corresponding struct ctl_table entry.
      
      Unfortunately, the minimum value points at the global 'zero' variable,
      which is an int.  This results in a KASAN splat when accessed as a long
      by proc_doulongvec_minmax on 64-bit architectures:
      
        | BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0
        | Read of size 8 at addr ffff2000133d1c20 by task systemd/1
        |
        | CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2
        | Hardware name: linux,dummy-virt (DT)
        | Call trace:
        |  dump_backtrace+0x0/0x228
        |  show_stack+0x14/0x20
        |  dump_stack+0xe8/0x124
        |  print_address_description+0x60/0x258
        |  kasan_report+0x140/0x1a0
        |  __asan_report_load8_noabort+0x18/0x20
        |  __do_proc_doulongvec_minmax+0x5d8/0x6a0
        |  proc_doulongvec_minmax+0x4c/0x78
        |  proc_sys_call_handler.isra.19+0x144/0x1d8
        |  proc_sys_write+0x34/0x58
        |  __vfs_write+0x54/0xe8
        |  vfs_write+0x124/0x3c0
        |  ksys_write+0xbc/0x168
        |  __arm64_sys_write+0x68/0x98
        |  el0_svc_common+0x100/0x258
        |  el0_svc_handler+0x48/0xc0
        |  el0_svc+0x8/0xc
        |
        | The buggy address belongs to the variable:
        |  zero+0x0/0x40
        |
        | Memory state around the buggy address:
        |  ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa
        |  ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa
        | >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00
        |                                ^
        |  ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
        |  ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      Fix the splat by introducing a unsigned long 'zero_ul' and using that
      instead.
      
      Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com
      Fixes: 32a5ad9c2285 ("sysctl: handle overflow for file-max")
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NChristian Brauner <christian@brauner.io>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matteo Croce <mcroce@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cdd369fe
    • G
      Revert "locking/lockdep: Add debug_locks check in __lock_downgrade()" · ac54bc12
      Greg Kroah-Hartman 提交于
      This reverts commit 0e0f7b30 which was
      commit 71492580571467fb7177aade19c18ce7486267f5 upstream.
      
      Tetsuo rightly points out that the backport here is incorrect, as it
      touches the __lock_set_class function instead of the intended
      __lock_downgrade function.
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ac54bc12
    • P
      sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup · c3edd427
      Phil Auld 提交于
      [ Upstream commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8 ]
      
      With extremely short cfs_period_us setting on a parent task group with a large
      number of children the for loop in sched_cfs_period_timer() can run until the
      watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
      will ever return 0.  The large number of children can make
      do_sched_cfs_period_timer() take longer than the period.
      
       NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
       RIP: 0010:tg_nop+0x0/0x10
        <IRQ>
        walk_tg_tree_from+0x29/0xb0
        unthrottle_cfs_rq+0xe0/0x1a0
        distribute_cfs_runtime+0xd3/0xf0
        sched_cfs_period_timer+0xcb/0x160
        ? sched_cfs_slack_timer+0xd0/0xd0
        __hrtimer_run_queues+0xfb/0x270
        hrtimer_interrupt+0x122/0x270
        smp_apic_timer_interrupt+0x6a/0x140
        apic_timer_interrupt+0xf/0x20
        </IRQ>
      
      To prevent this we add protection to the loop that detects when the loop has run
      too many times and scales the period and quota up, proportionally, so that the timer
      can complete before then next period expires.  This preserves the relative runtime
      quota while preventing the hard lockup.
      
      A warning is issued reporting this state and the new values.
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c3edd427
    • C
      timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze() · cd37fd46
      Chang-An Chen 提交于
      commit 3f2552f7e9c5abef2775c53f7af66532f8bf65bc upstream.
      
      tick_freeze() introduced by suspend-to-idle in commit 124cf911 ("PM /
      sleep: Make it possible to quiesce timers during suspend-to-idle") uses
      timekeeping_suspend() instead of syscore_suspend() during
      suspend-to-idle. As a consequence generic sched_clock will keep going
      because sched_clock_suspend() and sched_clock_resume() are not invoked
      during suspend-to-idle which can result in a generic sched_clock wrap.
      
      On a ARM system with suspend-to-idle enabled, sched_clock is registered
      as "56 bits at 13MHz, resolution 76ns, wraps every 4398046511101ns", which
      means the real wrapping duration is 8796093022202ns.
      
      [  134.551779] suspend-to-idle suspend (timekeeping_suspend())
      [ 1204.912239] suspend-to-idle resume (timekeeping_resume())
      ......
      [ 1206.912239] suspend-to-idle suspend (timekeeping_suspend())
      [ 5880.502807] suspend-to-idle resume (timekeeping_resume())
      ......
      [ 6000.403724] suspend-to-idle suspend (timekeeping_suspend())
      [ 8035.753167] suspend-to-idle resume  (timekeeping_resume())
      ......
      [ 8795.786684] (2)[321:charger_thread]......
      [ 8795.788387] (2)[321:charger_thread]......
      [    0.057226] (0)[0:swapper/0]......
      [    0.061447] (2)[0:swapper/2]......
      
      sched_clock was not stopped during suspend-to-idle, and sched_clock_poll
      hrtimer was not expired because timekeeping_suspend() was invoked during
      suspend-to-idle. It makes sched_clock wrap at kernel time 8796s.
      
      To prevent this, invoke sched_clock_suspend() and sched_clock_resume() in
      tick_freeze() together with timekeeping_suspend() and timekeeping_resume().
      
      Fixes: 124cf911 (PM / sleep: Make it possible to quiesce timers during suspend-to-idle)
      Signed-off-by: NChang-An Chen <chang-an.chen@mediatek.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Corey Minyard <cminyard@mvista.com>
      Cc: <linux-mediatek@lists.infradead.org>
      Cc: <linux-arm-kernel@lists.infradead.org>
      Cc: Stanley Chu <stanley.chu@mediatek.com>
      Cc: <kuohong.wang@mediatek.com>
      Cc: <freddy.hsin@mediatek.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1553828349-8914-1-git-send-email-chang-an.chen@mediatek.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd37fd46
    • M
      kprobes: Fix error check when reusing optimized probes · 23a926e5
      Masami Hiramatsu 提交于
      commit 5f843ed415581cfad4ef8fefe31c138a8346ca8a upstream.
      
      The following commit introduced a bug in one of our error paths:
      
        819319fc9346 ("kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()")
      
      it missed to handle the return value of kprobe_optready() as
      error-value. In reality, the kprobe_optready() returns a bool
      result, so "true" case must be passed instead of 0.
      
      This causes some errors on kprobe boot-time selftests on ARM:
      
       [   ] Beginning kprobe tests...
       [   ] Probe ARM code
       [   ]     kprobe
       [   ]     kretprobe
       [   ] ARM instruction simulation
       [   ]     Check decoding tables
       [   ]     Run test cases
       [   ] FAIL: test_case_handler not run
       [   ] FAIL: Test andge	r10, r11, r14, asr r7
       [   ] FAIL: Scenario 11
       ...
       [   ] FAIL: Scenario 7
       [   ] Total instruction simulation tests=1631, pass=1433 fail=198
       [   ] kprobe tests failed
      
      This can happen if an optimized probe is unregistered and next
      kprobe is registered on same address until the previous probe
      is not reclaimed.
      
      If this happens, a hidden aggregated probe may be kept in memory,
      and no new kprobe can probe same address. Also, in that case
      register_kprobe() will return "1" instead of minus error value,
      which can mislead caller logic.
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org # v5.0+
      Fixes: 819319fc9346 ("kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()")
      Link: http://lkml.kernel.org/r/155530808559.32517.539898325433642204.stgit@devnote2Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23a926e5
    • M
      kprobes: Mark ftrace mcount handler functions nokprobe · 426e2a80
      Masami Hiramatsu 提交于
      commit fabe38ab6b2bd9418350284c63825f13b8a6abba upstream.
      
      Mark ftrace mcount handler functions nokprobe since
      probing on these functions with kretprobe pushes
      return address incorrectly on kretprobe shadow stack.
      Reported-by: NFrancis Deslauriers <francis.deslauriers@efficios.com>
      Tested-by: NAndrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/155094062044.6137.6419622920568680640.stgit@devboxSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      426e2a80
  5. 20 4月, 2019 5 次提交
    • D
      bpf: fix use after free in bpf_evict_inode · e8eef7ad
      Daniel Borkmann 提交于
      [ Upstream commit 1da6c4d9140cb7c13e87667dc4e1488d6c8fc10f ]
      
      syzkaller was able to generate the following UAF in bpf:
      
        BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline]
        BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
        Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423
      
        CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+
        #110
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
        Google 01/01/2011
        Call Trace:
          __dump_stack lib/dump_stack.c:77 [inline]
          dump_stack+0x244/0x39d lib/dump_stack.c:113
          print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
          kasan_report_error mm/kasan/report.c:354 [inline]
          kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
          __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
          lookup_last fs/namei.c:2269 [inline]
          path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
          filename_lookup+0x26a/0x520 fs/namei.c:2348
          user_path_at_empty+0x40/0x50 fs/namei.c:2608
          user_path include/linux/namei.h:62 [inline]
          do_mount+0x180/0x1ff0 fs/namespace.c:2980
          ksys_mount+0x12d/0x140 fs/namespace.c:3258
          __do_sys_mount fs/namespace.c:3272 [inline]
          __se_sys_mount fs/namespace.c:3269 [inline]
          __x64_sys_mount+0xbe/0x150 fs/namespace.c:3269
          do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x457569
        Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
        48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
        ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
        RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
        RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569
        RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000
        RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000
        R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4
        R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff
      
        Allocated by task 9424:
          save_stack+0x43/0xd0 mm/kasan/kasan.c:448
          set_track mm/kasan/kasan.c:460 [inline]
          kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
          __do_kmalloc mm/slab.c:3722 [inline]
          __kmalloc_track_caller+0x157/0x760 mm/slab.c:3737
          kstrdup+0x39/0x70 mm/util.c:49
          bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356
          vfs_symlink+0x37a/0x5d0 fs/namei.c:4127
          do_symlinkat+0x242/0x2d0 fs/namei.c:4154
          __do_sys_symlink fs/namei.c:4173 [inline]
          __se_sys_symlink fs/namei.c:4171 [inline]
          __x64_sys_symlink+0x59/0x80 fs/namei.c:4171
          do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        Freed by task 9425:
          save_stack+0x43/0xd0 mm/kasan/kasan.c:448
          set_track mm/kasan/kasan.c:460 [inline]
          __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
          kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
          __cache_free mm/slab.c:3498 [inline]
          kfree+0xcf/0x230 mm/slab.c:3817
          bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565
          evict+0x4b9/0x980 fs/inode.c:558
          iput_final fs/inode.c:1550 [inline]
          iput+0x674/0xa90 fs/inode.c:1576
          do_unlinkat+0x733/0xa30 fs/namei.c:4069
          __do_sys_unlink fs/namei.c:4110 [inline]
          __se_sys_unlink fs/namei.c:4108 [inline]
          __x64_sys_unlink+0x42/0x50 fs/namei.c:4108
          do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      In this scenario path lookup under RCU is racing with the final
      unlink in case of symlinks. As Linus puts it in his analysis:
      
        [...] We actually RCU-delay the inode freeing itself, but
        when we do the final iput(), the "evict()" function is called
        synchronously. Now, the simple fix would seem to just RCU-delay
        the kfree() of the symlink data in bpf_evict_inode(). Maybe
        that's the right thing to do. [...]
      
      Al suggested to piggy-back on the ->destroy_inode() callback in
      order to implement RCU deferral there which can then kfree() the
      inode->i_link eventually right before putting inode back into
      inode cache. By reusing free_inode_nonrcu() from there we can
      avoid the need for our own inode cache and just reuse generic
      one as we currently do.
      
      And in-fact on top of all this we should just get rid of the
      bpf_evict_inode() entirely. This means truncate_inode_pages_final()
      and clear_inode() will then simply be called by the fs core via
      evict(). Dropping the reference should really only be done when
      inode is unhashed and nothing reachable anymore, so it's better
      also moved into the final ->destroy_inode() callback.
      
      Fixes: 0f98621b ("bpf, inode: add support for symlinks and fix mtime/ctime")
      Reported-by: syzbot+fb731ca573367b7f6564@syzkaller.appspotmail.com
      Reported-by: syzbot+a13e5ead792d6df37818@syzkaller.appspotmail.com
      Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Analyzed-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/lkml/0000000000006946d2057bbd0eef@google.com/T/Signed-off-by: NSasha Levin (Microsoft) <sashal@kernel.org>
      e8eef7ad
    • V
      kernel: hung_task.c: disable on suspend · 491dee74
      Vitaly Kuznetsov 提交于
      [ Upstream commit a1c6ca3c6de763459a6e93b644ec6518c890ba1c ]
      
      It is possible to observe hung_task complaints when system goes to
      suspend-to-idle state:
      
       # echo freeze > /sys/power/state
      
       PM: Syncing filesystems ... done.
       Freezing user space processes ... (elapsed 0.001 seconds) done.
       OOM killer disabled.
       Freezing remaining freezable tasks ... (elapsed 0.002 seconds) done.
       sd 0:0:0:0: [sda] Synchronizing SCSI cache
       INFO: task bash:1569 blocked for more than 120 seconds.
             Not tainted 4.19.0-rc3_+ #687
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       bash            D    0  1569    604 0x00000000
       Call Trace:
        ? __schedule+0x1fe/0x7e0
        schedule+0x28/0x80
        suspend_devices_and_enter+0x4ac/0x750
        pm_suspend+0x2c0/0x310
      
      Register a PM notifier to disable the detector on suspend and re-enable
      back on wakeup.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      491dee74
    • K
      sched/core: Fix buffer overflow in cgroup2 property cpu.max · 52466ab2
      Konstantin Khlebnikov 提交于
      [ Upstream commit 4c47acd824aaaa8fc6dc519fb4e08d1522105b7a ]
      
      Add limit into sscanf format string for on-stack buffer.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 0d593634 ("sched: Implement interface for cgroup unified hierarchy")
      Link: https://lkml.kernel.org/r/155189230232.2620.13120481613524200065.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      52466ab2
    • P
      sched/cpufreq: Fix 32-bit math overflow · a8c1de3a
      Peter Zijlstra 提交于
      [ Upstream commit a23314e9d88d89d49e69db08f60b7caa470f04e1 ]
      
      Vincent Wang reported that get_next_freq() has a mult overflow bug on
      32-bit platforms in the IOWAIT boost case, since in that case {util,max}
      are in freq units instead of capacity units.
      
      Solve this by moving the IOWAIT boost to capacity units. And since this
      means @max is constant; simplify the code.
      Reported-by: NVincent Wang <vincent.wang@unisoc.com>
      Tested-by: NVincent Wang <vincent.wang@unisoc.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chunyan Zhang <zhang.lyra@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190305083202.GU32494@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a8c1de3a
    • S
      perf/core: Restore mmap record type correctly · 673e23ce
      Stephane Eranian 提交于
      [ Upstream commit d9c1bb2f6a2157b38e8eb63af437cb22701d31ee ]
      
      On mmap(), perf_events generates a RECORD_MMAP record and then checks
      which events are interested in this record. There are currently 2
      versions of mmap records: RECORD_MMAP and RECORD_MMAP2. MMAP2 is larger.
      The event configuration controls which version the user level tool
      accepts.
      
      If the event->attr.mmap2=1 field then MMAP2 record is returned.  The
      perf_event_mmap_output() takes care of this. It checks attr->mmap2 and
      corrects the record fields before putting it in the sampling buffer of
      the event.  At the end the function restores the modified MMAP record
      fields.
      
      The problem is that the function restores the size but not the type.
      Thus, if a subsequent event only accepts MMAP type, then it would
      instead receive an MMAP2 record with a size of MMAP record.
      
      This patch fixes the problem by restoring the record type on exit.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Fixes: 13d7a241 ("perf: Add attr->mmap2 attribute to an event")
      Link: http://lkml.kernel.org/r/20190307185233.225521-1-eranian@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      673e23ce
  6. 17 4月, 2019 4 次提交
  7. 06 4月, 2019 11 次提交
    • V
      cpu/hotplug: Mute hotplug lockdep during init · fba4c61e
      Valentin Schneider 提交于
      [ Upstream commit ce48c457b95316b9a01b5aa9d4456ce820df94b4 ]
      
      Since we've had:
      
        commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")
      
      we've been getting some lockdep warnings during init, such as on HiKey960:
      
      [    0.820495] WARNING: CPU: 4 PID: 0 at kernel/cpu.c:316 lockdep_assert_cpus_held+0x3c/0x48
      [    0.820498] Modules linked in:
      [    0.820509] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S                4.20.0-rc5-00051-g4cae42a #34
      [    0.820511] Hardware name: HiKey960 (DT)
      [    0.820516] pstate: 600001c5 (nZCv dAIF -PAN -UAO)
      [    0.820520] pc : lockdep_assert_cpus_held+0x3c/0x48
      [    0.820523] lr : lockdep_assert_cpus_held+0x38/0x48
      [    0.820526] sp : ffff00000a9cbe50
      [    0.820528] x29: ffff00000a9cbe50 x28: 0000000000000000
      [    0.820533] x27: 00008000b69e5000 x26: ffff8000bff4cfe0
      [    0.820537] x25: ffff000008ba69e0 x24: 0000000000000001
      [    0.820541] x23: ffff000008fce000 x22: ffff000008ba70c8
      [    0.820545] x21: 0000000000000001 x20: 0000000000000003
      [    0.820548] x19: ffff00000a35d628 x18: ffffffffffffffff
      [    0.820552] x17: 0000000000000000 x16: 0000000000000000
      [    0.820556] x15: ffff00000958f848 x14: 455f3052464d4d34
      [    0.820559] x13: 00000000769dde98 x12: ffff8000bf3f65a8
      [    0.820564] x11: 0000000000000000 x10: ffff00000958f848
      [    0.820567] x9 : ffff000009592000 x8 : ffff00000958f848
      [    0.820571] x7 : ffff00000818ffa0 x6 : 0000000000000000
      [    0.820574] x5 : 0000000000000000 x4 : 0000000000000001
      [    0.820578] x3 : 0000000000000000 x2 : 0000000000000001
      [    0.820582] x1 : 00000000ffffffff x0 : 0000000000000000
      [    0.820587] Call trace:
      [    0.820591]  lockdep_assert_cpus_held+0x3c/0x48
      [    0.820598]  static_key_enable_cpuslocked+0x28/0xd0
      [    0.820606]  arch_timer_check_ool_workaround+0xe8/0x228
      [    0.820610]  arch_timer_starting_cpu+0xe4/0x2d8
      [    0.820615]  cpuhp_invoke_callback+0xe8/0xd08
      [    0.820619]  notify_cpu_starting+0x80/0xb8
      [    0.820625]  secondary_start_kernel+0x118/0x1d0
      
      We've also had a similar warning in sched_init_smp() for every
      asymmetric system that would enable the sched_asym_cpucapacity static
      key, although that was singled out in:
      
        commit 40fa3780bac2 ("sched/core: Take the hotplug lock in sched_init_smp()")
      
      Those warnings are actually harmless, since we cannot have hotplug
      operations at the time they appear. Instead of starting to sprinkle
      useless hotplug lock operations in the init codepaths, mute the
      warnings until they start warning about real problems.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: cai@gmx.us
      Cc: daniel.lezcano@linaro.org
      Cc: dietmar.eggemann@arm.com
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: longman@redhat.com
      Cc: marc.zyngier@arm.com
      Cc: mark.rutland@arm.com
      Link: https://lkml.kernel.org/r/1545243796-23224-2-git-send-email-valentin.schneider@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fba4c61e
    • O
      cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting · d0bc74c5
      Oleg Nesterov 提交于
      [ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ]
      
      The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
      needs pids_free() to uncharge the pid.
      
      However, ->free() is called from __put_task_struct()->cgroup_free() and this
      is too late. Even the trivial program which does
      
      	for (;;) {
      		int pid = fork();
      		assert(pid >= 0);
      		if (pid)
      			wait(NULL);
      		else
      			exit(0);
      	}
      
      can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
      implies an RCU gp after the task/pid goes away and before the final put().
      
      Test-case:
      
      	mkdir -p /tmp/CG
      	mount -t cgroup2 none /tmp/CG
      	echo '+pids' > /tmp/CG/cgroup.subtree_control
      
      	mkdir /tmp/CG/PID
      	echo 2 > /tmp/CG/PID/pids.max
      
      	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
      	echo $! > /tmp/CG/PID/cgroup.procs
      
      Without this patch the forking process fails soon after migration.
      
      Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
      into the new helper, cgroup_release(), called by release_task() which actually
      frees the pid(s).
      Reported-by: NHerton R. Krzesinski <hkrzesin@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d0bc74c5
    • A
      sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() · e8e0bd49
      Andrea Parri 提交于
      [ Upstream commit c546951d9c9300065bad253ecdf1ac59ce9d06c8 ]
      
      move_queued_task() synchronizes with task_rq_lock() as follows:
      
      	move_queued_task()		task_rq_lock()
      
      	[S] ->on_rq = MIGRATING		[L] rq = task_rq()
      	WMB (__set_task_cpu())		ACQUIRE (rq->lock);
      	[S] ->cpu = new_cpu		[L] ->on_rq
      
      where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an
      address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before
      "[L] ->on_rq" by the ACQUIRE itself.
      
      Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor
      this address dependency.  Also, mark the accesses to ->cpu and ->on_rq
      with READ_ONCE()/WRITE_ONCE() to comply with the LKMM.
      Signed-off-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e8e0bd49
    • H
      sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK · f056c90f
      Hidetoshi Seto 提交于
      [ Upstream commit 1ca4fa3ab604734e38e2a3000c9abf788512ffa7 ]
      
      register_sched_domain_sysctl() copies the cpu_possible_mask into
      sd_sysctl_cpus, but only if sd_sysctl_cpus hasn't already been
      allocated (ie, CONFIG_CPUMASK_OFFSTACK is set).  However, when
      CONFIG_CPUMASK_OFFSTACK is not set, sd_sysctl_cpus is left
      uninitialized (all zeroes) and the kernel may fail to initialize
      sched_domain sysctl entries for all possible CPUs.
      
      This is visible to the user if the kernel is booted with maxcpus=n, or
      if ACPI tables have been modified to leave CPUs offline, and then
      checking for missing /proc/sys/kernel/sched_domain/cpu* entries.
      
      Fix this by separating the allocation and initialization, and adding a
      flag to initialize the possible CPU entries while system booting only.
      Tested-by: NSyuuichirou Ishii <ishii.shuuichir@jp.fujitsu.com>
      Tested-by: NTarumizu, Kohei <tarumizu.kohei@jp.fujitsu.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Acked-by: NJoe Lawrence <joe.lawrence@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190129151245.5073-1-msys.mizuma@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f056c90f
    • M
      perf/aux: Make perf_event accessible to setup_aux() · efd85d83
      Mathieu Poirier 提交于
      [ Upstream commit 840018668ce2d96783356204ff282d6c9b0e5f66 ]
      
      When pmu::setup_aux() is called the coresight PMU needs to know which
      sink to use for the session by looking up the information in the
      event's attr::config2 field.
      
      As such simply replace the cpu information by the complete perf_event
      structure and change all affected customers.
      Signed-off-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Reviewed-by: NSuzuki Poulouse <suzuki.poulose@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-s390@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      efd85d83
    • T
      genirq: Avoid summation loops for /proc/stat · 1f369486
      Thomas Gleixner 提交于
      [ Upstream commit 1136b0728969901a091f0471968b2b76ed14d9ad ]
      
      Waiman reported that on large systems with a large amount of interrupts the
      readout of /proc/stat takes a long time to sum up the interrupt
      statistics. In principle this is not a problem. but for unknown reasons
      some enterprise quality software reads /proc/stat with a high frequency.
      
      The reason for this is that interrupt statistics are accounted per cpu. So
      the /proc/stat logic has to sum up the interrupt stats for each interrupt.
      
      This can be largely avoided for interrupts which are not marked as
      'PER_CPU' interrupts by simply adding a per interrupt summation counter
      which is incremented along with the per interrupt per cpu counter.
      
      The PER_CPU interrupts need to avoid that and use only per cpu accounting
      because they share the interrupt number and the interrupt descriptor and
      concurrent updates would conflict or require unwanted synchronization.
      Reported-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Link: https://lkml.kernel.org/r/20190208135020.925487496@linutronix.de
      
      8<-------------
      
      v2: Undo the unintentional layout change of struct irq_desc.
      
       include/linux/irqdesc.h |    1 +
       kernel/irq/chip.c       |   12 ++++++++++--
       kernel/irq/internals.h  |    8 +++++++-
       kernel/irq/irqdesc.c    |    7 ++++++-
       4 files changed, 24 insertions(+), 4 deletions(-)
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1f369486
    • L
      sched/topology: Fix percpu data types in struct sd_data & struct s_data · 845d4849
      Luc Van Oostenryck 提交于
      [ Upstream commit 99687cdbb3f6c8e32bcc7f37496e811f30460e48 ]
      
      The percpu members of struct sd_data and s_data are declared as:
      
      	struct ... ** __percpu member;
      
      So their type is:
      
      	__percpu pointer to pointer to struct ...
      
      But looking at how they're used, their type should be:
      
      	pointer to __percpu pointer to struct ...
      
      and they should thus be declared as:
      
      	struct ... * __percpu *member;
      
      So fix the placement of '__percpu' in the definition of these
      structures.
      
      This addresses a bunch of Sparse's warnings like:
      
      	warning: incorrect type in initializer (different address spaces)
      	  expected void const [noderef] <asn:3> *__vpp_verify
      	  got struct sched_domain **
      Signed-off-by: NLuc Van Oostenryck <luc.vanoostenryck@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190118144936.79158-1-luc.vanoostenryck@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      845d4849
    • M
      kprobes: Prohibit probing on RCU debug routine · d53b295f
      Masami Hiramatsu 提交于
      [ Upstream commit a39f15b9644fac3f950f522c39e667c3af25c588 ]
      
      Since kprobe itself depends on RCU, probing on RCU debug
      routine can cause recursive breakpoint bugs.
      
      Prohibit probing on RCU debug routines.
      
      int3
       ->do_int3()
         ->ist_enter()
           ->RCU_LOCKDEP_WARN()
             ->debug_lockdep_rcu_enabled() -> int3
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/154998807741.31052.11229157537816341591.stgit@devboxSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d53b295f
    • T
      cgroup, rstat: Don't flush subtree root unless necessary · a74ebf04
      Tejun Heo 提交于
      [ Upstream commit b4ff1b44bcd384d22fcbac6ebaf9cc0d33debe50 ]
      
      cgroup_rstat_cpu_pop_updated() is used to traverse the updated cgroups
      on flush.  While it was only visiting updated ones in the subtree, it
      was visiting @root unconditionally.  We can easily check whether @root
      is updated or not by looking at its ->updated_next just as with the
      cgroups in the subtree.
      
      * Remove the unnecessary cgroup_parent() test.  The system root cgroup
        is never updated and thus its ->updated_next is always NULL.  No
        need to test whether cgroup_parent() exists in addition to
        ->updated_next.
      
      * Terminate traverse if ->updated_next is NULL.  This can only happen
        for subtree @root and there's no reason to visit it if it's not
        marked updated.
      
      This reduces cpu consumption when reading a lot of rstat backed files.
      In a micro benchmark reading stat from ~1600 cgroups, the sys time was
      lowered by >40%.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a74ebf04
    • C
      sysctl: handle overflow for file-max · b227f157
      Christian Brauner 提交于
      [ Upstream commit 32a5ad9c22852e6bd9e74bdec5934ef9d1480bc5 ]
      
      Currently, when writing
      
        echo 18446744073709551616 > /proc/sys/fs/file-max
      
      /proc/sys/fs/file-max will overflow and be set to 0.  That quickly
      crashes the system.
      
      This commit sets the max and min value for file-max.  The max value is
      set to long int.  Any higher value cannot currently be used as the
      percpu counters are long ints and not unsigned integers.
      
      Note that the file-max value is ultimately parsed via
      __do_proc_doulongvec_minmax().  This function does not report error when
      min or max are exceeded.  Which means if a value largen that long int is
      written userspace will not receive an error instead the old value will be
      kept.  There is an argument to be made that this should be changed and
      __do_proc_doulongvec_minmax() should return an error when a dedicated min
      or max value are exceeded.  However this has the potential to break
      userspace so let's defer this to an RFC patch.
      
      Link: http://lkml.kernel.org/r/20190107222700.15954-3-christian@brauner.ioSigned-off-by: NChristian Brauner <christian@brauner.io>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      [christian@brauner.io: v4]
        Link: http://lkml.kernel.org/r/20190210203943.8227-3-christian@brauner.ioSigned-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b227f157
    • D
      tracing: kdb: Fix ftdump to not sleep · b73c7d02
      Douglas Anderson 提交于
      [ Upstream commit 31b265b3baaf55f209229888b7ffea523ddab366 ]
      
      As reported back in 2016-11 [1], the "ftdump" kdb command triggers a
      BUG for "sleeping function called from invalid context".
      
      kdb's "ftdump" command wants to call ring_buffer_read_prepare() in
      atomic context.  A very simple solution for this is to add allocation
      flags to ring_buffer_read_prepare() so kdb can call it without
      triggering the allocation error.  This patch does that.
      
      Note that in the original email thread about this, it was suggested
      that perhaps the solution for kdb was to either preallocate the buffer
      ahead of time or create our own iterator.  I'm hoping that this
      alternative of adding allocation flags to ring_buffer_read_prepare()
      can be considered since it means I don't need to duplicate more of the
      core trace code into "trace_kdb.c" (for either creating my own
      iterator or re-preparing a ring allocator whose memory was already
      allocated).
      
      NOTE: another option for kdb is to actually figure out how to make it
      reuse the existing ftrace_dump() function and totally eliminate the
      duplication.  This sounds very appealing and actually works (the "sr
      z" command can be seen to properly dump the ftrace buffer).  The
      downside here is that ftrace_dump() fully consumes the trace buffer.
      Unless that is changed I'd rather not use it because it means "ftdump
      | grep xyz" won't be very useful to search the ftrace buffer since it
      will throw away the whole trace on the first grep.  A future patch to
      dump only the last few lines of the buffer will also be hard to
      implement.
      
      [1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com
      
      Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.orgReported-by: NBrian Norris <briannorris@chromium.org>
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b73c7d02
  8. 03 4月, 2019 3 次提交
    • X
      bpf: do not restore dst_reg when cur_state is freed · f5959dec
      Xu Yu 提交于
      commit 0803278b0b4d8eeb2b461fb698785df65a725d9e upstream.
      
      Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.
      
      Call trace:
      
        dump_stack+0xbf/0x12e
        print_address_description+0x6a/0x280
        kasan_report+0x237/0x360
        sanitize_ptr_alu+0x85a/0x8d0
        adjust_ptr_min_max_vals+0x8f2/0x1ca0
        adjust_reg_min_max_vals+0x8ed/0x22e0
        do_check+0x1ca6/0x5d00
        bpf_check+0x9ca/0x2570
        bpf_prog_load+0xc91/0x1030
        __se_sys_bpf+0x61e/0x1f00
        do_syscall_64+0xc8/0x550
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fault injection trace:
      
        kfree+0xea/0x290
        free_func_state+0x4a/0x60
        free_verifier_state+0x61/0xe0
        push_stack+0x216/0x2f0	          <- inject failslab
        sanitize_ptr_alu+0x2b1/0x8d0
        adjust_ptr_min_max_vals+0x8f2/0x1ca0
        adjust_reg_min_max_vals+0x8ed/0x22e0
        do_check+0x1ca6/0x5d00
        bpf_check+0x9ca/0x2570
        bpf_prog_load+0xc91/0x1030
        __se_sys_bpf+0x61e/0x1f00
        do_syscall_64+0xc8/0x550
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      When kzalloc() fails in push_stack(), free_verifier_state() will free
      current verifier state. As push_stack() returns, dst_reg was restored
      if ptr_is_dst_reg is false. However, as member of the cur_state,
      dst_reg is also freed, and error occurs when dereferencing dst_reg.
      Simply fix it by testing ret of push_stack() before restoring dst_reg.
      
      Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5959dec
    • T
      cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n · a56aa02e
      Thomas Gleixner 提交于
      commit 206b92353c839c0b27a0b9bec24195f93fd6cf7a upstream.
      
      Tianyu reported a crash in a CPU hotplug teardown callback when booting a
      kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot
      parameter.
      
      It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken
      forever in case that a bringup callback fails. Unfortunately this issue was
      not recognized when the CPU hotplug code was reworked, so the shortcoming
      just stayed in place.
      
      When a bringup callback fails, the CPU hotplug code rolls back the
      operation and takes the CPU offline.
      
      The 'nosmt' command line argument uses a bringup failure to abort the
      bringup of SMT sibling CPUs. This partial bringup is required due to the
      MCE misdesign on Intel CPUs.
      
      With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but
      CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level
      teardown of a CPU including the synchronizations in various facilities like
      RCU, NOHZ and others.
      
      As a consequence the teardown callbacks which must be executed on the
      outgoing CPU within stop machine with interrupts disabled are executed on
      the control CPU in interrupt enabled and preemptible context causing the
      kernel to crash and burn. The pre state machine code has a different
      failure mode which is more subtle and resulting in a less obvious use after
      free crash because the control side frees resources which are still in use
      by the undead CPU.
      
      But this is not a x86 only problem. Any architecture which supports the
      SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less
      likely to be triggered because in 99.99999% of the cases all bringup
      callbacks succeed.
      
      The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on
      all architectures as the following architectures have either no hotplug
      support at all or not all subarchitectures support it:
      
       alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).
      
      Crashing the kernel in such a situation is not an acceptable state
      either.
      
      Implement a minimal rollback variant by limiting the teardown to the point
      where all regular teardown callbacks have been invoked and leave the CPU in
      the 'dead' idle state. This has the following consequences:
      
       - the CPU is brought down to the point where the stop_machine takedown
         would happen.
      
       - the CPU stays there forever and is idle
      
       - The CPU is cleared in the CPU active mask, but not in the CPU online
         mask which is a legit state.
      
       - Interrupts are not forced away from the CPU
      
       - All facilities which only look at online mask would still see it, but
         that is the case during normal hotplug/unplug operations as well. It's
         just a (way) longer time frame.
      
      This will expose issues, which haven't been exposed before or only seldom,
      because now the normally transient state of being non active but online is
      a permanent state. In testing this exposed already an issue vs. work queues
      where the vmstat code schedules work on the almost dead CPU which ends up
      in an unbound workqueue and triggers 'preemtible context' warnings. This is
      not a problem of this change, it merily exposes an already existing issue.
      Still this is better than crashing fully without a chance to debug it.
      
      This is mainly thought as workaround for those architectures which do not
      support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP.
      
      Fixes: 2e1a3483 ("cpu/hotplug: Split out the state walk into functions")
      Reported-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Konrad Wilk <konrad.wilk@oracle.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Mukesh Ojha <mojha@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a56aa02e
    • T
      watchdog: Respect watchdog cpumask on CPU hotplug · 336f6b23
      Thomas Gleixner 提交于
      commit 7dd47617114921fdd8c095509e5e7b4373cc44a1 upstream.
      
      The rework of the watchdog core to use cpu_stop_work broke the watchdog
      cpumask on CPU hotplug.
      
      The watchdog_enable/disable() functions are now called unconditionally from
      the hotplug callback, i.e. even on CPUs which are not in the watchdog
      cpumask. As a consequence the watchdog can become unstoppable.
      
      Only invoke them when the plugged CPU is in the watchdog cpumask.
      
      Fixes: 9cf57731 ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
      Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      336f6b23
  9. 27 3月, 2019 1 次提交