1. 16 9月, 2019 2 次提交
  2. 10 9月, 2019 1 次提交
    • A
      kprobes: Fix potential deadlock in kprobe_optimizer() · 5e1d50a3
      Andrea Righi 提交于
      [ Upstream commit f1c6ece23729257fb46562ff9224cf5f61b818da ]
      
      lockdep reports the following deadlock scenario:
      
       WARNING: possible circular locking dependency detected
      
       kworker/1:1/48 is trying to acquire lock:
       000000008d7a62b2 (text_mutex){+.+.}, at: kprobe_optimizer+0x163/0x290
      
       but task is already holding lock:
       00000000850b5e2d (module_mutex){+.+.}, at: kprobe_optimizer+0x31/0x290
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (module_mutex){+.+.}:
              __mutex_lock+0xac/0x9f0
              mutex_lock_nested+0x1b/0x20
              set_all_modules_text_rw+0x22/0x90
              ftrace_arch_code_modify_prepare+0x1c/0x20
              ftrace_run_update_code+0xe/0x30
              ftrace_startup_enable+0x2e/0x50
              ftrace_startup+0xa7/0x100
              register_ftrace_function+0x27/0x70
              arm_kprobe+0xb3/0x130
              enable_kprobe+0x83/0xa0
              enable_trace_kprobe.part.0+0x2e/0x80
              kprobe_register+0x6f/0xc0
              perf_trace_event_init+0x16b/0x270
              perf_kprobe_init+0xa7/0xe0
              perf_kprobe_event_init+0x3e/0x70
              perf_try_init_event+0x4a/0x140
              perf_event_alloc+0x93a/0xde0
              __do_sys_perf_event_open+0x19f/0xf30
              __x64_sys_perf_event_open+0x20/0x30
              do_syscall_64+0x65/0x1d0
              entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       -> #0 (text_mutex){+.+.}:
              __lock_acquire+0xfcb/0x1b60
              lock_acquire+0xca/0x1d0
              __mutex_lock+0xac/0x9f0
              mutex_lock_nested+0x1b/0x20
              kprobe_optimizer+0x163/0x290
              process_one_work+0x22b/0x560
              worker_thread+0x50/0x3c0
              kthread+0x112/0x150
              ret_from_fork+0x3a/0x50
      
       other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(module_mutex);
                                      lock(text_mutex);
                                      lock(module_mutex);
         lock(text_mutex);
      
        *** DEADLOCK ***
      
      As a reproducer I've been using bcc's funccount.py
      (https://github.com/iovisor/bcc/blob/master/tools/funccount.py),
      for example:
      
       # ./funccount.py '*interrupt*'
      
      That immediately triggers the lockdep splat.
      
      Fix by acquiring text_mutex before module_mutex in kprobe_optimizer().
      Signed-off-by: NAndrea Righi <andrea.righi@canonical.com>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: d5b844a2cf50 ("ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()")
      Link: http://lkml.kernel.org/r/20190812184302.GA7010@xps-13Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5e1d50a3
  3. 06 9月, 2019 3 次提交
  4. 29 8月, 2019 1 次提交
  5. 25 8月, 2019 1 次提交
  6. 16 8月, 2019 1 次提交
  7. 09 8月, 2019 5 次提交
    • T
      cgroup: Fix css_task_iter_advance_css_set() cset skip condition · ebda41dd
      Tejun Heo 提交于
      commit c596687a008b579c503afb7a64fcacc7270fae9e upstream.
      
      While adding handling for dying task group leaders c03cd7738a83
      ("cgroup: Include dying leaders with live threads in PROCS
      iterations") added an inverted cset skip condition to
      css_task_iter_advance_css_set().  It should skip cset if it's
      completely empty but was incorrectly testing for the inverse condition
      for the dying_tasks list.  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations")
      Reported-by: syzbot+d4bba5ccd4f9a2a68681@syzkaller.appspotmail.com
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ebda41dd
    • T
      cgroup: css_task_iter_skip()'d iterators must be advanced before accessed · 0a9abd27
      Tejun Heo 提交于
      commit cee0c33c546a93957a52ae9ab6bebadbee765ec5 upstream.
      
      b636fd38dc40 ("cgroup: Implement css_task_iter_skip()") introduced
      css_task_iter_skip() which is used to fix task iterations skipping
      dying threadgroup leaders with live threads.  Skipping is implemented
      as a subportion of full advancing but css_task_iter_next() forgot to
      fully advance a skipped iterator before determining the next task to
      visit causing it to return invalid task pointers.
      
      Fix it by making css_task_iter_next() fully advance the iterator if it
      has been skipped since the previous iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: syzbot
      Link: http://lkml.kernel.org/r/00000000000097025d058a7fd785@google.com
      Fixes: b636fd38dc40 ("cgroup: Implement css_task_iter_skip()")
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0a9abd27
    • T
      cgroup: Include dying leaders with live threads in PROCS iterations · 4340d175
      Tejun Heo 提交于
      commit c03cd7738a83b13739f00546166969342c8ff014 upstream.
      
      CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
      this means that a process with dying leader and live threads will be
      skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
      which is confusing to say the least.
      
      Fix it by making cset track dying tasks and include dying leaders with
      live threads in PROCS iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NTopi Miettinen <toiwoton@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4340d175
    • T
      cgroup: Implement css_task_iter_skip() · 370b9e63
      Tejun Heo 提交于
      commit b636fd38dc40113f853337a7d2a6885ad23b8811 upstream.
      
      When a task is moved out of a cset, task iterators pointing to the
      task are advanced using the normal css_task_iter_advance() call.  This
      is fine but we'll be tracking dying tasks on csets and thus moving
      tasks from cset->tasks to (to be added) cset->dying_tasks.  When we
      remove a task from cset->tasks, if we advance the iterators, they may
      move over to the next cset before we had the chance to add the task
      back on the dying list, which can allow the task to escape iteration.
      
      This patch separates out skipping from advancing.  Skipping only moves
      the affected iterators to the next pointer rather than fully advancing
      it and the following advancing will recognize that the cursor has
      already been moved forward and do the rest of advancing.  This ensures
      that when a task moves from one list to another in its cset, as long
      as it moves in the right direction, it's always visible to iteration.
      
      This doesn't cause any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      370b9e63
    • T
      cgroup: Call cgroup_release() before __exit_signal() · 7528e95b
      Tejun Heo 提交于
      commit 6b115bf58e6f013ca75e7115aabcbd56c20ff31d upstream.
      
      cgroup_release() calls cgroup_subsys->release() which is used by the
      pids controller to uncharge its pid.  We want to use it to manage
      iteration of dying tasks which requires putting it before
      __unhash_process().  Move cgroup_release() above __exit_signal().
      While this makes it uncharge before the pid is freed, pid is RCU freed
      anyway and the window is very narrow.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7528e95b
  8. 07 8月, 2019 2 次提交
    • P
      kernel/module.c: Only return -EEXIST for modules that have finished loading · 09ec6c67
      Prarit Bhargava 提交于
      [ Upstream commit 6e6de3dee51a439f76eb73c22ae2ffd2c9384712 ]
      
      Microsoft HyperV disables the X86_FEATURE_SMCA bit on AMD systems, and
      linux guests boot with repeated errors:
      
      amd64_edac_mod: Unknown symbol amd_unregister_ecc_decoder (err -2)
      amd64_edac_mod: Unknown symbol amd_register_ecc_decoder (err -2)
      amd64_edac_mod: Unknown symbol amd_report_gart_errors (err -2)
      amd64_edac_mod: Unknown symbol amd_unregister_ecc_decoder (err -2)
      amd64_edac_mod: Unknown symbol amd_register_ecc_decoder (err -2)
      amd64_edac_mod: Unknown symbol amd_report_gart_errors (err -2)
      
      The warnings occur because the module code erroneously returns -EEXIST
      for modules that have failed to load and are in the process of being
      removed from the module list.
      
      module amd64_edac_mod has a dependency on module edac_mce_amd.  Using
      modules.dep, systemd will load edac_mce_amd for every request of
      amd64_edac_mod.  When the edac_mce_amd module loads, the module has
      state MODULE_STATE_UNFORMED and once the module load fails and the state
      becomes MODULE_STATE_GOING.  Another request for edac_mce_amd module
      executes and add_unformed_module() will erroneously return -EEXIST even
      though the previous instance of edac_mce_amd has MODULE_STATE_GOING.
      Upon receiving -EEXIST, systemd attempts to load amd64_edac_mod, which
      fails because of unknown symbols from edac_mce_amd.
      
      add_unformed_module() must wait to return for any case other than
      MODULE_STATE_LIVE to prevent a race between multiple loads of
      dependent modules.
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: NBarret Rhoden <brho@google.com>
      Cc: David Arcari <darcari@redhat.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      09ec6c67
    • C
      ftrace: Enable trampoline when rec count returns back to one · f486088d
      Cheng Jian 提交于
      [ Upstream commit a124692b698b00026a58d89831ceda2331b2e1d0 ]
      
      Custom trampolines can only be enabled if there is only a single ops
      attached to it. If there's only a single callback registered to a function,
      and the ops has a trampoline registered for it, then we can call the
      trampoline directly. This is very useful for improving the performance of
      ftrace and livepatch.
      
      If more than one callback is registered to a function, the general
      trampoline is used, and the custom trampoline is not restored back to the
      direct call even if all the other callbacks were unregistered and we are
      back to one callback for the function.
      
      To fix this, set FTRACE_FL_TRAMP flag if rec count is decremented
      to one, and the ops that left has a trampoline.
      
      Testing After this patch :
      
      insmod livepatch_unshare_files.ko
      cat /sys/kernel/debug/tracing/enabled_functions
      
      	unshare_files (1) R I	tramp: 0xffffffffc0000000(klp_ftrace_handler+0x0/0xa0) ->ftrace_ops_assist_func+0x0/0xf0
      
      echo unshare_files > /sys/kernel/debug/tracing/set_ftrace_filter
      echo function > /sys/kernel/debug/tracing/current_tracer
      cat /sys/kernel/debug/tracing/enabled_functions
      
      	unshare_files (2) R I ->ftrace_ops_list_func+0x0/0x150
      
      echo nop > /sys/kernel/debug/tracing/current_tracer
      cat /sys/kernel/debug/tracing/enabled_functions
      
      	unshare_files (1) R I	tramp: 0xffffffffc0000000(klp_ftrace_handler+0x0/0xa0) ->ftrace_ops_assist_func+0x0/0xf0
      
      Link: http://lkml.kernel.org/r/1556969979-111047-1-git-send-email-cj.chengjian@huawei.comSigned-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f486088d
  9. 04 8月, 2019 2 次提交
  10. 31 7月, 2019 3 次提交
    • L
      access: avoid the RCU grace period for the temporary subjective credentials · 408af823
      Linus Torvalds 提交于
      commit d7852fbd0f0423937fa287a598bfde188bb68c22 upstream.
      
      It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
      work because it installs a temporary credential that gets allocated and
      freed for each system call.
      
      The allocation and freeing overhead is mostly benign, but because
      credentials can be accessed under the RCU read lock, the freeing
      involves a RCU grace period.
      
      Which is not a huge deal normally, but if you have a lot of access()
      calls, this causes a fair amount of seconday damage: instead of having a
      nice alloc/free patterns that hits in hot per-CPU slab caches, you have
      all those delayed free's, and on big machines with hundreds of cores,
      the RCU overhead can end up being enormous.
      
      But it turns out that all of this is entirely unnecessary.  Exactly
      because access() only installs the credential as the thread-local
      subjective credential, the temporary cred pointer doesn't actually need
      to be RCU free'd at all.  Once we're done using it, we can just free it
      synchronously and avoid all the RCU overhead.
      
      So add a 'non_rcu' flag to 'struct cred', which can be set by users that
      know they only use it in non-RCU context (there are other potential
      users for this).  We can make it a union with the rcu freeing list head
      that we need for the RCU case, so this doesn't need any extra storage.
      
      Note that this also makes 'get_current_cred()' clear the new non_rcu
      flag, in case we have filesystems that take a long-term reference to the
      cred and then expect the RCU delayed freeing afterwards.  It's not
      entirely clear that this is required, but it makes for clear semantics:
      the subjective cred remains non-RCU as long as you only access it
      synchronously using the thread-local accessors, but you _can_ use it as
      a generic cred if you want to.
      
      It is possible that we should just remove the whole RCU markings for
      ->cred entirely.  Only ->real_cred is really supposed to be accessed
      through RCU, and the long-term cred copies that nfs uses might want to
      explicitly re-enable RCU freeing if required, rather than have
      get_current_cred() do it implicitly.
      
      But this is a "minimal semantic changes" change for the immediate
      problem.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Glauber <jglauber@marvell.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      408af823
    • A
      locking/lockdep: Hide unused 'class' variable · 148959cc
      Arnd Bergmann 提交于
      [ Upstream commit 68037aa78208f34bda4e5cd76c357f718b838cbb ]
      
      The usage is now hidden in an #ifdef, so we need to move
      the variable itself in there as well to avoid this warning:
      
        kernel/locking/lockdep_proc.c:203:21: error: unused variable 'class' [-Werror,-Wunused-variable]
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yuyang Du <duyuyang@gmail.com>
      Cc: frederic@kernel.org
      Fixes: 68d41d8c94a3 ("locking/lockdep: Fix lock used or unused stats error")
      Link: https://lkml.kernel.org/r/20190715092809.736834-1-arnd@arndb.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      148959cc
    • Y
      locking/lockdep: Fix lock used or unused stats error · 4acb04ef
      Yuyang Du 提交于
      [ Upstream commit 68d41d8c94a31dfb8233ab90b9baf41a2ed2da68 ]
      
      The stats variable nr_unused_locks is incremented every time a new lock
      class is register and decremented when the lock is first used in
      __lock_acquire(). And after all, it is shown and checked in lockdep_stats.
      
      However, under configurations that either CONFIG_TRACE_IRQFLAGS or
      CONFIG_PROVE_LOCKING is not defined:
      
      The commit:
      
        091806515124b20 ("locking/lockdep: Consolidate lock usage bit initialization")
      
      missed marking the LOCK_USED flag at IRQ usage initialization because
      as mark_usage() is not called. And the commit:
      
        886532aee3cd42d ("locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")
      
      further made mark_lock() not defined such that the LOCK_USED cannot be
      marked at all when the lock is first acquired.
      
      As a result, we fix this by not showing and checking the stats under such
      configurations for lockdep_stats.
      Reported-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NYuyang Du <duyuyang@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: arnd@arndb.de
      Cc: frederic@kernel.org
      Link: https://lkml.kernel.org/r/20190709101522.9117-1-duyuyang@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4acb04ef
  11. 28 7月, 2019 2 次提交
    • P
      perf/core: Fix race between close() and fork() · 4a5cc64d
      Peter Zijlstra 提交于
      commit 1cf8dfe8a661f0462925df943140e9f6d1ea5233 upstream.
      
      Syzcaller reported the following Use-after-Free bug:
      
      	close()						clone()
      
      							  copy_process()
      							    perf_event_init_task()
      							      perf_event_init_context()
      							        mutex_lock(parent_ctx->mutex)
      								inherit_task_group()
      								  inherit_group()
      								    inherit_event()
      								      mutex_lock(event->child_mutex)
      								      // expose event on child list
      								      list_add_tail()
      								      mutex_unlock(event->child_mutex)
      							        mutex_unlock(parent_ctx->mutex)
      
      							    ...
      							    goto bad_fork_*
      
      							  bad_fork_cleanup_perf:
      							    perf_event_free_task()
      
      	  perf_release()
      	    perf_event_release_kernel()
      	      list_for_each_entry()
      		mutex_lock(ctx->mutex)
      		mutex_lock(event->child_mutex)
      		// event is from the failing inherit
      		// on the other CPU
      		perf_remove_from_context()
      		list_move()
      		mutex_unlock(event->child_mutex)
      		mutex_unlock(ctx->mutex)
      
      							      mutex_lock(ctx->mutex)
      							      list_for_each_entry_safe()
      							        // event already stolen
      							      mutex_unlock(ctx->mutex)
      
      							    delayed_free_task()
      							      free_task()
      
      	     list_for_each_entry_safe()
      	       list_del()
      	       free_event()
      	         _free_event()
      		   // and so event->hw.target
      		   // is the already freed failed clone()
      		   if (event->hw.target)
      		     put_task_struct(event->hw.target)
      		       // WHOOPSIE, already quite dead
      
      Which puts the lie to the the comment on perf_event_free_task():
      'unexposed, unused context' not so much.
      
      Which is a 'fun' confluence of fail; copy_process() doing an
      unconditional free_task() and not respecting refcounts, and perf having
      creative locking. In particular:
      
        82d94856 ("perf/core: Fix lock inversion between perf,trace,cpuhp")
      
      seems to have overlooked this 'fun' parade.
      
      Solve it by using the fact that detached events still have a reference
      count on their (previous) context. With this perf_event_free_task()
      can detect when events have escaped and wait for their destruction.
      Debugged-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Reported-by: syzbot+a24c397a29ad22d86c98@syzkaller.appspotmail.com
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 82d94856 ("perf/core: Fix lock inversion between perf,trace,cpuhp")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a5cc64d
    • A
      perf/core: Fix exclusive events' grouping · 75100ec5
      Alexander Shishkin 提交于
      commit 8a58ddae23796c733c5dfbd717538d89d036c5bd upstream.
      
      So far, we tried to disallow grouping exclusive events for the fear of
      complications they would cause with moving between contexts. Specifically,
      moving a software group to a hardware context would violate the exclusivity
      rules if both groups contain matching exclusive events.
      
      This attempt was, however, unsuccessful: the check that we have in the
      perf_event_open() syscall is both wrong (looks at wrong PMU) and
      insufficient (group leader may still be exclusive), as can be illustrated
      by running:
      
        $ perf record -e '{intel_pt//,cycles}' uname
        $ perf record -e '{cycles,intel_pt//}' uname
      
      ultimately successfully.
      
      Furthermore, we are completely free to trigger the exclusivity violation
      by:
      
         perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'
      
      even though the helpful perf record will not allow that, the ABI will.
      
      The warning later in the perf_event_open() path will also not trigger, because
      it's also wrong.
      
      Fix all this by validating the original group before moving, getting rid
      of broken safeguards and placing a useful one to perf_install_in_context().
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mathieu.poirier@linaro.org
      Cc: will.deacon@arm.com
      Fixes: bed5b25a ("perf: Add a pmu capability for "exclusive" events")
      Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75100ec5
  12. 26 7月, 2019 8 次提交
    • D
      padata: use smp_mb in padata_reorder to avoid orphaned padata jobs · 1e4247d7
      Daniel Jordan 提交于
      commit cf144f81a99d1a3928f90b0936accfd3f45c9a0a upstream.
      
      Testing padata with the tcrypt module on a 5.2 kernel...
      
          # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
          # modprobe tcrypt mode=211 sec=1
      
      ...produces this splat:
      
          INFO: task modprobe:10075 blocked for more than 120 seconds.
                Not tainted 5.2.0-base+ #16
          modprobe        D    0 10075  10064 0x80004080
          Call Trace:
           ? __schedule+0x4dd/0x610
           ? ring_buffer_unlock_commit+0x23/0x100
           schedule+0x6c/0x90
           schedule_timeout+0x3b/0x320
           ? trace_buffer_unlock_commit_regs+0x4f/0x1f0
           wait_for_common+0x160/0x1a0
           ? wake_up_q+0x80/0x80
           { crypto_wait_req }             # entries in braces added by hand
           { do_one_aead_op }
           { test_aead_jiffies }
           test_aead_speed.constprop.17+0x681/0xf30 [tcrypt]
           do_test+0x4053/0x6a2b [tcrypt]
           ? 0xffffffffa00f4000
           tcrypt_mod_init+0x50/0x1000 [tcrypt]
           ...
      
      The second modprobe command never finishes because in padata_reorder,
      CPU0's load of reorder_objects is executed before the unlocking store in
      spin_unlock_bh(pd->lock), causing CPU0 to miss CPU1's increment:
      
      CPU0                                 CPU1
      
      padata_reorder                       padata_do_serial
        LOAD reorder_objects  // 0
                                             INC reorder_objects  // 1
                                             padata_reorder
                                               TRYLOCK pd->lock   // failed
        UNLOCK pd->lock
      
      CPU0 deletes the timer before returning from padata_reorder and since no
      other job is submitted to padata, modprobe waits indefinitely.
      
      Add a pair of full barriers to guarantee proper ordering:
      
      CPU0                                 CPU1
      
      padata_reorder                       padata_do_serial
        UNLOCK pd->lock
        smp_mb()
        LOAD reorder_objects
                                             INC reorder_objects
                                             smp_mb__after_atomic()
                                             padata_reorder
                                               TRYLOCK pd->lock
      
      smp_mb__after_atomic is needed so the read part of the trylock operation
      comes after the INC, as Andrea points out.   Thanks also to Andrea for
      help with writing a litmus test.
      
      Fixes: 16295bec ("padata: Generic parallelization/serialization interface")
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: <stable@vger.kernel.org>
      Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-crypto@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1e4247d7
    • N
      timer_list: Guard procfs specific code · b9f547b7
      Nathan Huckleberry 提交于
      [ Upstream commit a9314773a91a1d3b36270085246a6715a326ff00 ]
      
      With CONFIG_PROC_FS=n the following warning is emitted:
      
      kernel/time/timer_list.c:361:36: warning: unused variable
      'timer_list_sops' [-Wunused-const-variable]
         static const struct seq_operations timer_list_sops = {
      
      Add #ifdef guard around procfs specific code.
      Signed-off-by: NNathan Huckleberry <nhuck@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Cc: john.stultz@linaro.org
      Cc: sboyd@kernel.org
      Cc: clang-built-linux@googlegroups.com
      Link: https://github.com/ClangBuiltLinux/linux/issues/534
      Link: https://lkml.kernel.org/r/20190614181604.112297-1-nhuck@google.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      b9f547b7
    • M
      ntp: Limit TAI-UTC offset · d86c0b73
      Miroslav Lichvar 提交于
      [ Upstream commit d897a4ab11dc8a9fda50d2eccc081a96a6385998 ]
      
      Don't allow the TAI-UTC offset of the system clock to be set by adjtimex()
      to a value larger than 100000 seconds.
      
      This prevents an overflow in the conversion to int, prevents the CLOCK_TAI
      clock from getting too far ahead of the CLOCK_REALTIME clock, and it is
      still large enough to allow leap seconds to be inserted at the maximum rate
      currently supported by the kernel (once per day) for the next ~270 years,
      however unlikely it is that someone can survive a catastrophic event which
      slowed down the rotation of the Earth so much.
      Reported-by: NWeikang shi <swkhack@gmail.com>
      Signed-off-by: NMiroslav Lichvar <mlichvar@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Link: https://lkml.kernel.org/r/20190618154713.20929-1-mlichvar@redhat.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      d86c0b73
    • Q
      sched/fair: Fix "runnable_avg_yN_inv" not used warnings · 7fc96cd2
      Qian Cai 提交于
      [ Upstream commit 509466b7d480bc5d22e90b9fbe6122ae0e2fbe39 ]
      
      runnable_avg_yN_inv[] is only used in kernel/sched/pelt.c but was
      included in several other places because they need other macros all
      came from kernel/sched/sched-pelt.h which was generated by
      Documentation/scheduler/sched-pelt. As the result, it causes compilation
      a lot of warnings,
      
        kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=]
        kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=]
        kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=]
        ...
      
      Silence it by appending the __maybe_unused attribute for it, so all
      generated variables and macros can still be kept in the same file.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1559596304-31581-1-git-send-email-cai@lca.pwSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7fc96cd2
    • G
      sched/core: Add __sched tag for io_schedule() · d8b7db6c
      Gao Xiang 提交于
      [ Upstream commit e3b929b0a184edb35531153c5afcaebb09014f9d ]
      
      Non-inline io_schedule() was introduced in:
      
        commit 10ab5643 ("sched/core: Separate out io_schedule_prepare() and io_schedule_finish()")
      
      Keep in line with io_schedule_timeout(), otherwise "/proc/<pid>/wchan" will
      report io_schedule() rather than its callers when waiting for IO.
      Reported-by: NJilong Kou <koujilong@huawei.com>
      Signed-off-by: NGao Xiang <gaoxiang25@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miao Xie <miaoxie@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 10ab5643 ("sched/core: Separate out io_schedule_prepare() and io_schedule_finish()")
      Link: https://lkml.kernel.org/r/20190603091338.2695-1-gaoxiang25@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d8b7db6c
    • V
      bpf: silence warning messages in core · 7c10f894
      Valdis Klētnieks 提交于
      [ Upstream commit aee450cbe482a8c2f6fa5b05b178ef8b8ff107ca ]
      
      Compiling kernel/bpf/core.c with W=1 causes a flood of warnings:
      
      kernel/bpf/core.c:1198:65: warning: initialized field overwritten [-Woverride-init]
       1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
            |                                                                 ^~~~
      kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL'
       1087 |  INSN_3(ALU, ADD,  X),   \
            |  ^~~~~~
      kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP'
       1202 |   BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
            |   ^~~~~~~~~~~~
      kernel/bpf/core.c:1198:65: note: (near initialization for 'public_insntable[12]')
       1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
            |                                                                 ^~~~
      kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL'
       1087 |  INSN_3(ALU, ADD,  X),   \
            |  ^~~~~~
      kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP'
       1202 |   BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
            |   ^~~~~~~~~~~~
      
      98 copies of the above.
      
      The attached patch silences the warnings, because we *know* we're overwriting
      the default initializer. That leaves bpf/core.c with only 6 other warnings,
      which become more visible in comparison.
      Signed-off-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7c10f894
    • I
      locking/lockdep: Fix merging of hlocks with non-zero references · 2fbde274
      Imre Deak 提交于
      [ Upstream commit d9349850e188b8b59e5322fda17ff389a1c0cd7d ]
      
      The sequence
      
      	static DEFINE_WW_CLASS(test_ww_class);
      
      	struct ww_acquire_ctx ww_ctx;
      	struct ww_mutex ww_lock_a;
      	struct ww_mutex ww_lock_b;
      	struct ww_mutex ww_lock_c;
      	struct mutex lock_c;
      
      	ww_acquire_init(&ww_ctx, &test_ww_class);
      
      	ww_mutex_init(&ww_lock_a, &test_ww_class);
      	ww_mutex_init(&ww_lock_b, &test_ww_class);
      	ww_mutex_init(&ww_lock_c, &test_ww_class);
      
      	mutex_init(&lock_c);
      
      	ww_mutex_lock(&ww_lock_a, &ww_ctx);
      
      	mutex_lock(&lock_c);
      
      	ww_mutex_lock(&ww_lock_b, &ww_ctx);
      	ww_mutex_lock(&ww_lock_c, &ww_ctx);
      
      	mutex_unlock(&lock_c);	(*)
      
      	ww_mutex_unlock(&ww_lock_c);
      	ww_mutex_unlock(&ww_lock_b);
      	ww_mutex_unlock(&ww_lock_a);
      
      	ww_acquire_fini(&ww_ctx); (**)
      
      will trigger the following error in __lock_release() when calling
      mutex_release() at **:
      
      	DEBUG_LOCKS_WARN_ON(depth <= 0)
      
      The problem is that the hlock merging happening at * updates the
      references for test_ww_class incorrectly to 3 whereas it should've
      updated it to 4 (representing all the instances for ww_ctx and
      ww_lock_[abc]).
      
      Fix this by updating the references during merging correctly taking into
      account that we can have non-zero references (both for the hlock that we
      merge into another hlock or for the hlock we are merging into).
      Signed-off-by: NImre Deak <imre.deak@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: =?UTF-8?q?Ville=20Syrj=C3=A4l=C3=A4?= <ville.syrjala@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: https://lkml.kernel.org/r/20190524201509.9199-2-imre.deak@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2fbde274
    • E
      signal/pid_namespace: Fix reboot_pid_ns to use send_sig not force_sig · b397462a
      Eric W. Biederman 提交于
      [ Upstream commit f9070dc94542093fd516ae4ccea17ef46a4362c5 ]
      
      The locking in force_sig_info is not prepared to deal with a task that
      exits or execs (as sighand may change).  The is not a locking problem
      in force_sig as force_sig is only built to handle synchronous
      exceptions.
      
      Further the function force_sig_info changes the signal state if the
      signal is ignored, or blocked or if SIGNAL_UNKILLABLE will prevent the
      delivery of the signal.  The signal SIGKILL can not be ignored and can
      not be blocked and SIGNAL_UNKILLABLE won't prevent it from being
      delivered.
      
      So using force_sig rather than send_sig for SIGKILL is confusing
      and pointless.
      
      Because it won't impact the sending of the signal and and because
      using force_sig is wrong, replace force_sig with send_sig.
      
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Fixes: cf3f8921 ("pidns: add reboot_pid_ns() to handle the reboot syscall")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b397462a
  13. 21 7月, 2019 5 次提交
    • T
      genirq: Add optional hardware synchronization for shutdown · 6074f604
      Thomas Gleixner 提交于
      commit 62e0468650c30f0298822c580f382b16328119f6 upstream
      
      free_irq() ensures that no hardware interrupt handler is executing on a
      different CPU before actually releasing resources and deactivating the
      interrupt completely in a domain hierarchy.
      
      But that does not catch the case where the interrupt is on flight at the
      hardware level but not yet serviced by the target CPU. That creates an
      interesing race condition:
      
         CPU 0                  CPU 1               IRQ CHIP
      
                                                    interrupt is raised
                                                    sent to CPU1
      			  Unable to handle
      			  immediately
      			  (interrupts off,
      			   deep idle delay)
         mask()
         ...
         free()
           shutdown()
           synchronize_irq()
           release_resources()
                                do_IRQ()
                                  -> resources are not available
      
      That might be harmless and just trigger a spurious interrupt warning, but
      some interrupt chips might get into a wedged state.
      
      Utilize the existing irq_get_irqchip_state() callback for the
      synchronization in free_irq().
      
      synchronize_hardirq() is not using this mechanism as it might actually
      deadlock unter certain conditions, e.g. when called with interrupts
      disabled and the target CPU is the one on which the synchronization is
      invoked. synchronize_irq() uses it because that function cannot be called
      from non preemtible contexts as it might sleep.
      
      No functional change intended and according to Marc the existing GIC
      implementations where the driver supports the callback should be able
      to cope with that core change. Famous last words.
      
      Fixes: 464d1230 ("x86/vector: Switch IOAPIC to global reservation mode")
      Reported-by: NRobert Hodaszi <Robert.Hodaszi@digi.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Tested-by: NMarc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/20190628111440.279463375@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      6074f604
    • T
      genirq: Fix misleading synchronize_irq() documentation · 3f10ccc2
      Thomas Gleixner 提交于
      commit 1d21f2af8571c6a6a44e7c1911780614847b0253 upstream
      
      The function might sleep, so it cannot be called from interrupt
      context. Not even with care.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/20190628111440.189241552@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      3f10ccc2
    • T
      genirq: Delay deactivation in free_irq() · 578db1aa
      Thomas Gleixner 提交于
      commit 4001d8e8762f57d418b66e4e668601791900a1dd upstream
      
      When interrupts are shutdown, they are immediately deactivated in the
      irqdomain hierarchy. While this looks obviously correct there is a subtle
      issue:
      
      There might be an interrupt in flight when free_irq() is invoking the
      shutdown. This is properly handled at the irq descriptor / primary handler
      level, but the deactivation might completely disable resources which are
      required to acknowledge the interrupt.
      
      Split the shutdown code and deactivate the interrupt after synchronization
      in free_irq(). Fixup all other usage sites where this is not an issue to
      invoke the combined shutdown_and_deactivate() function instead.
      
      This still might be an issue if the interrupt in flight servicing is
      delayed on a remote CPU beyond the invocation of synchronize_irq(), but
      that cannot be handled at that level and needs to be handled in the
      synchronize_irq() context.
      
      Fixes: f8264e34 ("irqdomain: Introduce new interfaces to support hierarchy irqdomains")
      Reported-by: NRobert Hodaszi <Robert.Hodaszi@digi.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/20190628111440.098196390@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      578db1aa
    • E
      cpu/hotplug: Fix out-of-bounds read when setting fail state · f6e01328
      Eiichi Tsukata 提交于
      [ Upstream commit 33d4a5a7a5b4d02915d765064b2319e90a11cbde ]
      
      Setting invalid value to /sys/devices/system/cpu/cpuX/hotplug/fail
      can control `struct cpuhp_step *sp` address, results in the following
      global-out-of-bounds read.
      
      Reproducer:
      
        # echo -2 > /sys/devices/system/cpu/cpu0/hotplug/fail
      
      KASAN report:
      
        BUG: KASAN: global-out-of-bounds in write_cpuhp_fail+0x2cd/0x2e0
        Read of size 8 at addr ffffffff89734438 by task bash/1941
      
        CPU: 0 PID: 1941 Comm: bash Not tainted 5.2.0-rc6+ #31
        Call Trace:
         write_cpuhp_fail+0x2cd/0x2e0
         dev_attr_store+0x58/0x80
         sysfs_kf_write+0x13d/0x1a0
         kernfs_fop_write+0x2bc/0x460
         vfs_write+0x1e1/0x560
         ksys_write+0x126/0x250
         do_syscall_64+0xc1/0x390
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f05e4f4c970
      
        The buggy address belongs to the variable:
         cpu_hotplug_lock+0x98/0xa0
      
        Memory state around the buggy address:
         ffffffff89734300: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
         ffffffff89734380: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
        >ffffffff89734400: 00 00 00 00 fa fa fa fa 00 00 00 00 fa fa fa fa
                                                ^
         ffffffff89734480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         ffffffff89734500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      Add a sanity check for the value written from user space.
      
      Fixes: 1db49484 ("smp/hotplug: Hotplug state fail injection")
      Signed-off-by: NEiichi Tsukata <devel@etsukata.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: peterz@infradead.org
      Link: https://lkml.kernel.org/r/20190627024732.31672-1-devel@etsukata.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      f6e01328
    • P
      perf/core: Fix perf_sample_regs_user() mm check · afda29dc
      Peter Zijlstra 提交于
      [ Upstream commit 085ebfe937d7a7a5df1729f35a12d6d655fea68c ]
      
      perf_sample_regs_user() uses 'current->mm' to test for the presence of
      userspace, but this is insufficient, consider use_mm().
      
      A better test is: '!(current->flags & PF_KTHREAD)', exec() clears
      PF_KTHREAD after it sets the new ->mm but before it drops to userspace
      for the first time.
      
      Possibly obsoletes: bf05fc25 ("powerpc/perf: Fix oops when kthread execs user process")
      Reported-by: NRavi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
      Reported-by: NYoung Xiao <92siuyang@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 4018994f ("perf: Add ability to attach user level registers dump to sample")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      afda29dc
  14. 14 7月, 2019 3 次提交
  15. 10 7月, 2019 1 次提交
    • D
      bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K · 54e8cf41
      Daniel Borkmann 提交于
      [ Upstream commit fdadd04931c2d7cd294dc5b2b342863f94be53a3 ]
      
      Michael and Sandipan report:
      
        Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
        JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
        and is adjusted again at init time if MODULES_VADDR is defined.
      
        For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
        the compile-time default at boot-time, which is 0x9c400000 when
        using 64K page size. This overflows the signed 32-bit bpf_jit_limit
        value:
      
        root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
        -1673527296
      
        and can cause various unexpected failures throughout the network
        stack. In one case `strace dhclient eth0` reported:
      
        setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
                   16) = -1 ENOTSUPP (Unknown error 524)
      
        and similar failures can be seen with tools like tcpdump. This doesn't
        always reproduce however, and I'm not sure why. The more consistent
        failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
        host would time out on systemd/netplan configuring a virtio-net NIC
        with no noticeable errors in the logs.
      
      Given this and also given that in near future some architectures like
      arm64 will have a custom area for BPF JIT image allocations we should
      get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
      4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
      so therefore add another overridable bpf_jit_alloc_exec_limit() helper
      function which returns the possible size of the memory area for deriving
      the default heuristic in bpf_jit_charge_init().
      
      Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
      bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
      JIT memory provider, and therefore in case archs implement their custom
      module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
      vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
      
      Additionally, for archs supporting large page sizes, we should change
      the sysctl to be handled as long to not run into sysctl restrictions
      in future.
      
      Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
      Reported-by: NSandipan Das <sandipan@linux.ibm.com>
      Reported-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      54e8cf41