1. 20 5月, 2020 1 次提交
    • E
      signal: Extend exec_id to 64bits · 1f4a4074
      Eric W. Biederman 提交于
      commit d1e7fd6462ca9fc76650fbe6ca800e35b24267da upstream.
      
      Replace the 32bit exec_id with a 64bit exec_id to make it impossible
      to wrap the exec_id counter.  With care an attacker can cause exec_id
      wrap and send arbitrary signals to a newly exec'd parent.  This
      bypasses the signal sending checks if the parent changes their
      credentials during exec.
      
      The severity of this problem can been seen that in my limited testing
      of a 32bit exec_id it can take as little as 19s to exec 65536 times.
      Which means that it can take as little as 14 days to wrap a 32bit
      exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
      days.  Even my slower timing is in the uptime of a typical server.
      Which means self_exec_id is simply a speed bump today, and if exec
      gets noticably faster self_exec_id won't even be a speed bump.
      
      Extending self_exec_id to 64bits introduces a problem on 32bit
      architectures where reading self_exec_id is no longer atomic and can
      take two read instructions.  Which means that is is possible to hit
      a window where the read value of exec_id does not match the written
      value.  So with very lucky timing after this change this still
      remains expoiltable.
      
      I have updated the update of exec_id on exec to use WRITE_ONCE
      and the read of exec_id in do_notify_parent to use READ_ONCE
      to make it clear that there is no locking between these two
      locations.
      
      Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
      Fixes: 2.3.23pre2
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1f4a4074
  2. 27 12月, 2019 6 次提交
    • E
      signal: Always ignore SIGKILL and SIGSTOP sent to the global init · f5521fef
      Eric W. Biederman 提交于
      [ Upstream commit 86989c41b5ea08776c450cb759592532314a4ed6 ]
      
      If the first process started (aka /sbin/init) receives a SIGKILL it
      will panic the system if it is delivered.  Making the system unusable
      and undebugable.  It isn't much better if the first process started
      receives SIGSTOP.
      
      So always ignore SIGSTOP and SIGKILL sent to init.
      
      This is done in a separate clause in sig_task_ignored as force_sig_info
      can clear SIG_UNKILLABLE and this protection should work even then.
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f5521fef
    • Z
      kernel/signal.c: trace_signal_deliver when signal_group_exit · f1bfcb97
      Zhenliang Wei 提交于
      commit 98af37d624ed8c83f1953b1b6b2f6866011fc064 upstream.
      
      In the fixes commit, removing SIGKILL from each thread signal mask and
      executing "goto fatal" directly will skip the call to
      "trace_signal_deliver".  At this point, the delivery tracking of the
      SIGKILL signal will be inaccurate.
      
      Therefore, we need to add trace_signal_deliver before "goto fatal" after
      executing sigdelset.
      
      Note: SEND_SIG_NOINFO matches the fact that SIGKILL doesn't have any info.
      
      Link: http://lkml.kernel.org/r/20190425025812.91424-1-weizhenliang@huawei.com
      Fixes: cf43a757fd4944 ("signal: Restore the stop PTRACE_EVENT_EXIT")
      Signed-off-by: NZhenliang Wei <weizhenliang@huawei.com>
      Reviewed-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Ivan Delalande <colona@arista.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Deepa Dinamani <deepa.kernel@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f1bfcb97
    • E
      signal: Restore the stop PTRACE_EVENT_EXIT · fd839bdb
      Eric W. Biederman 提交于
      commit cf43a757fd49442bc38f76088b70c2299eed2c2f upstream.
      
      In the middle of do_exit() there is there is a call
      "ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
      in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
      for the debugger to release the task or SIGKILL to be delivered.
      
      Skipping past dequeue_signal when we know a fatal signal has already
      been delivered resulted in SIGKILL remaining pending and
      TIF_SIGPENDING remaining set.  This in turn caused the
      scheduler to not sleep in PTACE_EVENT_EXIT as it figured
      a fatal signal was pending.  This also caused ptrace_freeze_traced
      in ptrace_check_attach to fail because it left a per thread
      SIGKILL pending which is what fatal_signal_pending tests for.
      
      This difference in signal state caused strace to report
      strace: Exit of unknown pid NNNNN ignored
      
      Therefore update the signal handling state like dequeue_signal
      would when removing a per thread SIGKILL, by removing SIGKILL
      from the per thread signal mask and clearing TIF_SIGPENDING.
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NIvan Delalande <colona@arista.com>
      Cc: stable@vger.kernel.org
      Fixes: 35634ffa1751 ("signal: Always notice exiting tasks")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fd839bdb
    • E
      signal: Better detection of synchronous signals · 8237146a
      Eric W. Biederman 提交于
      commit 7146db3317c67b517258cb5e1b08af387da0618b upstream.
      
      Recently syzkaller was able to create unkillablle processes by
      creating a timer that is delivered as a thread local signal on SIGHUP,
      and receiving SIGHUP SA_NODEFERER.  Ultimately causing a loop failing
      to deliver SIGHUP but always trying.
      
      When the stack overflows delivery of SIGHUP fails and force_sigsegv is
      called.  Unfortunately because SIGSEGV is numerically higher than
      SIGHUP next_signal tries again to deliver a SIGHUP.
      
      From a quality of implementation standpoint attempting to deliver the
      timer SIGHUP signal is wrong.  We should attempt to deliver the
      synchronous SIGSEGV signal we just forced.
      
      We can make that happening in a fairly straight forward manner by
      instead of just looking at the signal number we also look at the
      si_code.  In particular for exceptions (aka synchronous signals) the
      si_code is always greater than 0.
      
      That still has the potential to pick up a number of asynchronous
      signals as in a few cases the same si_codes that are used
      for synchronous signals are also used for asynchronous signals,
      and SI_KERNEL is also included in the list of possible si_codes.
      
      Still the heuristic is much better and timer signals are definitely
      excluded.  Which is enough to prevent all known ways for someone
      sending a process signals fast enough to cause unexpected and
      arguably incorrect behavior.
      
      Cc: stable@vger.kernel.org
      Fixes: a27341cd ("Prioritize synchronous signals over 'normal' signals")
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8237146a
    • E
      signal: Always notice exiting tasks · c1b7da77
      Eric W. Biederman 提交于
      commit 35634ffa1751b6efd8cf75010b509dcb0263e29b upstream.
      
      Recently syzkaller was able to create unkillablle processes by
      creating a timer that is delivered as a thread local signal on SIGHUP,
      and receiving SIGHUP SA_NODEFERER.  Ultimately causing a loop
      failing to deliver SIGHUP but always trying.
      
      Upon examination it turns out part of the problem is actually most of
      the solution.  Since 2.5 signal delivery has found all fatal signals,
      marked the signal group for death, and queued SIGKILL in every threads
      thread queue relying on signal->group_exit_code to preserve the
      information of which was the actual fatal signal.
      
      The conversion of all fatal signals to SIGKILL results in the
      synchronous signal heuristic in next_signal kicking in and preferring
      SIGHUP to SIGKILL.  Which is especially problematic as all
      fatal signals have already been transformed into SIGKILL.
      
      Instead of dequeueing signals and depending upon SIGKILL to
      be the first signal dequeued, first test if the signal group
      has already been marked for death.  This guarantees that
      nothing in the signal queue can prevent a process that needs
      to exit from exiting.
      
      Cc: stable@vger.kernel.org
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Ref: ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4")
      History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gitSigned-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c1b7da77
    • L
      Remove 'type' argument from access_ok() function · 4983cb67
      Linus Torvalds 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit 96d4f267
      category: cleanup
      bugzilla: 9284
      CVE: NA
      
      It's a cleanup patch that prepare for applying CVE-2018-20669
      patch 594cc251 ("make 'user_access_begin()' do 'access_ok()'")
      
      -------------------------------------------------
      
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      
      Conflicts:
      	drivers/media/v4l2-core/v4l2-compat-ioctl32.c
      	drivers/infiniband/core/uverbs_main.c
      	drivers/platform/goldfish/goldfish_pipe.c
      	fs/namespace.c
      	fs/select.c
      	kernel/compat.c
      	arch/powerpc/include/asm/uaccess.h
      	arch/arm64/kernel/perf_callchain.c
      	arch/arm64/include/asm/uaccess.h
      	arch/ia64/kernel/signal.c
      	arch/x86/entry/vsyscall/vsyscall_64.c
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4983cb67
  3. 14 11月, 2018 3 次提交
  4. 23 8月, 2018 17 次提交
  5. 10 8月, 2018 1 次提交
    • E
      signal: Don't restart fork when signals come in. · c3ad2c3b
      Eric W. Biederman 提交于
      Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
      report that a periodic signal received during fork can cause fork to
      continually restart preventing an application from making progress.
      
      The code was being overly pessimistic.  Fork needs to guarantee that a
      signal sent to multiple processes is logically delivered before the
      fork and just to the forking process or logically delivered after the
      fork to both the forking process and it's newly spawned child.  For
      signals like periodic timers that are always delivered to a single
      process fork can safely complete and let them appear to logically
      delivered after the fork().
      
      While examining this issue I also discovered that fork today will miss
      signals delivered to multiple processes during the fork and handled by
      another thread.  Similarly the current code will also miss blocked
      signals that are delivered to multiple process, as those signals will
      not appear pending during fork.
      
      Add a list of each thread that is currently forking, and keep on that
      list a signal set that records all of the signals sent to multiple
      processes.  When fork completes initialize the new processes
      shared_pending signal set with it.  The calculate_sigpending function
      will see those signals and set TIF_SIGPENDING causing the new task to
      take the slow path to userspace to handle those signals.  Making it
      appear as if those signals were received immediately after the fork.
      
      It is not possible to send real time signals to multiple processes and
      exceptions don't go to multiple processes, which means that that are
      no signals sent to multiple processes that require siginfo.  This
      means it is safe to not bother collecting siginfo on signals sent
      during fork.
      
      The sigaction of a child of fork is initially the same as the
      sigaction of the parent process.  So a signal the parent ignores the
      child will also initially ignore.  Therefore it is safe to ignore
      signals sent to multiple processes and ignored by the forking process.
      
      Signals sent to only a single process or only a single thread and delivered
      during fork are treated as if they are received after the fork, and generally
      not dealt with.  They won't cause any problems.
      
      V2: Added removal from the multiprocess list on failure.
      V3: Use -ERESTARTNOINTR directly
      V4: - Don't queue both SIGCONT and SIGSTOP
          - Initialize signal_struct.multiprocess in init_task
          - Move setting of shared_pending to before the new task
            is visible to signals.  This prevents signals from comming
            in before shared_pending.signal is set to delayed.signal
            and being lost.
      V5: - rework list add and delete to account for idle threads
      v6: - Use sigdelsetmask when removing stop signals
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
      Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
      Reported-by: Nmajiang <ma.jiang@zte.com.cn>
      Fixes: 4a2c7a78 ("[PATCH] make fork() atomic wrt pgrp/session signals")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c3ad2c3b
  6. 04 8月, 2018 2 次提交
    • E
      fork: Have new threads join on-going signal group stops · 924de3b8
      Eric W. Biederman 提交于
      There are only two signals that are delivered to every member of a
      signal group: SIGSTOP and SIGKILL.  Signal delivery requires every
      signal appear to be delivered either before or after a clone syscall.
      SIGKILL terminates the clone so does not need to be considered.  Which
      leaves only SIGSTOP that needs to be considered when creating new
      threads.
      
      Today in the event of a group stop TIF_SIGPENDING will get set and the
      fork will restart ensuring the fork syscall participates in the group
      stop.
      
      A fork (especially of a process with a lot of memory) is one of the
      most expensive system so we really only want to restart a fork when
      necessary.
      
      It is easy so check to see if a SIGSTOP is ongoing and have the new
      thread join it immediate after the clone completes.  Making it appear
      the clone completed happened just before the SIGSTOP.
      
      The calculate_sigpending function will see the bits set in jobctl and
      set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.
      
      V2: The call to task_join_group_stop was moved before the new task is
          added to the thread group list.  This should not matter as
          sighand->siglock is held over both the addition of the threads,
          the call to task_join_group_stop and do_signal_stop.  But the change
          is trivial and it is one less thing to worry about when reading
          the code.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      924de3b8
    • E
      signal: Add calculate_sigpending() · 088fe47c
      Eric W. Biederman 提交于
      Add a function calculate_sigpending to test to see if any signals are
      pending for a new task immediately following fork.  Signals have to
      happen either before or after fork.  Today our practice is to push
      all of the signals to before the fork, but that has the downside that
      frequent or periodic signals can make fork take much much longer than
      normal or prevent fork from completing entirely.
      
      So we need move signals that we can after the fork to prevent that.
      
      This updates the code to set TIF_SIGPENDING on a new task if there
      are signals or other activities that have moved so that they appear
      to happen after the fork.
      
      As the code today restarts if it sees any such activity this won't
      immediately have an effect, as there will be no reason for it
      to set TIF_SIGPENDING immediately after the fork.
      
      Adding calculate_sigpending means the code in fork can safely be
      changed to not always restart if a signal is pending.
      
      The new calculate_sigpending function sets sigpending if there
      are pending bits in jobctl, pending signals, the freezer needs
      to freeze the new task or the live kernel patching framework
      need the new thread to take the slow path to userspace.
      
      I have verified that setting TIF_SIGPENDING does make a new process
      take the slow path to userspace before it executes it's first userspace
      instruction.
      
      I have looked at the callers of signal_wake_up and the code paths
      setting TIF_SIGPENDING and I don't see anything else that needs to be
      handled.  The code probably doesn't need to set TIF_SIGPENDING for the
      kernel live patching as it uses a separate thread flag as well.  But
      at this point it seems safer reuse the recalc_sigpending logic and get
      the kernel live patching folks to sort out their story later.
      
      V2: I have moved the test into schedule_tail where siglock can
          be grabbed and recalc_sigpending can be reused directly.
          Further as the last action of setting up a new task this
          guarantees that TIF_SIGPENDING will be properly set in the
          new process.
      
          The helper calculate_sigpending takes the siglock and
          uncontitionally sets TIF_SIGPENDING and let's recalc_sigpending
          clear TIF_SIGPENDING if it is unnecessary.  This allows reusing
          the existing code and keeps maintenance of the conditions simple.
      
          Oleg Nesterov <oleg@redhat.com>  suggested the movement
          and pointed out the need to take siglock if this code
          was going to be called while the new task is discoverable.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      088fe47c
  7. 22 7月, 2018 5 次提交
  8. 21 7月, 2018 1 次提交
    • E
      signal: Pass pid and pid type into send_sigqueue · 24122c7f
      Eric W. Biederman 提交于
      Make the code more maintainable by performing more of the signal
      related work in send_sigqueue.
      
      A quick inspection of do_timer_create will show that this code path
      does not lookup a thread group by a thread's pid.  Making it safe
      to find the task pointed to by it_pid with "pid_task(it_pid, type)";
      
      This supports the changes needed in fork to tell if a signal was sent
      to a single process or a group of processes.
      
      Having the pid to task transition in signal.c will also make it easier
      to sort out races with de_thread and and the thread group leader
      exiting when it comes time to address that.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      24122c7f
  9. 10 6月, 2018 1 次提交
  10. 04 5月, 2018 1 次提交
    • P
      sched/core: Introduce set_special_state() · b5bf9a90
      Peter Zijlstra 提交于
      Gaurav reported a perceived problem with TASK_PARKED, which turned out
      to be a broken wait-loop pattern in __kthread_parkme(), but the
      reported issue can (and does) in fact happen for states that do not do
      condition based sleeps.
      
      When the 'current->state = TASK_RUNNING' store of a previous
      (concurrent) try_to_wake_up() collides with the setting of a 'special'
      sleep state, we can loose the sleep state.
      
      Normal condition based wait-loops are immune to this problem, but for
      sleep states that are not condition based are subject to this problem.
      
      There already is a fix for TASK_DEAD. Abstract that and also apply it
      to TASK_STOPPED and TASK_TRACED, both of which are also without
      condition based wait-loop.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5bf9a90
  11. 27 4月, 2018 2 次提交
    • E
      signal: Extend siginfo_layout with SIL_FAULT_{MCEERR|BNDERR|PKUERR} · 31931c93
      Eric W. Biederman 提交于
      Update the siginfo_layout function and enum siginfo_layout to represent
      all of the possible field layouts of struct siginfo.
      
      This allows the uses of siginfo_layout in um and arm64 where they are testing
      for SIL_FAULT to be more accurate as this rules out the other cases.
      
      Further this allows the switch statements on siginfo_layout to be simpler
      if perhaps a little more wordy.  Making it easier to understand what is
      actually going on.
      
      As SIL_FAULT_BNDERR and SIL_FAULT_PKUERR are never expected to appear
      in signalfd just treat them as SIL_FAULT.  To include them would take
      20 extra bytes an pretty much fill up what is left of
      signalfd_siginfo.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      31931c93
    • E
      signal: Remove unncessary #ifdef SEGV_PKUERR in 32bit compat code · 36a4ca3d
      Eric W. Biederman 提交于
      The only architecture that does not support SEGV_PKUERR is ia64 and
      ia64 has not had 32bit support since some time in 2008.  Therefore
      copy_siginfo_to_user32 and copy_siginfo_from_user32 do not need to
      include support for a missing SEGV_PKUERR.
      
      Compile test on ia64.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      36a4ca3d