1. 13 12月, 2020 1 次提交
  2. 11 12月, 2020 1 次提交
    • E
      exec: Transform exec_update_mutex into a rw_semaphore · f7cfd871
      Eric W. Biederman 提交于
      Recently syzbot reported[0] that there is a deadlock amongst the users
      of exec_update_mutex.  The problematic lock ordering found by lockdep
      was:
      
         perf_event_open  (exec_update_mutex -> ovl_i_mutex)
         chown            (ovl_i_mutex       -> sb_writes)
         sendfile         (sb_writes         -> p->lock)
           by reading from a proc file and writing to overlayfs
         proc_pid_syscall (p->lock           -> exec_update_mutex)
      
      While looking at possible solutions it occured to me that all of the
      users and possible users involved only wanted to state of the given
      process to remain the same.  They are all readers.  The only writer is
      exec.
      
      There is no reason for readers to block on each other.  So fix
      this deadlock by transforming exec_update_mutex into a rw_semaphore
      named exec_update_lock that only exec takes for writing.
      
      Cc: Jann Horn <jannh@google.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Fixes: eea96732 ("exec: Add exec_update_mutex to replace cred_guard_mutex")
      [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
      Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
      Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      f7cfd871
  3. 29 10月, 2020 2 次提交
  4. 08 7月, 2020 1 次提交
  5. 29 4月, 2020 1 次提交
  6. 03 4月, 2020 2 次提交
    • P
      mm: return faster for non-fatal signals in user mode faults · 8b9a65fd
      Peter Xu 提交于
      The idea comes from the upstream discussion between Linus and Andrea:
      
        https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
      
      A summary to the issue: there was a special path in handle_userfault() in
      the past that we'll return a VM_FAULT_NOPAGE when we detected non-fatal
      signals when waiting for userfault handling.  We did that by reacquiring
      the mmap_sem before returning.  However that brings a risk in that the
      vmas might have changed when we retake the mmap_sem and even we could be
      holding an invalid vma structure.
      
      This patch is a preparation of removing that special path by allowing the
      page fault to return even faster if we were interrupted by a non-fatal
      signal during a user-mode page fault handling routine.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160230.9598-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b9a65fd
    • P
      mm: introduce fault_signal_pending() · 4ef87322
      Peter Xu 提交于
      For most architectures, we've got a quick path to detect fatal signal
      after a handle_mm_fault().  Introduce a helper for that quick path.
      
      It cleans the current codes a bit so we don't need to duplicate the same
      check across archs.  More importantly, this will be an unified place that
      we handle the signal immediately right after an interrupted page fault, so
      it'll be much easier for us if we want to change the behavior of handling
      signals later on for all the archs.
      
      Note that currently only part of the archs are using this new helper,
      because some archs have their own way to handle signals.  In the follow up
      patches, we'll try to apply this helper to all the rest of archs.
      
      Another note is that the "regs" parameter in the new helper is not used
      yet.  It'll be used very soon.  Now we kept it in this patch only to avoid
      touching all the archs again in the follow up patches.
      
      [peterx@redhat.com: fix sparse warnings]
        Link: http://lkml.kernel.org/r/20200311145921.GD479302@xz-x1Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155353.8676-4-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ef87322
  7. 25 3月, 2020 1 次提交
  8. 28 8月, 2019 3 次提交
  9. 17 7月, 2019 2 次提交
    • O
      signal: simplify set_user_sigmask/restore_user_sigmask · b772434b
      Oleg Nesterov 提交于
      task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
      syscall paths.  This means that set_user_sigmask() can save ->blocked in
      ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
      was modified.
      
      This way the callers do not need 2 sigset_t's passed to set/restore and
      restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
      into the trivial helper which just calls restore_saved_sigmask().
      
      Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Deepa Dinamani <deepa.kernel@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Eric Wong <e@80x24.org>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David Laight <David.Laight@aculab.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b772434b
    • A
      signal: reorder struct sighand_struct · e2d9018e
      Alexey Dobriyan 提交于
      struct sighand_struct::siglock field is the most used field by far, put
      it first so that is can be accessed without IMM8 or IMM32 encoding on
      x86_64.
      
      Space savings (on trimmed down VM test config):
      
      add/remove: 0/0 grow/shrink: 8/68 up/down: 49/-1147 (-1098)
      Function                                     old     new   delta
      complete_signal                              512     533     +21
      do_signalfd4                                 335     346     +11
      __cleanup_sighand                             39      43      +4
      unhandled_signal                              49      52      +3
      prepare_signal                               692     695      +3
      ignore_signals                                37      40      +3
      __tty_check_change.part                      248     251      +3
      ksys_unshare                                 780     781      +1
      sighand_ctor                                  33      29      -4
      ptrace_trap_notify                            60      56      -4
      sigqueue_free                                 98      91      -7
      run_posix_cpu_timers                        1389    1382      -7
      proc_pid_status                             2448    2441      -7
      proc_pid_limits                              344     337      -7
      posix_cpu_timer_rearm                        222     215      -7
      posix_cpu_timer_get                          249     242      -7
      kill_pid_info_as_cred                        243     236      -7
      freeze_task                                  197     190      -7
      flush_old_exec                              1873    1866      -7
      do_task_stat                                3363    3356      -7
      do_send_sig_info                              98      91      -7
      do_group_exit                                147     140      -7
      init_sighand                                2088    2080      -8
      do_notify_parent_cldstop                     399     391      -8
      signalfd_cleanup                              50      41      -9
      do_notify_parent                             557     545     -12
      __send_signal                               1029    1017     -12
      ptrace_stop                                  590     577     -13
      get_signal                                  1576    1563     -13
      __lock_task_sighand                          112      99     -13
      zap_pid_ns_processes                         391     377     -14
      update_rlimit_cpu                             78      64     -14
      tty_signal_session_leader                    413     399     -14
      tty_open_proc_set_tty                        149     135     -14
      tty_jobctrl_ioctl                            936     922     -14
      set_cpu_itimer                               339     325     -14
      ptrace_resume                                226     212     -14
      ptrace_notify                                110      96     -14
      proc_clear_tty                                81      67     -14
      posix_cpu_timer_del                          229     215     -14
      kernel_sigaction                             156     142     -14
      getrusage                                    977     963     -14
      get_current_tty                               98      84     -14
      force_sigsegv                                 89      75     -14
      force_sig_info                               205     191     -14
      flush_signals                                 83      69     -14
      flush_itimer_signals                          85      71     -14
      do_timer_create                             1120    1106     -14
      do_sigpending                                 88      74     -14
      do_signal_stop                               537     523     -14
      cgroup_init_fs_context                       644     630     -14
      call_usermodehelper_exec_async               402     388     -14
      calculate_sigpending                          58      44     -14
      __x64_sys_timer_delete                       248     234     -14
      __set_current_blocked                         80      66     -14
      __ptrace_unlink                              310     296     -14
      __ptrace_detach.part                         187     173     -14
      send_sigqueue                                362     347     -15
      get_cpu_itimer                               214     199     -15
      signalfd_poll                                175     159     -16
      dequeue_signal                               340     323     -17
      do_getitimer                                 192     174     -18
      release_task.part                           1060    1040     -20
      ptrace_peek_siginfo                          408     387     -21
      posix_cpu_timer_set                          827     806     -21
      exit_signals                                 437     416     -21
      do_sigaction                                 541     520     -21
      do_setitimer                                 485     464     -21
      disassociate_ctty.part                       545     517     -28
      __x64_sys_rt_sigtimedwait                    721     679     -42
      __x64_sys_ptrace                            1319    1277     -42
      ptrace_request                              1828    1782     -46
      signalfd_read                                507     459     -48
      wait_consider_task                          2027    1971     -56
      do_coredump                                 3672    3616     -56
      copy_process.part                           6936    6871     -65
      
      Link: http://lkml.kernel.org/r/20190503192800.GA18004@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2d9018e
  10. 29 5月, 2019 3 次提交
    • E
      signal: Remove the signal number and task parameters from force_sig_info · a89e9b8a
      Eric W. Biederman 提交于
      force_sig_info always delivers to the current task and the signal
      parameter always matches info.si_signo.  So remove those parameters to
      make it a simpler less error prone interface, and to make it clear
      that none of the callers are doing anything clever.
      
      This guarantees that force_sig_info will not grow any new buggy
      callers that attempt to call force_sig on a non-current task, or that
      pass an signal number that does not match info.si_signo.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a89e9b8a
    • E
      signal: Remove the task parameter from force_sig_fault · 2e1661d2
      Eric W. Biederman 提交于
      As synchronous exceptions really only make sense against the current
      task (otherwise how are you synchronous) remove the task parameter
      from from force_sig_fault to make it explicit that is what is going
      on.
      
      The two known exceptions that deliver a synchronous exception to a
      stopped ptraced task have already been changed to
      force_sig_fault_to_task.
      
      The callers have been changed with the following emacs regular expression
      (with obvious variations on the architectures that take more arguments)
      to avoid typos:
      
      force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
      ->
      force_sig_fault(\1,\2,\3)
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      2e1661d2
    • E
      signal: Use force_sig_fault_to_task for the two calls that don't deliver to current · 91ca180d
      Eric W. Biederman 提交于
      In preparation for removing the task parameter from force_sig_fault
      introduce force_sig_fault_to_task and use it for the two cases where
      it matters.
      
      On mips force_fcr31_sig calls force_sig_fault and is called on either
      the current task, or a task that is suspended and is being switched to
      by the scheduler.  This is safe because the task being switched to by
      the scheduler is guaranteed to be suspended.  This ensures that
      task->sighand is stable while the signal is delivered to it.
      
      On parisc user_enable_single_step calls force_sig_fault and is in turn
      called by ptrace_request.  The function ptrace_request always calls
      user_enable_single_step on a child that is stopped for tracing.  The
      child being traced and not reaped ensures that child->sighand is not
      NULL, and that the child will not change child->sighand.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      91ca180d
  11. 27 5月, 2019 3 次提交
  12. 23 5月, 2019 1 次提交
    • E
      signal/usb: Replace kill_pid_info_as_cred with kill_pid_usb_asyncio · 70f1b0d3
      Eric W. Biederman 提交于
      The usb support for asyncio encoded one of it's values in the wrong
      field.  It should have used si_value but instead used si_addr which is
      not present in the _rt union member of struct siginfo.
      
      The practical result of this is that on a 64bit big endian kernel
      when delivering a signal to a 32bit process the si_addr field
      is set to NULL, instead of the expected pointer value.
      
      This issue can not be fixed in copy_siginfo_to_user32 as the usb
      usage of the the _sigfault (aka si_addr) member of the siginfo
      union when SI_ASYNCIO is set is incompatible with the POSIX and
      glibc usage of the _rt member of the siginfo union.
      
      Therefore replace kill_pid_info_as_cred with kill_pid_usb_asyncio a
      dedicated function for this one specific case.  There are no other
      users of kill_pid_info_as_cred so this specialization should have no
      impact on the amount of code in the kernel.  Have kill_pid_usb_asyncio
      take instead of a siginfo_t which is difficult and error prone, 3
      arguments, a signal number, an errno value, and an address enconded as
      a sigval_t.  The encoding of the address as a sigval_t allows the
      code that reads the userspace request for a signal to handle this
      compat issue along with all of the other compat issues.
      
      Add BUILD_BUG_ONs in kernel/signal.c to ensure that we can now place
      the pointer value at the in si_pid (instead of si_addr).  That is the
      code now verifies that si_pid and si_addr always occur at the same
      location.  Further the code veries that for native structures a value
      placed in si_pid and spilling into si_uid will appear in userspace in
      si_addr (on a byte by byte copy of siginfo or a field by field copy of
      siginfo).  The code also verifies that for a 64bit kernel and a 32bit
      userspace the 32bit pointer will fit in si_pid.
      
      I have used the usbsig.c program below written by Alan Stern and
      slightly tweaked by me to run on a big endian machine to verify the
      issue exists (on sparc64) and to confirm the patch below fixes the issue.
      
       /* usbsig.c -- test USB async signal delivery */
      
       #define _GNU_SOURCE
       #include <stdio.h>
       #include <fcntl.h>
       #include <signal.h>
       #include <string.h>
       #include <sys/ioctl.h>
       #include <unistd.h>
       #include <endian.h>
       #include <linux/usb/ch9.h>
       #include <linux/usbdevice_fs.h>
      
       static struct usbdevfs_urb urb;
       static struct usbdevfs_disconnectsignal ds;
       static volatile sig_atomic_t done = 0;
      
       void urb_handler(int sig, siginfo_t *info , void *ucontext)
       {
       	printf("Got signal %d, signo %d errno %d code %d addr: %p urb: %p\n",
       	       sig, info->si_signo, info->si_errno, info->si_code,
       	       info->si_addr, &urb);
      
       	printf("%s\n", (info->si_addr == &urb) ? "Good" : "Bad");
       }
      
       void ds_handler(int sig, siginfo_t *info , void *ucontext)
       {
       	printf("Got signal %d, signo %d errno %d code %d addr: %p ds: %p\n",
       	       sig, info->si_signo, info->si_errno, info->si_code,
       	       info->si_addr, &ds);
      
       	printf("%s\n", (info->si_addr == &ds) ? "Good" : "Bad");
       	done = 1;
       }
      
       int main(int argc, char **argv)
       {
       	char *devfilename;
       	int fd;
       	int rc;
       	struct sigaction act;
       	struct usb_ctrlrequest *req;
       	void *ptr;
       	char buf[80];
      
       	if (argc != 2) {
       		fprintf(stderr, "Usage: usbsig device-file-name\n");
       		return 1;
       	}
      
       	devfilename = argv[1];
       	fd = open(devfilename, O_RDWR);
       	if (fd == -1) {
       		perror("Error opening device file");
       		return 1;
       	}
      
       	act.sa_sigaction = urb_handler;
       	sigemptyset(&act.sa_mask);
       	act.sa_flags = SA_SIGINFO;
      
       	rc = sigaction(SIGUSR1, &act, NULL);
       	if (rc == -1) {
       		perror("Error in sigaction");
       		return 1;
       	}
      
       	act.sa_sigaction = ds_handler;
       	sigemptyset(&act.sa_mask);
       	act.sa_flags = SA_SIGINFO;
      
       	rc = sigaction(SIGUSR2, &act, NULL);
       	if (rc == -1) {
       		perror("Error in sigaction");
       		return 1;
       	}
      
       	memset(&urb, 0, sizeof(urb));
       	urb.type = USBDEVFS_URB_TYPE_CONTROL;
       	urb.endpoint = USB_DIR_IN | 0;
       	urb.buffer = buf;
       	urb.buffer_length = sizeof(buf);
       	urb.signr = SIGUSR1;
      
       	req = (struct usb_ctrlrequest *) buf;
       	req->bRequestType = USB_DIR_IN | USB_TYPE_STANDARD | USB_RECIP_DEVICE;
       	req->bRequest = USB_REQ_GET_DESCRIPTOR;
       	req->wValue = htole16(USB_DT_DEVICE << 8);
       	req->wIndex = htole16(0);
       	req->wLength = htole16(sizeof(buf) - sizeof(*req));
      
       	rc = ioctl(fd, USBDEVFS_SUBMITURB, &urb);
       	if (rc == -1) {
       		perror("Error in SUBMITURB ioctl");
       		return 1;
       	}
      
       	rc = ioctl(fd, USBDEVFS_REAPURB, &ptr);
       	if (rc == -1) {
       		perror("Error in REAPURB ioctl");
       		return 1;
       	}
      
       	memset(&ds, 0, sizeof(ds));
       	ds.signr = SIGUSR2;
       	ds.context = &ds;
       	rc = ioctl(fd, USBDEVFS_DISCSIGNAL, &ds);
       	if (rc == -1) {
       		perror("Error in DISCSIGNAL ioctl");
       		return 1;
       	}
      
       	printf("Waiting for usb disconnect\n");
       	while (!done) {
       		sleep(1);
       	}
      
       	close(fd);
       	return 0;
       }
      
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: linux-usb@vger.kernel.org
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.com>
      Fixes: v2.3.39
      Cc: stable@vger.kernel.org
      Acked-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      70f1b0d3
  13. 15 5月, 2019 1 次提交
  14. 30 3月, 2019 1 次提交
  15. 04 2月, 2019 2 次提交
    • E
      sched/core: Convert signal_struct.sigcnt to refcount_t · 60d4de3f
      Elena Reshetova 提交于
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable signal_struct.sigcnt is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the signal_struct.sigcnt it might make a difference
      in following places:
      
       - put_signal_struct(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: NHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-3-git-send-email-elena.reshetova@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      60d4de3f
    • E
      sched/core: Convert sighand_struct.count to refcount_t · d036bda7
      Elena Reshetova 提交于
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable sighand_struct.count is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the sighand_struct.count it might make a difference
      in following places:
      
       - __cleanup_sighand: decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: NHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d036bda7
  16. 03 10月, 2018 1 次提交
    • E
      signal: Distinguish between kernel_siginfo and siginfo · ae7795bc
      Eric W. Biederman 提交于
      Linus recently observed that if we did not worry about the padding
      member in struct siginfo it is only about 48 bytes, and 48 bytes is
      much nicer than 128 bytes for allocating on the stack and copying
      around in the kernel.
      
      The obvious thing of only adding the padding when userspace is
      including siginfo.h won't work as there are sigframe definitions in
      the kernel that embed struct siginfo.
      
      So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
      traditional name for the userspace definition.  While the version that
      is used internally to the kernel and ultimately will not be padded to
      128 bytes is called kernel_siginfo.
      
      The definition of struct kernel_siginfo I have put in include/signal_types.h
      
      A set of buildtime checks has been added to verify the two structures have
      the same field offsets.
      
      To make it easy to verify the change kernel_siginfo retains the same
      size as siginfo.  The reduction in size comes in a following change.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      ae7795bc
  17. 12 9月, 2018 2 次提交
  18. 23 8月, 2018 1 次提交
  19. 10 8月, 2018 1 次提交
    • E
      signal: Don't restart fork when signals come in. · c3ad2c3b
      Eric W. Biederman 提交于
      Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
      report that a periodic signal received during fork can cause fork to
      continually restart preventing an application from making progress.
      
      The code was being overly pessimistic.  Fork needs to guarantee that a
      signal sent to multiple processes is logically delivered before the
      fork and just to the forking process or logically delivered after the
      fork to both the forking process and it's newly spawned child.  For
      signals like periodic timers that are always delivered to a single
      process fork can safely complete and let them appear to logically
      delivered after the fork().
      
      While examining this issue I also discovered that fork today will miss
      signals delivered to multiple processes during the fork and handled by
      another thread.  Similarly the current code will also miss blocked
      signals that are delivered to multiple process, as those signals will
      not appear pending during fork.
      
      Add a list of each thread that is currently forking, and keep on that
      list a signal set that records all of the signals sent to multiple
      processes.  When fork completes initialize the new processes
      shared_pending signal set with it.  The calculate_sigpending function
      will see those signals and set TIF_SIGPENDING causing the new task to
      take the slow path to userspace to handle those signals.  Making it
      appear as if those signals were received immediately after the fork.
      
      It is not possible to send real time signals to multiple processes and
      exceptions don't go to multiple processes, which means that that are
      no signals sent to multiple processes that require siginfo.  This
      means it is safe to not bother collecting siginfo on signals sent
      during fork.
      
      The sigaction of a child of fork is initially the same as the
      sigaction of the parent process.  So a signal the parent ignores the
      child will also initially ignore.  Therefore it is safe to ignore
      signals sent to multiple processes and ignored by the forking process.
      
      Signals sent to only a single process or only a single thread and delivered
      during fork are treated as if they are received after the fork, and generally
      not dealt with.  They won't cause any problems.
      
      V2: Added removal from the multiprocess list on failure.
      V3: Use -ERESTARTNOINTR directly
      V4: - Don't queue both SIGCONT and SIGSTOP
          - Initialize signal_struct.multiprocess in init_task
          - Move setting of shared_pending to before the new task
            is visible to signals.  This prevents signals from comming
            in before shared_pending.signal is set to delayed.signal
            and being lost.
      V5: - rework list add and delete to account for idle threads
      v6: - Use sigdelsetmask when removing stop signals
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
      Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
      Reported-by: Nmajiang <ma.jiang@zte.com.cn>
      Fixes: 4a2c7a78 ("[PATCH] make fork() atomic wrt pgrp/session signals")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c3ad2c3b
  20. 04 8月, 2018 2 次提交
    • E
      fork: Have new threads join on-going signal group stops · 924de3b8
      Eric W. Biederman 提交于
      There are only two signals that are delivered to every member of a
      signal group: SIGSTOP and SIGKILL.  Signal delivery requires every
      signal appear to be delivered either before or after a clone syscall.
      SIGKILL terminates the clone so does not need to be considered.  Which
      leaves only SIGSTOP that needs to be considered when creating new
      threads.
      
      Today in the event of a group stop TIF_SIGPENDING will get set and the
      fork will restart ensuring the fork syscall participates in the group
      stop.
      
      A fork (especially of a process with a lot of memory) is one of the
      most expensive system so we really only want to restart a fork when
      necessary.
      
      It is easy so check to see if a SIGSTOP is ongoing and have the new
      thread join it immediate after the clone completes.  Making it appear
      the clone completed happened just before the SIGSTOP.
      
      The calculate_sigpending function will see the bits set in jobctl and
      set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.
      
      V2: The call to task_join_group_stop was moved before the new task is
          added to the thread group list.  This should not matter as
          sighand->siglock is held over both the addition of the threads,
          the call to task_join_group_stop and do_signal_stop.  But the change
          is trivial and it is one less thing to worry about when reading
          the code.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      924de3b8
    • E
      signal: Add calculate_sigpending() · 088fe47c
      Eric W. Biederman 提交于
      Add a function calculate_sigpending to test to see if any signals are
      pending for a new task immediately following fork.  Signals have to
      happen either before or after fork.  Today our practice is to push
      all of the signals to before the fork, but that has the downside that
      frequent or periodic signals can make fork take much much longer than
      normal or prevent fork from completing entirely.
      
      So we need move signals that we can after the fork to prevent that.
      
      This updates the code to set TIF_SIGPENDING on a new task if there
      are signals or other activities that have moved so that they appear
      to happen after the fork.
      
      As the code today restarts if it sees any such activity this won't
      immediately have an effect, as there will be no reason for it
      to set TIF_SIGPENDING immediately after the fork.
      
      Adding calculate_sigpending means the code in fork can safely be
      changed to not always restart if a signal is pending.
      
      The new calculate_sigpending function sets sigpending if there
      are pending bits in jobctl, pending signals, the freezer needs
      to freeze the new task or the live kernel patching framework
      need the new thread to take the slow path to userspace.
      
      I have verified that setting TIF_SIGPENDING does make a new process
      take the slow path to userspace before it executes it's first userspace
      instruction.
      
      I have looked at the callers of signal_wake_up and the code paths
      setting TIF_SIGPENDING and I don't see anything else that needs to be
      handled.  The code probably doesn't need to set TIF_SIGPENDING for the
      kernel live patching as it uses a separate thread flag as well.  But
      at this point it seems safer reuse the recalc_sigpending logic and get
      the kernel live patching folks to sort out their story later.
      
      V2: I have moved the test into schedule_tail where siglock can
          be grabbed and recalc_sigpending can be reused directly.
          Further as the last action of setting up a new task this
          guarantees that TIF_SIGPENDING will be properly set in the
          new process.
      
          The helper calculate_sigpending takes the siglock and
          uncontitionally sets TIF_SIGPENDING and let's recalc_sigpending
          clear TIF_SIGPENDING if it is unnecessary.  This allows reusing
          the existing code and keeps maintenance of the conditions simple.
      
          Oleg Nesterov <oleg@redhat.com>  suggested the movement
          and pointed out the need to take siglock if this code
          was going to be called while the new task is discoverable.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      088fe47c
  21. 21 7月, 2018 5 次提交
    • E
      signal: Pass pid and pid type into send_sigqueue · 24122c7f
      Eric W. Biederman 提交于
      Make the code more maintainable by performing more of the signal
      related work in send_sigqueue.
      
      A quick inspection of do_timer_create will show that this code path
      does not lookup a thread group by a thread's pid.  Making it safe
      to find the task pointed to by it_pid with "pid_task(it_pid, type)";
      
      This supports the changes needed in fork to tell if a signal was sent
      to a single process or a group of processes.
      
      Having the pid to task transition in signal.c will also make it easier
      to sort out races with de_thread and and the thread group leader
      exiting when it comes time to address that.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      24122c7f
    • E
      pid: Implement PIDTYPE_TGID · 6883f81a
      Eric W. Biederman 提交于
      Everywhere except in the pid array we distinguish between a tasks pid and
      a tasks tgid (thread group id).  Even in the enumeration we want that
      distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
      we almost have an implementation of PIDTYPE_TGID in struct signal_struct.
      
      Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
      into the pids array.  Then remove the __PIDTYPE_TGID special case and the
      leader_pid in signal_struct.
      
      The net size increase is just an extra pointer added to struct pid and
      an extra pair of pointers of an hlist_node added to task_struct.
      
      The effect on code maintenance is the removal of a number of special
      cases today and the potential to remove many more special cases as
      PIDTYPE_TGID gets used to it's fullest.  The long term potential
      is allowing zombie thread group leaders to exit, which will remove
      a lot more special cases in the code.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      6883f81a
    • E
      pids: Move the pgrp and session pid pointers from task_struct to signal_struct · 2c470475
      Eric W. Biederman 提交于
      To access these fields the code always has to go to group leader so
      going to signal struct is no loss and is actually a fundamental simplification.
      
      This saves a little bit of memory by only allocating the pid pointer array
      once instead of once for every thread, and even better this removes a
      few potential races caused by the fact that group_leader can be changed
      by de_thread, while signal_struct can not.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      2c470475
    • E
      pids: Compute task_tgid using signal->leader_pid · 7a36094d
      Eric W. Biederman 提交于
      The cost is the the same and this removes the need
      to worry about complications that come from de_thread
      and group_leader changing.
      
      __task_pid_nr_ns has been updated to take advantage of this change.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7a36094d
    • E
      pids: Move task_pid_type into sched/signal.h · 1fb53567
      Eric W. Biederman 提交于
      The function is general and inline so there is no need
      to hide it inside of exit.c
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      1fb53567
  22. 04 5月, 2018 1 次提交
    • P
      sched/core: Introduce set_special_state() · b5bf9a90
      Peter Zijlstra 提交于
      Gaurav reported a perceived problem with TASK_PARKED, which turned out
      to be a broken wait-loop pattern in __kthread_parkme(), but the
      reported issue can (and does) in fact happen for states that do not do
      condition based sleeps.
      
      When the 'current->state = TASK_RUNNING' store of a previous
      (concurrent) try_to_wake_up() collides with the setting of a 'special'
      sleep state, we can loose the sleep state.
      
      Normal condition based wait-loops are immune to this problem, but for
      sleep states that are not condition based are subject to this problem.
      
      There already is a fix for TASK_DEAD. Abstract that and also apply it
      to TASK_STOPPED and TASK_TRACED, both of which are also without
      condition based wait-loop.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5bf9a90
  23. 07 3月, 2018 1 次提交
  24. 23 1月, 2018 1 次提交
    • E
      signal/ptrace: Add force_sig_ptrace_errno_trap and use it where needed · f71dd7dc
      Eric W. Biederman 提交于
      There are so many places that build struct siginfo by hand that at
      least one of them is bound to get it wrong.  A handful of cases in the
      kernel arguably did just that when using the errno field of siginfo to
      pass no errno values to userspace.  The usage is limited to a single
      si_code so at least does not mess up anything else.
      
      Encapsulate this questionable pattern in a helper function so
      that the userspace ABI is preserved.
      
      Update all of the places that use this pattern to use the new helper
      function.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f71dd7dc