1. 24 1月, 2015 1 次提交
    • N
      ktime: Optimize ktime_divns for constant divisors · 8b618628
      Nicolas Pitre 提交于
      At least on ARM, do_div() is optimized to turn constant divisors into
      an inline multiplication by the reciprocal value at compile time.
      However this optimization is missed entirely whenever ktime_divns() is
      used and the slow out-of-line division code is used all the time.
      
      Let ktime_divns() use do_div() inline whenever the divisor is constant
      and small enough.  This will make things like ktime_to_us() and
      ktime_to_ms() much faster.
      
      Cc: Arnd Bergmann <arnd.bergmann@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Nicolas Pitre <nico@linaro.org>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NNicolas Pitre <nico@linaro.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      8b618628
  2. 09 1月, 2015 7 次提交
  3. 30 12月, 2014 1 次提交
    • P
      audit: create private file name copies when auditing inodes · fcf22d82
      Paul Moore 提交于
      Unfortunately, while commit 4a928436 ("audit: correctly record file
      names with different path name types") fixed a problem where we were
      not recording filenames, it created a new problem by attempting to use
      these file names after they had been freed.  This patch resolves the
      issue by creating a copy of the filename which the audit subsystem
      frees after it is done with the string.
      
      At some point it would be nice to resolve this issue with refcounts,
      or something similar, instead of having to allocate/copy strings, but
      that is almost surely beyond the scope of a -rcX patch so we'll defer
      that for later.  On the plus side, only audit users should be impacted
      by the string copying.
      Reported-by: NToralf Foerster <toralf.foerster@gmx.de>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      fcf22d82
  4. 27 12月, 2014 1 次提交
    • J
      netlink/genetlink: pass network namespace to bind/unbind · 023e2cfa
      Johannes Berg 提交于
      Netlink families can exist in multiple namespaces, and for the most
      part multicast subscriptions are per network namespace. Thus it only
      makes sense to have bind/unbind notifications per network namespace.
      
      To achieve this, pass the network namespace of a given client socket
      to the bind/unbind functions.
      
      Also do this in generic netlink, and there also make sure that any
      bind for multicast groups that only exist in init_net is rejected.
      This isn't really a problem if it is accepted since a client in a
      different namespace will never receive any notifications from such
      a group, but it can confuse the family if not rejected (it's also
      possible to silently (without telling the family) accept it, but it
      would also have to be ignored on unbind so families that take any
      kind of action on bind/unbind won't do unnecessary work for invalid
      clients like that.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      023e2cfa
  5. 24 12月, 2014 1 次提交
    • R
      audit: restore AUDIT_LOGINUID unset ABI · 041d7b98
      Richard Guy Briggs 提交于
      A regression was caused by commit 780a7654:
      	 audit: Make testing for a valid loginuid explicit.
      (which in turn attempted to fix a regression caused by e1760bd5)
      
      When audit_krule_to_data() fills in the rules to get a listing, there was a
      missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID.
      
      This broke userspace by not returning the same information that was sent and
      expected.
      
      The rule:
      	auditctl -a exit,never -F auid=-1
      gives:
      	auditctl -l
      		LIST_RULES: exit,never f24=0 syscall=all
      when it should give:
      		LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all
      
      Tag it so that it is reported the same way it was set.  Create a new
      private flags audit_krule field (pflags) to store it that won't interact with
      the public one from the API.
      
      Cc: stable@vger.kernel.org # v3.10-rc1+
      Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      041d7b98
  6. 23 12月, 2014 2 次提交
    • A
      sched: Fix KMALLOC_MAX_SIZE overflow during cpumask allocation · b74e6278
      Alex Thorlton 提交于
      When allocating space for load_balance_mask, in sched_init, when
      CPUMASK_OFFSTACK is set, we've managed to spill over
      KMALLOC_MAX_SIZE on our 6144 core machine.  The patch below
      breaks up the allocations so that they don't overflow the max
      alloc size.  It also allocates the masks on the the node from
      which they'll most commonly be accessed, to minimize remote
      accesses on NUMA machines.
      Suggested-by: NGeorge Beshers <gbeshers@sgi.com>
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: George Beshers <gbeshers@sgi.com>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1418928270-148543-1-git-send-email-athorlton@sgi.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b74e6278
    • P
      audit: correctly record file names with different path name types · 4a928436
      Paul Moore 提交于
      There is a problem with the audit system when multiple audit records
      are created for the same path, each with a different path name type.
      The root cause of the problem is in __audit_inode() when an exact
      match (both the path name and path name type) is not found for a
      path name record; the existing code creates a new path name record,
      but it never sets the path name in this record, leaving it NULL.
      This patch corrects this problem by assigning the path name to these
      newly created records.
      
      There are many ways to reproduce this problem, but one of the
      easiest is the following (assuming auditd is running):
      
        # mkdir /root/tmp/test
        # touch /root/tmp/test/567
        # auditctl -a always,exit -F dir=/root/tmp/test
        # touch /root/tmp/test/567
      
      Afterwards, or while the commands above are running, check the audit
      log and pay special attention to the PATH records.  A faulty kernel
      will display something like the following for the file creation:
      
        type=SYSCALL msg=audit(1416957442.025:93): arch=c000003e syscall=2
          success=yes exit=3 ... comm="touch" exe="/usr/bin/touch"
        type=CWD msg=audit(1416957442.025:93):  cwd="/root/tmp"
        type=PATH msg=audit(1416957442.025:93): item=0 name="test/"
          inode=401409 ... nametype=PARENT
        type=PATH msg=audit(1416957442.025:93): item=1 name=(null)
          inode=393804 ... nametype=NORMAL
        type=PATH msg=audit(1416957442.025:93): item=2 name=(null)
          inode=393804 ... nametype=NORMAL
      
      While a patched kernel will show the following:
      
        type=SYSCALL msg=audit(1416955786.566:89): arch=c000003e syscall=2
          success=yes exit=3 ... comm="touch" exe="/usr/bin/touch"
        type=CWD msg=audit(1416955786.566:89):  cwd="/root/tmp"
        type=PATH msg=audit(1416955786.566:89): item=0 name="test/"
          inode=401409 ... nametype=PARENT
        type=PATH msg=audit(1416955786.566:89): item=1 name="test/567"
          inode=393804 ... nametype=NORMAL
      
      This issue was brought up by a number of people, but special credit
      should go to hujianyang@huawei.com for reporting the problem along
      with an explanation of the problem and a patch.  While the original
      patch did have some problems (see the archive link below), it did
      demonstrate the problem and helped kickstart the fix presented here.
      
        * https://lkml.org/lkml/2014/9/5/66Reported-by: Nhujianyang <hujianyang@huawei.com>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Acked-by: NRichard Guy Briggs <rgb@redhat.com>
      4a928436
  7. 20 12月, 2014 3 次提交
    • R
      audit: use supplied gfp_mask from audit_buffer in kauditd_send_multicast_skb · 54dc77d9
      Richard Guy Briggs 提交于
      Eric Paris explains: Since kauditd_send_multicast_skb() gets called in
      audit_log_end(), which can come from any context (aka even a sleeping context)
      GFP_KERNEL can't be used.  Since the audit_buffer knows what context it should
      use, pass that down and use that.
      
      See: https://lkml.org/lkml/2014/12/16/542
      
      BUG: sleeping function called from invalid context at mm/slab.c:2849
      in_atomic(): 1, irqs_disabled(): 0, pid: 885, name: sulogin
      2 locks held by sulogin/885:
        #0:  (&sig->cred_guard_mutex){+.+.+.}, at: [<ffffffff91152e30>] prepare_bprm_creds+0x28/0x8b
        #1:  (tty_files_lock){+.+.+.}, at: [<ffffffff9123e787>] selinux_bprm_committing_creds+0x55/0x22b
      CPU: 1 PID: 885 Comm: sulogin Not tainted 3.18.0-next-20141216 #30
      Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A15 06/20/2014
        ffff880223744f10 ffff88022410f9b8 ffffffff916ba529 0000000000000375
        ffff880223744f10 ffff88022410f9e8 ffffffff91063185 0000000000000006
        0000000000000000 0000000000000000 0000000000000000 ffff88022410fa38
      Call Trace:
        [<ffffffff916ba529>] dump_stack+0x50/0xa8
        [<ffffffff91063185>] ___might_sleep+0x1b6/0x1be
        [<ffffffff910632a6>] __might_sleep+0x119/0x128
        [<ffffffff91140720>] cache_alloc_debugcheck_before.isra.45+0x1d/0x1f
        [<ffffffff91141d81>] kmem_cache_alloc+0x43/0x1c9
        [<ffffffff914e148d>] __alloc_skb+0x42/0x1a3
        [<ffffffff914e2b62>] skb_copy+0x3e/0xa3
        [<ffffffff910c263e>] audit_log_end+0x83/0x100
        [<ffffffff9123b8d3>] ? avc_audit_pre_callback+0x103/0x103
        [<ffffffff91252a73>] common_lsm_audit+0x441/0x450
        [<ffffffff9123c163>] slow_avc_audit+0x63/0x67
        [<ffffffff9123c42c>] avc_has_perm+0xca/0xe3
        [<ffffffff9123dc2d>] inode_has_perm+0x5a/0x65
        [<ffffffff9123e7ca>] selinux_bprm_committing_creds+0x98/0x22b
        [<ffffffff91239e64>] security_bprm_committing_creds+0xe/0x10
        [<ffffffff911515e6>] install_exec_creds+0xe/0x79
        [<ffffffff911974cf>] load_elf_binary+0xe36/0x10d7
        [<ffffffff9115198e>] search_binary_handler+0x81/0x18c
        [<ffffffff91153376>] do_execveat_common.isra.31+0x4e3/0x7b7
        [<ffffffff91153669>] do_execve+0x1f/0x21
        [<ffffffff91153967>] SyS_execve+0x25/0x29
        [<ffffffff916c61a9>] stub_execve+0x69/0xa0
      
      Cc: stable@vger.kernel.org #v3.16-rc1
      Reported-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
      Tested-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      54dc77d9
    • P
      audit: don't attempt to lookup PIDs when changing PID filtering audit rules · 3640dcfa
      Paul Moore 提交于
      Commit f1dc4867 ("audit: anchor all pid references in the initial pid
      namespace") introduced a find_vpid() call when adding/removing audit
      rules with PID/PPID filters; unfortunately this is problematic as
      find_vpid() only works if there is a task with the associated PID
      alive on the system.  The following commands demonstrate a simple
      reproducer.
      
      	# auditctl -D
      	# auditctl -l
      	# autrace /bin/true
      	# auditctl -l
      
      This patch resolves the problem by simply using the PID provided by
      the user without any additional validation, e.g. no calls to check to
      see if the task/PID exists.
      
      Cc: stable@vger.kernel.org # 3.15
      Cc: Richard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Acked-by: NEric Paris <eparis@redhat.com>
      Reviewed-by: NRichard Guy Briggs <rgb@redhat.com>
      3640dcfa
    • R
      PM: Eliminate CONFIG_PM_RUNTIME · 464ed18e
      Rafael J. Wysocki 提交于
      Having switched over all of the users of CONFIG_PM_RUNTIME to use
      CONFIG_PM directly, turn the latter into a user-selectable option
      and drop the former entirely from the tree.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NUlf Hansson <ulf.hansson@linaro.org>
      Acked-by: NKevin Hilman <khilman@linaro.org>
      464ed18e
  8. 19 12月, 2014 1 次提交
    • T
      tick/powerclamp: Remove tick_nohz_idle abuse · a5fd9733
      Thomas Gleixner 提交于
      commit 4dbd2771 "tick: export nohz tick idle symbols for module
      use" was merged via the thermal tree without an explicit ack from the
      relevant maintainers.
      
      The exports are abused by the intel powerclamp driver which implements
      a fake idle state from a sched FIFO task. This causes all kinds of
      wreckage in the NOHZ core code which rightfully assumes that
      tick_nohz_idle_enter/exit() are only called from the idle task itself.
      
      Recent changes in the NOHZ core lead to a failure of the powerclamp
      driver and now people try to hack completely broken and backwards
      workarounds into the NOHZ core code. This is completely unacceptable
      and just papers over the real problem. There are way more subtle
      issues lurking around the corner.
      
      The real solution is to fix the powerclamp driver by rewriting it with
      a sane concept, but that's beyond the scope of this.
      
      So the only solution for now is to remove the calls into the core NOHZ
      code from the powerclamp trainwreck along with the exports. 
      
      Fixes: d6d71ee4 "PM: Introduce Intel PowerClamp Driver"
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Pan Jacob jun <jacob.jun.pan@intel.com>
      Cc: LKP <lkp@01.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412181110110.17382@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a5fd9733
  9. 18 12月, 2014 1 次提交
    • K
      param: do not set store func without write perm · b0a65b0c
      Kees Cook 提交于
      When a module_param is defined without DAC write permissions, it can
      still be changed at runtime and updated. Drivers using a 0444 permission
      may be surprised that these values can still be changed.
      
      For drivers that want to allow updates, any S_IW* flag will set the
      "store" function as before. Drivers without S_IW* flags will have the
      "store" function unset, unforcing a read-only value. Drivers that wish
      neither "store" nor "get" can continue to use "0" for perms to stay out
      of sysfs entirely.
      
      Old behavior:
        # cd /sys/module/snd/parameters
        # ls -l
        total 0
        -r--r--r-- 1 root root 4096 Dec 11 13:55 cards_limit
        -r--r--r-- 1 root root 4096 Dec 11 13:55 major
        -r--r--r-- 1 root root 4096 Dec 11 13:55 slots
        # cat major
        116
        # echo -1 > major
        -bash: major: Permission denied
        # chmod u+w major
        # echo -1 > major
        # cat major
        -1
      
      New behavior:
        ...
        # chmod u+w major
        # echo -1 > major
        -bash: echo: write error: Input/output error
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      b0a65b0c
  10. 15 12月, 2014 2 次提交
  11. 14 12月, 2014 8 次提交
  12. 13 12月, 2014 2 次提交
    • T
      genirq: Prevent proc race against freeing of irq descriptors · c291ee62
      Thomas Gleixner 提交于
      Since the rework of the sparse interrupt code to actually free the
      unused interrupt descriptors there exists a race between the /proc
      interfaces to the irq subsystem and the code which frees the interrupt
      descriptor.
      
      CPU0				CPU1
      				show_interrupts()
      				  desc = irq_to_desc(X);
      free_desc(desc)
        remove_from_radix_tree();
        kfree(desc);
      				  raw_spinlock_irq(&desc->lock);
      
      /proc/interrupts is the only interface which can actively corrupt
      kernel memory via the lock access. /proc/stat can only read from freed
      memory. Extremly hard to trigger, but possible.
      
      The interfaces in /proc/irq/N/ are not affected by this because the
      removal of the proc file is serialized in procfs against concurrent
      readers/writers. The removal happens before the descriptor is freed.
      
      For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
      as the descriptor is never freed. It's merely cleared out with the irq
      descriptor lock held. So any concurrent proc access will either see
      the old correct value or the cleared out ones.
      
      Protect the lookup and access to the irq descriptor in
      show_interrupts() with the sparse_irq_lock.
      
      Provide kstat_irqs_usr() which is protecting the lookup and access
      with sparse_irq_lock and switch /proc/stat to use it.
      
      Document the existing kstat_irqs interfaces so it's clear that the
      caller needs to take care about protection. The users of these
      interfaces are either not affected due to SPARSE_IRQ=n or already
      protected against removal.
      
      Fixes: 1f5a5b87 "genirq: Implement a sane sparse_irq allocator"
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      c291ee62
    • R
      tracing / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM · 798bc6d8
      Rafael J. Wysocki 提交于
      After commit b2b49ccb (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
      selected) PM_RUNTIME is always set if PM is set, so files that are
      build conditionally if CONFIG_PM_RUNTIME is set may now be build
      if CONFIG_PM is set.
      
      Replace CONFIG_PM_RUNTIME with CONFIG_PM in kernel/trace/Makefile
      for this reason.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: Steven Rostedt <rostedt@goodmis.org.
      798bc6d8
  13. 12 12月, 2014 3 次提交
    • E
      userns; Correct the comment in map_write · 36476bea
      Eric W. Biederman 提交于
      It is important that all maps are less than PAGE_SIZE
      or else setting the last byte of the buffer to '0'
      could write off the end of the allocated storage.
      
      Correct the misleading comment.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      36476bea
    • E
      userns: Allow setting gid_maps without privilege when setgroups is disabled · 66d2f338
      Eric W. Biederman 提交于
      Now that setgroups can be disabled and not reenabled, setting gid_map
      without privielge can now be enabled when setgroups is disabled.
      
      This restores most of the functionality that was lost when unprivileged
      setting of gid_map was removed.  Applications that use this functionality
      will need to check to see if they use setgroups or init_groups, and if they
      don't they can be fixed by simply disabling setgroups before writing to
      gid_map.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      66d2f338
    • E
      userns: Add a knob to disable setgroups on a per user namespace basis · 9cc46516
      Eric W. Biederman 提交于
      - Expose the knob to user space through a proc file /proc/<pid>/setgroups
      
        A value of "deny" means the setgroups system call is disabled in the
        current processes user namespace and can not be enabled in the
        future in this user namespace.
      
        A value of "allow" means the segtoups system call is enabled.
      
      - Descendant user namespaces inherit the value of setgroups from
        their parents.
      
      - A proc file is used (instead of a sysctl) as sysctls currently do
        not allow checking the permissions at open time.
      
      - Writing to the proc file is restricted to before the gid_map
        for the user namespace is set.
      
        This ensures that disabling setgroups at a user namespace
        level will never remove the ability to call setgroups
        from a process that already has that ability.
      
        A process may opt in to the setgroups disable for itself by
        creating, entering and configuring a user namespace or by calling
        setns on an existing user namespace with setgroups disabled.
        Processes without privileges already can not call setgroups so this
        is a noop.  Prodcess with privilege become processes without
        privilege when entering a user namespace and as with any other path
        to dropping privilege they would not have the ability to call
        setgroups.  So this remains within the bounds of what is possible
        without a knob to disable setgroups permanently in a user namespace.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      9cc46516
  14. 11 12月, 2014 7 次提交
    • S
      printk: Do not disable preemption for accessing printk_func · 1fb8915b
      Steven Rostedt (Red Hat) 提交于
      As printk_func will either be the default function, or a per_cpu function
      for the current CPU, there's no reason to disable preemption to access
      it from printk. That's because if the printk_func is not the default
      then the caller had better disabled preemption as they were the one to
      change it.
      
      Link: http://lkml.kernel.org/r/CA+55aFz5-_LKW4JHEBoWinN9_ouNcGRWAF2FUA35u46FRN-Kxw@mail.gmail.comSuggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1fb8915b
    • J
      perf: Fix events installation during moving group · 9fc81d87
      Jiri Olsa 提交于
      We allow PMU driver to change the cpu on which the event
      should be installed to. This happened in patch:
      
        e2d37cd2 ("perf: Allow the PMU driver to choose the CPU on which to install events")
      
      This patch also forces all the group members to follow
      the currently opened events cpu if the group happened
      to be moved.
      
      This and the change of event->cpu in perf_install_in_context()
      function introduced in:
      
        0cda4c02 ("perf: Introduce perf_pmu_migrate_context()")
      
      forces group members to change their event->cpu,
      if the currently-opened-event's PMU changed the cpu
      and there is a group move.
      
      Above behaviour causes problem for breakpoint events,
      which uses event->cpu to touch cpu specific data for
      breakpoints accounting. By changing event->cpu, some
      breakpoints slots were wrongly accounted for given
      cpu.
      
      Vinces's perf fuzzer hit this issue and caused following
      WARN on my setup:
      
         WARNING: CPU: 0 PID: 20214 at arch/x86/kernel/hw_breakpoint.c:119 arch_install_hw_breakpoint+0x142/0x150()
         Can't find any breakpoint slot
         [...]
      
      This patch changes the group moving code to keep the event's
      original cpu.
      Reported-by: NVince Weaver <vince@deater.net>
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Yan, Zheng <zheng.z.yan@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1418243031-20367-3-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9fc81d87
    • O
      exit: pidns: fix/update the comments in zap_pid_ns_processes() · a53b8315
      Oleg Nesterov 提交于
      The comments in zap_pid_ns_processes() are not clear, we need to explain
      how this code actually works.
      
      1. "Ignore SIGCHLD" looks like optimization but it is not, we also
         need this for correctness.
      
      2. The comment above sys_wait4() could tell more.
      
         EXIT_ZOMBIE child is only possible if it has exited before we
         ignored SIGCHLD. Or if it is traced from the parent namespace,
         but in this case it will be reaped by debugger after detach,
         sys_wait4() acts as a synchronization point.
      
      3. The comment about TASK_DEAD (EXIT_DEAD in fact) children is
         outdated. Contrary to what it says we do not need to make sure
         they all go away after 0a01f2cc "pidns: Make the pidns proc
         mount/umount logic obvious".
      
         At the same time, we do need to wait for nr_hashed==init_pids,
         but the reasons are quite different and not obvious: setns().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a53b8315
    • O
      exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting · 24c037eb
      Oleg Nesterov 提交于
      alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
      fails because disable_pid_allocation() was called by the exiting
      child_reaper.
      
      We could simply move get_pid_ns() down to successful return, but this fix
      tries to be as trivial as possible.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24c037eb
    • O
      exit: exit_notify: re-use "dead" list to autoreap current · 6c66e7db
      Oleg Nesterov 提交于
      After the previous change we can add just the exiting EXIT_DEAD task to
      the "dead" list and remove another release_task(tsk).
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c66e7db
    • O
      exit: reparent: call forget_original_parent() under tasklist_lock · 482a3767
      Oleg Nesterov 提交于
      Shift "release dead children" loop from forget_original_parent() to its
      caller, exit_notify().  It is safe to reap them even if our parent reaps
      us right after we drop tasklist_lock, those children no longer have any
      connection to the exiting task.
      
      And this allows us to avoid write_lock_irq(tasklist_lock) right after it
      was released by forget_original_parent(), we can simply call it with
      tasklist_lock held.
      
      While at it, move the comment about forget_original_parent() up to
      this function.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      482a3767
    • O
      exit: reparent: avoid find_new_reaper() if no children · ad9e206a
      Oleg Nesterov 提交于
      Now that pid_ns logic was isolated we can change forget_original_parent()
      to return right after find_child_reaper() when father->children is empty,
      there is nothing to reparent in this case.
      
      In particular this avoids find_alive_thread() and this can help if the
      whole process exits and it has a lot of PF_EXITING threads at the start of
      the thread list, this can easily lead to O(nr_threads ** 2) iterations.
      
      Trivial test case (tested under KVM, 2 CPUs):
      
          static void *tfunc(void *arg)
          {
              pause();
              return NULL;
          }
      
          static int child(unsigned int nt)
          {
              pthread_t pt;
      
              while (nt--)
                  assert(pthread_create(&pt, NULL, tfunc, NULL) == 0);
      
              pthread_kill(pt, SIGTRAP);
              pause();
              return 0;
          }
      
          int main(int argc, const char *argv[])
          {
              int stat;
              unsigned int nf = atoi(argv[1]);
              unsigned int nt = atoi(argv[2]);
      
              while (nf--) {
                  if (!fork())
                      return child(nt);
      
                  wait(&stat);
                  assert(stat == SIGTRAP);
              }
      
              return 0;
          }
      
      $ time ./test 16 16536 shows:
      
                    real        user         sys
          -    5m37.628s    0m4.437s    8m5.560s
          +    0m50.032s    0m7.130s    1m4.927s
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad9e206a