1. 10 8月, 2017 1 次提交
  2. 03 8月, 2017 1 次提交
    • D
      cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 89affbf5
      Dima Zavin 提交于
      In codepaths that use the begin/retry interface for reading
      mems_allowed_seq with irqs disabled, there exists a race condition that
      stalls the patch process after only modifying a subset of the
      static_branch call sites.
      
      This problem manifested itself as a deadlock in the slub allocator,
      inside get_any_partial.  The loop reads mems_allowed_seq value (via
      read_mems_allowed_begin), performs the defrag operation, and then
      verifies the consistency of mem_allowed via the read_mems_allowed_retry
      and the cookie returned by xxx_begin.
      
      The issue here is that both begin and retry first check if cpusets are
      enabled via cpusets_enabled() static branch.  This branch can be
      rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
      x86 jump label code fully synchronizes across all CPUs for every entry
      it rewrites.  If it rewrites only one of the callsites (specifically the
      one in read_mems_allowed_retry) and then waits for the
      smp_call_function(do_sync_core) to complete while a CPU is inside the
      begin/retry section with IRQs off and the mems_allowed value is changed,
      we can hang.
      
      This is because begin() will always return 0 (since it wasn't patched
      yet) while retry() will test the 0 against the actual value of the seq
      counter.
      
      The fix is to use two different static keys: one for begin
      (pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
      first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
      always return a valid seqcount if are enabling cpusets.  Similarly, when
      disabling cpusets via cpuset_dec(), we first ensure that callers of
      cpuset_mems_allowed_retry() will start ignoring the seqcount value
      before we let cpuset_mems_allowed_begin() return 0.
      
      The relevant stack traces of the two stuck threads:
      
        CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
        RIP: smp_call_function_many+0x1f9/0x260
        Call Trace:
          smp_call_function+0x3b/0x70
          on_each_cpu+0x2f/0x90
          text_poke_bp+0x87/0xd0
          arch_jump_label_transform+0x93/0x100
          __jump_label_update+0x77/0x90
          jump_label_update+0xaa/0xc0
          static_key_slow_inc+0x9e/0xb0
          cpuset_css_online+0x70/0x2e0
          online_css+0x2c/0xa0
          cgroup_apply_control_enable+0x27f/0x3d0
          cgroup_mkdir+0x2b7/0x420
          kernfs_iop_mkdir+0x5a/0x80
          vfs_mkdir+0xf6/0x1a0
          SyS_mkdir+0xb7/0xe0
          entry_SYSCALL_64_fastpath+0x18/0xad
      
        ...
      
        CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8818087c0000 task.stack: ffffc90000030000
        RIP: int3+0x39/0x70
        Call Trace:
          <#DB> ? ___slab_alloc+0x28b/0x5a0
          <EOE> ? copy_process.part.40+0xf7/0x1de0
          __slab_alloc.isra.80+0x54/0x90
          copy_process.part.40+0xf7/0x1de0
          copy_process.part.40+0xf7/0x1de0
          kmem_cache_alloc_node+0x8a/0x280
          copy_process.part.40+0xf7/0x1de0
          _do_fork+0xe7/0x6c0
          _raw_spin_unlock_irq+0x2d/0x60
          trace_hardirqs_on_caller+0x136/0x1d0
          entry_SYSCALL_64_fastpath+0x5/0xad
          do_syscall_64+0x27/0x350
          SyS_clone+0x19/0x20
          do_syscall_64+0x60/0x350
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
      Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
      Signed-off-by: NDima Zavin <dmitriyz@waymo.com>
      Reported-by: NCliff Spradlin <cspradlin@waymo.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89affbf5
  3. 23 7月, 2017 1 次提交
  4. 19 7月, 2017 1 次提交
    • T
      cgroup: create dfl_root files on subsys registration · 7af608e4
      Tejun Heo 提交于
      On subsystem registration, css_populate_dir() is not called on the new
      root css, so the interface files for the subsystem on cgrp_dfl_root
      aren't created on registration.  This is a residue from the days when
      cgrp_dfl_root was used only as the parking spot for unused subsystems,
      which no longer is true as it's used as the root for cgroup2.
      
      This is often fine as later operations tend to create them as a part
      of mount (cgroup1) or subtree_control operations (cgroup2); however,
      it's not difficult to mount cgroup2 with the controller interface
      files missing as Waiman found out.
      
      Fix it by invoking css_populate_dir() on the root css on subsys
      registration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NWaiman Long <longman@redhat.com>
      Cc: stable@vger.kernel.org # v4.5+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7af608e4
  5. 08 7月, 2017 1 次提交
    • T
      cgroup: don't call migration methods if there are no tasks to migrate · 61046727
      Tejun Heo 提交于
      Subsystem migration methods shouldn't be called for empty migrations.
      cgroup_migrate_execute() implements this guarantee by bailing early if
      there are no source css_sets.  This used to be correct before
      a79a908f ("cgroup: introduce cgroup namespaces"), but no longer
      since the commit because css_sets can stay pinned without tasks in
      them.
      
      This caused cgroup_migrate_execute() call into cpuset migration
      methods with an empty cgroup_taskset.  cpuset migration methods
      correctly assume that cgroup_taskset_first() never returns NULL;
      however, due to the bug, it can, leading to the following oops.
      
        Unable to handle kernel paging request for data at address 0x00000960
        Faulting instruction address: 0xc0000000001d6868
        Oops: Kernel access of bad area, sig: 11 [#1]
        ...
        CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: G        W
        4.12.0-rc4-next-20170609 #2
        Workqueue: events cpuset_hotplug_workfn
        task: c00000000ca60580 task.stack: c00000000c728000
        NIP: c0000000001d6868 LR: c0000000001d6858 CTR: c0000000001d6810
        REGS: c00000000c72b720 TRAP: 0300   Tainted: GW (4.12.0-rc4-next-20170609)
        MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 44722422  XER: 20000000
        CFAR: c000000000008710 DAR: 0000000000000960 DSISR: 40000000 SOFTE: 1
        GPR00: c0000000001d6858 c00000000c72b9a0 c000000001536e00 0000000000000000
        GPR04: c00000000c72b9c0 0000000000000000 c00000000c72bad0 c000000766367678
        GPR08: c000000766366d10 c00000000c72b958 c000000001736e00 0000000000000000
        GPR12: c0000000001d6810 c00000000e749300 c000000000123ef8 c000000775af4180
        GPR16: 0000000000000000 0000000000000000 c00000075480e9c0 c00000075480e9e0
        GPR20: c00000075480e8c0 0000000000000001 0000000000000000 c00000000c72ba20
        GPR24: c00000000c72baa0 c00000000c72bac0 c000000001407248 c00000000c72ba20
        GPR28: c00000000141fc80 c00000000c72bac0 c00000000c6bc790 0000000000000000
        NIP [c0000000001d6868] cpuset_can_attach+0x58/0x1b0
        LR [c0000000001d6858] cpuset_can_attach+0x48/0x1b0
        Call Trace:
        [c00000000c72b9a0] [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 (unreliable)
        [c00000000c72ba00] [c0000000001cbe80] cgroup_migrate_execute+0xb0/0x450
        [c00000000c72ba80] [c0000000001d3754] cgroup_transfer_tasks+0x1c4/0x360
        [c00000000c72bba0] [c0000000001d923c] cpuset_hotplug_workfn+0x86c/0xa20
        [c00000000c72bca0] [c00000000011aa44] process_one_work+0x1e4/0x580
        [c00000000c72bd30] [c00000000011ae78] worker_thread+0x98/0x5c0
        [c00000000c72bdc0] [c000000000124058] kthread+0x168/0x1b0
        [c00000000c72be30] [c00000000000b2e8] ret_from_kernel_thread+0x5c/0x74
        Instruction dump:
        f821ffa1 7c7d1b78 60000000 60000000 38810020 7fa3eb78 3f42ffed 4bff4c25
        60000000 3b5a0448 3d420020 eb610020 <e9230960> 7f43d378 e9290000 f92af200
        ---[ end trace dcaaf98fb36d9e64 ]---
      
      This patch fixes the bug by adding an explicit nr_tasks counter to
      cgroup_taskset and skipping calling the migration methods if the
      counter is zero.  While at it, remove the now spurious check on no
      source css_sets.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: stable@vger.kernel.org # v4.6+
      Fixes: a79a908f ("cgroup: introduce cgroup namespaces")
      Link: http://lkml.kernel.org/r/1497266622.15415.39.camel@abdul.in.ibm.com
      61046727
  6. 07 7月, 2017 2 次提交
    • V
      mm, cpuset: always use seqlock when changing task's nodemask · 5f155f27
      Vlastimil Babka 提交于
      When updating task's mems_allowed and rebinding its mempolicy due to
      cpuset's mems being changed, we currently only take the seqlock for
      writing when either the task has a mempolicy, or the new mems has no
      intersection with the old mems.
      
      This should be enough to prevent a parallel allocation seeing no
      available nodes, but the optimization is IMHO unnecessary (cpuset
      updates should not be frequent), and we still potentially risk issues if
      the intersection of new and old nodes has limited amount of
      free/reclaimable memory.
      
      Let's just use the seqlock for all tasks.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-6-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f155f27
    • V
      mm, mempolicy: simplify rebinding mempolicies when updating cpusets · 213980c0
      Vlastimil Babka 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") has introduced a two-step protocol when
      rebinding task's mempolicy due to cpuset update, in order to avoid a
      parallel allocation seeing an empty effective nodemask and failing.
      
      Later, commit cc9a6c87 ("cpuset: mm: reduce large amounts of memory
      barrier related damage v3") introduced a seqlock protection and removed
      the synchronization point between the two update steps.  At that point
      (or perhaps later), the two-step rebinding became unnecessary.
      
      Currently it only makes sure that the update first adds new nodes in
      step 1 and then removes nodes in step 2.  Without memory barriers the
      effects are questionable, and even then this cannot prevent a parallel
      zonelist iteration checking the nodemask at each step to observe all
      nodes as unusable for allocation.  We now fully rely on the seqlock to
      prevent premature OOMs and allocation failures.
      
      We can thus remove the two-step update parts and simplify the code.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      213980c0
  7. 29 6月, 2017 2 次提交
    • T
      cgroup: implement "nsdelegate" mount option · 5136f636
      Tejun Heo 提交于
      Currently, cgroup only supports delegation to !root users and cgroup
      namespaces don't get any special treatments.  This limits the
      usefulness of cgroup namespaces as they by themselves can't be safe
      delegation boundaries.  A process inside a cgroup can change the
      resource control knobs of the parent in the namespace root and may
      move processes in and out of the namespace if cgroups outside its
      namespace are visible somehow.
      
      This patch adds a new mount option "nsdelegate" which makes cgroup
      namespaces delegation boundaries.  If set, cgroup behaves as if write
      permission based delegation took place at namespace boundaries -
      writes to the resource control knobs from the namespace root are
      denied and migration crossing the namespace boundary aren't allowed
      from inside the namespace.
      
      This allows cgroup namespace to function as a delegation boundary by
      itself.
      
      v2: Silently ignore nsdelegate specified on !init mounts.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Aravind Anbudurai <aru7@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      5136f636
    • T
      cgroup: restructure cgroup_procs_write_permission() · 824ecbe0
      Tejun Heo 提交于
      Restructure cgroup_procs_write_permission() to make extending
      permission logic easier.
      
      This patch doesn't cause any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      824ecbe0
  8. 15 6月, 2017 7 次提交
    • T
      cgroup: fix lockdep warning in debug controller · b6053d40
      Tejun Heo 提交于
      The debug controller grabs cgroup_mutex from interface file show
      functions which can deadlock and triggers lockdep warnings.  Fix it by
      using cgroup_kn_lock_live()/cgroup_kn_unlock() instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      b6053d40
    • T
      cgroup: refactor cgroup_masks_read() in the debug controller · 2866c0b4
      Tejun Heo 提交于
      Factor out cgroup_masks_read_one() out of cgroup_masks_read() for
      simplicity.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      2866c0b4
    • T
      cgroup: make debug an implicit controller on cgroup2 · 8cc38fa7
      Tejun Heo 提交于
      Make debug an implicit controller on cgroup2 which is enabled by
      "cgroup_debug" boot param.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      8cc38fa7
    • W
      cgroup: Make debug cgroup support v2 and thread mode · 575313f4
      Waiman Long 提交于
      Besides supporting cgroup v2 and thread mode, the following changes
      are also made:
       1) current_* cgroup files now resides only at the root as we don't
          need duplicated files of the same function all over the cgroup
          hierarchy.
       2) The cgroup_css_links_read() function is modified to report
          the number of tasks that are skipped because of overflow.
       3) The number of extra unaccounted references are displayed.
       4) The current_css_set_read() function now prints out the addresses of
          the css'es associated with the current css_set.
       5) A new cgroup_subsys_states file is added to display the css objects
          associated with a cgroup.
       6) A new cgroup_masks file is added to display the various controller
          bit masks in the cgroup.
      
      tj: Dropped thread mode related information for now so that debug
          controller changes aren't blocked on the thread mode.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      575313f4
    • W
      cgroup: Make Kconfig prompt of debug cgroup more accurate · 23b0be48
      Waiman Long 提交于
      The Kconfig prompt and description of the debug cgroup controller
      more accurate by saying that it is for debug purpose only and its
      interfaces are unstable.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      23b0be48
    • W
      cgroup: Move debug cgroup to its own file · a28f8f5e
      Waiman Long 提交于
      The debug cgroup currently resides within cgroup-v1.c and is enabled
      only for v1 cgroup. To enable the debug cgroup also for v2, it makes
      sense to put the code into its own file as it will no longer be v1
      specific. There is no change to the debug cgroup specific code.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a28f8f5e
    • W
      cgroup: Keep accurate count of tasks in each css_set · 73a7242a
      Waiman Long 提交于
      The reference count in the css_set data structure was used as a
      proxy of the number of tasks attached to that css_set. However, that
      count is actually not an accurate measure especially with thread mode
      support. So a new variable nr_tasks is added to the css_set to keep
      track of the actual task count. This new variable is protected by
      the css_set_lock. Functions that require the actual task count are
      updated to use the new variable.
      
      tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
          Refreshed on top of cgroup/for-v4.13 which dropped on
          css_set_populated() -> nr_tasks conversion.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      73a7242a
  9. 25 5月, 2017 1 次提交
    • T
      cpuset: consider dying css as offline · 41c25707
      Tejun Heo 提交于
      In most cases, a cgroup controller don't care about the liftimes of
      cgroups.  For the controller, a css becomes online when ->css_online()
      is called on it and offline when ->css_offline() is called.
      
      However, cpuset is special in that the user interface it exposes cares
      whether certain cgroups exist or not.  Combined with the RCU delay
      between cgroup removal and css offlining, this can lead to user
      visible behavior oddities where operations which should succeed after
      cgroup removals fail for some time period.  The effects of cgroup
      removals are delayed when seen from userland.
      
      This patch adds css_is_dying() which tests whether offline is pending
      and updates is_cpuset_online() so that the function returns false also
      while offline is pending.  This gets rid of the userland visible
      delays.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      41c25707
  10. 18 5月, 2017 1 次提交
  11. 02 5月, 2017 1 次提交
  12. 29 4月, 2017 2 次提交
    • Z
      cgroup: avoid attaching a cgroup root to two different superblocks, take 2 · 9732adc5
      Zefan Li 提交于
      Commit bfb0b80d ("cgroup: avoid attaching a cgroup root to two
      different superblocks") is broken.  Now we try to fix the race by
      delaying the initialization of cgroup root refcnt until a superblock
      has been allocated.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: NAndrei Vagin <avagin@virtuozzo.com>
      Tested-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9732adc5
    • T
      cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc() · a590b90d
      Tejun Heo 提交于
      cgroup_get() expected to be called only on live cgroups and triggers
      warning on a dead cgroup; however, cgroup_sk_alloc() may be called
      while cloning a socket which is left in an empty and removed cgroup
      and thus may legitimately duplicate its reference on a dead cgroup.
      This currently triggers the following warning spuriously.
      
       WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60
       ...
        [<ffffffff8107e123>] __warn+0xd3/0xf0
        [<ffffffff8107e20e>] warn_slowpath_null+0x1e/0x20
        [<ffffffff810ff465>] cgroup_get+0x55/0x60
        [<ffffffff81106061>] cgroup_sk_alloc+0x51/0xe0
        [<ffffffff81761beb>] sk_clone_lock+0x2db/0x390
        [<ffffffff817cce06>] inet_csk_clone_lock+0x16/0xc0
        [<ffffffff817e8173>] tcp_create_openreq_child+0x23/0x4b0
        [<ffffffff818601a1>] tcp_v6_syn_recv_sock+0x91/0x670
        [<ffffffff817e8b16>] tcp_check_req+0x3a6/0x4e0
        [<ffffffff81861ba3>] tcp_v6_rcv+0x693/0xa00
        [<ffffffff81837429>] ip6_input_finish+0x59/0x3e0
        [<ffffffff81837cb2>] ip6_input+0x32/0xb0
        [<ffffffff81837387>] ip6_rcv_finish+0x57/0xa0
        [<ffffffff81837ac8>] ipv6_rcv+0x318/0x4d0
        [<ffffffff817778c7>] __netif_receive_skb_core+0x2d7/0x9a0
        [<ffffffff81777fa6>] __netif_receive_skb+0x16/0x70
        [<ffffffff81778023>] netif_receive_skb_internal+0x23/0x80
        [<ffffffff817787d8>] napi_gro_frags+0x208/0x270
        [<ffffffff8168a9ec>] mlx4_en_process_rx_cq+0x74c/0xf40
        [<ffffffff8168b270>] mlx4_en_poll_rx_cq+0x30/0x90
        [<ffffffff81778b30>] net_rx_action+0x210/0x350
        [<ffffffff8188c426>] __do_softirq+0x106/0x2c7
        [<ffffffff81082bad>] irq_exit+0x9d/0xa0 [<ffffffff8188c0e4>] do_IRQ+0x54/0xd0
        [<ffffffff8188a63f>] common_interrupt+0x7f/0x7f <EOI>
        [<ffffffff8173d7e7>] cpuidle_enter+0x17/0x20
        [<ffffffff810bdfd9>] cpu_startup_entry+0x2a9/0x2f0
        [<ffffffff8103edd1>] start_secondary+0xf1/0x100
      
      This patch renames the existing cgroup_get() with the dead cgroup
      warning to cgroup_get_live() after cgroup_kn_lock_live() and
      introduces the new cgroup_get() which doesn't check whether the cgroup
      is live or dead.
      
      All existing cgroup_get() users except for cgroup_sk_alloc() are
      converted to use cgroup_get_live().
      
      Fixes: d979a39d ("cgroup: duplicate cgroup reference when cloning sockets")
      Cc: stable@vger.kernel.org # v4.5+
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Reported-by: NChris Mason <clm@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a590b90d
  13. 16 4月, 2017 1 次提交
  14. 11 4月, 2017 2 次提交
    • Z
      cgroup: avoid attaching a cgroup root to two different superblocks · bfb0b80d
      Zefan Li 提交于
      Run this:
      
          touch file0
          for ((; ;))
          {
              mount -t cpuset xxx file0
          }
      
      And this concurrently:
      
          touch file1
          for ((; ;))
          {
              mount -t cpuset xxx file1
          }
      
      We'll trigger a warning like this:
      
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 4675 at lib/percpu-refcount.c:317 percpu_ref_kill_and_confirm+0x92/0xb0
       percpu_ref_kill_and_confirm called more than once on css_release!
       CPU: 1 PID: 4675 Comm: mount Not tainted 4.11.0-rc5+ #5
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
       Call Trace:
        dump_stack+0x63/0x84
        __warn+0xd1/0xf0
        warn_slowpath_fmt+0x5f/0x80
        percpu_ref_kill_and_confirm+0x92/0xb0
        cgroup_kill_sb+0x95/0xb0
        deactivate_locked_super+0x43/0x70
        deactivate_super+0x46/0x60
       ...
       ---[ end trace a79f61c2a2633700 ]---
      
      Here's a race:
      
        Thread A				Thread B
      
        cgroup1_mount()
          # alloc a new cgroup root
          cgroup_setup_root()
      					cgroup1_mount()
      					  # no sb yet, returns NULL
      					  kernfs_pin_sb()
      
      					  # but succeeds in getting the refcnt,
      					  # so re-use cgroup root
      					  percpu_ref_tryget_live()
          # alloc sb with cgroup root
          cgroup_do_mount()
      
        cgroup_kill_sb()
      					  # alloc another sb with same root
      					  cgroup_do_mount()
      
      					cgroup_kill_sb()
      
      We end up using the same cgroup root for two different superblocks,
      so percpu_ref_kill() will be called twice on the same root when the
      two superblocks are destroyed.
      
      We should fix to make sure the superblock pinning is really successful.
      
      Cc: stable@vger.kernel.org # 3.16+
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bfb0b80d
    • R
      cpuset: Remove cpuset_update_active_cpus()'s parameter. · 30e03acd
      Rakib Mullick 提交于
      In cpuset_update_active_cpus(), cpu_online isn't used anymore. Remove
      it.
      
      Signed-off-by: Rakib Mullick<rakib.mullick@gmail.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      30e03acd
  15. 28 3月, 2017 1 次提交
  16. 17 3月, 2017 1 次提交
    • T
      cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups · 77f88796
      Tejun Heo 提交于
      Creation of a kthread goes through a couple interlocked stages between
      the kthread itself and its creator.  Once the new kthread starts
      running, it initializes itself and wakes up the creator.  The creator
      then can further configure the kthread and then let it start doing its
      job by waking it up.
      
      In this configuration-by-creator stage, the creator is the only one
      that can wake it up but the kthread is visible to userland.  When
      altering the kthread's attributes from userland is allowed, this is
      fine; however, for cases where CPU affinity is critical,
      kthread_bind() is used to first disable affinity changes from userland
      and then set the affinity.  This also prevents the kthread from being
      migrated into non-root cgroups as that can affect the CPU affinity and
      many other things.
      
      Unfortunately, the cgroup side of protection is racy.  While the
      PF_NO_SETAFFINITY flag prevents further migrations, userland can win
      the race before the creator sets the flag with kthread_bind() and put
      the kthread in a non-root cgroup, which can lead to all sorts of
      problems including incorrect CPU affinity and starvation.
      
      This bug got triggered by userland which periodically tries to migrate
      all processes in the root cpuset cgroup to a non-root one.  Per-cpu
      workqueue workers got caught while being created and ended up with
      incorrected CPU affinity breaking concurrency management and sometimes
      stalling workqueue execution.
      
      This patch adds task->no_cgroup_migration which disallows the task to
      be migrated by userland.  kthreadd starts with the flag set making
      every child kthread start in the root cgroup with migration
      disallowed.  The flag is cleared after the kthread finishes
      initialization by which time PF_NO_SETAFFINITY is set if the kthread
      should stay in the root cgroup.
      
      It'd be better to wait for the initialization instead of failing but I
      couldn't think of a way of implementing that without adding either a
      new PF flag, or sleeping and retrying from waiting side.  Even if
      userland depends on changing cgroup membership of a kthread, it either
      has to be synchronized with kthread_create() or periodically repeat,
      so it's unlikely that this would break anything.
      
      v2: Switch to a simpler implementation using a new task_struct bit
          field suggested by Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-and-debugged-by: NChris Mason <clm@fb.com>
      Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
      Signed-off-by: NTejun Heo <tj@kernel.org>
      77f88796
  17. 10 3月, 2017 1 次提交
  18. 09 3月, 2017 1 次提交
  19. 07 3月, 2017 3 次提交
    • K
      cgroups: censor kernel pointer in debug files · b6a6759d
      Kees Cook 提交于
      As found in grsecurity, this avoids exposing a kernel pointer through
      the cgroup debug entries.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b6a6759d
    • T
      cgroup/pids: remove spurious suspicious RCU usage warning · 1d18c274
      Tejun Heo 提交于
      pids_can_fork() is special in that the css association is guaranteed
      to be stable throughout the function and thus doesn't need RCU
      protection around task_css access.  When determining the css to charge
      the pid, task_css_check() is used to override the RCU sanity check.
      
      While adding a warning message on fork rejection from pids limit,
      135b8b37 ("cgroup: Add pids controller event when fork fails
      because of pid limit") incorrectly added a task_css access which is
      neither RCU protected or explicitly annotated.  This triggers the
      following suspicious RCU usage warning when RCU debugging is enabled.
      
        cgroup: fork rejected by pids controller in
      
        ===============================
        [ ERR: suspicious RCU usage.  ]
        4.10.0-work+ #1 Not tainted
        -------------------------------
        ./include/linux/cgroup.h:435 suspicious rcu_dereference_check() usage!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 0
        1 lock held by bash/1748:
         #0:  (&cgroup_threadgroup_rwsem){+++++.}, at: [<ffffffff81052c96>] _do_fork+0xe6/0x6e0
      
        stack backtrace:
        CPU: 3 PID: 1748 Comm: bash Not tainted 4.10.0-work+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
        Call Trace:
         dump_stack+0x68/0x93
         lockdep_rcu_suspicious+0xd7/0x110
         pids_can_fork+0x1c7/0x1d0
         cgroup_can_fork+0x67/0xc0
         copy_process.part.58+0x1709/0x1e90
         _do_fork+0xe6/0x6e0
         SyS_clone+0x19/0x20
         do_syscall_64+0x5c/0x140
         entry_SYSCALL64_slow_path+0x25/0x25
        RIP: 0033:0x7f7853fab93a
        RSP: 002b:00007ffc12d05c90 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
        RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7853fab93a
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
        RBP: 00007ffc12d05cc0 R08: 0000000000000000 R09: 00007f78548db700
        R10: 00007f78548db9d0 R11: 0000000000000246 R12: 00000000000006d4
        R13: 0000000000000001 R14: 0000000000000000 R15: 000055e3ebe2c04d
        /asdf
      
      There's no reason to dereference task_css again here when the
      associated css is already available.  Fix it by replacing the
      task_cgroup() call with css->cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMike Galbraith <efault@gmx.de>
      Fixes: 135b8b37 ("cgroup: Add pids controller event when fork fails because of pid limit")
      Cc: Kenny Yu <kennyyu@fb.com>
      Cc: stable@vger.kernel.org # v4.8+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1d18c274
    • E
      kernel: convert cgroup_namespace.count from atomic_t to refcount_t · 387ad967
      Elena Reshetova 提交于
      refcount_t type and corresponding API should be
      used instead of atomic_t when the variable is used as
      a reference counter. This allows to avoid accidental
      refcounter overflows that might lead to use-after-free
      situations.
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      387ad967
  20. 03 3月, 2017 3 次提交
    • I
      sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> · 50ff9d13
      Ingo Molnar 提交于
      It's not used by any of the scheduler methods, but <linux/sched/task_stack.h>
      needs it to pick up STACK_END_MAGIC.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      50ff9d13
    • I
      sched/headers: Move the task_lock()/unlock() APIs to <linux/sched/task.h> · 56cd6973
      Ingo Molnar 提交于
      The task_lock()/task_unlock() APIs are not realated to core scheduling,
      they are task lifetime APIs, i.e. they belong into <linux/sched/task.h>.
      
      Move them.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      56cd6973
    • I
      sched/headers: Move task_struct::signal and task_struct::sighand types and... · c3edc401
      Ingo Molnar 提交于
      sched/headers: Move task_struct::signal and task_struct::sighand types and accessors into <linux/sched/signal.h>
      
      task_struct::signal and task_struct::sighand are pointers, which would normally make it
      straightforward to not define those types in sched.h.
      
      That is not so, because the types are accompanied by a myriad of APIs (macros and inline
      functions) that dereference them.
      
      Split the types and the APIs out of sched.h and move them into a new header, <linux/sched/signal.h>.
      
      With this change sched.h does not know about 'struct signal' and 'struct sighand' anymore,
      trying to put accessors into sched.h as a test fails the following way:
      
        ./include/linux/sched.h: In function ‘test_signal_types’:
        ./include/linux/sched.h:2461:18: error: dereferencing pointer to incomplete type ‘struct signal_struct’
                          ^
      
      This reduces the size and complexity of sched.h significantly.
      
      Update all headers and .c code that relied on getting the signal handling
      functionality from <linux/sched.h> to include <linux/sched/signal.h>.
      
      The list of affected files in the preparatory patch was partly generated by
      grepping for the APIs, and partly by doing coverage build testing, both
      all[yes|mod|def|no]config builds on 64-bit and 32-bit x86, and an array of
      cross-architecture builds.
      
      Nevertheless some (trivial) build breakage is still expected related to rare
      Kconfig combinations and in-flight patches to various kernel code, but most
      of it should be handled by this patch.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c3edc401
  21. 02 3月, 2017 4 次提交
  22. 03 2月, 2017 1 次提交
    • T
      cgroup: drop the matching uid requirement on migration for cgroup v2 · 576dd464
      Tejun Heo 提交于
      Along with the write access to the cgroup.procs or tasks file, cgroup
      has required the writer's euid, unless root, to match [s]uid of the
      target process or task.  On cgroup v1, this is necessary because
      there's nothing preventing a delegatee from pulling in tasks or
      processes from all over the system.
      
      If a user has a cgroup subdirectory delegated to it, the user would
      have write access to the cgroup.procs or tasks file.  If there are no
      further checks than file write access check, the user would be able to
      pull processes from all over the system into its subhierarchy which is
      clearly not the intended behavior.  The matching [s]uid requirement
      partially prevents this problem by allowing a delegatee to pull in the
      processes that belongs to it.  This isn't a sufficient protection
      however, because a user would still be able to jump processes across
      two disjoint sub-hierarchies that has been delegated to them.
      
      cgroup v2 resolves the issue by requiring the writer to have access to
      the common ancestor of the cgroup.procs file of the source and target
      cgroups.  This confines each delegatee to their own sub-hierarchy
      proper and bases all permission decisions on the cgroup filesystem
      rather than having to pull in explicit uid matching.
      
      cgroup v2 has still been applying the matching [s]uid requirement just
      for historical reasons.  On cgroup2, the requirement doesn't serve any
      purpose while unnecessarily complicating the permission model.  Let's
      drop it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      576dd464
  23. 31 1月, 2017 1 次提交
    • T
      cgroup: misc cleanups · b807421a
      Tejun Heo 提交于
      * cgrp_dfl_implicit_ss_mask is ulong instead of u16 unlike other
        ss_masks.  Make it a u16.
      
      * Move have_canfork_callback together with other callback ss_masks.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b807421a