1. 27 10月, 2017 1 次提交
  2. 07 9月, 2017 2 次提交
    • P
      sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs · 50e76632
      Peter Zijlstra 提交于
      Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
      because it now resulted in non-cpuset usage breaking too.
      
      On suspend cpuset_cpu_inactive() doesn't call into
      cpuset_update_active_cpus() because it doesn't want to move tasks about,
      there is no need, all tasks are frozen and won't run again until after
      we've resumed everything.
      
      But this means that when we finally do call into
      cpuset_update_active_cpus() after resuming the last frozen cpu in
      cpuset_cpu_active(), the top_cpuset will not have any difference with
      the cpu_active_mask and this it will not in fact do _anything_.
      
      So the cpuset configuration will not be restored. This was largely
      hidden because we would unconditionally create identity domains and
      mobile users would not in fact use cpusets much. And servers what do use
      cpusets tend to not suspend-resume much.
      
      An addition problem is that we'd not in fact wait for the cpuset work to
      finish before resuming the tasks, allowing spurious migrations outside
      of the specified domains.
      
      Fix the rebuild by introducing cpuset_force_rebuild() and fix the
      ordering with cpuset_wait_for_hotplug().
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: deb7aa30 ("cpuset: reorganize CPU / memory hotplug handling")
      Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      50e76632
    • M
      mm: replace TIF_MEMDIE checks by tsk_is_oom_victim · da99ecf1
      Michal Hocko 提交于
      TIF_MEMDIE is set only to the tasks whick were either directly selected
      by the OOM killer or passed through mark_oom_victim from the allocator
      path.  tsk_is_oom_victim is more generic and allows to identify all
      tasks (threads) which share the mm with the oom victim.
      
      Please note that the freezer still needs to check TIF_MEMDIE because we
      cannot thaw tasks which do not participage in oom_victims counting
      otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da99ecf1
  3. 25 8月, 2017 2 次提交
    • P
      sched/topology, cpuset: Avoid spurious/wrong domain rebuilds · 77d1dfda
      Peter Zijlstra 提交于
      When disabling cpuset.sched_load_balance we expect to be able to online
      CPUs without generating sched_domains. However this is currently
      completely broken.
      
      What happens is that we generate the sched_domains and then destroy
      them. This is because of the spurious 'default' domain build in
      cpuset_update_active_cpus(). That builds a single machine wide domain
      and then schedules a work to build the 'real' domains. The work then
      finds there are _no_ domains and destroys the lot again.
      
      Furthermore, if there actually were cpusets, building the machine wide
      domain is actively wrong, because it would allow tasks to 'escape' their
      cpuset. Also I don't think its needed, the scheduler really should
      respect the active mask.
      Reported-by: NOfer Levi(SW) <oferle@mellanox.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
      Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      77d1dfda
    • W
      cpuset: Fix incorrect memory_pressure control file mapping · 1c08c22c
      Waiman Long 提交于
      The memory_pressure control file was incorrectly set up without
      a private value (0, by default). As a result, this control
      file was treated like memory_migrate on read. By adding back the
      FILE_MEMORY_PRESSURE private value, the correct memory pressure value
      will be returned.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 7dbdb199 ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE")
      Cc: stable@vger.kernel.org # v4.4+
      1c08c22c
  4. 18 8月, 2017 1 次提交
    • W
      cpuset: Allow v2 behavior in v1 cgroup · b8d1b8ee
      Waiman Long 提交于
      Cpuset v2 has some useful behaviors that are not present in v1 because
      of backward compatibility concern. One of that is the restoration of
      the original cpu and memory node mask after a hot removal and addition
      event sequence.
      
      This patch makes the cpuset controller to check the
      CGRP_ROOT_CPUSET_V2_MODE flag and use the v2 behavior if it is set.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b8d1b8ee
  5. 10 8月, 2017 1 次提交
  6. 03 8月, 2017 1 次提交
    • D
      cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 89affbf5
      Dima Zavin 提交于
      In codepaths that use the begin/retry interface for reading
      mems_allowed_seq with irqs disabled, there exists a race condition that
      stalls the patch process after only modifying a subset of the
      static_branch call sites.
      
      This problem manifested itself as a deadlock in the slub allocator,
      inside get_any_partial.  The loop reads mems_allowed_seq value (via
      read_mems_allowed_begin), performs the defrag operation, and then
      verifies the consistency of mem_allowed via the read_mems_allowed_retry
      and the cookie returned by xxx_begin.
      
      The issue here is that both begin and retry first check if cpusets are
      enabled via cpusets_enabled() static branch.  This branch can be
      rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
      x86 jump label code fully synchronizes across all CPUs for every entry
      it rewrites.  If it rewrites only one of the callsites (specifically the
      one in read_mems_allowed_retry) and then waits for the
      smp_call_function(do_sync_core) to complete while a CPU is inside the
      begin/retry section with IRQs off and the mems_allowed value is changed,
      we can hang.
      
      This is because begin() will always return 0 (since it wasn't patched
      yet) while retry() will test the 0 against the actual value of the seq
      counter.
      
      The fix is to use two different static keys: one for begin
      (pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
      first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
      always return a valid seqcount if are enabling cpusets.  Similarly, when
      disabling cpusets via cpuset_dec(), we first ensure that callers of
      cpuset_mems_allowed_retry() will start ignoring the seqcount value
      before we let cpuset_mems_allowed_begin() return 0.
      
      The relevant stack traces of the two stuck threads:
      
        CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
        RIP: smp_call_function_many+0x1f9/0x260
        Call Trace:
          smp_call_function+0x3b/0x70
          on_each_cpu+0x2f/0x90
          text_poke_bp+0x87/0xd0
          arch_jump_label_transform+0x93/0x100
          __jump_label_update+0x77/0x90
          jump_label_update+0xaa/0xc0
          static_key_slow_inc+0x9e/0xb0
          cpuset_css_online+0x70/0x2e0
          online_css+0x2c/0xa0
          cgroup_apply_control_enable+0x27f/0x3d0
          cgroup_mkdir+0x2b7/0x420
          kernfs_iop_mkdir+0x5a/0x80
          vfs_mkdir+0xf6/0x1a0
          SyS_mkdir+0xb7/0xe0
          entry_SYSCALL_64_fastpath+0x18/0xad
      
        ...
      
        CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8818087c0000 task.stack: ffffc90000030000
        RIP: int3+0x39/0x70
        Call Trace:
          <#DB> ? ___slab_alloc+0x28b/0x5a0
          <EOE> ? copy_process.part.40+0xf7/0x1de0
          __slab_alloc.isra.80+0x54/0x90
          copy_process.part.40+0xf7/0x1de0
          copy_process.part.40+0xf7/0x1de0
          kmem_cache_alloc_node+0x8a/0x280
          copy_process.part.40+0xf7/0x1de0
          _do_fork+0xe7/0x6c0
          _raw_spin_unlock_irq+0x2d/0x60
          trace_hardirqs_on_caller+0x136/0x1d0
          entry_SYSCALL_64_fastpath+0x5/0xad
          do_syscall_64+0x27/0x350
          SyS_clone+0x19/0x20
          do_syscall_64+0x60/0x350
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
      Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
      Signed-off-by: NDima Zavin <dmitriyz@waymo.com>
      Reported-by: NCliff Spradlin <cspradlin@waymo.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89affbf5
  7. 21 7月, 2017 1 次提交
    • T
      cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS · bc2fb7ed
      Tejun Heo 提交于
      css_task_iter currently always walks all tasks.  With the scheduled
      cgroup v2 thread support, the iterator would need to handle multiple
      types of iteration.  As a preparation, add @flags to
      css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
      is not specified, it walks all tasks as before.  When asserted, the
      iterator only walks the group leaders.
      
      For now, the only user of the flag is cgroup v2 "cgroup.procs" file
      which no longer needs to skip non-leader tasks in cgroup_procs_next().
      Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
      v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
      cgroup" but "list all thread group id's with any threads in the
      cgroup".
      
      While at it, update cgroup_procs_show() to use task_pid_vnr() instead
      of task_tgid_vnr().  As the iteration guarantees that the function
      only sees group leaders, this doesn't change the output and will allow
      sharing the function for thread iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bc2fb7ed
  8. 07 7月, 2017 2 次提交
    • V
      mm, cpuset: always use seqlock when changing task's nodemask · 5f155f27
      Vlastimil Babka 提交于
      When updating task's mems_allowed and rebinding its mempolicy due to
      cpuset's mems being changed, we currently only take the seqlock for
      writing when either the task has a mempolicy, or the new mems has no
      intersection with the old mems.
      
      This should be enough to prevent a parallel allocation seeing no
      available nodes, but the optimization is IMHO unnecessary (cpuset
      updates should not be frequent), and we still potentially risk issues if
      the intersection of new and old nodes has limited amount of
      free/reclaimable memory.
      
      Let's just use the seqlock for all tasks.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-6-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f155f27
    • V
      mm, mempolicy: simplify rebinding mempolicies when updating cpusets · 213980c0
      Vlastimil Babka 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") has introduced a two-step protocol when
      rebinding task's mempolicy due to cpuset update, in order to avoid a
      parallel allocation seeing an empty effective nodemask and failing.
      
      Later, commit cc9a6c87 ("cpuset: mm: reduce large amounts of memory
      barrier related damage v3") introduced a seqlock protection and removed
      the synchronization point between the two update steps.  At that point
      (or perhaps later), the two-step rebinding became unnecessary.
      
      Currently it only makes sure that the update first adds new nodes in
      step 1 and then removes nodes in step 2.  Without memory barriers the
      effects are questionable, and even then this cannot prevent a parallel
      zonelist iteration checking the nodemask at each step to observe all
      nodes as unusable for allocation.  We now fully rely on the seqlock to
      prevent premature OOMs and allocation failures.
      
      We can thus remove the two-step update parts and simplify the code.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      213980c0
  9. 25 5月, 2017 1 次提交
    • T
      cpuset: consider dying css as offline · 41c25707
      Tejun Heo 提交于
      In most cases, a cgroup controller don't care about the liftimes of
      cgroups.  For the controller, a css becomes online when ->css_online()
      is called on it and offline when ->css_offline() is called.
      
      However, cpuset is special in that the user interface it exposes cares
      whether certain cgroups exist or not.  Combined with the RCU delay
      between cgroup removal and css offlining, this can lead to user
      visible behavior oddities where operations which should succeed after
      cgroup removals fail for some time period.  The effects of cgroup
      removals are delayed when seen from userland.
      
      This patch adds css_is_dying() which tests whether offline is pending
      and updates is_cpuset_online() so that the function returns false also
      while offline is pending.  This gets rid of the userland visible
      delays.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      41c25707
  10. 11 4月, 2017 1 次提交
  11. 28 3月, 2017 1 次提交
  12. 02 3月, 2017 2 次提交
  13. 28 12月, 2016 1 次提交
  14. 25 12月, 2016 1 次提交
  15. 29 9月, 2016 1 次提交
  16. 16 9月, 2016 1 次提交
  17. 13 9月, 2016 1 次提交
    • J
      cpuset: handle race between CPU hotplug and cpuset_hotplug_work · 28b89b9e
      Joonwoo Park 提交于
      A discrepancy between cpu_online_mask and cpuset's effective_cpus
      mask is inevitable during hotplug since cpuset defers updating of
      effective_cpus mask using a workqueue, during which time nothing
      prevents the system from more hotplug operations.  For that reason
      guarantee_online_cpus() walks up the cpuset hierarchy until it finds
      an intersection under the assumption that top cpuset's effective_cpus
      mask intersects with cpu_online_mask even with such a race occurring.
      
      However a sequence of CPU hotplugs can open a time window, during which
      none of the effective CPUs in the top cpuset intersect with
      cpu_online_mask.
      
      For example when there are 4 possible CPUs 0-3 and only CPU0 is online:
      
        ========================  ===========================
         cpu_online_mask           top_cpuset.effective_cpus
        ========================  ===========================
         echo 1 > cpu2/online.
         CPU hotplug notifier woke up hotplug work but not yet scheduled.
            [0,2]                     [0]
      
         echo 0 > cpu0/online.
         The workqueue is still runnable.
            [2]                       [0]
        ========================  ===========================
      
        Now there is no intersection between cpu_online_mask and
        top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
        this moment can cause following:
      
         Unable to handle kernel NULL pointer dereference at virtual address 000000d0
         ------------[ cut here ]------------
         Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable]
         Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
         Modules linked in:
         CPU: 2 PID: 1420 Comm: taskset Tainted: G        W       4.4.8+ #98
         task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000
         PC is at guarantee_online_cpus+0x2c/0x58
         LR is at cpuset_cpus_allowed+0x4c/0x6c
         <snip>
         Process taskset (pid: 1420, stack limit = 0xffffffc06e124020)
         Call trace:
         [<ffffffc0001389b0>] guarantee_online_cpus+0x2c/0x58
         [<ffffffc00013b208>] cpuset_cpus_allowed+0x4c/0x6c
         [<ffffffc0000d61f0>] sched_setaffinity+0xc0/0x1ac
         [<ffffffc0000d6374>] SyS_sched_setaffinity+0x98/0xac
         [<ffffffc000085cb0>] el0_svc_naked+0x24/0x28
      
      The top cpuset's effective_cpus are guaranteed to be identical to
      cpu_online_mask eventually.  Hence fall back to cpu_online_mask when
      there is no intersection between top cpuset's effective_cpus and
      cpu_online_mask.
      Signed-off-by: NJoonwoo Park <joonwoop@codeaurora.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: cgroups@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.17+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      28b89b9e
  18. 10 8月, 2016 2 次提交
    • T
      cgroup: make cgroup_path() and friends behave in the style of strlcpy() · 4c737b41
      Tejun Heo 提交于
      cgroup_path() and friends used to format the path from the end and
      thus the resulting path usually didn't start at the start of the
      passed in buffer.  Also, when the buffer was too small, the partial
      result was truncated from the head rather than tail and there was no
      way to tell how long the full path would be.  These make the functions
      less robust and more awkward to use.
      
      With recent updates to kernfs_path(), cgroup_path() and friends can be
      made to behave in strlcpy() style.
      
      * cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now
        always return the length of the full path.  If buffer is too small,
        it contains nul terminated truncated output.
      
      * All users updated accordingly.
      
      v2: cgroup_path() usage in kernel/sched/debug.c converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      4c737b41
    • Z
      cpuset: make sure new tasks conform to the current config of the cpuset · 06f4e948
      Zefan Li 提交于
      A new task inherits cpus_allowed and mems_allowed masks from its parent,
      but if someone changes cpuset's config by writing to cpuset.cpus/cpuset.mems
      before this new task is inserted into the cgroup's task list, the new task
      won't be updated accordingly.
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      06f4e948
  19. 29 7月, 2016 1 次提交
  20. 20 5月, 2016 2 次提交
    • V
      cpuset: use static key better and convert to new API · 002f2906
      Vlastimil Babka 提交于
      An important function for cpusets is cpuset_node_allowed(), which
      optimizes on the fact if there's a single root CPU set, it must be
      trivially allowed.  But the check "nr_cpusets() <= 1" doesn't use the
      cpusets_enabled_key static key the right way where static keys eliminate
      branching overhead with jump labels.
      
      This patch converts it so that static key is used properly.  It's also
      switched to the new static key API and the checking functions are
      converted to return bool instead of int.  We also provide a new variant
      __cpuset_zone_allowed() which expects that the static key check was
      already done and they key was enabled.  This is needed for
      get_page_from_freelist() where we want to also avoid the relatively
      slower check when ALLOC_CPUSET is not set in alloc_flags.
      
      The impact on the page allocator microbenchmark is less than expected
      but the cleanup in itself is worthwhile.
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                             multcheck-v1r20               cpuset-v1r20
        Min      alloc-odr0-1               348.00 (  0.00%)           348.00 (  0.00%)
        Min      alloc-odr0-2               254.00 (  0.00%)           254.00 (  0.00%)
        Min      alloc-odr0-4               213.00 (  0.00%)           213.00 (  0.00%)
        Min      alloc-odr0-8               186.00 (  0.00%)           183.00 (  1.61%)
        Min      alloc-odr0-16              173.00 (  0.00%)           171.00 (  1.16%)
        Min      alloc-odr0-32              166.00 (  0.00%)           163.00 (  1.81%)
        Min      alloc-odr0-64              162.00 (  0.00%)           159.00 (  1.85%)
        Min      alloc-odr0-128             160.00 (  0.00%)           157.00 (  1.88%)
        Min      alloc-odr0-256             169.00 (  0.00%)           166.00 (  1.78%)
        Min      alloc-odr0-512             180.00 (  0.00%)           180.00 (  0.00%)
        Min      alloc-odr0-1024            188.00 (  0.00%)           187.00 (  0.53%)
        Min      alloc-odr0-2048            194.00 (  0.00%)           193.00 (  0.52%)
        Min      alloc-odr0-4096            199.00 (  0.00%)           198.00 (  0.50%)
        Min      alloc-odr0-8192            202.00 (  0.00%)           201.00 (  0.50%)
        Min      alloc-odr0-16384           203.00 (  0.00%)           202.00 (  0.49%)
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      002f2906
    • A
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton 提交于
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
  21. 26 4月, 2016 1 次提交
    • T
      cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback · 5cf1cacb
      Tejun Heo 提交于
      Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset
      kicks off asynchronous NUMA node migration if necessary during task
      migration and flushes it from cpuset_post_attach_flush() which is
      called at the end of __cgroup_procs_write().  This is to avoid
      performing migration with cgroup_threadgroup_rwsem write-locked which
      can lead to deadlock through dependency on kworker creation.
      
      memcg has a similar issue with charge moving, so let's convert it to
      an official callback rather than the current one-off cpuset specific
      function.  This patch adds cgroup_subsys->post_attach callback and
      makes cpuset register cpuset_post_attach_flush() as its ->post_attach.
      
      The conversion is mostly one-to-one except that the new callback is
      called under cgroup_mutex.  This is to guarantee that no other
      migration operations are started before ->post_attach callbacks are
      finished.  cgroup_mutex is one of the outermost mutex in the system
      and has never been and shouldn't be a problem.  We can add specialized
      synchronization around __cgroup_procs_write() but I don't think
      there's any noticeable benefit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
      5cf1cacb
  22. 23 2月, 2016 1 次提交
  23. 17 2月, 2016 1 次提交
    • A
      cgroup: introduce cgroup namespaces · a79a908f
      Aditya Kali 提交于
      Introduce the ability to create new cgroup namespace. The newly created
      cgroup namespace remembers the cgroup of the process at the point
      of creation of the cgroup namespace (referred as cgroupns-root).
      The main purpose of cgroup namespace is to virtualize the contents
      of /proc/self/cgroup file. Processes inside a cgroup namespace
      are only able to see paths relative to their namespace root
      (unless they are moved outside of their cgroupns-root, at which point
       they will see a relative path from their cgroupns-root).
      For a correctly setup container this enables container-tools
      (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
      containers without leaking system level cgroup hierarchy to the task.
      This patch only implements the 'unshare' part of the cgroupns.
      Signed-off-by: NAditya Kali <adityakali@google.com>
      Signed-off-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a79a908f
  24. 22 1月, 2016 1 次提交
    • T
      cpuset: make mm migration asynchronous · e93ad19d
      Tejun Heo 提交于
      If "cpuset.memory_migrate" is set, when a process is moved from one
      cpuset to another with a different memory node mask, pages in used by
      the process are migrated to the new set of nodes.  This was performed
      synchronously in the ->attach() callback, which is synchronized
      against process management.  Recently, the synchronization was changed
      from per-process rwsem to global percpu rwsem for simplicity and
      optimization.
      
      Combined with the synchronous mm migration, this led to deadlocks
      because mm migration could schedule a work item which may in turn try
      to create a new worker blocking on the process management lock held
      from cgroup process migration path.
      
      This heavy an operation shouldn't be performed synchronously from that
      deep inside cgroup migration in the first place.  This patch punts the
      actual migration to an ordered workqueue and updates cgroup process
      migration and cpuset config update paths to flush the workqueue after
      all locks are released.  This way, the operations still seem
      synchronous to userland without entangling mm migration with process
      management synchronization.  CPU hotplug can also invoke mm migration
      but there's no reason for it to wait for mm migrations and thus
      doesn't synchronize against their completions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: stable@vger.kernel.org # v4.4+
      e93ad19d
  25. 03 12月, 2015 1 次提交
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  26. 26 11月, 2015 1 次提交
    • A
      cpuset: Replace all instances of time_t with time64_t · d2b43658
      Arnd Bergmann 提交于
      The following patch replaces all instances of time_t with time64_t i.e.
      change the type used for representing time from 32-bit to 64-bit. All
      32-bit kernels to date use a signed 32-bit time_t type, which can only
      represent time until January 2038. Since embedded systems running 32-bit
      Linux are going to survive beyond that date, we have to change all
      current uses, in a backwards compatible way.
      
      The patch also changes the function get_seconds() that returns a 32-bit
      integer to ktime_get_seconds() that returns seconds as 64-bit integer.
      
      The patch changes the type of ticks from time_t to u32. We keep ticks as
      32-bits as the function uses 32-bit arithmetic which would prove less
      expensive than 64-bit arithmetic and the function is expected to be
      called atleast once every 32 seconds.
      Signed-off-by: NHeena Sirwani <heenasirwani@gmail.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d2b43658
  27. 06 11月, 2015 1 次提交
  28. 16 10月, 2015 1 次提交
    • T
      cgroup: replace cgroup_has_tasks() with cgroup_is_populated() · 27bd4dbb
      Tejun Heo 提交于
      Currently, cgroup_has_tasks() tests whether the target cgroup has any
      css_set linked to it.  This works because a css_set's refcnt converges
      with the number of tasks linked to it and thus there's no css_set
      linked to a cgroup if it doesn't have any live tasks.
      
      To help tracking resource usage of zombie tasks, putting the ref of
      css_set will be separated from disassociating the task from the
      css_set which means that a cgroup may have css_sets linked to it even
      when it doesn't have any live tasks.
      
      This patch replaces cgroup_has_tasks() with cgroup_is_populated()
      which tests cgroup->nr_populated instead which locally counts the
      number of populated css_sets.  Unlike cgroup_has_tasks(),
      cgroup_is_populated() is recursive - if any of the descendants is
      populated, the cgroup is populated too.  While this changes the
      meaning of the test, all the existing users are okay with the change.
      
      While at it, replace the open-coded ->populated_cnt test in
      cgroup_events_show() with cgroup_is_populated().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      27bd4dbb
  29. 23 9月, 2015 2 次提交
    • T
      cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader() · 4530eddb
      Tejun Heo 提交于
      It wasn't explicitly documented but, when a process is being migrated,
      cpuset and memcg depend on cgroup_taskset_first() returning the
      threadgroup leader; however, this approach is somewhat ghetto and
      would no longer work for the planned multi-process migration.
      
      This patch introduces explicit cgroup_taskset_for_each_leader() which
      iterates over only the threadgroup leaders and replaces
      cgroup_taskset_first() usages for accessing the leader with it.
      
      This prepares both memcg and cpuset for multi-process migration.  This
      patch also updates the documentation for cgroup_taskset_for_each() to
      clarify the iteration rules and removes comments mentioning task
      ordering in tasksets.
      
      v2: A previous patch which added threadgroup leader test was dropped.
          Patch updated accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      4530eddb
    • T
      cpuset: migrate memory only for threadgroup leaders · 3df9ca0a
      Tejun Heo 提交于
      If memory_migrate flag is set, cpuset migrates memory according to the
      destnation css's nodemask.  The current implementation migrates memory
      whenever any thread of a process is migrated making the behavior
      somewhat arbitrary.  Let's tie memory operations to the threadgroup
      leader so that memory is migrated only when the leader is migrated.
      
      While this is a behavior change, given the inherent fuziness, this
      change is not too likely to be noticed and allows us to clearly define
      who owns the memory (always the leader) and helps the planned atomic
      multi-process migration.
      
      Note that we're currently migrating memory in migration path proper
      while holding all the locks.  In the long term, this should be moved
      out to an async work item.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      3df9ca0a
  30. 19 9月, 2015 1 次提交
    • T
      cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE · 7dbdb199
      Tejun Heo 提交于
      cftype->mode allows controllers to give arbitrary permissions to
      interface knobs.  Except for "cgroup.event_control", the existing uses
      are spurious.
      
      * Some explicitly specify S_IRUGO | S_IWUSR even though that's the
        default.
      
      * "cpuset.memory_pressure" specifies S_IRUGO while also setting a
        write callback which returns -EACCES.  All it needs to do is simply
        not setting a write callback.
      
      "cgroup.event_control" uses cftype->mode to make the file
      world-writable.  It's a misdesigned interface and we don't want
      controllers to be tweaking interface file permissions in general.
      This patch removes cftype->mode and all its spurious uses and
      implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is
      marked as compatibility-only.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      7dbdb199
  31. 18 9月, 2015 1 次提交
    • T
      cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() · 9e10a130
      Tejun Heo 提交于
      cgroup_on_dfl() tests whether the cgroup's root is the default
      hierarchy; however, an individual controller is only interested in
      whether the controller is attached to the default hierarchy and never
      tests a cgroup which doesn't belong to the hierarchy that the
      controller is attached to.
      
      This patch replaces cgroup_on_dfl() tests in controllers with faster
      static_key based cgroup_subsys_on_dfl().  This leaves cgroup core as
      the only user of cgroup_on_dfl() and the function is moved from the
      header file to cgroup.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      9e10a130
  32. 10 8月, 2015 1 次提交
    • A
      cpuset: use trialcs->mems_allowed as a temp variable · 24ee3cf8
      Alban Crequy 提交于
      The comment says it's using trialcs->mems_allowed as a temp variable but
      it didn't match the code. Change the code to match the comment.
      
      This fixes an issue when writing in cpuset.mems when a sub-directory
      exists: we need to write several times for the information to persist:
      
      | root@alban:/sys/fs/cgroup/cpuset# mkdir footest9
      | root@alban:/sys/fs/cgroup/cpuset# cd footest9
      | root@alban:/sys/fs/cgroup/cpuset/footest9# mkdir aa
      | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
      |
      | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
      | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
      |
      | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
      | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
      | 0
      | root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
      |
      | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > aa/cpuset.mems
      | root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
      | 0
      | root@alban:/sys/fs/cgroup/cpuset/footest9#
      
      This should help to fix the following issue in Docker:
      https://github.com/opencontainers/runc/issues/133
      In some conditions, a Docker container needs to be started twice in
      order to work.
      Signed-off-by: NAlban Crequy <alban@endocode.com>
      Tested-by: NIago López Galeiras <iago@endocode.com>
      Cc: <stable@vger.kernel.org> # 3.17+
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      24ee3cf8
  33. 15 4月, 2015 1 次提交