1. 18 8月, 2017 1 次提交
    • W
      cpuset: Allow v2 behavior in v1 cgroup · b8d1b8ee
      Waiman Long 提交于
      Cpuset v2 has some useful behaviors that are not present in v1 because
      of backward compatibility concern. One of that is the restoration of
      the original cpu and memory node mask after a hot removal and addition
      event sequence.
      
      This patch makes the cpuset controller to check the
      CGRP_ROOT_CPUSET_V2_MODE flag and use the v2 behavior if it is set.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b8d1b8ee
  2. 21 7月, 2017 1 次提交
    • T
      cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS · bc2fb7ed
      Tejun Heo 提交于
      css_task_iter currently always walks all tasks.  With the scheduled
      cgroup v2 thread support, the iterator would need to handle multiple
      types of iteration.  As a preparation, add @flags to
      css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
      is not specified, it walks all tasks as before.  When asserted, the
      iterator only walks the group leaders.
      
      For now, the only user of the flag is cgroup v2 "cgroup.procs" file
      which no longer needs to skip non-leader tasks in cgroup_procs_next().
      Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
      v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
      cgroup" but "list all thread group id's with any threads in the
      cgroup".
      
      While at it, update cgroup_procs_show() to use task_pid_vnr() instead
      of task_tgid_vnr().  As the iteration guarantees that the function
      only sees group leaders, this doesn't change the output and will allow
      sharing the function for thread iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bc2fb7ed
  3. 07 7月, 2017 2 次提交
    • V
      mm, cpuset: always use seqlock when changing task's nodemask · 5f155f27
      Vlastimil Babka 提交于
      When updating task's mems_allowed and rebinding its mempolicy due to
      cpuset's mems being changed, we currently only take the seqlock for
      writing when either the task has a mempolicy, or the new mems has no
      intersection with the old mems.
      
      This should be enough to prevent a parallel allocation seeing no
      available nodes, but the optimization is IMHO unnecessary (cpuset
      updates should not be frequent), and we still potentially risk issues if
      the intersection of new and old nodes has limited amount of
      free/reclaimable memory.
      
      Let's just use the seqlock for all tasks.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-6-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f155f27
    • V
      mm, mempolicy: simplify rebinding mempolicies when updating cpusets · 213980c0
      Vlastimil Babka 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") has introduced a two-step protocol when
      rebinding task's mempolicy due to cpuset update, in order to avoid a
      parallel allocation seeing an empty effective nodemask and failing.
      
      Later, commit cc9a6c87 ("cpuset: mm: reduce large amounts of memory
      barrier related damage v3") introduced a seqlock protection and removed
      the synchronization point between the two update steps.  At that point
      (or perhaps later), the two-step rebinding became unnecessary.
      
      Currently it only makes sure that the update first adds new nodes in
      step 1 and then removes nodes in step 2.  Without memory barriers the
      effects are questionable, and even then this cannot prevent a parallel
      zonelist iteration checking the nodemask at each step to observe all
      nodes as unusable for allocation.  We now fully rely on the seqlock to
      prevent premature OOMs and allocation failures.
      
      We can thus remove the two-step update parts and simplify the code.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      213980c0
  4. 25 5月, 2017 1 次提交
    • T
      cpuset: consider dying css as offline · 41c25707
      Tejun Heo 提交于
      In most cases, a cgroup controller don't care about the liftimes of
      cgroups.  For the controller, a css becomes online when ->css_online()
      is called on it and offline when ->css_offline() is called.
      
      However, cpuset is special in that the user interface it exposes cares
      whether certain cgroups exist or not.  Combined with the RCU delay
      between cgroup removal and css offlining, this can lead to user
      visible behavior oddities where operations which should succeed after
      cgroup removals fail for some time period.  The effects of cgroup
      removals are delayed when seen from userland.
      
      This patch adds css_is_dying() which tests whether offline is pending
      and updates is_cpuset_online() so that the function returns false also
      while offline is pending.  This gets rid of the userland visible
      delays.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      41c25707
  5. 11 4月, 2017 1 次提交
  6. 28 3月, 2017 1 次提交
  7. 02 3月, 2017 2 次提交
  8. 28 12月, 2016 1 次提交
  9. 25 12月, 2016 1 次提交
  10. 29 9月, 2016 1 次提交
  11. 16 9月, 2016 1 次提交
  12. 13 9月, 2016 1 次提交
    • J
      cpuset: handle race between CPU hotplug and cpuset_hotplug_work · 28b89b9e
      Joonwoo Park 提交于
      A discrepancy between cpu_online_mask and cpuset's effective_cpus
      mask is inevitable during hotplug since cpuset defers updating of
      effective_cpus mask using a workqueue, during which time nothing
      prevents the system from more hotplug operations.  For that reason
      guarantee_online_cpus() walks up the cpuset hierarchy until it finds
      an intersection under the assumption that top cpuset's effective_cpus
      mask intersects with cpu_online_mask even with such a race occurring.
      
      However a sequence of CPU hotplugs can open a time window, during which
      none of the effective CPUs in the top cpuset intersect with
      cpu_online_mask.
      
      For example when there are 4 possible CPUs 0-3 and only CPU0 is online:
      
        ========================  ===========================
         cpu_online_mask           top_cpuset.effective_cpus
        ========================  ===========================
         echo 1 > cpu2/online.
         CPU hotplug notifier woke up hotplug work but not yet scheduled.
            [0,2]                     [0]
      
         echo 0 > cpu0/online.
         The workqueue is still runnable.
            [2]                       [0]
        ========================  ===========================
      
        Now there is no intersection between cpu_online_mask and
        top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
        this moment can cause following:
      
         Unable to handle kernel NULL pointer dereference at virtual address 000000d0
         ------------[ cut here ]------------
         Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable]
         Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
         Modules linked in:
         CPU: 2 PID: 1420 Comm: taskset Tainted: G        W       4.4.8+ #98
         task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000
         PC is at guarantee_online_cpus+0x2c/0x58
         LR is at cpuset_cpus_allowed+0x4c/0x6c
         <snip>
         Process taskset (pid: 1420, stack limit = 0xffffffc06e124020)
         Call trace:
         [<ffffffc0001389b0>] guarantee_online_cpus+0x2c/0x58
         [<ffffffc00013b208>] cpuset_cpus_allowed+0x4c/0x6c
         [<ffffffc0000d61f0>] sched_setaffinity+0xc0/0x1ac
         [<ffffffc0000d6374>] SyS_sched_setaffinity+0x98/0xac
         [<ffffffc000085cb0>] el0_svc_naked+0x24/0x28
      
      The top cpuset's effective_cpus are guaranteed to be identical to
      cpu_online_mask eventually.  Hence fall back to cpu_online_mask when
      there is no intersection between top cpuset's effective_cpus and
      cpu_online_mask.
      Signed-off-by: NJoonwoo Park <joonwoop@codeaurora.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: cgroups@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.17+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      28b89b9e
  13. 10 8月, 2016 2 次提交
    • T
      cgroup: make cgroup_path() and friends behave in the style of strlcpy() · 4c737b41
      Tejun Heo 提交于
      cgroup_path() and friends used to format the path from the end and
      thus the resulting path usually didn't start at the start of the
      passed in buffer.  Also, when the buffer was too small, the partial
      result was truncated from the head rather than tail and there was no
      way to tell how long the full path would be.  These make the functions
      less robust and more awkward to use.
      
      With recent updates to kernfs_path(), cgroup_path() and friends can be
      made to behave in strlcpy() style.
      
      * cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now
        always return the length of the full path.  If buffer is too small,
        it contains nul terminated truncated output.
      
      * All users updated accordingly.
      
      v2: cgroup_path() usage in kernel/sched/debug.c converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      4c737b41
    • Z
      cpuset: make sure new tasks conform to the current config of the cpuset · 06f4e948
      Zefan Li 提交于
      A new task inherits cpus_allowed and mems_allowed masks from its parent,
      but if someone changes cpuset's config by writing to cpuset.cpus/cpuset.mems
      before this new task is inserted into the cgroup's task list, the new task
      won't be updated accordingly.
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      06f4e948
  14. 29 7月, 2016 1 次提交
  15. 20 5月, 2016 2 次提交
    • V
      cpuset: use static key better and convert to new API · 002f2906
      Vlastimil Babka 提交于
      An important function for cpusets is cpuset_node_allowed(), which
      optimizes on the fact if there's a single root CPU set, it must be
      trivially allowed.  But the check "nr_cpusets() <= 1" doesn't use the
      cpusets_enabled_key static key the right way where static keys eliminate
      branching overhead with jump labels.
      
      This patch converts it so that static key is used properly.  It's also
      switched to the new static key API and the checking functions are
      converted to return bool instead of int.  We also provide a new variant
      __cpuset_zone_allowed() which expects that the static key check was
      already done and they key was enabled.  This is needed for
      get_page_from_freelist() where we want to also avoid the relatively
      slower check when ALLOC_CPUSET is not set in alloc_flags.
      
      The impact on the page allocator microbenchmark is less than expected
      but the cleanup in itself is worthwhile.
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                             multcheck-v1r20               cpuset-v1r20
        Min      alloc-odr0-1               348.00 (  0.00%)           348.00 (  0.00%)
        Min      alloc-odr0-2               254.00 (  0.00%)           254.00 (  0.00%)
        Min      alloc-odr0-4               213.00 (  0.00%)           213.00 (  0.00%)
        Min      alloc-odr0-8               186.00 (  0.00%)           183.00 (  1.61%)
        Min      alloc-odr0-16              173.00 (  0.00%)           171.00 (  1.16%)
        Min      alloc-odr0-32              166.00 (  0.00%)           163.00 (  1.81%)
        Min      alloc-odr0-64              162.00 (  0.00%)           159.00 (  1.85%)
        Min      alloc-odr0-128             160.00 (  0.00%)           157.00 (  1.88%)
        Min      alloc-odr0-256             169.00 (  0.00%)           166.00 (  1.78%)
        Min      alloc-odr0-512             180.00 (  0.00%)           180.00 (  0.00%)
        Min      alloc-odr0-1024            188.00 (  0.00%)           187.00 (  0.53%)
        Min      alloc-odr0-2048            194.00 (  0.00%)           193.00 (  0.52%)
        Min      alloc-odr0-4096            199.00 (  0.00%)           198.00 (  0.50%)
        Min      alloc-odr0-8192            202.00 (  0.00%)           201.00 (  0.50%)
        Min      alloc-odr0-16384           203.00 (  0.00%)           202.00 (  0.49%)
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      002f2906
    • A
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton 提交于
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
  16. 26 4月, 2016 1 次提交
    • T
      cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback · 5cf1cacb
      Tejun Heo 提交于
      Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset
      kicks off asynchronous NUMA node migration if necessary during task
      migration and flushes it from cpuset_post_attach_flush() which is
      called at the end of __cgroup_procs_write().  This is to avoid
      performing migration with cgroup_threadgroup_rwsem write-locked which
      can lead to deadlock through dependency on kworker creation.
      
      memcg has a similar issue with charge moving, so let's convert it to
      an official callback rather than the current one-off cpuset specific
      function.  This patch adds cgroup_subsys->post_attach callback and
      makes cpuset register cpuset_post_attach_flush() as its ->post_attach.
      
      The conversion is mostly one-to-one except that the new callback is
      called under cgroup_mutex.  This is to guarantee that no other
      migration operations are started before ->post_attach callbacks are
      finished.  cgroup_mutex is one of the outermost mutex in the system
      and has never been and shouldn't be a problem.  We can add specialized
      synchronization around __cgroup_procs_write() but I don't think
      there's any noticeable benefit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
      5cf1cacb
  17. 23 2月, 2016 1 次提交
  18. 17 2月, 2016 1 次提交
    • A
      cgroup: introduce cgroup namespaces · a79a908f
      Aditya Kali 提交于
      Introduce the ability to create new cgroup namespace. The newly created
      cgroup namespace remembers the cgroup of the process at the point
      of creation of the cgroup namespace (referred as cgroupns-root).
      The main purpose of cgroup namespace is to virtualize the contents
      of /proc/self/cgroup file. Processes inside a cgroup namespace
      are only able to see paths relative to their namespace root
      (unless they are moved outside of their cgroupns-root, at which point
       they will see a relative path from their cgroupns-root).
      For a correctly setup container this enables container-tools
      (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
      containers without leaking system level cgroup hierarchy to the task.
      This patch only implements the 'unshare' part of the cgroupns.
      Signed-off-by: NAditya Kali <adityakali@google.com>
      Signed-off-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a79a908f
  19. 22 1月, 2016 1 次提交
    • T
      cpuset: make mm migration asynchronous · e93ad19d
      Tejun Heo 提交于
      If "cpuset.memory_migrate" is set, when a process is moved from one
      cpuset to another with a different memory node mask, pages in used by
      the process are migrated to the new set of nodes.  This was performed
      synchronously in the ->attach() callback, which is synchronized
      against process management.  Recently, the synchronization was changed
      from per-process rwsem to global percpu rwsem for simplicity and
      optimization.
      
      Combined with the synchronous mm migration, this led to deadlocks
      because mm migration could schedule a work item which may in turn try
      to create a new worker blocking on the process management lock held
      from cgroup process migration path.
      
      This heavy an operation shouldn't be performed synchronously from that
      deep inside cgroup migration in the first place.  This patch punts the
      actual migration to an ordered workqueue and updates cgroup process
      migration and cpuset config update paths to flush the workqueue after
      all locks are released.  This way, the operations still seem
      synchronous to userland without entangling mm migration with process
      management synchronization.  CPU hotplug can also invoke mm migration
      but there's no reason for it to wait for mm migrations and thus
      doesn't synchronize against their completions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: stable@vger.kernel.org # v4.4+
      e93ad19d
  20. 03 12月, 2015 1 次提交
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  21. 26 11月, 2015 1 次提交
    • A
      cpuset: Replace all instances of time_t with time64_t · d2b43658
      Arnd Bergmann 提交于
      The following patch replaces all instances of time_t with time64_t i.e.
      change the type used for representing time from 32-bit to 64-bit. All
      32-bit kernels to date use a signed 32-bit time_t type, which can only
      represent time until January 2038. Since embedded systems running 32-bit
      Linux are going to survive beyond that date, we have to change all
      current uses, in a backwards compatible way.
      
      The patch also changes the function get_seconds() that returns a 32-bit
      integer to ktime_get_seconds() that returns seconds as 64-bit integer.
      
      The patch changes the type of ticks from time_t to u32. We keep ticks as
      32-bits as the function uses 32-bit arithmetic which would prove less
      expensive than 64-bit arithmetic and the function is expected to be
      called atleast once every 32 seconds.
      Signed-off-by: NHeena Sirwani <heenasirwani@gmail.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d2b43658
  22. 06 11月, 2015 1 次提交
  23. 16 10月, 2015 1 次提交
    • T
      cgroup: replace cgroup_has_tasks() with cgroup_is_populated() · 27bd4dbb
      Tejun Heo 提交于
      Currently, cgroup_has_tasks() tests whether the target cgroup has any
      css_set linked to it.  This works because a css_set's refcnt converges
      with the number of tasks linked to it and thus there's no css_set
      linked to a cgroup if it doesn't have any live tasks.
      
      To help tracking resource usage of zombie tasks, putting the ref of
      css_set will be separated from disassociating the task from the
      css_set which means that a cgroup may have css_sets linked to it even
      when it doesn't have any live tasks.
      
      This patch replaces cgroup_has_tasks() with cgroup_is_populated()
      which tests cgroup->nr_populated instead which locally counts the
      number of populated css_sets.  Unlike cgroup_has_tasks(),
      cgroup_is_populated() is recursive - if any of the descendants is
      populated, the cgroup is populated too.  While this changes the
      meaning of the test, all the existing users are okay with the change.
      
      While at it, replace the open-coded ->populated_cnt test in
      cgroup_events_show() with cgroup_is_populated().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      27bd4dbb
  24. 23 9月, 2015 2 次提交
    • T
      cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader() · 4530eddb
      Tejun Heo 提交于
      It wasn't explicitly documented but, when a process is being migrated,
      cpuset and memcg depend on cgroup_taskset_first() returning the
      threadgroup leader; however, this approach is somewhat ghetto and
      would no longer work for the planned multi-process migration.
      
      This patch introduces explicit cgroup_taskset_for_each_leader() which
      iterates over only the threadgroup leaders and replaces
      cgroup_taskset_first() usages for accessing the leader with it.
      
      This prepares both memcg and cpuset for multi-process migration.  This
      patch also updates the documentation for cgroup_taskset_for_each() to
      clarify the iteration rules and removes comments mentioning task
      ordering in tasksets.
      
      v2: A previous patch which added threadgroup leader test was dropped.
          Patch updated accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      4530eddb
    • T
      cpuset: migrate memory only for threadgroup leaders · 3df9ca0a
      Tejun Heo 提交于
      If memory_migrate flag is set, cpuset migrates memory according to the
      destnation css's nodemask.  The current implementation migrates memory
      whenever any thread of a process is migrated making the behavior
      somewhat arbitrary.  Let's tie memory operations to the threadgroup
      leader so that memory is migrated only when the leader is migrated.
      
      While this is a behavior change, given the inherent fuziness, this
      change is not too likely to be noticed and allows us to clearly define
      who owns the memory (always the leader) and helps the planned atomic
      multi-process migration.
      
      Note that we're currently migrating memory in migration path proper
      while holding all the locks.  In the long term, this should be moved
      out to an async work item.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      3df9ca0a
  25. 19 9月, 2015 1 次提交
    • T
      cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE · 7dbdb199
      Tejun Heo 提交于
      cftype->mode allows controllers to give arbitrary permissions to
      interface knobs.  Except for "cgroup.event_control", the existing uses
      are spurious.
      
      * Some explicitly specify S_IRUGO | S_IWUSR even though that's the
        default.
      
      * "cpuset.memory_pressure" specifies S_IRUGO while also setting a
        write callback which returns -EACCES.  All it needs to do is simply
        not setting a write callback.
      
      "cgroup.event_control" uses cftype->mode to make the file
      world-writable.  It's a misdesigned interface and we don't want
      controllers to be tweaking interface file permissions in general.
      This patch removes cftype->mode and all its spurious uses and
      implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is
      marked as compatibility-only.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      7dbdb199
  26. 18 9月, 2015 1 次提交
    • T
      cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() · 9e10a130
      Tejun Heo 提交于
      cgroup_on_dfl() tests whether the cgroup's root is the default
      hierarchy; however, an individual controller is only interested in
      whether the controller is attached to the default hierarchy and never
      tests a cgroup which doesn't belong to the hierarchy that the
      controller is attached to.
      
      This patch replaces cgroup_on_dfl() tests in controllers with faster
      static_key based cgroup_subsys_on_dfl().  This leaves cgroup core as
      the only user of cgroup_on_dfl() and the function is moved from the
      header file to cgroup.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      9e10a130
  27. 10 8月, 2015 1 次提交
    • A
      cpuset: use trialcs->mems_allowed as a temp variable · 24ee3cf8
      Alban Crequy 提交于
      The comment says it's using trialcs->mems_allowed as a temp variable but
      it didn't match the code. Change the code to match the comment.
      
      This fixes an issue when writing in cpuset.mems when a sub-directory
      exists: we need to write several times for the information to persist:
      
      | root@alban:/sys/fs/cgroup/cpuset# mkdir footest9
      | root@alban:/sys/fs/cgroup/cpuset# cd footest9
      | root@alban:/sys/fs/cgroup/cpuset/footest9# mkdir aa
      | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
      |
      | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
      | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
      |
      | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
      | root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
      | 0
      | root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
      |
      | root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > aa/cpuset.mems
      | root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
      | 0
      | root@alban:/sys/fs/cgroup/cpuset/footest9#
      
      This should help to fix the following issue in Docker:
      https://github.com/opencontainers/runc/issues/133
      In some conditions, a Docker container needs to be started twice in
      order to work.
      Signed-off-by: NAlban Crequy <alban@endocode.com>
      Tested-by: NIago López Galeiras <iago@endocode.com>
      Cc: <stable@vger.kernel.org> # 3.17+
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      24ee3cf8
  28. 15 4月, 2015 1 次提交
  29. 20 3月, 2015 1 次提交
    • R
      cpusets, isolcpus: exclude isolcpus from load balancing in cpusets · 47b8ea71
      Rik van Riel 提交于
      Ensure that cpus specified with the isolcpus= boot commandline
      option stay outside of the load balancing in the kernel scheduler.
      
      Operations like load balancing can introduce unwanted latencies,
      which is exactly what the isolcpus= commandline is there to prevent.
      
      Previously, simply creating a new cpuset, without even touching the
      cpuset.cpus field inside the new cpuset, would undo the effects of
      isolcpus=, by creating a scheduler domain spanning the whole system,
      and setting up load balancing inside that domain. The cpuset root
      cpuset.cpus file is read-only, so there was not even a way to undo
      that effect.
      
      This does not impact the majority of cpusets users, since isolcpus=
      is a fairly specialized feature used for realtime purposes.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: cgroups@vger.kernel.org
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Tested-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      47b8ea71
  30. 03 3月, 2015 3 次提交
  31. 14 2月, 2015 1 次提交
  32. 13 2月, 2015 1 次提交
  33. 28 10月, 2014 1 次提交