1. 24 4月, 2020 1 次提交
    • Y
      alinux: sched: Introduce per-cgroup idle accounting · 61e58859
      Yihao Wu 提交于
      to #26424323
      
      Since we concern idle, let's take idle as the center state. And omit
      transition between other stats. Below is the state transition graph:
      
                                      sleep->deque
      +-----------+ cpumask +-------+ exit->deque +-------+
      |ineffective|-------- | idle  | <-----------|running|
      +-----------+         +-------+             +-------+
                              ^ |
       unthrtl child -> deque | |
                wake -> deque | |thrtl chlid -> enque
             migrate -> deque | |migrate -> enque
                              | v
                            +-------+
                            | steal |
                            +-------+
      
      We conclude idle state condition as:
      
      !se->on_rq && !my_q->throttled && cpu allowed.
      
      From this graph and condition, we can hook (de|en)queue_task_fair
      update_cpumasks_hier, (un|)throttle_cfs_rq to account idle state.
      
      In the hooked functions, we also check the conditions, to avoid
      accounting unwanted cpu clocks.
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      61e58859
  2. 10 7月, 2019 1 次提交
    • J
      cpuset: restore sanity to cpuset_cpus_allowed_fallback() · b7747ecb
      Joel Savitz 提交于
      [ Upstream commit d477f8c202d1f0d4791ab1263ca7657bbe5cf79e ]
      
      In the case that a process is constrained by taskset(1) (i.e.
      sched_setaffinity(2)) to a subset of available cpus, and all of those are
      subsequently offlined, the scheduler will set tsk->cpus_allowed to
      the current value of task_cs(tsk)->effective_cpus.
      
      This is done via a call to do_set_cpus_allowed() in the context of
      cpuset_cpus_allowed_fallback() made by the scheduler when this case is
      detected. This is the only call made to cpuset_cpus_allowed_fallback()
      in the latest mainline kernel.
      
      However, this is not sane behavior.
      
      I will demonstrate this on a system running the latest upstream kernel
      with the following initial configuration:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	ffffffff,fffffff
      	Cpus_allowed_list:	0-63
      
      (Where cpus 32-63 are provided via smt.)
      
      If we limit our current shell process to cpu2 only and then offline it
      and reonline it:
      
      	# taskset -p 4 $$
      	pid 2272's current affinity mask: ffffffffffffffff
      	pid 2272's new affinity mask: 4
      
      	# echo off > /sys/devices/system/cpu/cpu2/online
      	# dmesg | tail -3
      	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
      	[ 2195.872700] IRQ 114: no longer affine to CPU2
      	[ 2195.879128] smpboot: CPU 2 is now offline
      
      	# echo on > /sys/devices/system/cpu/cpu2/online
      	# dmesg | tail -1
      	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4
      
      We see that our current process now has an affinity mask containing
      every cpu available on the system _except_ the one we originally
      constrained it to:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:   ffffffff,fffffffb
      	Cpus_allowed_list:      0-1,3-63
      
      This is not sane behavior, as the scheduler can now not only place the
      process on previously forbidden cpus, it can't even schedule it on
      the cpu it was originally constrained to!
      
      Other cases result in even more exotic affinity masks. Take for instance
      a process with an affinity mask containing only cpus provided by smt at
      the moment that smt is toggled, in a configuration such as the following:
      
      	# taskset -p f000000000 $$
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	000000f0,00000000
      	Cpus_allowed_list:	36-39
      
      A double toggle of smt results in the following behavior:
      
      	# echo off > /sys/devices/system/cpu/smt/control
      	# echo on > /sys/devices/system/cpu/smt/control
      	# grep -i cpus /proc/$$/status
      	Cpus_allowed:	ffffff00,ffffffff
      	Cpus_allowed_list:	0-31,40-63
      
      This is even less sane than the previous case, as the new affinity mask
      excludes all smt-provided cpus with ids less than those that were
      previously in the affinity mask, as well as those that were actually in
      the mask.
      
      With this patch applied, both of these cases end in the following state:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	ffffffff,ffffffff
      	Cpus_allowed_list:	0-63
      
      The original policy is discarded. Though not ideal, it is the simplest way
      to restore sanity to this fallback case without reinventing the cpuset
      wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
      for the previous affinity mask to be restored in this fallback case can use
      that mechanism instead.
      
      This patch modifies scheduler behavior by instead resetting the mask to
      task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
      mode. I tested the cases above on both modes.
      
      Note that the scheduler uses this fallback mechanism if and only if
      _every_ other valid avenue has been traveled, and it is the last resort
      before calling BUG().
      Suggested-by: NWaiman Long <longman@redhat.com>
      Suggested-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NJoel Savitz <jsavitz@redhat.com>
      Acked-by: NPhil Auld <pauld@redhat.com>
      Acked-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b7747ecb
  3. 16 6月, 2018 1 次提交
  4. 13 6月, 2018 1 次提交
    • K
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook 提交于
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6da2ec56
  5. 07 2月, 2018 1 次提交
  6. 05 12月, 2017 2 次提交
  7. 28 11月, 2017 2 次提交
    • P
      cpuset: Make cpuset hotplug synchronous · 1599a185
      Prateek Sood 提交于
      Convert cpuset_hotplug_workfn() into synchronous call for cpu hotplug
      path. For memory hotplug path it still gets queued as a work item.
      
      Since cpuset_hotplug_workfn() can be made synchronous for cpu hotplug
      path, it is not required to wait for cpuset hotplug while thawing
      processes.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1599a185
    • P
      cgroup/cpuset: remove circular dependency deadlock · aa24163b
      Prateek Sood 提交于
      Remove circular dependency deadlock in a scenario where hotplug of CPU is
      being done while there is updation in cgroup and cpuset triggered from
      userspace.
      
      Process A => kthreadd => Process B => Process C => Process A
      
      Process A
      cpu_subsys_offline();
        cpu_down();
          _cpu_down();
            percpu_down_write(&cpu_hotplug_lock); //held
            cpuhp_invoke_callback();
      	     workqueue_offline_cpu();
                  queue_work_on(); // unbind_work on system_highpri_wq
                     __queue_work();
                       insert_work();
                          wake_up_worker();
                  flush_work();
                     wait_for_completion();
      
      worker_thread();
         manage_workers();
            create_worker();
      	     kthread_create_on_node();
      		    wake_up_process(kthreadd_task);
      
      kthreadd
      kthreadd();
        kernel_thread();
          do_fork();
            copy_process();
              percpu_down_read(&cgroup_threadgroup_rwsem);
                __rwsem_down_read_failed_common(); //waiting
      
      Process B
      kernfs_fop_write();
        cgroup_file_write();
          cgroup_procs_write();
            percpu_down_write(&cgroup_threadgroup_rwsem); //held
            cgroup_attach_task();
              cgroup_migrate();
                cgroup_migrate_execute();
                  cpuset_can_attach();
                    mutex_lock(&cpuset_mutex); //waiting
      
      Process C
      kernfs_fop_write();
        cgroup_file_write();
          cpuset_write_resmask();
            mutex_lock(&cpuset_mutex); //held
            update_cpumask();
              update_cpumasks_hier();
                rebuild_sched_domains_locked();
                  get_online_cpus();
                    percpu_down_read(&cpu_hotplug_lock); //waiting
      
      Eliminating deadlock by reversing the locking order for cpuset_mutex and
      cpu_hotplug_lock.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      aa24163b
  8. 27 10月, 2017 1 次提交
  9. 07 9月, 2017 2 次提交
    • P
      sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs · 50e76632
      Peter Zijlstra 提交于
      Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
      because it now resulted in non-cpuset usage breaking too.
      
      On suspend cpuset_cpu_inactive() doesn't call into
      cpuset_update_active_cpus() because it doesn't want to move tasks about,
      there is no need, all tasks are frozen and won't run again until after
      we've resumed everything.
      
      But this means that when we finally do call into
      cpuset_update_active_cpus() after resuming the last frozen cpu in
      cpuset_cpu_active(), the top_cpuset will not have any difference with
      the cpu_active_mask and this it will not in fact do _anything_.
      
      So the cpuset configuration will not be restored. This was largely
      hidden because we would unconditionally create identity domains and
      mobile users would not in fact use cpusets much. And servers what do use
      cpusets tend to not suspend-resume much.
      
      An addition problem is that we'd not in fact wait for the cpuset work to
      finish before resuming the tasks, allowing spurious migrations outside
      of the specified domains.
      
      Fix the rebuild by introducing cpuset_force_rebuild() and fix the
      ordering with cpuset_wait_for_hotplug().
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: deb7aa30 ("cpuset: reorganize CPU / memory hotplug handling")
      Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      50e76632
    • M
      mm: replace TIF_MEMDIE checks by tsk_is_oom_victim · da99ecf1
      Michal Hocko 提交于
      TIF_MEMDIE is set only to the tasks whick were either directly selected
      by the OOM killer or passed through mark_oom_victim from the allocator
      path.  tsk_is_oom_victim is more generic and allows to identify all
      tasks (threads) which share the mm with the oom victim.
      
      Please note that the freezer still needs to check TIF_MEMDIE because we
      cannot thaw tasks which do not participage in oom_victims counting
      otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da99ecf1
  10. 25 8月, 2017 2 次提交
    • P
      sched/topology, cpuset: Avoid spurious/wrong domain rebuilds · 77d1dfda
      Peter Zijlstra 提交于
      When disabling cpuset.sched_load_balance we expect to be able to online
      CPUs without generating sched_domains. However this is currently
      completely broken.
      
      What happens is that we generate the sched_domains and then destroy
      them. This is because of the spurious 'default' domain build in
      cpuset_update_active_cpus(). That builds a single machine wide domain
      and then schedules a work to build the 'real' domains. The work then
      finds there are _no_ domains and destroys the lot again.
      
      Furthermore, if there actually were cpusets, building the machine wide
      domain is actively wrong, because it would allow tasks to 'escape' their
      cpuset. Also I don't think its needed, the scheduler really should
      respect the active mask.
      Reported-by: NOfer Levi(SW) <oferle@mellanox.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
      Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      77d1dfda
    • W
      cpuset: Fix incorrect memory_pressure control file mapping · 1c08c22c
      Waiman Long 提交于
      The memory_pressure control file was incorrectly set up without
      a private value (0, by default). As a result, this control
      file was treated like memory_migrate on read. By adding back the
      FILE_MEMORY_PRESSURE private value, the correct memory pressure value
      will be returned.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 7dbdb199 ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE")
      Cc: stable@vger.kernel.org # v4.4+
      1c08c22c
  11. 18 8月, 2017 1 次提交
    • W
      cpuset: Allow v2 behavior in v1 cgroup · b8d1b8ee
      Waiman Long 提交于
      Cpuset v2 has some useful behaviors that are not present in v1 because
      of backward compatibility concern. One of that is the restoration of
      the original cpu and memory node mask after a hot removal and addition
      event sequence.
      
      This patch makes the cpuset controller to check the
      CGRP_ROOT_CPUSET_V2_MODE flag and use the v2 behavior if it is set.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b8d1b8ee
  12. 10 8月, 2017 1 次提交
  13. 03 8月, 2017 1 次提交
    • D
      cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 89affbf5
      Dima Zavin 提交于
      In codepaths that use the begin/retry interface for reading
      mems_allowed_seq with irqs disabled, there exists a race condition that
      stalls the patch process after only modifying a subset of the
      static_branch call sites.
      
      This problem manifested itself as a deadlock in the slub allocator,
      inside get_any_partial.  The loop reads mems_allowed_seq value (via
      read_mems_allowed_begin), performs the defrag operation, and then
      verifies the consistency of mem_allowed via the read_mems_allowed_retry
      and the cookie returned by xxx_begin.
      
      The issue here is that both begin and retry first check if cpusets are
      enabled via cpusets_enabled() static branch.  This branch can be
      rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
      x86 jump label code fully synchronizes across all CPUs for every entry
      it rewrites.  If it rewrites only one of the callsites (specifically the
      one in read_mems_allowed_retry) and then waits for the
      smp_call_function(do_sync_core) to complete while a CPU is inside the
      begin/retry section with IRQs off and the mems_allowed value is changed,
      we can hang.
      
      This is because begin() will always return 0 (since it wasn't patched
      yet) while retry() will test the 0 against the actual value of the seq
      counter.
      
      The fix is to use two different static keys: one for begin
      (pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
      first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
      always return a valid seqcount if are enabling cpusets.  Similarly, when
      disabling cpusets via cpuset_dec(), we first ensure that callers of
      cpuset_mems_allowed_retry() will start ignoring the seqcount value
      before we let cpuset_mems_allowed_begin() return 0.
      
      The relevant stack traces of the two stuck threads:
      
        CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
        RIP: smp_call_function_many+0x1f9/0x260
        Call Trace:
          smp_call_function+0x3b/0x70
          on_each_cpu+0x2f/0x90
          text_poke_bp+0x87/0xd0
          arch_jump_label_transform+0x93/0x100
          __jump_label_update+0x77/0x90
          jump_label_update+0xaa/0xc0
          static_key_slow_inc+0x9e/0xb0
          cpuset_css_online+0x70/0x2e0
          online_css+0x2c/0xa0
          cgroup_apply_control_enable+0x27f/0x3d0
          cgroup_mkdir+0x2b7/0x420
          kernfs_iop_mkdir+0x5a/0x80
          vfs_mkdir+0xf6/0x1a0
          SyS_mkdir+0xb7/0xe0
          entry_SYSCALL_64_fastpath+0x18/0xad
      
        ...
      
        CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8818087c0000 task.stack: ffffc90000030000
        RIP: int3+0x39/0x70
        Call Trace:
          <#DB> ? ___slab_alloc+0x28b/0x5a0
          <EOE> ? copy_process.part.40+0xf7/0x1de0
          __slab_alloc.isra.80+0x54/0x90
          copy_process.part.40+0xf7/0x1de0
          copy_process.part.40+0xf7/0x1de0
          kmem_cache_alloc_node+0x8a/0x280
          copy_process.part.40+0xf7/0x1de0
          _do_fork+0xe7/0x6c0
          _raw_spin_unlock_irq+0x2d/0x60
          trace_hardirqs_on_caller+0x136/0x1d0
          entry_SYSCALL_64_fastpath+0x5/0xad
          do_syscall_64+0x27/0x350
          SyS_clone+0x19/0x20
          do_syscall_64+0x60/0x350
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
      Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
      Signed-off-by: NDima Zavin <dmitriyz@waymo.com>
      Reported-by: NCliff Spradlin <cspradlin@waymo.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89affbf5
  14. 21 7月, 2017 1 次提交
    • T
      cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS · bc2fb7ed
      Tejun Heo 提交于
      css_task_iter currently always walks all tasks.  With the scheduled
      cgroup v2 thread support, the iterator would need to handle multiple
      types of iteration.  As a preparation, add @flags to
      css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
      is not specified, it walks all tasks as before.  When asserted, the
      iterator only walks the group leaders.
      
      For now, the only user of the flag is cgroup v2 "cgroup.procs" file
      which no longer needs to skip non-leader tasks in cgroup_procs_next().
      Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
      v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
      cgroup" but "list all thread group id's with any threads in the
      cgroup".
      
      While at it, update cgroup_procs_show() to use task_pid_vnr() instead
      of task_tgid_vnr().  As the iteration guarantees that the function
      only sees group leaders, this doesn't change the output and will allow
      sharing the function for thread iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bc2fb7ed
  15. 07 7月, 2017 2 次提交
    • V
      mm, cpuset: always use seqlock when changing task's nodemask · 5f155f27
      Vlastimil Babka 提交于
      When updating task's mems_allowed and rebinding its mempolicy due to
      cpuset's mems being changed, we currently only take the seqlock for
      writing when either the task has a mempolicy, or the new mems has no
      intersection with the old mems.
      
      This should be enough to prevent a parallel allocation seeing no
      available nodes, but the optimization is IMHO unnecessary (cpuset
      updates should not be frequent), and we still potentially risk issues if
      the intersection of new and old nodes has limited amount of
      free/reclaimable memory.
      
      Let's just use the seqlock for all tasks.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-6-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f155f27
    • V
      mm, mempolicy: simplify rebinding mempolicies when updating cpusets · 213980c0
      Vlastimil Babka 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") has introduced a two-step protocol when
      rebinding task's mempolicy due to cpuset update, in order to avoid a
      parallel allocation seeing an empty effective nodemask and failing.
      
      Later, commit cc9a6c87 ("cpuset: mm: reduce large amounts of memory
      barrier related damage v3") introduced a seqlock protection and removed
      the synchronization point between the two update steps.  At that point
      (or perhaps later), the two-step rebinding became unnecessary.
      
      Currently it only makes sure that the update first adds new nodes in
      step 1 and then removes nodes in step 2.  Without memory barriers the
      effects are questionable, and even then this cannot prevent a parallel
      zonelist iteration checking the nodemask at each step to observe all
      nodes as unusable for allocation.  We now fully rely on the seqlock to
      prevent premature OOMs and allocation failures.
      
      We can thus remove the two-step update parts and simplify the code.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      213980c0
  16. 25 5月, 2017 1 次提交
    • T
      cpuset: consider dying css as offline · 41c25707
      Tejun Heo 提交于
      In most cases, a cgroup controller don't care about the liftimes of
      cgroups.  For the controller, a css becomes online when ->css_online()
      is called on it and offline when ->css_offline() is called.
      
      However, cpuset is special in that the user interface it exposes cares
      whether certain cgroups exist or not.  Combined with the RCU delay
      between cgroup removal and css offlining, this can lead to user
      visible behavior oddities where operations which should succeed after
      cgroup removals fail for some time period.  The effects of cgroup
      removals are delayed when seen from userland.
      
      This patch adds css_is_dying() which tests whether offline is pending
      and updates is_cpuset_online() so that the function returns false also
      while offline is pending.  This gets rid of the userland visible
      delays.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      41c25707
  17. 11 4月, 2017 1 次提交
  18. 28 3月, 2017 1 次提交
  19. 02 3月, 2017 2 次提交
  20. 28 12月, 2016 1 次提交
  21. 25 12月, 2016 1 次提交
  22. 29 9月, 2016 1 次提交
  23. 16 9月, 2016 1 次提交
  24. 13 9月, 2016 1 次提交
    • J
      cpuset: handle race between CPU hotplug and cpuset_hotplug_work · 28b89b9e
      Joonwoo Park 提交于
      A discrepancy between cpu_online_mask and cpuset's effective_cpus
      mask is inevitable during hotplug since cpuset defers updating of
      effective_cpus mask using a workqueue, during which time nothing
      prevents the system from more hotplug operations.  For that reason
      guarantee_online_cpus() walks up the cpuset hierarchy until it finds
      an intersection under the assumption that top cpuset's effective_cpus
      mask intersects with cpu_online_mask even with such a race occurring.
      
      However a sequence of CPU hotplugs can open a time window, during which
      none of the effective CPUs in the top cpuset intersect with
      cpu_online_mask.
      
      For example when there are 4 possible CPUs 0-3 and only CPU0 is online:
      
        ========================  ===========================
         cpu_online_mask           top_cpuset.effective_cpus
        ========================  ===========================
         echo 1 > cpu2/online.
         CPU hotplug notifier woke up hotplug work but not yet scheduled.
            [0,2]                     [0]
      
         echo 0 > cpu0/online.
         The workqueue is still runnable.
            [2]                       [0]
        ========================  ===========================
      
        Now there is no intersection between cpu_online_mask and
        top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
        this moment can cause following:
      
         Unable to handle kernel NULL pointer dereference at virtual address 000000d0
         ------------[ cut here ]------------
         Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable]
         Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
         Modules linked in:
         CPU: 2 PID: 1420 Comm: taskset Tainted: G        W       4.4.8+ #98
         task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000
         PC is at guarantee_online_cpus+0x2c/0x58
         LR is at cpuset_cpus_allowed+0x4c/0x6c
         <snip>
         Process taskset (pid: 1420, stack limit = 0xffffffc06e124020)
         Call trace:
         [<ffffffc0001389b0>] guarantee_online_cpus+0x2c/0x58
         [<ffffffc00013b208>] cpuset_cpus_allowed+0x4c/0x6c
         [<ffffffc0000d61f0>] sched_setaffinity+0xc0/0x1ac
         [<ffffffc0000d6374>] SyS_sched_setaffinity+0x98/0xac
         [<ffffffc000085cb0>] el0_svc_naked+0x24/0x28
      
      The top cpuset's effective_cpus are guaranteed to be identical to
      cpu_online_mask eventually.  Hence fall back to cpu_online_mask when
      there is no intersection between top cpuset's effective_cpus and
      cpu_online_mask.
      Signed-off-by: NJoonwoo Park <joonwoop@codeaurora.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: cgroups@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.17+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      28b89b9e
  25. 10 8月, 2016 2 次提交
    • T
      cgroup: make cgroup_path() and friends behave in the style of strlcpy() · 4c737b41
      Tejun Heo 提交于
      cgroup_path() and friends used to format the path from the end and
      thus the resulting path usually didn't start at the start of the
      passed in buffer.  Also, when the buffer was too small, the partial
      result was truncated from the head rather than tail and there was no
      way to tell how long the full path would be.  These make the functions
      less robust and more awkward to use.
      
      With recent updates to kernfs_path(), cgroup_path() and friends can be
      made to behave in strlcpy() style.
      
      * cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now
        always return the length of the full path.  If buffer is too small,
        it contains nul terminated truncated output.
      
      * All users updated accordingly.
      
      v2: cgroup_path() usage in kernel/sched/debug.c converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      4c737b41
    • Z
      cpuset: make sure new tasks conform to the current config of the cpuset · 06f4e948
      Zefan Li 提交于
      A new task inherits cpus_allowed and mems_allowed masks from its parent,
      but if someone changes cpuset's config by writing to cpuset.cpus/cpuset.mems
      before this new task is inserted into the cgroup's task list, the new task
      won't be updated accordingly.
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      06f4e948
  26. 29 7月, 2016 1 次提交
  27. 20 5月, 2016 2 次提交
    • V
      cpuset: use static key better and convert to new API · 002f2906
      Vlastimil Babka 提交于
      An important function for cpusets is cpuset_node_allowed(), which
      optimizes on the fact if there's a single root CPU set, it must be
      trivially allowed.  But the check "nr_cpusets() <= 1" doesn't use the
      cpusets_enabled_key static key the right way where static keys eliminate
      branching overhead with jump labels.
      
      This patch converts it so that static key is used properly.  It's also
      switched to the new static key API and the checking functions are
      converted to return bool instead of int.  We also provide a new variant
      __cpuset_zone_allowed() which expects that the static key check was
      already done and they key was enabled.  This is needed for
      get_page_from_freelist() where we want to also avoid the relatively
      slower check when ALLOC_CPUSET is not set in alloc_flags.
      
      The impact on the page allocator microbenchmark is less than expected
      but the cleanup in itself is worthwhile.
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                             multcheck-v1r20               cpuset-v1r20
        Min      alloc-odr0-1               348.00 (  0.00%)           348.00 (  0.00%)
        Min      alloc-odr0-2               254.00 (  0.00%)           254.00 (  0.00%)
        Min      alloc-odr0-4               213.00 (  0.00%)           213.00 (  0.00%)
        Min      alloc-odr0-8               186.00 (  0.00%)           183.00 (  1.61%)
        Min      alloc-odr0-16              173.00 (  0.00%)           171.00 (  1.16%)
        Min      alloc-odr0-32              166.00 (  0.00%)           163.00 (  1.81%)
        Min      alloc-odr0-64              162.00 (  0.00%)           159.00 (  1.85%)
        Min      alloc-odr0-128             160.00 (  0.00%)           157.00 (  1.88%)
        Min      alloc-odr0-256             169.00 (  0.00%)           166.00 (  1.78%)
        Min      alloc-odr0-512             180.00 (  0.00%)           180.00 (  0.00%)
        Min      alloc-odr0-1024            188.00 (  0.00%)           187.00 (  0.53%)
        Min      alloc-odr0-2048            194.00 (  0.00%)           193.00 (  0.52%)
        Min      alloc-odr0-4096            199.00 (  0.00%)           198.00 (  0.50%)
        Min      alloc-odr0-8192            202.00 (  0.00%)           201.00 (  0.50%)
        Min      alloc-odr0-16384           203.00 (  0.00%)           202.00 (  0.49%)
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      002f2906
    • A
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton 提交于
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
  28. 26 4月, 2016 1 次提交
    • T
      cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback · 5cf1cacb
      Tejun Heo 提交于
      Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset
      kicks off asynchronous NUMA node migration if necessary during task
      migration and flushes it from cpuset_post_attach_flush() which is
      called at the end of __cgroup_procs_write().  This is to avoid
      performing migration with cgroup_threadgroup_rwsem write-locked which
      can lead to deadlock through dependency on kworker creation.
      
      memcg has a similar issue with charge moving, so let's convert it to
      an official callback rather than the current one-off cpuset specific
      function.  This patch adds cgroup_subsys->post_attach callback and
      makes cpuset register cpuset_post_attach_flush() as its ->post_attach.
      
      The conversion is mostly one-to-one except that the new callback is
      called under cgroup_mutex.  This is to guarantee that no other
      migration operations are started before ->post_attach callbacks are
      finished.  cgroup_mutex is one of the outermost mutex in the system
      and has never been and shouldn't be a problem.  We can add specialized
      synchronization around __cgroup_procs_write() but I don't think
      there's any noticeable benefit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
      5cf1cacb
  29. 23 2月, 2016 1 次提交
  30. 17 2月, 2016 1 次提交
    • A
      cgroup: introduce cgroup namespaces · a79a908f
      Aditya Kali 提交于
      Introduce the ability to create new cgroup namespace. The newly created
      cgroup namespace remembers the cgroup of the process at the point
      of creation of the cgroup namespace (referred as cgroupns-root).
      The main purpose of cgroup namespace is to virtualize the contents
      of /proc/self/cgroup file. Processes inside a cgroup namespace
      are only able to see paths relative to their namespace root
      (unless they are moved outside of their cgroupns-root, at which point
       they will see a relative path from their cgroupns-root).
      For a correctly setup container this enables container-tools
      (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
      containers without leaking system level cgroup hierarchy to the task.
      This patch only implements the 'unshare' part of the cgroupns.
      Signed-off-by: NAditya Kali <adityakali@google.com>
      Signed-off-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a79a908f
  31. 22 1月, 2016 1 次提交
    • T
      cpuset: make mm migration asynchronous · e93ad19d
      Tejun Heo 提交于
      If "cpuset.memory_migrate" is set, when a process is moved from one
      cpuset to another with a different memory node mask, pages in used by
      the process are migrated to the new set of nodes.  This was performed
      synchronously in the ->attach() callback, which is synchronized
      against process management.  Recently, the synchronization was changed
      from per-process rwsem to global percpu rwsem for simplicity and
      optimization.
      
      Combined with the synchronous mm migration, this led to deadlocks
      because mm migration could schedule a work item which may in turn try
      to create a new worker blocking on the process management lock held
      from cgroup process migration path.
      
      This heavy an operation shouldn't be performed synchronously from that
      deep inside cgroup migration in the first place.  This patch punts the
      actual migration to an ordered workqueue and updates cgroup process
      migration and cpuset config update paths to flush the workqueue after
      all locks are released.  This way, the operations still seem
      synchronous to userland without entangling mm migration with process
      management synchronization.  CPU hotplug can also invoke mm migration
      but there's no reason for it to wait for mm migrations and thus
      doesn't synchronize against their completions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: stable@vger.kernel.org # v4.4+
      e93ad19d
  32. 03 12月, 2015 1 次提交
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5