1. 29 10月, 2019 1 次提交
    • V
      sched/topology: Don't try to build empty sched domains · cd1cb335
      Valentin Schneider 提交于
      Turns out hotplugging CPUs that are in exclusive cpusets can lead to the
      cpuset code feeding empty cpumasks to the sched domain rebuild machinery.
      
      This leads to the following splat:
      
          Internal error: Oops: 96000004 [#1] PREEMPT SMP
          Modules linked in:
          CPU: 0 PID: 235 Comm: kworker/5:2 Not tainted 5.4.0-rc1-00005-g8d495477 #23
          Hardware name: ARM Juno development board (r0) (DT)
          Workqueue: events cpuset_hotplug_workfn
          pstate: 60000005 (nZCv daif -PAN -UAO)
          pc : build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
          lr : build_sched_domains (kernel/sched/topology.c:1966)
          Call trace:
          build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
          partition_sched_domains_locked (kernel/sched/topology.c:2250)
          rebuild_sched_domains_locked (./include/linux/bitmap.h:370 ./include/linux/cpumask.h:538 kernel/cgroup/cpuset.c:955 kernel/cgroup/cpuset.c:978 kernel/cgroup/cpuset.c:1019)
          rebuild_sched_domains (kernel/cgroup/cpuset.c:1032)
          cpuset_hotplug_workfn (kernel/cgroup/cpuset.c:3205 (discriminator 2))
          process_one_work (./arch/arm64/include/asm/jump_label.h:21 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:114 kernel/workqueue.c:2274)
          worker_thread (./include/linux/compiler.h:199 ./include/linux/list.h:268 kernel/workqueue.c:2416)
          kthread (kernel/kthread.c:255)
          ret_from_fork (arch/arm64/kernel/entry.S:1167)
          Code: f860dae2 912802d6 aa1603e1 12800000 (f8616853)
      
      The faulty line in question is:
      
        cap = arch_scale_cpu_capacity(cpumask_first(cpu_map));
      
      and we're not checking the return value against nr_cpu_ids (we shouldn't
      have to!), which leads to the above.
      
      Prevent generate_sched_domains() from returning empty cpumasks, and add
      some assertion in build_sched_domains() to scream bloody murder if it
      happens again.
      
      The above splat was obtained on my Juno r0 with the following reproducer:
      
        $ cgcreate -g cpuset:asym
        $ cgset -r cpuset.cpus=0-3 asym
        $ cgset -r cpuset.mems=0 asym
        $ cgset -r cpuset.cpu_exclusive=1 asym
      
        $ cgcreate -g cpuset:smp
        $ cgset -r cpuset.cpus=4-5 smp
        $ cgset -r cpuset.mems=0 smp
        $ cgset -r cpuset.cpu_exclusive=1 smp
      
        $ cgset -r cpuset.sched_load_balance=0 .
      
        $ echo 0 > /sys/devices/system/cpu/cpu4/online
        $ echo 0 > /sys/devices/system/cpu/cpu5/online
      Signed-off-by: NValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar.Eggemann@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: hannes@cmpxchg.org
      Cc: lizefan@huawei.com
      Cc: morten.rasmussen@arm.com
      Cc: qperret@google.com
      Cc: tj@kernel.org
      Cc: vincent.guittot@linaro.org
      Fixes: 05484e09 ("sched/topology: Add SD_ASYM_CPUCAPACITY flag detection")
      Link: https://lkml.kernel.org/r/20191023153745.19515-2-valentin.schneider@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd1cb335
  2. 25 10月, 2019 1 次提交
    • T
      cgroup: remove cgroup_enable_task_cg_lists() optimization · 5153faac
      Tejun Heo 提交于
      cgroup_enable_task_cg_lists() is used to lazyily initialize task
      cgroup associations on the first use to reduce fork / exit overheads
      on systems which don't use cgroup.  Unfortunately, locking around it
      has never been actually correct and its value is dubious given how the
      vast majority of systems use cgroup right away from boot.
      
      This patch removes the optimization.  For now, replace the cg_list
      based branches with WARN_ON_ONCE()'s to be on the safe side.  We can
      simplify the logic further in the future.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5153faac
  3. 25 7月, 2019 4 次提交
  4. 15 7月, 2019 1 次提交
  5. 15 6月, 2019 1 次提交
  6. 13 6月, 2019 1 次提交
    • J
      cpuset: restore sanity to cpuset_cpus_allowed_fallback() · d477f8c2
      Joel Savitz 提交于
      In the case that a process is constrained by taskset(1) (i.e.
      sched_setaffinity(2)) to a subset of available cpus, and all of those are
      subsequently offlined, the scheduler will set tsk->cpus_allowed to
      the current value of task_cs(tsk)->effective_cpus.
      
      This is done via a call to do_set_cpus_allowed() in the context of
      cpuset_cpus_allowed_fallback() made by the scheduler when this case is
      detected. This is the only call made to cpuset_cpus_allowed_fallback()
      in the latest mainline kernel.
      
      However, this is not sane behavior.
      
      I will demonstrate this on a system running the latest upstream kernel
      with the following initial configuration:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	ffffffff,fffffff
      	Cpus_allowed_list:	0-63
      
      (Where cpus 32-63 are provided via smt.)
      
      If we limit our current shell process to cpu2 only and then offline it
      and reonline it:
      
      	# taskset -p 4 $$
      	pid 2272's current affinity mask: ffffffffffffffff
      	pid 2272's new affinity mask: 4
      
      	# echo off > /sys/devices/system/cpu/cpu2/online
      	# dmesg | tail -3
      	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
      	[ 2195.872700] IRQ 114: no longer affine to CPU2
      	[ 2195.879128] smpboot: CPU 2 is now offline
      
      	# echo on > /sys/devices/system/cpu/cpu2/online
      	# dmesg | tail -1
      	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4
      
      We see that our current process now has an affinity mask containing
      every cpu available on the system _except_ the one we originally
      constrained it to:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:   ffffffff,fffffffb
      	Cpus_allowed_list:      0-1,3-63
      
      This is not sane behavior, as the scheduler can now not only place the
      process on previously forbidden cpus, it can't even schedule it on
      the cpu it was originally constrained to!
      
      Other cases result in even more exotic affinity masks. Take for instance
      a process with an affinity mask containing only cpus provided by smt at
      the moment that smt is toggled, in a configuration such as the following:
      
      	# taskset -p f000000000 $$
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	000000f0,00000000
      	Cpus_allowed_list:	36-39
      
      A double toggle of smt results in the following behavior:
      
      	# echo off > /sys/devices/system/cpu/smt/control
      	# echo on > /sys/devices/system/cpu/smt/control
      	# grep -i cpus /proc/$$/status
      	Cpus_allowed:	ffffff00,ffffffff
      	Cpus_allowed_list:	0-31,40-63
      
      This is even less sane than the previous case, as the new affinity mask
      excludes all smt-provided cpus with ids less than those that were
      previously in the affinity mask, as well as those that were actually in
      the mask.
      
      With this patch applied, both of these cases end in the following state:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	ffffffff,ffffffff
      	Cpus_allowed_list:	0-63
      
      The original policy is discarded. Though not ideal, it is the simplest way
      to restore sanity to this fallback case without reinventing the cpuset
      wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
      for the previous affinity mask to be restored in this fallback case can use
      that mechanism instead.
      
      This patch modifies scheduler behavior by instead resetting the mask to
      task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
      mode. I tested the cases above on both modes.
      
      Note that the scheduler uses this fallback mechanism if and only if
      _every_ other valid avenue has been traveled, and it is the last resort
      before calling BUG().
      Suggested-by: NWaiman Long <longman@redhat.com>
      Suggested-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NJoel Savitz <jsavitz@redhat.com>
      Acked-by: NPhil Auld <pauld@redhat.com>
      Acked-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d477f8c2
  7. 03 6月, 2019 1 次提交
  8. 26 5月, 2019 1 次提交
  9. 20 4月, 2019 1 次提交
  10. 28 2月, 2019 1 次提交
    • D
      cpuset: Use fs_context · a1875374
      David Howells 提交于
      Make the cpuset filesystem use the filesystem context.  This is potentially
      tricky as the cpuset fs is almost an alias for the cgroup filesystem, but
      with some special parameters.
      
      This can, however, be handled by setting up an appropriate cgroup
      filesystem and returning the root directory of that as the root dir of this
      one.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a1875374
  11. 19 2月, 2019 1 次提交
  12. 29 12月, 2018 1 次提交
    • Y
      mm, oom: reorganize the oom report in dump_header · ef8444ea
      yuzhoujian 提交于
      OOM report contains several sections.  The first one is the allocation
      context that has triggered the OOM.  Then we have cpuset context followed
      by the stack trace of the OOM path.  The tird one is the OOM memory
      information.  Followed by the current memory state of all system tasks.
      At last, we will show oom eligible tasks and the information about the
      chosen oom victim.
      
      One thing that makes parsing more awkward than necessary is that we do not
      have a single and easily parsable line about the oom context.  This patch
      is reorganizing the oom report to
      
      1) who invoked oom and what was the allocation request
      
      [  515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
      
      2) OOM stack trace
      
      [  515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
      [  515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
      [  515.906821] Call Trace:
      [  515.908062]  dump_stack+0x5a/0x73
      [  515.909311]  dump_header+0x55/0x28c
      [  515.914260]  oom_kill_process+0x2d8/0x300
      [  515.916708]  out_of_memory+0x145/0x4a0
      [  515.917932]  __alloc_pages_slowpath+0x7d2/0xa16
      [  515.919157]  __alloc_pages_nodemask+0x277/0x290
      [  515.920367]  filemap_fault+0x3d0/0x6c0
      [  515.921529]  ? filemap_map_pages+0x2b8/0x420
      [  515.922709]  ext4_filemap_fault+0x2c/0x40 [ext4]
      [  515.923884]  __do_fault+0x20/0x80
      [  515.925032]  __handle_mm_fault+0xbc0/0xe80
      [  515.926195]  handle_mm_fault+0xfa/0x210
      [  515.927357]  __do_page_fault+0x233/0x4c0
      [  515.928506]  do_page_fault+0x32/0x140
      [  515.929646]  ? page_fault+0x8/0x30
      [  515.930770]  page_fault+0x1e/0x30
      
      3) OOM memory information
      
      [  515.958093] Mem-Info:
      [  515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
       active_file:4402672 inactive_file:483963 isolated_file:1344
       unevictable:0 dirty:4886753 writeback:0 unstable:0
       slab_reclaimable:148442 slab_unreclaimable:18741
       mapped:1347 shmem:1347 pagetables:58669 bounce:0
       free:88663 free_pcp:0 free_cma:0
      ...
      
      4) current memory state of all system tasks
      
      [  516.079544] [    744]     0   744     9211     1345   114688       82             0 systemd-journal
      [  516.082034] [    787]     0   787    31764        0   143360       92             0 lvmetad
      [  516.084465] [    792]     0   792    10930        1   110592      208         -1000 systemd-udevd
      [  516.086865] [   1199]     0  1199    13866        0   131072      112         -1000 auditd
      [  516.089190] [   1222]     0  1222    31990        1   110592      157             0 smartd
      [  516.091477] [   1225]     0  1225     4864       85    81920       43             0 irqbalance
      [  516.093712] [   1226]     0  1226    52612        0   258048      426             0 abrtd
      [  516.112128] [   1280]     0  1280   109774       55   299008      400             0 NetworkManager
      [  516.113998] [   1295]     0  1295    28817       37    69632       24             0 ksmtuned
      [  516.144596] [  10718]     0 10718  2622484  1721372 15998976   267219             0 panic
      [  516.145792] [  10719]     0 10719  2622484  1164767  9818112    53576             0 panic
      [  516.146977] [  10720]     0 10720  2622484  1174361  9904128    53709             0 panic
      [  516.148163] [  10721]     0 10721  2622484  1209070 10194944    54824             0 panic
      [  516.149329] [  10722]     0 10722  2622484  1745799 14774272    91138             0 panic
      
      5) oom context (contrains and the chosen victim).
      
      oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0
      
      An admin can easily get the full oom context at a single line which
      makes parsing much easier.
      
      Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.comSigned-off-by: Nyuzhoujian <yuzhoujian@didichuxing.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <yang.s@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef8444ea
  13. 04 12月, 2018 1 次提交
  14. 14 11月, 2018 1 次提交
    • T
      cpuset: Minor cgroup2 interface updates · b1e3aeb1
      Tejun Heo 提交于
      * Rename the partition file from "cpuset.sched.partition" to
        "cpuset.cpus.partition".
      
      * When writing to the partition file, drop "0" and "1" and only accept
        "member" and "root".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Waiman Long <longman@redhat.com>
      b1e3aeb1
  15. 09 11月, 2018 11 次提交
    • W
      cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug · 5cf8114d
      Waiman Long 提交于
      For debugging purpose, it will be useful to expose the content of the
      subparts_cpus as a read-only file to see if the code work correctly.
      However, subparts_cpus will not be used at all in most use cases. So
      adding a new cpuset file that clutters the cgroup directory may not be
      desirable.  This is now being done by using the hidden "cgroup_debug"
      kernel command line option to expose a new "cpuset.cpus.subpartitions"
      file.
      
      That option was originally used by the debug controller to expose
      itself when configured into the kernel. This is now extended to set an
      internal flag used by cgroup_addrm_files(). A new CFTYPE_DEBUG flag
      can now be used to specify that a cgroup file should only be created
      when the "cgroup_debug" option is specified.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5cf8114d
    • W
      cpuset: Use descriptive text when reading/writing cpuset.sched.partition · bb5b553c
      Waiman Long 提交于
      Currently, cpuset.sched.partition returns the values, 0, 1 or -1 on
      read. A person who is not familiar with the partition code may not
      understand what they mean.
      
      In order to make cpuset.sched.partition more user-friendly, it will
      now display the following descriptive text on read:
      
        "root" - A partition root (top cpuset of a partition)
        "member" - A non-root member of a partition
        "root invalid" - An invalid partition root
      
      Note that there is at least one partition in the whole cgroup hierarchy.
      The top cpuset is the root of that partition.  The rests are either a
      root if it starts a new partition or a member of a partition.
      
      The cpuset.sched.partition file will now also accept "root" and
      "member" besides 1 and 0 as valid input values. The "root invalid"
      value is internal only and cannot be written to the file.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bb5b553c
    • W
      cpuset: Expose cpus.effective and mems.effective on cgroup v2 root · 5776cecc
      Waiman Long 提交于
      Because of the fact that setting the "cpuset.sched.partition" in
      a direct child of root can remove CPUs from the root's effective CPU
      list, it makes sense to know what CPUs are left in the root cgroup for
      scheduling purpose. So the "cpuset.cpus.effective" control file is now
      exposed in the v2 cgroup root.
      
      For consistency, the "cpuset.mems.effective" control file is exposed
      as well.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5776cecc
    • W
      cpuset: Make generate_sched_domains() work with partition · 0ccea8fe
      Waiman Long 提交于
      The generate_sched_domains() function is modified to make it work
      correctly with the newly introduced subparts_cpus mask for scheduling
      domains generation.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      0ccea8fe
    • W
      cpuset: Make CPU hotplug work with partition · 4b842da2
      Waiman Long 提交于
      When there is a cpu hotplug event (CPU online or offline), the partitions
      may need to be reconfigured and regenerated. So code is added to the
      hotplug functions to make them work with new subparts_cpus mask to
      compute the right effective_cpus for each of the affected cpusets.
      It may also change the state of a partition root from real one to an
      erroneous one or vice versa.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4b842da2
    • W
      cpuset: Track cpusets that use parent's effective_cpus · 4716909c
      Waiman Long 提交于
      In the default hierarchy, a cpuset will use the parent's effective_cpus
      if none of the requested CPUs can be granted from the parent. That can
      be a problem if a parent is a partition root with children partition
      roots. Changes to a parent's effective_cpus list due to changes in a
      child partition root may not be properly reflected in a child cpuset
      that use parent's effective_cpus because the cpu_exclusive rule of a
      partition root will not guard against that.
      
      In order to avoid the mismatch, two new tracking variables are added to
      the cpuset structure to track if a cpuset uses parent's effective_cpus
      and the number of children cpusets that use its effective_cpus. So
      whenever cpumask changes are made to a parent, it will also check to
      see if it has other children cpusets that use its effective_cpus and
      call update_cpumasks_hier() if that is the case.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4716909c
    • W
      cpuset: Add an error state to cpuset.sched.partition · 3881b861
      Waiman Long 提交于
      When external events like CPU offlining or user events like changing
      the cpu list of an ancestor cpuset happen, update_cpumasks_hier()
      will be called to update the effective cpus of each of the affected
      cpusets. That will then call update_parent_subparts_cpumask() if
      partitions are impacted.
      
      Currently, these events may cause update_parent_subparts_cpumask()
      to return error if none of the requested cpus are available or it will
      consume all the cpus in the parent partition root. Handling these errors
      is problematic as the states may become inconsistent.
      
      Instead of letting update_parent_subparts_cpumask() return error, a new
      error state (-1) is added to the partition_root_state flag to designate
      the fact that the partition is no longer valid. IOW, it is no longer a
      real partition root, but the CS_CPU_EXCLUSIVE flag will still be set
      as it can be changed back to a real one if favorable change happens
      later on.
      
      This new error state is set internally and user cannot write this new
      value to "cpuset.sched.partition".
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3881b861
    • W
      cpuset: Add new v2 cpuset.sched.partition flag · ee8dde0c
      Waiman Long 提交于
      A new cpuset.sched.partition boolean flag is added to cpuset v2.
      This new flag, if set, indicates that the cgroup is the root of a
      new scheduling domain or partition that includes itself and all its
      descendants except those that are scheduling domain roots themselves
      and their descendants.
      
      With this new flag, one can directly create as many partitions as
      necessary without ever using the v1 trick of turning off load balancing
      in specific cpusets to create partitions as a side effect.
      
      This new flag is owned by the parent and will cause the CPUs in the
      cpuset to be removed from the effective CPUs of its parent.
      
      This is implemented internally by adding a new subparts_cpus mask that
      holds the CPUs belonging to child partitions so that:
      
              subparts_cpus | effective_cpus = cpus_allowed
              subparts_cpus & effective_cpus = 0
      
      This new flag can only be turned on in a cpuset if its parent is a
      partition root itself. The state of this flag cannot be changed if the
      cpuset has children.
      
      Once turned on, further changes to "cpuset.cpus" is allowed as long
      as there is at least one CPU left that can be granted from the parent
      and a child partition root cannot use up all the CPUs in the parent's
      effective_cpus.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ee8dde0c
    • W
      cpuset: Simply allocation and freeing of cpumasks · bf92370c
      Waiman Long 提交于
      The previous commit introduces a new subparts_cpus mask into the cpuset
      data structure and a new tmpmasks structure.  Managing the allocation
      and freeing of those cpumasks is becoming more complex.
      
      So a number of helper functions are added to simplify and streamline
      the management of those cpumasks. To make it simple, all the cpumasks
      are now pre-cleared on allocation.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bf92370c
    • W
      cpuset: Define data structures to support scheduling partition · 58b74842
      Waiman Long 提交于
      >From a cpuset point of view, a scheduling partition is a group of
      cpusets with their own set of exclusive CPUs that are not shared by
      other tasks outside the scheduling partition.
      
      In the legacy hierarchy, scheduling partitions are supported indirectly
      via the right use of the load balancing and the exclusive CPUs flag
      which is not intuitive and can be hard to use.
      
      To fully support the concept of scheduling partitions in the default
      hierarchy, we need to add some new field into the cpuset structure as
      well as a new tmpmasks structure that is used to pre-allocate cpumasks
      at the top level cpuset functions to avoid memory allocation in inner
      functions as memory allocation failure in those inner functions may
      cause a cpuset to have inconsistent states.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      58b74842
    • W
      cpuset: Enable cpuset controller in default hierarchy · 4ec22e9c
      Waiman Long 提交于
      Given the fact that thread mode had been merged into 4.14, it is now
      time to enable cpuset to be used in the default hierarchy (cgroup v2)
      as it is clearly threaded.
      
      The cpuset controller had experienced feature creep since its
      introduction more than a decade ago. Besides the core cpus and mems
      control files to limit cpus and memory nodes, there are a bunch of
      additional features that can be controlled from the userspace. Some of
      the features are of doubtful usefulness and may not be actively used.
      
      This patch enables cpuset controller in the default hierarchy with
      a minimal set of features, namely just the cpus and mems and their
      effective_* counterparts.  We can certainly add more features to the
      default hierarchy in the future if there is a real user need for them
      later on.
      
      Alternatively, with the unified hiearachy, it may make more sense
      to move some of those additional cpuset features, if desired, to
      memory controller or may be to the cpu controller instead of staying
      with cpuset.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4ec22e9c
  16. 16 6月, 2018 1 次提交
  17. 13 6月, 2018 1 次提交
    • K
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook 提交于
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6da2ec56
  18. 07 2月, 2018 1 次提交
  19. 05 12月, 2017 2 次提交
  20. 28 11月, 2017 2 次提交
    • P
      cpuset: Make cpuset hotplug synchronous · 1599a185
      Prateek Sood 提交于
      Convert cpuset_hotplug_workfn() into synchronous call for cpu hotplug
      path. For memory hotplug path it still gets queued as a work item.
      
      Since cpuset_hotplug_workfn() can be made synchronous for cpu hotplug
      path, it is not required to wait for cpuset hotplug while thawing
      processes.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1599a185
    • P
      cgroup/cpuset: remove circular dependency deadlock · aa24163b
      Prateek Sood 提交于
      Remove circular dependency deadlock in a scenario where hotplug of CPU is
      being done while there is updation in cgroup and cpuset triggered from
      userspace.
      
      Process A => kthreadd => Process B => Process C => Process A
      
      Process A
      cpu_subsys_offline();
        cpu_down();
          _cpu_down();
            percpu_down_write(&cpu_hotplug_lock); //held
            cpuhp_invoke_callback();
      	     workqueue_offline_cpu();
                  queue_work_on(); // unbind_work on system_highpri_wq
                     __queue_work();
                       insert_work();
                          wake_up_worker();
                  flush_work();
                     wait_for_completion();
      
      worker_thread();
         manage_workers();
            create_worker();
      	     kthread_create_on_node();
      		    wake_up_process(kthreadd_task);
      
      kthreadd
      kthreadd();
        kernel_thread();
          do_fork();
            copy_process();
              percpu_down_read(&cgroup_threadgroup_rwsem);
                __rwsem_down_read_failed_common(); //waiting
      
      Process B
      kernfs_fop_write();
        cgroup_file_write();
          cgroup_procs_write();
            percpu_down_write(&cgroup_threadgroup_rwsem); //held
            cgroup_attach_task();
              cgroup_migrate();
                cgroup_migrate_execute();
                  cpuset_can_attach();
                    mutex_lock(&cpuset_mutex); //waiting
      
      Process C
      kernfs_fop_write();
        cgroup_file_write();
          cpuset_write_resmask();
            mutex_lock(&cpuset_mutex); //held
            update_cpumask();
              update_cpumasks_hier();
                rebuild_sched_domains_locked();
                  get_online_cpus();
                    percpu_down_read(&cpu_hotplug_lock); //waiting
      
      Eliminating deadlock by reversing the locking order for cpuset_mutex and
      cpu_hotplug_lock.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      aa24163b
  21. 27 10月, 2017 1 次提交
  22. 07 9月, 2017 2 次提交
    • P
      sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs · 50e76632
      Peter Zijlstra 提交于
      Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
      because it now resulted in non-cpuset usage breaking too.
      
      On suspend cpuset_cpu_inactive() doesn't call into
      cpuset_update_active_cpus() because it doesn't want to move tasks about,
      there is no need, all tasks are frozen and won't run again until after
      we've resumed everything.
      
      But this means that when we finally do call into
      cpuset_update_active_cpus() after resuming the last frozen cpu in
      cpuset_cpu_active(), the top_cpuset will not have any difference with
      the cpu_active_mask and this it will not in fact do _anything_.
      
      So the cpuset configuration will not be restored. This was largely
      hidden because we would unconditionally create identity domains and
      mobile users would not in fact use cpusets much. And servers what do use
      cpusets tend to not suspend-resume much.
      
      An addition problem is that we'd not in fact wait for the cpuset work to
      finish before resuming the tasks, allowing spurious migrations outside
      of the specified domains.
      
      Fix the rebuild by introducing cpuset_force_rebuild() and fix the
      ordering with cpuset_wait_for_hotplug().
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: deb7aa30 ("cpuset: reorganize CPU / memory hotplug handling")
      Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      50e76632
    • M
      mm: replace TIF_MEMDIE checks by tsk_is_oom_victim · da99ecf1
      Michal Hocko 提交于
      TIF_MEMDIE is set only to the tasks whick were either directly selected
      by the OOM killer or passed through mark_oom_victim from the allocator
      path.  tsk_is_oom_victim is more generic and allows to identify all
      tasks (threads) which share the mm with the oom victim.
      
      Please note that the freezer still needs to check TIF_MEMDIE because we
      cannot thaw tasks which do not participage in oom_victims counting
      otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da99ecf1
  23. 25 8月, 2017 2 次提交
    • P
      sched/topology, cpuset: Avoid spurious/wrong domain rebuilds · 77d1dfda
      Peter Zijlstra 提交于
      When disabling cpuset.sched_load_balance we expect to be able to online
      CPUs without generating sched_domains. However this is currently
      completely broken.
      
      What happens is that we generate the sched_domains and then destroy
      them. This is because of the spurious 'default' domain build in
      cpuset_update_active_cpus(). That builds a single machine wide domain
      and then schedules a work to build the 'real' domains. The work then
      finds there are _no_ domains and destroys the lot again.
      
      Furthermore, if there actually were cpusets, building the machine wide
      domain is actively wrong, because it would allow tasks to 'escape' their
      cpuset. Also I don't think its needed, the scheduler really should
      respect the active mask.
      Reported-by: NOfer Levi(SW) <oferle@mellanox.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
      Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      77d1dfda
    • W
      cpuset: Fix incorrect memory_pressure control file mapping · 1c08c22c
      Waiman Long 提交于
      The memory_pressure control file was incorrectly set up without
      a private value (0, by default). As a result, this control
      file was treated like memory_migrate on read. By adding back the
      FILE_MEMORY_PRESSURE private value, the correct memory pressure value
      will be returned.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 7dbdb199 ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE")
      Cc: stable@vger.kernel.org # v4.4+
      1c08c22c