1. 11 1月, 2018 1 次提交
  2. 20 12月, 2017 1 次提交
    • T
      cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC · 74d0833c
      Tejun Heo 提交于
      While teaching css_task_iter to handle skipping over tasks which
      aren't group leaders, bc2fb7ed ("cgroup: add @flags to
      css_task_iter_start() and implement CSS_TASK_ITER_PROCS") introduced a
      silly bug.
      
      CSS_TASK_ITER_PROCS is implemented by repeating
      css_task_iter_advance() while the advanced cursor is pointing to a
      non-leader thread.  However, the cursor variable, @l, wasn't updated
      when the iteration has to advance to the next css_set and the
      following repetition would operate on the terminal @l from the
      previous iteration which isn't pointing to a valid task leading to
      oopses like the following or infinite looping.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000254
        IP: __task_pid_nr_ns+0xc7/0xf0
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP
        ...
        CPU: 2 PID: 1 Comm: systemd Not tainted 4.14.4-200.fc26.x86_64 #1
        Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 3203 11/09/2017
        task: ffff88c4baee8000 task.stack: ffff96d5c3158000
        RIP: 0010:__task_pid_nr_ns+0xc7/0xf0
        RSP: 0018:ffff96d5c315bd50 EFLAGS: 00010206
        RAX: 0000000000000000 RBX: ffff88c4b68c6000 RCX: 0000000000000250
        RDX: ffffffffa5e47960 RSI: 0000000000000000 RDI: ffff88c490f6ab00
        RBP: ffff96d5c315bd50 R08: 0000000000001000 R09: 0000000000000005
        R10: ffff88c4be006b80 R11: ffff88c42f1b8004 R12: ffff96d5c315bf18
        R13: ffff88c42d7dd200 R14: ffff88c490f6a510 R15: ffff88c4b68c6000
        FS:  00007f9446f8ea00(0000) GS:ffff88c4be680000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000254 CR3: 00000007f956f000 CR4: 00000000003406e0
        Call Trace:
         cgroup_procs_show+0x19/0x30
         cgroup_seqfile_show+0x4c/0xb0
         kernfs_seq_show+0x21/0x30
         seq_read+0x2ec/0x3f0
         kernfs_fop_read+0x134/0x180
         __vfs_read+0x37/0x160
         ? security_file_permission+0x9b/0xc0
         vfs_read+0x8e/0x130
         SyS_read+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        RIP: 0033:0x7f94455f942d
        RSP: 002b:00007ffe81ba2d00 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
        RAX: ffffffffffffffda RBX: 00005574e2233f00 RCX: 00007f94455f942d
        RDX: 0000000000001000 RSI: 00005574e2321a90 RDI: 000000000000002b
        RBP: 0000000000000000 R08: 00005574e2321a90 R09: 00005574e231de60
        R10: 00007f94458c8b38 R11: 0000000000000293 R12: 00007f94458c8ae0
        R13: 00007ffe81ba3800 R14: 0000000000000000 R15: 00005574e2116560
        Code: 04 74 0e 89 f6 48 8d 04 76 48 8d 04 c5 f0 05 00 00 48 8b bf b8 05 00 00 48 01 c7 31 c0 48 8b 0f 48 85 c9 74 18 8b b2 30 08 00 00 <3b> 71 04 77 0d 48 c1 e6 05 48 01 f1 48 3b 51 38 74 09 5d c3 8b
        RIP: __task_pid_nr_ns+0xc7/0xf0 RSP: ffff96d5c315bd50
      
      Fix it by moving the initialization of the cursor below the repeat
      label.  While at it, rename it to @next for readability.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: bc2fb7ed ("cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS")
      Cc: stable@vger.kernel.org # v4.14+
      Reported-by: NLaura Abbott <labbott@redhat.com>
      Reported-by: NBronek Kozicki <brok@incorrekt.com>
      Reported-by: NGeorge Amanakis <gamanakis@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      74d0833c
  3. 19 12月, 2017 1 次提交
    • P
      cgroup: Fix deadlock in cpu hotplug path · 116d2f74
      Prateek Sood 提交于
      Deadlock during cgroup migration from cpu hotplug path when a task T is
      being moved from source to destination cgroup.
      
      kworker/0:0
      cpuset_hotplug_workfn()
         cpuset_hotplug_update_tasks()
            hotplug_update_tasks_legacy()
              remove_tasks_in_empty_cpuset()
                cgroup_transfer_tasks() // stuck in iterator loop
                  cgroup_migrate()
                    cgroup_migrate_add_task()
      
      In cgroup_migrate_add_task() it checks for PF_EXITING flag of task T.
      Task T will not migrate to destination cgroup. css_task_iter_start()
      will keep pointing to task T in loop waiting for task T cg_list node
      to be removed.
      
      Task T
      do_exit()
        exit_signals() // sets PF_EXITING
        exit_task_namespaces()
          switch_task_namespaces()
            free_nsproxy()
              put_mnt_ns()
                drop_collected_mounts()
                  namespace_unlock()
                    synchronize_rcu()
                      _synchronize_rcu_expedited()
                        schedule_work() // on cpu0 low priority worker pool
                        wait_event() // waiting for work item to execute
      
      Task T inserted a work item in the worklist of cpu0 low priority
      worker pool. It is waiting for expedited grace period work item
      to execute. This work item will only be executed once kworker/0:0
      complete execution of cpuset_hotplug_workfn().
      
      kworker/0:0 ==> Task T ==>kworker/0:0
      
      In case of PF_EXITING task being migrated from source to destination
      cgroup, migrate next available task in source cgroup.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      116d2f74
  4. 15 12月, 2017 1 次提交
  5. 12 12月, 2017 1 次提交
  6. 05 12月, 2017 2 次提交
  7. 28 11月, 2017 4 次提交
    • L
      cgroup: properly init u64_stats · 52cf373c
      Lucas Stach 提交于
      Lockdep complains that the stats update is trying to register a non-static
      key. This is because u64_stats are using a seqlock on 32bit arches, which
      needs to be initialized before usage.
      
      Fixes: 041cd640 (cgroup: Implement cgroup2 basic CPU usage accounting)
      Signed-off-by: NLucas Stach <l.stach@pengutronix.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      52cf373c
    • W
      debug cgroup: use task_css_set instead of rcu_dereference · ddf7005f
      Wang Long 提交于
      This macro `task_css_set` verifies that the caller is
      inside proper critical section if the kernel set CONFIG_PROVE_RCU=y.
      Signed-off-by: NWang Long <wanglong19@meituan.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ddf7005f
    • P
      cpuset: Make cpuset hotplug synchronous · 1599a185
      Prateek Sood 提交于
      Convert cpuset_hotplug_workfn() into synchronous call for cpu hotplug
      path. For memory hotplug path it still gets queued as a work item.
      
      Since cpuset_hotplug_workfn() can be made synchronous for cpu hotplug
      path, it is not required to wait for cpuset hotplug while thawing
      processes.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1599a185
    • P
      cgroup/cpuset: remove circular dependency deadlock · aa24163b
      Prateek Sood 提交于
      Remove circular dependency deadlock in a scenario where hotplug of CPU is
      being done while there is updation in cgroup and cpuset triggered from
      userspace.
      
      Process A => kthreadd => Process B => Process C => Process A
      
      Process A
      cpu_subsys_offline();
        cpu_down();
          _cpu_down();
            percpu_down_write(&cpu_hotplug_lock); //held
            cpuhp_invoke_callback();
      	     workqueue_offline_cpu();
                  queue_work_on(); // unbind_work on system_highpri_wq
                     __queue_work();
                       insert_work();
                          wake_up_worker();
                  flush_work();
                     wait_for_completion();
      
      worker_thread();
         manage_workers();
            create_worker();
      	     kthread_create_on_node();
      		    wake_up_process(kthreadd_task);
      
      kthreadd
      kthreadd();
        kernel_thread();
          do_fork();
            copy_process();
              percpu_down_read(&cgroup_threadgroup_rwsem);
                __rwsem_down_read_failed_common(); //waiting
      
      Process B
      kernfs_fop_write();
        cgroup_file_write();
          cgroup_procs_write();
            percpu_down_write(&cgroup_threadgroup_rwsem); //held
            cgroup_attach_task();
              cgroup_migrate();
                cgroup_migrate_execute();
                  cpuset_can_attach();
                    mutex_lock(&cpuset_mutex); //waiting
      
      Process C
      kernfs_fop_write();
        cgroup_file_write();
          cpuset_write_resmask();
            mutex_lock(&cpuset_mutex); //held
            update_cpumask();
              update_cpumasks_hier();
                rebuild_sched_domains_locked();
                  get_online_cpus();
                    percpu_down_read(&cpu_hotplug_lock); //waiting
      
      Eliminating deadlock by reversing the locking order for cpuset_mutex and
      cpu_hotplug_lock.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      aa24163b
  8. 07 11月, 2017 2 次提交
    • R
      cgroup: export list of cgroups v2 features using sysfs · 5f2e6734
      Roman Gushchin 提交于
      The active development of cgroups v2 sometimes leads to a creation
      of interfaces, which are not turned on by default (to provide
      backward compatibility). It's handy to know from userspace, which
      cgroup v2 features are supported without calculating it based
      on the kernel version. So, let's export the list of such features
      using /sys/kernel/cgroup/features pseudo-file.
      
      The list is hardcoded and has to be extended when new functionality
      is added. Each feature is printed on a new line.
      
      Example:
        $ cat /sys/kernel/cgroup/features
        nsdelegate
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5f2e6734
    • R
      cgroup: export list of delegatable control files using sysfs · 01ee6cfb
      Roman Gushchin 提交于
      Delegatable cgroup v2 control files may require special handling
      (e.g. chowning), and the exact list of such files varies between
      kernel versions (and likely to be extended in the future).
      
      To guarantee correctness of this list and simplify the life
      of userspace (systemd, first of all), let's export the list
      via /sys/kernel/cgroup/delegate pseudo-file.
      
      Format is siple: each control file name is printed on a new line.
      Example:
        $ cat /sys/kernel/cgroup/delegate
        cgroup.procs
        cgroup.subtree_control
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: NTejun Heo <tj@kernel.org>
      01ee6cfb
  9. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  10. 30 10月, 2017 1 次提交
  11. 27 10月, 2017 2 次提交
  12. 11 10月, 2017 1 次提交
  13. 10 10月, 2017 1 次提交
  14. 05 10月, 2017 2 次提交
    • A
      bpf: introduce BPF_PROG_QUERY command · 468e2f64
      Alexei Starovoitov 提交于
      introduce BPF_PROG_QUERY command to retrieve a set of either
      attached programs to given cgroup or a set of effective programs
      that will execute for events within a cgroup
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      468e2f64
    • A
      bpf: multi program support for cgroup+bpf · 324bda9e
      Alexei Starovoitov 提交于
      introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
      bpf programs to a cgroup.
      
      The difference between three possible flags for BPF_PROG_ATTACH command:
      - NONE(default): No further bpf programs allowed in the subtree.
      - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
        the program in this cgroup yields to sub-cgroup program.
      - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
        that cgroup program gets run in addition to the program in this cgroup.
      
      NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
      change their behavior. It only clarifies the semantics in relation
      to new flag.
      
      Only one program is allowed to be attached to a cgroup with
      NONE or BPF_F_ALLOW_OVERRIDE flag.
      Multiple programs are allowed to be attached to a cgroup with
      BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
      (those that were attached first, run first)
      The programs of sub-cgroup are executed first, then programs of
      this cgroup and then programs of parent cgroup.
      All eligible programs are executed regardless of return code from
      earlier programs.
      
      To allow efficient execution of multiple programs attached to a cgroup
      and to avoid penalizing cgroups without any programs attached
      introduce 'struct bpf_prog_array' which is RCU protected array
      of pointers to bpf programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      324bda9e
  15. 26 9月, 2017 1 次提交
    • T
      cgroup: statically initialize init_css_set->dfl_cgrp · 38683148
      Tejun Heo 提交于
      Like other csets, init_css_set's dfl_cgrp is initialized when the cset
      gets linked.  init_css_set gets linked in cgroup_init().  This has
      been fine till now but the recently added basic CPU usage accounting
      may end up accessing dfl_cgrp of init before cgroup_init() leading to
      the following oops.
      
        SELinux:  Initializing.
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
        IP: account_system_index_time+0x60/0x90
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc2-00003-g041cd640 #10
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
        +1.9.3-20161025_171302-gandalf 04/01/2014
        task: ffffffff81e10480 task.stack: ffffffff81e00000
        RIP: 0010:account_system_index_time+0x60/0x90
        RSP: 0000:ffff880011e03cb8 EFLAGS: 00010002
        RAX: ffffffff81ef8800 RBX: ffffffff81e10480 RCX: 0000000000000003
        RDX: 0000000000000000 RSI: 00000000000f4240 RDI: 0000000000000000
        RBP: ffff880011e03cc0 R08: 0000000000010000 R09: 0000000000000000
        R10: 0000000000000020 R11: 0000003b9aca0000 R12: 000000000001c100
        R13: 0000000000000000 R14: ffffffff81e10480 R15: ffffffff81e03cd8
        FS:  0000000000000000(0000) GS:ffff880011e00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000b0 CR3: 0000000001e09000 CR4: 00000000000006b0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         account_system_time+0x45/0x60
         account_process_tick+0x5a/0x140
         update_process_times+0x22/0x60
         tick_periodic+0x2b/0x90
         tick_handle_periodic+0x25/0x70
         timer_interrupt+0x15/0x20
         __handle_irq_event_percpu+0x7e/0x1b0
         handle_irq_event_percpu+0x23/0x60
         handle_irq_event+0x42/0x70
         handle_level_irq+0x83/0x100
         handle_irq+0x6f/0x110
         do_IRQ+0x46/0xd0
         common_interrupt+0x9d/0x9d
      
      Fix it by statically initializing init_css_set.dfl_cgrp so that init's
      default cgroup is accessible from the get-go.
      
      Fixes: 041cd640 ("cgroup: Implement cgroup2 basic CPU usage accounting")
      Reported-by: N“kbuild-all@01.org” <kbuild-all@01.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      38683148
  16. 25 9月, 2017 1 次提交
    • T
      cgroup: Implement cgroup2 basic CPU usage accounting · 041cd640
      Tejun Heo 提交于
      In cgroup1, while cpuacct isn't actually controlling any resources, it
      is a separate controller due to combination of two factors -
      1. enabling cpu controller has significant side effects, and 2. we
      have to pick one of the hierarchies to account CPU usages on.  cpuacct
      controller is effectively used to designate a hierarchy to track CPU
      usages on.
      
      cgroup2's unified hierarchy removes the second reason and we can
      account basic CPU usages by default.  While we can use cpuacct for
      this purpose, both its interface and implementation leave a lot to be
      desired - it collects and exposes two sources of truth which don't
      agree with each other and some of the exposed statistics don't make
      much sense.  Also, it propagates all the way up the hierarchy on each
      accounting event which is unnecessary.
      
      This patch adds basic resource accounting mechanism to cgroup2's
      unified hierarchy and accounts CPU usages using it.
      
      * All accountings are done per-cpu and don't propagate immediately.
        It just bumps the per-cgroup per-cpu counters and links to the
        parent's updated list if not already on it.
      
      * On a read, the per-cpu counters are collected into the global ones
        and then propagated upwards.  Only the per-cpu counters which have
        changed since the last read are propagated.
      
      * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
        prefix.  Total usage is collected from scheduling events.  User/sys
        breakdown is sourced from tick sampling and adjusted to the usage
        using cputime_adjust().
      
      This keeps the accounting side hot path O(1) and per-cpu and the read
      side O(nr_updated_since_last_read).
      
      v2: Minor changes and documentation updates as suggested by Waiman and
          Roman.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      041cd640
  17. 22 9月, 2017 1 次提交
    • W
      cgroup: Reinit cgroup_taskset structure before cgroup_migrate_execute() returns · c4fa6c43
      Waiman Long 提交于
      The cgroup_taskset structure within the larger cgroup_mgctx structure
      is supposed to be used once and then discarded. That is not really the
      case in the hotplug code path:
      
      cpuset_hotplug_workfn()
       - cgroup_transfer_tasks()
         - cgroup_migrate()
           - cgroup_migrate_add_task()
           - cgroup_migrate_execute()
      
      In this case, the cgroup_migrate() function is called multiple time
      with the same cgroup_mgctx structure to transfer the tasks from
      one cgroup to another one-by-one. The second time cgroup_migrate()
      is called, the cgroup_taskset will be in an incorrect state and so
      may cause the system to panic. For example,
      
        [  150.888410] Faulting instruction address: 0xc0000000001db648
        [  150.888414] Oops: Kernel access of bad area, sig: 11 [#1]
        [  150.888417] SMP NR_CPUS=2048
        [  150.888417] NUMA
        [  150.888419] pSeries
          :
        [  150.888545] NIP [c0000000001db648] cpuset_can_attach+0x58/0x1b0
        [  150.888548] LR [c0000000001db638] cpuset_can_attach+0x48/0x1b0
        [  150.888551] Call Trace:
        [  150.888554] [c0000005f65cb940] [c0000000001db638] cpuset_can_attach+0x48/0x1b 0 (unreliable)
        [  150.888559] [c0000005f65cb9a0] [c0000000001cff04] cgroup_migrate_execute+0xc4/0x4b0
        [  150.888563] [c0000005f65cba20] [c0000000001d7d14] cgroup_transfer_tasks+0x1d4/0x370
        [  150.888568] [c0000005f65cbb70] [c0000000001ddcb0] cpuset_hotplug_workfn+0x710/0x8f0
        [  150.888572] [c0000005f65cbc80] [c00000000012032c] process_one_work+0x1ac/0x4d0
        [  150.888576] [c0000005f65cbd20] [c0000000001206f8] worker_thread+0xa8/0x5b0
        [  150.888580] [c0000005f65cbdc0] [c0000000001293f8] kthread+0x168/0x1b0
        [  150.888584] [c0000005f65cbe30] [c00000000000b368] ret_from_kernel_thread+0x5c/0x74
      
      To allow reuse of the cgroup_mgctx structure, some fields in that
      structure are now re-initialized at the end of cgroup_migrate_execute()
      function call so that the structure can be reused again in a later
      iteration without causing problem.
      
      This bug was introduced in the commit e595cd70 ("group: track
      migration context in cgroup_mgctx") in 4.11. This commit moves the
      cgroup_taskset initialization out of cgroup_migrate(). The commit
      10467270fb3 ("cgroup: don't call migration methods if there are no
      tasks to migrate") helped, but did not completely resolve the problem.
      
      Fixes: e595cd70 ("group: track migration context in cgroup_mgctx")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v4.11+
      c4fa6c43
  18. 07 9月, 2017 3 次提交
    • P
      sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs · 50e76632
      Peter Zijlstra 提交于
      Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
      because it now resulted in non-cpuset usage breaking too.
      
      On suspend cpuset_cpu_inactive() doesn't call into
      cpuset_update_active_cpus() because it doesn't want to move tasks about,
      there is no need, all tasks are frozen and won't run again until after
      we've resumed everything.
      
      But this means that when we finally do call into
      cpuset_update_active_cpus() after resuming the last frozen cpu in
      cpuset_cpu_active(), the top_cpuset will not have any difference with
      the cpu_active_mask and this it will not in fact do _anything_.
      
      So the cpuset configuration will not be restored. This was largely
      hidden because we would unconditionally create identity domains and
      mobile users would not in fact use cpusets much. And servers what do use
      cpusets tend to not suspend-resume much.
      
      An addition problem is that we'd not in fact wait for the cpuset work to
      finish before resuming the tasks, allowing spurious migrations outside
      of the specified domains.
      
      Fix the rebuild by introducing cpuset_force_rebuild() and fix the
      ordering with cpuset_wait_for_hotplug().
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: deb7aa30 ("cpuset: reorganize CPU / memory hotplug handling")
      Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      50e76632
    • M
      mm: replace TIF_MEMDIE checks by tsk_is_oom_victim · da99ecf1
      Michal Hocko 提交于
      TIF_MEMDIE is set only to the tasks whick were either directly selected
      by the OOM killer or passed through mark_oom_victim from the allocator
      path.  tsk_is_oom_victim is more generic and allows to identify all
      tasks (threads) which share the mm with the oom victim.
      
      Please note that the freezer still needs to check TIF_MEMDIE because we
      cannot thaw tasks which do not participage in oom_victims counting
      otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da99ecf1
    • R
      cgroup: revert fa06235b ("cgroup: reset css on destruction") · 65f3975f
      Roman Gushchin 提交于
      Commit fa06235b ("cgroup: reset css on destruction") caused
      css_reset callback to be called from the offlining path.  Although it
      solves the problem mentioned in the commit description ("For instance,
      memory cgroup needs to reset memory.low, otherwise pages charged to a
      dead cgroup might never get reclaimed."), generally speaking, it's not
      correct.
      
      An offline cgroup can still be a resource domain, and we shouldn't grant
      it more resources than it had before deletion.
      
      For instance, if an offline memory cgroup has dirty pages, we should
      still imply i/o limits during writeback.
      
      The css_reset callback is designed to return the cgroup state into the
      original state, that means reset all limits and counters.  It's
      spomething different from the offlining, and we shouldn't use it from
      the offlining path.  Instead, we should adjust necessary settings from
      the per-controller css_offline callbacks (e.g.  reset memory.low).
      
      Link: http://lkml.kernel.org/r/20170727130428.28856-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65f3975f
  19. 25 8月, 2017 2 次提交
    • P
      sched/topology, cpuset: Avoid spurious/wrong domain rebuilds · 77d1dfda
      Peter Zijlstra 提交于
      When disabling cpuset.sched_load_balance we expect to be able to online
      CPUs without generating sched_domains. However this is currently
      completely broken.
      
      What happens is that we generate the sched_domains and then destroy
      them. This is because of the spurious 'default' domain build in
      cpuset_update_active_cpus(). That builds a single machine wide domain
      and then schedules a work to build the 'real' domains. The work then
      finds there are _no_ domains and destroys the lot again.
      
      Furthermore, if there actually were cpusets, building the machine wide
      domain is actively wrong, because it would allow tasks to 'escape' their
      cpuset. Also I don't think its needed, the scheduler really should
      respect the active mask.
      Reported-by: NOfer Levi(SW) <oferle@mellanox.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
      Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      77d1dfda
    • W
      cpuset: Fix incorrect memory_pressure control file mapping · 1c08c22c
      Waiman Long 提交于
      The memory_pressure control file was incorrectly set up without
      a private value (0, by default). As a result, this control
      file was treated like memory_migrate on read. By adding back the
      FILE_MEMORY_PRESSURE private value, the correct memory pressure value
      will be returned.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 7dbdb199 ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE")
      Cc: stable@vger.kernel.org # v4.4+
      1c08c22c
  20. 18 8月, 2017 2 次提交
  21. 12 8月, 2017 1 次提交
  22. 11 8月, 2017 1 次提交
    • T
      cgroup: misc changes · 3e48930c
      Tejun Heo 提交于
      Misc trivial changes to prepare for future changes.  No functional
      difference.
      
      * Expose cgroup_get(), cgroup_tryget() and cgroup_parent().
      
      * Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp.
      
      * Rename cgroup_stats_show() to cgroup_stat_show() for consistency
        with the file name.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3e48930c
  23. 10 8月, 2017 1 次提交
  24. 03 8月, 2017 6 次提交
    • D
      cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 89affbf5
      Dima Zavin 提交于
      In codepaths that use the begin/retry interface for reading
      mems_allowed_seq with irqs disabled, there exists a race condition that
      stalls the patch process after only modifying a subset of the
      static_branch call sites.
      
      This problem manifested itself as a deadlock in the slub allocator,
      inside get_any_partial.  The loop reads mems_allowed_seq value (via
      read_mems_allowed_begin), performs the defrag operation, and then
      verifies the consistency of mem_allowed via the read_mems_allowed_retry
      and the cookie returned by xxx_begin.
      
      The issue here is that both begin and retry first check if cpusets are
      enabled via cpusets_enabled() static branch.  This branch can be
      rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
      x86 jump label code fully synchronizes across all CPUs for every entry
      it rewrites.  If it rewrites only one of the callsites (specifically the
      one in read_mems_allowed_retry) and then waits for the
      smp_call_function(do_sync_core) to complete while a CPU is inside the
      begin/retry section with IRQs off and the mems_allowed value is changed,
      we can hang.
      
      This is because begin() will always return 0 (since it wasn't patched
      yet) while retry() will test the 0 against the actual value of the seq
      counter.
      
      The fix is to use two different static keys: one for begin
      (pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
      first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
      always return a valid seqcount if are enabling cpusets.  Similarly, when
      disabling cpusets via cpuset_dec(), we first ensure that callers of
      cpuset_mems_allowed_retry() will start ignoring the seqcount value
      before we let cpuset_mems_allowed_begin() return 0.
      
      The relevant stack traces of the two stuck threads:
      
        CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
        RIP: smp_call_function_many+0x1f9/0x260
        Call Trace:
          smp_call_function+0x3b/0x70
          on_each_cpu+0x2f/0x90
          text_poke_bp+0x87/0xd0
          arch_jump_label_transform+0x93/0x100
          __jump_label_update+0x77/0x90
          jump_label_update+0xaa/0xc0
          static_key_slow_inc+0x9e/0xb0
          cpuset_css_online+0x70/0x2e0
          online_css+0x2c/0xa0
          cgroup_apply_control_enable+0x27f/0x3d0
          cgroup_mkdir+0x2b7/0x420
          kernfs_iop_mkdir+0x5a/0x80
          vfs_mkdir+0xf6/0x1a0
          SyS_mkdir+0xb7/0xe0
          entry_SYSCALL_64_fastpath+0x18/0xad
      
        ...
      
        CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8818087c0000 task.stack: ffffc90000030000
        RIP: int3+0x39/0x70
        Call Trace:
          <#DB> ? ___slab_alloc+0x28b/0x5a0
          <EOE> ? copy_process.part.40+0xf7/0x1de0
          __slab_alloc.isra.80+0x54/0x90
          copy_process.part.40+0xf7/0x1de0
          copy_process.part.40+0xf7/0x1de0
          kmem_cache_alloc_node+0x8a/0x280
          copy_process.part.40+0xf7/0x1de0
          _do_fork+0xe7/0x6c0
          _raw_spin_unlock_irq+0x2d/0x60
          trace_hardirqs_on_caller+0x136/0x1d0
          entry_SYSCALL_64_fastpath+0x5/0xad
          do_syscall_64+0x27/0x350
          SyS_clone+0x19/0x20
          do_syscall_64+0x60/0x350
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
      Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
      Signed-off-by: NDima Zavin <dmitriyz@waymo.com>
      Reported-by: NCliff Spradlin <cspradlin@waymo.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89affbf5
    • T
      cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy · 13d82fb7
      Tejun Heo 提交于
      Each css_set directly points to the default cgroup it belongs to, so
      there's no reason to walk the cgrp_links list on the default
      hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      13d82fb7
    • R
      cgroup: re-use the parent pointer in cgroup_destroy_locked() · 5a621e6c
      Roman Gushchin 提交于
      As we already have a pointer to the parent cgroup in
      cgroup_destroy_locked(), we don't need to calculate it again
      to pass as an argument for cgroup1_check_for_release().
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: linux-kernel@vger.kernel.org
      5a621e6c
    • R
      cgroup: add cgroup.stat interface with basic hierarchy stats · ec39225c
      Roman Gushchin 提交于
      A cgroup can consume resources even after being deleted by a user.
      For example, writing back dirty pages should be accounted and
      limited, despite the corresponding cgroup might contain no processes
      and being deleted by a user.
      
      In the current implementation a cgroup can remain in such "dying" state
      for an undefined amount of time. For instance, if a memory cgroup
      contains a pge, mlocked by a process belonging to an other cgroup.
      
      Although the lifecycle of a dying cgroup is out of user's control,
      it's important to have some insight of what's going on under the hood.
      
      In particular, it's handy to have a counter which will allow
      to detect css leaks.
      
      To solve this problem, add a cgroup.stat interface to
      the base cgroup control files with the following metrics:
      
      nr_descendants		total number of visible descendant cgroups
      nr_dying_descendants	total number of dying descendant cgroups
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      ec39225c
    • R
      cgroup: implement hierarchy limits · 1a926e0b
      Roman Gushchin 提交于
      Creating cgroup hierearchies of unreasonable size can affect
      overall system performance. A user might want to limit the
      size of cgroup hierarchy. This is especially important if a user
      is delegating some cgroup sub-tree.
      
      To address this issue, introduce an ability to control
      the size of cgroup hierarchy.
      
      The cgroup.max.descendants control file allows to set the maximum
      allowed number of descendant cgroups.
      The cgroup.max.depth file controls the maximum depth of the cgroup
      tree. Both are single value r/w files, with "max" default value.
      
      The control files exist on each hierarchy level (including root).
      When a new cgroup is created, we check the total descendants
      and depth limits on each level, and if none of them are exceeded,
      a new cgroup is created.
      
      Only alive cgroups are counted, removed (dying) cgroups are
      ignored.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      1a926e0b
    • R
      cgroup: keep track of number of descent cgroups · 0679dee0
      Roman Gushchin 提交于
      Keep track of the number of online and dying descent cgroups.
      
      This data will be used later to add an ability to control cgroup
      hierarchy (limit the depth and the number of descent cgroups)
      and display hierarchy stats.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      0679dee0