提交 34a9304a 编写于 作者: L Linus Torvalds

Merge branch 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cgroup v2 interface is now official.  It's no longer hidden behind a
   devel flag and can be mounted using the new cgroup2 fs type.

   Unfortunately, cpu v2 interface hasn't made it yet due to the
   discussion around in-process hierarchical resource distribution and
   only memory and io controllers can be used on the v2 interface at the
   moment.

 - The existing documentation which has always been a bit of mess is
   relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
   is added as the authoritative documentation for the v2 interface.

 - Some features are added through for-4.5-ancestor-test branch to
   enable netfilter xt_cgroup match to use cgroup v2 paths.  The actual
   netfilter changes will be merged through the net tree which pulled in
   the said branch.

 - Various cleanups

* 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: rename cgroup documentations
  cgroup: fix a typo.
  cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
  cgroup: demote subsystem init messages to KERN_DEBUG
  cgroup: Fix uninitialized variable warning
  cgroup: put controller Kconfig options in meaningful order
  cgroup: clean up the kernel configuration menu nomenclature
  cgroup_pids: fix a typo.
  Subject: cgroup: Fix incomplete dd command in blkio documentation
  cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
  cpuset: Replace all instances of time_t with time64_t
  cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
  cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
  cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
...@@ -24,7 +24,5 @@ net_prio.txt ...@@ -24,7 +24,5 @@ net_prio.txt
- Network priority cgroups details and usages. - Network priority cgroups details and usages.
pids.txt pids.txt
- Process number cgroups details and usages. - Process number cgroups details and usages.
resource_counter.txt
- Resource Counter API.
unified-hierarchy.txt unified-hierarchy.txt
- Description the new/next cgroup interface. - Description the new/next cgroup interface.
...@@ -84,8 +84,7 @@ Throttling/Upper Limit policy ...@@ -84,8 +84,7 @@ Throttling/Upper Limit policy
- Run dd to read a file and see if rate is throttled to 1MB/s or not. - Run dd to read a file and see if rate is throttled to 1MB/s or not.
# dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
# iflag=direct
1024+0 records in 1024+0 records in
1024+0 records out 1024+0 records out
4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s 4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
...@@ -374,82 +373,3 @@ One can experience an overall throughput drop if you have created multiple ...@@ -374,82 +373,3 @@ One can experience an overall throughput drop if you have created multiple
groups and put applications in that group which are not driving enough groups and put applications in that group which are not driving enough
IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
on individual groups and throughput should improve. on individual groups and throughput should improve.
Writeback
=========
Page cache is dirtied through buffered writes and shared mmaps and
written asynchronously to the backing filesystem by the writeback
mechanism. Writeback sits between the memory and IO domains and
regulates the proportion of dirty memory by balancing dirtying and
write IOs.
On traditional cgroup hierarchies, relationships between different
controllers cannot be established making it impossible for writeback
to operate accounting for cgroup resource restrictions and all
writeback IOs are attributed to the root cgroup.
If both the blkio and memory controllers are used on the v2 hierarchy
and the filesystem supports cgroup writeback, writeback operations
correctly follow the resource restrictions imposed by both memory and
blkio controllers.
Writeback examines both system-wide and per-cgroup dirty memory status
and enforces the more restrictive of the two. Also, writeback control
parameters which are absolute values - vm.dirty_bytes and
vm.dirty_background_bytes - are distributed across cgroups according
to their current writeback bandwidth.
There's a peculiarity stemming from the discrepancy in ownership
granularity between memory controller and writeback. While memory
controller tracks ownership per page, writeback operates on inode
basis. cgroup writeback bridges the gap by tracking ownership by
inode but migrating ownership if too many foreign pages, pages which
don't match the current inode ownership, have been encountered while
writing back the inode.
This is a conscious design choice as writeback operations are
inherently tied to inodes making strictly following page ownership
complicated and inefficient. The only use case which suffers from
this compromise is multiple cgroups concurrently dirtying disjoint
regions of the same inode, which is an unlikely use case and decided
to be unsupported. Note that as memory controller assigns page
ownership on the first use and doesn't update it until the page is
released, even if cgroup writeback strictly follows page ownership,
multiple cgroups dirtying overlapping areas wouldn't work as expected.
In general, write-sharing an inode across multiple cgroups is not well
supported.
Filesystem support for cgroup writeback
---------------------------------------
A filesystem can make writeback IOs cgroup-aware by updating
address_space_operations->writepage[s]() to annotate bio's using the
following two functions.
* wbc_init_bio(@wbc, @bio)
Should be called for each bio carrying writeback data and associates
the bio with the inode's owner cgroup. Can be called anytime
between bio allocation and submission.
* wbc_account_io(@wbc, @page, @bytes)
Should be called for each data segment being written out. While
this function doesn't care exactly when it's called during the
writeback session, it's the easiest and most natural to call it as
data segments are added to a bio.
With writeback bio's annotated, cgroup support can be enabled per
super_block by setting MS_CGROUPWB in ->s_flags. This allows for
selective disabling of cgroup writeback support which is helpful when
certain filesystem features, e.g. journaled data mode, are
incompatible.
wbc_init_bio() binds the specified bio to its cgroup. Depending on
the configuration, the bio may be executed at a lower priority and if
the writeback session is holding shared resources, e.g. a journal
entry, may lead to priority inversion. There is no one easy solution
for the problem. Filesystems can try to work around specific problem
cases by skipping wbc_init_bio() or using bio_associate_blkcg()
directly.
此差异已折叠。
此差异已折叠。
...@@ -34,17 +34,12 @@ struct seq_file; ...@@ -34,17 +34,12 @@ struct seq_file;
/* define the enumeration of all cgroup subsystems */ /* define the enumeration of all cgroup subsystems */
#define SUBSYS(_x) _x ## _cgrp_id, #define SUBSYS(_x) _x ## _cgrp_id,
#define SUBSYS_TAG(_t) CGROUP_ ## _t, \
__unused_tag_ ## _t = CGROUP_ ## _t - 1,
enum cgroup_subsys_id { enum cgroup_subsys_id {
#include <linux/cgroup_subsys.h> #include <linux/cgroup_subsys.h>
CGROUP_SUBSYS_COUNT, CGROUP_SUBSYS_COUNT,
}; };
#undef SUBSYS_TAG
#undef SUBSYS #undef SUBSYS
#define CGROUP_CANFORK_COUNT (CGROUP_CANFORK_END - CGROUP_CANFORK_START)
/* bits in struct cgroup_subsys_state flags field */ /* bits in struct cgroup_subsys_state flags field */
enum { enum {
CSS_NO_REF = (1 << 0), /* no reference counting for this css */ CSS_NO_REF = (1 << 0), /* no reference counting for this css */
...@@ -66,7 +61,6 @@ enum { ...@@ -66,7 +61,6 @@ enum {
/* cgroup_root->flags */ /* cgroup_root->flags */
enum { enum {
CGRP_ROOT_SANE_BEHAVIOR = (1 << 0), /* __DEVEL__sane_behavior specified */
CGRP_ROOT_NOPREFIX = (1 << 1), /* mounted subsystems have no named prefix */ CGRP_ROOT_NOPREFIX = (1 << 1), /* mounted subsystems have no named prefix */
CGRP_ROOT_XATTR = (1 << 2), /* supports extended attributes */ CGRP_ROOT_XATTR = (1 << 2), /* supports extended attributes */
}; };
...@@ -439,9 +433,9 @@ struct cgroup_subsys { ...@@ -439,9 +433,9 @@ struct cgroup_subsys {
int (*can_attach)(struct cgroup_taskset *tset); int (*can_attach)(struct cgroup_taskset *tset);
void (*cancel_attach)(struct cgroup_taskset *tset); void (*cancel_attach)(struct cgroup_taskset *tset);
void (*attach)(struct cgroup_taskset *tset); void (*attach)(struct cgroup_taskset *tset);
int (*can_fork)(struct task_struct *task, void **priv_p); int (*can_fork)(struct task_struct *task);
void (*cancel_fork)(struct task_struct *task, void *priv); void (*cancel_fork)(struct task_struct *task);
void (*fork)(struct task_struct *task, void *priv); void (*fork)(struct task_struct *task);
void (*exit)(struct task_struct *task); void (*exit)(struct task_struct *task);
void (*free)(struct task_struct *task); void (*free)(struct task_struct *task);
void (*bind)(struct cgroup_subsys_state *root_css); void (*bind)(struct cgroup_subsys_state *root_css);
...@@ -527,7 +521,6 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) ...@@ -527,7 +521,6 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
#else /* CONFIG_CGROUPS */ #else /* CONFIG_CGROUPS */
#define CGROUP_CANFORK_COUNT 0
#define CGROUP_SUBSYS_COUNT 0 #define CGROUP_SUBSYS_COUNT 0
static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) {} static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) {}
......
...@@ -97,12 +97,9 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, ...@@ -97,12 +97,9 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *tsk); struct pid *pid, struct task_struct *tsk);
void cgroup_fork(struct task_struct *p); void cgroup_fork(struct task_struct *p);
extern int cgroup_can_fork(struct task_struct *p, extern int cgroup_can_fork(struct task_struct *p);
void *ss_priv[CGROUP_CANFORK_COUNT]); extern void cgroup_cancel_fork(struct task_struct *p);
extern void cgroup_cancel_fork(struct task_struct *p, extern void cgroup_post_fork(struct task_struct *p);
void *ss_priv[CGROUP_CANFORK_COUNT]);
extern void cgroup_post_fork(struct task_struct *p,
void *old_ss_priv[CGROUP_CANFORK_COUNT]);
void cgroup_exit(struct task_struct *p); void cgroup_exit(struct task_struct *p);
void cgroup_free(struct task_struct *p); void cgroup_free(struct task_struct *p);
...@@ -562,13 +559,9 @@ static inline int cgroupstats_build(struct cgroupstats *stats, ...@@ -562,13 +559,9 @@ static inline int cgroupstats_build(struct cgroupstats *stats,
struct dentry *dentry) { return -EINVAL; } struct dentry *dentry) { return -EINVAL; }
static inline void cgroup_fork(struct task_struct *p) {} static inline void cgroup_fork(struct task_struct *p) {}
static inline int cgroup_can_fork(struct task_struct *p, static inline int cgroup_can_fork(struct task_struct *p) { return 0; }
void *ss_priv[CGROUP_CANFORK_COUNT]) static inline void cgroup_cancel_fork(struct task_struct *p) {}
{ return 0; } static inline void cgroup_post_fork(struct task_struct *p) {}
static inline void cgroup_cancel_fork(struct task_struct *p,
void *ss_priv[CGROUP_CANFORK_COUNT]) {}
static inline void cgroup_post_fork(struct task_struct *p,
void *ss_priv[CGROUP_CANFORK_COUNT]) {}
static inline void cgroup_exit(struct task_struct *p) {} static inline void cgroup_exit(struct task_struct *p) {}
static inline void cgroup_free(struct task_struct *p) {} static inline void cgroup_free(struct task_struct *p) {}
......
...@@ -6,14 +6,8 @@ ...@@ -6,14 +6,8 @@
/* /*
* This file *must* be included with SUBSYS() defined. * This file *must* be included with SUBSYS() defined.
* SUBSYS_TAG() is a noop if undefined.
*/ */
#ifndef SUBSYS_TAG
#define __TMP_SUBSYS_TAG
#define SUBSYS_TAG(_x)
#endif
#if IS_ENABLED(CONFIG_CPUSETS) #if IS_ENABLED(CONFIG_CPUSETS)
SUBSYS(cpuset) SUBSYS(cpuset)
#endif #endif
...@@ -58,17 +52,10 @@ SUBSYS(net_prio) ...@@ -58,17 +52,10 @@ SUBSYS(net_prio)
SUBSYS(hugetlb) SUBSYS(hugetlb)
#endif #endif
/*
* Subsystems that implement the can_fork() family of callbacks.
*/
SUBSYS_TAG(CANFORK_START)
#if IS_ENABLED(CONFIG_CGROUP_PIDS) #if IS_ENABLED(CONFIG_CGROUP_PIDS)
SUBSYS(pids) SUBSYS(pids)
#endif #endif
SUBSYS_TAG(CANFORK_END)
/* /*
* The following subsystems are not supported on the default hierarchy. * The following subsystems are not supported on the default hierarchy.
*/ */
...@@ -76,11 +63,6 @@ SUBSYS_TAG(CANFORK_END) ...@@ -76,11 +63,6 @@ SUBSYS_TAG(CANFORK_END)
SUBSYS(debug) SUBSYS(debug)
#endif #endif
#ifdef __TMP_SUBSYS_TAG
#undef __TMP_SUBSYS_TAG
#undef SUBSYS_TAG
#endif
/* /*
* DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS. * DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS.
*/ */
...@@ -54,6 +54,7 @@ ...@@ -54,6 +54,7 @@
#define SMB_SUPER_MAGIC 0x517B #define SMB_SUPER_MAGIC 0x517B
#define CGROUP_SUPER_MAGIC 0x27e0eb #define CGROUP_SUPER_MAGIC 0x27e0eb
#define CGROUP2_SUPER_MAGIC 0x63677270
#define STACK_END_MAGIC 0x57AC6E9D #define STACK_END_MAGIC 0x57AC6E9D
......
...@@ -940,95 +940,24 @@ menuconfig CGROUPS ...@@ -940,95 +940,24 @@ menuconfig CGROUPS
if CGROUPS if CGROUPS
config CGROUP_DEBUG
bool "Example debug cgroup subsystem"
default n
help
This option enables a simple cgroup subsystem that
exports useful debugging information about the cgroups
framework.
Say N if unsure.
config CGROUP_FREEZER
bool "Freezer cgroup subsystem"
help
Provides a way to freeze and unfreeze all tasks in a
cgroup.
config CGROUP_PIDS
bool "PIDs cgroup subsystem"
help
Provides enforcement of process number limits in the scope of a
cgroup. Any attempt to fork more processes than is allowed in the
cgroup will fail. PIDs are fundamentally a global resource because it
is fairly trivial to reach PID exhaustion before you reach even a
conservative kmemcg limit. As a result, it is possible to grind a
system to halt without being limited by other cgroup policies. The
PIDs cgroup subsystem is designed to stop this from happening.
It should be noted that organisational operations (such as attaching
to a cgroup hierarchy will *not* be blocked by the PIDs subsystem),
since the PIDs limit only affects a process's ability to fork, not to
attach to a cgroup.
config CGROUP_DEVICE
bool "Device controller for cgroups"
help
Provides a cgroup implementing whitelists for devices which
a process in the cgroup can mknod or open.
config CPUSETS
bool "Cpuset support"
help
This option will let you create and manage CPUSETs which
allow dynamically partitioning a system into sets of CPUs and
Memory Nodes and assigning tasks to run only within those sets.
This is primarily useful on large SMP or NUMA systems.
Say N if unsure.
config PROC_PID_CPUSET
bool "Include legacy /proc/<pid>/cpuset file"
depends on CPUSETS
default y
config CGROUP_CPUACCT
bool "Simple CPU accounting cgroup subsystem"
help
Provides a simple Resource Controller for monitoring the
total CPU consumed by the tasks in a cgroup.
config PAGE_COUNTER config PAGE_COUNTER
bool bool
config MEMCG config MEMCG
bool "Memory Resource Controller for Control Groups" bool "Memory controller"
select PAGE_COUNTER select PAGE_COUNTER
select EVENTFD select EVENTFD
help help
Provides a memory resource controller that manages both anonymous Provides control over the memory footprint of tasks in a cgroup.
memory and page cache. (See Documentation/cgroups/memory.txt)
config MEMCG_SWAP config MEMCG_SWAP
bool "Memory Resource Controller Swap Extension" bool "Swap controller"
depends on MEMCG && SWAP depends on MEMCG && SWAP
help help
Add swap management feature to memory resource controller. When you Provides control over the swap space consumed by tasks in a cgroup.
enable this, you can limit mem+swap usage per cgroup. In other words,
when you disable this, memory resource controller has no cares to
usage of swap...a process can exhaust all of the swap. This extension
is useful when you want to avoid exhaustion swap but this itself
adds more overheads and consumes memory for remembering information.
Especially if you use 32bit system or small memory system, please
be careful about enabling this. When memory resource controller
is disabled by boot option, this will be automatically disabled and
there will be no overhead from this. Even when you set this config=y,
if boot option "swapaccount=0" is set, swap will not be accounted.
Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
size is 4096bytes, 512k per 1Gbytes of swap.
config MEMCG_SWAP_ENABLED config MEMCG_SWAP_ENABLED
bool "Memory Resource Controller Swap Extension enabled by default" bool "Swap controller enabled by default"
depends on MEMCG_SWAP depends on MEMCG_SWAP
default y default y
help help
...@@ -1052,34 +981,43 @@ config MEMCG_KMEM ...@@ -1052,34 +981,43 @@ config MEMCG_KMEM
the kmem extension can use it to guarantee that no group of processes the kmem extension can use it to guarantee that no group of processes
will ever exhaust kernel resources alone. will ever exhaust kernel resources alone.
config CGROUP_HUGETLB config BLK_CGROUP
bool "HugeTLB Resource Controller for Control Groups" bool "IO controller"
depends on HUGETLB_PAGE depends on BLOCK
select PAGE_COUNTER
default n default n
help ---help---
Provides a cgroup Resource Controller for HugeTLB pages. Generic block IO controller cgroup interface. This is the common
When you enable this, you can put a per cgroup limit on HugeTLB usage. cgroup interface which should be used by various IO controlling
The limit is enforced during page fault. Since HugeTLB doesn't policies.
support page reclaim, enforcing the limit at page fault time implies
that, the application will get SIGBUS signal if it tries to access
HugeTLB pages beyond its limit. This requires the application to know
beforehand how much HugeTLB pages it would require for its use. The
control group is tracked in the third page lru pointer. This means
that we cannot use the controller with huge page less than 3 pages.
config CGROUP_PERF Currently, CFQ IO scheduler uses it to recognize task groups and
bool "Enable perf_event per-cpu per-container group (cgroup) monitoring" control disk bandwidth allocation (proportional time slice allocation)
depends on PERF_EVENTS && CGROUPS to such task groups. It is also used by bio throttling logic in
help block layer to implement upper limit in IO rates on a device.
This option extends the per-cpu mode to restrict monitoring to
threads which belong to the cgroup specified and run on the
designated cpu.
Say N if unsure. This option only enables generic Block IO controller infrastructure.
One needs to also enable actual IO controlling logic/policy. For
enabling proportional weight division of disk bandwidth in CFQ, set
CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set
CONFIG_BLK_DEV_THROTTLING=y.
See Documentation/cgroups/blkio-controller.txt for more information.
config DEBUG_BLK_CGROUP
bool "IO controller debugging"
depends on BLK_CGROUP
default n
---help---
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.
config CGROUP_WRITEBACK
bool
depends on MEMCG && BLK_CGROUP
default y
menuconfig CGROUP_SCHED menuconfig CGROUP_SCHED
bool "Group CPU scheduler" bool "CPU controller"
default n default n
help help
This feature lets CPU scheduler recognize task groups and control CPU This feature lets CPU scheduler recognize task groups and control CPU
...@@ -1116,40 +1054,89 @@ config RT_GROUP_SCHED ...@@ -1116,40 +1054,89 @@ config RT_GROUP_SCHED
endif #CGROUP_SCHED endif #CGROUP_SCHED
config BLK_CGROUP config CGROUP_PIDS
bool "Block IO controller" bool "PIDs controller"
depends on BLOCK help
Provides enforcement of process number limits in the scope of a
cgroup. Any attempt to fork more processes than is allowed in the
cgroup will fail. PIDs are fundamentally a global resource because it
is fairly trivial to reach PID exhaustion before you reach even a
conservative kmemcg limit. As a result, it is possible to grind a
system to halt without being limited by other cgroup policies. The
PIDs cgroup subsystem is designed to stop this from happening.
It should be noted that organisational operations (such as attaching
to a cgroup hierarchy will *not* be blocked by the PIDs subsystem),
since the PIDs limit only affects a process's ability to fork, not to
attach to a cgroup.
config CGROUP_FREEZER
bool "Freezer controller"
help
Provides a way to freeze and unfreeze all tasks in a
cgroup.
config CGROUP_HUGETLB
bool "HugeTLB controller"
depends on HUGETLB_PAGE
select PAGE_COUNTER
default n default n
---help--- help
Generic block IO controller cgroup interface. This is the common Provides a cgroup controller for HugeTLB pages.
cgroup interface which should be used by various IO controlling When you enable this, you can put a per cgroup limit on HugeTLB usage.
policies. The limit is enforced during page fault. Since HugeTLB doesn't
support page reclaim, enforcing the limit at page fault time implies
that, the application will get SIGBUS signal if it tries to access
HugeTLB pages beyond its limit. This requires the application to know
beforehand how much HugeTLB pages it would require for its use. The
control group is tracked in the third page lru pointer. This means
that we cannot use the controller with huge page less than 3 pages.
Currently, CFQ IO scheduler uses it to recognize task groups and config CPUSETS
control disk bandwidth allocation (proportional time slice allocation) bool "Cpuset controller"
to such task groups. It is also used by bio throttling logic in help
block layer to implement upper limit in IO rates on a device. This option will let you create and manage CPUSETs which
allow dynamically partitioning a system into sets of CPUs and
Memory Nodes and assigning tasks to run only within those sets.
This is primarily useful on large SMP or NUMA systems.
This option only enables generic Block IO controller infrastructure. Say N if unsure.
One needs to also enable actual IO controlling logic/policy. For
enabling proportional weight division of disk bandwidth in CFQ, set
CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set
CONFIG_BLK_DEV_THROTTLING=y.
See Documentation/cgroups/blkio-controller.txt for more information. config PROC_PID_CPUSET
bool "Include legacy /proc/<pid>/cpuset file"
depends on CPUSETS
default y
config DEBUG_BLK_CGROUP config CGROUP_DEVICE
bool "Enable Block IO controller debugging" bool "Device controller"
depends on BLK_CGROUP help
Provides a cgroup controller implementing whitelists for
devices which a process in the cgroup can mknod or open.
config CGROUP_CPUACCT
bool "Simple CPU accounting controller"
help
Provides a simple controller for monitoring the
total CPU consumed by the tasks in a cgroup.
config CGROUP_PERF
bool "Perf controller"
depends on PERF_EVENTS
help
This option extends the perf per-cpu mode to restrict monitoring
to threads which belong to the cgroup specified and run on the
designated cpu.
Say N if unsure.
config CGROUP_DEBUG
bool "Example controller"
default n default n
---help--- help
Enable some debugging help. Currently it exports additional stat This option enables a simple controller that exports
files in a cgroup which can be useful for debugging. debugging information about the cgroups framework.
config CGROUP_WRITEBACK Say N.
bool
depends on MEMCG && BLK_CGROUP
default y
endif # CGROUPS endif # CGROUPS
......
...@@ -211,6 +211,7 @@ static unsigned long have_free_callback __read_mostly; ...@@ -211,6 +211,7 @@ static unsigned long have_free_callback __read_mostly;
/* Ditto for the can_fork callback. */ /* Ditto for the can_fork callback. */
static unsigned long have_canfork_callback __read_mostly; static unsigned long have_canfork_callback __read_mostly;
static struct file_system_type cgroup2_fs_type;
static struct cftype cgroup_dfl_base_files[]; static struct cftype cgroup_dfl_base_files[];
static struct cftype cgroup_legacy_base_files[]; static struct cftype cgroup_legacy_base_files[];
...@@ -1623,10 +1624,6 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts) ...@@ -1623,10 +1624,6 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
all_ss = true; all_ss = true;
continue; continue;
} }
if (!strcmp(token, "__DEVEL__sane_behavior")) {
opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
continue;
}
if (!strcmp(token, "noprefix")) { if (!strcmp(token, "noprefix")) {
opts->flags |= CGRP_ROOT_NOPREFIX; opts->flags |= CGRP_ROOT_NOPREFIX;
continue; continue;
...@@ -1693,15 +1690,6 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts) ...@@ -1693,15 +1690,6 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
return -ENOENT; return -ENOENT;
} }
if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
if (nr_opts != 1) {
pr_err("sane_behavior: no other mount options allowed\n");
return -EINVAL;
}
return 0;
}
/* /*
* If the 'all' option was specified select all the subsystems, * If the 'all' option was specified select all the subsystems,
* otherwise if 'none', 'name=' and a subsystem name options were * otherwise if 'none', 'name=' and a subsystem name options were
...@@ -1981,6 +1969,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, ...@@ -1981,6 +1969,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name, int flags, const char *unused_dev_name,
void *data) void *data)
{ {
bool is_v2 = fs_type == &cgroup2_fs_type;
struct super_block *pinned_sb = NULL; struct super_block *pinned_sb = NULL;
struct cgroup_subsys *ss; struct cgroup_subsys *ss;
struct cgroup_root *root; struct cgroup_root *root;
...@@ -1997,6 +1986,17 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, ...@@ -1997,6 +1986,17 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
if (!use_task_css_set_links) if (!use_task_css_set_links)
cgroup_enable_task_cg_lists(); cgroup_enable_task_cg_lists();
if (is_v2) {
if (data) {
pr_err("cgroup2: unknown option \"%s\"\n", (char *)data);
return ERR_PTR(-EINVAL);
}
cgrp_dfl_root_visible = true;
root = &cgrp_dfl_root;
cgroup_get(&root->cgrp);
goto out_mount;
}
mutex_lock(&cgroup_mutex); mutex_lock(&cgroup_mutex);
/* First find the desired set of subsystems */ /* First find the desired set of subsystems */
...@@ -2004,15 +2004,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, ...@@ -2004,15 +2004,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
if (ret) if (ret)
goto out_unlock; goto out_unlock;
/* look for a matching existing root */
if (opts.flags & CGRP_ROOT_SANE_BEHAVIOR) {
cgrp_dfl_root_visible = true;
root = &cgrp_dfl_root;
cgroup_get(&root->cgrp);
ret = 0;
goto out_unlock;
}
/* /*
* Destruction of cgroup root is asynchronous, so subsystems may * Destruction of cgroup root is asynchronous, so subsystems may
* still be dying after the previous unmount. Let's drain the * still be dying after the previous unmount. Let's drain the
...@@ -2123,9 +2114,10 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, ...@@ -2123,9 +2114,10 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
if (ret) if (ret)
return ERR_PTR(ret); return ERR_PTR(ret);
out_mount:
dentry = kernfs_mount(fs_type, flags, root->kf_root, dentry = kernfs_mount(fs_type, flags, root->kf_root,
CGROUP_SUPER_MAGIC, &new_sb); is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
&new_sb);
if (IS_ERR(dentry) || !new_sb) if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp); cgroup_put(&root->cgrp);
...@@ -2168,6 +2160,12 @@ static struct file_system_type cgroup_fs_type = { ...@@ -2168,6 +2160,12 @@ static struct file_system_type cgroup_fs_type = {
.kill_sb = cgroup_kill_sb, .kill_sb = cgroup_kill_sb,
}; };
static struct file_system_type cgroup2_fs_type = {
.name = "cgroup2",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
};
/** /**
* task_cgroup_path - cgroup path of a task in the first cgroup hierarchy * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
* @task: target task * @task: target task
...@@ -4039,7 +4037,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) ...@@ -4039,7 +4037,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
goto out_err; goto out_err;
/* /*
* Migrate tasks one-by-one until @form is empty. This fails iff * Migrate tasks one-by-one until @from is empty. This fails iff
* ->can_attach() fails. * ->can_attach() fails.
*/ */
do { do {
...@@ -5171,7 +5169,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early) ...@@ -5171,7 +5169,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
{ {
struct cgroup_subsys_state *css; struct cgroup_subsys_state *css;
printk(KERN_INFO "Initializing cgroup subsys %s\n", ss->name); pr_debug("Initializing cgroup subsys %s\n", ss->name);
mutex_lock(&cgroup_mutex); mutex_lock(&cgroup_mutex);
...@@ -5329,6 +5327,7 @@ int __init cgroup_init(void) ...@@ -5329,6 +5327,7 @@ int __init cgroup_init(void)
WARN_ON(sysfs_create_mount_point(fs_kobj, "cgroup")); WARN_ON(sysfs_create_mount_point(fs_kobj, "cgroup"));
WARN_ON(register_filesystem(&cgroup_fs_type)); WARN_ON(register_filesystem(&cgroup_fs_type));
WARN_ON(register_filesystem(&cgroup2_fs_type));
WARN_ON(!proc_create("cgroups", 0, NULL, &proc_cgroupstats_operations)); WARN_ON(!proc_create("cgroups", 0, NULL, &proc_cgroupstats_operations));
return 0; return 0;
...@@ -5472,19 +5471,6 @@ static const struct file_operations proc_cgroupstats_operations = { ...@@ -5472,19 +5471,6 @@ static const struct file_operations proc_cgroupstats_operations = {
.release = single_release, .release = single_release,
}; };
static void **subsys_canfork_priv_p(void *ss_priv[CGROUP_CANFORK_COUNT], int i)
{
if (CGROUP_CANFORK_START <= i && i < CGROUP_CANFORK_END)
return &ss_priv[i - CGROUP_CANFORK_START];
return NULL;
}
static void *subsys_canfork_priv(void *ss_priv[CGROUP_CANFORK_COUNT], int i)
{
void **private = subsys_canfork_priv_p(ss_priv, i);
return private ? *private : NULL;
}
/** /**
* cgroup_fork - initialize cgroup related fields during copy_process() * cgroup_fork - initialize cgroup related fields during copy_process()
* @child: pointer to task_struct of forking parent process. * @child: pointer to task_struct of forking parent process.
...@@ -5507,14 +5493,13 @@ void cgroup_fork(struct task_struct *child) ...@@ -5507,14 +5493,13 @@ void cgroup_fork(struct task_struct *child)
* returns an error, the fork aborts with that error code. This allows for * returns an error, the fork aborts with that error code. This allows for
* a cgroup subsystem to conditionally allow or deny new forks. * a cgroup subsystem to conditionally allow or deny new forks.
*/ */
int cgroup_can_fork(struct task_struct *child, int cgroup_can_fork(struct task_struct *child)
void *ss_priv[CGROUP_CANFORK_COUNT])
{ {
struct cgroup_subsys *ss; struct cgroup_subsys *ss;
int i, j, ret; int i, j, ret;
for_each_subsys_which(ss, i, &have_canfork_callback) { for_each_subsys_which(ss, i, &have_canfork_callback) {
ret = ss->can_fork(child, subsys_canfork_priv_p(ss_priv, i)); ret = ss->can_fork(child);
if (ret) if (ret)
goto out_revert; goto out_revert;
} }
...@@ -5526,7 +5511,7 @@ int cgroup_can_fork(struct task_struct *child, ...@@ -5526,7 +5511,7 @@ int cgroup_can_fork(struct task_struct *child,
if (j >= i) if (j >= i)
break; break;
if (ss->cancel_fork) if (ss->cancel_fork)
ss->cancel_fork(child, subsys_canfork_priv(ss_priv, j)); ss->cancel_fork(child);
} }
return ret; return ret;
...@@ -5539,15 +5524,14 @@ int cgroup_can_fork(struct task_struct *child, ...@@ -5539,15 +5524,14 @@ int cgroup_can_fork(struct task_struct *child,
* This calls the cancel_fork() callbacks if a fork failed *after* * This calls the cancel_fork() callbacks if a fork failed *after*
* cgroup_can_fork() succeded. * cgroup_can_fork() succeded.
*/ */
void cgroup_cancel_fork(struct task_struct *child, void cgroup_cancel_fork(struct task_struct *child)
void *ss_priv[CGROUP_CANFORK_COUNT])
{ {
struct cgroup_subsys *ss; struct cgroup_subsys *ss;
int i; int i;
for_each_subsys(ss, i) for_each_subsys(ss, i)
if (ss->cancel_fork) if (ss->cancel_fork)
ss->cancel_fork(child, subsys_canfork_priv(ss_priv, i)); ss->cancel_fork(child);
} }
/** /**
...@@ -5560,8 +5544,7 @@ void cgroup_cancel_fork(struct task_struct *child, ...@@ -5560,8 +5544,7 @@ void cgroup_cancel_fork(struct task_struct *child,
* cgroup_task_iter_start() - to guarantee that the new task ends up on its * cgroup_task_iter_start() - to guarantee that the new task ends up on its
* list. * list.
*/ */
void cgroup_post_fork(struct task_struct *child, void cgroup_post_fork(struct task_struct *child)
void *old_ss_priv[CGROUP_CANFORK_COUNT])
{ {
struct cgroup_subsys *ss; struct cgroup_subsys *ss;
int i; int i;
...@@ -5605,7 +5588,7 @@ void cgroup_post_fork(struct task_struct *child, ...@@ -5605,7 +5588,7 @@ void cgroup_post_fork(struct task_struct *child,
* and addition to css_set. * and addition to css_set.
*/ */
for_each_subsys_which(ss, i, &have_fork_callback) for_each_subsys_which(ss, i, &have_fork_callback)
ss->fork(child, subsys_canfork_priv(old_ss_priv, i)); ss->fork(child);
} }
/** /**
......
...@@ -200,7 +200,7 @@ static void freezer_attach(struct cgroup_taskset *tset) ...@@ -200,7 +200,7 @@ static void freezer_attach(struct cgroup_taskset *tset)
* to do anything as freezer_attach() will put @task into the appropriate * to do anything as freezer_attach() will put @task into the appropriate
* state. * state.
*/ */
static void freezer_fork(struct task_struct *task, void *private) static void freezer_fork(struct task_struct *task)
{ {
struct freezer *freezer; struct freezer *freezer;
......
...@@ -134,7 +134,7 @@ static void pids_charge(struct pids_cgroup *pids, int num) ...@@ -134,7 +134,7 @@ static void pids_charge(struct pids_cgroup *pids, int num)
* *
* This function follows the set limit. It will fail if the charge would cause * This function follows the set limit. It will fail if the charge would cause
* the new value to exceed the hierarchical limit. Returns 0 if the charge * the new value to exceed the hierarchical limit. Returns 0 if the charge
* succeded, otherwise -EAGAIN. * succeeded, otherwise -EAGAIN.
*/ */
static int pids_try_charge(struct pids_cgroup *pids, int num) static int pids_try_charge(struct pids_cgroup *pids, int num)
{ {
...@@ -209,7 +209,7 @@ static void pids_cancel_attach(struct cgroup_taskset *tset) ...@@ -209,7 +209,7 @@ static void pids_cancel_attach(struct cgroup_taskset *tset)
* task_css_check(true) in pids_can_fork() and pids_cancel_fork() relies * task_css_check(true) in pids_can_fork() and pids_cancel_fork() relies
* on threadgroup_change_begin() held by the copy_process(). * on threadgroup_change_begin() held by the copy_process().
*/ */
static int pids_can_fork(struct task_struct *task, void **priv_p) static int pids_can_fork(struct task_struct *task)
{ {
struct cgroup_subsys_state *css; struct cgroup_subsys_state *css;
struct pids_cgroup *pids; struct pids_cgroup *pids;
...@@ -219,7 +219,7 @@ static int pids_can_fork(struct task_struct *task, void **priv_p) ...@@ -219,7 +219,7 @@ static int pids_can_fork(struct task_struct *task, void **priv_p)
return pids_try_charge(pids, 1); return pids_try_charge(pids, 1);
} }
static void pids_cancel_fork(struct task_struct *task, void *priv) static void pids_cancel_fork(struct task_struct *task)
{ {
struct cgroup_subsys_state *css; struct cgroup_subsys_state *css;
struct pids_cgroup *pids; struct pids_cgroup *pids;
......
...@@ -51,6 +51,7 @@ ...@@ -51,6 +51,7 @@
#include <linux/stat.h> #include <linux/stat.h>
#include <linux/string.h> #include <linux/string.h>
#include <linux/time.h> #include <linux/time.h>
#include <linux/time64.h>
#include <linux/backing-dev.h> #include <linux/backing-dev.h>
#include <linux/sort.h> #include <linux/sort.h>
...@@ -68,7 +69,7 @@ struct static_key cpusets_enabled_key __read_mostly = STATIC_KEY_INIT_FALSE; ...@@ -68,7 +69,7 @@ struct static_key cpusets_enabled_key __read_mostly = STATIC_KEY_INIT_FALSE;
struct fmeter { struct fmeter {
int cnt; /* unprocessed events count */ int cnt; /* unprocessed events count */
int val; /* most recent output value */ int val; /* most recent output value */
time_t time; /* clock (secs) when val computed */ time64_t time; /* clock (secs) when val computed */
spinlock_t lock; /* guards read or write of above */ spinlock_t lock; /* guards read or write of above */
}; };
...@@ -1374,7 +1375,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, ...@@ -1374,7 +1375,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
*/ */
#define FM_COEF 933 /* coefficient for half-life of 10 secs */ #define FM_COEF 933 /* coefficient for half-life of 10 secs */
#define FM_MAXTICKS ((time_t)99) /* useless computing more ticks than this */ #define FM_MAXTICKS ((u32)99) /* useless computing more ticks than this */
#define FM_MAXCNT 1000000 /* limit cnt to avoid overflow */ #define FM_MAXCNT 1000000 /* limit cnt to avoid overflow */
#define FM_SCALE 1000 /* faux fixed point scale */ #define FM_SCALE 1000 /* faux fixed point scale */
...@@ -1390,8 +1391,11 @@ static void fmeter_init(struct fmeter *fmp) ...@@ -1390,8 +1391,11 @@ static void fmeter_init(struct fmeter *fmp)
/* Internal meter update - process cnt events and update value */ /* Internal meter update - process cnt events and update value */
static void fmeter_update(struct fmeter *fmp) static void fmeter_update(struct fmeter *fmp)
{ {
time_t now = get_seconds(); time64_t now;
time_t ticks = now - fmp->time; u32 ticks;
now = ktime_get_seconds();
ticks = now - fmp->time;
if (ticks == 0) if (ticks == 0)
return; return;
......
...@@ -1250,7 +1250,6 @@ static struct task_struct *copy_process(unsigned long clone_flags, ...@@ -1250,7 +1250,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
{ {
int retval; int retval;
struct task_struct *p; struct task_struct *p;
void *cgrp_ss_priv[CGROUP_CANFORK_COUNT] = {};
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL); return ERR_PTR(-EINVAL);
...@@ -1527,7 +1526,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, ...@@ -1527,7 +1526,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
* between here and cgroup_post_fork() if an organisation operation is in * between here and cgroup_post_fork() if an organisation operation is in
* progress. * progress.
*/ */
retval = cgroup_can_fork(p, cgrp_ss_priv); retval = cgroup_can_fork(p);
if (retval) if (retval)
goto bad_fork_free_pid; goto bad_fork_free_pid;
...@@ -1609,7 +1608,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, ...@@ -1609,7 +1608,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
write_unlock_irq(&tasklist_lock); write_unlock_irq(&tasklist_lock);
proc_fork_connector(p); proc_fork_connector(p);
cgroup_post_fork(p, cgrp_ss_priv); cgroup_post_fork(p);
threadgroup_change_end(current); threadgroup_change_end(current);
perf_event_fork(p); perf_event_fork(p);
...@@ -1619,7 +1618,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, ...@@ -1619,7 +1618,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
return p; return p;
bad_fork_cancel_cgroup: bad_fork_cancel_cgroup:
cgroup_cancel_fork(p, cgrp_ss_priv); cgroup_cancel_fork(p);
bad_fork_free_pid: bad_fork_free_pid:
if (pid != &init_struct_pid) if (pid != &init_struct_pid)
free_pid(pid); free_pid(pid);
......
...@@ -8342,7 +8342,7 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css) ...@@ -8342,7 +8342,7 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
sched_offline_group(tg); sched_offline_group(tg);
} }
static void cpu_cgroup_fork(struct task_struct *task, void *private) static void cpu_cgroup_fork(struct task_struct *task)
{ {
sched_move_task(task); sched_move_task(task);
} }
......
...@@ -4813,7 +4813,7 @@ static void mem_cgroup_clear_mc(void) ...@@ -4813,7 +4813,7 @@ static void mem_cgroup_clear_mc(void)
static int mem_cgroup_can_attach(struct cgroup_taskset *tset) static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{ {
struct cgroup_subsys_state *css; struct cgroup_subsys_state *css;
struct mem_cgroup *memcg; struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
struct mem_cgroup *from; struct mem_cgroup *from;
struct task_struct *leader, *p; struct task_struct *leader, *p;
struct mm_struct *mm; struct mm_struct *mm;
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册