diff --git a/Documentation/accounting/taskstats.txt b/Documentation/accounting/taskstats.txt new file mode 100644 index 0000000000000000000000000000000000000000..ad9b6997e162de243183c83e50c3cbc4562f71f2 --- /dev/null +++ b/Documentation/accounting/taskstats.txt @@ -0,0 +1,146 @@ +Per-task statistics interface +----------------------------- + + +Taskstats is a netlink-based interface for sending per-task and +per-process statistics from the kernel to userspace. + +Taskstats was designed for the following benefits: + +- efficiently provide statistics during lifetime of a task and on its exit +- unified interface for multiple accounting subsystems +- extensibility for use by future accounting patches + +Terminology +----------- + +"pid", "tid" and "task" are used interchangeably and refer to the standard +Linux task defined by struct task_struct. per-pid stats are the same as +per-task stats. + +"tgid", "process" and "thread group" are used interchangeably and refer to the +tasks that share an mm_struct i.e. the traditional Unix process. Despite the +use of tgid, there is no special treatment for the task that is thread group +leader - a process is deemed alive as long as it has any task belonging to it. + +Usage +----- + +To get statistics during task's lifetime, userspace opens a unicast netlink +socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid. +The response contains statistics for a task (if pid is specified) or the sum of +statistics for all tasks of the process (if tgid is specified). + +To obtain statistics for tasks which are exiting, userspace opens a multicast +netlink socket. Each time a task exits, two records are sent by the kernel to +each listener on the multicast socket. The first the per-pid task's statistics +and the second is the sum for all tasks of the process to which the task +belongs (the task does not need to be the thread group leader). The need for +per-tgid stats to be sent for each exiting task is explained in the per-tgid +stats section below. + + +Interface +--------- + +The user-kernel interface is encapsulated in include/linux/taskstats.h + +To avoid this documentation becoming obsolete as the interface evolves, only +an outline of the current version is given. taskstats.h always overrides the +description here. + +struct taskstats is the common accounting structure for both per-pid and +per-tgid data. It is versioned and can be extended by each accounting subsystem +that is added to the kernel. The fields and their semantics are defined in the +taskstats.h file. + +The data exchanged between user and kernel space is a netlink message belonging +to the NETLINK_GENERIC family and using the netlink attributes interface. +The messages are in the format + + +----------+- - -+-------------+-------------------+ + | nlmsghdr | Pad | genlmsghdr | taskstats payload | + +----------+- - -+-------------+-------------------+ + + +The taskstats payload is one of the following three kinds: + +1. Commands: Sent from user to kernel. The payload is one attribute, of type +TASKSTATS_CMD_ATTR_PID/TGID, containing a u32 pid or tgid in the attribute +payload. The pid/tgid denotes the task/process for which userspace wants +statistics. + +2. Response for a command: sent from the kernel in response to a userspace +command. The payload is a series of three attributes of type: + +a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates +a pid/tgid will be followed by some stats. + +b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats +is being returned. + +c) TASKSTATS_TYPE_STATS: attribute with a struct taskstsats as payload. The +same structure is used for both per-pid and per-tgid stats. + +3. New message sent by kernel whenever a task exits. The payload consists of a + series of attributes of the following type: + +a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats +b) TASKSTATS_TYPE_PID: contains exiting task's pid +c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats +d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats +e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs +f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process + + +per-tgid stats +-------------- + +Taskstats provides per-process stats, in addition to per-task stats, since +resource management is often done at a process granularity and aggregating task +stats in userspace alone is inefficient and potentially inaccurate (due to lack +of atomicity). + +However, maintaining per-process, in addition to per-task stats, within the +kernel has space and time overheads. Hence the taskstats implementation +dynamically sums up the per-task stats for each task belonging to a process +whenever per-process stats are needed. + +Not maintaining per-tgid stats creates a problem when userspace is interested +in getting these stats when the process dies i.e. the last thread of +a process exits. It isn't possible to simply return some aggregated per-process +statistic from the kernel. + +The approach taken by taskstats is to return the per-tgid stats *each* time +a task exits, in addition to the per-pid stats for that task. Userspace can +maintain task<->process mappings and use them to maintain the per-process stats +in userspace, updating the aggregate appropriately as the tasks of a process +exit. + +Extending taskstats +------------------- + +There are two ways to extend the taskstats interface to export more +per-task/process stats as patches to collect them get added to the kernel +in future: + +1. Adding more fields to the end of the existing struct taskstats. Backward + compatibility is ensured by the version number within the + structure. Userspace will use only the fields of the struct that correspond + to the version its using. + +2. Defining separate statistic structs and using the netlink attributes + interface to return them. Since userspace processes each netlink attribute + independently, it can always ignore attributes whose type it does not + understand (because it is using an older version of the interface). + + +Choosing between 1. and 2. is a matter of trading off flexibility and +overhead. If only a few fields need to be added, then 1. is the preferable +path since the kernel and userspace don't need to incur the overhead of +processing new netlink attributes. But if the new fields expand the existing +struct too much, requiring disparate userspace accounting utilities to +unnecessarily receive large structures whose fields are of no interest, then +extending the attributes structure would be worthwhile. + +---- diff --git a/include/linux/taskstats.h b/include/linux/taskstats.h new file mode 100644 index 0000000000000000000000000000000000000000..51f62759bea9fd23a5158e19f42b5480fde60e29 --- /dev/null +++ b/include/linux/taskstats.h @@ -0,0 +1,84 @@ +/* taskstats.h - exporting per-task statistics + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#ifndef _LINUX_TASKSTATS_H +#define _LINUX_TASKSTATS_H + +/* Format for per-task data returned to userland when + * - a task exits + * - listener requests stats for a task + * + * The struct is versioned. Newer versions should only add fields to + * the bottom of the struct to maintain backward compatibility. + * + * + * To add new fields + * a) bump up TASKSTATS_VERSION + * b) add comment indicating new version number at end of struct + * c) add new fields after version comment; maintain 64-bit alignment + */ + +#define TASKSTATS_VERSION 1 + +struct taskstats { + + /* Version 1 */ + __u64 version; +}; + + +#define TASKSTATS_LISTEN_GROUP 0x1 + +/* + * Commands sent from userspace + * Not versioned. New commands should only be inserted at the enum's end + * prior to __TASKSTATS_CMD_MAX + */ + +enum { + TASKSTATS_CMD_UNSPEC = 0, /* Reserved */ + TASKSTATS_CMD_GET, /* user->kernel request/get-response */ + TASKSTATS_CMD_NEW, /* kernel->user event */ + __TASKSTATS_CMD_MAX, +}; + +#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1) + +enum { + TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */ + TASKSTATS_TYPE_PID, /* Process id */ + TASKSTATS_TYPE_TGID, /* Thread group id */ + TASKSTATS_TYPE_STATS, /* taskstats structure */ + TASKSTATS_TYPE_AGGR_PID, /* contains pid + stats */ + TASKSTATS_TYPE_AGGR_TGID, /* contains tgid + stats */ + __TASKSTATS_TYPE_MAX, +}; + +#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1) + +enum { + TASKSTATS_CMD_ATTR_UNSPEC = 0, + TASKSTATS_CMD_ATTR_PID, + TASKSTATS_CMD_ATTR_TGID, + __TASKSTATS_CMD_ATTR_MAX, +}; + +#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1) + +/* NETLINK_GENERIC related info */ + +#define TASKSTATS_GENL_NAME "TASKSTATS" +#define TASKSTATS_GENL_VERSION 0x1 + +#endif /* _LINUX_TASKSTATS_H */ diff --git a/include/linux/taskstats_kern.h b/include/linux/taskstats_kern.h new file mode 100644 index 0000000000000000000000000000000000000000..bd0ecb969c263a1541193725b6ffab55f1740bf0 --- /dev/null +++ b/include/linux/taskstats_kern.h @@ -0,0 +1,57 @@ +/* taskstats_kern.h - kernel header for per-task statistics interface + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + */ + +#ifndef _LINUX_TASKSTATS_KERN_H +#define _LINUX_TASKSTATS_KERN_H + +#include +#include + +enum { + TASKSTATS_MSG_UNICAST, /* send data only to requester */ + TASKSTATS_MSG_MULTICAST, /* send data to a group */ +}; + +#ifdef CONFIG_TASKSTATS +extern kmem_cache_t *taskstats_cache; + +static inline void taskstats_exit_alloc(struct taskstats **ptidstats, + struct taskstats **ptgidstats) +{ + *ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL); + *ptgidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL); +} + +static inline void taskstats_exit_free(struct taskstats *tidstats, + struct taskstats *tgidstats) +{ + if (tidstats) + kmem_cache_free(taskstats_cache, tidstats); + if (tgidstats) + kmem_cache_free(taskstats_cache, tgidstats); +} + +extern void taskstats_exit_send(struct task_struct *, struct taskstats *, + struct taskstats *); +extern void taskstats_init_early(void); + +#else +static inline void taskstats_exit_alloc(struct taskstats **ptidstats, + struct taskstats **ptgidstats) +{} +static inline void taskstats_exit_free(struct taskstats *ptidstats, + struct taskstats *ptgidstats) +{} +static inline void taskstats_exit_send(struct task_struct *tsk, + struct taskstats *tidstats, + struct taskstats *tgidstats) +{} +static inline void taskstats_init_early(void) +{} +#endif /* CONFIG_TASKSTATS */ + +#endif + diff --git a/init/Kconfig b/init/Kconfig index 90498a3e53da674315ee19a1a5e01cf289a17f83..56a7093b4e4ce367b14a5ff4d60a454493e7405b 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -158,6 +158,19 @@ config BSD_PROCESS_ACCT_V3 for processing it. A preliminary version of these tools is available at . +config TASKSTATS + bool "Export task/process statistics through netlink (EXPERIMENTAL)" + depends on NET + default n + help + Export selected statistics for tasks/processes through the + generic netlink interface. Unlike BSD process accounting, the + statistics are available during the lifetime of tasks/processes as + responses to commands. Like BSD accounting, they are sent to user + space on task exit. + + Say N if unsure. + config TASK_DELAY_ACCT bool "Enable per-task delay accounting (EXPERIMENTAL)" help diff --git a/init/main.c b/init/main.c index 9e8e8c152142f3c1273e7cc7f2963971f717ab2c..8651a720a092c4ade7e530c01b57b040c3e52539 100644 --- a/init/main.c +++ b/init/main.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include #include @@ -575,6 +576,7 @@ asmlinkage void __init start_kernel(void) proc_root_init(); #endif cpuset_init(); + taskstats_init_early(); delayacct_init(); check_bugs(); diff --git a/kernel/Makefile b/kernel/Makefile index 87bb34cc893821264e7a71d3047a004463358cc2..d62ec66c1af28246946d9a21a52f5fc6df706701 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -49,6 +49,7 @@ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o +obj-$(CONFIG_TASKSTATS) += taskstats.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra , the -fno-omit-frame-pointer is diff --git a/kernel/exit.c b/kernel/exit.c index 3c2cf91defa764842939405aac8e1cfae29c191b..9852ed8c29885f5b5c808accc92cb2a73e5c85e0 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -844,6 +845,7 @@ static void exit_notify(struct task_struct *tsk) fastcall NORET_TYPE void do_exit(long code) { struct task_struct *tsk = current; + struct taskstats *tidstats, *tgidstats; int group_dead; profile_task_exit(tsk); @@ -882,6 +884,8 @@ fastcall NORET_TYPE void do_exit(long code) current->comm, current->pid, preempt_count()); + taskstats_exit_alloc(&tidstats, &tgidstats); + acct_update_integrals(tsk); if (tsk->mm) { update_hiwater_rss(tsk->mm); @@ -901,7 +905,10 @@ fastcall NORET_TYPE void do_exit(long code) #endif if (unlikely(tsk->audit_context)) audit_free(tsk); + taskstats_exit_send(tsk, tidstats, tgidstats); + taskstats_exit_free(tidstats, tgidstats); delayacct_tsk_exit(tsk); + exit_mm(tsk); if (group_dead) diff --git a/kernel/taskstats.c b/kernel/taskstats.c new file mode 100644 index 0000000000000000000000000000000000000000..82ec9137d908cd35c321be2bf832ad515d30945a --- /dev/null +++ b/kernel/taskstats.c @@ -0,0 +1,336 @@ +/* + * taskstats.c - Export per-task statistics to userland + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include +#include +#include +#include + +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 }; +static int family_registered; +kmem_cache_t *taskstats_cache; +static DEFINE_MUTEX(taskstats_exit_mutex); + +static struct genl_family family = { + .id = GENL_ID_GENERATE, + .name = TASKSTATS_GENL_NAME, + .version = TASKSTATS_GENL_VERSION, + .maxattr = TASKSTATS_CMD_ATTR_MAX, +}; + +static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] +__read_mostly = { + [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 }, + [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 }, +}; + + +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp, + void **replyp, size_t size) +{ + struct sk_buff *skb; + void *reply; + + /* + * If new attributes are added, please revisit this allocation + */ + skb = nlmsg_new(size); + if (!skb) + return -ENOMEM; + + if (!info) { + int seq = get_cpu_var(taskstats_seqnum)++; + put_cpu_var(taskstats_seqnum); + + reply = genlmsg_put(skb, 0, seq, + family.id, 0, 0, + cmd, family.version); + } else + reply = genlmsg_put(skb, info->snd_pid, info->snd_seq, + family.id, 0, 0, + cmd, family.version); + if (reply == NULL) { + nlmsg_free(skb); + return -EINVAL; + } + + *skbp = skb; + *replyp = reply; + return 0; +} + +static int send_reply(struct sk_buff *skb, pid_t pid, int event) +{ + struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data); + void *reply; + int rc; + + reply = genlmsg_data(genlhdr); + + rc = genlmsg_end(skb, reply); + if (rc < 0) { + nlmsg_free(skb); + return rc; + } + + if (event == TASKSTATS_MSG_MULTICAST) + return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP); + return genlmsg_unicast(skb, pid); +} + +static int fill_pid(pid_t pid, struct task_struct *pidtsk, + struct taskstats *stats) +{ + int rc; + struct task_struct *tsk = pidtsk; + + if (!pidtsk) { + read_lock(&tasklist_lock); + tsk = find_task_by_pid(pid); + if (!tsk) { + read_unlock(&tasklist_lock); + return -ESRCH; + } + get_task_struct(tsk); + read_unlock(&tasklist_lock); + } else + get_task_struct(tsk); + + /* + * Each accounting subsystem adds calls to its functions to + * fill in relevant parts of struct taskstsats as follows + * + * rc = per-task-foo(stats, tsk); + * if (rc) + * goto err; + */ + +err: + put_task_struct(tsk); + return rc; + +} + +static int fill_tgid(pid_t tgid, struct task_struct *tgidtsk, + struct taskstats *stats) +{ + int rc; + struct task_struct *tsk, *first; + + first = tgidtsk; + read_lock(&tasklist_lock); + if (!first) { + first = find_task_by_pid(tgid); + if (!first) { + read_unlock(&tasklist_lock); + return -ESRCH; + } + } + tsk = first; + do { + /* + * Each accounting subsystem adds calls its functions to + * fill in relevant parts of struct taskstsats as follows + * + * rc = per-task-foo(stats, tsk); + * if (rc) + * break; + */ + + } while_each_thread(first, tsk); + read_unlock(&tasklist_lock); + + /* + * Accounting subsytems can also add calls here if they don't + * wish to aggregate statistics for per-tgid stats + */ + + return rc; +} + +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info) +{ + int rc = 0; + struct sk_buff *rep_skb; + struct taskstats stats; + void *reply; + size_t size; + struct nlattr *na; + + /* + * Size includes space for nested attributes + */ + size = nla_total_size(sizeof(u32)) + + nla_total_size(sizeof(struct taskstats)) + nla_total_size(0); + + memset(&stats, 0, sizeof(stats)); + rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply, size); + if (rc < 0) + return rc; + + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) { + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]); + rc = fill_pid(pid, NULL, &stats); + if (rc < 0) + goto err; + + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid); + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, + stats); + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) { + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]); + rc = fill_tgid(tgid, NULL, &stats); + if (rc < 0) + goto err; + + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid); + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, + stats); + } else { + rc = -EINVAL; + goto err; + } + + nla_nest_end(rep_skb, na); + + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST); + +nla_put_failure: + return genlmsg_cancel(rep_skb, reply); +err: + nlmsg_free(rep_skb); + return rc; +} + +/* Send pid data out on exit */ +void taskstats_exit_send(struct task_struct *tsk, struct taskstats *tidstats, + struct taskstats *tgidstats) +{ + int rc; + struct sk_buff *rep_skb; + void *reply; + size_t size; + int is_thread_group; + struct nlattr *na; + + if (!family_registered || !tidstats) + return; + + mutex_lock(&taskstats_exit_mutex); + + is_thread_group = !thread_group_empty(tsk); + rc = 0; + + /* + * Size includes space for nested attributes + */ + size = nla_total_size(sizeof(u32)) + + nla_total_size(sizeof(struct taskstats)) + nla_total_size(0); + + if (is_thread_group) + size = 2 * size; /* PID + STATS + TGID + STATS */ + + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply, size); + if (rc < 0) + goto ret; + + rc = fill_pid(tsk->pid, tsk, tidstats); + if (rc < 0) + goto err_skb; + + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid); + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, + *tidstats); + nla_nest_end(rep_skb, na); + + if (!is_thread_group || !tgidstats) { + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); + goto ret; + } + + rc = fill_tgid(tsk->pid, tsk, tgidstats); + /* + * If fill_tgid() failed then one probable reason could be that the + * thread group leader has exited. fill_tgid() will fail, send out + * the pid statistics collected earlier. + */ + if (rc < 0) { + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); + goto ret; + } + + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid); + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, + *tgidstats); + nla_nest_end(rep_skb, na); + + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); + goto ret; + +nla_put_failure: + genlmsg_cancel(rep_skb, reply); + goto ret; +err_skb: + nlmsg_free(rep_skb); +ret: + mutex_unlock(&taskstats_exit_mutex); + return; +} + +static struct genl_ops taskstats_ops = { + .cmd = TASKSTATS_CMD_GET, + .doit = taskstats_send_stats, + .policy = taskstats_cmd_get_policy, +}; + +/* Needed early in initialization */ +void __init taskstats_init_early(void) +{ + taskstats_cache = kmem_cache_create("taskstats_cache", + sizeof(struct taskstats), + 0, SLAB_PANIC, NULL, NULL); +} + +static int __init taskstats_init(void) +{ + int rc; + + rc = genl_register_family(&family); + if (rc) + return rc; + + rc = genl_register_ops(&family, &taskstats_ops); + if (rc < 0) + goto err; + + family_registered = 1; + return 0; +err: + genl_unregister_family(&family); + return rc; +} + +/* + * late initcall ensures initialization of statistics collection + * mechanisms precedes initialization of the taskstats interface + */ +late_initcall(taskstats_init);