提交 f55ac551 编写于 作者: G Gavin Shan 提交者: Joseph Qi

alinux: mm: Support kidled

This enables scanning pages in fixed interval to determine their access
frequency (hot/cold). The result is exported to user land on basis of
memory cgroup by "memory.idle_page_stats". The design is highlighted as
below:

   * A kernel thread is spawn when this feature is enabled by writing
     non-zero value to "/sys/kernel/mm/kidled/scan_period_in_seconds".
     The thread sequentially scans the nodes and their pages that have
     been chained up in LRU list.

   * For each page, its corresponding age information is stored in the
     page flags or array in node. The age represents the scanning intervals
     in which the page isn't accessed. Also, the page flag (PG_idle) is
     leveraged. The page's age is increased by one if the idle flag isn't
     cleared in two consective scans. Otherwise, the page's age is cleared out.
     Also, the page's age information is cleared when it's free'd so that
     the stale age information won't be fetched when it's allocated.

   * Initially, the flag is set, while the access bit in its PTE is cleared
     out by the thread. In next scanning period, its PTE access bit is
     synchronized with the page flag: clear the flag if access bit is set.
     The flag is kept otherwise. For unmapped pages, the flag is cleared
     when it's accessed.

   * Eventually, the page's aging information is updated to the unstable
     bucket of its corresponding memory cgroup, taking as statistics. The
     unstable bucket (statistics) is copied to stable bucket when all pages
     in all nodes are scanned for once. The stable bucket (statistics) is
     exported to user land through "memory.idle_page_stats".

TESTING
=======

   * cgroup1, unmapped pagecache

     # dd if=/dev/zero of=/ext4/test.data oflag=direct bs=1M count=128
     #
     # echo 1 > /sys/kernel/mm/kidled/use_hierarchy
     # echo 15 > /sys/kernel/mm/kidled/scan_period_in_seconds
     # mkdir -p /cgroup/memory
     # mount -tcgroup -o memory /cgroup/memory
     # echo 1 > /cgroup/memory/memory.use_hierarchy
     # mkdir -p /cgroup/memory/test
     # echo 1 > /cgroup/memory/test/memory.use_hierarchy
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # dd if=/ext4/test.data of=/dev/null bs=1M count=128
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
       cfei   0   0   0   134217728   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfei
       cfei   0   0   0   134217728   0   0   0   0

   * cgroup1, mapped pagecache

     # < create same file and memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap the whole created file and access the area >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
       cfei   0   134217728   0   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfei
       cfei   0   134217728   0   0   0   0   0   0

   * cgroup1, mapped and locked pagecache

     # < create same file and memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap the whole created file and mlock the area >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfui
       cfui   0   134217728   0   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfui
       cfui   0   134217728   0   0   0   0   0   0

   * cgroup1, anonymous and locked area

     # < create memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap anonymous area and mlock it >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep csui
       csui   0   0   134217728   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep csui
       csui   0   0   134217728   0   0   0   0   0

   * Rerun above test cases in cgroup2 and the results are no exceptional.
     However, the cgroups are populated in different way as below:

     # mkdir -p /cgroup
     # mount -tcgroup2 none /cgroup
     # echo "+memory" > /cgroup/cgroup.subtree_control
     # mkdir -p /cgroup/test
Signed-off-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
上级 666beb72
.. SPDX-License-Identifier: GPL-2.0+
======
kidled
======
Introduction
============
kidled uses a kernel thread to scan the pages on LRU list, and supports to
output statistics for each memory cgroup (process is not supported yet).
kidled scans pages round to round indexed by pfn, and will try to finish each
round in a fixed duration which is named as scan period. Of course, users can
set the scan period whose unit is seconds. Each page has an attribute named
as 'idle age', which represents how long the page is kept in idle state, the
age's unit is in one scan period. The idle aging information (field) consumes
one byte, which is stored in dynamically allocated array, tied with the NUMA
node or flags field of page descriptor (struct page). So the maximal age is
255. kidled eventually shows the histogram statistics through memory cgroup
files (``memory.idle_page_stats``). The statistics could be used to evaluate
the working-set size of that memory cgroup or the hierarchy.
Usage
=====
There are two sysfs files and one memory cgroup file, exported by kidled.
Here are their functions:
* ``/sys/kernel/mm/kidled/scan_period_in_seconds``
It controls the scan period for the kernel thread to do the scanning.
Higher resolution will be achieved with smaller value, but more CPU
cycles will be consumed to do the scanning. The scanning won't be
issued if 0 is set for the parameter and it's default setting. Writing
to the file clears all statistics collected previously, even the scan
period isn't changed.
.. note::
A rare race exists! ``scan_period_in_seconds`` is only visible thing to
users. duration and sequence number are internal representation for
developers, and they'd better not be seen by users to avoid be confused.
When user updates ``scan_period_in_seconds`` file, the sequence number
is increased and the duration is updated sychronously, as below figure
shows:
OP | VALUE OF SCAN_PERIOD
Initial value | seq = 0, duration = 0
user update 120s | seq = 1, duration = 120 <---- last value kidled sees
user update 120s | seq = 2, duration = 120 ---+
.... | | kidled may miss these
.... | | updates because busy
user update 300s | seq = 65536, duration = 300 |
user update 300s | seq = 0, duration = 300 ---+
user update 120s | seq = 1, duration = 120 <---- next value kidled sees
The race happens when ``scan_period_in_seconds`` is updated very fast in a
very short period of time and kidled misses just 65536 * N (N = 1,2,3...)
updates and the duration keeps the same. kidled won't clear previous
statistics, but it won't be very odd due to the duration are the same at
least.
* ``/sys/kernel/mm/kidled/use_hierarchy``
It controls if accumulated statistics is given by ``memory.idle_page_stats``.
When it's set to zero, the statistics corresponding to the memory cgroup
will be shown. However, the accumulated statistics will be given for
the root memory cgroup. When it's set to one, the accumulative statistics
is always shown.
* ``memory.idle_page_stats`` (memory cgroup v1/v2)
It shows histogram of idle statistics for the correponding memory cgroup.
It depends on the setting of ``use_hierarchy`` if the statistics is the
accumulated one or not.
----------------------------- snapshot start -----------------------------
# version: 1.0
# scans: 1380
# scan_period_in_seconds: 120
# use_hierarchy: 0
# buckets: 1,2,5,15,30,60,120,240
#
# _-----=> clean/dirty
# / _----=> swap/file
# | / _---=> evict/unevict
# || / _--=> inactive/active
# ||| /
# |||| [1,2) [2,5) [5,15) [15,30) [30,60) [60,120) [120,240) [240,+inf)
csei 0 0 0 0 0 0 0 0
dsei 0 0 442368 49152 0 49152 212992 7741440
cfei 4096 233472 1171456 1032192 28672 65536 122880 147550208
dfei 0 0 4096 20480 4096 0 12288 12288
csui 0 0 0 0 0 0 0 0
dsui 0 0 0 0 0 0 0 0
cfui 0 0 0 0 0 0 0 0
dfui 0 0 0 0 0 0 0 0
csea 77824 331776 1216512 1069056 217088 372736 327680 33284096
dsea 0 0 0 0 0 0 0 139264
cfea 4096 57344 606208 13144064 53248 135168 1683456 48357376
dfea 0 0 0 0 0 0 0 0
csua 0 0 0 0 0 0 0 0
dsua 0 0 0 0 0 0 0 0
cfua 0 0 0 0 0 0 0 0
dfua 0 0 0 0 0 0 0 0
----------------------------- snapshot end -----------------------------
``scans`` means how many rounds current cgroup has been scanned.
``scan_period_in_seconds`` means kidled will take how long to finish
one round. ``use_hierarchy`` shows current statistics whether does
hierarchical accounting, see above. ``buckets`` is to allow scripts
parsing easily. The table shows how many bytes are in idle state,
the row is indexed by idle type and column is indexed by idle ages.
e.g. it shows 331776 bytes are idle at column ``[2,5)`` and row ``csea``,
``csea`` means the pages are clean && swappable && evictable && active,
``[2,5)`` means pages keep idle at least 240 seconds and less than 600
seconds (get them by [2, 5) * scan_period_in_seconds). The last column
``[240,+inf)`` means pages keep idle for a long time, greater than 28800
seconds.
Each memory cgroup can have its own histogram sampling different from
others by echo a monotonically increasing array to this file, each number
should be less than 256 and the write operation will clear previous stats
even buckets have not been changed. The number of bucket values must be
less or equal than 8. The default setting is "1,2,5,15,30,60,120,240".
Null bucket values (i.e. a null string) means no need account to current
memcg (NOTE it will still account to parent memcg if parent memcg exists
and has non-null buckets), non-accounting's snapshot looks like below:
----------------------------- snapshot start -----------------------------
$ sudo bash -c "echo '' > /sys/fs/cgroup/memory/test/memory.idle_page_stats"
$ cat /sys/fs/cgroup/memory/test/memory.idle_page_stats
# version: 1.0
# scans: 0
# scan_period_in_seconds: 1
# use_hierarchy: 1
# buckets: no valid bucket available
----------------------------- snapshot end -----------------------------
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_MM_KIDLED_H
#define _LINUX_MM_KIDLED_H
#ifdef CONFIG_KIDLED
#include <linux/types.h>
#define KIDLED_VERSION "1.0"
/*
* We want to get more info about a specified idle page, whether it's
* a page cache or in active LRU list and so on. We use KIDLE_<flag>
* to mark these different page attributes, we support 4 flags:
*
* KIDLE_DIRTY : page is dirty or not;
* KIDLE_FILE : page is a page cache or not;
* KIDLE_UNEVIT : page is unevictable or evictable;
* KIDLE_ACTIVE : page is in active LRU list or not.
*
* Each KIDLE_<flag> occupies one bit position in a specified idle type.
* There exist total 2^4=16 idle types.
*/
#define KIDLE_BASE 0
#define KIDLE_DIRTY (1 << 0)
#define KIDLE_FILE (1 << 1)
#define KIDLE_UNEVICT (1 << 2)
#define KIDLE_ACTIVE (1 << 3)
#define KIDLE_NR_TYPE 16
/*
* Each page has an idle age which means how long the page is keeping
* in idle state, the age's unit is in one scan period. Each page's
* idle age will consume one byte, so the max age must be 255.
* Buckets are used for histogram sampling depends on the idle age,
* e.g. the bucket [5,15) means page's idle age ge than 5 scan periods
* and lt 15 scan periods. A specified bucket value is a split line of
* the idle age. We support a maximum of NUM_KIDLED_BUCKETS sampling
* regions.
*/
#define KIDLED_MAX_IDLE_AGE U8_MAX
#define NUM_KIDLED_BUCKETS 8
/*
* Since it's not convenient to get an immediate statistics for a memory
* cgroup, we use a ping-pong buffer. One is used to store the stable
* statistics which call it 'stable buffer', it's used for showing.
* Another is used to store the statistics being updated by scanning
* threads which call it 'unstable buffer'. Switch them when one scanning
* round is finished.
*/
#define KIDLED_STATS_NR_TYPE 2
/*
* When user wants not to account for a specified instance (e.g. may
* be a memory cgoup), then mark the corresponding buckets to be invalid.
* kidled will skip accounting when encounter invalid buckets. Note the
* scanning is still on.
*
* When users update new buckets, it means current statistics should be
* invalid. But we can't reset immediately, reasons as above. We'll reset
* at a safe point(i.e. one round finished). Store new buckets in stable
* stats's buckets, while mark unstable stats's buckets to be invalid.
*
* This value must be greater than KIDLED_MAX_IDLE_AGE, and can be only
* used for the first bucket value, so it can return quickly when call
* kidled_get_bucket(). User shouldn't use KIDLED_INVALID_BUCKET directly.
*/
#define KIDLED_INVALID_BUCKET (KIDLED_MAX_IDLE_AGE + 1)
#define KIDLED_MARK_BUCKET_INVALID(buckets) \
(buckets[0] = KIDLED_INVALID_BUCKET)
#define KIDLED_IS_BUCKET_INVALID(buckets) \
(buckets[0] == KIDLED_INVALID_BUCKET)
/*
* We account number of idle pages depending on idle type and buckets
* for a specified instance (e.g. one memory cgroup or one process...)
*/
struct idle_page_stats {
int buckets[NUM_KIDLED_BUCKETS];
unsigned long count[KIDLE_NR_TYPE][NUM_KIDLED_BUCKETS];
};
/*
* Duration is in seconds, it means kidled will take how long to finish
* one round (just try, no promise). Sequence number will be increased
* when user updates the sysfs file each time, it can protect readers
* won't get stale statistics by comparing the sequence number even
* duration keep the same. However, there exists a rare race that seq
* num may wrap and be the same as previous seq num. So we also check
* the duration to make readers won't get strange statistics. But it may
* be still stale when seq and duration are both the same as previous
* value, but I think it's acceptable because duration is the same at
* least.
*/
#define KIDLED_MAX_SCAN_DURATION U16_MAX /* max 65536 seconds */
struct kidled_scan_period {
union {
atomic_t val;
struct {
u16 seq; /* inc when update */
u16 duration; /* in seconds */
};
};
};
extern struct kidled_scan_period kidled_scan_period;
#define KIDLED_OP_SET_DURATION (1 << 0)
#define KIDLED_OP_INC_SEQ (1 << 1)
static inline struct kidled_scan_period kidled_get_current_scan_period(void)
{
struct kidled_scan_period scan_period;
atomic_set(&scan_period.val, atomic_read(&kidled_scan_period.val));
return scan_period;
}
static inline unsigned int kidled_get_current_scan_duration(void)
{
struct kidled_scan_period scan_period =
kidled_get_current_scan_period();
return scan_period.duration;
}
static inline void kidled_reset_scan_period(struct kidled_scan_period *p)
{
atomic_set(&p->val, 0);
}
/*
* Compare with global kidled_scan_period, return true if equals.
*/
static inline bool kidled_is_scan_period_equal(struct kidled_scan_period *p)
{
return atomic_read(&p->val) == atomic_read(&kidled_scan_period.val);
}
static inline bool kidled_set_scan_period(int op, u16 duration,
struct kidled_scan_period *orig)
{
bool retry = false;
/*
* atomic_cmpxchg() tries to update kidled_scan_period, shouldn't
* retry to avoid endless loop when caller specify a period.
*/
if (!orig) {
orig = &kidled_scan_period;
retry = true;
}
while (true) {
int new_period_val, old_period_val;
struct kidled_scan_period new_period;
old_period_val = atomic_read(&orig->val);
atomic_set(&new_period.val, old_period_val);
if (op & KIDLED_OP_INC_SEQ)
new_period.seq++;
if (op & KIDLED_OP_SET_DURATION)
new_period.duration = duration;
new_period_val = atomic_read(&new_period.val);
if (atomic_cmpxchg(&kidled_scan_period.val,
old_period_val,
new_period_val) == old_period_val)
return true;
if (!retry)
return false;
}
}
static inline void kidled_set_scan_duration(u16 duration)
{
kidled_set_scan_period(KIDLED_OP_INC_SEQ |
KIDLED_OP_SET_DURATION,
duration, NULL);
}
/*
* Caller must specify the original scan period, avoid the race between
* the double operation and user's updates through sysfs interface.
*/
static inline bool kidled_try_double_scan_period(struct kidled_scan_period orig)
{
u16 duration = orig.duration;
if (unlikely(duration == KIDLED_MAX_SCAN_DURATION))
return false;
duration <<= 1;
if (duration < orig.duration)
duration = KIDLED_MAX_SCAN_DURATION;
return kidled_set_scan_period(KIDLED_OP_INC_SEQ |
KIDLED_OP_SET_DURATION,
duration,
&orig);
}
/*
* Increase the sequence number while keep duration the same, it's used
* to start a new period immediately.
*/
static inline void kidled_inc_scan_seq(void)
{
kidled_set_scan_period(KIDLED_OP_INC_SEQ, 0, NULL);
}
extern const int kidled_default_buckets[NUM_KIDLED_BUCKETS];
bool kidled_use_hierarchy(void);
#ifdef CONFIG_MEMCG
void kidled_mem_cgroup_move_stats(struct mem_cgroup *from,
struct mem_cgroup *to,
struct page *page,
unsigned int nr_pages);
#endif /* CONFIG_MEMCG */
#else /* !CONFIG_KIDLED */
#ifdef CONFIG_MEMCG
static inline void kidled_mem_cgroup_move_stats(struct mem_cgroup *from,
struct mem_cgroup *to,
struct page *page,
unsigned int nr_pages)
{
}
#endif /* CONFIG_MEMCG */
#endif /* CONFIG_KIDLED */
#endif /* _LINUX_MM_KIDLED_H */
......@@ -30,6 +30,7 @@
#include <linux/vmstat.h>
#include <linux/writeback.h>
#include <linux/page-flags.h>
#include <linux/kidled.h>
struct mem_cgroup;
struct page;
......@@ -317,6 +318,14 @@ struct mem_cgroup {
struct list_head event_list;
spinlock_t event_list_lock;
#ifdef CONFIG_KIDLED
struct rw_semaphore idle_stats_rwsem;
unsigned long idle_scans;
struct kidled_scan_period scan_period;
int idle_stable_idx;
struct idle_page_stats idle_stats[KIDLED_STATS_NR_TYPE];
#endif
struct mem_cgroup_per_node *nodeinfo[0];
/* WARNING: nodeinfo must be the last member here */
};
......@@ -799,6 +808,28 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
void mem_cgroup_split_huge_fixup(struct page *head);
#endif
#ifdef CONFIG_KIDLED
static inline struct idle_page_stats *
mem_cgroup_get_stable_idle_stats(struct mem_cgroup *memcg)
{
return &memcg->idle_stats[memcg->idle_stable_idx];
}
static inline struct idle_page_stats *
mem_cgroup_get_unstable_idle_stats(struct mem_cgroup *memcg)
{
return &memcg->idle_stats[KIDLED_STATS_NR_TYPE - 1 -
memcg->idle_stable_idx];
}
static inline void
mem_cgroup_idle_page_stats_switch(struct mem_cgroup *memcg)
{
memcg->idle_stable_idx = KIDLED_STATS_NR_TYPE - 1 -
memcg->idle_stable_idx;
}
#endif /* CONFIG_KIDLED */
static inline bool is_wmark_ok(struct mem_cgroup *memcg, bool high)
{
if (high)
......
......@@ -794,11 +794,12 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
* sets it, so none of the operations on it need to be atomic.
*/
/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | ... | FLAGS | */
/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | [KIDLED_AGE] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
#define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH)
#define KIDLED_AGE_PGOFF (LAST_CPUPID_PGOFF - KIDLED_AGE_WIDTH)
/*
* Define the bit shifts to access each section. For non-existent
......@@ -809,6 +810,7 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
#define LAST_CPUPID_PGSHIFT (LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
#define KIDLED_AGE_PGSHIFT (KIDLED_AGE_PGOFF * (KIDLED_AGE_WIDTH != 0))
/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
......@@ -1089,6 +1091,71 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
}
#endif /* CONFIG_NUMA_BALANCING */
#ifdef CONFIG_KIDLED
#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS
static inline int kidled_get_page_age(pg_data_t *pgdat, unsigned long pfn)
{
u8 *age = pgdat->node_page_age;
if (unlikely(!age))
return -EINVAL;
age += (pfn - pgdat->node_start_pfn);
return *age;
}
static inline int kidled_inc_page_age(pg_data_t *pgdat, unsigned long pfn)
{
u8 *age = pgdat->node_page_age;
if (unlikely(!age))
return -EINVAL;
age += (pfn - pgdat->node_start_pfn);
*age += 1;
return *age;
}
static inline void kidled_set_page_age(pg_data_t *pgdat,
unsigned long pfn, int val)
{
u8 *age = pgdat->node_page_age;
if (unlikely(!age))
return;
age += (pfn - pgdat->node_start_pfn);
*age = val;
}
#else
static inline int kidled_get_page_age(pg_data_t *pgdat, unsigned long pfn)
{
struct page *page = pfn_to_page(pfn);
return (page->flags >> KIDLED_AGE_PGSHIFT) & KIDLED_AGE_MASK;
}
extern int kidled_inc_page_age(pg_data_t *pgdat, unsigned long pfn);
extern void kidled_set_page_age(pg_data_t *pgdat, unsigned long pfn, int val);
#endif /* KIDLED_AGE_NOT_IN_PAGE_FLAGS */
#else /* !CONFIG_KIDLED */
static inline int kidled_get_page_age(pg_data_t *pgdat, unsigned long pfn)
{
return -EINVAL;
}
static inline int kidled_inc_page_age(pg_data_t *pgdat, unsigned long pfn)
{
return -EINVAL;
}
static inline void kidled_set_page_age(pg_data_t *pgdat,
unsigned long pfn, int val)
{
}
#endif /* CONFIG_KIDLED */
static inline struct zone *page_zone(const struct page *page)
{
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
......
......@@ -653,6 +653,11 @@ typedef struct pglist_data {
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
#ifdef CONFIG_KIDLED
unsigned long node_idle_scan_pfn;
u8 *node_page_age;
#endif
int node_id;
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
......
......@@ -82,6 +82,19 @@
#define LAST_CPUPID_WIDTH 0
#endif
#ifdef CONFIG_KIDLED
#define KIDLED_AGE_SHIFT 8
#define KIDLED_AGE_MASK ((1UL << KIDLED_AGE_SHIFT)-1)
#else
#define KIDLED_AGE_SHIFT 0
#endif
#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT+KIDLED_AGE_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
#define KIDLED_AGE_WIDTH KIDLED_AGE_SHIFT
#else
#define KIDLED_AGE_WIDTH 0
#endif
/*
* We are going to use the flags for the page to node mapping if its in
* there. This includes the case where there is no node, so it is implicit.
......@@ -94,4 +107,8 @@
#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
#endif
#if defined(CONFIG_KIDLED) && KIDLED_AGE_WIDTH == 0
#define KIDLED_AGE_NOT_IN_PAGE_FLAGS
#endif
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
......@@ -38,6 +38,9 @@ static inline void clear_page_idle(struct page *page)
{
ClearPageIdle(page);
}
void page_idle_clear_pte_refs(struct page *page);
#else /* !CONFIG_64BIT */
/*
* If there is not enough space to store Idle and Young bits in page flags, use
......@@ -135,6 +138,10 @@ static inline void clear_page_idle(struct page *page)
{
}
static inline void page_idle_clear_pte_refs(struct page *page)
{
}
#endif /* CONFIG_IDLE_PAGE_TRACKING */
#endif /* _LINUX_MM_PAGE_IDLE_H */
......@@ -764,4 +764,16 @@ config GUP_BENCHMARK
config ARCH_HAS_PTE_SPECIAL
bool
config KIDLED
bool "Enable kernel thread to scan idle pages"
depends on IDLE_PAGE_TRACKING
help
This introduces kernel thread (kidled) to scan pages in configurable
interval to determine if they are accessed in that interval, to
determine their access frequency. The hot/cold pages are identified
with it and the statistics are exported to user space on basis of
memory cgroup by "memory.idle_page_stats".
See Documentation/vm/kidled.rst for more details.
endmenu
......@@ -105,3 +105,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
obj-$(CONFIG_HMM) += hmm.o
obj-$(CONFIG_MEMFD_CREATE) += memfd.o
obj-$(CONFIG_KIDLED) += kidled.o
// SPDX-License-Identifier: GPL-2.0
#include <linux/kthread.h>
#include <linux/memcontrol.h>
#include <linux/mm.h>
#include <linux/mmzone.h>
#include <linux/mm_inline.h>
#include <linux/module.h>
#include <linux/pagemap.h>
#include <linux/page-flags.h>
#include <linux/page_idle.h>
#include <linux/vmalloc.h>
#include <linux/wait.h>
#include <linux/kidled.h>
#include <uapi/linux/sched/types.h>
/*
* Should the accounting be hierarchical? Hierarchical accounting only
* works when memcg is in hierarchy mode. It's OK when kilded enables
* hierarchical accounting while memcg is in non-hierarchy mode, kidled
* will account to the memory cgroup page is charged to. No dependency
* between these two settings.
*/
static bool use_hierarchy __read_mostly;
struct kidled_scan_period kidled_scan_period;
const int kidled_default_buckets[NUM_KIDLED_BUCKETS] = {
1, 2, 5, 15, 30, 60, 120, 240 };
static DECLARE_WAIT_QUEUE_HEAD(kidled_wait);
static unsigned long kidled_scan_rounds __read_mostly;
static inline int kidled_get_bucket(int *idle_buckets, int age)
{
int bucket;
if (age < idle_buckets[0])
return -EINVAL;
for (bucket = 1; bucket <= (NUM_KIDLED_BUCKETS - 1); bucket++) {
if (age < idle_buckets[bucket])
return bucket - 1;
}
return NUM_KIDLED_BUCKETS - 1;
}
static inline int kidled_get_idle_type(struct page *page)
{
int idle_type = KIDLE_BASE;
if (PageDirty(page) || PageWriteback(page))
idle_type |= KIDLE_DIRTY;
if (page_is_file_cache(page))
idle_type |= KIDLE_FILE;
/*
* Couldn't call page_evictable() here, because we have not held
* the page lock, so use page flags instead. Different from
* PageMlocked().
*/
if (PageUnevictable(page))
idle_type |= KIDLE_UNEVICT;
if (PageActive(page))
idle_type |= KIDLE_ACTIVE;
return idle_type;
}
#ifndef KIDLED_AGE_NOT_IN_PAGE_FLAGS
int kidled_inc_page_age(pg_data_t *pgdat, unsigned long pfn)
{
struct page *page = pfn_to_page(pfn);
unsigned long old, new;
int age;
do {
age = ((page->flags >> KIDLED_AGE_PGSHIFT) & KIDLED_AGE_MASK);
if (age >= KIDLED_AGE_MASK)
break;
new = old = page->flags;
new &= ~(KIDLED_AGE_MASK << KIDLED_AGE_PGSHIFT);
new |= (((age + 1) & KIDLED_AGE_MASK) << KIDLED_AGE_PGSHIFT);
} while (unlikely(cmpxchg(&page->flags, old, new) != old));
return age;
}
EXPORT_SYMBOL_GPL(kidled_inc_page_age);
void kidled_set_page_age(pg_data_t *pgdat, unsigned long pfn, int val)
{
struct page *page = pfn_to_page(pfn);
unsigned long old, new;
do {
new = old = page->flags;
new &= ~(KIDLED_AGE_MASK << KIDLED_AGE_PGSHIFT);
new |= ((val & KIDLED_AGE_MASK) << KIDLED_AGE_PGSHIFT);
} while (unlikely(cmpxchg(&page->flags, old, new) != old));
}
EXPORT_SYMBOL_GPL(kidled_set_page_age);
#endif /* !KIDLED_AGE_NOT_IN_PAGE_FLAGS */
#ifdef CONFIG_MEMCG
static inline void kidled_mem_cgroup_account(struct page *page,
int age,
int nr_pages)
{
struct mem_cgroup *memcg;
struct idle_page_stats *stats;
int type, bucket;
if (mem_cgroup_disabled())
return;
type = kidled_get_idle_type(page);
memcg = lock_page_memcg(page);
if (unlikely(!memcg)) {
unlock_page_memcg(page);
return;
}
stats = mem_cgroup_get_unstable_idle_stats(memcg);
bucket = kidled_get_bucket(stats->buckets, age);
if (bucket >= 0)
stats->count[type][bucket] += nr_pages;
unlock_page_memcg(page);
}
void kidled_mem_cgroup_move_stats(struct mem_cgroup *from,
struct mem_cgroup *to,
struct page *page,
unsigned int nr_pages)
{
pg_data_t *pgdat = page_pgdat(page);
unsigned long pfn = page_to_pfn(page);
struct idle_page_stats *stats[4] = { NULL, };
int type, bucket, age;
if (mem_cgroup_disabled())
return;
type = kidled_get_idle_type(page);
stats[0] = mem_cgroup_get_stable_idle_stats(from);
stats[1] = mem_cgroup_get_unstable_idle_stats(from);
if (to) {
stats[2] = mem_cgroup_get_stable_idle_stats(to);
stats[3] = mem_cgroup_get_unstable_idle_stats(to);
}
/*
* We assume the all page ages are same if this is a compound page.
* Also we uses node's cursor (@node_idle_scan_pfn) to check if current
* page should be removed from the source memory cgroup or charged
* to target memory cgroup, without introducing locking mechanism.
* This may lead to slightly inconsistent statistics, but it's fine
* as it will be reshuffled in next round of scanning.
*/
age = kidled_get_page_age(pgdat, pfn);
if (age < 0)
return;
bucket = kidled_get_bucket(stats[1]->buckets, age);
if (bucket < 0)
return;
/* Remove from the source memory cgroup */
if (stats[0]->count[type][bucket] > nr_pages)
stats[0]->count[type][bucket] -= nr_pages;
else
stats[0]->count[type][bucket] = 0;
if (pgdat->node_idle_scan_pfn >= pfn) {
if (stats[1]->count[type][bucket] > nr_pages)
stats[1]->count[type][bucket] -= nr_pages;
else
stats[1]->count[type][bucket] = 0;
}
/* Charge to the target memory cgroup */
if (!to)
return;
bucket = kidled_get_bucket(stats[3]->buckets, age);
if (bucket < 0)
return;
stats[2]->count[type][bucket] += nr_pages;
if (pgdat->node_idle_scan_pfn >= pfn)
stats[3]->count[type][bucket] += nr_pages;
}
EXPORT_SYMBOL_GPL(kidled_mem_cgroup_move_stats);
static inline void kidled_mem_cgroup_scan_done(struct kidled_scan_period period)
{
struct mem_cgroup *memcg;
struct idle_page_stats *stable_stats, *unstable_stats;
for (memcg = mem_cgroup_iter(NULL, NULL, NULL);
memcg != NULL;
memcg = mem_cgroup_iter(NULL, memcg, NULL)) {
down_write(&memcg->idle_stats_rwsem);
stable_stats = mem_cgroup_get_stable_idle_stats(memcg);
unstable_stats = mem_cgroup_get_unstable_idle_stats(memcg);
/*
* Switch when scanning buckets is valid, or copy buckets
* from stable_stats's buckets which may have user's new
* buckets(maybe valid or not).
*/
if (!KIDLED_IS_BUCKET_INVALID(unstable_stats->buckets)) {
mem_cgroup_idle_page_stats_switch(memcg);
memcg->idle_scans++;
} else {
memcpy(unstable_stats->buckets, stable_stats->buckets,
sizeof(unstable_stats->buckets));
}
memcg->scan_period = period;
up_write(&memcg->idle_stats_rwsem);
unstable_stats = mem_cgroup_get_unstable_idle_stats(memcg);
memset(&unstable_stats->count, 0,
sizeof(unstable_stats->count));
}
}
static inline void kidled_mem_cgroup_reset(void)
{
struct mem_cgroup *memcg;
struct idle_page_stats *stable_stats, *unstable_stats;
for (memcg = mem_cgroup_iter(NULL, NULL, NULL);
memcg != NULL;
memcg = mem_cgroup_iter(NULL, memcg, NULL)) {
down_write(&memcg->idle_stats_rwsem);
stable_stats = mem_cgroup_get_stable_idle_stats(memcg);
unstable_stats = mem_cgroup_get_unstable_idle_stats(memcg);
memset(&stable_stats->count, 0, sizeof(stable_stats->count));
memcg->idle_scans = 0;
kidled_reset_scan_period(&memcg->scan_period);
up_write(&memcg->idle_stats_rwsem);
memset(&unstable_stats->count, 0,
sizeof(unstable_stats->count));
}
}
#else /* !CONFIG_MEMCG */
static inline void kidled_mem_cgroup_account(struct page *page,
int age,
int nr_pages)
{
}
static inline void kidled_mem_cgroup_scan_done(struct kidled_scan_period
scan_period)
{
}
static inline void kidled_mem_cgroup_reset(void)
{
}
#endif /* CONFIG_MEMCG */
/*
* An idle page with an older age is more likely idle, while a busy page is
* more likely busy, so we can reduce the sampling frequency to save cpu
* resource when meet these pages. And we will keep sampling each time when
* an idle page is young. See tables below:
*
* idle age | down ratio
* ----------+-------------
* [0, 1) | 1/2 # busy
* [1, 4) | 1 # young idle
* [4, 8) | 1/2 # idle
* [8, 16) | 1/4 # old idle
* [16, +inf)| 1/8 # older idle
*/
static inline bool kidled_need_check_idle(pg_data_t *pgdat, unsigned long pfn)
{
struct page *page = pfn_to_page(pfn);
int age = kidled_get_page_age(pgdat, pfn);
unsigned long pseudo_random;
if (age < 0)
return false;
/*
* kidled will check different pages at each round when need
* reduce sampling frequency, this depends on current pfn and
* global scanning rounds. There exist some special pfns, for
* one huge page, we can only check the head page, while tail
* pages would be checked in low levels and will be skipped.
* Shifting HPAGE_PMD_ORDER bits is to achieve good load balance
* for each round when system has many huge pages, 1GB is not
* considered here.
*/
if (PageTransHuge(page))
pfn >>= compound_order(page);
pseudo_random = pfn + kidled_scan_rounds;
if (age == 0)
return pseudo_random & 0x1UL;
else if (age < 4)
return true;
else if (age < 8)
return pseudo_random & 0x1UL;
else if (age < 16)
return (pseudo_random & 0x3UL) == 0x3UL;
else
return (pseudo_random & 0x7UL) == 0x7UL;
}
static inline int kidled_scan_page(pg_data_t *pgdat, unsigned long pfn)
{
struct page *page;
int age, nr_pages = 1, idx;
bool idle = false;
if (!pfn_valid(pfn))
goto out;
page = pfn_to_page(pfn);
if (!page || !PageLRU(page)) {
kidled_set_page_age(pgdat, pfn, 0);
goto out;
}
/*
* Try to skip clear PTE references which is an expensive call.
* PG_idle should be cleared when free a page and we have checked
* PG_lru flag above, so the race is acceptable to us.
*/
if (page_is_idle(page)) {
if (kidled_need_check_idle(pgdat, pfn)) {
if (!get_page_unless_zero(page)) {
kidled_set_page_age(pgdat, pfn, 0);
goto out;
}
/*
* Check again after get a reference count, while in
* page_idle_get_page() it gets zone_lru_lock at first,
* it seems useless.
*
* Also we can't hold LRU lock here as the consumed
* time to finish the scanning is fixed. Otherwise,
* the accumulated statistics will be cleared out
* and scan interval (@scan_period_in_seconds) will
* be doubled. However, this may incur race between
* kidled and page reclaim. The page reclaim may dry
* run due to dumped refcount, but it's acceptable.
*/
if (unlikely(!PageLRU(page))) {
put_page(page);
kidled_set_page_age(pgdat, pfn, 0);
goto out;
}
page_idle_clear_pte_refs(page);
if (page_is_idle(page))
idle = true;
put_page(page);
} else if (kidled_get_page_age(pgdat, pfn) > 0) {
idle = true;
}
}
if (PageTransHuge(page))
nr_pages = 1 << compound_order(page);
if (idle) {
age = kidled_inc_page_age(pgdat, pfn);
if (age > 0)
kidled_mem_cgroup_account(page, age, nr_pages);
else
age = 0;
} else {
age = 0;
kidled_set_page_age(pgdat, pfn, 0);
if (get_page_unless_zero(page)) {
if (likely(PageLRU(page)))
set_page_idle(page);
put_page(page);
}
}
for (idx = 1; idx < nr_pages; idx++)
kidled_set_page_age(pgdat, pfn + idx, age);
out:
return nr_pages;
}
static bool kidled_scan_node(pg_data_t *pgdat,
struct kidled_scan_period scan_period,
bool restart)
{
unsigned long pfn, end, node_end;
#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS
if (unlikely(!pgdat->node_page_age)) {
pgdat->node_page_age = vzalloc(pgdat->node_spanned_pages);
if (unlikely(!pgdat->node_page_age))
return false;
}
#endif /* KIDLED_AGE_NOT_IN_PAGE_FLAGS */
node_end = pgdat_end_pfn(pgdat);
pfn = pgdat->node_start_pfn;
if (!restart && pfn < pgdat->node_idle_scan_pfn)
pfn = pgdat->node_idle_scan_pfn;
end = min(pfn + DIV_ROUND_UP(pgdat->node_spanned_pages,
scan_period.duration), node_end);
while (pfn < end) {
/* Restart new scanning when user updates the period */
if (unlikely(!kidled_is_scan_period_equal(&scan_period)))
break;
cond_resched();
pfn += kidled_scan_page(pgdat, pfn);
}
pgdat->node_idle_scan_pfn = pfn;
return pfn >= node_end;
}
static inline void kidled_scan_done(struct kidled_scan_period scan_period)
{
kidled_mem_cgroup_scan_done(scan_period);
kidled_scan_rounds++;
}
static inline void kidled_reset(bool free)
{
pg_data_t *pgdat;
kidled_mem_cgroup_reset();
get_online_mems();
#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS
for_each_online_pgdat(pgdat) {
if (!pgdat->node_page_age)
continue;
if (free) {
vfree(pgdat->node_page_age);
pgdat->node_page_age = NULL;
} else {
memset(pgdat->node_page_age, 0,
pgdat->node_spanned_pages);
}
cond_resched();
}
#else
for_each_online_pgdat(pgdat) {
unsigned long pfn, end_pfn = pgdat->node_start_pfn +
pgdat->node_spanned_pages;
for (pfn = pgdat->node_start_pfn; pfn < end_pfn; pfn++) {
if (!pfn_valid(pfn))
continue;
kidled_set_page_age(pgdat, pfn, 0);
if (pfn % HPAGE_PMD_NR == 0)
cond_resched();
}
}
#endif /* KIDLED_AGE_NOT_IN_PAGE_FLAGS */
put_online_mems();
}
static inline bool kidled_should_run(struct kidled_scan_period *p, bool *new)
{
if (unlikely(!kidled_is_scan_period_equal(p))) {
struct kidled_scan_period scan_period;
scan_period = kidled_get_current_scan_period();
if (p->duration)
kidled_reset(!scan_period.duration);
*p = scan_period;
*new = true;
} else {
*new = false;
}
if (p->duration > 0)
return true;
return false;
}
static int kidled(void *dummy)
{
int busy_loop = 0;
bool restart = true;
struct kidled_scan_period scan_period;
kidled_reset_scan_period(&scan_period);
while (!kthread_should_stop()) {
pg_data_t *pgdat;
u64 start_jiffies, elapsed;
bool new, scan_done = true;
wait_event_interruptible(kidled_wait,
kidled_should_run(&scan_period, &new));
if (unlikely(new)) {
restart = true;
busy_loop = 0;
}
if (unlikely(scan_period.duration == 0))
continue;
start_jiffies = jiffies_64;
get_online_mems();
for_each_online_pgdat(pgdat) {
scan_done &= kidled_scan_node(pgdat,
scan_period,
restart);
}
put_online_mems();
if (scan_done) {
kidled_scan_done(scan_period);
restart = true;
} else {
restart = false;
}
/*
* We hope kidled can scan specified pages which depends on
* scan_period in each slice, and supposed to finish each
* slice in one second:
*
* pages_to_scan = total_pages / scan_duration
* for_each_slice() {
* start_jiffies = jiffies_64;
* scan_pages(pages_to_scan);
* elapsed = jiffies_64 - start_jiffies;
* sleep(HZ - elapsed);
* }
*
* We thought it's busy when elapsed >= (HZ / 2), and if keep
* busy for several consecutive times, we'll scale up the
* scan duration.
*
* NOTE it's a simple guard, not a promise.
*/
#define KIDLED_BUSY_RUNNING (HZ / 2)
#define KIDLED_BUSY_LOOP_THRESHOLD 10
elapsed = jiffies_64 - start_jiffies;
if (elapsed < KIDLED_BUSY_RUNNING) {
busy_loop = 0;
schedule_timeout_interruptible(HZ - elapsed);
} else if (++busy_loop == KIDLED_BUSY_LOOP_THRESHOLD) {
busy_loop = 0;
if (kidled_try_double_scan_period(scan_period)) {
pr_warn_ratelimited("%s: period -> %u\n",
__func__,
kidled_get_current_scan_duration());
}
/* sleep for a while to relax cpu */
schedule_timeout_interruptible(elapsed);
}
}
return 0;
}
bool kidled_use_hierarchy(void)
{
return use_hierarchy;
}
static ssize_t kidled_scan_period_show(struct kobject *kobj,
struct kobj_attribute *attr,
char *buf)
{
return sprintf(buf, "%u\n", kidled_get_current_scan_duration());
}
/*
* We will update the real scan period and do reset asynchronously,
* avoid stall when kidled is busy waiting for other resources.
*/
static ssize_t kidled_scan_period_store(struct kobject *kobj,
struct kobj_attribute *attr,
const char *buf, size_t count)
{
unsigned long secs;
int ret;
ret = kstrtoul(buf, 10, &secs);
if (ret || secs > KIDLED_MAX_SCAN_DURATION)
return -EINVAL;
kidled_set_scan_duration(secs);
wake_up_interruptible(&kidled_wait);
return count;
}
static ssize_t kidled_use_hierarchy_show(struct kobject *kobj,
struct kobj_attribute *attr,
char *buf)
{
return sprintf(buf, "%u\n", use_hierarchy);
}
static ssize_t kidled_use_hierarchy_store(struct kobject *kobj,
struct kobj_attribute *attr,
const char *buf, size_t count)
{
unsigned long val;
int ret;
ret = kstrtoul(buf, 10, &val);
if (ret || val > 1)
return -EINVAL;
WRITE_ONCE(use_hierarchy, val);
/*
* Always start a new period when user sets use_hierarchy,
* kidled_inc_scan_seq() uses atomic_cmpxchg() which implies a
* memory barrier. This will make sure readers will get new
* statistics after the store returned. But there still exists
* a rare race when storing:
*
* writer | readers
* |
* update_use_hierarchy |
* ..... | read_statistics <-- race
* increase_scan_sequence |
*
* readers may get new use_hierarchy value and old statistics,
* ignore this..
*/
kidled_inc_scan_seq();
return count;
}
static struct kobj_attribute kidled_scan_period_attr =
__ATTR(scan_period_in_seconds, 0644,
kidled_scan_period_show, kidled_scan_period_store);
static struct kobj_attribute kidled_use_hierarchy_attr =
__ATTR(use_hierarchy, 0644,
kidled_use_hierarchy_show, kidled_use_hierarchy_store);
static struct attribute *kidled_attrs[] = {
&kidled_scan_period_attr.attr,
&kidled_use_hierarchy_attr.attr,
NULL
};
static struct attribute_group kidled_attr_group = {
.name = "kidled",
.attrs = kidled_attrs,
};
static int __init kidled_init(void)
{
struct task_struct *thread;
struct sched_param param = { .sched_priority = 0 };
int ret;
ret = sysfs_create_group(mm_kobj, &kidled_attr_group);
if (ret) {
pr_warn("%s: Error %d on creating sysfs files\n",
__func__, ret);
return ret;
}
thread = kthread_run(kidled, NULL, "kidled");
if (IS_ERR(thread)) {
sysfs_remove_group(mm_kobj, &kidled_attr_group);
pr_warn("%s: Failed to start kthread\n", __func__);
return PTR_ERR(thread);
}
/* Make kidled as nice as possible. */
sched_setscheduler(thread, SCHED_IDLE, &param);
return 0;
}
module_init(kidled_init);
......@@ -3562,6 +3562,246 @@ static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
return nbytes;
}
#ifdef CONFIG_KIDLED
static int mem_cgroup_idle_page_stats_show(struct seq_file *m, void *v)
{
struct mem_cgroup *iter, *memcg = mem_cgroup_from_css(seq_css(m));
struct kidled_scan_period scan_period, period;
struct idle_page_stats stats, cache;
unsigned long scans;
bool has_hierarchy = kidled_use_hierarchy();
bool no_buckets = false;
int i, j, t;
down_read(&memcg->idle_stats_rwsem);
stats = memcg->idle_stats[memcg->idle_stable_idx];
scans = memcg->idle_scans;
scan_period = memcg->scan_period;
up_read(&memcg->idle_stats_rwsem);
/* Nothing will be outputed with invalid buckets */
if (KIDLED_IS_BUCKET_INVALID(stats.buckets)) {
no_buckets = true;
scans = 0;
goto output;
}
/* Zeroes will be output with mismatched scan period */
if (!kidled_is_scan_period_equal(&scan_period)) {
memset(&stats.count, 0, sizeof(stats.count));
scan_period = kidled_get_current_scan_period();
scans = 0;
goto output;
}
if (mem_cgroup_is_root(memcg) || has_hierarchy) {
for_each_mem_cgroup_tree(iter, memcg) {
/* The root memcg was just accounted */
if (iter == memcg)
continue;
down_read(&iter->idle_stats_rwsem);
cache = iter->idle_stats[iter->idle_stable_idx];
period = memcg->scan_period;
up_read(&iter->idle_stats_rwsem);
/*
* Skip to account if the scan period is mismatched
* or buckets are invalid.
*/
if (!kidled_is_scan_period_equal(&period) ||
KIDLED_IS_BUCKET_INVALID(cache.buckets))
continue;
/*
* The buckets of current memory cgroup might be
* mismatched with that of root memory cgroup. We
* charge the current statistics to the possibly
* largest bucket. The users need to apply the
* consistent buckets into the memory cgroups in
* the hierarchy tree.
*/
for (i = 0; i < NUM_KIDLED_BUCKETS; i++) {
for (j = 0; j < NUM_KIDLED_BUCKETS - 1; j++) {
if (cache.buckets[i] <=
stats.buckets[j])
break;
}
for (t = 0; t < KIDLE_NR_TYPE; t++)
stats.count[t][j] += cache.count[t][i];
}
}
}
output:
seq_printf(m, "# version: %s\n", KIDLED_VERSION);
seq_printf(m, "# scans: %lu\n", scans);
seq_printf(m, "# scan_period_in_seconds: %u\n", scan_period.duration);
seq_printf(m, "# use_hierarchy: %u\n", kidled_use_hierarchy());
seq_puts(m, "# buckets: ");
if (no_buckets) {
seq_puts(m, "no valid bucket available\n");
return 0;
}
for (i = 0; i < NUM_KIDLED_BUCKETS; i++) {
seq_printf(m, "%d", stats.buckets[i]);
if ((i == NUM_KIDLED_BUCKETS - 1) ||
!stats.buckets[i + 1]) {
seq_puts(m, "\n");
j = i + 1;
break;
}
seq_puts(m, ",");
}
seq_puts(m, "#\n");
seq_puts(m, "# _-----=> clean/dirty\n");
seq_puts(m, "# / _----=> swap/file\n");
seq_puts(m, "# | / _---=> evict/unevict\n");
seq_puts(m, "# || / _--=> inactive/active\n");
seq_puts(m, "# ||| /\n");
seq_printf(m, "# %-8s", "||||");
for (i = 0; i < j; i++) {
char region[20];
if (i == j - 1) {
snprintf(region, sizeof(region), "[%d,+inf)",
stats.buckets[i]);
} else {
snprintf(region, sizeof(region), "[%d,%d)",
stats.buckets[i],
stats.buckets[i + 1]);
}
seq_printf(m, " %14s", region);
}
seq_puts(m, "\n");
for (t = 0; t < KIDLE_NR_TYPE; t++) {
char kidled_type_str[5];
kidled_type_str[0] = t & KIDLE_DIRTY ? 'd' : 'c';
kidled_type_str[1] = t & KIDLE_FILE ? 'f' : 's';
kidled_type_str[2] = t & KIDLE_UNEVICT ? 'u' : 'e';
kidled_type_str[3] = t & KIDLE_ACTIVE ? 'a' : 'i';
kidled_type_str[4] = '\0';
seq_printf(m, " %-8s", kidled_type_str);
for (i = 0; i < j; i++) {
seq_printf(m, " %14lu",
stats.count[t][i] << PAGE_SHIFT);
}
seq_puts(m, "\n");
}
return 0;
}
static ssize_t mem_cgroup_idle_page_stats_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
struct idle_page_stats *stable_stats, *unstable_stats;
int buckets[NUM_KIDLED_BUCKETS] = { 0 }, i = 0, err;
unsigned long prev = 0, curr;
char *next;
buf = strstrip(buf);
while (*buf) {
if (i >= NUM_KIDLED_BUCKETS)
return -E2BIG;
/* Get next entry */
next = buf + 1;
while (*next && *next >= '0' && *next <= '9')
next++;
while (*next && (*next == ' ' || *next == ','))
*next++ = '\0';
/* Should be monotonically increasing */
err = kstrtoul(buf, 10, &curr);
if (err || curr > KIDLED_MAX_IDLE_AGE || curr <= prev)
return -EINVAL;
buckets[i++] = curr;
prev = curr;
buf = next;
}
/* No buckets set, mark it invalid */
if (i == 0)
KIDLED_MARK_BUCKET_INVALID(buckets);
if (down_write_killable(&memcg->idle_stats_rwsem))
return -EINTR;
stable_stats = mem_cgroup_get_stable_idle_stats(memcg);
unstable_stats = mem_cgroup_get_unstable_idle_stats(memcg);
memcpy(stable_stats->buckets, buckets, sizeof(buckets));
/*
* We will clear the stats without check the buckets whether
* has been changed, it works when user only wants to reset
* stats but not to reset the buckets.
*/
memset(stable_stats->count, 0, sizeof(stable_stats->count));
/*
* It's safe that the kidled reads the unstable buckets without
* holding any read side locks.
*/
KIDLED_MARK_BUCKET_INVALID(unstable_stats->buckets);
memcg->idle_scans = 0;
up_write(&memcg->idle_stats_rwsem);
return nbytes;
}
static void kidled_memcg_init(struct mem_cgroup *memcg)
{
int type;
init_rwsem(&memcg->idle_stats_rwsem);
for (type = 0; type < KIDLED_STATS_NR_TYPE; type++) {
memcpy(memcg->idle_stats[type].buckets,
kidled_default_buckets,
sizeof(kidled_default_buckets));
}
}
static void kidled_memcg_inherit_parent_buckets(struct mem_cgroup *parent,
struct mem_cgroup *memcg)
{
int idle_buckets[NUM_KIDLED_BUCKETS], type;
down_read(&parent->idle_stats_rwsem);
memcpy(idle_buckets,
parent->idle_stats[parent->idle_stable_idx].buckets,
sizeof(idle_buckets));
up_read(&parent->idle_stats_rwsem);
for (type = 0; type < KIDLED_STATS_NR_TYPE; type++) {
memcpy(memcg->idle_stats[type].buckets,
idle_buckets,
sizeof(idle_buckets));
}
}
#else
static void kidled_memcg_init(struct mem_cgroup *memcg)
{
}
static void kidled_memcg_inherit_parent_buckets(struct mem_cgroup *parent,
struct mem_cgroup *memcg)
{
}
#endif /* CONFIG_KIDLED */
static u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
struct cftype *cft)
{
......@@ -4670,6 +4910,13 @@ static struct cftype mem_cgroup_legacy_files[] = {
.write = mem_cgroup_reset,
.read_u64 = mem_cgroup_read_u64,
},
#ifdef CONFIG_KIDLED
{
.name = "idle_page_stats",
.seq_show = mem_cgroup_idle_page_stats_show,
.write = mem_cgroup_idle_page_stats_write,
},
#endif
{ }, /* terminate */
};
......@@ -4852,6 +5099,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
#ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(&memcg->cgwb_list);
#endif
kidled_memcg_init(memcg);
idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
return memcg;
fail:
......@@ -4880,6 +5128,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
/* Default gap is 0.5% max limit */
memcg->wmark_scale_factor = parent->wmark_scale_factor ?
: 50;
kidled_memcg_inherit_parent_buckets(parent, memcg);
}
if (parent && parent->use_hierarchy) {
memcg->use_hierarchy = true;
......@@ -5244,6 +5493,8 @@ static int mem_cgroup_move_account(struct page *page,
ret = 0;
kidled_mem_cgroup_move_stats(from, to, page, nr_pages);
local_irq_disable();
mem_cgroup_charge_statistics(to, page, compound, nr_pages);
memcg_check_events(to, page);
......@@ -6152,6 +6403,13 @@ static struct cftype memory_files[] = {
.seq_show = memory_oom_group_show,
.write = memory_oom_group_write,
},
#ifdef CONFIG_KIDLED
{
.name = "idle_page_stats",
.seq_show = mem_cgroup_idle_page_stats_show,
.write = mem_cgroup_idle_page_stats_write,
},
#endif
{ } /* terminate */
};
......
......@@ -738,6 +738,12 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
pgdat->node_start_pfn = start_pfn;
pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS
if (pgdat->node_page_age) {
vfree(pgdat->node_page_age);
pgdat->node_page_age = NULL;
}
#endif
}
void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
......@@ -1871,6 +1877,13 @@ void try_offline_node(int nid)
if (check_and_unmap_cpu_on_node(pgdat))
return;
#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS
if (pgdat->node_page_age) {
vfree(pgdat->node_page_age);
pgdat->node_page_age = NULL;
}
#endif
/*
* all memory/cpu of this node are removed, we can offline this
* node now.
......
......@@ -1034,6 +1034,17 @@ static __always_inline bool free_pages_prepare(struct page *page,
bad++;
continue;
}
/*
* The page age information is stored in page flags
* or node's page array. We need to explicitly clear
* it in both cases. Otherwise, the stale age will
* be provided when it's allocated again. Also, we
* maintain age information for each page in the
* compound page, So we have to clear them one by one.
*/
kidled_set_page_age(page_pgdat(page + i),
page_to_pfn(page + i), 0);
(page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
}
}
......@@ -1047,6 +1058,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
return false;
page_cpupid_reset_last(page);
kidled_set_page_age(page_pgdat(page), page_to_pfn(page), 0);
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
reset_page_owner(page, order);
......
......@@ -92,7 +92,7 @@ static bool page_idle_clear_pte_refs_one(struct page *page,
return true;
}
static void page_idle_clear_pte_refs(struct page *page)
void page_idle_clear_pte_refs(struct page *page)
{
/*
* Since rwc.arg is unused, rwc is effectively immutable, so we
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册