From fd952d8ce5a0b5bba80c233ef8beefbca4e41d6b Mon Sep 17 00:00:00 2001 From: Gavin Shan Date: Fri, 30 Aug 2019 13:47:42 +0800 Subject: [PATCH] alios: mm: Support kidled This enables scanning pages in fixed interval to determine their access frequency (hot/cold). The result is exported to user land on basis of memory cgroup by "memory.idle_page_stats". The design is highlighted as below: * A kernel thread is spawn when this feature is enabled by writing non-zero value to "/sys/kernel/mm/kidled/scan_period_in_seconds". The thread sequentially scans the nodes and their pages that have been chained up in LRU list. * For each page, its corresponding age information is stored in the page flags or array in node. The age represents the scanning intervals in which the page isn't accessed. Also, the page flag (PG_idle) is leveraged. The page's age is increased by one if the idle flag isn't cleared in two consective scans. Otherwise, the page's age is cleared out. Also, the page's age information is cleared when it's free'd so that the stale age information won't be fetched when it's allocated. * Initially, the flag is set, while the access bit in its PTE is cleared out by the thread. In next scanning period, its PTE access bit is synchronized with the page flag: clear the flag if access bit is set. The flag is kept otherwise. For unmapped pages, the flag is cleared when it's accessed. * Eventually, the page's aging information is updated to the unstable bucket of its corresponding memory cgroup, taking as statistics. The unstable bucket (statistics) is copied to stable bucket when all pages in all nodes are scanned for once. The stable bucket (statistics) is exported to user land through "memory.idle_page_stats". TESTING ======= * cgroup1, unmapped pagecache # dd if=/dev/zero of=/ext4/test.data oflag=direct bs=1M count=128 # # echo 1 > /sys/kernel/mm/kidled/use_hierarchy # echo 15 > /sys/kernel/mm/kidled/scan_period_in_seconds # mkdir -p /cgroup/memory # mount -tcgroup -o memory /cgroup/memory # echo 1 > /cgroup/memory/memory.use_hierarchy # mkdir -p /cgroup/memory/test # echo 1 > /cgroup/memory/test/memory.use_hierarchy # # echo $$ > /cgroup/memory/test/cgroup.procs # dd if=/ext4/test.data of=/dev/null bs=1M count=128 # < wait a few minutes > # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei cfei 0 0 0 134217728 0 0 0 0 # cat /cgroup/memory/memory.idle_page_stats | grep cfei cfei 0 0 0 134217728 0 0 0 0 * cgroup1, mapped pagecache # < create same file and memory cgroups as above > # # echo $$ > /cgroup/memory/test/cgroup.procs # < run program to mmap the whole created file and access the area > # < wait a few minutes > # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei cfei 0 134217728 0 0 0 0 0 0 # cat /cgroup/memory/memory.idle_page_stats | grep cfei cfei 0 134217728 0 0 0 0 0 0 * cgroup1, mapped and locked pagecache # < create same file and memory cgroups as above > # # echo $$ > /cgroup/memory/test/cgroup.procs # < run program to mmap the whole created file and mlock the area > # < wait a few minutes > # cat /cgroup/memory/test/memory.idle_page_stats | grep cfui cfui 0 134217728 0 0 0 0 0 0 # cat /cgroup/memory/memory.idle_page_stats | grep cfui cfui 0 134217728 0 0 0 0 0 0 * cgroup1, anonymous and locked area # < create memory cgroups as above > # # echo $$ > /cgroup/memory/test/cgroup.procs # < run program to mmap anonymous area and mlock it > # < wait a few minutes > # cat /cgroup/memory/test/memory.idle_page_stats | grep csui csui 0 0 134217728 0 0 0 0 0 # cat /cgroup/memory/memory.idle_page_stats | grep csui csui 0 0 134217728 0 0 0 0 0 * Rerun above test cases in cgroup2 and the results are no exceptional. However, the cgroups are populated in different way as below: # mkdir -p /cgroup # mount -tcgroup2 none /cgroup # echo "+memory" > /cgroup/cgroup.subtree_control # mkdir -p /cgroup/test Signed-off-by: Gavin Shan Reviewed-by: Yang Shi Reviewed-by: Xunlei Pang --- Documentation/vm/kidled.rst | 139 ++++++ include/linux/kidled.h | 237 ++++++++++ include/linux/memcontrol.h | 31 ++ include/linux/mm.h | 69 ++- include/linux/mmzone.h | 5 + include/linux/page-flags-layout.h | 17 + include/linux/page_idle.h | 7 + mm/Kconfig | 12 + mm/Makefile | 1 + mm/kidled.c | 691 ++++++++++++++++++++++++++++++ mm/memcontrol.c | 258 +++++++++++ mm/memory_hotplug.c | 13 + mm/page_alloc.c | 12 + mm/page_idle.c | 2 +- 14 files changed, 1492 insertions(+), 2 deletions(-) create mode 100644 Documentation/vm/kidled.rst create mode 100644 include/linux/kidled.h create mode 100644 mm/kidled.c diff --git a/Documentation/vm/kidled.rst b/Documentation/vm/kidled.rst new file mode 100644 index 000000000000..016274a06715 --- /dev/null +++ b/Documentation/vm/kidled.rst @@ -0,0 +1,139 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +====== +kidled +====== + +Introduction +============ + +kidled uses a kernel thread to scan the pages on LRU list, and supports to +output statistics for each memory cgroup (process is not supported yet). +kidled scans pages round to round indexed by pfn, and will try to finish each +round in a fixed duration which is named as scan period. Of course, users can +set the scan period whose unit is seconds. Each page has an attribute named +as 'idle age', which represents how long the page is kept in idle state, the +age's unit is in one scan period. The idle aging information (field) consumes +one byte, which is stored in dynamically allocated array, tied with the NUMA +node or flags field of page descriptor (struct page). So the maximal age is +255. kidled eventually shows the histogram statistics through memory cgroup +files (``memory.idle_page_stats``). The statistics could be used to evaluate +the working-set size of that memory cgroup or the hierarchy. + + +Usage +===== + +There are two sysfs files and one memory cgroup file, exported by kidled. +Here are their functions: + +* ``/sys/kernel/mm/kidled/scan_period_in_seconds`` + + It controls the scan period for the kernel thread to do the scanning. + Higher resolution will be achieved with smaller value, but more CPU + cycles will be consumed to do the scanning. The scanning won't be + issued if 0 is set for the parameter and it's default setting. Writing + to the file clears all statistics collected previously, even the scan + period isn't changed. + +.. note:: + A rare race exists! ``scan_period_in_seconds`` is only visible thing to + users. duration and sequence number are internal representation for + developers, and they'd better not be seen by users to avoid be confused. + When user updates ``scan_period_in_seconds`` file, the sequence number + is increased and the duration is updated sychronously, as below figure + shows: + + OP | VALUE OF SCAN_PERIOD + Initial value | seq = 0, duration = 0 + user update 120s | seq = 1, duration = 120 <---- last value kidled sees + user update 120s | seq = 2, duration = 120 ---+ + .... | | kidled may miss these + .... | | updates because busy + user update 300s | seq = 65536, duration = 300 | + user update 300s | seq = 0, duration = 300 ---+ + user update 120s | seq = 1, duration = 120 <---- next value kidled sees + + The race happens when ``scan_period_in_seconds`` is updated very fast in a + very short period of time and kidled misses just 65536 * N (N = 1,2,3...) + updates and the duration keeps the same. kidled won't clear previous + statistics, but it won't be very odd due to the duration are the same at + least. + +* ``/sys/kernel/mm/kidled/use_hierarchy`` + + It controls if accumulated statistics is given by ``memory.idle_page_stats``. + When it's set to zero, the statistics corresponding to the memory cgroup + will be shown. However, the accumulated statistics will be given for + the root memory cgroup. When it's set to one, the accumulative statistics + is always shown. + +* ``memory.idle_page_stats`` (memory cgroup v1/v2) + + It shows histogram of idle statistics for the correponding memory cgroup. + It depends on the setting of ``use_hierarchy`` if the statistics is the + accumulated one or not. + + ----------------------------- snapshot start ----------------------------- + # version: 1.0 + # scans: 1380 + # scan_period_in_seconds: 120 + # use_hierarchy: 0 + # buckets: 1,2,5,15,30,60,120,240 + # + # _-----=> clean/dirty + # / _----=> swap/file + # | / _---=> evict/unevict + # || / _--=> inactive/active + # ||| / + # |||| [1,2) [2,5) [5,15) [15,30) [30,60) [60,120) [120,240) [240,+inf) + csei 0 0 0 0 0 0 0 0 + dsei 0 0 442368 49152 0 49152 212992 7741440 + cfei 4096 233472 1171456 1032192 28672 65536 122880 147550208 + dfei 0 0 4096 20480 4096 0 12288 12288 + csui 0 0 0 0 0 0 0 0 + dsui 0 0 0 0 0 0 0 0 + cfui 0 0 0 0 0 0 0 0 + dfui 0 0 0 0 0 0 0 0 + csea 77824 331776 1216512 1069056 217088 372736 327680 33284096 + dsea 0 0 0 0 0 0 0 139264 + cfea 4096 57344 606208 13144064 53248 135168 1683456 48357376 + dfea 0 0 0 0 0 0 0 0 + csua 0 0 0 0 0 0 0 0 + dsua 0 0 0 0 0 0 0 0 + cfua 0 0 0 0 0 0 0 0 + dfua 0 0 0 0 0 0 0 0 + ----------------------------- snapshot end ----------------------------- + + ``scans`` means how many rounds current cgroup has been scanned. + ``scan_period_in_seconds`` means kidled will take how long to finish + one round. ``use_hierarchy`` shows current statistics whether does + hierarchical accounting, see above. ``buckets`` is to allow scripts + parsing easily. The table shows how many bytes are in idle state, + the row is indexed by idle type and column is indexed by idle ages. + + e.g. it shows 331776 bytes are idle at column ``[2,5)`` and row ``csea``, + ``csea`` means the pages are clean && swappable && evictable && active, + ``[2,5)`` means pages keep idle at least 240 seconds and less than 600 + seconds (get them by [2, 5) * scan_period_in_seconds). The last column + ``[240,+inf)`` means pages keep idle for a long time, greater than 28800 + seconds. + + Each memory cgroup can have its own histogram sampling different from + others by echo a monotonically increasing array to this file, each number + should be less than 256 and the write operation will clear previous stats + even buckets have not been changed. The number of bucket values must be + less or equal than 8. The default setting is "1,2,5,15,30,60,120,240". + Null bucket values (i.e. a null string) means no need account to current + memcg (NOTE it will still account to parent memcg if parent memcg exists + and has non-null buckets), non-accounting's snapshot looks like below: + + ----------------------------- snapshot start ----------------------------- + $ sudo bash -c "echo '' > /sys/fs/cgroup/memory/test/memory.idle_page_stats" + $ cat /sys/fs/cgroup/memory/test/memory.idle_page_stats + # version: 1.0 + # scans: 0 + # scan_period_in_seconds: 1 + # use_hierarchy: 1 + # buckets: no valid bucket available + ----------------------------- snapshot end ----------------------------- diff --git a/include/linux/kidled.h b/include/linux/kidled.h new file mode 100644 index 000000000000..a212b9b6adf4 --- /dev/null +++ b/include/linux/kidled.h @@ -0,0 +1,237 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MM_KIDLED_H +#define _LINUX_MM_KIDLED_H + +#ifdef CONFIG_KIDLED + +#include + +#define KIDLED_VERSION "1.0" + +/* + * We want to get more info about a specified idle page, whether it's + * a page cache or in active LRU list and so on. We use KIDLE_ + * to mark these different page attributes, we support 4 flags: + * + * KIDLE_DIRTY : page is dirty or not; + * KIDLE_FILE : page is a page cache or not; + * KIDLE_UNEVIT : page is unevictable or evictable; + * KIDLE_ACTIVE : page is in active LRU list or not. + * + * Each KIDLE_ occupies one bit position in a specified idle type. + * There exist total 2^4=16 idle types. + */ +#define KIDLE_BASE 0 +#define KIDLE_DIRTY (1 << 0) +#define KIDLE_FILE (1 << 1) +#define KIDLE_UNEVICT (1 << 2) +#define KIDLE_ACTIVE (1 << 3) + +#define KIDLE_NR_TYPE 16 + +/* + * Each page has an idle age which means how long the page is keeping + * in idle state, the age's unit is in one scan period. Each page's + * idle age will consume one byte, so the max age must be 255. + * Buckets are used for histogram sampling depends on the idle age, + * e.g. the bucket [5,15) means page's idle age ge than 5 scan periods + * and lt 15 scan periods. A specified bucket value is a split line of + * the idle age. We support a maximum of NUM_KIDLED_BUCKETS sampling + * regions. + */ +#define KIDLED_MAX_IDLE_AGE U8_MAX +#define NUM_KIDLED_BUCKETS 8 + +/* + * Since it's not convenient to get an immediate statistics for a memory + * cgroup, we use a ping-pong buffer. One is used to store the stable + * statistics which call it 'stable buffer', it's used for showing. + * Another is used to store the statistics being updated by scanning + * threads which call it 'unstable buffer'. Switch them when one scanning + * round is finished. + */ +#define KIDLED_STATS_NR_TYPE 2 + +/* + * When user wants not to account for a specified instance (e.g. may + * be a memory cgoup), then mark the corresponding buckets to be invalid. + * kidled will skip accounting when encounter invalid buckets. Note the + * scanning is still on. + * + * When users update new buckets, it means current statistics should be + * invalid. But we can't reset immediately, reasons as above. We'll reset + * at a safe point(i.e. one round finished). Store new buckets in stable + * stats's buckets, while mark unstable stats's buckets to be invalid. + * + * This value must be greater than KIDLED_MAX_IDLE_AGE, and can be only + * used for the first bucket value, so it can return quickly when call + * kidled_get_bucket(). User shouldn't use KIDLED_INVALID_BUCKET directly. + */ +#define KIDLED_INVALID_BUCKET (KIDLED_MAX_IDLE_AGE + 1) + +#define KIDLED_MARK_BUCKET_INVALID(buckets) \ + (buckets[0] = KIDLED_INVALID_BUCKET) +#define KIDLED_IS_BUCKET_INVALID(buckets) \ + (buckets[0] == KIDLED_INVALID_BUCKET) + +/* + * We account number of idle pages depending on idle type and buckets + * for a specified instance (e.g. one memory cgroup or one process...) + */ +struct idle_page_stats { + int buckets[NUM_KIDLED_BUCKETS]; + unsigned long count[KIDLE_NR_TYPE][NUM_KIDLED_BUCKETS]; +}; + +/* + * Duration is in seconds, it means kidled will take how long to finish + * one round (just try, no promise). Sequence number will be increased + * when user updates the sysfs file each time, it can protect readers + * won't get stale statistics by comparing the sequence number even + * duration keep the same. However, there exists a rare race that seq + * num may wrap and be the same as previous seq num. So we also check + * the duration to make readers won't get strange statistics. But it may + * be still stale when seq and duration are both the same as previous + * value, but I think it's acceptable because duration is the same at + * least. + */ +#define KIDLED_MAX_SCAN_DURATION U16_MAX /* max 65536 seconds */ +struct kidled_scan_period { + union { + atomic_t val; + struct { + u16 seq; /* inc when update */ + u16 duration; /* in seconds */ + }; + }; +}; +extern struct kidled_scan_period kidled_scan_period; + +#define KIDLED_OP_SET_DURATION (1 << 0) +#define KIDLED_OP_INC_SEQ (1 << 1) + +static inline struct kidled_scan_period kidled_get_current_scan_period(void) +{ + struct kidled_scan_period scan_period; + + atomic_set(&scan_period.val, atomic_read(&kidled_scan_period.val)); + return scan_period; +} + +static inline unsigned int kidled_get_current_scan_duration(void) +{ + struct kidled_scan_period scan_period = + kidled_get_current_scan_period(); + + return scan_period.duration; +} + +static inline void kidled_reset_scan_period(struct kidled_scan_period *p) +{ + atomic_set(&p->val, 0); +} + +/* + * Compare with global kidled_scan_period, return true if equals. + */ +static inline bool kidled_is_scan_period_equal(struct kidled_scan_period *p) +{ + return atomic_read(&p->val) == atomic_read(&kidled_scan_period.val); +} + +static inline bool kidled_set_scan_period(int op, u16 duration, + struct kidled_scan_period *orig) +{ + bool retry = false; + + /* + * atomic_cmpxchg() tries to update kidled_scan_period, shouldn't + * retry to avoid endless loop when caller specify a period. + */ + if (!orig) { + orig = &kidled_scan_period; + retry = true; + } + + while (true) { + int new_period_val, old_period_val; + struct kidled_scan_period new_period; + + old_period_val = atomic_read(&orig->val); + atomic_set(&new_period.val, old_period_val); + if (op & KIDLED_OP_INC_SEQ) + new_period.seq++; + if (op & KIDLED_OP_SET_DURATION) + new_period.duration = duration; + new_period_val = atomic_read(&new_period.val); + + if (atomic_cmpxchg(&kidled_scan_period.val, + old_period_val, + new_period_val) == old_period_val) + return true; + + if (!retry) + return false; + } +} + +static inline void kidled_set_scan_duration(u16 duration) +{ + kidled_set_scan_period(KIDLED_OP_INC_SEQ | + KIDLED_OP_SET_DURATION, + duration, NULL); +} + +/* + * Caller must specify the original scan period, avoid the race between + * the double operation and user's updates through sysfs interface. + */ +static inline bool kidled_try_double_scan_period(struct kidled_scan_period orig) +{ + u16 duration = orig.duration; + + if (unlikely(duration == KIDLED_MAX_SCAN_DURATION)) + return false; + + duration <<= 1; + if (duration < orig.duration) + duration = KIDLED_MAX_SCAN_DURATION; + return kidled_set_scan_period(KIDLED_OP_INC_SEQ | + KIDLED_OP_SET_DURATION, + duration, + &orig); +} + +/* + * Increase the sequence number while keep duration the same, it's used + * to start a new period immediately. + */ +static inline void kidled_inc_scan_seq(void) +{ + kidled_set_scan_period(KIDLED_OP_INC_SEQ, 0, NULL); +} + +extern const int kidled_default_buckets[NUM_KIDLED_BUCKETS]; + +bool kidled_use_hierarchy(void); +#ifdef CONFIG_MEMCG +void kidled_mem_cgroup_move_stats(struct mem_cgroup *from, + struct mem_cgroup *to, + struct page *page, + unsigned int nr_pages); +#endif /* CONFIG_MEMCG */ + +#else /* !CONFIG_KIDLED */ + +#ifdef CONFIG_MEMCG +static inline void kidled_mem_cgroup_move_stats(struct mem_cgroup *from, + struct mem_cgroup *to, + struct page *page, + unsigned int nr_pages) +{ +} +#endif /* CONFIG_MEMCG */ + +#endif /* CONFIG_KIDLED */ + +#endif /* _LINUX_MM_KIDLED_H */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 8feaa0abf1a4..dfa3a89a1440 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -30,6 +30,7 @@ #include #include #include +#include struct mem_cgroup; struct page; @@ -317,6 +318,14 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock; +#ifdef CONFIG_KIDLED + struct rw_semaphore idle_stats_rwsem; + unsigned long idle_scans; + struct kidled_scan_period scan_period; + int idle_stable_idx; + struct idle_page_stats idle_stats[KIDLED_STATS_NR_TYPE]; +#endif + struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; @@ -799,6 +808,28 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, void mem_cgroup_split_huge_fixup(struct page *head); #endif +#ifdef CONFIG_KIDLED +static inline struct idle_page_stats * +mem_cgroup_get_stable_idle_stats(struct mem_cgroup *memcg) +{ + return &memcg->idle_stats[memcg->idle_stable_idx]; +} + +static inline struct idle_page_stats * +mem_cgroup_get_unstable_idle_stats(struct mem_cgroup *memcg) +{ + return &memcg->idle_stats[KIDLED_STATS_NR_TYPE - 1 - + memcg->idle_stable_idx]; +} + +static inline void +mem_cgroup_idle_page_stats_switch(struct mem_cgroup *memcg) +{ + memcg->idle_stable_idx = KIDLED_STATS_NR_TYPE - 1 - + memcg->idle_stable_idx; +} +#endif /* CONFIG_KIDLED */ + static inline bool is_wmark_ok(struct mem_cgroup *memcg, bool high) { if (high) diff --git a/include/linux/mm.h b/include/linux/mm.h index 45f10f5896b7..3adb081f83d5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -794,11 +794,12 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); * sets it, so none of the operations on it need to be atomic. */ -/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | ... | FLAGS | */ +/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | [KIDLED_AGE] | ... | FLAGS | */ #define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH) #define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH) #define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH) #define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH) +#define KIDLED_AGE_PGOFF (LAST_CPUPID_PGOFF - KIDLED_AGE_WIDTH) /* * Define the bit shifts to access each section. For non-existent @@ -809,6 +810,7 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0)) #define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0)) #define LAST_CPUPID_PGSHIFT (LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0)) +#define KIDLED_AGE_PGSHIFT (KIDLED_AGE_PGOFF * (KIDLED_AGE_WIDTH != 0)) /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */ #ifdef NODE_NOT_IN_PAGE_FLAGS @@ -1089,6 +1091,71 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid) } #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_KIDLED +#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS +static inline int kidled_get_page_age(pg_data_t *pgdat, unsigned long pfn) +{ + u8 *age = pgdat->node_page_age; + + if (unlikely(!age)) + return -EINVAL; + + age += (pfn - pgdat->node_start_pfn); + return *age; +} + +static inline int kidled_inc_page_age(pg_data_t *pgdat, unsigned long pfn) +{ + u8 *age = pgdat->node_page_age; + + if (unlikely(!age)) + return -EINVAL; + + age += (pfn - pgdat->node_start_pfn); + *age += 1; + + return *age; +} + +static inline void kidled_set_page_age(pg_data_t *pgdat, + unsigned long pfn, int val) +{ + u8 *age = pgdat->node_page_age; + + if (unlikely(!age)) + return; + + age += (pfn - pgdat->node_start_pfn); + *age = val; +} +#else +static inline int kidled_get_page_age(pg_data_t *pgdat, unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + + return (page->flags >> KIDLED_AGE_PGSHIFT) & KIDLED_AGE_MASK; +} + +extern int kidled_inc_page_age(pg_data_t *pgdat, unsigned long pfn); +extern void kidled_set_page_age(pg_data_t *pgdat, unsigned long pfn, int val); +#endif /* KIDLED_AGE_NOT_IN_PAGE_FLAGS */ +#else /* !CONFIG_KIDLED */ +static inline int kidled_get_page_age(pg_data_t *pgdat, unsigned long pfn) +{ + return -EINVAL; +} + +static inline int kidled_inc_page_age(pg_data_t *pgdat, unsigned long pfn) +{ + return -EINVAL; +} + +static inline void kidled_set_page_age(pg_data_t *pgdat, + unsigned long pfn, int val) +{ +} +#endif /* CONFIG_KIDLED */ + static inline struct zone *page_zone(const struct page *page) { return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4e46ff268cb1..00f681746bb3 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -653,6 +653,11 @@ typedef struct pglist_data { unsigned long node_present_pages; /* total number of physical pages */ unsigned long node_spanned_pages; /* total size of physical page range, including holes */ +#ifdef CONFIG_KIDLED + unsigned long node_idle_scan_pfn; + u8 *node_page_age; +#endif + int node_id; wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h index 7ec86bf31ce4..92766b1b04d8 100644 --- a/include/linux/page-flags-layout.h +++ b/include/linux/page-flags-layout.h @@ -82,6 +82,19 @@ #define LAST_CPUPID_WIDTH 0 #endif +#ifdef CONFIG_KIDLED +#define KIDLED_AGE_SHIFT 8 +#define KIDLED_AGE_MASK ((1UL << KIDLED_AGE_SHIFT)-1) +#else +#define KIDLED_AGE_SHIFT 0 +#endif + +#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT+KIDLED_AGE_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS +#define KIDLED_AGE_WIDTH KIDLED_AGE_SHIFT +#else +#define KIDLED_AGE_WIDTH 0 +#endif + /* * We are going to use the flags for the page to node mapping if its in * there. This includes the case where there is no node, so it is implicit. @@ -94,4 +107,8 @@ #define LAST_CPUPID_NOT_IN_PAGE_FLAGS #endif +#if defined(CONFIG_KIDLED) && KIDLED_AGE_WIDTH == 0 +#define KIDLED_AGE_NOT_IN_PAGE_FLAGS +#endif + #endif /* _LINUX_PAGE_FLAGS_LAYOUT */ diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index 1e894d34bdce..f87648fd0825 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -38,6 +38,9 @@ static inline void clear_page_idle(struct page *page) { ClearPageIdle(page); } + +void page_idle_clear_pte_refs(struct page *page); + #else /* !CONFIG_64BIT */ /* * If there is not enough space to store Idle and Young bits in page flags, use @@ -135,6 +138,10 @@ static inline void clear_page_idle(struct page *page) { } +static inline void page_idle_clear_pte_refs(struct page *page) +{ +} + #endif /* CONFIG_IDLE_PAGE_TRACKING */ #endif /* _LINUX_MM_PAGE_IDLE_H */ diff --git a/mm/Kconfig b/mm/Kconfig index b457e94ae618..453a8446e951 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -764,4 +764,16 @@ config GUP_BENCHMARK config ARCH_HAS_PTE_SPECIAL bool +config KIDLED + bool "Enable kernel thread to scan idle pages" + depends on IDLE_PAGE_TRACKING + help + This introduces kernel thread (kidled) to scan pages in configurable + interval to determine if they are accessed in that interval, to + determine their access frequency. The hot/cold pages are identified + with it and the statistics are exported to user space on basis of + memory cgroup by "memory.idle_page_stats". + + See Documentation/vm/kidled.rst for more details. + endmenu diff --git a/mm/Makefile b/mm/Makefile index 26ef77a3883b..0ca4b8cd21f3 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -105,3 +105,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o obj-$(CONFIG_HMM) += hmm.o obj-$(CONFIG_MEMFD_CREATE) += memfd.o +obj-$(CONFIG_KIDLED) += kidled.o diff --git a/mm/kidled.c b/mm/kidled.c new file mode 100644 index 000000000000..db63de493ece --- /dev/null +++ b/mm/kidled.c @@ -0,0 +1,691 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * Should the accounting be hierarchical? Hierarchical accounting only + * works when memcg is in hierarchy mode. It's OK when kilded enables + * hierarchical accounting while memcg is in non-hierarchy mode, kidled + * will account to the memory cgroup page is charged to. No dependency + * between these two settings. + */ +static bool use_hierarchy __read_mostly; + +struct kidled_scan_period kidled_scan_period; +const int kidled_default_buckets[NUM_KIDLED_BUCKETS] = { + 1, 2, 5, 15, 30, 60, 120, 240 }; +static DECLARE_WAIT_QUEUE_HEAD(kidled_wait); +static unsigned long kidled_scan_rounds __read_mostly; + +static inline int kidled_get_bucket(int *idle_buckets, int age) +{ + int bucket; + + if (age < idle_buckets[0]) + return -EINVAL; + + for (bucket = 1; bucket <= (NUM_KIDLED_BUCKETS - 1); bucket++) { + if (age < idle_buckets[bucket]) + return bucket - 1; + } + + return NUM_KIDLED_BUCKETS - 1; +} + +static inline int kidled_get_idle_type(struct page *page) +{ + int idle_type = KIDLE_BASE; + + if (PageDirty(page) || PageWriteback(page)) + idle_type |= KIDLE_DIRTY; + if (page_is_file_cache(page)) + idle_type |= KIDLE_FILE; + /* + * Couldn't call page_evictable() here, because we have not held + * the page lock, so use page flags instead. Different from + * PageMlocked(). + */ + if (PageUnevictable(page)) + idle_type |= KIDLE_UNEVICT; + if (PageActive(page)) + idle_type |= KIDLE_ACTIVE; + return idle_type; +} + +#ifndef KIDLED_AGE_NOT_IN_PAGE_FLAGS +int kidled_inc_page_age(pg_data_t *pgdat, unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + unsigned long old, new; + int age; + + do { + age = ((page->flags >> KIDLED_AGE_PGSHIFT) & KIDLED_AGE_MASK); + if (age >= KIDLED_AGE_MASK) + break; + + new = old = page->flags; + new &= ~(KIDLED_AGE_MASK << KIDLED_AGE_PGSHIFT); + new |= (((age + 1) & KIDLED_AGE_MASK) << KIDLED_AGE_PGSHIFT); + } while (unlikely(cmpxchg(&page->flags, old, new) != old)); + + return age; +} +EXPORT_SYMBOL_GPL(kidled_inc_page_age); + +void kidled_set_page_age(pg_data_t *pgdat, unsigned long pfn, int val) +{ + struct page *page = pfn_to_page(pfn); + unsigned long old, new; + + do { + new = old = page->flags; + new &= ~(KIDLED_AGE_MASK << KIDLED_AGE_PGSHIFT); + new |= ((val & KIDLED_AGE_MASK) << KIDLED_AGE_PGSHIFT); + } while (unlikely(cmpxchg(&page->flags, old, new) != old)); + +} +EXPORT_SYMBOL_GPL(kidled_set_page_age); +#endif /* !KIDLED_AGE_NOT_IN_PAGE_FLAGS */ + +#ifdef CONFIG_MEMCG +static inline void kidled_mem_cgroup_account(struct page *page, + int age, + int nr_pages) +{ + struct mem_cgroup *memcg; + struct idle_page_stats *stats; + int type, bucket; + + if (mem_cgroup_disabled()) + return; + + type = kidled_get_idle_type(page); + + memcg = lock_page_memcg(page); + if (unlikely(!memcg)) { + unlock_page_memcg(page); + return; + } + + stats = mem_cgroup_get_unstable_idle_stats(memcg); + bucket = kidled_get_bucket(stats->buckets, age); + if (bucket >= 0) + stats->count[type][bucket] += nr_pages; + + unlock_page_memcg(page); +} + +void kidled_mem_cgroup_move_stats(struct mem_cgroup *from, + struct mem_cgroup *to, + struct page *page, + unsigned int nr_pages) +{ + pg_data_t *pgdat = page_pgdat(page); + unsigned long pfn = page_to_pfn(page); + struct idle_page_stats *stats[4] = { NULL, }; + int type, bucket, age; + + if (mem_cgroup_disabled()) + return; + + type = kidled_get_idle_type(page); + stats[0] = mem_cgroup_get_stable_idle_stats(from); + stats[1] = mem_cgroup_get_unstable_idle_stats(from); + if (to) { + stats[2] = mem_cgroup_get_stable_idle_stats(to); + stats[3] = mem_cgroup_get_unstable_idle_stats(to); + } + + /* + * We assume the all page ages are same if this is a compound page. + * Also we uses node's cursor (@node_idle_scan_pfn) to check if current + * page should be removed from the source memory cgroup or charged + * to target memory cgroup, without introducing locking mechanism. + * This may lead to slightly inconsistent statistics, but it's fine + * as it will be reshuffled in next round of scanning. + */ + age = kidled_get_page_age(pgdat, pfn); + if (age < 0) + return; + + bucket = kidled_get_bucket(stats[1]->buckets, age); + if (bucket < 0) + return; + + /* Remove from the source memory cgroup */ + if (stats[0]->count[type][bucket] > nr_pages) + stats[0]->count[type][bucket] -= nr_pages; + else + stats[0]->count[type][bucket] = 0; + if (pgdat->node_idle_scan_pfn >= pfn) { + if (stats[1]->count[type][bucket] > nr_pages) + stats[1]->count[type][bucket] -= nr_pages; + else + stats[1]->count[type][bucket] = 0; + } + + /* Charge to the target memory cgroup */ + if (!to) + return; + + bucket = kidled_get_bucket(stats[3]->buckets, age); + if (bucket < 0) + return; + + stats[2]->count[type][bucket] += nr_pages; + if (pgdat->node_idle_scan_pfn >= pfn) + stats[3]->count[type][bucket] += nr_pages; +} +EXPORT_SYMBOL_GPL(kidled_mem_cgroup_move_stats); + +static inline void kidled_mem_cgroup_scan_done(struct kidled_scan_period period) +{ + struct mem_cgroup *memcg; + struct idle_page_stats *stable_stats, *unstable_stats; + + for (memcg = mem_cgroup_iter(NULL, NULL, NULL); + memcg != NULL; + memcg = mem_cgroup_iter(NULL, memcg, NULL)) { + + down_write(&memcg->idle_stats_rwsem); + stable_stats = mem_cgroup_get_stable_idle_stats(memcg); + unstable_stats = mem_cgroup_get_unstable_idle_stats(memcg); + + /* + * Switch when scanning buckets is valid, or copy buckets + * from stable_stats's buckets which may have user's new + * buckets(maybe valid or not). + */ + if (!KIDLED_IS_BUCKET_INVALID(unstable_stats->buckets)) { + mem_cgroup_idle_page_stats_switch(memcg); + memcg->idle_scans++; + } else { + memcpy(unstable_stats->buckets, stable_stats->buckets, + sizeof(unstable_stats->buckets)); + } + + memcg->scan_period = period; + up_write(&memcg->idle_stats_rwsem); + + unstable_stats = mem_cgroup_get_unstable_idle_stats(memcg); + memset(&unstable_stats->count, 0, + sizeof(unstable_stats->count)); + } +} + +static inline void kidled_mem_cgroup_reset(void) +{ + struct mem_cgroup *memcg; + struct idle_page_stats *stable_stats, *unstable_stats; + + for (memcg = mem_cgroup_iter(NULL, NULL, NULL); + memcg != NULL; + memcg = mem_cgroup_iter(NULL, memcg, NULL)) { + down_write(&memcg->idle_stats_rwsem); + stable_stats = mem_cgroup_get_stable_idle_stats(memcg); + unstable_stats = mem_cgroup_get_unstable_idle_stats(memcg); + memset(&stable_stats->count, 0, sizeof(stable_stats->count)); + + memcg->idle_scans = 0; + kidled_reset_scan_period(&memcg->scan_period); + up_write(&memcg->idle_stats_rwsem); + + memset(&unstable_stats->count, 0, + sizeof(unstable_stats->count)); + } +} +#else /* !CONFIG_MEMCG */ +static inline void kidled_mem_cgroup_account(struct page *page, + int age, + int nr_pages) +{ +} +static inline void kidled_mem_cgroup_scan_done(struct kidled_scan_period + scan_period) +{ +} +static inline void kidled_mem_cgroup_reset(void) +{ +} +#endif /* CONFIG_MEMCG */ + +/* + * An idle page with an older age is more likely idle, while a busy page is + * more likely busy, so we can reduce the sampling frequency to save cpu + * resource when meet these pages. And we will keep sampling each time when + * an idle page is young. See tables below: + * + * idle age | down ratio + * ----------+------------- + * [0, 1) | 1/2 # busy + * [1, 4) | 1 # young idle + * [4, 8) | 1/2 # idle + * [8, 16) | 1/4 # old idle + * [16, +inf)| 1/8 # older idle + */ +static inline bool kidled_need_check_idle(pg_data_t *pgdat, unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + int age = kidled_get_page_age(pgdat, pfn); + unsigned long pseudo_random; + + if (age < 0) + return false; + + /* + * kidled will check different pages at each round when need + * reduce sampling frequency, this depends on current pfn and + * global scanning rounds. There exist some special pfns, for + * one huge page, we can only check the head page, while tail + * pages would be checked in low levels and will be skipped. + * Shifting HPAGE_PMD_ORDER bits is to achieve good load balance + * for each round when system has many huge pages, 1GB is not + * considered here. + */ + if (PageTransHuge(page)) + pfn >>= compound_order(page); + + pseudo_random = pfn + kidled_scan_rounds; + if (age == 0) + return pseudo_random & 0x1UL; + else if (age < 4) + return true; + else if (age < 8) + return pseudo_random & 0x1UL; + else if (age < 16) + return (pseudo_random & 0x3UL) == 0x3UL; + else + return (pseudo_random & 0x7UL) == 0x7UL; +} + +static inline int kidled_scan_page(pg_data_t *pgdat, unsigned long pfn) +{ + struct page *page; + int age, nr_pages = 1, idx; + bool idle = false; + + if (!pfn_valid(pfn)) + goto out; + + page = pfn_to_page(pfn); + if (!page || !PageLRU(page)) { + kidled_set_page_age(pgdat, pfn, 0); + goto out; + } + + /* + * Try to skip clear PTE references which is an expensive call. + * PG_idle should be cleared when free a page and we have checked + * PG_lru flag above, so the race is acceptable to us. + */ + if (page_is_idle(page)) { + if (kidled_need_check_idle(pgdat, pfn)) { + if (!get_page_unless_zero(page)) { + kidled_set_page_age(pgdat, pfn, 0); + goto out; + } + + /* + * Check again after get a reference count, while in + * page_idle_get_page() it gets zone_lru_lock at first, + * it seems useless. + * + * Also we can't hold LRU lock here as the consumed + * time to finish the scanning is fixed. Otherwise, + * the accumulated statistics will be cleared out + * and scan interval (@scan_period_in_seconds) will + * be doubled. However, this may incur race between + * kidled and page reclaim. The page reclaim may dry + * run due to dumped refcount, but it's acceptable. + */ + if (unlikely(!PageLRU(page))) { + put_page(page); + kidled_set_page_age(pgdat, pfn, 0); + goto out; + } + + page_idle_clear_pte_refs(page); + if (page_is_idle(page)) + idle = true; + put_page(page); + } else if (kidled_get_page_age(pgdat, pfn) > 0) { + idle = true; + } + } + + if (PageTransHuge(page)) + nr_pages = 1 << compound_order(page); + + if (idle) { + age = kidled_inc_page_age(pgdat, pfn); + if (age > 0) + kidled_mem_cgroup_account(page, age, nr_pages); + else + age = 0; + } else { + age = 0; + kidled_set_page_age(pgdat, pfn, 0); + if (get_page_unless_zero(page)) { + if (likely(PageLRU(page))) + set_page_idle(page); + put_page(page); + } + } + + for (idx = 1; idx < nr_pages; idx++) + kidled_set_page_age(pgdat, pfn + idx, age); + +out: + return nr_pages; +} + +static bool kidled_scan_node(pg_data_t *pgdat, + struct kidled_scan_period scan_period, + bool restart) +{ + unsigned long pfn, end, node_end; + +#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS + if (unlikely(!pgdat->node_page_age)) { + pgdat->node_page_age = vzalloc(pgdat->node_spanned_pages); + if (unlikely(!pgdat->node_page_age)) + return false; + } +#endif /* KIDLED_AGE_NOT_IN_PAGE_FLAGS */ + + node_end = pgdat_end_pfn(pgdat); + pfn = pgdat->node_start_pfn; + if (!restart && pfn < pgdat->node_idle_scan_pfn) + pfn = pgdat->node_idle_scan_pfn; + end = min(pfn + DIV_ROUND_UP(pgdat->node_spanned_pages, + scan_period.duration), node_end); + while (pfn < end) { + /* Restart new scanning when user updates the period */ + if (unlikely(!kidled_is_scan_period_equal(&scan_period))) + break; + + cond_resched(); + pfn += kidled_scan_page(pgdat, pfn); + } + + pgdat->node_idle_scan_pfn = pfn; + return pfn >= node_end; +} + +static inline void kidled_scan_done(struct kidled_scan_period scan_period) +{ + kidled_mem_cgroup_scan_done(scan_period); + kidled_scan_rounds++; +} + +static inline void kidled_reset(bool free) +{ + pg_data_t *pgdat; + + kidled_mem_cgroup_reset(); + + get_online_mems(); + +#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS + for_each_online_pgdat(pgdat) { + if (!pgdat->node_page_age) + continue; + + if (free) { + vfree(pgdat->node_page_age); + pgdat->node_page_age = NULL; + } else { + memset(pgdat->node_page_age, 0, + pgdat->node_spanned_pages); + } + + cond_resched(); + } +#else + for_each_online_pgdat(pgdat) { + unsigned long pfn, end_pfn = pgdat->node_start_pfn + + pgdat->node_spanned_pages; + + for (pfn = pgdat->node_start_pfn; pfn < end_pfn; pfn++) { + if (!pfn_valid(pfn)) + continue; + + kidled_set_page_age(pgdat, pfn, 0); + + if (pfn % HPAGE_PMD_NR == 0) + cond_resched(); + } + } +#endif /* KIDLED_AGE_NOT_IN_PAGE_FLAGS */ + + put_online_mems(); +} + +static inline bool kidled_should_run(struct kidled_scan_period *p, bool *new) +{ + if (unlikely(!kidled_is_scan_period_equal(p))) { + struct kidled_scan_period scan_period; + + scan_period = kidled_get_current_scan_period(); + if (p->duration) + kidled_reset(!scan_period.duration); + *p = scan_period; + *new = true; + } else { + *new = false; + } + + if (p->duration > 0) + return true; + + return false; +} + +static int kidled(void *dummy) +{ + int busy_loop = 0; + bool restart = true; + struct kidled_scan_period scan_period; + + kidled_reset_scan_period(&scan_period); + + while (!kthread_should_stop()) { + pg_data_t *pgdat; + u64 start_jiffies, elapsed; + bool new, scan_done = true; + + wait_event_interruptible(kidled_wait, + kidled_should_run(&scan_period, &new)); + if (unlikely(new)) { + restart = true; + busy_loop = 0; + } + + if (unlikely(scan_period.duration == 0)) + continue; + + start_jiffies = jiffies_64; + get_online_mems(); + for_each_online_pgdat(pgdat) { + scan_done &= kidled_scan_node(pgdat, + scan_period, + restart); + } + put_online_mems(); + + if (scan_done) { + kidled_scan_done(scan_period); + restart = true; + } else { + restart = false; + } + + /* + * We hope kidled can scan specified pages which depends on + * scan_period in each slice, and supposed to finish each + * slice in one second: + * + * pages_to_scan = total_pages / scan_duration + * for_each_slice() { + * start_jiffies = jiffies_64; + * scan_pages(pages_to_scan); + * elapsed = jiffies_64 - start_jiffies; + * sleep(HZ - elapsed); + * } + * + * We thought it's busy when elapsed >= (HZ / 2), and if keep + * busy for several consecutive times, we'll scale up the + * scan duration. + * + * NOTE it's a simple guard, not a promise. + */ +#define KIDLED_BUSY_RUNNING (HZ / 2) +#define KIDLED_BUSY_LOOP_THRESHOLD 10 + elapsed = jiffies_64 - start_jiffies; + if (elapsed < KIDLED_BUSY_RUNNING) { + busy_loop = 0; + schedule_timeout_interruptible(HZ - elapsed); + } else if (++busy_loop == KIDLED_BUSY_LOOP_THRESHOLD) { + busy_loop = 0; + if (kidled_try_double_scan_period(scan_period)) { + pr_warn_ratelimited("%s: period -> %u\n", + __func__, + kidled_get_current_scan_duration()); + } + + /* sleep for a while to relax cpu */ + schedule_timeout_interruptible(elapsed); + } + } + + return 0; +} + +bool kidled_use_hierarchy(void) +{ + return use_hierarchy; +} + +static ssize_t kidled_scan_period_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%u\n", kidled_get_current_scan_duration()); +} + +/* + * We will update the real scan period and do reset asynchronously, + * avoid stall when kidled is busy waiting for other resources. + */ +static ssize_t kidled_scan_period_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long secs; + int ret; + + ret = kstrtoul(buf, 10, &secs); + if (ret || secs > KIDLED_MAX_SCAN_DURATION) + return -EINVAL; + + kidled_set_scan_duration(secs); + wake_up_interruptible(&kidled_wait); + return count; +} + +static ssize_t kidled_use_hierarchy_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%u\n", use_hierarchy); +} + +static ssize_t kidled_use_hierarchy_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long val; + int ret; + + ret = kstrtoul(buf, 10, &val); + if (ret || val > 1) + return -EINVAL; + + WRITE_ONCE(use_hierarchy, val); + + /* + * Always start a new period when user sets use_hierarchy, + * kidled_inc_scan_seq() uses atomic_cmpxchg() which implies a + * memory barrier. This will make sure readers will get new + * statistics after the store returned. But there still exists + * a rare race when storing: + * + * writer | readers + * | + * update_use_hierarchy | + * ..... | read_statistics <-- race + * increase_scan_sequence | + * + * readers may get new use_hierarchy value and old statistics, + * ignore this.. + */ + kidled_inc_scan_seq(); + return count; +} + +static struct kobj_attribute kidled_scan_period_attr = + __ATTR(scan_period_in_seconds, 0644, + kidled_scan_period_show, kidled_scan_period_store); +static struct kobj_attribute kidled_use_hierarchy_attr = + __ATTR(use_hierarchy, 0644, + kidled_use_hierarchy_show, kidled_use_hierarchy_store); + +static struct attribute *kidled_attrs[] = { + &kidled_scan_period_attr.attr, + &kidled_use_hierarchy_attr.attr, + NULL +}; +static struct attribute_group kidled_attr_group = { + .name = "kidled", + .attrs = kidled_attrs, +}; + +static int __init kidled_init(void) +{ + struct task_struct *thread; + struct sched_param param = { .sched_priority = 0 }; + int ret; + + ret = sysfs_create_group(mm_kobj, &kidled_attr_group); + if (ret) { + pr_warn("%s: Error %d on creating sysfs files\n", + __func__, ret); + return ret; + } + + thread = kthread_run(kidled, NULL, "kidled"); + if (IS_ERR(thread)) { + sysfs_remove_group(mm_kobj, &kidled_attr_group); + pr_warn("%s: Failed to start kthread\n", __func__); + return PTR_ERR(thread); + } + + /* Make kidled as nice as possible. */ + sched_setscheduler(thread, SCHED_IDLE, ¶m); + + return 0; +} + +module_init(kidled_init); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8a728e07018f..31abc4dc1c54 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3553,6 +3553,246 @@ static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf, return nbytes; } +#ifdef CONFIG_KIDLED +static int mem_cgroup_idle_page_stats_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *iter, *memcg = mem_cgroup_from_css(seq_css(m)); + struct kidled_scan_period scan_period, period; + struct idle_page_stats stats, cache; + unsigned long scans; + bool has_hierarchy = kidled_use_hierarchy(); + bool no_buckets = false; + int i, j, t; + + down_read(&memcg->idle_stats_rwsem); + stats = memcg->idle_stats[memcg->idle_stable_idx]; + scans = memcg->idle_scans; + scan_period = memcg->scan_period; + up_read(&memcg->idle_stats_rwsem); + + /* Nothing will be outputed with invalid buckets */ + if (KIDLED_IS_BUCKET_INVALID(stats.buckets)) { + no_buckets = true; + scans = 0; + goto output; + } + + /* Zeroes will be output with mismatched scan period */ + if (!kidled_is_scan_period_equal(&scan_period)) { + memset(&stats.count, 0, sizeof(stats.count)); + scan_period = kidled_get_current_scan_period(); + scans = 0; + goto output; + } + + if (mem_cgroup_is_root(memcg) || has_hierarchy) { + for_each_mem_cgroup_tree(iter, memcg) { + /* The root memcg was just accounted */ + if (iter == memcg) + continue; + + down_read(&iter->idle_stats_rwsem); + cache = iter->idle_stats[iter->idle_stable_idx]; + period = memcg->scan_period; + up_read(&iter->idle_stats_rwsem); + + /* + * Skip to account if the scan period is mismatched + * or buckets are invalid. + */ + if (!kidled_is_scan_period_equal(&period) || + KIDLED_IS_BUCKET_INVALID(cache.buckets)) + continue; + + /* + * The buckets of current memory cgroup might be + * mismatched with that of root memory cgroup. We + * charge the current statistics to the possibly + * largest bucket. The users need to apply the + * consistent buckets into the memory cgroups in + * the hierarchy tree. + */ + for (i = 0; i < NUM_KIDLED_BUCKETS; i++) { + for (j = 0; j < NUM_KIDLED_BUCKETS - 1; j++) { + if (cache.buckets[i] <= + stats.buckets[j]) + break; + } + + for (t = 0; t < KIDLE_NR_TYPE; t++) + stats.count[t][j] += cache.count[t][i]; + } + } + } + + +output: + seq_printf(m, "# version: %s\n", KIDLED_VERSION); + seq_printf(m, "# scans: %lu\n", scans); + seq_printf(m, "# scan_period_in_seconds: %u\n", scan_period.duration); + seq_printf(m, "# use_hierarchy: %u\n", kidled_use_hierarchy()); + seq_puts(m, "# buckets: "); + if (no_buckets) { + seq_puts(m, "no valid bucket available\n"); + return 0; + } + + for (i = 0; i < NUM_KIDLED_BUCKETS; i++) { + seq_printf(m, "%d", stats.buckets[i]); + + if ((i == NUM_KIDLED_BUCKETS - 1) || + !stats.buckets[i + 1]) { + seq_puts(m, "\n"); + j = i + 1; + break; + } + seq_puts(m, ","); + } + seq_puts(m, "#\n"); + + seq_puts(m, "# _-----=> clean/dirty\n"); + seq_puts(m, "# / _----=> swap/file\n"); + seq_puts(m, "# | / _---=> evict/unevict\n"); + seq_puts(m, "# || / _--=> inactive/active\n"); + seq_puts(m, "# ||| /\n"); + + seq_printf(m, "# %-8s", "||||"); + for (i = 0; i < j; i++) { + char region[20]; + + if (i == j - 1) { + snprintf(region, sizeof(region), "[%d,+inf)", + stats.buckets[i]); + } else { + snprintf(region, sizeof(region), "[%d,%d)", + stats.buckets[i], + stats.buckets[i + 1]); + } + + seq_printf(m, " %14s", region); + } + seq_puts(m, "\n"); + + for (t = 0; t < KIDLE_NR_TYPE; t++) { + char kidled_type_str[5]; + + kidled_type_str[0] = t & KIDLE_DIRTY ? 'd' : 'c'; + kidled_type_str[1] = t & KIDLE_FILE ? 'f' : 's'; + kidled_type_str[2] = t & KIDLE_UNEVICT ? 'u' : 'e'; + kidled_type_str[3] = t & KIDLE_ACTIVE ? 'a' : 'i'; + kidled_type_str[4] = '\0'; + seq_printf(m, " %-8s", kidled_type_str); + + for (i = 0; i < j; i++) { + seq_printf(m, " %14lu", + stats.count[t][i] << PAGE_SHIFT); + } + + seq_puts(m, "\n"); + } + + return 0; +} + +static ssize_t mem_cgroup_idle_page_stats_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + struct idle_page_stats *stable_stats, *unstable_stats; + int buckets[NUM_KIDLED_BUCKETS] = { 0 }, i = 0, err; + unsigned long prev = 0, curr; + char *next; + + buf = strstrip(buf); + while (*buf) { + if (i >= NUM_KIDLED_BUCKETS) + return -E2BIG; + + /* Get next entry */ + next = buf + 1; + while (*next && *next >= '0' && *next <= '9') + next++; + while (*next && (*next == ' ' || *next == ',')) + *next++ = '\0'; + + /* Should be monotonically increasing */ + err = kstrtoul(buf, 10, &curr); + if (err || curr > KIDLED_MAX_IDLE_AGE || curr <= prev) + return -EINVAL; + + buckets[i++] = curr; + prev = curr; + buf = next; + } + + /* No buckets set, mark it invalid */ + if (i == 0) + KIDLED_MARK_BUCKET_INVALID(buckets); + if (down_write_killable(&memcg->idle_stats_rwsem)) + return -EINTR; + stable_stats = mem_cgroup_get_stable_idle_stats(memcg); + unstable_stats = mem_cgroup_get_unstable_idle_stats(memcg); + memcpy(stable_stats->buckets, buckets, sizeof(buckets)); + + /* + * We will clear the stats without check the buckets whether + * has been changed, it works when user only wants to reset + * stats but not to reset the buckets. + */ + memset(stable_stats->count, 0, sizeof(stable_stats->count)); + + /* + * It's safe that the kidled reads the unstable buckets without + * holding any read side locks. + */ + KIDLED_MARK_BUCKET_INVALID(unstable_stats->buckets); + memcg->idle_scans = 0; + up_write(&memcg->idle_stats_rwsem); + + return nbytes; +} + +static void kidled_memcg_init(struct mem_cgroup *memcg) +{ + int type; + + init_rwsem(&memcg->idle_stats_rwsem); + for (type = 0; type < KIDLED_STATS_NR_TYPE; type++) { + memcpy(memcg->idle_stats[type].buckets, + kidled_default_buckets, + sizeof(kidled_default_buckets)); + } +} + +static void kidled_memcg_inherit_parent_buckets(struct mem_cgroup *parent, + struct mem_cgroup *memcg) +{ + int idle_buckets[NUM_KIDLED_BUCKETS], type; + + down_read(&parent->idle_stats_rwsem); + memcpy(idle_buckets, + parent->idle_stats[parent->idle_stable_idx].buckets, + sizeof(idle_buckets)); + up_read(&parent->idle_stats_rwsem); + + for (type = 0; type < KIDLED_STATS_NR_TYPE; type++) { + memcpy(memcg->idle_stats[type].buckets, + idle_buckets, + sizeof(idle_buckets)); + } +} +#else +static void kidled_memcg_init(struct mem_cgroup *memcg) +{ +} + +static void kidled_memcg_inherit_parent_buckets(struct mem_cgroup *parent, + struct mem_cgroup *memcg) +{ +} +#endif /* CONFIG_KIDLED */ + static u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -4661,6 +4901,13 @@ static struct cftype mem_cgroup_legacy_files[] = { .write = mem_cgroup_reset, .read_u64 = mem_cgroup_read_u64, }, +#ifdef CONFIG_KIDLED + { + .name = "idle_page_stats", + .seq_show = mem_cgroup_idle_page_stats_show, + .write = mem_cgroup_idle_page_stats_write, + }, +#endif { }, /* terminate */ }; @@ -4843,6 +5090,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&memcg->cgwb_list); #endif + kidled_memcg_init(memcg); idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); return memcg; fail: @@ -4871,6 +5119,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) /* Default gap is 0.5% max limit */ memcg->wmark_scale_factor = parent->wmark_scale_factor ? : 50; + kidled_memcg_inherit_parent_buckets(parent, memcg); } if (parent && parent->use_hierarchy) { memcg->use_hierarchy = true; @@ -5235,6 +5484,8 @@ static int mem_cgroup_move_account(struct page *page, ret = 0; + kidled_mem_cgroup_move_stats(from, to, page, nr_pages); + local_irq_disable(); mem_cgroup_charge_statistics(to, page, compound, nr_pages); memcg_check_events(to, page); @@ -6143,6 +6394,13 @@ static struct cftype memory_files[] = { .seq_show = memory_oom_group_show, .write = memory_oom_group_write, }, +#ifdef CONFIG_KIDLED + { + .name = "idle_page_stats", + .seq_show = mem_cgroup_idle_page_stats_show, + .write = mem_cgroup_idle_page_stats_write, + }, +#endif { } /* terminate */ }; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 190aed2a906f..883c80e7a339 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -775,6 +775,12 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon pgdat->node_start_pfn = start_pfn; pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn; +#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS + if (pgdat->node_page_age) { + vfree(pgdat->node_page_age); + pgdat->node_page_age = NULL; + } +#endif } void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, @@ -1880,6 +1886,13 @@ void try_offline_node(int nid) if (check_and_unmap_cpu_on_node(pgdat)) return; +#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS + if (pgdat->node_page_age) { + vfree(pgdat->node_page_age); + pgdat->node_page_age = NULL; + } +#endif + /* * all memory/cpu of this node are removed, we can offline this * node now. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e6a03c667ea..750c4d6d59ca 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1034,6 +1034,17 @@ static __always_inline bool free_pages_prepare(struct page *page, bad++; continue; } + + /* + * The page age information is stored in page flags + * or node's page array. We need to explicitly clear + * it in both cases. Otherwise, the stale age will + * be provided when it's allocated again. Also, we + * maintain age information for each page in the + * compound page, So we have to clear them one by one. + */ + kidled_set_page_age(page_pgdat(page + i), + page_to_pfn(page + i), 0); (page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; } } @@ -1047,6 +1058,7 @@ static __always_inline bool free_pages_prepare(struct page *page, return false; page_cpupid_reset_last(page); + kidled_set_page_age(page_pgdat(page), page_to_pfn(page), 0); page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; reset_page_owner(page, order); diff --git a/mm/page_idle.c b/mm/page_idle.c index 52ed59bbc275..e21293799c4f 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -92,7 +92,7 @@ static bool page_idle_clear_pte_refs_one(struct page *page, return true; } -static void page_idle_clear_pte_refs(struct page *page) +void page_idle_clear_pte_refs(struct page *page) { /* * Since rwc.arg is unused, rwc is effectively immutable, so we -- GitLab