diff --git a/Documentation/vm/kidled.rst b/Documentation/vm/kidled.rst new file mode 100644 index 0000000000000000000000000000000000000000..016274a06715739cb5a20d27330ce96287e15017 --- /dev/null +++ b/Documentation/vm/kidled.rst @@ -0,0 +1,139 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +====== +kidled +====== + +Introduction +============ + +kidled uses a kernel thread to scan the pages on LRU list, and supports to +output statistics for each memory cgroup (process is not supported yet). +kidled scans pages round to round indexed by pfn, and will try to finish each +round in a fixed duration which is named as scan period. Of course, users can +set the scan period whose unit is seconds. Each page has an attribute named +as 'idle age', which represents how long the page is kept in idle state, the +age's unit is in one scan period. The idle aging information (field) consumes +one byte, which is stored in dynamically allocated array, tied with the NUMA +node or flags field of page descriptor (struct page). So the maximal age is +255. kidled eventually shows the histogram statistics through memory cgroup +files (``memory.idle_page_stats``). The statistics could be used to evaluate +the working-set size of that memory cgroup or the hierarchy. + + +Usage +===== + +There are two sysfs files and one memory cgroup file, exported by kidled. +Here are their functions: + +* ``/sys/kernel/mm/kidled/scan_period_in_seconds`` + + It controls the scan period for the kernel thread to do the scanning. + Higher resolution will be achieved with smaller value, but more CPU + cycles will be consumed to do the scanning. The scanning won't be + issued if 0 is set for the parameter and it's default setting. Writing + to the file clears all statistics collected previously, even the scan + period isn't changed. + +.. note:: + A rare race exists! ``scan_period_in_seconds`` is only visible thing to + users. duration and sequence number are internal representation for + developers, and they'd better not be seen by users to avoid be confused. + When user updates ``scan_period_in_seconds`` file, the sequence number + is increased and the duration is updated sychronously, as below figure + shows: + + OP | VALUE OF SCAN_PERIOD + Initial value | seq = 0, duration = 0 + user update 120s | seq = 1, duration = 120 <---- last value kidled sees + user update 120s | seq = 2, duration = 120 ---+ + .... | | kidled may miss these + .... | | updates because busy + user update 300s | seq = 65536, duration = 300 | + user update 300s | seq = 0, duration = 300 ---+ + user update 120s | seq = 1, duration = 120 <---- next value kidled sees + + The race happens when ``scan_period_in_seconds`` is updated very fast in a + very short period of time and kidled misses just 65536 * N (N = 1,2,3...) + updates and the duration keeps the same. kidled won't clear previous + statistics, but it won't be very odd due to the duration are the same at + least. + +* ``/sys/kernel/mm/kidled/use_hierarchy`` + + It controls if accumulated statistics is given by ``memory.idle_page_stats``. + When it's set to zero, the statistics corresponding to the memory cgroup + will be shown. However, the accumulated statistics will be given for + the root memory cgroup. When it's set to one, the accumulative statistics + is always shown. + +* ``memory.idle_page_stats`` (memory cgroup v1/v2) + + It shows histogram of idle statistics for the correponding memory cgroup. + It depends on the setting of ``use_hierarchy`` if the statistics is the + accumulated one or not. + + ----------------------------- snapshot start ----------------------------- + # version: 1.0 + # scans: 1380 + # scan_period_in_seconds: 120 + # use_hierarchy: 0 + # buckets: 1,2,5,15,30,60,120,240 + # + # _-----=> clean/dirty + # / _----=> swap/file + # | / _---=> evict/unevict + # || / _--=> inactive/active + # ||| / + # |||| [1,2) [2,5) [5,15) [15,30) [30,60) [60,120) [120,240) [240,+inf) + csei 0 0 0 0 0 0 0 0 + dsei 0 0 442368 49152 0 49152 212992 7741440 + cfei 4096 233472 1171456 1032192 28672 65536 122880 147550208 + dfei 0 0 4096 20480 4096 0 12288 12288 + csui 0 0 0 0 0 0 0 0 + dsui 0 0 0 0 0 0 0 0 + cfui 0 0 0 0 0 0 0 0 + dfui 0 0 0 0 0 0 0 0 + csea 77824 331776 1216512 1069056 217088 372736 327680 33284096 + dsea 0 0 0 0 0 0 0 139264 + cfea 4096 57344 606208 13144064 53248 135168 1683456 48357376 + dfea 0 0 0 0 0 0 0 0 + csua 0 0 0 0 0 0 0 0 + dsua 0 0 0 0 0 0 0 0 + cfua 0 0 0 0 0 0 0 0 + dfua 0 0 0 0 0 0 0 0 + ----------------------------- snapshot end ----------------------------- + + ``scans`` means how many rounds current cgroup has been scanned. + ``scan_period_in_seconds`` means kidled will take how long to finish + one round. ``use_hierarchy`` shows current statistics whether does + hierarchical accounting, see above. ``buckets`` is to allow scripts + parsing easily. The table shows how many bytes are in idle state, + the row is indexed by idle type and column is indexed by idle ages. + + e.g. it shows 331776 bytes are idle at column ``[2,5)`` and row ``csea``, + ``csea`` means the pages are clean && swappable && evictable && active, + ``[2,5)`` means pages keep idle at least 240 seconds and less than 600 + seconds (get them by [2, 5) * scan_period_in_seconds). The last column + ``[240,+inf)`` means pages keep idle for a long time, greater than 28800 + seconds. + + Each memory cgroup can have its own histogram sampling different from + others by echo a monotonically increasing array to this file, each number + should be less than 256 and the write operation will clear previous stats + even buckets have not been changed. The number of bucket values must be + less or equal than 8. The default setting is "1,2,5,15,30,60,120,240". + Null bucket values (i.e. a null string) means no need account to current + memcg (NOTE it will still account to parent memcg if parent memcg exists + and has non-null buckets), non-accounting's snapshot looks like below: + + ----------------------------- snapshot start ----------------------------- + $ sudo bash -c "echo '' > /sys/fs/cgroup/memory/test/memory.idle_page_stats" + $ cat /sys/fs/cgroup/memory/test/memory.idle_page_stats + # version: 1.0 + # scans: 0 + # scan_period_in_seconds: 1 + # use_hierarchy: 1 + # buckets: no valid bucket available + ----------------------------- snapshot end ----------------------------- diff --git a/include/linux/kidled.h b/include/linux/kidled.h new file mode 100644 index 0000000000000000000000000000000000000000..a212b9b6adf45830076d045d3017164fd03afe23 --- /dev/null +++ b/include/linux/kidled.h @@ -0,0 +1,237 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MM_KIDLED_H +#define _LINUX_MM_KIDLED_H + +#ifdef CONFIG_KIDLED + +#include + +#define KIDLED_VERSION "1.0" + +/* + * We want to get more info about a specified idle page, whether it's + * a page cache or in active LRU list and so on. We use KIDLE_ + * to mark these different page attributes, we support 4 flags: + * + * KIDLE_DIRTY : page is dirty or not; + * KIDLE_FILE : page is a page cache or not; + * KIDLE_UNEVIT : page is unevictable or evictable; + * KIDLE_ACTIVE : page is in active LRU list or not. + * + * Each KIDLE_ occupies one bit position in a specified idle type. + * There exist total 2^4=16 idle types. + */ +#define KIDLE_BASE 0 +#define KIDLE_DIRTY (1 << 0) +#define KIDLE_FILE (1 << 1) +#define KIDLE_UNEVICT (1 << 2) +#define KIDLE_ACTIVE (1 << 3) + +#define KIDLE_NR_TYPE 16 + +/* + * Each page has an idle age which means how long the page is keeping + * in idle state, the age's unit is in one scan period. Each page's + * idle age will consume one byte, so the max age must be 255. + * Buckets are used for histogram sampling depends on the idle age, + * e.g. the bucket [5,15) means page's idle age ge than 5 scan periods + * and lt 15 scan periods. A specified bucket value is a split line of + * the idle age. We support a maximum of NUM_KIDLED_BUCKETS sampling + * regions. + */ +#define KIDLED_MAX_IDLE_AGE U8_MAX +#define NUM_KIDLED_BUCKETS 8 + +/* + * Since it's not convenient to get an immediate statistics for a memory + * cgroup, we use a ping-pong buffer. One is used to store the stable + * statistics which call it 'stable buffer', it's used for showing. + * Another is used to store the statistics being updated by scanning + * threads which call it 'unstable buffer'. Switch them when one scanning + * round is finished. + */ +#define KIDLED_STATS_NR_TYPE 2 + +/* + * When user wants not to account for a specified instance (e.g. may + * be a memory cgoup), then mark the corresponding buckets to be invalid. + * kidled will skip accounting when encounter invalid buckets. Note the + * scanning is still on. + * + * When users update new buckets, it means current statistics should be + * invalid. But we can't reset immediately, reasons as above. We'll reset + * at a safe point(i.e. one round finished). Store new buckets in stable + * stats's buckets, while mark unstable stats's buckets to be invalid. + * + * This value must be greater than KIDLED_MAX_IDLE_AGE, and can be only + * used for the first bucket value, so it can return quickly when call + * kidled_get_bucket(). User shouldn't use KIDLED_INVALID_BUCKET directly. + */ +#define KIDLED_INVALID_BUCKET (KIDLED_MAX_IDLE_AGE + 1) + +#define KIDLED_MARK_BUCKET_INVALID(buckets) \ + (buckets[0] = KIDLED_INVALID_BUCKET) +#define KIDLED_IS_BUCKET_INVALID(buckets) \ + (buckets[0] == KIDLED_INVALID_BUCKET) + +/* + * We account number of idle pages depending on idle type and buckets + * for a specified instance (e.g. one memory cgroup or one process...) + */ +struct idle_page_stats { + int buckets[NUM_KIDLED_BUCKETS]; + unsigned long count[KIDLE_NR_TYPE][NUM_KIDLED_BUCKETS]; +}; + +/* + * Duration is in seconds, it means kidled will take how long to finish + * one round (just try, no promise). Sequence number will be increased + * when user updates the sysfs file each time, it can protect readers + * won't get stale statistics by comparing the sequence number even + * duration keep the same. However, there exists a rare race that seq + * num may wrap and be the same as previous seq num. So we also check + * the duration to make readers won't get strange statistics. But it may + * be still stale when seq and duration are both the same as previous + * value, but I think it's acceptable because duration is the same at + * least. + */ +#define KIDLED_MAX_SCAN_DURATION U16_MAX /* max 65536 seconds */ +struct kidled_scan_period { + union { + atomic_t val; + struct { + u16 seq; /* inc when update */ + u16 duration; /* in seconds */ + }; + }; +}; +extern struct kidled_scan_period kidled_scan_period; + +#define KIDLED_OP_SET_DURATION (1 << 0) +#define KIDLED_OP_INC_SEQ (1 << 1) + +static inline struct kidled_scan_period kidled_get_current_scan_period(void) +{ + struct kidled_scan_period scan_period; + + atomic_set(&scan_period.val, atomic_read(&kidled_scan_period.val)); + return scan_period; +} + +static inline unsigned int kidled_get_current_scan_duration(void) +{ + struct kidled_scan_period scan_period = + kidled_get_current_scan_period(); + + return scan_period.duration; +} + +static inline void kidled_reset_scan_period(struct kidled_scan_period *p) +{ + atomic_set(&p->val, 0); +} + +/* + * Compare with global kidled_scan_period, return true if equals. + */ +static inline bool kidled_is_scan_period_equal(struct kidled_scan_period *p) +{ + return atomic_read(&p->val) == atomic_read(&kidled_scan_period.val); +} + +static inline bool kidled_set_scan_period(int op, u16 duration, + struct kidled_scan_period *orig) +{ + bool retry = false; + + /* + * atomic_cmpxchg() tries to update kidled_scan_period, shouldn't + * retry to avoid endless loop when caller specify a period. + */ + if (!orig) { + orig = &kidled_scan_period; + retry = true; + } + + while (true) { + int new_period_val, old_period_val; + struct kidled_scan_period new_period; + + old_period_val = atomic_read(&orig->val); + atomic_set(&new_period.val, old_period_val); + if (op & KIDLED_OP_INC_SEQ) + new_period.seq++; + if (op & KIDLED_OP_SET_DURATION) + new_period.duration = duration; + new_period_val = atomic_read(&new_period.val); + + if (atomic_cmpxchg(&kidled_scan_period.val, + old_period_val, + new_period_val) == old_period_val) + return true; + + if (!retry) + return false; + } +} + +static inline void kidled_set_scan_duration(u16 duration) +{ + kidled_set_scan_period(KIDLED_OP_INC_SEQ | + KIDLED_OP_SET_DURATION, + duration, NULL); +} + +/* + * Caller must specify the original scan period, avoid the race between + * the double operation and user's updates through sysfs interface. + */ +static inline bool kidled_try_double_scan_period(struct kidled_scan_period orig) +{ + u16 duration = orig.duration; + + if (unlikely(duration == KIDLED_MAX_SCAN_DURATION)) + return false; + + duration <<= 1; + if (duration < orig.duration) + duration = KIDLED_MAX_SCAN_DURATION; + return kidled_set_scan_period(KIDLED_OP_INC_SEQ | + KIDLED_OP_SET_DURATION, + duration, + &orig); +} + +/* + * Increase the sequence number while keep duration the same, it's used + * to start a new period immediately. + */ +static inline void kidled_inc_scan_seq(void) +{ + kidled_set_scan_period(KIDLED_OP_INC_SEQ, 0, NULL); +} + +extern const int kidled_default_buckets[NUM_KIDLED_BUCKETS]; + +bool kidled_use_hierarchy(void); +#ifdef CONFIG_MEMCG +void kidled_mem_cgroup_move_stats(struct mem_cgroup *from, + struct mem_cgroup *to, + struct page *page, + unsigned int nr_pages); +#endif /* CONFIG_MEMCG */ + +#else /* !CONFIG_KIDLED */ + +#ifdef CONFIG_MEMCG +static inline void kidled_mem_cgroup_move_stats(struct mem_cgroup *from, + struct mem_cgroup *to, + struct page *page, + unsigned int nr_pages) +{ +} +#endif /* CONFIG_MEMCG */ + +#endif /* CONFIG_KIDLED */ + +#endif /* _LINUX_MM_KIDLED_H */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 8feaa0abf1a4de09ead0f0ff56c9159a33a87228..dfa3a89a1440b4e72e2195e8be406ac53b66c911 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -30,6 +30,7 @@ #include #include #include +#include struct mem_cgroup; struct page; @@ -317,6 +318,14 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock; +#ifdef CONFIG_KIDLED + struct rw_semaphore idle_stats_rwsem; + unsigned long idle_scans; + struct kidled_scan_period scan_period; + int idle_stable_idx; + struct idle_page_stats idle_stats[KIDLED_STATS_NR_TYPE]; +#endif + struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; @@ -799,6 +808,28 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, void mem_cgroup_split_huge_fixup(struct page *head); #endif +#ifdef CONFIG_KIDLED +static inline struct idle_page_stats * +mem_cgroup_get_stable_idle_stats(struct mem_cgroup *memcg) +{ + return &memcg->idle_stats[memcg->idle_stable_idx]; +} + +static inline struct idle_page_stats * +mem_cgroup_get_unstable_idle_stats(struct mem_cgroup *memcg) +{ + return &memcg->idle_stats[KIDLED_STATS_NR_TYPE - 1 - + memcg->idle_stable_idx]; +} + +static inline void +mem_cgroup_idle_page_stats_switch(struct mem_cgroup *memcg) +{ + memcg->idle_stable_idx = KIDLED_STATS_NR_TYPE - 1 - + memcg->idle_stable_idx; +} +#endif /* CONFIG_KIDLED */ + static inline bool is_wmark_ok(struct mem_cgroup *memcg, bool high) { if (high) diff --git a/include/linux/mm.h b/include/linux/mm.h index 45f10f5896b7bb1ed9e453c12689d122b5552e13..3adb081f83d5ca1825c989bc1d17156c0f3d9338 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -794,11 +794,12 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); * sets it, so none of the operations on it need to be atomic. */ -/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | ... | FLAGS | */ +/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | [KIDLED_AGE] | ... | FLAGS | */ #define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH) #define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH) #define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH) #define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH) +#define KIDLED_AGE_PGOFF (LAST_CPUPID_PGOFF - KIDLED_AGE_WIDTH) /* * Define the bit shifts to access each section. For non-existent @@ -809,6 +810,7 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0)) #define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0)) #define LAST_CPUPID_PGSHIFT (LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0)) +#define KIDLED_AGE_PGSHIFT (KIDLED_AGE_PGOFF * (KIDLED_AGE_WIDTH != 0)) /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */ #ifdef NODE_NOT_IN_PAGE_FLAGS @@ -1089,6 +1091,71 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid) } #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_KIDLED +#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS +static inline int kidled_get_page_age(pg_data_t *pgdat, unsigned long pfn) +{ + u8 *age = pgdat->node_page_age; + + if (unlikely(!age)) + return -EINVAL; + + age += (pfn - pgdat->node_start_pfn); + return *age; +} + +static inline int kidled_inc_page_age(pg_data_t *pgdat, unsigned long pfn) +{ + u8 *age = pgdat->node_page_age; + + if (unlikely(!age)) + return -EINVAL; + + age += (pfn - pgdat->node_start_pfn); + *age += 1; + + return *age; +} + +static inline void kidled_set_page_age(pg_data_t *pgdat, + unsigned long pfn, int val) +{ + u8 *age = pgdat->node_page_age; + + if (unlikely(!age)) + return; + + age += (pfn - pgdat->node_start_pfn); + *age = val; +} +#else +static inline int kidled_get_page_age(pg_data_t *pgdat, unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + + return (page->flags >> KIDLED_AGE_PGSHIFT) & KIDLED_AGE_MASK; +} + +extern int kidled_inc_page_age(pg_data_t *pgdat, unsigned long pfn); +extern void kidled_set_page_age(pg_data_t *pgdat, unsigned long pfn, int val); +#endif /* KIDLED_AGE_NOT_IN_PAGE_FLAGS */ +#else /* !CONFIG_KIDLED */ +static inline int kidled_get_page_age(pg_data_t *pgdat, unsigned long pfn) +{ + return -EINVAL; +} + +static inline int kidled_inc_page_age(pg_data_t *pgdat, unsigned long pfn) +{ + return -EINVAL; +} + +static inline void kidled_set_page_age(pg_data_t *pgdat, + unsigned long pfn, int val) +{ +} +#endif /* CONFIG_KIDLED */ + static inline struct zone *page_zone(const struct page *page) { return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4e46ff268cb1b7c3b71489ceab9e9af43582bcca..00f681746bb3936a710ee5815f9d4224fbd7ccb3 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -653,6 +653,11 @@ typedef struct pglist_data { unsigned long node_present_pages; /* total number of physical pages */ unsigned long node_spanned_pages; /* total size of physical page range, including holes */ +#ifdef CONFIG_KIDLED + unsigned long node_idle_scan_pfn; + u8 *node_page_age; +#endif + int node_id; wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h index 7ec86bf31ce48602b2fdbbdadca43a9436c5d62a..92766b1b04d829756fb0a897bb750f4f558db84e 100644 --- a/include/linux/page-flags-layout.h +++ b/include/linux/page-flags-layout.h @@ -82,6 +82,19 @@ #define LAST_CPUPID_WIDTH 0 #endif +#ifdef CONFIG_KIDLED +#define KIDLED_AGE_SHIFT 8 +#define KIDLED_AGE_MASK ((1UL << KIDLED_AGE_SHIFT)-1) +#else +#define KIDLED_AGE_SHIFT 0 +#endif + +#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT+KIDLED_AGE_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS +#define KIDLED_AGE_WIDTH KIDLED_AGE_SHIFT +#else +#define KIDLED_AGE_WIDTH 0 +#endif + /* * We are going to use the flags for the page to node mapping if its in * there. This includes the case where there is no node, so it is implicit. @@ -94,4 +107,8 @@ #define LAST_CPUPID_NOT_IN_PAGE_FLAGS #endif +#if defined(CONFIG_KIDLED) && KIDLED_AGE_WIDTH == 0 +#define KIDLED_AGE_NOT_IN_PAGE_FLAGS +#endif + #endif /* _LINUX_PAGE_FLAGS_LAYOUT */ diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index 1e894d34bdceb2a91318bd1ea7796a17deeddf8b..f87648fd082599ac1d3589ed3cebf2fbf313daed 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -38,6 +38,9 @@ static inline void clear_page_idle(struct page *page) { ClearPageIdle(page); } + +void page_idle_clear_pte_refs(struct page *page); + #else /* !CONFIG_64BIT */ /* * If there is not enough space to store Idle and Young bits in page flags, use @@ -135,6 +138,10 @@ static inline void clear_page_idle(struct page *page) { } +static inline void page_idle_clear_pte_refs(struct page *page) +{ +} + #endif /* CONFIG_IDLE_PAGE_TRACKING */ #endif /* _LINUX_MM_PAGE_IDLE_H */ diff --git a/mm/Kconfig b/mm/Kconfig index b457e94ae6182a491df70157a607af812b6f1b37..453a8446e951a85331065504cbf1b9b157b2f576 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -764,4 +764,16 @@ config GUP_BENCHMARK config ARCH_HAS_PTE_SPECIAL bool +config KIDLED + bool "Enable kernel thread to scan idle pages" + depends on IDLE_PAGE_TRACKING + help + This introduces kernel thread (kidled) to scan pages in configurable + interval to determine if they are accessed in that interval, to + determine their access frequency. The hot/cold pages are identified + with it and the statistics are exported to user space on basis of + memory cgroup by "memory.idle_page_stats". + + See Documentation/vm/kidled.rst for more details. + endmenu diff --git a/mm/Makefile b/mm/Makefile index 26ef77a3883b5c708659425229e975ac74069875..0ca4b8cd21f36fda1e34752c3a9fe4630d8d7592 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -105,3 +105,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o obj-$(CONFIG_HMM) += hmm.o obj-$(CONFIG_MEMFD_CREATE) += memfd.o +obj-$(CONFIG_KIDLED) += kidled.o diff --git a/mm/kidled.c b/mm/kidled.c new file mode 100644 index 0000000000000000000000000000000000000000..db63de493ece7e212661bead0f62e7fe78694ffb --- /dev/null +++ b/mm/kidled.c @@ -0,0 +1,691 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * Should the accounting be hierarchical? Hierarchical accounting only + * works when memcg is in hierarchy mode. It's OK when kilded enables + * hierarchical accounting while memcg is in non-hierarchy mode, kidled + * will account to the memory cgroup page is charged to. No dependency + * between these two settings. + */ +static bool use_hierarchy __read_mostly; + +struct kidled_scan_period kidled_scan_period; +const int kidled_default_buckets[NUM_KIDLED_BUCKETS] = { + 1, 2, 5, 15, 30, 60, 120, 240 }; +static DECLARE_WAIT_QUEUE_HEAD(kidled_wait); +static unsigned long kidled_scan_rounds __read_mostly; + +static inline int kidled_get_bucket(int *idle_buckets, int age) +{ + int bucket; + + if (age < idle_buckets[0]) + return -EINVAL; + + for (bucket = 1; bucket <= (NUM_KIDLED_BUCKETS - 1); bucket++) { + if (age < idle_buckets[bucket]) + return bucket - 1; + } + + return NUM_KIDLED_BUCKETS - 1; +} + +static inline int kidled_get_idle_type(struct page *page) +{ + int idle_type = KIDLE_BASE; + + if (PageDirty(page) || PageWriteback(page)) + idle_type |= KIDLE_DIRTY; + if (page_is_file_cache(page)) + idle_type |= KIDLE_FILE; + /* + * Couldn't call page_evictable() here, because we have not held + * the page lock, so use page flags instead. Different from + * PageMlocked(). + */ + if (PageUnevictable(page)) + idle_type |= KIDLE_UNEVICT; + if (PageActive(page)) + idle_type |= KIDLE_ACTIVE; + return idle_type; +} + +#ifndef KIDLED_AGE_NOT_IN_PAGE_FLAGS +int kidled_inc_page_age(pg_data_t *pgdat, unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + unsigned long old, new; + int age; + + do { + age = ((page->flags >> KIDLED_AGE_PGSHIFT) & KIDLED_AGE_MASK); + if (age >= KIDLED_AGE_MASK) + break; + + new = old = page->flags; + new &= ~(KIDLED_AGE_MASK << KIDLED_AGE_PGSHIFT); + new |= (((age + 1) & KIDLED_AGE_MASK) << KIDLED_AGE_PGSHIFT); + } while (unlikely(cmpxchg(&page->flags, old, new) != old)); + + return age; +} +EXPORT_SYMBOL_GPL(kidled_inc_page_age); + +void kidled_set_page_age(pg_data_t *pgdat, unsigned long pfn, int val) +{ + struct page *page = pfn_to_page(pfn); + unsigned long old, new; + + do { + new = old = page->flags; + new &= ~(KIDLED_AGE_MASK << KIDLED_AGE_PGSHIFT); + new |= ((val & KIDLED_AGE_MASK) << KIDLED_AGE_PGSHIFT); + } while (unlikely(cmpxchg(&page->flags, old, new) != old)); + +} +EXPORT_SYMBOL_GPL(kidled_set_page_age); +#endif /* !KIDLED_AGE_NOT_IN_PAGE_FLAGS */ + +#ifdef CONFIG_MEMCG +static inline void kidled_mem_cgroup_account(struct page *page, + int age, + int nr_pages) +{ + struct mem_cgroup *memcg; + struct idle_page_stats *stats; + int type, bucket; + + if (mem_cgroup_disabled()) + return; + + type = kidled_get_idle_type(page); + + memcg = lock_page_memcg(page); + if (unlikely(!memcg)) { + unlock_page_memcg(page); + return; + } + + stats = mem_cgroup_get_unstable_idle_stats(memcg); + bucket = kidled_get_bucket(stats->buckets, age); + if (bucket >= 0) + stats->count[type][bucket] += nr_pages; + + unlock_page_memcg(page); +} + +void kidled_mem_cgroup_move_stats(struct mem_cgroup *from, + struct mem_cgroup *to, + struct page *page, + unsigned int nr_pages) +{ + pg_data_t *pgdat = page_pgdat(page); + unsigned long pfn = page_to_pfn(page); + struct idle_page_stats *stats[4] = { NULL, }; + int type, bucket, age; + + if (mem_cgroup_disabled()) + return; + + type = kidled_get_idle_type(page); + stats[0] = mem_cgroup_get_stable_idle_stats(from); + stats[1] = mem_cgroup_get_unstable_idle_stats(from); + if (to) { + stats[2] = mem_cgroup_get_stable_idle_stats(to); + stats[3] = mem_cgroup_get_unstable_idle_stats(to); + } + + /* + * We assume the all page ages are same if this is a compound page. + * Also we uses node's cursor (@node_idle_scan_pfn) to check if current + * page should be removed from the source memory cgroup or charged + * to target memory cgroup, without introducing locking mechanism. + * This may lead to slightly inconsistent statistics, but it's fine + * as it will be reshuffled in next round of scanning. + */ + age = kidled_get_page_age(pgdat, pfn); + if (age < 0) + return; + + bucket = kidled_get_bucket(stats[1]->buckets, age); + if (bucket < 0) + return; + + /* Remove from the source memory cgroup */ + if (stats[0]->count[type][bucket] > nr_pages) + stats[0]->count[type][bucket] -= nr_pages; + else + stats[0]->count[type][bucket] = 0; + if (pgdat->node_idle_scan_pfn >= pfn) { + if (stats[1]->count[type][bucket] > nr_pages) + stats[1]->count[type][bucket] -= nr_pages; + else + stats[1]->count[type][bucket] = 0; + } + + /* Charge to the target memory cgroup */ + if (!to) + return; + + bucket = kidled_get_bucket(stats[3]->buckets, age); + if (bucket < 0) + return; + + stats[2]->count[type][bucket] += nr_pages; + if (pgdat->node_idle_scan_pfn >= pfn) + stats[3]->count[type][bucket] += nr_pages; +} +EXPORT_SYMBOL_GPL(kidled_mem_cgroup_move_stats); + +static inline void kidled_mem_cgroup_scan_done(struct kidled_scan_period period) +{ + struct mem_cgroup *memcg; + struct idle_page_stats *stable_stats, *unstable_stats; + + for (memcg = mem_cgroup_iter(NULL, NULL, NULL); + memcg != NULL; + memcg = mem_cgroup_iter(NULL, memcg, NULL)) { + + down_write(&memcg->idle_stats_rwsem); + stable_stats = mem_cgroup_get_stable_idle_stats(memcg); + unstable_stats = mem_cgroup_get_unstable_idle_stats(memcg); + + /* + * Switch when scanning buckets is valid, or copy buckets + * from stable_stats's buckets which may have user's new + * buckets(maybe valid or not). + */ + if (!KIDLED_IS_BUCKET_INVALID(unstable_stats->buckets)) { + mem_cgroup_idle_page_stats_switch(memcg); + memcg->idle_scans++; + } else { + memcpy(unstable_stats->buckets, stable_stats->buckets, + sizeof(unstable_stats->buckets)); + } + + memcg->scan_period = period; + up_write(&memcg->idle_stats_rwsem); + + unstable_stats = mem_cgroup_get_unstable_idle_stats(memcg); + memset(&unstable_stats->count, 0, + sizeof(unstable_stats->count)); + } +} + +static inline void kidled_mem_cgroup_reset(void) +{ + struct mem_cgroup *memcg; + struct idle_page_stats *stable_stats, *unstable_stats; + + for (memcg = mem_cgroup_iter(NULL, NULL, NULL); + memcg != NULL; + memcg = mem_cgroup_iter(NULL, memcg, NULL)) { + down_write(&memcg->idle_stats_rwsem); + stable_stats = mem_cgroup_get_stable_idle_stats(memcg); + unstable_stats = mem_cgroup_get_unstable_idle_stats(memcg); + memset(&stable_stats->count, 0, sizeof(stable_stats->count)); + + memcg->idle_scans = 0; + kidled_reset_scan_period(&memcg->scan_period); + up_write(&memcg->idle_stats_rwsem); + + memset(&unstable_stats->count, 0, + sizeof(unstable_stats->count)); + } +} +#else /* !CONFIG_MEMCG */ +static inline void kidled_mem_cgroup_account(struct page *page, + int age, + int nr_pages) +{ +} +static inline void kidled_mem_cgroup_scan_done(struct kidled_scan_period + scan_period) +{ +} +static inline void kidled_mem_cgroup_reset(void) +{ +} +#endif /* CONFIG_MEMCG */ + +/* + * An idle page with an older age is more likely idle, while a busy page is + * more likely busy, so we can reduce the sampling frequency to save cpu + * resource when meet these pages. And we will keep sampling each time when + * an idle page is young. See tables below: + * + * idle age | down ratio + * ----------+------------- + * [0, 1) | 1/2 # busy + * [1, 4) | 1 # young idle + * [4, 8) | 1/2 # idle + * [8, 16) | 1/4 # old idle + * [16, +inf)| 1/8 # older idle + */ +static inline bool kidled_need_check_idle(pg_data_t *pgdat, unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + int age = kidled_get_page_age(pgdat, pfn); + unsigned long pseudo_random; + + if (age < 0) + return false; + + /* + * kidled will check different pages at each round when need + * reduce sampling frequency, this depends on current pfn and + * global scanning rounds. There exist some special pfns, for + * one huge page, we can only check the head page, while tail + * pages would be checked in low levels and will be skipped. + * Shifting HPAGE_PMD_ORDER bits is to achieve good load balance + * for each round when system has many huge pages, 1GB is not + * considered here. + */ + if (PageTransHuge(page)) + pfn >>= compound_order(page); + + pseudo_random = pfn + kidled_scan_rounds; + if (age == 0) + return pseudo_random & 0x1UL; + else if (age < 4) + return true; + else if (age < 8) + return pseudo_random & 0x1UL; + else if (age < 16) + return (pseudo_random & 0x3UL) == 0x3UL; + else + return (pseudo_random & 0x7UL) == 0x7UL; +} + +static inline int kidled_scan_page(pg_data_t *pgdat, unsigned long pfn) +{ + struct page *page; + int age, nr_pages = 1, idx; + bool idle = false; + + if (!pfn_valid(pfn)) + goto out; + + page = pfn_to_page(pfn); + if (!page || !PageLRU(page)) { + kidled_set_page_age(pgdat, pfn, 0); + goto out; + } + + /* + * Try to skip clear PTE references which is an expensive call. + * PG_idle should be cleared when free a page and we have checked + * PG_lru flag above, so the race is acceptable to us. + */ + if (page_is_idle(page)) { + if (kidled_need_check_idle(pgdat, pfn)) { + if (!get_page_unless_zero(page)) { + kidled_set_page_age(pgdat, pfn, 0); + goto out; + } + + /* + * Check again after get a reference count, while in + * page_idle_get_page() it gets zone_lru_lock at first, + * it seems useless. + * + * Also we can't hold LRU lock here as the consumed + * time to finish the scanning is fixed. Otherwise, + * the accumulated statistics will be cleared out + * and scan interval (@scan_period_in_seconds) will + * be doubled. However, this may incur race between + * kidled and page reclaim. The page reclaim may dry + * run due to dumped refcount, but it's acceptable. + */ + if (unlikely(!PageLRU(page))) { + put_page(page); + kidled_set_page_age(pgdat, pfn, 0); + goto out; + } + + page_idle_clear_pte_refs(page); + if (page_is_idle(page)) + idle = true; + put_page(page); + } else if (kidled_get_page_age(pgdat, pfn) > 0) { + idle = true; + } + } + + if (PageTransHuge(page)) + nr_pages = 1 << compound_order(page); + + if (idle) { + age = kidled_inc_page_age(pgdat, pfn); + if (age > 0) + kidled_mem_cgroup_account(page, age, nr_pages); + else + age = 0; + } else { + age = 0; + kidled_set_page_age(pgdat, pfn, 0); + if (get_page_unless_zero(page)) { + if (likely(PageLRU(page))) + set_page_idle(page); + put_page(page); + } + } + + for (idx = 1; idx < nr_pages; idx++) + kidled_set_page_age(pgdat, pfn + idx, age); + +out: + return nr_pages; +} + +static bool kidled_scan_node(pg_data_t *pgdat, + struct kidled_scan_period scan_period, + bool restart) +{ + unsigned long pfn, end, node_end; + +#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS + if (unlikely(!pgdat->node_page_age)) { + pgdat->node_page_age = vzalloc(pgdat->node_spanned_pages); + if (unlikely(!pgdat->node_page_age)) + return false; + } +#endif /* KIDLED_AGE_NOT_IN_PAGE_FLAGS */ + + node_end = pgdat_end_pfn(pgdat); + pfn = pgdat->node_start_pfn; + if (!restart && pfn < pgdat->node_idle_scan_pfn) + pfn = pgdat->node_idle_scan_pfn; + end = min(pfn + DIV_ROUND_UP(pgdat->node_spanned_pages, + scan_period.duration), node_end); + while (pfn < end) { + /* Restart new scanning when user updates the period */ + if (unlikely(!kidled_is_scan_period_equal(&scan_period))) + break; + + cond_resched(); + pfn += kidled_scan_page(pgdat, pfn); + } + + pgdat->node_idle_scan_pfn = pfn; + return pfn >= node_end; +} + +static inline void kidled_scan_done(struct kidled_scan_period scan_period) +{ + kidled_mem_cgroup_scan_done(scan_period); + kidled_scan_rounds++; +} + +static inline void kidled_reset(bool free) +{ + pg_data_t *pgdat; + + kidled_mem_cgroup_reset(); + + get_online_mems(); + +#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS + for_each_online_pgdat(pgdat) { + if (!pgdat->node_page_age) + continue; + + if (free) { + vfree(pgdat->node_page_age); + pgdat->node_page_age = NULL; + } else { + memset(pgdat->node_page_age, 0, + pgdat->node_spanned_pages); + } + + cond_resched(); + } +#else + for_each_online_pgdat(pgdat) { + unsigned long pfn, end_pfn = pgdat->node_start_pfn + + pgdat->node_spanned_pages; + + for (pfn = pgdat->node_start_pfn; pfn < end_pfn; pfn++) { + if (!pfn_valid(pfn)) + continue; + + kidled_set_page_age(pgdat, pfn, 0); + + if (pfn % HPAGE_PMD_NR == 0) + cond_resched(); + } + } +#endif /* KIDLED_AGE_NOT_IN_PAGE_FLAGS */ + + put_online_mems(); +} + +static inline bool kidled_should_run(struct kidled_scan_period *p, bool *new) +{ + if (unlikely(!kidled_is_scan_period_equal(p))) { + struct kidled_scan_period scan_period; + + scan_period = kidled_get_current_scan_period(); + if (p->duration) + kidled_reset(!scan_period.duration); + *p = scan_period; + *new = true; + } else { + *new = false; + } + + if (p->duration > 0) + return true; + + return false; +} + +static int kidled(void *dummy) +{ + int busy_loop = 0; + bool restart = true; + struct kidled_scan_period scan_period; + + kidled_reset_scan_period(&scan_period); + + while (!kthread_should_stop()) { + pg_data_t *pgdat; + u64 start_jiffies, elapsed; + bool new, scan_done = true; + + wait_event_interruptible(kidled_wait, + kidled_should_run(&scan_period, &new)); + if (unlikely(new)) { + restart = true; + busy_loop = 0; + } + + if (unlikely(scan_period.duration == 0)) + continue; + + start_jiffies = jiffies_64; + get_online_mems(); + for_each_online_pgdat(pgdat) { + scan_done &= kidled_scan_node(pgdat, + scan_period, + restart); + } + put_online_mems(); + + if (scan_done) { + kidled_scan_done(scan_period); + restart = true; + } else { + restart = false; + } + + /* + * We hope kidled can scan specified pages which depends on + * scan_period in each slice, and supposed to finish each + * slice in one second: + * + * pages_to_scan = total_pages / scan_duration + * for_each_slice() { + * start_jiffies = jiffies_64; + * scan_pages(pages_to_scan); + * elapsed = jiffies_64 - start_jiffies; + * sleep(HZ - elapsed); + * } + * + * We thought it's busy when elapsed >= (HZ / 2), and if keep + * busy for several consecutive times, we'll scale up the + * scan duration. + * + * NOTE it's a simple guard, not a promise. + */ +#define KIDLED_BUSY_RUNNING (HZ / 2) +#define KIDLED_BUSY_LOOP_THRESHOLD 10 + elapsed = jiffies_64 - start_jiffies; + if (elapsed < KIDLED_BUSY_RUNNING) { + busy_loop = 0; + schedule_timeout_interruptible(HZ - elapsed); + } else if (++busy_loop == KIDLED_BUSY_LOOP_THRESHOLD) { + busy_loop = 0; + if (kidled_try_double_scan_period(scan_period)) { + pr_warn_ratelimited("%s: period -> %u\n", + __func__, + kidled_get_current_scan_duration()); + } + + /* sleep for a while to relax cpu */ + schedule_timeout_interruptible(elapsed); + } + } + + return 0; +} + +bool kidled_use_hierarchy(void) +{ + return use_hierarchy; +} + +static ssize_t kidled_scan_period_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%u\n", kidled_get_current_scan_duration()); +} + +/* + * We will update the real scan period and do reset asynchronously, + * avoid stall when kidled is busy waiting for other resources. + */ +static ssize_t kidled_scan_period_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long secs; + int ret; + + ret = kstrtoul(buf, 10, &secs); + if (ret || secs > KIDLED_MAX_SCAN_DURATION) + return -EINVAL; + + kidled_set_scan_duration(secs); + wake_up_interruptible(&kidled_wait); + return count; +} + +static ssize_t kidled_use_hierarchy_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%u\n", use_hierarchy); +} + +static ssize_t kidled_use_hierarchy_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long val; + int ret; + + ret = kstrtoul(buf, 10, &val); + if (ret || val > 1) + return -EINVAL; + + WRITE_ONCE(use_hierarchy, val); + + /* + * Always start a new period when user sets use_hierarchy, + * kidled_inc_scan_seq() uses atomic_cmpxchg() which implies a + * memory barrier. This will make sure readers will get new + * statistics after the store returned. But there still exists + * a rare race when storing: + * + * writer | readers + * | + * update_use_hierarchy | + * ..... | read_statistics <-- race + * increase_scan_sequence | + * + * readers may get new use_hierarchy value and old statistics, + * ignore this.. + */ + kidled_inc_scan_seq(); + return count; +} + +static struct kobj_attribute kidled_scan_period_attr = + __ATTR(scan_period_in_seconds, 0644, + kidled_scan_period_show, kidled_scan_period_store); +static struct kobj_attribute kidled_use_hierarchy_attr = + __ATTR(use_hierarchy, 0644, + kidled_use_hierarchy_show, kidled_use_hierarchy_store); + +static struct attribute *kidled_attrs[] = { + &kidled_scan_period_attr.attr, + &kidled_use_hierarchy_attr.attr, + NULL +}; +static struct attribute_group kidled_attr_group = { + .name = "kidled", + .attrs = kidled_attrs, +}; + +static int __init kidled_init(void) +{ + struct task_struct *thread; + struct sched_param param = { .sched_priority = 0 }; + int ret; + + ret = sysfs_create_group(mm_kobj, &kidled_attr_group); + if (ret) { + pr_warn("%s: Error %d on creating sysfs files\n", + __func__, ret); + return ret; + } + + thread = kthread_run(kidled, NULL, "kidled"); + if (IS_ERR(thread)) { + sysfs_remove_group(mm_kobj, &kidled_attr_group); + pr_warn("%s: Failed to start kthread\n", __func__); + return PTR_ERR(thread); + } + + /* Make kidled as nice as possible. */ + sched_setscheduler(thread, SCHED_IDLE, ¶m); + + return 0; +} + +module_init(kidled_init); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8a728e07018f9bcb30f97d7c62fa0b7b3c300356..31abc4dc1c54946a2e77641c2d88c108442ef338 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3553,6 +3553,246 @@ static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf, return nbytes; } +#ifdef CONFIG_KIDLED +static int mem_cgroup_idle_page_stats_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *iter, *memcg = mem_cgroup_from_css(seq_css(m)); + struct kidled_scan_period scan_period, period; + struct idle_page_stats stats, cache; + unsigned long scans; + bool has_hierarchy = kidled_use_hierarchy(); + bool no_buckets = false; + int i, j, t; + + down_read(&memcg->idle_stats_rwsem); + stats = memcg->idle_stats[memcg->idle_stable_idx]; + scans = memcg->idle_scans; + scan_period = memcg->scan_period; + up_read(&memcg->idle_stats_rwsem); + + /* Nothing will be outputed with invalid buckets */ + if (KIDLED_IS_BUCKET_INVALID(stats.buckets)) { + no_buckets = true; + scans = 0; + goto output; + } + + /* Zeroes will be output with mismatched scan period */ + if (!kidled_is_scan_period_equal(&scan_period)) { + memset(&stats.count, 0, sizeof(stats.count)); + scan_period = kidled_get_current_scan_period(); + scans = 0; + goto output; + } + + if (mem_cgroup_is_root(memcg) || has_hierarchy) { + for_each_mem_cgroup_tree(iter, memcg) { + /* The root memcg was just accounted */ + if (iter == memcg) + continue; + + down_read(&iter->idle_stats_rwsem); + cache = iter->idle_stats[iter->idle_stable_idx]; + period = memcg->scan_period; + up_read(&iter->idle_stats_rwsem); + + /* + * Skip to account if the scan period is mismatched + * or buckets are invalid. + */ + if (!kidled_is_scan_period_equal(&period) || + KIDLED_IS_BUCKET_INVALID(cache.buckets)) + continue; + + /* + * The buckets of current memory cgroup might be + * mismatched with that of root memory cgroup. We + * charge the current statistics to the possibly + * largest bucket. The users need to apply the + * consistent buckets into the memory cgroups in + * the hierarchy tree. + */ + for (i = 0; i < NUM_KIDLED_BUCKETS; i++) { + for (j = 0; j < NUM_KIDLED_BUCKETS - 1; j++) { + if (cache.buckets[i] <= + stats.buckets[j]) + break; + } + + for (t = 0; t < KIDLE_NR_TYPE; t++) + stats.count[t][j] += cache.count[t][i]; + } + } + } + + +output: + seq_printf(m, "# version: %s\n", KIDLED_VERSION); + seq_printf(m, "# scans: %lu\n", scans); + seq_printf(m, "# scan_period_in_seconds: %u\n", scan_period.duration); + seq_printf(m, "# use_hierarchy: %u\n", kidled_use_hierarchy()); + seq_puts(m, "# buckets: "); + if (no_buckets) { + seq_puts(m, "no valid bucket available\n"); + return 0; + } + + for (i = 0; i < NUM_KIDLED_BUCKETS; i++) { + seq_printf(m, "%d", stats.buckets[i]); + + if ((i == NUM_KIDLED_BUCKETS - 1) || + !stats.buckets[i + 1]) { + seq_puts(m, "\n"); + j = i + 1; + break; + } + seq_puts(m, ","); + } + seq_puts(m, "#\n"); + + seq_puts(m, "# _-----=> clean/dirty\n"); + seq_puts(m, "# / _----=> swap/file\n"); + seq_puts(m, "# | / _---=> evict/unevict\n"); + seq_puts(m, "# || / _--=> inactive/active\n"); + seq_puts(m, "# ||| /\n"); + + seq_printf(m, "# %-8s", "||||"); + for (i = 0; i < j; i++) { + char region[20]; + + if (i == j - 1) { + snprintf(region, sizeof(region), "[%d,+inf)", + stats.buckets[i]); + } else { + snprintf(region, sizeof(region), "[%d,%d)", + stats.buckets[i], + stats.buckets[i + 1]); + } + + seq_printf(m, " %14s", region); + } + seq_puts(m, "\n"); + + for (t = 0; t < KIDLE_NR_TYPE; t++) { + char kidled_type_str[5]; + + kidled_type_str[0] = t & KIDLE_DIRTY ? 'd' : 'c'; + kidled_type_str[1] = t & KIDLE_FILE ? 'f' : 's'; + kidled_type_str[2] = t & KIDLE_UNEVICT ? 'u' : 'e'; + kidled_type_str[3] = t & KIDLE_ACTIVE ? 'a' : 'i'; + kidled_type_str[4] = '\0'; + seq_printf(m, " %-8s", kidled_type_str); + + for (i = 0; i < j; i++) { + seq_printf(m, " %14lu", + stats.count[t][i] << PAGE_SHIFT); + } + + seq_puts(m, "\n"); + } + + return 0; +} + +static ssize_t mem_cgroup_idle_page_stats_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + struct idle_page_stats *stable_stats, *unstable_stats; + int buckets[NUM_KIDLED_BUCKETS] = { 0 }, i = 0, err; + unsigned long prev = 0, curr; + char *next; + + buf = strstrip(buf); + while (*buf) { + if (i >= NUM_KIDLED_BUCKETS) + return -E2BIG; + + /* Get next entry */ + next = buf + 1; + while (*next && *next >= '0' && *next <= '9') + next++; + while (*next && (*next == ' ' || *next == ',')) + *next++ = '\0'; + + /* Should be monotonically increasing */ + err = kstrtoul(buf, 10, &curr); + if (err || curr > KIDLED_MAX_IDLE_AGE || curr <= prev) + return -EINVAL; + + buckets[i++] = curr; + prev = curr; + buf = next; + } + + /* No buckets set, mark it invalid */ + if (i == 0) + KIDLED_MARK_BUCKET_INVALID(buckets); + if (down_write_killable(&memcg->idle_stats_rwsem)) + return -EINTR; + stable_stats = mem_cgroup_get_stable_idle_stats(memcg); + unstable_stats = mem_cgroup_get_unstable_idle_stats(memcg); + memcpy(stable_stats->buckets, buckets, sizeof(buckets)); + + /* + * We will clear the stats without check the buckets whether + * has been changed, it works when user only wants to reset + * stats but not to reset the buckets. + */ + memset(stable_stats->count, 0, sizeof(stable_stats->count)); + + /* + * It's safe that the kidled reads the unstable buckets without + * holding any read side locks. + */ + KIDLED_MARK_BUCKET_INVALID(unstable_stats->buckets); + memcg->idle_scans = 0; + up_write(&memcg->idle_stats_rwsem); + + return nbytes; +} + +static void kidled_memcg_init(struct mem_cgroup *memcg) +{ + int type; + + init_rwsem(&memcg->idle_stats_rwsem); + for (type = 0; type < KIDLED_STATS_NR_TYPE; type++) { + memcpy(memcg->idle_stats[type].buckets, + kidled_default_buckets, + sizeof(kidled_default_buckets)); + } +} + +static void kidled_memcg_inherit_parent_buckets(struct mem_cgroup *parent, + struct mem_cgroup *memcg) +{ + int idle_buckets[NUM_KIDLED_BUCKETS], type; + + down_read(&parent->idle_stats_rwsem); + memcpy(idle_buckets, + parent->idle_stats[parent->idle_stable_idx].buckets, + sizeof(idle_buckets)); + up_read(&parent->idle_stats_rwsem); + + for (type = 0; type < KIDLED_STATS_NR_TYPE; type++) { + memcpy(memcg->idle_stats[type].buckets, + idle_buckets, + sizeof(idle_buckets)); + } +} +#else +static void kidled_memcg_init(struct mem_cgroup *memcg) +{ +} + +static void kidled_memcg_inherit_parent_buckets(struct mem_cgroup *parent, + struct mem_cgroup *memcg) +{ +} +#endif /* CONFIG_KIDLED */ + static u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -4661,6 +4901,13 @@ static struct cftype mem_cgroup_legacy_files[] = { .write = mem_cgroup_reset, .read_u64 = mem_cgroup_read_u64, }, +#ifdef CONFIG_KIDLED + { + .name = "idle_page_stats", + .seq_show = mem_cgroup_idle_page_stats_show, + .write = mem_cgroup_idle_page_stats_write, + }, +#endif { }, /* terminate */ }; @@ -4843,6 +5090,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&memcg->cgwb_list); #endif + kidled_memcg_init(memcg); idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); return memcg; fail: @@ -4871,6 +5119,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) /* Default gap is 0.5% max limit */ memcg->wmark_scale_factor = parent->wmark_scale_factor ? : 50; + kidled_memcg_inherit_parent_buckets(parent, memcg); } if (parent && parent->use_hierarchy) { memcg->use_hierarchy = true; @@ -5235,6 +5484,8 @@ static int mem_cgroup_move_account(struct page *page, ret = 0; + kidled_mem_cgroup_move_stats(from, to, page, nr_pages); + local_irq_disable(); mem_cgroup_charge_statistics(to, page, compound, nr_pages); memcg_check_events(to, page); @@ -6143,6 +6394,13 @@ static struct cftype memory_files[] = { .seq_show = memory_oom_group_show, .write = memory_oom_group_write, }, +#ifdef CONFIG_KIDLED + { + .name = "idle_page_stats", + .seq_show = mem_cgroup_idle_page_stats_show, + .write = mem_cgroup_idle_page_stats_write, + }, +#endif { } /* terminate */ }; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 190aed2a906fcb11e27a3cb6e31ba50c62d13a46..883c80e7a339c2887c03dbd3c065f91e270aeea0 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -775,6 +775,12 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon pgdat->node_start_pfn = start_pfn; pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn; +#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS + if (pgdat->node_page_age) { + vfree(pgdat->node_page_age); + pgdat->node_page_age = NULL; + } +#endif } void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, @@ -1880,6 +1886,13 @@ void try_offline_node(int nid) if (check_and_unmap_cpu_on_node(pgdat)) return; +#ifdef KIDLED_AGE_NOT_IN_PAGE_FLAGS + if (pgdat->node_page_age) { + vfree(pgdat->node_page_age); + pgdat->node_page_age = NULL; + } +#endif + /* * all memory/cpu of this node are removed, we can offline this * node now. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e6a03c667eacdcc23dbb58cf4621de86431f14e..750c4d6d59cafa5aafa32ce07c8d31312760aa1a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1034,6 +1034,17 @@ static __always_inline bool free_pages_prepare(struct page *page, bad++; continue; } + + /* + * The page age information is stored in page flags + * or node's page array. We need to explicitly clear + * it in both cases. Otherwise, the stale age will + * be provided when it's allocated again. Also, we + * maintain age information for each page in the + * compound page, So we have to clear them one by one. + */ + kidled_set_page_age(page_pgdat(page + i), + page_to_pfn(page + i), 0); (page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; } } @@ -1047,6 +1058,7 @@ static __always_inline bool free_pages_prepare(struct page *page, return false; page_cpupid_reset_last(page); + kidled_set_page_age(page_pgdat(page), page_to_pfn(page), 0); page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; reset_page_owner(page, order); diff --git a/mm/page_idle.c b/mm/page_idle.c index 52ed59bbc275950715e411e92fce05984fb3857f..e21293799c4f9bc1c799b1ae1b71dd7d98fdad57 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -92,7 +92,7 @@ static bool page_idle_clear_pte_refs_one(struct page *page, return true; } -static void page_idle_clear_pte_refs(struct page *page) +void page_idle_clear_pte_refs(struct page *page) { /* * Since rwc.arg is unused, rwc is effectively immutable, so we