1. 18 8月, 2018 8 次提交
    • K
      mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance · fae91d6d
      Kirill Tkhai 提交于
      Introduce set_shrinker_bit() function to set shrinker-related bit in
      memcg shrinker bitmap, and set the bit after the first item is added and
      in case of reparenting destroyed memcg's items.
      
      This will allow next patch to make shrinkers be called only, in case of
      they have charged objects at the moment, and to improve shrink_slab()
      performance.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fae91d6d
    • K
      mm/list_lru.c: pass lru argument to memcg_drain_list_lru_node() · 3b82c4dc
      Kirill Tkhai 提交于
      This is just refactoring to allow next patches to have lru pointer in
      memcg_drain_list_lru_node().
      
      Link: http://lkml.kernel.org/r/153063063164.1818.55009531386089350.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b82c4dc
    • K
      mm/list_lru: pass dst_memcg argument to memcg_drain_list_lru_node() · 9bec5c35
      Kirill Tkhai 提交于
      This is just refactoring to allow the next patches to have dst_memcg
      pointer in memcg_drain_list_lru_node().
      
      Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bec5c35
    • K
      mm/list_lru.c: add memcg argument to list_lru_from_kmem() · 44bd4a47
      Kirill Tkhai 提交于
      This is just refactoring to allow the next patches to have memcg pointer
      in list_lru_from_kmem().
      
      Link: http://lkml.kernel.org/r/153063060664.1818.9541345386733498582.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44bd4a47
    • K
      fs: propagate shrinker::id to list_lru · c92e8e10
      Kirill Tkhai 提交于
      Add list_lru::shrinker_id field and populate it by registered shrinker
      id.
      
      This will be used to set correct bit in memcg shrinkers map by lru code
      in next patches, after there appeared the first related to memcg element
      in list_lru.
      
      Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c92e8e10
    • K
      mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB · 84c07d11
      Kirill Tkhai 提交于
      Introduce new config option, which is used to replace repeating
      CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
      memcg+kmem related code, so let's keep the defines more clearly.
      
      Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c07d11
    • K
      mm/list_lru.c: combine code under the same define · e0295238
      Kirill Tkhai 提交于
      Patch series "Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))", v8.
      
      This patcheset solves the problem with slow shrink_slab() occuring on
      the machines having many shrinkers and memory cgroups (i.e., with many
      containers).  The problem is complexity of shrink_slab() is O(n^2) and
      it grows too fast with the growth of containers numbers.
      
      Let us have 200 containers, and every container has 10 mounts and 10
      cgroups.  All container tasks are isolated, and they don't touch foreign
      containers mounts.
      
      In case of global reclaim, a task has to iterate all over the memcgs and
      to call all the memcg-aware shrinkers for all of them.  This means, the
      task has to visit 200 * 10 = 2000 shrinkers for every memcg, and since
      there are 2000 memcgs, the total calls of do_shrink_slab() are 2000 *
      2000 = 4000000.
      
      4 million calls are not a number operations, which can takes 1 cpu
      cycle.  E.g., super_cache_count() accesses at least two lists, and makes
      arifmetical calculations.  Even, if there are no charged objects, we do
      these calculations, and replaces cpu caches by read memory.  I observed
      nodes spending almost 100% time in kernel, in case of intensive writing
      and global reclaim.  The writer consumes pages fast, but it's need to
      shrink_slab() before the reclaimer reached shrink pages function (and
      frees SWAP_CLUSTER_MAX pages).  Even if there is no writing, the
      iterations just waste the time, and slows reclaim down.
      
      Let's see the small test below:
      
        $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
        $mkdir /sys/fs/cgroup/memory/ct
        $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
        $for i in `seq 0 4000`;
                do mkdir /sys/fs/cgroup/memory/ct/$i;
                echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
                mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
        done
      
      Then, let's see drop caches time (5 sequential calls):
      
        $time echo 3 > /proc/sys/vm/drop_caches
      
        0.00user 13.78system 0:13.78elapsed 99%CPU
        0.00user 5.59system 0:05.60elapsed 99%CPU
        0.00user 5.48system 0:05.48elapsed 99%CPU
        0.00user 8.35system 0:08.35elapsed 99%CPU
        0.00user 8.34system 0:08.35elapsed 99%CPU
      
      The last four calls don't actually shrink anything.  So, the iterations
      over slab shrinkers take 5.48 seconds.  Not so good for scalability.
      
      The patchset solves the problem by making shrink_slab() of O(n)
      complexity.  There are following functional actions:
      
      1) Assign id to every registered memcg-aware shrinker.
      
      2) Maintain per-memcgroup bitmap of memcg-aware shrinkers, and set a
         shrinker-related bit after the first element is added to lru list
         (also, when removed child memcg elements are reparanted).
      
      3) Split memcg-aware shrinkers and !memcg-aware shrinkers, and call a
         shrinker if its bit is set in memcg's shrinker bitmap.  (Also, there is
         a functionality to clear the bit, after last element is shrinked).
      
      This gives significant performance increase.  The result after patchset
      is applied:
      
        $time echo 3 > /proc/sys/vm/drop_caches
      
        0.00user 1.10system 0:01.10elapsed 99%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
      
      The results show the performance increases at least in 548 times.
      
      So, the patchset makes shrink_slab() of less complexity and improves the
      performance in such types of load I pointed.  This will give a profit in
      case of !global reclaim case, since there also will be less
      do_shrink_slab() calls.
      
      This patch (of 17):
      
      These two pairs of blocks of code are under the same #ifdef #else
      #endif.
      
      Link: http://lkml.kernel.org/r/153063052519.1818.9393587113056959488.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0295238
    • A
      mm/list_lru.c: fold __list_lru_count_one() into its caller · 930eaac5
      Andrew Morton 提交于
      __list_lru_count_one() has a single callsite.
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      930eaac5
  2. 06 4月, 2018 1 次提交
    • K
      mm: make counting of list_lru_one::nr_items lockless · 0c7c1bed
      Kirill Tkhai 提交于
      During the reclaiming slab of a memcg, shrink_slab iterates over all
      registered shrinkers in the system, and tries to count and consume
      objects related to the cgroup.  In case of memory pressure, this behaves
      bad: I observe high system time and time spent in list_lru_count_one()
      for many processes on RHEL7 kernel.
      
      This patch makes list_lru_node::memcg_lrus rcu protected, that allows to
      skip taking spinlock in list_lru_count_one().
      
      Shakeel Butt with the patch observes significant perf graph change.  He
      says:
      
      ========================================================================
      Setup: running a fork-bomb in a memcg of 200MiB on a 8GiB and 4 vcpu
      VM and recording the trace with 'perf record -g -a'.
      
      The trace without the patch:
      
      +  34.19%     fb.sh  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
      +  30.77%     fb.sh  [kernel.kallsyms]  [k] _raw_spin_lock
      +   3.53%     fb.sh  [kernel.kallsyms]  [k] list_lru_count_one
      +   2.26%     fb.sh  [kernel.kallsyms]  [k] super_cache_count
      +   1.68%     fb.sh  [kernel.kallsyms]  [k] shrink_slab
      +   0.59%     fb.sh  [kernel.kallsyms]  [k] down_read_trylock
      +   0.48%     fb.sh  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
      +   0.38%     fb.sh  [kernel.kallsyms]  [k] shrink_node_memcg
      +   0.32%     fb.sh  [kernel.kallsyms]  [k] queue_work_on
      +   0.26%     fb.sh  [kernel.kallsyms]  [k] count_shadow_nodes
      
      With the patch:
      
      +   0.16%     swapper  [kernel.kallsyms]    [k] default_idle
      +   0.13%     oom_reaper  [kernel.kallsyms]    [k] mutex_spin_on_owner
      +   0.05%     perf  [kernel.kallsyms]    [k] copy_user_generic_string
      +   0.05%     init.real  [kernel.kallsyms]    [k] wait_consider_task
      +   0.05%     kworker/0:0  [kernel.kallsyms]    [k] finish_task_switch
      +   0.04%     kworker/2:1  [kernel.kallsyms]    [k] finish_task_switch
      +   0.04%     kworker/3:1  [kernel.kallsyms]    [k] finish_task_switch
      +   0.04%     kworker/1:0  [kernel.kallsyms]    [k] finish_task_switch
      +   0.03%     binary  [kernel.kallsyms]    [k] copy_page
      ========================================================================
      
      Thanks Shakeel for the testing.
      
      [ktkhai@virtuozzo.com: v2]
        Link: http://lkml.kernel.org/r/151203869520.3915.2587549826865799173.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/150583358557.26700.8490036563698102569.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c7c1bed
  3. 16 11月, 2017 1 次提交
  4. 04 10月, 2017 1 次提交
    • J
      mm: memcontrol: use vmalloc fallback for large kmem memcg arrays · f80c7dab
      Johannes Weiner 提交于
      For quick per-memcg indexing, slab caches and list_lru structures
      maintain linear arrays of descriptors.  As the number of concurrent
      memory cgroups in the system goes up, this requires large contiguous
      allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for every
      existing slab cache and list_lru, which can easily fail on loaded
      systems.  E.g.:
      
        mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
        CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty #481
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
        Call Trace:
         ? __alloc_pages_direct_compact+0x4c/0x110
         __alloc_pages_nodemask+0xf50/0x1430
         alloc_pages_current+0x60/0xc0
         kmalloc_order_trace+0x29/0x1b0
         __kmalloc+0x1f4/0x320
         memcg_update_all_list_lrus+0xca/0x2e0
         mem_cgroup_css_alloc+0x612/0x670
         cgroup_apply_control_enable+0x19e/0x360
         cgroup_mkdir+0x322/0x490
         kernfs_iop_mkdir+0x55/0x80
         vfs_mkdir+0xd0/0x120
         SyS_mkdirat+0x6c/0xe0
         SyS_mkdir+0x14/0x20
         entry_SYSCALL_64_fastpath+0x18/0xad
        Mem-Info:
        active_anon:2965 inactive_anon:19 isolated_anon:0
         active_file:100270 inactive_file:98846 isolated_file:0
         unevictable:0 dirty:0 writeback:0 unstable:0
         slab_reclaimable:7328 slab_unreclaimable:16402
         mapped:771 shmem:52 pagetables:278 bounce:0
         free:13718 free_pcp:0 free_cma:0
      
      This output is from an artificial reproducer, but we have repeatedly
      observed order-7 failures in production in the Facebook fleet.  These
      systems become useless as they cannot run more jobs, even though there
      is plenty of memory to allocate 128 individual pages.
      
      Use kvmalloc and kvzalloc to fall back to vmalloc space if these arrays
      prove too large for allocating them physically contiguous.
      
      Link: http://lkml.kernel.org/r/20170918184919.20644-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f80c7dab
  5. 11 7月, 2017 1 次提交
  6. 28 10月, 2016 1 次提交
  7. 21 1月, 2016 1 次提交
  8. 06 11月, 2015 2 次提交
  9. 09 9月, 2015 1 次提交
  10. 13 2月, 2015 5 次提交
    • V
      memcg: reparent list_lrus and free kmemcg_id on css offline · 2788cf0c
      Vladimir Davydov 提交于
      Now, the only reason to keep kmemcg_id till css free is list_lru, which
      uses it to distribute elements between per-memcg lists.  However, it can
      be easily sorted out - we only need to change kmemcg_id of an offline
      cgroup to its parent's id, making further list_lru_add()'s add elements to
      the parent's list, and then move all elements from the offline cgroup's
      list to the one of its parent.  It will work, because a racing
      list_lru_del() does not need to know the list it is deleting the element
      from.  It can decrement the wrong nr_items counter though, but the ongoing
      reparenting will fix it.  After list_lru reparenting is done we are free
      to release kmemcg_id saving a valuable slot in a per-memcg array for new
      cgroups.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2788cf0c
    • V
      list_lru: add helpers to isolate items · 3f97b163
      Vladimir Davydov 提交于
      Currently, the isolate callback passed to the list_lru_walk family of
      functions is supposed to just delete an item from the list upon returning
      LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
      __list_lru_walk_one after the callback returns.  Since the callback is
      allowed to drop the lock after removing an item (it has to return
      LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
      of elements on the list even if we check them under the lock.  This makes
      it difficult to move items from one list_lru_one to another, which is
      required for per-memcg list_lru reparenting - we can't just splice the
      lists, we have to move entries one by one.
      
      This patch therefore introduces helpers that must be used by callback
      functions to isolate items instead of raw list_del/list_move.  These are
      list_lru_isolate and list_lru_isolate_move.  They not only remove the
      entry from the list, but also fix the nr_items counter, making sure
      nr_items always reflects the actual number of elements on the list if
      checked under the appropriate lock.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f97b163
    • V
      list_lru: introduce per-memcg lists · 60d3fd32
      Vladimir Davydov 提交于
      There are several FS shrinkers, including super_block::s_shrink, that
      keep reclaimable objects in the list_lru structure.  Hence to turn them
      to memcg-aware shrinkers, it is enough to make list_lru per-memcg.
      
      This patch does the trick.  It adds an array of lru lists to the
      list_lru_node structure (per-node part of the list_lru), one for each
      kmem-active memcg, and dispatches every item addition or removal to the
      list corresponding to the memcg which the item is accounted to.  So now
      the list_lru structure is not just per node, but per node and per memcg.
      
      Not all list_lrus need this feature, so this patch also adds a new
      method, list_lru_init_memcg, which initializes a list_lru as memcg
      aware.  Otherwise (i.e.  if initialized with old list_lru_init), the
      list_lru won't have per memcg lists.
      
      Just like per memcg caches arrays, the arrays of per-memcg lists are
      indexed by memcg_cache_id, so we must grow them whenever
      memcg_nr_cache_ids is increased.  So we introduce a callback,
      memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id
      space is full.
      
      The locking is implemented in a manner similar to lruvecs, i.e.  we have
      one lock per node that protects all lists (both global and per cgroup) on
      the node.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60d3fd32
    • V
      list_lru: organize all list_lrus to list · c0a5b560
      Vladimir Davydov 提交于
      To make list_lru memcg aware, we need all list_lrus to be kept on a list
      protected by a mutex, so that we could sleep while walking over the
      list.
      
      Therefore after this change list_lru_destroy may sleep.  Fortunately,
      there is only one user that calls it from an atomic context - it's
      put_super - and we can easily fix it by calling list_lru_destroy before
      put_super in destroy_locked_super - anyway we don't longer need lrus by
      that time.
      
      Another point that should be noted is that list_lru_destroy is allowed
      to be called on an uninitialized zeroed-out object, in which case it is
      a no-op.  Before this patch this was guaranteed by kfree, but now we
      need an explicit check there.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0a5b560
    • V
      list_lru: get rid of ->active_nodes · ff0b67ef
      Vladimir Davydov 提交于
      The active_nodes mask allows us to skip empty nodes when walking over
      list_lru items from all nodes in list_lru_count/walk.  However, these
      functions are never called from hot paths, so it doesn't seem we need
      such kind of optimization there.  OTOH, removing the mask will make it
      easier to make list_lru per-memcg.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff0b67ef
  11. 04 4月, 2014 1 次提交
    • J
      mm: keep page cache radix tree nodes in check · 449dd698
      Johannes Weiner 提交于
      Previously, page cache radix tree nodes were freed after reclaim emptied
      out their page pointers.  But now reclaim stores shadow entries in their
      place, which are only reclaimed when the inodes themselves are
      reclaimed.  This is problematic for bigger files that are still in use
      after they have a significant amount of their cache reclaimed, without
      any of those pages actually refaulting.  The shadow entries will just
      sit there and waste memory.  In the worst case, the shadow entries will
      accumulate until the machine runs out of memory.
      
      To get this under control, the VM will track radix tree nodes
      exclusively containing shadow entries on a per-NUMA node list.  Per-NUMA
      rather than global because we expect the radix tree nodes themselves to
      be allocated node-locally and we want to reduce cross-node references of
      otherwise independent cache workloads.  A simple shrinker will then
      reclaim these nodes on memory pressure.
      
      A few things need to be stored in the radix tree node to implement the
      shadow node LRU and allow tree deletions coming from the list:
      
      1. There is no index available that would describe the reverse path
         from the node up to the tree root, which is needed to perform a
         deletion.  To solve this, encode in each node its offset inside the
         parent.  This can be stored in the unused upper bits of the same
         member that stores the node's height at no extra space cost.
      
      2. The number of shadow entries needs to be counted in addition to the
         regular entries, to quickly detect when the node is ready to go to
         the shadow node LRU list.  The current entry count is an unsigned
         int but the maximum number of entries is 64, so a shadow counter
         can easily be stored in the unused upper bits.
      
      3. Tree modification needs tree lock and tree root, which are located
         in the address space, so store an address_space backpointer in the
         node.  The parent pointer of the node is in a union with the 2-word
         rcu_head, so the backpointer comes at no extra cost as well.
      
      4. The node needs to be linked to an LRU list, which requires a list
         head inside the node.  This does increase the size of the node, but
         it does not change the number of objects that fit into a slab page.
      
      [akpm@linux-foundation.org: export the right function]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      449dd698
  12. 31 10月, 2013 1 次提交
    • R
      mm: list_lru: fix almost infinite loop causing effective livelock · c56b097a
      Russell King 提交于
      I've seen a fair number of issues with kswapd and other processes
      appearing to get stuck in v3.12-rc.  Using sysrq-p many times seems to
      indicate that it gets stuck somewhere in list_lru_walk_node(), called
      from prune_icache_sb() and super_cache_scan().
      
      I never seem to be able to trigger a calltrace for functions above that
      point.
      
      So I decided to add the following to super_cache_scan():
      
          @@ -81,10 +81,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
                  inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
                  dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
                  total_objects = dentries + inodes + fs_objects + 1;
          +printk("%s:%u: %s: dentries %lu inodes %lu total %lu\n", current->comm, current->pid, __func__, dentries, inodes, total_objects);
      
                  /* proportion the scan between the caches */
                  dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
                  inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
          +printk("%s:%u: %s: dentries %lu inodes %lu\n", current->comm, current->pid, __func__, dentries, inodes);
          +BUG_ON(dentries == 0);
          +BUG_ON(inodes == 0);
      
                  /*
                   * prune the dcache first as the icache is pinned by it, then
          @@ -99,7 +103,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
                          freed += sb->s_op->free_cached_objects(sb, fs_objects,
                                                                 sc->nid);
                  }
          -
          +printk("%s:%u: %s: dentries %lu inodes %lu freed %lu\n", current->comm, current->pid, __func__, dentries, inodes, freed);
                  drop_super(sb);
                  return freed;
           }
      
      and shortly thereafter, having applied some pressure, I got this:
      
          update-apt-xapi:1616: super_cache_scan: dentries 25632 inodes 2 total 25635
          update-apt-xapi:1616: super_cache_scan: dentries 1023 inodes 0
          ------------[ cut here ]------------
          Kernel BUG at c0101994 [verbose debug info unavailable]
          Internal error: Oops - BUG: 0 [#3] SMP ARM
          Modules linked in: fuse rfcomm bnep bluetooth hid_cypress
          CPU: 0 PID: 1616 Comm: update-apt-xapi Tainted: G      D      3.12.0-rc7+ #154
          task: daea1200 ti: c3bf8000 task.ti: c3bf8000
          PC is at super_cache_scan+0x1c0/0x278
          LR is at trace_hardirqs_on+0x14/0x18
          Process update-apt-xapi (pid: 1616, stack limit = 0xc3bf8240)
          ...
          Backtrace:
            (super_cache_scan) from [<c00cd69c>] (shrink_slab+0x254/0x4c8)
            (shrink_slab) from [<c00d09a0>] (try_to_free_pages+0x3a0/0x5e0)
            (try_to_free_pages) from [<c00c59cc>] (__alloc_pages_nodemask+0x5)
            (__alloc_pages_nodemask) from [<c00e07c0>] (__pte_alloc+0x2c/0x13)
            (__pte_alloc) from [<c00e3a70>] (handle_mm_fault+0x84c/0x914)
            (handle_mm_fault) from [<c001a4cc>] (do_page_fault+0x1f0/0x3bc)
            (do_page_fault) from [<c001a7b0>] (do_translation_fault+0xac/0xb8)
            (do_translation_fault) from [<c000840c>] (do_DataAbort+0x38/0xa0)
            (do_DataAbort) from [<c00133f8>] (__dabt_usr+0x38/0x40)
      
      Notice that we had a very low number of inodes, which were reduced to
      zero my mult_frac().
      
      Now, prune_icache_sb() calls list_lru_walk_node() passing that number of
      inodes (0) into that as the number of objects to scan:
      
          long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
                               int nid)
          {
                  LIST_HEAD(freeable);
                  long freed;
      
                  freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
                                                 &freeable, &nr_to_scan);
      
      which does:
      
          unsigned long
          list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
                             void *cb_arg, unsigned long *nr_to_walk)
          {
      
                  struct list_lru_node    *nlru = &lru->node[nid];
                  struct list_head *item, *n;
                  unsigned long isolated = 0;
      
                  spin_lock(&nlru->lock);
          restart:
                  list_for_each_safe(item, n, &nlru->list) {
                          enum lru_status ret;
      
                          /*
                           * decrement nr_to_walk first so that we don't livelock if we
                           * get stuck on large numbesr of LRU_RETRY items
                           */
                          if (--(*nr_to_walk) == 0)
                                  break;
      
      So, if *nr_to_walk was zero when this function was entered, that means
      we're wanting to operate on (~0UL)+1 objects - which might as well be
      infinite.
      
      Clearly this is not correct behaviour.  If we think about the behaviour
      of this function when *nr_to_walk is 1, then clearly it's wrong - we
      decrement first and then test for zero - which results in us doing
      nothing at all.  A post-decrement would give the desired behaviour -
      we'd try to walk one object and one object only if *nr_to_walk were one.
      
      It also gives the correct behaviour for zero - we exit at this point.
      
      Fixes: 5cedf721 ("list_lru: fix broken LRU_RETRY behaviour")
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      [ Modified to make sure we never underflow the count: this function gets
        called in a loop, so the 0 -> ~0ul transition is dangerous  - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c56b097a
  13. 11 9月, 2013 6 次提交
    • G
      list_lru: dynamically adjust node arrays · 5ca302c8
      Glauber Costa 提交于
      We currently use a compile-time constant to size the node array for the
      list_lru structure.  Due to this, we don't need to allocate any memory at
      initialization time.  But as a consequence, the structures that contain
      embedded list_lru lists can become way too big (the superblock for
      instance contains two of them).
      
      This patch aims at ameliorating this situation by dynamically allocating
      the node arrays with the firmware provided nr_node_ids.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5ca302c8
    • G
      list_lru: remove special case function list_lru_dispose_all. · 4e717f5c
      Glauber Costa 提交于
      The list_lru implementation has one function, list_lru_dispose_all, with
      only one user (the dentry code).  At first, such function appears to make
      sense because we are really not interested in the result of isolating each
      dentry separately - all of them are going away anyway.  However, it's
      implementation is buggy in the following way:
      
      When we call list_lru_dispose_all in fs/dcache.c, we scan all dentries
      marking them with DCACHE_SHRINK_LIST.  However, this is done without the
      nlru->lock taken.  The imediate result of that is that someone else may
      add or remove the dentry from the LRU at the same time.  When list_lru_del
      happens in that scenario we will see an element that is not yet marked
      with DCACHE_SHRINK_LIST (even though it will be in the future) and
      obviously remove it from an lru where the element no longer is.  Since
      list_lru_dispose_all will in effect count down nlru's nr_items and
      list_lru_del will do the same, this will lead to an imbalance.
      
      The solution for this would not be so simple: we can obviously just keep
      the lru_lock taken, but then we have no guarantees that we will be able to
      acquire the dentry lock (dentry->d_lock).  To properly solve this, we need
      a communication mechanism between the lru and dentry code, so they can
      coordinate this with each other.
      
      Such mechanism already exists in the form of the list_lru_walk_cb
      callback.  So it is possible to construct a dcache-side prune function
      that does the right thing only by calling list_lru_walk in a loop until no
      more dentries are available.
      
      With only one user, plus the fact that a sane solution for the problem
      would involve boucing between dcache and list_lru anyway, I see little
      justification to keep the special case list_lru_dispose_all in tree.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4e717f5c
    • G
      list_lru: per-node API · 6a4f496f
      Glauber Costa 提交于
      This patch adapts the list_lru API to accept an optional node argument, to
      be used by NUMA aware shrinking functions.  Code that does not care about
      the NUMA placement of objects can still call into the very same functions
      as before.  They will simply iterate over all nodes.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6a4f496f
    • D
      list_lru: fix broken LRU_RETRY behaviour · 5cedf721
      Dave Chinner 提交于
      The LRU_RETRY code assumes that the list traversal status after we have
      dropped and regained the list lock.  Unfortunately, this is not a valid
      assumption, and that can lead to racing traversals isolating objects that
      the other traversal expects to be the next item on the list.
      
      This is causing problems with the inode cache shrinker isolation, with
      races resulting in an inode on a dispose list being "isolated" because a
      racing traversal still thinks it is on the LRU.  The inode is then never
      reclaimed and that causes hangs if a subsequent lookup on that inode
      occurs.
      
      Fix it by always restarting the list walk on a LRU_RETRY return from the
      isolate callback.  Avoid the possibility of livelocks the current code was
      trying to avoid by always decrementing the nr_to_walk counter on retries
      so that even if we keep hitting the same item on the list we'll eventually
      stop trying to walk and exit out of the situation causing the problem.
      Reported-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5cedf721
    • D
      list_lru: per-node list infrastructure · 3b1d58a4
      Dave Chinner 提交于
      Now that we have an LRU list API, we can start to enhance the
      implementation.  This splits the single LRU list into per-node lists and
      locks to enhance scalability.  Items are placed on lists according to the
      node the memory belongs to.  To make scanning the lists efficient, also
      track whether the per-node lists have entries in them in a active
      nodemask.
      
      Note: We use a fixed-size array for the node LRU, this struct can be very
      big if MAX_NUMNODES is big.  If this becomes a problem this is fixable by
      turning this into a pointer and dynamically allocating this to
      nr_node_ids.  This quantity is firwmare-provided, and still would provide
      room for all nodes at the cost of a pointer lookup and an extra
      allocation.  Because that allocation will most likely come from a may very
      well fail.
      
      [glommer@openvz.org: fix warnings, added note about node lru]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3b1d58a4
    • D
      list: add a new LRU list type · a38e4082
      Dave Chinner 提交于
      Several subsystems use the same construct for LRU lists - a list head, a
      spin lock and and item count.  They also use exactly the same code for
      adding and removing items from the LRU.  Create a generic type for these
      LRU lists.
      
      This is the beginning of generic, node aware LRUs for shrinkers to work
      with.
      
      [glommer@openvz.org: enum defined constants for lru. Suggested by gthelen, don't relock over retry]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a38e4082