1. 13 7月, 2019 10 次提交
    • W
      mm, memcg: add a memcg_slabinfo debugfs file · fcf8a1e4
      Waiman Long 提交于
      There are concerns about memory leaks from extensive use of memory cgroups
      as each memory cgroup creates its own set of kmem caches.  There is a
      possiblity that the memcg kmem caches may remain even after the memory
      cgroups have been offlined.  Therefore, it will be useful to show the
      status of each of memcg kmem caches.
      
      This patch introduces a new <debugfs>/memcg_slabinfo file which is
      somewhat similar to /proc/slabinfo in format, but lists only information
      about kmem caches that have child memcg kmem caches.  Information
      available in /proc/slabinfo are not repeated in memcg_slabinfo.
      
      A portion of a sample output of the file was:
      
        # <name> <css_id[:dead]> <active_objs> <num_objs> <active_slabs> <num_slabs>
        rpc_inode_cache   root          13     51      1      1
        rpc_inode_cache     48           0      0      0      0
        fat_inode_cache   root           1     45      1      1
        fat_inode_cache     41           2     45      1      1
        xfs_inode         root         770    816     24     24
        xfs_inode           92          22     34      1      1
        xfs_inode           88:dead      1     34      1      1
        xfs_inode           89:dead     23     34      1      1
        xfs_inode           85           4     34      1      1
        xfs_inode           84           9     34      1      1
      
      The css id of the memcg is also listed. If a memcg is not online,
      the tag ":dead" will be attached as shown above.
      
      [longman@redhat.com: memcg: add ":deact" tag for reparented kmem caches in memcg_slabinfo]
        Link: http://lkml.kernel.org/r/20190621173005.31514-1-longman@redhat.com
      [longman@redhat.com: set the flag in the common code as suggested by Roman]
        Link: http://lkml.kernel.org/r/20190627184324.5875-1-longman@redhat.com
      Link: http://lkml.kernel.org/r/20190619171621.26209-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Suggested-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcf8a1e4
    • R
      mm: memcg/slab: reparent memcg kmem_caches on cgroup removal · fb2f2b0a
      Roman Gushchin 提交于
      Let's reparent non-root kmem_caches on memcg offlining.  This allows us to
      release the memory cgroup without waiting for the last outstanding kernel
      object (e.g.  dentry used by another application).
      
      Since the parent cgroup is already charged, everything we need to do is to
      splice the list of kmem_caches to the parent's kmem_caches list, swap the
      memcg pointer, drop the css refcounter for each kmem_cache and adjust the
      parent's css refcounter.
      
      Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
      anymore.  It's safe to read it under rcu_read_lock(), cgroup_mutex held,
      or any other way that protects the memory cgroup from being released.
      
      We can race with the slab allocation and deallocation paths.  It's not a
      big problem: parent's charge and slab global stats are always correct, and
      we don't care anymore about the child usage and global stats.  The child
      cgroup is already offline, so we don't use or show it anywhere.
      
      Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
      used anywhere except count_shadow_nodes().  But even there it won't break
      anything: after reparenting "nodes" will be 0 on child level (because
      we're already reparenting shrinker lists), and on parent level page stats
      always were 0, and this patch won't change anything.
      
      [guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup]
        Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb2f2b0a
    • R
      mm: memcg/slab: rework non-root kmem_cache lifecycle management · f0a3a24b
      Roman Gushchin 提交于
      Currently each charged slab page holds a reference to the cgroup to which
      it's charged.  Kmem_caches are held by the memcg and are released all
      together with the memory cgroup.  It means that none of kmem_caches are
      released unless at least one reference to the memcg exists, which is very
      far from optimal.
      
      Let's rework it in a way that allows releasing individual kmem_caches as
      soon as the cgroup is offline, the kmem_cache is empty and there are no
      pending allocations.
      
      To make it possible, let's introduce a new percpu refcounter for non-root
      kmem caches.  The counter is initialized to the percpu mode, and is
      switched to the atomic mode during kmem_cache deactivation.  The counter
      is bumped for every charged page and also for every running allocation.
      So the kmem_cache can't be released unless all allocations complete.
      
      To shutdown non-active empty kmem_caches, let's reuse the work queue,
      previously used for the kmem_cache deactivation.  Once the reference
      counter reaches 0, let's schedule an asynchronous kmem_cache release.
      
      * I used the following simple approach to test the performance
      (stolen from another patchset by T. Harding):
      
          time find / -name fname-no-exist
          echo 2 > /proc/sys/vm/drop_caches
          repeat 10 times
      
      Results:
      
              orig		patched
      
      real	0m1.455s	real	0m1.355s
      user	0m0.206s	user	0m0.219s
      sys	0m0.855s	sys	0m0.807s
      
      real	0m1.487s	real	0m1.699s
      user	0m0.221s	user	0m0.256s
      sys	0m0.806s	sys	0m0.948s
      
      real	0m1.515s	real	0m1.505s
      user	0m0.183s	user	0m0.215s
      sys	0m0.876s	sys	0m0.858s
      
      real	0m1.291s	real	0m1.380s
      user	0m0.193s	user	0m0.198s
      sys	0m0.843s	sys	0m0.786s
      
      real	0m1.364s	real	0m1.374s
      user	0m0.180s	user	0m0.182s
      sys	0m0.868s	sys	0m0.806s
      
      real	0m1.352s	real	0m1.312s
      user	0m0.201s	user	0m0.212s
      sys	0m0.820s	sys	0m0.761s
      
      real	0m1.302s	real	0m1.349s
      user	0m0.205s	user	0m0.203s
      sys	0m0.803s	sys	0m0.792s
      
      real	0m1.334s	real	0m1.301s
      user	0m0.194s	user	0m0.201s
      sys	0m0.806s	sys	0m0.779s
      
      real	0m1.426s	real	0m1.434s
      user	0m0.216s	user	0m0.181s
      sys	0m0.824s	sys	0m0.864s
      
      real	0m1.350s	real	0m1.295s
      user	0m0.200s	user	0m0.190s
      sys	0m0.842s	sys	0m0.811s
      
      So it looks like the difference is not noticeable in this test.
      
      [cai@lca.pw: fix an use-after-free in kmemcg_workfn()]
        Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0a3a24b
    • R
      mm: memcg/slab: synchronize access to kmem_cache dying flag using a spinlock · 63b02ef7
      Roman Gushchin 提交于
      Currently the memcg_params.dying flag and the corresponding workqueue used
      for the asynchronous deactivation of kmem_caches is synchronized using the
      slab_mutex.
      
      It makes impossible to check this flag from the irq context, which will be
      required in order to implement asynchronous release of kmem_caches.
      
      So let's switch over to the irq-save flavor of the spinlock-based
      synchronization.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-8-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63b02ef7
    • R
      mm: memcg/slab: don't check the dying flag on kmem_cache creation · 57033297
      Roman Gushchin 提交于
      There is no point in checking the root_cache->memcg_params.dying flag on
      kmem_cache creation path.  New allocations shouldn't be performed using a
      dead root kmem_cache, so no new memcg kmem_cache creation can be scheduled
      after the flag is set.  And if it was scheduled before,
      flush_memcg_workqueue() will wait for it anyway.
      
      So let's drop this check to simplify the code.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-7-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57033297
    • R
      mm: memcg/slab: generalize postponed non-root kmem_cache deactivation · 43486694
      Roman Gushchin 提交于
      Currently SLUB uses a work scheduled after an RCU grace period to
      deactivate a non-root kmem_cache.  This mechanism can be reused for
      kmem_caches release, but requires generalization for SLAB case.
      
      Introduce kmemcg_cache_deactivate() function, which calls
      allocator-specific __kmem_cache_deactivate() and schedules execution of
      __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
      context after an rcu grace period.
      
      Here is the new calling scheme:
        kmemcg_cache_deactivate()
          __kmemcg_cache_deactivate()                  SLAB/SLUB-specific
          kmemcg_rcufn()                               rcu
            kmemcg_workfn()                            work
              __kmemcg_cache_deactivate_after_rcu()    SLAB/SLUB-specific
      
      instead of:
        __kmemcg_cache_deactivate()                    SLAB/SLUB-specific
          slab_deactivate_memcg_cache_rcu_sched()      SLUB-only
            kmemcg_rcufn()                             rcu
              kmemcg_workfn()                          work
                kmemcg_cache_deact_after_rcu()         SLUB-only
      
      For consistency, all allocator-specific functions start with "__".
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43486694
    • R
      mm: memcg/slab: rename slab delayed deactivation functions and fields · 0b14e8aa
      Roman Gushchin 提交于
      The delayed work/rcu deactivation infrastructure of non-root kmem_caches
      can be also used for asynchronous release of these objects.  Let's get rid
      of the word "deactivation" in corresponding names to make the code look
      better after generalization.
      
      It's easier to make the renaming first, so that the generalized code will
      look consistent from scratch.
      
      Let's rename struct memcg_cache_params fields:
        deact_fn -> work_fn
        deact_rcu_head -> rcu_head
        deact_work -> work
      
      And RCU/delayed work callbacks in slab common code:
        kmemcg_deactivate_rcufn -> kmemcg_rcufn
        kmemcg_deactivate_workfn -> kmemcg_workfn
      
      This patch contains no functional changes, only renamings.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-3-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b14e8aa
    • R
      mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() · c03914b7
      Roman Gushchin 提交于
      Patch series "mm: reparent slab memory on cgroup removal", v7.
      
      # Why do we need this?
      
      We've noticed that the number of dying cgroups is steadily growing on most
      of our hosts in production.  The following investigation revealed an issue
      in the userspace memory reclaim code [1], accounting of kernel stacks [2],
      and also the main reason: slab objects.
      
      The underlying problem is quite simple: any page charged to a cgroup holds
      a reference to it, so the cgroup can't be reclaimed unless all charged
      pages are gone.  If a slab object is actively used by other cgroups, it
      won't be reclaimed, and will prevent the origin cgroup from being
      reclaimed.
      
      Slab objects, and first of all vfs cache, is shared between cgroups, which
      are using the same underlying fs, and what's even more important, it's
      shared between multiple generations of the same workload.  So if something
      is running periodically every time in a new cgroup (like how systemd
      works), we do accumulate multiple dying cgroups.
      
      Strictly speaking pagecache isn't different here, but there is a key
      difference: we disable protection and apply some extra pressure on LRUs of
      dying cgroups, and these LRUs contain all charged pages.  My experiments
      show that with the disabled kernel memory accounting the number of dying
      cgroups stabilizes at a relatively small number (~100, depends on memory
      pressure and cgroup creation rate), and with kernel memory accounting it
      grows pretty steadily up to several thousands.
      
      Memory cgroups are quite complex and big objects (mostly due to percpu
      stats), so it leads to noticeable memory losses.  Memory occupied by dying
      cgroups is measured in hundreds of megabytes.  I've even seen a host with
      more than 100Gb of memory wasted for dying cgroups.  It leads to a
      degradation of performance with the uptime, and generally limits the usage
      of cgroups.
      
      My previous attempt [3] to fix the problem by applying extra pressure on
      slab shrinker lists caused a regressions with xfs and ext4, and has been
      reverted [4].  The following attempts to find the right balance [5, 6]
      were not successful.
      
      So instead of trying to find a maybe non-existing balance, let's do
      reparent accounted slab caches to the parent cgroup on cgroup removal.
      
      # Implementation approach
      
      There is however a significant problem with reparenting of slab memory:
      there is no list of charged pages.  Some of them are in shrinker lists,
      but not all.  Introducing of a new list is really not an option.
      
      But fortunately there is a way forward: every slab page has a stable
      pointer to the corresponding kmem_cache.  So the idea is to reparent
      kmem_caches instead of slab pages.
      
      It's actually simpler and cheaper, but requires some underlying changes:
      1) Make kmem_caches to hold a single reference to the memory cgroup,
         instead of a separate reference per every slab page.
      2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
         page->kmem_cache->memcg indirection instead. It's used only on
         slab page release, so performance overhead shouldn't be a big issue.
      3) Introduce a refcounter for non-root slab caches. It's required to
         be able to destroy kmem_caches when they become empty and release
         the associated memory cgroup.
      
      There is a bonus: currently we release all memcg kmem_caches all together
      with the memory cgroup itself.  This patchset allows individual
      kmem_caches to be released as soon as they become inactive and free.
      
      Some additional implementation details are provided in corresponding
      commit messages.
      
      # Results
      
      Below is the average number of dying cgroups on two groups of our
      production hosts.  They do run some sort of web frontend workload, the
      memory pressure is moderate.  As we can see, with the kernel memory
      reparenting the number stabilizes in 60s range; however with the original
      version it grows almost linearly and doesn't show any signs of plateauing.
      The difference in slab and percpu usage between patched and unpatched
      versions also grows linearly.  In 7 days it exceeded 200Mb.
      
      day           0    1    2    3    4    5    6    7
      original     56  362  628  752 1070 1250 1490 1560
      patched      23   46   51   55   60   57   67   69
      mem diff(Mb) 22   74  123  152  164  182  214  241
      
      # Links
      
      [1]: commit 68600f62 ("mm: don't miss the last page because of round-off error")
      [2]: commit 9b6f7e16 ("mm: rework memcg kernel stack accounting")
      [3]: commit 172b06c3 ("mm: slowly shrink slabs with a relatively small number of objects")
      [4]: commit a9a238e8 ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
      [5]: https://lkml.org/lkml/2019/1/28/1865
      [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
      
      This patch (of 10):
      
      Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
      rather than in init_memcg_params().
      
      Once kmem_cache will hold a reference to the memory cgroup, it will
      simplify the refcounting.
      
      For non-root kmem_caches memcg_link_cache() is always called before the
      kmem_cache becomes visible to a user, so it's safe.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c03914b7
    • M
      mm/kasan: add object validation in ksize() · 0d4ca4c9
      Marco Elver 提交于
      ksize() has been unconditionally unpoisoning the whole shadow memory
      region associated with an allocation.  This can lead to various undetected
      bugs, for example, double-kzfree().
      
      Specifically, kzfree() uses ksize() to determine the actual allocation
      size, and subsequently zeroes the memory.  Since ksize() used to just
      unpoison the whole shadow memory region, no invalid free was detected.
      
      This patch addresses this as follows:
      
      1. Add a check in ksize(), and only then unpoison the memory region.
      
      2. Preserve kasan_unpoison_slab() semantics by explicitly unpoisoning
         the shadow memory region using the size obtained from __ksize().
      
      Tested:
      1. With SLAB allocator: a) normal boot without warnings; b) verified the
         added double-kzfree() is detected.
      2. With SLUB allocator: a) normal boot without warnings; b) verified the
         added double-kzfree() is detected.
      
      [elver@google.com: s/BUG_ON/WARN_ON_ONCE/, per Kees]
        Link: http://lkml.kernel.org/r/20190627094445.216365-6-elver@google.com
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199359
      Link: http://lkml.kernel.org/r/20190626142014.141844-6-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d4ca4c9
    • M
      mm/slab: refactor common ksize KASAN logic into slab_common.c · 10d1f8cb
      Marco Elver 提交于
      This refactors common code of ksize() between the various allocators into
      slab_common.c: __ksize() is the allocator-specific implementation without
      instrumentation, whereas ksize() includes the required KASAN logic.
      
      Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10d1f8cb
  2. 30 3月, 2019 1 次提交
    • N
      mm: add support for kmem caches in DMA32 zone · 6d6ea1e9
      Nicolas Boichat 提交于
      Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
      v6.
      
      This is a followup to the discussion in [1], [2].
      
      IOMMUs using ARMv7 short-descriptor format require page tables (level 1
      and 2) to be allocated within the first 4GB of RAM, even on 64-bit
      systems.
      
      For L1 tables that are bigger than a page, we can just use
      __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
      use GFP_DMA).
      
      For L2 tables that only take 1KB, it would be a waste to allocate a full
      page, so we considered 3 approaches:
       1. This series, adding support for GFP_DMA32 slab caches.
       2. genalloc, which requires pre-allocating the maximum number of L2 page
          tables (4096, so 4MB of memory).
       3. page_frag, which is not very memory-efficient as it is unable to reuse
          freed fragments until the whole page is freed. [3]
      
      This series is the most memory-efficient approach.
      
      stable@ note:
        We confirmed that this is a regression, and IOMMU errors happen on 4.19
        and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
        most likely starts from commit ad67f5a6 ("arm64: replace ZONE_DMA
        with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
        platforms (and maybe others?).
      
      [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
      [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
      [3] https://patchwork.codeaurora.org/patch/671639/
      
      This patch (of 3):
      
      IOMMUs using ARMv7 short-descriptor format require page tables to be
      allocated within the first 4GB of RAM, even on 64-bit systems.  On arm64,
      this is done by passing GFP_DMA32 flag to memory allocation functions.
      
      For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
      a full page using get_free_pages, so we considered 3 approaches:
       1. This patch, adding support for GFP_DMA32 slab caches.
       2. genalloc, which requires pre-allocating the maximum number of L2
          page tables (4096, so 4MB of memory).
       3. page_frag, which is not very memory-efficient as it is unable
          to reuse freed fragments until the whole page is freed.
      
      This change makes it possible to create a custom cache in DMA32 zone using
      kmem_cache_create, then allocate memory using kmem_cache_alloc.
      
      We do not create a DMA32 kmalloc cache array, as there are currently no
      users of kmalloc(..., GFP_DMA32).  These calls will continue to trigger a
      warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.
      
      This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
      kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
      unnecessary).
      
      Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.orgSigned-off-by: NNicolas Boichat <drinkcat@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Sasha Levin <Alexander.Levin@microsoft.com>
      Cc: Huaisheng Ye <yehs1@lenovo.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yong Wu <yong.wu@mediatek.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Tomasz Figa <tfiga@google.com>
      Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d6ea1e9
  3. 06 3月, 2019 2 次提交
  4. 22 2月, 2019 2 次提交
  5. 29 12月, 2018 3 次提交
    • Y
      mm, slab: remove unnecessary unlikely() · 221d7da6
      Yangtao Li 提交于
      WARN_ON() already contains an unlikely(), so it's not necessary to use
      unlikely.
      
      Also change WARN_ON() back to WARN_ON_ONCE() to avoid potentially
      spamming dmesg with user-triggerable large allocations.
      
      [akpm@linux-foundation.org: s/WARN_ON/WARN_ON_ONCE/, per Vlastimil]
      Link: http://lkml.kernel.org/r/20181104125028.3572-1-tiny.windzz@gmail.comSigned-off-by: NYangtao Li <tiny.windzz@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      221d7da6
    • A
      kasan, mm: perform untagged pointers comparison in krealloc · 772a2fa5
      Andrey Konovalov 提交于
      The krealloc function checks where the same buffer was reused or a new one
      allocated by comparing kernel pointers.  Tag-based KASAN changes memory
      tag on the krealloc'ed chunk of memory and therefore also changes the
      pointer tag of the returned pointer.  Therefore we need to perform
      comparison on untagged (with tags reset) pointers to check whether it's
      the same memory region or not.
      
      Link: http://lkml.kernel.org/r/14f6190d7846186a3506cd66d82446646fe65090.1544099024.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      772a2fa5
    • A
      kasan, mm: change hooks signatures · 0116523c
      Andrey Konovalov 提交于
      Patch series "kasan: add software tag-based mode for arm64", v13.
      
      This patchset adds a new software tag-based mode to KASAN [1].  (Initially
      this mode was called KHWASAN, but it got renamed, see the naming rationale
      at the end of this section).
      
      The plan is to implement HWASan [2] for the kernel with the incentive,
      that it's going to have comparable to KASAN performance, but in the same
      time consume much less memory, trading that off for somewhat imprecise bug
      detection and being supported only for arm64.
      
      The underlying ideas of the approach used by software tag-based KASAN are:
      
      1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
         pointer tags in the top byte of each kernel pointer.
      
      2. Using shadow memory, we can store memory tags for each chunk of kernel
         memory.
      
      3. On each memory allocation, we can generate a random tag, embed it into
         the returned pointer and set the memory tags that correspond to this
         chunk of memory to the same value.
      
      4. By using compiler instrumentation, before each memory access we can add
         a check that the pointer tag matches the tag of the memory that is being
         accessed.
      
      5. On a tag mismatch we report an error.
      
      With this patchset the existing KASAN mode gets renamed to generic KASAN,
      with the word "generic" meaning that the implementation can be supported
      by any architecture as it is purely software.
      
      The new mode this patchset adds is called software tag-based KASAN.  The
      word "tag-based" refers to the fact that this mode uses tags embedded into
      the top byte of kernel pointers and the TBI arm64 CPU feature that allows
      to dereference such pointers.  The word "software" here means that shadow
      memory manipulation and tag checking on pointer dereference is done in
      software.  As it is the only tag-based implementation right now, "software
      tag-based" KASAN is sometimes referred to as simply "tag-based" in this
      patchset.
      
      A potential expansion of this mode is a hardware tag-based mode, which
      would use hardware memory tagging support (announced by Arm [3]) instead
      of compiler instrumentation and manual shadow memory manipulation.
      
      Same as generic KASAN, software tag-based KASAN is strictly a debugging
      feature.
      
      [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
      
      [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
      
      [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
      
      ====== Rationale
      
      On mobile devices generic KASAN's memory usage is significant problem.
      One of the main reasons to have tag-based KASAN is to be able to perform a
      similar set of checks as the generic one does, but with lower memory
      requirements.
      
      Comment from Vishwath Mohan <vishwath@google.com>:
      
      I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
      problematic to enable for environments that don't tolerate the increased
      memory pressure well.  This includes
      
      (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
      (c) Connected components like Pixel's visual core [1].
      
      These are both places I'd love to have a low(er) memory footprint option at
      my disposal.
      
      Comment from Evgenii Stepanov <eugenis@google.com>:
      
      Looking at a live Android device under load, slab (according to
      /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB).  KASAN's
      overhead of 2x - 3x on top of it is not insignificant.
      
      Not having this overhead enables near-production use - ex.  running
      KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
      not reproduce in test configuration.  These are the ones that often cost
      the most engineering time to track down.
      
      CPU overhead is bad, but generally tolerable.  RAM is critical, in our
      experience.  Once it gets low enough, OOM-killer makes your life
      miserable.
      
      [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
      
      ====== Technical details
      
      Software tag-based KASAN mode is implemented in a very similar way to the
      generic one. This patchset essentially does the following:
      
      1. TCR_TBI1 is set to enable Top Byte Ignore.
      
      2. Shadow memory is used (with a different scale, 1:16, so each shadow
         byte corresponds to 16 bytes of kernel memory) to store memory tags.
      
      3. All slab objects are aligned to shadow scale, which is 16 bytes.
      
      4. All pointers returned from the slab allocator are tagged with a random
         tag and the corresponding shadow memory is poisoned with the same value.
      
      5. Compiler instrumentation is used to insert tag checks. Either by
         calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
         CONFIG_KASAN_INLINE flags are reused).
      
      6. When a tag mismatch is detected in callback instrumentation mode
         KASAN simply prints a bug report. In case of inline instrumentation,
         clang inserts a brk instruction, and KASAN has it's own brk handler,
         which reports the bug.
      
      7. The memory in between slab objects is marked with a reserved tag, and
         acts as a redzone.
      
      8. When a slab object is freed it's marked with a reserved tag.
      
      Bug detection is imprecise for two reasons:
      
      1. We won't catch some small out-of-bounds accesses, that fall into the
         same shadow cell, as the last byte of a slab object.
      
      2. We only have 1 byte to store tags, which means we have a 1/256
         probability of a tag match for an incorrect access (actually even
         slightly less due to reserved tag values).
      
      Despite that there's a particular type of bugs that tag-based KASAN can
      detect compared to generic KASAN: use-after-free after the object has been
      allocated by someone else.
      
      ====== Testing
      
      Some kernel developers voiced a concern that changing the top byte of
      kernel pointers may lead to subtle bugs that are difficult to discover.
      To address this concern deliberate testing has been performed.
      
      It doesn't seem feasible to do some kind of static checking to find
      potential issues with pointer tagging, so a dynamic approach was taken.
      All pointer comparisons/subtractions have been instrumented in an LLVM
      compiler pass and a kernel module that would print a bug report whenever
      two pointers with different tags are being compared/subtracted (ignoring
      comparisons with NULL pointers and with pointers obtained by casting an
      error code to a pointer type) has been used.  Then the kernel has been
      booted in QEMU and on an Odroid C2 board and syzkaller has been run.
      
      This yielded the following results.
      
      The two places that look interesting are:
      
      is_vmalloc_addr in include/linux/mm.h
      is_kernel_rodata in mm/util.c
      
      Here we compare a pointer with some fixed untagged values to make sure
      that the pointer lies in a particular part of the kernel address space.
      Since tag-based KASAN doesn't add tags to pointers that belong to rodata
      or vmalloc regions, this should work as is.  To make sure debug checks to
      those two functions that check that the result doesn't change whether we
      operate on pointers with or without untagging has been added.
      
      A few other cases that don't look that interesting:
      
      Comparing pointers to achieve unique sorting order of pointee objects
      (e.g. sorting locks addresses before performing a double lock):
      
      tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
      pipe_double_lock in fs/pipe.c
      unix_state_double_lock in net/unix/af_unix.c
      lock_two_nondirectories in fs/inode.c
      mutex_lock_double in kernel/events/core.c
      
      ep_cmp_ffd in fs/eventpoll.c
      fsnotify_compare_groups fs/notify/mark.c
      
      Nothing needs to be done here, since the tags embedded into pointers
      don't change, so the sorting order would still be unique.
      
      Checks that a pointer belongs to some particular allocation:
      
      is_sibling_entry in lib/radix-tree.c
      object_is_on_stack in include/linux/sched/task_stack.h
      
      Nothing needs to be done here either, since two pointers can only belong
      to the same allocation if they have the same tag.
      
      Overall, since the kernel boots and works, there are no critical bugs.
      As for the rest, the traditional kernel testing way (use until fails) is
      the only one that looks feasible.
      
      Another point here is that tag-based KASAN is available under a separate
      config option that needs to be deliberately enabled. Even though it might
      be used in a "near-production" environment to find bugs that are not found
      during fuzzing or running tests, it is still a debug tool.
      
      ====== Benchmarks
      
      The following numbers were collected on Odroid C2 board. Both generic and
      tag-based KASAN were used in inline instrumentation mode.
      
      Boot time [1]:
      * ~1.7 sec for clean kernel
      * ~5.0 sec for generic KASAN
      * ~5.0 sec for tag-based KASAN
      
      Network performance [2]:
      * 8.33 Gbits/sec for clean kernel
      * 3.17 Gbits/sec for generic KASAN
      * 2.85 Gbits/sec for tag-based KASAN
      
      Slab memory usage after boot [3]:
      * ~40 kb for clean kernel
      * ~105 kb (~260% overhead) for generic KASAN
      * ~47 kb (~20% overhead) for tag-based KASAN
      
      KASAN memory overhead consists of three main parts:
      1. Increased slab memory usage due to redzones.
      2. Shadow memory (the whole reserved once during boot).
      3. Quaratine (grows gradually until some preset limit; the more the limit,
         the more the chance to detect a use-after-free).
      
      Comparing tag-based vs generic KASAN for each of these points:
      1. 20% vs 260% overhead.
      2. 1/16th vs 1/8th of physical memory.
      3. Tag-based KASAN doesn't require quarantine.
      
      [1] Time before the ext4 driver is initialized.
      [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
      [3] Measured as `cat /proc/meminfo | grep Slab`.
      
      ====== Some notes
      
      A few notes:
      
      1. The patchset can be found here:
         https://github.com/xairy/kasan-prototype/tree/khwasan
      
      2. Building requires a recent Clang version (7.0.0 or later).
      
      3. Stack instrumentation is not supported yet and will be added later.
      
      This patch (of 25):
      
      Tag-based KASAN changes the value of the top byte of pointers returned
      from the kernel allocation functions (such as kmalloc).  This patch
      updates KASAN hooks signatures and their usage in SLAB and SLUB code to
      reflect that.
      
      Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0116523c
  6. 20 12月, 2018 1 次提交
  7. 28 11月, 2018 1 次提交
    • P
      slab: Replace synchronize_sched() with synchronize_rcu() · 6564a25e
      Paul E. McKenney 提交于
      Now that synchronize_rcu() waits for preempt-disable regions of code
      as well as RCU read-side critical sections, synchronize_sched() can be
      replaced by synchronize_rcu().  This commit therefore makes this change.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      6564a25e
  8. 27 10月, 2018 4 次提交
    • V
      mm, slab: shorten kmalloc cache names for large sizes · f0d77874
      Vlastimil Babka 提交于
      Kmalloc cache names can get quite long for large object sizes, when the
      sizes are expressed in bytes.  Use 'k' and 'M' prefixes to make the names
      as short as possible e.g.  in /proc/slabinfo.  This works, as we mostly
      use power-of-two sizes, with exceptions only below 1k.
      
      Example: 'kmalloc-4194304' becomes 'kmalloc-4M'
      
      Link: http://lkml.kernel.org/r/20180731090649.16028-7-vbabka@suse.czSuggested-by: NMatthew Wilcox <willy@infradead.org>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0d77874
    • V
      mm, slab/slub: introduce kmalloc-reclaimable caches · 1291523f
      Vlastimil Babka 提交于
      Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
      indicates they contain objects which can be reclaimed under memory
      pressure (typically through a shrinker).  This makes the slab pages
      accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
      MemAvailable meminfo counter and in overcommit decisions.  The slab pages
      are also allocated with __GFP_RECLAIMABLE, which is good for
      anti-fragmentation through grouping pages by mobility.
      
      The generic kmalloc-X caches are created without this flag, but sometimes
      are used also for objects that can be reclaimed, which due to varying size
      cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag.  A
      prominent example are dcache external names, which prompted the creation
      of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
      in commit f1782c9b ("dcache: account external names as indirectly
      reclaimable memory").
      
      To better handle this and any other similar cases, this patch introduces
      SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
      They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
      gfp flags.  They are added to the kmalloc_caches array as a new type.
      Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
      cache.
      
      This change only applies to SLAB and SLUB, not SLOB.  This is fine, since
      SLOB's target are tiny system and this patch does add some overhead of
      kmem management objects.
      
      Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1291523f
    • V
      mm, slab: combine kmalloc_caches and kmalloc_dma_caches · cc252eae
      Vlastimil Babka 提交于
      Patch series "kmalloc-reclaimable caches", v4.
      
      As discussed at LSF/MM [1] here's a patchset that introduces
      kmalloc-reclaimable caches (more details in the second patch) and uses
      them for dcache external names.  That allows us to repurpose the
      NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
      
      With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
      caches, eliminating the need for manual accounting.  More importantly, it
      also ensures the reclaimable kmalloc allocations are grouped in pages
      separate from the regular kmalloc allocations.  The need for proper
      accounting of dcache external names has shown it's easy for misbehaving
      process to allocate lots of them, causing premature OOMs.  Without the
      added grouping, it's likely that a similar workload can interleave the
      dcache external names allocations with regular kmalloc allocations (note:
      I haven't searched myself for an example of such regular kmalloc
      allocation, but I would be very surprised if there wasn't some).  A
      pathological case would be e.g.  one 64byte regular allocations with 63
      external dcache names in a page (64x64=4096), which means the page is not
      freed even after reclaiming after all dcache names, and the process can
      thus "steal" the whole page with single 64byte allocation.
      
      If other kmalloc users similar to dcache external names become identified,
      they can also benefit from the new functionality simply by adding
      __GFP_RECLAIMABLE to the kmalloc calls.
      
      Side benefits of the patchset (that could be also merged separately)
      include removed branch for detecting __GFP_DMA kmalloc(), and shortening
      kmalloc cache names in /proc/slabinfo output.  The latter is potentially
      an ABI break in case there are tools parsing the names and expecting the
      values to be in bytes.
      
      This is how /proc/slabinfo looks like after booting in virtme:
      
      ...
      kmalloc-rcl-4M         0      0 4194304    1 1024 : tunables    1    1    0 : slabdata      0      0      0
      ...
      kmalloc-rcl-96         7     32    128   32    1 : tunables  120   60    8 : slabdata      1      1      0
      kmalloc-rcl-64        25    128     64   64    1 : tunables  120   60    8 : slabdata      2      2      0
      kmalloc-rcl-32         0      0     32  124    1 : tunables  120   60    8 : slabdata      0      0      0
      kmalloc-4M             0      0 4194304    1 1024 : tunables    1    1    0 : slabdata      0      0      0
      kmalloc-2M             0      0 2097152    1  512 : tunables    1    1    0 : slabdata      0      0      0
      kmalloc-1M             0      0 1048576    1  256 : tunables    1    1    0 : slabdata      0      0      0
      ...
      
      /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
      
      ...
      nr_slab_reclaimable 2817
      nr_slab_unreclaimable 1781
      ...
      nr_kernel_misc_reclaimable 0
      ...
      
      /proc/meminfo with new KReclaimable counter:
      
      ...
      Shmem:               564 kB
      KReclaimable:      11260 kB
      Slab:              18368 kB
      SReclaimable:      11260 kB
      SUnreclaim:         7108 kB
      KernelStack:        1248 kB
      ...
      
      This patch (of 6):
      
      The kmalloc caches currently mainain separate (optional) array
      kmalloc_dma_caches for __GFP_DMA allocations.  There are tests for
      __GFP_DMA in the allocation hotpaths.  We can avoid the branches by
      combining kmalloc_caches and kmalloc_dma_caches into a single
      two-dimensional array where the outer dimension is cache "type".  This
      will also allow to add kmalloc-reclaimable caches as a third type.
      
      Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc252eae
    • D
      mm: don't warn about large allocations for slab · 61448479
      Dmitry Vyukov 提交于
      Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE,
      instead it falls back to kmalloc_large().
      
      For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls
      kmalloc_slab() for all allocations relying on NULL return value for
      over-sized allocations.
      
      This inconsistency leads to unwanted warnings from kmalloc_slab() for
      over-sized allocations for slab.  Returning NULL for failed allocations is
      the expected behavior.
      
      Make slub and slab code consistent by checking size >
      KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab().
      
      While we are here also fix the check in kmalloc_slab().  We should check
      against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE.  It all kinda
      worked because for slab the constants are the same, and slub always checks
      the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab().  But if we
      get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will
      happen.  For example, in case of a newly introduced bug in slub code.
      
      Also move the check in kmalloc_slab() from function entry to the size >
      192 case.  This partially compensates for the additional check in slab
      code and makes slub code a bit faster (at least theoretically).
      
      Also drop __GFP_NOWARN in the warning check.  This warning means a bug in
      slab code itself, user-passed flags have nothing to do with it.
      
      Nothing of this affects slob.
      
      Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com
      Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com
      Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com
      Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com
      Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61448479
  9. 18 8月, 2018 1 次提交
  10. 29 6月, 2018 1 次提交
    • M
      slub: fix failure when we delete and create a slab cache · d50d82fa
      Mikulas Patocka 提交于
      In kernel 4.17 I removed some code from dm-bufio that did slab cache
      merging (commit 21bb1327: "dm bufio: remove code that merges slab
      caches") - both slab and slub support merging caches with identical
      attributes, so dm-bufio now just calls kmem_cache_create and relies on
      implicit merging.
      
      This uncovered a bug in the slub subsystem - if we delete a cache and
      immediatelly create another cache with the same attributes, it fails
      because of duplicate filename in /sys/kernel/slab/.  The slub subsystem
      offloads freeing the cache to a workqueue - and if we create the new
      cache before the workqueue runs, it complains because of duplicate
      filename in sysfs.
      
      This patch fixes the bug by moving the call of kobject_del from
      sysfs_slab_remove_workfn to shutdown_cache.  kobject_del must be called
      while we hold slab_mutex - so that the sysfs entry is deleted before a
      cache with the same attributes could be created.
      
      Running device-mapper-test-suite with:
      
        dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/
      
      triggered:
      
        Buffer I/O error on dev dm-0, logical block 1572848, async page read
        device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
        device-mapper: thin: 253:1: aborting current metadata transaction
        sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
        CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
        Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
        Workqueue: dm-thin do_worker [dm_thin_pool]
        Call Trace:
         dump_stack+0x5a/0x73
         sysfs_warn_dup+0x58/0x70
         sysfs_create_dir_ns+0x77/0x80
         kobject_add_internal+0xba/0x2e0
         kobject_init_and_add+0x70/0xb0
         sysfs_slab_add+0xb1/0x250
         __kmem_cache_create+0x116/0x150
         create_cache+0xd9/0x1f0
         kmem_cache_create_usercopy+0x1c1/0x250
         kmem_cache_create+0x18/0x20
         dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
         dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
         __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
         dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
         metadata_operation_failed+0x59/0x100 [dm_thin_pool]
         alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
         process_cell+0x2a3/0x550 [dm_thin_pool]
         do_worker+0x28d/0x8f0 [dm_thin_pool]
         process_one_work+0x171/0x370
         worker_thread+0x49/0x3f0
         kthread+0xf8/0x130
         ret_from_fork+0x35/0x40
        kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
        kmem_cache_create(dm_bufio_buffer-16) failed with error -17
      
      Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.comSigned-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reported-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d50d82fa
  11. 15 6月, 2018 2 次提交
  12. 06 4月, 2018 12 次提交