1. 13 7月, 2019 40 次提交
    • W
      mm, memcg: add a memcg_slabinfo debugfs file · fcf8a1e4
      Waiman Long 提交于
      There are concerns about memory leaks from extensive use of memory cgroups
      as each memory cgroup creates its own set of kmem caches.  There is a
      possiblity that the memcg kmem caches may remain even after the memory
      cgroups have been offlined.  Therefore, it will be useful to show the
      status of each of memcg kmem caches.
      
      This patch introduces a new <debugfs>/memcg_slabinfo file which is
      somewhat similar to /proc/slabinfo in format, but lists only information
      about kmem caches that have child memcg kmem caches.  Information
      available in /proc/slabinfo are not repeated in memcg_slabinfo.
      
      A portion of a sample output of the file was:
      
        # <name> <css_id[:dead]> <active_objs> <num_objs> <active_slabs> <num_slabs>
        rpc_inode_cache   root          13     51      1      1
        rpc_inode_cache     48           0      0      0      0
        fat_inode_cache   root           1     45      1      1
        fat_inode_cache     41           2     45      1      1
        xfs_inode         root         770    816     24     24
        xfs_inode           92          22     34      1      1
        xfs_inode           88:dead      1     34      1      1
        xfs_inode           89:dead     23     34      1      1
        xfs_inode           85           4     34      1      1
        xfs_inode           84           9     34      1      1
      
      The css id of the memcg is also listed. If a memcg is not online,
      the tag ":dead" will be attached as shown above.
      
      [longman@redhat.com: memcg: add ":deact" tag for reparented kmem caches in memcg_slabinfo]
        Link: http://lkml.kernel.org/r/20190621173005.31514-1-longman@redhat.com
      [longman@redhat.com: set the flag in the common code as suggested by Roman]
        Link: http://lkml.kernel.org/r/20190627184324.5875-1-longman@redhat.com
      Link: http://lkml.kernel.org/r/20190619171621.26209-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Suggested-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcf8a1e4
    • R
      mm: memcg/slab: reparent memcg kmem_caches on cgroup removal · fb2f2b0a
      Roman Gushchin 提交于
      Let's reparent non-root kmem_caches on memcg offlining.  This allows us to
      release the memory cgroup without waiting for the last outstanding kernel
      object (e.g.  dentry used by another application).
      
      Since the parent cgroup is already charged, everything we need to do is to
      splice the list of kmem_caches to the parent's kmem_caches list, swap the
      memcg pointer, drop the css refcounter for each kmem_cache and adjust the
      parent's css refcounter.
      
      Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
      anymore.  It's safe to read it under rcu_read_lock(), cgroup_mutex held,
      or any other way that protects the memory cgroup from being released.
      
      We can race with the slab allocation and deallocation paths.  It's not a
      big problem: parent's charge and slab global stats are always correct, and
      we don't care anymore about the child usage and global stats.  The child
      cgroup is already offline, so we don't use or show it anywhere.
      
      Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
      used anywhere except count_shadow_nodes().  But even there it won't break
      anything: after reparenting "nodes" will be 0 on child level (because
      we're already reparenting shrinker lists), and on parent level page stats
      always were 0, and this patch won't change anything.
      
      [guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup]
        Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb2f2b0a
    • R
      mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages · 4d96ba35
      Roman Gushchin 提交于
      Every slab page charged to a non-root memory cgroup has a pointer to the
      memory cgroup and holds a reference to it, which protects a non-empty
      memory cgroup from being released.  At the same time the page has a
      pointer to the corresponding kmem_cache, and also hold a reference to the
      kmem_cache.  And kmem_cache by itself holds a reference to the cgroup.
      
      So there is clearly some redundancy, which allows to stop setting the
      page->mem_cgroup pointer and rely on getting memcg pointer indirectly via
      kmem_cache.  Further it will allow to change this pointer easier, without
      a need to go over all charged pages.
      
      So let's stop setting page->mem_cgroup pointer for slab pages, and stop
      using the css refcounter directly for protecting the memory cgroup from
      going away.  Instead rely on kmem_cache as an intermediate object.
      
      Make sure that vmstats and shrinker lists are working as previously, as
      well as /proc/kpagecgroup interface.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-10-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d96ba35
    • R
      mm: memcg/slab: rework non-root kmem_cache lifecycle management · f0a3a24b
      Roman Gushchin 提交于
      Currently each charged slab page holds a reference to the cgroup to which
      it's charged.  Kmem_caches are held by the memcg and are released all
      together with the memory cgroup.  It means that none of kmem_caches are
      released unless at least one reference to the memcg exists, which is very
      far from optimal.
      
      Let's rework it in a way that allows releasing individual kmem_caches as
      soon as the cgroup is offline, the kmem_cache is empty and there are no
      pending allocations.
      
      To make it possible, let's introduce a new percpu refcounter for non-root
      kmem caches.  The counter is initialized to the percpu mode, and is
      switched to the atomic mode during kmem_cache deactivation.  The counter
      is bumped for every charged page and also for every running allocation.
      So the kmem_cache can't be released unless all allocations complete.
      
      To shutdown non-active empty kmem_caches, let's reuse the work queue,
      previously used for the kmem_cache deactivation.  Once the reference
      counter reaches 0, let's schedule an asynchronous kmem_cache release.
      
      * I used the following simple approach to test the performance
      (stolen from another patchset by T. Harding):
      
          time find / -name fname-no-exist
          echo 2 > /proc/sys/vm/drop_caches
          repeat 10 times
      
      Results:
      
              orig		patched
      
      real	0m1.455s	real	0m1.355s
      user	0m0.206s	user	0m0.219s
      sys	0m0.855s	sys	0m0.807s
      
      real	0m1.487s	real	0m1.699s
      user	0m0.221s	user	0m0.256s
      sys	0m0.806s	sys	0m0.948s
      
      real	0m1.515s	real	0m1.505s
      user	0m0.183s	user	0m0.215s
      sys	0m0.876s	sys	0m0.858s
      
      real	0m1.291s	real	0m1.380s
      user	0m0.193s	user	0m0.198s
      sys	0m0.843s	sys	0m0.786s
      
      real	0m1.364s	real	0m1.374s
      user	0m0.180s	user	0m0.182s
      sys	0m0.868s	sys	0m0.806s
      
      real	0m1.352s	real	0m1.312s
      user	0m0.201s	user	0m0.212s
      sys	0m0.820s	sys	0m0.761s
      
      real	0m1.302s	real	0m1.349s
      user	0m0.205s	user	0m0.203s
      sys	0m0.803s	sys	0m0.792s
      
      real	0m1.334s	real	0m1.301s
      user	0m0.194s	user	0m0.201s
      sys	0m0.806s	sys	0m0.779s
      
      real	0m1.426s	real	0m1.434s
      user	0m0.216s	user	0m0.181s
      sys	0m0.824s	sys	0m0.864s
      
      real	0m1.350s	real	0m1.295s
      user	0m0.200s	user	0m0.190s
      sys	0m0.842s	sys	0m0.811s
      
      So it looks like the difference is not noticeable in this test.
      
      [cai@lca.pw: fix an use-after-free in kmemcg_workfn()]
        Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0a3a24b
    • R
      mm: memcg/slab: synchronize access to kmem_cache dying flag using a spinlock · 63b02ef7
      Roman Gushchin 提交于
      Currently the memcg_params.dying flag and the corresponding workqueue used
      for the asynchronous deactivation of kmem_caches is synchronized using the
      slab_mutex.
      
      It makes impossible to check this flag from the irq context, which will be
      required in order to implement asynchronous release of kmem_caches.
      
      So let's switch over to the irq-save flavor of the spinlock-based
      synchronization.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-8-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63b02ef7
    • R
      mm: memcg/slab: don't check the dying flag on kmem_cache creation · 57033297
      Roman Gushchin 提交于
      There is no point in checking the root_cache->memcg_params.dying flag on
      kmem_cache creation path.  New allocations shouldn't be performed using a
      dead root kmem_cache, so no new memcg kmem_cache creation can be scheduled
      after the flag is set.  And if it was scheduled before,
      flush_memcg_workqueue() will wait for it anyway.
      
      So let's drop this check to simplify the code.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-7-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57033297
    • R
      mm: memcg/slab: unify SLAB and SLUB page accounting · 6cea1d56
      Roman Gushchin 提交于
      Currently the page accounting code is duplicated in SLAB and SLUB
      internals.  Let's move it into new (un)charge_slab_page helpers in the
      slab_common.c file.  These helpers will be responsible for statistics
      (global and memcg-aware) and memcg charging.  So they are replacing direct
      memcg_(un)charge_slab() calls.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cea1d56
    • R
      mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg() · 49a18eae
      Roman Gushchin 提交于
      Let's separate the page counter modification code out of
      __memcg_kmem_uncharge() in a way similar to what
      __memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.
      
      This will allow to reuse this code later using a new
      memcg_kmem_uncharge_memcg() wrapper, which calls
      __memcg_kmem_uncharge_memcg() if memcg_kmem_enabled()
      check is passed.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-5-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49a18eae
    • R
      mm: memcg/slab: generalize postponed non-root kmem_cache deactivation · 43486694
      Roman Gushchin 提交于
      Currently SLUB uses a work scheduled after an RCU grace period to
      deactivate a non-root kmem_cache.  This mechanism can be reused for
      kmem_caches release, but requires generalization for SLAB case.
      
      Introduce kmemcg_cache_deactivate() function, which calls
      allocator-specific __kmem_cache_deactivate() and schedules execution of
      __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
      context after an rcu grace period.
      
      Here is the new calling scheme:
        kmemcg_cache_deactivate()
          __kmemcg_cache_deactivate()                  SLAB/SLUB-specific
          kmemcg_rcufn()                               rcu
            kmemcg_workfn()                            work
              __kmemcg_cache_deactivate_after_rcu()    SLAB/SLUB-specific
      
      instead of:
        __kmemcg_cache_deactivate()                    SLAB/SLUB-specific
          slab_deactivate_memcg_cache_rcu_sched()      SLUB-only
            kmemcg_rcufn()                             rcu
              kmemcg_workfn()                          work
                kmemcg_cache_deact_after_rcu()         SLUB-only
      
      For consistency, all allocator-specific functions start with "__".
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43486694
    • R
      mm: memcg/slab: rename slab delayed deactivation functions and fields · 0b14e8aa
      Roman Gushchin 提交于
      The delayed work/rcu deactivation infrastructure of non-root kmem_caches
      can be also used for asynchronous release of these objects.  Let's get rid
      of the word "deactivation" in corresponding names to make the code look
      better after generalization.
      
      It's easier to make the renaming first, so that the generalized code will
      look consistent from scratch.
      
      Let's rename struct memcg_cache_params fields:
        deact_fn -> work_fn
        deact_rcu_head -> rcu_head
        deact_work -> work
      
      And RCU/delayed work callbacks in slab common code:
        kmemcg_deactivate_rcufn -> kmemcg_rcufn
        kmemcg_deactivate_workfn -> kmemcg_workfn
      
      This patch contains no functional changes, only renamings.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-3-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b14e8aa
    • R
      mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() · c03914b7
      Roman Gushchin 提交于
      Patch series "mm: reparent slab memory on cgroup removal", v7.
      
      # Why do we need this?
      
      We've noticed that the number of dying cgroups is steadily growing on most
      of our hosts in production.  The following investigation revealed an issue
      in the userspace memory reclaim code [1], accounting of kernel stacks [2],
      and also the main reason: slab objects.
      
      The underlying problem is quite simple: any page charged to a cgroup holds
      a reference to it, so the cgroup can't be reclaimed unless all charged
      pages are gone.  If a slab object is actively used by other cgroups, it
      won't be reclaimed, and will prevent the origin cgroup from being
      reclaimed.
      
      Slab objects, and first of all vfs cache, is shared between cgroups, which
      are using the same underlying fs, and what's even more important, it's
      shared between multiple generations of the same workload.  So if something
      is running periodically every time in a new cgroup (like how systemd
      works), we do accumulate multiple dying cgroups.
      
      Strictly speaking pagecache isn't different here, but there is a key
      difference: we disable protection and apply some extra pressure on LRUs of
      dying cgroups, and these LRUs contain all charged pages.  My experiments
      show that with the disabled kernel memory accounting the number of dying
      cgroups stabilizes at a relatively small number (~100, depends on memory
      pressure and cgroup creation rate), and with kernel memory accounting it
      grows pretty steadily up to several thousands.
      
      Memory cgroups are quite complex and big objects (mostly due to percpu
      stats), so it leads to noticeable memory losses.  Memory occupied by dying
      cgroups is measured in hundreds of megabytes.  I've even seen a host with
      more than 100Gb of memory wasted for dying cgroups.  It leads to a
      degradation of performance with the uptime, and generally limits the usage
      of cgroups.
      
      My previous attempt [3] to fix the problem by applying extra pressure on
      slab shrinker lists caused a regressions with xfs and ext4, and has been
      reverted [4].  The following attempts to find the right balance [5, 6]
      were not successful.
      
      So instead of trying to find a maybe non-existing balance, let's do
      reparent accounted slab caches to the parent cgroup on cgroup removal.
      
      # Implementation approach
      
      There is however a significant problem with reparenting of slab memory:
      there is no list of charged pages.  Some of them are in shrinker lists,
      but not all.  Introducing of a new list is really not an option.
      
      But fortunately there is a way forward: every slab page has a stable
      pointer to the corresponding kmem_cache.  So the idea is to reparent
      kmem_caches instead of slab pages.
      
      It's actually simpler and cheaper, but requires some underlying changes:
      1) Make kmem_caches to hold a single reference to the memory cgroup,
         instead of a separate reference per every slab page.
      2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
         page->kmem_cache->memcg indirection instead. It's used only on
         slab page release, so performance overhead shouldn't be a big issue.
      3) Introduce a refcounter for non-root slab caches. It's required to
         be able to destroy kmem_caches when they become empty and release
         the associated memory cgroup.
      
      There is a bonus: currently we release all memcg kmem_caches all together
      with the memory cgroup itself.  This patchset allows individual
      kmem_caches to be released as soon as they become inactive and free.
      
      Some additional implementation details are provided in corresponding
      commit messages.
      
      # Results
      
      Below is the average number of dying cgroups on two groups of our
      production hosts.  They do run some sort of web frontend workload, the
      memory pressure is moderate.  As we can see, with the kernel memory
      reparenting the number stabilizes in 60s range; however with the original
      version it grows almost linearly and doesn't show any signs of plateauing.
      The difference in slab and percpu usage between patched and unpatched
      versions also grows linearly.  In 7 days it exceeded 200Mb.
      
      day           0    1    2    3    4    5    6    7
      original     56  362  628  752 1070 1250 1490 1560
      patched      23   46   51   55   60   57   67   69
      mem diff(Mb) 22   74  123  152  164  182  214  241
      
      # Links
      
      [1]: commit 68600f62 ("mm: don't miss the last page because of round-off error")
      [2]: commit 9b6f7e16 ("mm: rework memcg kernel stack accounting")
      [3]: commit 172b06c3 ("mm: slowly shrink slabs with a relatively small number of objects")
      [4]: commit a9a238e8 ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
      [5]: https://lkml.org/lkml/2019/1/28/1865
      [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
      
      This patch (of 10):
      
      Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
      rather than in init_memcg_params().
      
      Once kmem_cache will hold a reference to the memory cgroup, it will
      simplify the refcounting.
      
      For non-root kmem_caches memcg_link_cache() is always called before the
      kmem_cache becomes visible to a user, so it's safe.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c03914b7
    • J
      mm: memcontrol: dump memory.stat during cgroup OOM · c8713d0b
      Johannes Weiner 提交于
      The current cgroup OOM memory info dump doesn't include all the memory
      we are tracking, nor does it give insight into what the VM tried to do
      leading up to the OOM. All that useful info is in memory.stat.
      
      Furthermore, the recursive printing for every child cgroup can
      generate absurd amounts of data on the console for larger cgroup
      trees, and it's not like we provide a per-cgroup breakdown during
      global OOM kills.
      
      When an OOM kill is triggered, print one set of recursive memory.stat
      items at the level whose limit triggered the OOM condition.
      
      Example output:
      
          stress invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
          CPU: 2 PID: 210 Comm: stress Not tainted 5.2.0-rc2-mm1-00247-g47d49835983c #135
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
          Call Trace:
           dump_stack+0x46/0x60
           dump_header+0x4c/0x2d0
           oom_kill_process.cold.10+0xb/0x10
           out_of_memory+0x200/0x270
           ? try_to_free_mem_cgroup_pages+0xdf/0x130
           mem_cgroup_out_of_memory+0xb7/0xc0
           try_charge+0x680/0x6f0
           mem_cgroup_try_charge+0xb5/0x160
           __add_to_page_cache_locked+0xc6/0x300
           ? list_lru_destroy+0x80/0x80
           add_to_page_cache_lru+0x45/0xc0
           pagecache_get_page+0x11b/0x290
           filemap_fault+0x458/0x6d0
           ext4_filemap_fault+0x27/0x36
           __do_fault+0x2f/0xb0
           __handle_mm_fault+0x9c5/0x1140
           ? apic_timer_interrupt+0xa/0x20
           handle_mm_fault+0xc5/0x180
           __do_page_fault+0x1ab/0x440
           ? page_fault+0x8/0x30
           page_fault+0x1e/0x30
          RIP: 0033:0x55c32167fc10
          Code: Bad RIP value.
          RSP: 002b:00007fff1d031c50 EFLAGS: 00010206
          RAX: 000000000dc00000 RBX: 00007fd2db000010 RCX: 00007fd2db000010
          RDX: 0000000000000000 RSI: 0000000010001000 RDI: 0000000000000000
          RBP: 000055c321680a54 R08: 00000000ffffffff R09: 0000000000000000
          R10: 0000000000000022 R11: 0000000000000246 R12: ffffffffffffffff
          R13: 0000000000000002 R14: 0000000000001000 R15: 0000000010000000
          memory: usage 1024kB, limit 1024kB, failcnt 75131
          swap: usage 0kB, limit 9007199254740988kB, failcnt 0
          Memory cgroup stats for /foo:
          anon 0
          file 0
          kernel_stack 36864
          slab 274432
          sock 0
          shmem 0
          file_mapped 0
          file_dirty 0
          file_writeback 0
          anon_thp 0
          inactive_anon 126976
          active_anon 0
          inactive_file 0
          active_file 0
          unevictable 0
          slab_reclaimable 0
          slab_unreclaimable 274432
          pgfault 59466
          pgmajfault 1617
          workingset_refault 2145
          workingset_activate 0
          workingset_nodereclaim 0
          pgrefill 98952
          pgscan 200060
          pgsteal 59340
          pgactivate 40095
          pgdeactivate 96787
          pglazyfree 0
          pglazyfreed 0
          thp_fault_alloc 0
          thp_collapse_alloc 0
          Tasks state (memory values in pages):
          [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
          [    200]     0   200     1121      884    53248       29             0 bash
          [    209]     0   209      905      246    45056       19             0 stress
          [    210]     0   210    66442       56   499712    56349             0 stress
          oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),oom_memcg=/foo,task_memcg=/foo,task=stress,pid=210,uid=0
          Memory cgroup out of memory: Killed process 210 (stress) total-vm:265768kB, anon-rss:0kB, file-rss:224kB, shmem-rss:0kB
          oom_reaper: reaped process 210 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      [hannes@cmpxchg.org: s/kvmalloc/kmalloc/ per Michal]
        Link: http://lkml.kernel.org/r/20190605161133.GA12453@cmpxchg.org
      Link: http://lkml.kernel.org/r/20190604210509.9744-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8713d0b
    • S
      mm, memcg: introduce memory.events.local · 1e577f97
      Shakeel Butt 提交于
      The memory controller in cgroup v2 exposes memory.events file for each
      memcg which shows the number of times events like low, high, max, oom
      and oom_kill have happened for the whole tree rooted at that memcg.
      Users can also poll or register notification to monitor the changes in
      that file.  Any event at any level of the tree rooted at memcg will
      notify all the listeners along the path till root_mem_cgroup.  There are
      existing users which depend on this behavior.
      
      However there are users which are only interested in the events
      happening at a specific level of the memcg tree and not in the events in
      the underlying tree rooted at that memcg.  One such use-case is a
      centralized resource monitor which can dynamically adjust the limits of
      the jobs running on a system.  The jobs can create their sub-hierarchy
      for their own sub-tasks.  The centralized monitor is only interested in
      the events at the top level memcgs of the jobs as it can then act and
      adjust the limits of the jobs.  Using the current memory.events for such
      centralized monitor is very inconvenient.  The monitor will keep
      receiving events which it is not interested and to find if the received
      event is interesting, it has to read memory.event files of the next
      level and compare it with the top level one.  So, let's introduce
      memory.events.local to the memcg which shows and notify for the events
      at the memcg level.
      
      Now, does memory.stat and memory.pressure need their local versions.  IMHO
      no due to the no internal process contraint of the cgroup v2.  The
      memory.stat file of the top level memcg of a job shows the stats and
      vmevents of the whole tree.  The local stats or vmevents of the top level
      memcg will only change if there is a process running in that memcg but v2
      does not allow that.  Similarly for memory.pressure there will not be any
      process in the internal nodes and thus no chance of local pressure.
      
      Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Chris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e577f97
    • S
      memcg, oom: no oom-kill for __GFP_RETRY_MAYFAIL · 38d38493
      Shakeel Butt 提交于
      The documentation of __GFP_RETRY_MAYFAIL clearly mentioned that the OOM
      killer will not be triggered and indeed the page alloc does not invoke OOM
      killer for such allocations.  However we do trigger memcg OOM killer for
      __GFP_RETRY_MAYFAIL.  Fix that.  This flag will used later to not trigger
      oom-killer in the charging path for fanotify and inotify event
      allocations.
      
      Link: http://lkml.kernel.org/r/20190514212259.156585-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38d38493
    • H
      mm/mincore.c: fix race between swapoff and mincore · aeb309b8
      Huang Ying 提交于
      Via commit 4b3ef9da ("mm/swap: split swap cache into 64MB trunks"),
      after swapoff, the address_space associated with the swap device will be
      freed.  So swap_address_space() users which touch the address_space need
      some kind of mechanism to prevent the address_space from being freed
      during accessing.
      
      When mincore processes an unmapped range for swapped shmem pages, it
      doesn't hold the lock to prevent swap device from being swapped off.  So
      the following race is possible:
      
      CPU1					CPU2
      do_mincore()				swapoff()
        walk_page_range()
          mincore_unmapped_range()
            __mincore_unmapped_range
              mincore_page
      	  as = swap_address_space()
                ...				  exit_swap_address_space()
                ...				    kvfree(spaces)
      	  find_get_page(as)
      
      The address space may be accessed after being freed.
      
      To fix the race, get_swap_device()/put_swap_device() is used to enclose
      find_get_page() to check whether the swap entry is valid and prevent the
      swap device from being swapoff during accessing.
      
      Link: http://lkml.kernel.org/r/20190611020510.28251-1-ying.huang@intel.com
      Fixes: 4b3ef9da ("mm/swap: split swap cache into 64MB trunks")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aeb309b8
    • A
      mm, swap: use rbtree for swap_extent · 4efaceb1
      Aaron Lu 提交于
      swap_extent is used to map swap page offset to backing device's block
      offset.  For a continuous block range, one swap_extent is used and all
      these swap_extents are managed in a linked list.
      
      These swap_extents are used by map_swap_entry() during swap's read and
      write path.  To find out the backing device's block offset for a page
      offset, the swap_extent list will be traversed linearly, with
      curr_swap_extent being used as a cache to speed up the search.
      
      This works well as long as swap_extents are not huge or when the number
      of processes that access swap device are few, but when the swap device
      has many extents and there are a number of processes accessing the swap
      device concurrently, it can be a problem.  On one of our servers, the
      disk's remaining size is tight:
      
        $df -h
        Filesystem      Size  Used Avail Use% Mounted on
        ... ...
        /dev/nvme0n1p1  1.8T  1.3T  504G  72% /home/t4
      
      When creating a 80G swapfile there, there are as many as 84656 swap
      extents.  The end result is, kernel spends abou 30% time in
      map_swap_entry() and swap throughput is only 70MB/s.
      
      As a comparison, when I used smaller sized swapfile, like 4G whose
      swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
      map_swap_entry() is about 3%.
      
      One downside of using rbtree for swap_extent is, 'struct rbtree' takes
      24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
      for each swap_extent.  For a swapfile that has 80k swap_extents, that
      means 625KiB more memory consumed.
      
      Test:
      
      Since it's not possible to reboot that server, I can not test this patch
      diretly there.  Instead, I tested it on another server with NVMe disk.
      
      I created a 20G swapfile on an NVMe backed XFS fs.  By default, the
      filesystem is quite clean and the created swapfile has only 2 extents.
      Testing vanilla and this patch shows no obvious performance difference
      when swapfile is not fragmented.
      
      To see the patch's effects, I used some tweaks to manually fragment the
      swapfile by breaking the extent at 1M boundary.  This made the swapfile
      have 20K extents.
      
        nr_task=4
        kernel   swapout(KB/s) map_swap_entry(perf)  swapin(KB/s) map_swap_entry(perf)
        vanilla  165191           90.77%             171798          90.21%
        patched  858993 +420%      2.16%             715827 +317%     0.77%
      
        nr_task=8
        kernel   swapout(KB/s) map_swap_entry(perf)  swapin(KB/s) map_swap_entry(perf)
        vanilla  306783           92.19%             318145          87.76%
        patched  954437 +211%      2.35%            1073741 +237%     1.57%
      
      swapout: the throughput of swap out, in KB/s, higher is better 1st
      map_swap_entry: cpu cycles percent sampled by perf swapin: the
      throughput of swap in, in KB/s, higher is better.  2nd map_swap_entry:
      cpu cycles percent sampled by perf
      
      nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
      can be effectively used to cache the correct swap extent for single task
      workload.
      
      [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
      Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronluSigned-off-by: NAaron Lu <ziqian.lzq@antfin.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4efaceb1
    • H
      mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device() · 054f1d1f
      Huang Ying 提交于
      total_swapcache_pages() may race with swapper_spaces[] allocation and
      freeing.  Previously, this is protected with a swapper_spaces[] specific
      RCU mechanism.  To simplify the logic/code complexity, it is replaced with
      get/put_swap_device().  The code line number is reduced too.  Although not
      so important, the swapoff() performance improves too because one
      synchronize_rcu() call during swapoff() is deleted.
      
      [ying.huang@intel.com: fix bad swap file entry warning]
        Link: http://lkml.kernel.org/r/20190531024102.21723-1-ying.huang@intel.com
      Link: http://lkml.kernel.org/r/20190527082714.12151-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      054f1d1f
    • H
      mm, swap: fix race between swapoff and some swap operations · eb085574
      Huang Ying 提交于
      When swapin is performed, after getting the swap entry information from
      the page table, system will swap in the swap entry, without any lock held
      to prevent the swap device from being swapoff.  This may cause the race
      like below,
      
      CPU 1				CPU 2
      -----				-----
      				do_swap_page
      				  swapin_readahead
      				    __read_swap_cache_async
      swapoff				      swapcache_prepare
        p->swap_map = NULL		        __swap_duplicate
      					  p->swap_map[?] /* !!! NULL pointer access */
      
      Because swapoff is usually done when system shutdown only, the race may
      not hit many people in practice.  But it is still a race need to be fixed.
      
      To fix the race, get_swap_device() is added to check whether the specified
      swap entry is valid in its swap device.  If so, it will keep the swap
      entry valid via preventing the swap device from being swapoff, until
      put_swap_device() is called.
      
      Because swapoff() is very rare code path, to make the normal path runs as
      fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
      reference count is used to implement get/put_swap_device().  >From
      get_swap_device() to put_swap_device(), RCU reader side is locked, so
      synchronize_rcu() in swapoff() will wait until put_swap_device() is
      called.
      
      In addition to swap_map, cluster_info, etc.  data structure in the struct
      swap_info_struct, the swap cache radix tree will be freed after swapoff,
      so this patch fixes the race between swap cache looking up and swapoff
      too.
      
      Races between some other swap cache usages and swapoff are fixed too via
      calling synchronize_rcu() between clearing PageSwapCache() and freeing
      swap cache data structure.
      
      Another possible method to fix this is to use preempt_off() +
      stop_machine() to prevent the swap device from being swapoff when its data
      structure is being accessed.  The overhead in hot-path of both methods is
      similar.  The advantages of RCU based method are,
      
      1. stop_machine() may disturb the normal execution code path on other
         CPUs.
      
      2. File cache uses RCU to protect its radix tree.  If the similar
         mechanism is used for swap cache too, it is easier to share code
         between them.
      
      3. RCU is used to protect swap cache in total_swapcache_pages() and
         exit_swap_address_space() already.  The two mechanisms can be
         merged to simplify the logic.
      
      Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
      Fixes: 235b6217 ("mm/swap: add cluster lock")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Not-nacked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb085574
    • Y
      mm/filemap.c: correct the comment about VM_FAULT_RETRY · a4985833
      Yang Shi 提交于
      Commit 6b4c9f44 ("filemap: drop the mmap_sem for all blocking
      operations") changed when mmap_sem is dropped during filemap page fault
      and when returning VM_FAULT_RETRY.
      
      Correct the comment to reflect the change.
      
      Link: http://lkml.kernel.org/r/1556234531-108228-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4985833
    • C
      mm/filemap: don't cast ->readpage to filler_t for do_read_cache_page · 6c45b454
      Christoph Hellwig 提交于
      We can just pass a NULL filler and do the right thing inside of
      do_read_cache_page based on the NULL parameter.
      
      Link: http://lkml.kernel.org/r/20190520055731.24538-3-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c45b454
    • C
      mm/filemap.c: fix an overly long line in read_cache_page · d322a8e5
      Christoph Hellwig 提交于
      Patch series "fix filler_t callback type mismatches", v2.
      
      Casting mapping->a_ops->readpage to filler_t causes an indirect call
      type mismatch with Control-Flow Integrity checking.  This change fixes
      the mismatch in read_cache_page_gfp and read_mapping_page by adding
      using a NULL filler argument as an indication to call ->readpage
      directly, and by passing the right parameter callbacks in nfs and jffs2.
      
      This patch (of 4):
      
      Code cleanup.
      
      Link: http://lkml.kernel.org/r/20190520055731.24538-2-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d322a8e5
    • V
      mm, debug_pagealloc: use a page type instead of page_ext flag · 3972f6bb
      Vlastimil Babka 提交于
      When debug_pagealloc is enabled, we currently allocate the page_ext
      array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag.  Now that
      we have the page_type field in struct page, we can use that instead, as
      guard pages are neither PageSlab nor mapped to userspace.  This reduces
      memory overhead when debug_pagealloc is enabled and there are no other
      features requiring the page_ext array.
      
      Link: http://lkml.kernel.org/r/20190603143451.27353-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3972f6bb
    • V
      mm, page_alloc: more extensive free page checking with debug_pagealloc · 4462b32c
      Vlastimil Babka 提交于
      The page allocator checks struct pages for expected state (mapcount,
      flags etc) as pages are being allocated (check_new_page()) and freed
      (free_pages_check()) to provide some defense against errors in page
      allocator users.
      
      Prior commits 479f854a ("mm, page_alloc: defer debugging checks of
      pages allocated from the PCP") and 4db7548c ("mm, page_alloc: defer
      debugging checks of freed pages until a PCP drain") this has happened
      for order-0 pages as they were allocated from or freed to the per-cpu
      caches (pcplists).  Since those are fast paths, the checks are now
      performed only when pages are moved between pcplists and global free
      lists.  This however lowers the chances of catching errors soon enough.
      
      In order to increase the chances of the checks to catch errors, the
      kernel has to be rebuilt with CONFIG_DEBUG_VM, which also enables
      multiple other internal debug checks (VM_BUG_ON() etc), which is
      suboptimal when the goal is to catch errors in mm users, not in mm code
      itself.
      
      To catch some wrong users of the page allocator we have
      CONFIG_DEBUG_PAGEALLOC, which is designed to have virtually no overhead
      unless enabled at boot time.  Memory corruptions when writing to freed
      pages have often the same underlying errors (use-after-free, double free)
      as corrupting the corresponding struct pages, so this existing debugging
      functionality is a good fit to extend by also perform struct page checks
      at least as often as if CONFIG_DEBUG_VM was enabled.
      
      Specifically, after this patch, when debug_pagealloc is enabled on boot,
      and CONFIG_DEBUG_VM disabled, pages are checked when allocated from or
      freed to the pcplists *in addition* to being moved between pcplists and
      free lists.  When both debug_pagealloc and CONFIG_DEBUG_VM are enabled,
      pages are checked when being moved between pcplists and free lists *in
      addition* to when allocated from or freed to the pcplists.
      
      When debug_pagealloc is not enabled on boot, the overhead in fast paths
      should be virtually none thanks to the use of static key.
      
      Link: http://lkml.kernel.org/r/20190603143451.27353-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4462b32c
    • V
      mm, debug_pagelloc: use static keys to enable debugging · 96a2b03f
      Vlastimil Babka 提交于
      Patch series "debug_pagealloc improvements".
      
      I have been recently debugging some pcplist corruptions, where it would be
      useful to perform struct page checks immediately as pages are allocated
      from and freed to pcplists, which is now only possible by rebuilding the
      kernel with CONFIG_DEBUG_VM (details in Patch 2 changelog).
      
      To make this kind of debugging simpler in future on a distro kernel, I
      have improved CONFIG_DEBUG_PAGEALLOC so that it has even smaller overhead
      when not enabled at boot time (Patch 1) and also when enabled (Patch 3),
      and extended it to perform the struct page checks more often when enabled
      (Patch 2).  Now it can be configured in when building a distro kernel
      without extra overhead, and debugging page use after free or double free
      can be enabled simply by rebooting with debug_pagealloc=on.
      
      This patch (of 3):
      
      CONFIG_DEBUG_PAGEALLOC has been redesigned by 031bc574
      ("mm/debug-pagealloc: make debug-pagealloc boottime configurable") to
      allow being always enabled in a distro kernel, but only perform its
      expensive functionality when booted with debug_pagelloc=on.  We can
      further reduce the overhead when not boot-enabled (including page
      allocator fast paths) using static keys.  This patch introduces one for
      debug_pagealloc core functionality, and another for the optional guard
      page functionality (enabled by booting with debug_guardpage_minorder=X).
      
      Link: http://lkml.kernel.org/r/20190603143451.27353-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96a2b03f
    • N
      mm/failslab.c: by default, do not fail allocations with direct reclaim only · a9659476
      Nicolas Boichat 提交于
      When failslab was originally written, the intention of the
      "ignore-gfp-wait" flag default value ("N") was to fail GFP_ATOMIC
      allocations.  Those were defined as (__GFP_HIGH), and the code would test
      for __GFP_WAIT (0x10u).
      
      However, since then, __GFP_WAIT was replaced by __GFP_RECLAIM
      (___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM), and GFP_ATOMIC is now
      defined as (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM).
      
      This means that when the flag is false, almost no allocation ever fails
      (as even GFP_ATOMIC allocations contain ___GFP_KSWAPD_RECLAIM).
      
      Restore the original intent of the code, by ignoring calls that directly
      reclaim only (__GFP_DIRECT_RECLAIM), and thus, failing GFP_ATOMIC calls
      again by default.
      
      Link: http://lkml.kernel.org/r/20190520214514.81360-1-drinkcat@chromium.org
      Fixes: 71baba4b ("mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM")
      Signed-off-by: NNicolas Boichat <drinkcat@chromium.org>
      Reviewed-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9659476
    • D
      mm: remove the exporting of totalram_pages · 98ef2046
      Denis Efremov 提交于
      Previously totalram_pages was the global variable.  Currently,
      totalram_pages is the static inline function from the include/linux/mm.h
      However, the function is also marked as EXPORT_SYMBOL, which is at best an
      odd combination.  Because there is no point for the static inline function
      from a public header to be exported, this commit removes the
      EXPORT_SYMBOL() marking.  It will be still possible to use the function in
      modules because all the symbols it depends on are exported.
      
      Link: http://lkml.kernel.org/r/20190710141031.15642-1-efremov@linux.com
      Fixes: ca79b0c2 ("mm: convert totalram_pages and totalhigh_pages variables to atomic")
      Signed-off-by: NDenis Efremov <efremov@linux.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98ef2046
    • P
      1fcf0a56
    • C
      mm: remove the account_page_dirtied export · ac1c3e49
      Christoph Hellwig 提交于
      account_page_dirtied() is only used by our set_page_dirty() helpers and
      should not be used anywhere else.
      
      Link: http://lkml.kernel.org/r/20190605183702.30572-1-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac1c3e49
    • M
    • B
      a7030aea
    • M
      mm/kasan: add object validation in ksize() · 0d4ca4c9
      Marco Elver 提交于
      ksize() has been unconditionally unpoisoning the whole shadow memory
      region associated with an allocation.  This can lead to various undetected
      bugs, for example, double-kzfree().
      
      Specifically, kzfree() uses ksize() to determine the actual allocation
      size, and subsequently zeroes the memory.  Since ksize() used to just
      unpoison the whole shadow memory region, no invalid free was detected.
      
      This patch addresses this as follows:
      
      1. Add a check in ksize(), and only then unpoison the memory region.
      
      2. Preserve kasan_unpoison_slab() semantics by explicitly unpoisoning
         the shadow memory region using the size obtained from __ksize().
      
      Tested:
      1. With SLAB allocator: a) normal boot without warnings; b) verified the
         added double-kzfree() is detected.
      2. With SLUB allocator: a) normal boot without warnings; b) verified the
         added double-kzfree() is detected.
      
      [elver@google.com: s/BUG_ON/WARN_ON_ONCE/, per Kees]
        Link: http://lkml.kernel.org/r/20190627094445.216365-6-elver@google.com
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199359
      Link: http://lkml.kernel.org/r/20190626142014.141844-6-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d4ca4c9
    • M
      mm/slab: refactor common ksize KASAN logic into slab_common.c · 10d1f8cb
      Marco Elver 提交于
      This refactors common code of ksize() between the various allocators into
      slab_common.c: __ksize() is the allocator-specific implementation without
      instrumentation, whereas ksize() includes the required KASAN logic.
      
      Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10d1f8cb
    • M
      mm/kasan: change kasan_check_{read,write} to return boolean · b5f6e0fc
      Marco Elver 提交于
      This changes {,__}kasan_check_{read,write} functions to return a boolean
      denoting if the access was valid or not.
      
      [sfr@canb.auug.org.au: include types.h for "bool"]
        Link: http://lkml.kernel.org/r/20190705184949.13cdd021@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20190626142014.141844-3-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5f6e0fc
    • M
      mm/kasan: introduce __kasan_check_{read,write} · 7d8ad890
      Marco Elver 提交于
      Patch series "mm/kasan: Add object validation in ksize()", v3.
      
      This patch (of 5):
      
      This introduces __kasan_check_{read,write}.  __kasan_check functions may
      be used from anywhere, even compilation units that disable instrumentation
      selectively.
      
      This change eliminates the need for the __KASAN_INTERNAL definition.
      
      [elver@google.com: v5]
        Link: http://lkml.kernel.org/r/20190708170706.174189-2-elver@google.com
      Link: http://lkml.kernel.org/r/20190626142014.141844-2-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d8ad890
    • M
      mm/kasan: print frame description for stack bugs · e8969219
      Marco Elver 提交于
      This adds support for printing stack frame description on invalid stack
      accesses.  The frame description is embedded by the compiler, which is
      parsed and then pretty-printed.
      
      Currently, we can only print the stack frame info for accesses to the
      task's own stack, but not accesses to other tasks' stacks.
      
      Example of what it looks like:
      
        page dumped because: kasan: bad access detected
      
        addr ffff8880673ef98a is located in stack of task insmod/2008 at offset 106 in frame:
         kasan_stack_oob+0x0/0xf5 [test_kasan]
      
        this frame has 2 objects:
         [32, 36) 'i'
         [96, 106) 'stack_array'
      
        Memory state around the buggy address:
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=198435
      Link: http://lkml.kernel.org/r/20190522100048.146841-1-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8969219
    • A
      mm/kmemleak.c: change error at _write when kmemleak is disabled · 4e4dfce2
      André Almeida 提交于
      According to POSIX, EBUSY means that the "device or resource is busy", and
      this can lead to people thinking that the file
      `/sys/kernel/debug/kmemleak/` is somehow locked or being used by other
      process.  Change this error code to a more appropriate one.
      
      Link: http://lkml.kernel.org/r/20190612155231.19448-1-andrealmeid@collabora.comSigned-off-by: NAndré Almeida <andrealmeid@collabora.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e4dfce2
    • D
      mm/kmemleak.c: fix check for softirq context · 6ef90569
      Dmitry Vyukov 提交于
      in_softirq() is a wrong predicate to check if we are in a softirq
      context.  It also returns true if we have BH disabled, so objects are
      falsely stamped with "softirq" comm.  The correct predicate is
      in_serving_softirq().
      
      If user does cat from /sys/kernel/debug/kmemleak previously they would
      see this, which is clearly wrong, this is system call context (see the
      comm):
      
      unreferenced object 0xffff88805bd661c0 (size 64):
        comm "softirq", pid 0, jiffies 4294942959 (age 12.400s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00  ................
          00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
        backtrace:
          [<0000000007dcb30c>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
          [<0000000007dcb30c>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<0000000007dcb30c>] slab_alloc mm/slab.c:3326 [inline]
          [<0000000007dcb30c>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
          [<00000000969722b7>] kmalloc include/linux/slab.h:547 [inline]
          [<00000000969722b7>] kzalloc include/linux/slab.h:742 [inline]
          [<00000000969722b7>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
          [<00000000969722b7>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
          [<00000000a4134b5f>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
          [<00000000d20248ad>] do_ip_setsockopt.isra.0+0x19fe/0x1c00 net/ipv4/ip_sockglue.c:957
          [<000000003d367be7>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
          [<000000003c7c76af>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
          [<000000000c1aeb23>] sock_common_setsockopt+0x3e/0x50 net/core/sock.c:3130
          [<000000000157b92b>] __sys_setsockopt+0x9e/0x120 net/socket.c:2078
          [<00000000a9f3d058>] __do_sys_setsockopt net/socket.c:2089 [inline]
          [<00000000a9f3d058>] __se_sys_setsockopt net/socket.c:2086 [inline]
          [<00000000a9f3d058>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
          [<000000001b8da885>] do_syscall_64+0x7c/0x1a0 arch/x86/entry/common.c:301
          [<00000000ba770c62>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      now they will see this:
      
      unreferenced object 0xffff88805413c800 (size 64):
        comm "syz-executor.4", pid 8960, jiffies 4294994003 (age 14.350s)
        hex dump (first 32 bytes):
          00 7a 8a 57 80 88 ff ff e0 00 00 01 00 00 00 00  .z.W............
          00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000c5d3be64>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
          [<00000000c5d3be64>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<00000000c5d3be64>] slab_alloc mm/slab.c:3326 [inline]
          [<00000000c5d3be64>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
          [<0000000023865be2>] kmalloc include/linux/slab.h:547 [inline]
          [<0000000023865be2>] kzalloc include/linux/slab.h:742 [inline]
          [<0000000023865be2>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
          [<0000000023865be2>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
          [<000000003029a9d4>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
          [<00000000ccd0a87c>] do_ip_setsockopt.isra.0+0x19fe/0x1c00 net/ipv4/ip_sockglue.c:957
          [<00000000a85a3785>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
          [<00000000ec13c18d>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
          [<0000000052d748e3>] sock_common_setsockopt+0x3e/0x50 net/core/sock.c:3130
          [<00000000512f1014>] __sys_setsockopt+0x9e/0x120 net/socket.c:2078
          [<00000000181758bc>] __do_sys_setsockopt net/socket.c:2089 [inline]
          [<00000000181758bc>] __se_sys_setsockopt net/socket.c:2086 [inline]
          [<00000000181758bc>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
          [<00000000d4b73623>] do_syscall_64+0x7c/0x1a0 arch/x86/entry/common.c:301
          [<00000000c1098bec>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Link: http://lkml.kernel.org/r/20190517171507.96046-1-dvyukov@gmail.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ef90569
    • S
      slub: don't panic for memcg kmem cache creation failure · cb097cd4
      Shakeel Butt 提交于
      Currently for CONFIG_SLUB, if a memcg kmem cache creation is failed and
      the corresponding root kmem cache has SLAB_PANIC flag, the kernel will
      be crashed.  This is unnecessary as the kernel can handle the creation
      failures of memcg kmem caches.  Additionally CONFIG_SLAB does not
      implement this behavior.  So, to keep the behavior consistent between
      SLAB and SLUB, removing the panic for memcg kmem cache creation
      failures.  The root kmem cache creation failure for SLAB_PANIC correctly
      panics for both SLAB and SLUB.
      
      Link: http://lkml.kernel.org/r/20190619232514.58994-1-shakeelb@google.comReported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb097cd4
    • Y
      mm/slub.c: avoid double string traverse in kmem_cache_flags() · 9cf3a8d8
      Yury Norov 提交于
      If ',' is not found, kmem_cache_flags() calls strlen() to find the end of
      line.  We can do it in a single pass using strchrnul().
      
      Link: http://lkml.kernel.org/r/20190501053111.7950-1-ynorov@marvell.comSigned-off-by: NYury Norov <ynorov@marvell.com>
      Acked-by: NAaron Tomlin <atomlin@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cf3a8d8
    • K
      mm/slab: sanity-check page type when looking up cache · a64b5378
      Kees Cook 提交于
      This avoids any possible type confusion when looking up an object.  For
      example, if a non-slab were to be passed to kfree(), the invalid
      slab_cache pointer (i.e.  overlapped with some other value from the
      struct page union) would be used for subsequent slab manipulations that
      could lead to further memory corruption.
      
      Since the page is already in cache, adding the PageSlab() check will
      have nearly zero cost, so add a check and WARN() to virt_to_cache().
      Additionally replaces an open-coded virt_to_cache().  To support the
      failure mode this also updates all callers of virt_to_cache() and
      cache_from_obj() to handle a NULL cache pointer return value (though
      note that several already handle this case gracefully).
      
      [dan.carpenter@oracle.com: restore IRQs in kfree()]
        Link: http://lkml.kernel.org/r/20190613065637.GE16334@mwanda
      Link: http://lkml.kernel.org/r/20190530045017.15252-3-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Alexander Popov <alex.popov@linux.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a64b5378