1. 13 7月, 2019 10 次提交
    • A
      mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options · 6471384a
      Alexander Potapenko 提交于
      Patch series "add init_on_alloc/init_on_free boot options", v10.
      
      Provide init_on_alloc and init_on_free boot options.
      
      These are aimed at preventing possible information leaks and making the
      control-flow bugs that depend on uninitialized values more deterministic.
      
      Enabling either of the options guarantees that the memory returned by the
      page allocator and SL[AU]B is initialized with zeroes.  SLOB allocator
      isn't supported at the moment, as its emulation of kmem caches complicates
      handling of SLAB_TYPESAFE_BY_RCU caches correctly.
      
      Enabling init_on_free also guarantees that pages and heap objects are
      initialized right after they're freed, so it won't be possible to access
      stale data by using a dangling pointer.
      
      As suggested by Michal Hocko, right now we don't let the heap users to
      disable initialization for certain allocations.  There's not enough
      evidence that doing so can speed up real-life cases, and introducing ways
      to opt-out may result in things going out of control.
      
      This patch (of 2):
      
      The new options are needed to prevent possible information leaks and make
      control-flow bugs that depend on uninitialized values more deterministic.
      
      This is expected to be on-by-default on Android and Chrome OS.  And it
      gives the opportunity for anyone else to use it under distros too via the
      boot args.  (The init_on_free feature is regularly requested by folks
      where memory forensics is included in their threat models.)
      
      init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
      objects with zeroes.  Initialization is done at allocation time at the
      places where checks for __GFP_ZERO are performed.
      
      init_on_free=1 makes the kernel initialize freed pages and heap objects
      with zeroes upon their deletion.  This helps to ensure sensitive data
      doesn't leak via use-after-free accesses.
      
      Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
      returns zeroed memory.  The two exceptions are slab caches with
      constructors and SLAB_TYPESAFE_BY_RCU flag.  Those are never
      zero-initialized to preserve their semantics.
      
      Both init_on_alloc and init_on_free default to zero, but those defaults
      can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
      CONFIG_INIT_ON_FREE_DEFAULT_ON.
      
      If either SLUB poisoning or page poisoning is enabled, those options take
      precedence over init_on_alloc and init_on_free: initialization is only
      applied to unpoisoned allocations.
      
      Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
      
      hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
      hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
      
      Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
      Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
      Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
      Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
      
      The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
      is within the standard error.
      
      The new features are also going to pave the way for hardware memory
      tagging (e.g.  arm64's MTE), which will require both on_alloc and on_free
      hooks to set the tags for heap objects.  With MTE, tagging will have the
      same cost as memory initialization.
      
      Although init_on_free is rather costly, there are paranoid use-cases where
      in-memory data lifetime is desired to be minimized.  There are various
      arguments for/against the realism of the associated threat models, but
      given that we'll need the infrastructure for MTE anyway, and there are
      people who want wipe-on-free behavior no matter what the performance cost,
      it seems reasonable to include it in this series.
      
      [glider@google.com: v8]
        Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
      [glider@google.com: v10]
        Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
      Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: Michal Hocko <mhocko@suse.cz>		[page and dmapool parts
      Acked-by: James Morris <jamorris@linux.microsoft.com>]
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Sandeep Patil <sspatil@android.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6471384a
    • R
      mm: memcg/slab: reparent memcg kmem_caches on cgroup removal · fb2f2b0a
      Roman Gushchin 提交于
      Let's reparent non-root kmem_caches on memcg offlining.  This allows us to
      release the memory cgroup without waiting for the last outstanding kernel
      object (e.g.  dentry used by another application).
      
      Since the parent cgroup is already charged, everything we need to do is to
      splice the list of kmem_caches to the parent's kmem_caches list, swap the
      memcg pointer, drop the css refcounter for each kmem_cache and adjust the
      parent's css refcounter.
      
      Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
      anymore.  It's safe to read it under rcu_read_lock(), cgroup_mutex held,
      or any other way that protects the memory cgroup from being released.
      
      We can race with the slab allocation and deallocation paths.  It's not a
      big problem: parent's charge and slab global stats are always correct, and
      we don't care anymore about the child usage and global stats.  The child
      cgroup is already offline, so we don't use or show it anywhere.
      
      Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
      used anywhere except count_shadow_nodes().  But even there it won't break
      anything: after reparenting "nodes" will be 0 on child level (because
      we're already reparenting shrinker lists), and on parent level page stats
      always were 0, and this patch won't change anything.
      
      [guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup]
        Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb2f2b0a
    • R
      mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages · 4d96ba35
      Roman Gushchin 提交于
      Every slab page charged to a non-root memory cgroup has a pointer to the
      memory cgroup and holds a reference to it, which protects a non-empty
      memory cgroup from being released.  At the same time the page has a
      pointer to the corresponding kmem_cache, and also hold a reference to the
      kmem_cache.  And kmem_cache by itself holds a reference to the cgroup.
      
      So there is clearly some redundancy, which allows to stop setting the
      page->mem_cgroup pointer and rely on getting memcg pointer indirectly via
      kmem_cache.  Further it will allow to change this pointer easier, without
      a need to go over all charged pages.
      
      So let's stop setting page->mem_cgroup pointer for slab pages, and stop
      using the css refcounter directly for protecting the memory cgroup from
      going away.  Instead rely on kmem_cache as an intermediate object.
      
      Make sure that vmstats and shrinker lists are working as previously, as
      well as /proc/kpagecgroup interface.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-10-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d96ba35
    • R
      mm: memcg/slab: rework non-root kmem_cache lifecycle management · f0a3a24b
      Roman Gushchin 提交于
      Currently each charged slab page holds a reference to the cgroup to which
      it's charged.  Kmem_caches are held by the memcg and are released all
      together with the memory cgroup.  It means that none of kmem_caches are
      released unless at least one reference to the memcg exists, which is very
      far from optimal.
      
      Let's rework it in a way that allows releasing individual kmem_caches as
      soon as the cgroup is offline, the kmem_cache is empty and there are no
      pending allocations.
      
      To make it possible, let's introduce a new percpu refcounter for non-root
      kmem caches.  The counter is initialized to the percpu mode, and is
      switched to the atomic mode during kmem_cache deactivation.  The counter
      is bumped for every charged page and also for every running allocation.
      So the kmem_cache can't be released unless all allocations complete.
      
      To shutdown non-active empty kmem_caches, let's reuse the work queue,
      previously used for the kmem_cache deactivation.  Once the reference
      counter reaches 0, let's schedule an asynchronous kmem_cache release.
      
      * I used the following simple approach to test the performance
      (stolen from another patchset by T. Harding):
      
          time find / -name fname-no-exist
          echo 2 > /proc/sys/vm/drop_caches
          repeat 10 times
      
      Results:
      
              orig		patched
      
      real	0m1.455s	real	0m1.355s
      user	0m0.206s	user	0m0.219s
      sys	0m0.855s	sys	0m0.807s
      
      real	0m1.487s	real	0m1.699s
      user	0m0.221s	user	0m0.256s
      sys	0m0.806s	sys	0m0.948s
      
      real	0m1.515s	real	0m1.505s
      user	0m0.183s	user	0m0.215s
      sys	0m0.876s	sys	0m0.858s
      
      real	0m1.291s	real	0m1.380s
      user	0m0.193s	user	0m0.198s
      sys	0m0.843s	sys	0m0.786s
      
      real	0m1.364s	real	0m1.374s
      user	0m0.180s	user	0m0.182s
      sys	0m0.868s	sys	0m0.806s
      
      real	0m1.352s	real	0m1.312s
      user	0m0.201s	user	0m0.212s
      sys	0m0.820s	sys	0m0.761s
      
      real	0m1.302s	real	0m1.349s
      user	0m0.205s	user	0m0.203s
      sys	0m0.803s	sys	0m0.792s
      
      real	0m1.334s	real	0m1.301s
      user	0m0.194s	user	0m0.201s
      sys	0m0.806s	sys	0m0.779s
      
      real	0m1.426s	real	0m1.434s
      user	0m0.216s	user	0m0.181s
      sys	0m0.824s	sys	0m0.864s
      
      real	0m1.350s	real	0m1.295s
      user	0m0.200s	user	0m0.190s
      sys	0m0.842s	sys	0m0.811s
      
      So it looks like the difference is not noticeable in this test.
      
      [cai@lca.pw: fix an use-after-free in kmemcg_workfn()]
        Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0a3a24b
    • R
      mm: memcg/slab: unify SLAB and SLUB page accounting · 6cea1d56
      Roman Gushchin 提交于
      Currently the page accounting code is duplicated in SLAB and SLUB
      internals.  Let's move it into new (un)charge_slab_page helpers in the
      slab_common.c file.  These helpers will be responsible for statistics
      (global and memcg-aware) and memcg charging.  So they are replacing direct
      memcg_(un)charge_slab() calls.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cea1d56
    • R
      mm: memcg/slab: generalize postponed non-root kmem_cache deactivation · 43486694
      Roman Gushchin 提交于
      Currently SLUB uses a work scheduled after an RCU grace period to
      deactivate a non-root kmem_cache.  This mechanism can be reused for
      kmem_caches release, but requires generalization for SLAB case.
      
      Introduce kmemcg_cache_deactivate() function, which calls
      allocator-specific __kmem_cache_deactivate() and schedules execution of
      __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
      context after an rcu grace period.
      
      Here is the new calling scheme:
        kmemcg_cache_deactivate()
          __kmemcg_cache_deactivate()                  SLAB/SLUB-specific
          kmemcg_rcufn()                               rcu
            kmemcg_workfn()                            work
              __kmemcg_cache_deactivate_after_rcu()    SLAB/SLUB-specific
      
      instead of:
        __kmemcg_cache_deactivate()                    SLAB/SLUB-specific
          slab_deactivate_memcg_cache_rcu_sched()      SLUB-only
            kmemcg_rcufn()                             rcu
              kmemcg_workfn()                          work
                kmemcg_cache_deact_after_rcu()         SLUB-only
      
      For consistency, all allocator-specific functions start with "__".
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43486694
    • R
      mm: memcg/slab: rename slab delayed deactivation functions and fields · 0b14e8aa
      Roman Gushchin 提交于
      The delayed work/rcu deactivation infrastructure of non-root kmem_caches
      can be also used for asynchronous release of these objects.  Let's get rid
      of the word "deactivation" in corresponding names to make the code look
      better after generalization.
      
      It's easier to make the renaming first, so that the generalized code will
      look consistent from scratch.
      
      Let's rename struct memcg_cache_params fields:
        deact_fn -> work_fn
        deact_rcu_head -> rcu_head
        deact_work -> work
      
      And RCU/delayed work callbacks in slab common code:
        kmemcg_deactivate_rcufn -> kmemcg_rcufn
        kmemcg_deactivate_workfn -> kmemcg_workfn
      
      This patch contains no functional changes, only renamings.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-3-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b14e8aa
    • R
      mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() · c03914b7
      Roman Gushchin 提交于
      Patch series "mm: reparent slab memory on cgroup removal", v7.
      
      # Why do we need this?
      
      We've noticed that the number of dying cgroups is steadily growing on most
      of our hosts in production.  The following investigation revealed an issue
      in the userspace memory reclaim code [1], accounting of kernel stacks [2],
      and also the main reason: slab objects.
      
      The underlying problem is quite simple: any page charged to a cgroup holds
      a reference to it, so the cgroup can't be reclaimed unless all charged
      pages are gone.  If a slab object is actively used by other cgroups, it
      won't be reclaimed, and will prevent the origin cgroup from being
      reclaimed.
      
      Slab objects, and first of all vfs cache, is shared between cgroups, which
      are using the same underlying fs, and what's even more important, it's
      shared between multiple generations of the same workload.  So if something
      is running periodically every time in a new cgroup (like how systemd
      works), we do accumulate multiple dying cgroups.
      
      Strictly speaking pagecache isn't different here, but there is a key
      difference: we disable protection and apply some extra pressure on LRUs of
      dying cgroups, and these LRUs contain all charged pages.  My experiments
      show that with the disabled kernel memory accounting the number of dying
      cgroups stabilizes at a relatively small number (~100, depends on memory
      pressure and cgroup creation rate), and with kernel memory accounting it
      grows pretty steadily up to several thousands.
      
      Memory cgroups are quite complex and big objects (mostly due to percpu
      stats), so it leads to noticeable memory losses.  Memory occupied by dying
      cgroups is measured in hundreds of megabytes.  I've even seen a host with
      more than 100Gb of memory wasted for dying cgroups.  It leads to a
      degradation of performance with the uptime, and generally limits the usage
      of cgroups.
      
      My previous attempt [3] to fix the problem by applying extra pressure on
      slab shrinker lists caused a regressions with xfs and ext4, and has been
      reverted [4].  The following attempts to find the right balance [5, 6]
      were not successful.
      
      So instead of trying to find a maybe non-existing balance, let's do
      reparent accounted slab caches to the parent cgroup on cgroup removal.
      
      # Implementation approach
      
      There is however a significant problem with reparenting of slab memory:
      there is no list of charged pages.  Some of them are in shrinker lists,
      but not all.  Introducing of a new list is really not an option.
      
      But fortunately there is a way forward: every slab page has a stable
      pointer to the corresponding kmem_cache.  So the idea is to reparent
      kmem_caches instead of slab pages.
      
      It's actually simpler and cheaper, but requires some underlying changes:
      1) Make kmem_caches to hold a single reference to the memory cgroup,
         instead of a separate reference per every slab page.
      2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
         page->kmem_cache->memcg indirection instead. It's used only on
         slab page release, so performance overhead shouldn't be a big issue.
      3) Introduce a refcounter for non-root slab caches. It's required to
         be able to destroy kmem_caches when they become empty and release
         the associated memory cgroup.
      
      There is a bonus: currently we release all memcg kmem_caches all together
      with the memory cgroup itself.  This patchset allows individual
      kmem_caches to be released as soon as they become inactive and free.
      
      Some additional implementation details are provided in corresponding
      commit messages.
      
      # Results
      
      Below is the average number of dying cgroups on two groups of our
      production hosts.  They do run some sort of web frontend workload, the
      memory pressure is moderate.  As we can see, with the kernel memory
      reparenting the number stabilizes in 60s range; however with the original
      version it grows almost linearly and doesn't show any signs of plateauing.
      The difference in slab and percpu usage between patched and unpatched
      versions also grows linearly.  In 7 days it exceeded 200Mb.
      
      day           0    1    2    3    4    5    6    7
      original     56  362  628  752 1070 1250 1490 1560
      patched      23   46   51   55   60   57   67   69
      mem diff(Mb) 22   74  123  152  164  182  214  241
      
      # Links
      
      [1]: commit 68600f62 ("mm: don't miss the last page because of round-off error")
      [2]: commit 9b6f7e16 ("mm: rework memcg kernel stack accounting")
      [3]: commit 172b06c3 ("mm: slowly shrink slabs with a relatively small number of objects")
      [4]: commit a9a238e8 ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
      [5]: https://lkml.org/lkml/2019/1/28/1865
      [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
      
      This patch (of 10):
      
      Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
      rather than in init_memcg_params().
      
      Once kmem_cache will hold a reference to the memory cgroup, it will
      simplify the refcounting.
      
      For non-root kmem_caches memcg_link_cache() is always called before the
      kmem_cache becomes visible to a user, so it's safe.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c03914b7
    • K
      mm/slab: sanity-check page type when looking up cache · a64b5378
      Kees Cook 提交于
      This avoids any possible type confusion when looking up an object.  For
      example, if a non-slab were to be passed to kfree(), the invalid
      slab_cache pointer (i.e.  overlapped with some other value from the
      struct page union) would be used for subsequent slab manipulations that
      could lead to further memory corruption.
      
      Since the page is already in cache, adding the PageSlab() check will
      have nearly zero cost, so add a check and WARN() to virt_to_cache().
      Additionally replaces an open-coded virt_to_cache().  To support the
      failure mode this also updates all callers of virt_to_cache() and
      cache_from_obj() to handle a NULL cache pointer return value (though
      note that several already handle this case gracefully).
      
      [dan.carpenter@oracle.com: restore IRQs in kfree()]
        Link: http://lkml.kernel.org/r/20190613065637.GE16334@mwanda
      Link: http://lkml.kernel.org/r/20190530045017.15252-3-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Alexander Popov <alex.popov@linux.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a64b5378
    • K
      mm/slab: validate cache membership under freelist hardening · 598a0717
      Kees Cook 提交于
      Patch series "mm/slab: Improved sanity checking".
      
      This adds defenses against slab cache confusion (as seen in real-world
      exploits[1]) and gracefully handles type confusions when trying to look
      up slab caches from an arbitrary page.  (Also is patch 3: new LKDTM
      tests for these defenses as well as for the existing double-free
      detection.
      
      This patch (of 3):
      
      When building under CONFIG_SLAB_FREELIST_HARDENING, it makes sense to
      perform sanity-checking on the assumed slab cache during
      kmem_cache_free() to make sure the kernel doesn't mix freelists across
      slab caches and corrupt memory (as seen in the exploitation of flaws
      like CVE-2018-9568[1]).  Note that the prior code might WARN() but still
      corrupt memory (i.e.  return the assumed cache instead of the owned
      cache).
      
      There is no noticeable performance impact (changes are within noise).
      Measuring parallel kernel builds, I saw the following with
      CONFIG_SLAB_FREELIST_HARDENED, before and after this patch:
      
      before:
      
      	Run times: 288.85 286.53 287.09 287.07 287.21
      	Min: 286.53 Max: 288.85 Mean: 287.35 Std Dev: 0.79
      
      after:
      
      	Run times: 289.58 287.40 286.97 287.20 287.01
      	Min: 286.97 Max: 289.58 Mean: 287.63 Std Dev: 0.99
      
      Delta: 0.1% which is well below the standard deviation
      
      [1] https://github.com/ThomasKing2014/slides/raw/master/Building%20universal%20Android%20rooting%20with%20a%20type%20confusion%20vulnerability.pdf
      
      Link: http://lkml.kernel.org/r/20190530045017.15252-2-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alexander Popov <alex.popov@linux.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      598a0717
  2. 30 3月, 2019 1 次提交
    • N
      mm: add support for kmem caches in DMA32 zone · 6d6ea1e9
      Nicolas Boichat 提交于
      Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
      v6.
      
      This is a followup to the discussion in [1], [2].
      
      IOMMUs using ARMv7 short-descriptor format require page tables (level 1
      and 2) to be allocated within the first 4GB of RAM, even on 64-bit
      systems.
      
      For L1 tables that are bigger than a page, we can just use
      __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
      use GFP_DMA).
      
      For L2 tables that only take 1KB, it would be a waste to allocate a full
      page, so we considered 3 approaches:
       1. This series, adding support for GFP_DMA32 slab caches.
       2. genalloc, which requires pre-allocating the maximum number of L2 page
          tables (4096, so 4MB of memory).
       3. page_frag, which is not very memory-efficient as it is unable to reuse
          freed fragments until the whole page is freed. [3]
      
      This series is the most memory-efficient approach.
      
      stable@ note:
        We confirmed that this is a regression, and IOMMU errors happen on 4.19
        and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
        most likely starts from commit ad67f5a6 ("arm64: replace ZONE_DMA
        with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
        platforms (and maybe others?).
      
      [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
      [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
      [3] https://patchwork.codeaurora.org/patch/671639/
      
      This patch (of 3):
      
      IOMMUs using ARMv7 short-descriptor format require page tables to be
      allocated within the first 4GB of RAM, even on 64-bit systems.  On arm64,
      this is done by passing GFP_DMA32 flag to memory allocation functions.
      
      For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
      a full page using get_free_pages, so we considered 3 approaches:
       1. This patch, adding support for GFP_DMA32 slab caches.
       2. genalloc, which requires pre-allocating the maximum number of L2
          page tables (4096, so 4MB of memory).
       3. page_frag, which is not very memory-efficient as it is unable
          to reuse freed fragments until the whole page is freed.
      
      This change makes it possible to create a custom cache in DMA32 zone using
      kmem_cache_create, then allocate memory using kmem_cache_alloc.
      
      We do not create a DMA32 kmalloc cache array, as there are currently no
      users of kmalloc(..., GFP_DMA32).  These calls will continue to trigger a
      warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.
      
      This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
      kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
      unnecessary).
      
      Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.orgSigned-off-by: NNicolas Boichat <drinkcat@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Sasha Levin <Alexander.Levin@microsoft.com>
      Cc: Huaisheng Ye <yehs1@lenovo.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yong Wu <yong.wu@mediatek.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Tomasz Figa <tfiga@google.com>
      Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d6ea1e9
  3. 06 3月, 2019 1 次提交
  4. 22 2月, 2019 2 次提交
  5. 29 12月, 2018 1 次提交
    • A
      kasan, mm: change hooks signatures · 0116523c
      Andrey Konovalov 提交于
      Patch series "kasan: add software tag-based mode for arm64", v13.
      
      This patchset adds a new software tag-based mode to KASAN [1].  (Initially
      this mode was called KHWASAN, but it got renamed, see the naming rationale
      at the end of this section).
      
      The plan is to implement HWASan [2] for the kernel with the incentive,
      that it's going to have comparable to KASAN performance, but in the same
      time consume much less memory, trading that off for somewhat imprecise bug
      detection and being supported only for arm64.
      
      The underlying ideas of the approach used by software tag-based KASAN are:
      
      1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
         pointer tags in the top byte of each kernel pointer.
      
      2. Using shadow memory, we can store memory tags for each chunk of kernel
         memory.
      
      3. On each memory allocation, we can generate a random tag, embed it into
         the returned pointer and set the memory tags that correspond to this
         chunk of memory to the same value.
      
      4. By using compiler instrumentation, before each memory access we can add
         a check that the pointer tag matches the tag of the memory that is being
         accessed.
      
      5. On a tag mismatch we report an error.
      
      With this patchset the existing KASAN mode gets renamed to generic KASAN,
      with the word "generic" meaning that the implementation can be supported
      by any architecture as it is purely software.
      
      The new mode this patchset adds is called software tag-based KASAN.  The
      word "tag-based" refers to the fact that this mode uses tags embedded into
      the top byte of kernel pointers and the TBI arm64 CPU feature that allows
      to dereference such pointers.  The word "software" here means that shadow
      memory manipulation and tag checking on pointer dereference is done in
      software.  As it is the only tag-based implementation right now, "software
      tag-based" KASAN is sometimes referred to as simply "tag-based" in this
      patchset.
      
      A potential expansion of this mode is a hardware tag-based mode, which
      would use hardware memory tagging support (announced by Arm [3]) instead
      of compiler instrumentation and manual shadow memory manipulation.
      
      Same as generic KASAN, software tag-based KASAN is strictly a debugging
      feature.
      
      [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
      
      [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
      
      [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
      
      ====== Rationale
      
      On mobile devices generic KASAN's memory usage is significant problem.
      One of the main reasons to have tag-based KASAN is to be able to perform a
      similar set of checks as the generic one does, but with lower memory
      requirements.
      
      Comment from Vishwath Mohan <vishwath@google.com>:
      
      I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
      problematic to enable for environments that don't tolerate the increased
      memory pressure well.  This includes
      
      (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
      (c) Connected components like Pixel's visual core [1].
      
      These are both places I'd love to have a low(er) memory footprint option at
      my disposal.
      
      Comment from Evgenii Stepanov <eugenis@google.com>:
      
      Looking at a live Android device under load, slab (according to
      /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB).  KASAN's
      overhead of 2x - 3x on top of it is not insignificant.
      
      Not having this overhead enables near-production use - ex.  running
      KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
      not reproduce in test configuration.  These are the ones that often cost
      the most engineering time to track down.
      
      CPU overhead is bad, but generally tolerable.  RAM is critical, in our
      experience.  Once it gets low enough, OOM-killer makes your life
      miserable.
      
      [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
      
      ====== Technical details
      
      Software tag-based KASAN mode is implemented in a very similar way to the
      generic one. This patchset essentially does the following:
      
      1. TCR_TBI1 is set to enable Top Byte Ignore.
      
      2. Shadow memory is used (with a different scale, 1:16, so each shadow
         byte corresponds to 16 bytes of kernel memory) to store memory tags.
      
      3. All slab objects are aligned to shadow scale, which is 16 bytes.
      
      4. All pointers returned from the slab allocator are tagged with a random
         tag and the corresponding shadow memory is poisoned with the same value.
      
      5. Compiler instrumentation is used to insert tag checks. Either by
         calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
         CONFIG_KASAN_INLINE flags are reused).
      
      6. When a tag mismatch is detected in callback instrumentation mode
         KASAN simply prints a bug report. In case of inline instrumentation,
         clang inserts a brk instruction, and KASAN has it's own brk handler,
         which reports the bug.
      
      7. The memory in between slab objects is marked with a reserved tag, and
         acts as a redzone.
      
      8. When a slab object is freed it's marked with a reserved tag.
      
      Bug detection is imprecise for two reasons:
      
      1. We won't catch some small out-of-bounds accesses, that fall into the
         same shadow cell, as the last byte of a slab object.
      
      2. We only have 1 byte to store tags, which means we have a 1/256
         probability of a tag match for an incorrect access (actually even
         slightly less due to reserved tag values).
      
      Despite that there's a particular type of bugs that tag-based KASAN can
      detect compared to generic KASAN: use-after-free after the object has been
      allocated by someone else.
      
      ====== Testing
      
      Some kernel developers voiced a concern that changing the top byte of
      kernel pointers may lead to subtle bugs that are difficult to discover.
      To address this concern deliberate testing has been performed.
      
      It doesn't seem feasible to do some kind of static checking to find
      potential issues with pointer tagging, so a dynamic approach was taken.
      All pointer comparisons/subtractions have been instrumented in an LLVM
      compiler pass and a kernel module that would print a bug report whenever
      two pointers with different tags are being compared/subtracted (ignoring
      comparisons with NULL pointers and with pointers obtained by casting an
      error code to a pointer type) has been used.  Then the kernel has been
      booted in QEMU and on an Odroid C2 board and syzkaller has been run.
      
      This yielded the following results.
      
      The two places that look interesting are:
      
      is_vmalloc_addr in include/linux/mm.h
      is_kernel_rodata in mm/util.c
      
      Here we compare a pointer with some fixed untagged values to make sure
      that the pointer lies in a particular part of the kernel address space.
      Since tag-based KASAN doesn't add tags to pointers that belong to rodata
      or vmalloc regions, this should work as is.  To make sure debug checks to
      those two functions that check that the result doesn't change whether we
      operate on pointers with or without untagging has been added.
      
      A few other cases that don't look that interesting:
      
      Comparing pointers to achieve unique sorting order of pointee objects
      (e.g. sorting locks addresses before performing a double lock):
      
      tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
      pipe_double_lock in fs/pipe.c
      unix_state_double_lock in net/unix/af_unix.c
      lock_two_nondirectories in fs/inode.c
      mutex_lock_double in kernel/events/core.c
      
      ep_cmp_ffd in fs/eventpoll.c
      fsnotify_compare_groups fs/notify/mark.c
      
      Nothing needs to be done here, since the tags embedded into pointers
      don't change, so the sorting order would still be unique.
      
      Checks that a pointer belongs to some particular allocation:
      
      is_sibling_entry in lib/radix-tree.c
      object_is_on_stack in include/linux/sched/task_stack.h
      
      Nothing needs to be done here either, since two pointers can only belong
      to the same allocation if they have the same tag.
      
      Overall, since the kernel boots and works, there are no critical bugs.
      As for the rest, the traditional kernel testing way (use until fails) is
      the only one that looks feasible.
      
      Another point here is that tag-based KASAN is available under a separate
      config option that needs to be deliberately enabled. Even though it might
      be used in a "near-production" environment to find bugs that are not found
      during fuzzing or running tests, it is still a debug tool.
      
      ====== Benchmarks
      
      The following numbers were collected on Odroid C2 board. Both generic and
      tag-based KASAN were used in inline instrumentation mode.
      
      Boot time [1]:
      * ~1.7 sec for clean kernel
      * ~5.0 sec for generic KASAN
      * ~5.0 sec for tag-based KASAN
      
      Network performance [2]:
      * 8.33 Gbits/sec for clean kernel
      * 3.17 Gbits/sec for generic KASAN
      * 2.85 Gbits/sec for tag-based KASAN
      
      Slab memory usage after boot [3]:
      * ~40 kb for clean kernel
      * ~105 kb (~260% overhead) for generic KASAN
      * ~47 kb (~20% overhead) for tag-based KASAN
      
      KASAN memory overhead consists of three main parts:
      1. Increased slab memory usage due to redzones.
      2. Shadow memory (the whole reserved once during boot).
      3. Quaratine (grows gradually until some preset limit; the more the limit,
         the more the chance to detect a use-after-free).
      
      Comparing tag-based vs generic KASAN for each of these points:
      1. 20% vs 260% overhead.
      2. 1/16th vs 1/8th of physical memory.
      3. Tag-based KASAN doesn't require quarantine.
      
      [1] Time before the ext4 driver is initialized.
      [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
      [3] Measured as `cat /proc/meminfo | grep Slab`.
      
      ====== Some notes
      
      A few notes:
      
      1. The patchset can be found here:
         https://github.com/xairy/kasan-prototype/tree/khwasan
      
      2. Building requires a recent Clang version (7.0.0 or later).
      
      3. Stack instrumentation is not supported yet and will be added later.
      
      This patch (of 25):
      
      Tag-based KASAN changes the value of the top byte of pointers returned
      from the kernel allocation functions (such as kmalloc).  This patch
      updates KASAN hooks signatures and their usage in SLAB and SLUB code to
      reflect that.
      
      Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0116523c
  6. 18 8月, 2018 1 次提交
  7. 06 4月, 2018 6 次提交
  8. 01 2月, 2018 1 次提交
  9. 16 1月, 2018 2 次提交
    • D
      usercopy: Mark kmalloc caches as usercopy caches · 6c0c21ad
      David Windsor 提交于
      Mark the kmalloc slab caches as entirely whitelisted. These caches
      are frequently used to fulfill kernel allocations that contain data
      to be copied to/from userspace. Internal-only uses are also common,
      but are scattered in the kernel. For now, mark all the kmalloc caches
      as whitelisted.
      
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      Signed-off-by: NDavid Windsor <dave@nullcore.net>
      [kees: merged in moved kmalloc hunks, adjust commit log]
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-mm@kvack.org
      Cc: linux-xfs@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      6c0c21ad
    • D
      usercopy: Prepare for usercopy whitelisting · 8eb8284b
      David Windsor 提交于
      This patch prepares the slab allocator to handle caches having annotations
      (useroffset and usersize) defining usercopy regions.
      
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on
      my understanding of the code. Changes or omissions from the original
      code are mine and don't reflect the original grsecurity/PaX code.
      
      Currently, hardened usercopy performs dynamic bounds checking on slab
      cache objects. This is good, but still leaves a lot of kernel memory
      available to be copied to/from userspace in the face of bugs. To further
      restrict what memory is available for copying, this creates a way to
      whitelist specific areas of a given slab cache object for copying to/from
      userspace, allowing much finer granularity of access control. Slab caches
      that are never exposed to userspace can declare no whitelist for their
      objects, thereby keeping them unavailable to userspace via dynamic copy
      operations. (Note, an implicit form of whitelisting is the use of constant
      sizes in usercopy operations and get_user()/put_user(); these bypass
      hardened usercopy checks since these sizes cannot change at runtime.)
      
      To support this whitelist annotation, usercopy region offset and size
      members are added to struct kmem_cache. The slab allocator receives a
      new function, kmem_cache_create_usercopy(), that creates a new cache
      with a usercopy region defined, suitable for declaring spans of fields
      within the objects that get copied to/from userspace.
      
      In this patch, the default kmem_cache_create() marks the entire allocation
      as whitelisted, leaving it semantically unchanged. Once all fine-grained
      whitelists have been added (in subsequent patches), this will be changed
      to a usersize of 0, making caches created with kmem_cache_create() not
      copyable to/from userspace.
      
      After the entire usercopy whitelist series is applied, less than 15%
      of the slab cache memory remains exposed to potential usercopy bugs
      after a fresh boot:
      
      Total Slab Memory:           48074720
      Usercopyable Memory:          6367532  13.2%
               task_struct                    0.2%         4480/1630720
               RAW                            0.3%            300/96000
               RAWv6                          2.1%           1408/64768
               ext4_inode_cache               3.0%       269760/8740224
               dentry                        11.1%       585984/5273856
               mm_struct                     29.1%         54912/188448
               kmalloc-8                    100.0%          24576/24576
               kmalloc-16                   100.0%          28672/28672
               kmalloc-32                   100.0%          81920/81920
               kmalloc-192                  100.0%          96768/96768
               kmalloc-128                  100.0%        143360/143360
               names_cache                  100.0%        163840/163840
               kmalloc-64                   100.0%        167936/167936
               kmalloc-256                  100.0%        339968/339968
               kmalloc-512                  100.0%        350720/350720
               kmalloc-96                   100.0%        455616/455616
               kmalloc-8192                 100.0%        655360/655360
               kmalloc-1024                 100.0%        812032/812032
               kmalloc-4096                 100.0%        819200/819200
               kmalloc-2048                 100.0%      1310720/1310720
      
      After some kernel build workloads, the percentage (mainly driven by
      dentry and inode caches expanding) drops under 10%:
      
      Total Slab Memory:           95516184
      Usercopyable Memory:          8497452   8.8%
               task_struct                    0.2%         4000/1456000
               RAW                            0.3%            300/96000
               RAWv6                          2.1%           1408/64768
               ext4_inode_cache               3.0%     1217280/39439872
               dentry                        11.1%     1623200/14608800
               mm_struct                     29.1%         73216/251264
               kmalloc-8                    100.0%          24576/24576
               kmalloc-16                   100.0%          28672/28672
               kmalloc-32                   100.0%          94208/94208
               kmalloc-192                  100.0%          96768/96768
               kmalloc-128                  100.0%        143360/143360
               names_cache                  100.0%        163840/163840
               kmalloc-64                   100.0%        245760/245760
               kmalloc-256                  100.0%        339968/339968
               kmalloc-512                  100.0%        350720/350720
               kmalloc-96                   100.0%        563520/563520
               kmalloc-8192                 100.0%        655360/655360
               kmalloc-1024                 100.0%        794624/794624
               kmalloc-4096                 100.0%        819200/819200
               kmalloc-2048                 100.0%      1257472/1257472
      Signed-off-by: NDavid Windsor <dave@nullcore.net>
      [kees: adjust commit log, split out a few extra kmalloc hunks]
      [kees: add field names to function declarations]
      [kees: convert BUGs to WARNs and fail closed]
      [kees: add attack surface reduction analysis to commit log]
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-mm@kvack.org
      Cc: linux-xfs@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      8eb8284b
  10. 17 12月, 2017 1 次提交
  11. 16 11月, 2017 4 次提交
  12. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  13. 24 10月, 2017 1 次提交
  14. 10 8月, 2017 1 次提交
    • P
      locking/lockdep: Rework FS_RECLAIM annotation · d92a8cfc
      Peter Zijlstra 提交于
      A while ago someone, and I cannot find the email just now, asked if we
      could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
      like we use for other things like workqueues etc. I think this should
      be possible which allows reducing the 'irq' states and will reduce the
      amount of __bfs() lookups we do.
      
      Removing the 1 IRQ state results in 4 less __bfs() walks per
      dependency, improving lockdep performance. And by moving this
      annotation out of the lockdep code it becomes easier for the mm people
      to extend.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: iamjoonsoo.kim@lge.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d92a8cfc
  15. 07 7月, 2017 3 次提交
  16. 19 4月, 2017 1 次提交
    • P
      mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU · 5f0d5a3a
      Paul E. McKenney 提交于
      A group of Linux kernel hackers reported chasing a bug that resulted
      from their assumption that SLAB_DESTROY_BY_RCU provided an existence
      guarantee, that is, that no block from such a slab would be reallocated
      during an RCU read-side critical section.  Of course, that is not the
      case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
      slab of blocks.
      
      However, there is a phrase for this, namely "type safety".  This commit
      therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
      to avoid future instances of this sort of confusion.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      [ paulmck: Add comments mentioning the old name, as requested by Eric
        Dumazet, in order to help people familiar with the old name find
        the new one. ]
      Acked-by: NDavid Rientjes <rientjes@google.com>
      5f0d5a3a
  17. 23 2月, 2017 3 次提交
    • T
      slab: remove synchronous synchronize_sched() from memcg cache deactivation path · 01fb58bc
      Tejun Heo 提交于
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      slub uses synchronize_sched() to deactivate a memcg cache.
      synchronize_sched() is an expensive and slow operation and doesn't scale
      when a huge number of caches are destroyed back-to-back.  While there
      used to be a simple batching mechanism, the batching was too restricted
      to be helpful.
      
      This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
      can use to schedule sched RCU callback instead of performing
      synchronize_sched() synchronously while holding cgroup_mutex.  While
      this adds online cpus, mems and slab_mutex operations, operating on
      these locks back-to-back from the same kworker, which is what's gonna
      happen when there are many to deactivate, isn't expensive at all and
      this gets rid of the scalability problem completely.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJay Vana <jsvana@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01fb58bc
    • T
      slab: introduce __kmemcg_cache_deactivate() · c9fc5864
      Tejun Heo 提交于
      __kmem_cache_shrink() is called with %true @deactivate only for memcg
      caches.  Remove @deactivate from __kmem_cache_shrink() and introduce
      __kmemcg_cache_deactivate() instead.  Each memcg-supporting allocator
      should implement it and it should deactivate and drain the cache.
      
      This is to allow memcg cache deactivation behavior to further deviate
      from simple shrinking without messing up __kmem_cache_shrink().
      
      This is pure reorganization and doesn't introduce any observable
      behavior changes.
      
      v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9fc5864
    • T
      slab: implement slab_root_caches list · 510ded33
      Tejun Heo 提交于
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      slab_caches currently lists all caches including root and memcg ones.
      This is the only data structure which lists the root caches and
      iterating root caches can only be done by walking the list while
      skipping over memcg caches.  As there can be a huge number of memcg
      caches, this can become very expensive.
      
      This also can make /proc/slabinfo behave very badly.  seq_file processes
      reads in 4k chunks and seeks to the previous Nth position on slab_caches
      list to resume after each chunk.  With a lot of memcg cache churns on
      the list, reading /proc/slabinfo can become very slow and its content
      often ends up with duplicate and/or missing entries.
      
      This patch adds a new list slab_root_caches which lists only the root
      caches.  When memcg is not enabled, it becomes just an alias of
      slab_caches.  memcg specific list operations are collected into
      memcg_[un]link_cache().
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJay Vana <jsvana@fb.com>
      Acked-by: NVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      510ded33