1. 08 8月, 2020 5 次提交
  2. 04 6月, 2020 1 次提交
  3. 14 1月, 2020 1 次提交
    • V
      mm, debug_pagealloc: don't rely on static keys too early · 8e57f8ac
      Vlastimil Babka 提交于
      Commit 96a2b03f ("mm, debug_pagelloc: use static keys to enable
      debugging") has introduced a static key to reduce overhead when
      debug_pagealloc is compiled in but not enabled.  It relied on the
      assumption that jump_label_init() is called before parse_early_param()
      as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
      it is safe to enable the static key.
      
      However, it turns out multiple architectures call parse_early_param()
      earlier from their setup_arch().  x86 also calls jump_label_init() even
      earlier, so no issue was found while testing the commit, but same is not
      true for e.g.  ppc64 and s390 where the kernel would not boot with
      debug_pagealloc=on as found by our QA.
      
      To fix this without tricky changes to init code of multiple
      architectures, this patch partially reverts the static key conversion
      from 96a2b03f.  Init-time and non-fastpath calls (such as in arch
      code) of debug_pagealloc_enabled() will again test a simple bool
      variable.  Fastpath mm code is converted to a new
      debug_pagealloc_enabled_static() variant that relies on the static key,
      which is enabled in a well-defined point in mm_init() where it's
      guaranteed that jump_label_init() has been called, regardless of
      architecture.
      
      [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
        Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
      Fixes: 96a2b03f ("mm, debug_pagelloc: use static keys to enable debugging")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e57f8ac
  4. 01 12月, 2019 2 次提交
  5. 15 10月, 2019 1 次提交
  6. 13 7月, 2019 6 次提交
    • A
      mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options · 6471384a
      Alexander Potapenko 提交于
      Patch series "add init_on_alloc/init_on_free boot options", v10.
      
      Provide init_on_alloc and init_on_free boot options.
      
      These are aimed at preventing possible information leaks and making the
      control-flow bugs that depend on uninitialized values more deterministic.
      
      Enabling either of the options guarantees that the memory returned by the
      page allocator and SL[AU]B is initialized with zeroes.  SLOB allocator
      isn't supported at the moment, as its emulation of kmem caches complicates
      handling of SLAB_TYPESAFE_BY_RCU caches correctly.
      
      Enabling init_on_free also guarantees that pages and heap objects are
      initialized right after they're freed, so it won't be possible to access
      stale data by using a dangling pointer.
      
      As suggested by Michal Hocko, right now we don't let the heap users to
      disable initialization for certain allocations.  There's not enough
      evidence that doing so can speed up real-life cases, and introducing ways
      to opt-out may result in things going out of control.
      
      This patch (of 2):
      
      The new options are needed to prevent possible information leaks and make
      control-flow bugs that depend on uninitialized values more deterministic.
      
      This is expected to be on-by-default on Android and Chrome OS.  And it
      gives the opportunity for anyone else to use it under distros too via the
      boot args.  (The init_on_free feature is regularly requested by folks
      where memory forensics is included in their threat models.)
      
      init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
      objects with zeroes.  Initialization is done at allocation time at the
      places where checks for __GFP_ZERO are performed.
      
      init_on_free=1 makes the kernel initialize freed pages and heap objects
      with zeroes upon their deletion.  This helps to ensure sensitive data
      doesn't leak via use-after-free accesses.
      
      Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
      returns zeroed memory.  The two exceptions are slab caches with
      constructors and SLAB_TYPESAFE_BY_RCU flag.  Those are never
      zero-initialized to preserve their semantics.
      
      Both init_on_alloc and init_on_free default to zero, but those defaults
      can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
      CONFIG_INIT_ON_FREE_DEFAULT_ON.
      
      If either SLUB poisoning or page poisoning is enabled, those options take
      precedence over init_on_alloc and init_on_free: initialization is only
      applied to unpoisoned allocations.
      
      Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
      
      hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
      hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
      
      Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
      Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
      Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
      Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
      
      The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
      is within the standard error.
      
      The new features are also going to pave the way for hardware memory
      tagging (e.g.  arm64's MTE), which will require both on_alloc and on_free
      hooks to set the tags for heap objects.  With MTE, tagging will have the
      same cost as memory initialization.
      
      Although init_on_free is rather costly, there are paranoid use-cases where
      in-memory data lifetime is desired to be minimized.  There are various
      arguments for/against the realism of the associated threat models, but
      given that we'll need the infrastructure for MTE anyway, and there are
      people who want wipe-on-free behavior no matter what the performance cost,
      it seems reasonable to include it in this series.
      
      [glider@google.com: v8]
        Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
      [glider@google.com: v10]
        Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
      Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: Michal Hocko <mhocko@suse.cz>		[page and dmapool parts
      Acked-by: James Morris <jamorris@linux.microsoft.com>]
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Sandeep Patil <sspatil@android.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6471384a
    • R
      mm: memcg/slab: unify SLAB and SLUB page accounting · 6cea1d56
      Roman Gushchin 提交于
      Currently the page accounting code is duplicated in SLAB and SLUB
      internals.  Let's move it into new (un)charge_slab_page helpers in the
      slab_common.c file.  These helpers will be responsible for statistics
      (global and memcg-aware) and memcg charging.  So they are replacing direct
      memcg_(un)charge_slab() calls.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cea1d56
    • R
      mm: memcg/slab: generalize postponed non-root kmem_cache deactivation · 43486694
      Roman Gushchin 提交于
      Currently SLUB uses a work scheduled after an RCU grace period to
      deactivate a non-root kmem_cache.  This mechanism can be reused for
      kmem_caches release, but requires generalization for SLAB case.
      
      Introduce kmemcg_cache_deactivate() function, which calls
      allocator-specific __kmem_cache_deactivate() and schedules execution of
      __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
      context after an rcu grace period.
      
      Here is the new calling scheme:
        kmemcg_cache_deactivate()
          __kmemcg_cache_deactivate()                  SLAB/SLUB-specific
          kmemcg_rcufn()                               rcu
            kmemcg_workfn()                            work
              __kmemcg_cache_deactivate_after_rcu()    SLAB/SLUB-specific
      
      instead of:
        __kmemcg_cache_deactivate()                    SLAB/SLUB-specific
          slab_deactivate_memcg_cache_rcu_sched()      SLUB-only
            kmemcg_rcufn()                             rcu
              kmemcg_workfn()                          work
                kmemcg_cache_deact_after_rcu()         SLUB-only
      
      For consistency, all allocator-specific functions start with "__".
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43486694
    • R
      mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() · c03914b7
      Roman Gushchin 提交于
      Patch series "mm: reparent slab memory on cgroup removal", v7.
      
      # Why do we need this?
      
      We've noticed that the number of dying cgroups is steadily growing on most
      of our hosts in production.  The following investigation revealed an issue
      in the userspace memory reclaim code [1], accounting of kernel stacks [2],
      and also the main reason: slab objects.
      
      The underlying problem is quite simple: any page charged to a cgroup holds
      a reference to it, so the cgroup can't be reclaimed unless all charged
      pages are gone.  If a slab object is actively used by other cgroups, it
      won't be reclaimed, and will prevent the origin cgroup from being
      reclaimed.
      
      Slab objects, and first of all vfs cache, is shared between cgroups, which
      are using the same underlying fs, and what's even more important, it's
      shared between multiple generations of the same workload.  So if something
      is running periodically every time in a new cgroup (like how systemd
      works), we do accumulate multiple dying cgroups.
      
      Strictly speaking pagecache isn't different here, but there is a key
      difference: we disable protection and apply some extra pressure on LRUs of
      dying cgroups, and these LRUs contain all charged pages.  My experiments
      show that with the disabled kernel memory accounting the number of dying
      cgroups stabilizes at a relatively small number (~100, depends on memory
      pressure and cgroup creation rate), and with kernel memory accounting it
      grows pretty steadily up to several thousands.
      
      Memory cgroups are quite complex and big objects (mostly due to percpu
      stats), so it leads to noticeable memory losses.  Memory occupied by dying
      cgroups is measured in hundreds of megabytes.  I've even seen a host with
      more than 100Gb of memory wasted for dying cgroups.  It leads to a
      degradation of performance with the uptime, and generally limits the usage
      of cgroups.
      
      My previous attempt [3] to fix the problem by applying extra pressure on
      slab shrinker lists caused a regressions with xfs and ext4, and has been
      reverted [4].  The following attempts to find the right balance [5, 6]
      were not successful.
      
      So instead of trying to find a maybe non-existing balance, let's do
      reparent accounted slab caches to the parent cgroup on cgroup removal.
      
      # Implementation approach
      
      There is however a significant problem with reparenting of slab memory:
      there is no list of charged pages.  Some of them are in shrinker lists,
      but not all.  Introducing of a new list is really not an option.
      
      But fortunately there is a way forward: every slab page has a stable
      pointer to the corresponding kmem_cache.  So the idea is to reparent
      kmem_caches instead of slab pages.
      
      It's actually simpler and cheaper, but requires some underlying changes:
      1) Make kmem_caches to hold a single reference to the memory cgroup,
         instead of a separate reference per every slab page.
      2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
         page->kmem_cache->memcg indirection instead. It's used only on
         slab page release, so performance overhead shouldn't be a big issue.
      3) Introduce a refcounter for non-root slab caches. It's required to
         be able to destroy kmem_caches when they become empty and release
         the associated memory cgroup.
      
      There is a bonus: currently we release all memcg kmem_caches all together
      with the memory cgroup itself.  This patchset allows individual
      kmem_caches to be released as soon as they become inactive and free.
      
      Some additional implementation details are provided in corresponding
      commit messages.
      
      # Results
      
      Below is the average number of dying cgroups on two groups of our
      production hosts.  They do run some sort of web frontend workload, the
      memory pressure is moderate.  As we can see, with the kernel memory
      reparenting the number stabilizes in 60s range; however with the original
      version it grows almost linearly and doesn't show any signs of plateauing.
      The difference in slab and percpu usage between patched and unpatched
      versions also grows linearly.  In 7 days it exceeded 200Mb.
      
      day           0    1    2    3    4    5    6    7
      original     56  362  628  752 1070 1250 1490 1560
      patched      23   46   51   55   60   57   67   69
      mem diff(Mb) 22   74  123  152  164  182  214  241
      
      # Links
      
      [1]: commit 68600f62 ("mm: don't miss the last page because of round-off error")
      [2]: commit 9b6f7e16 ("mm: rework memcg kernel stack accounting")
      [3]: commit 172b06c3 ("mm: slowly shrink slabs with a relatively small number of objects")
      [4]: commit a9a238e8 ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
      [5]: https://lkml.org/lkml/2019/1/28/1865
      [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
      
      This patch (of 10):
      
      Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
      rather than in init_memcg_params().
      
      Once kmem_cache will hold a reference to the memory cgroup, it will
      simplify the refcounting.
      
      For non-root kmem_caches memcg_link_cache() is always called before the
      kmem_cache becomes visible to a user, so it's safe.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c03914b7
    • M
      mm/slab: refactor common ksize KASAN logic into slab_common.c · 10d1f8cb
      Marco Elver 提交于
      This refactors common code of ksize() between the various allocators into
      slab_common.c: __ksize() is the allocator-specific implementation without
      instrumentation, whereas ksize() includes the required KASAN logic.
      
      Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10d1f8cb
    • K
      mm/slab: sanity-check page type when looking up cache · a64b5378
      Kees Cook 提交于
      This avoids any possible type confusion when looking up an object.  For
      example, if a non-slab were to be passed to kfree(), the invalid
      slab_cache pointer (i.e.  overlapped with some other value from the
      struct page union) would be used for subsequent slab manipulations that
      could lead to further memory corruption.
      
      Since the page is already in cache, adding the PageSlab() check will
      have nearly zero cost, so add a check and WARN() to virt_to_cache().
      Additionally replaces an open-coded virt_to_cache().  To support the
      failure mode this also updates all callers of virt_to_cache() and
      cache_from_obj() to handle a NULL cache pointer return value (though
      note that several already handle this case gracefully).
      
      [dan.carpenter@oracle.com: restore IRQs in kfree()]
        Link: http://lkml.kernel.org/r/20190613065637.GE16334@mwanda
      Link: http://lkml.kernel.org/r/20190530045017.15252-3-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Alexander Popov <alex.popov@linux.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a64b5378
  7. 17 5月, 2019 1 次提交
    • Q
      slab: remove /proc/slab_allocators · 7878c231
      Qian Cai 提交于
      It turned out that DEBUG_SLAB_LEAK is still broken even after recent
      recue efforts that when there is a large number of objects like
      kmemleak_object which is normal on a debug kernel,
      
        # grep kmemleak /proc/slabinfo
        kmemleak_object   2243606 3436210 ...
      
      reading /proc/slab_allocators could easily loop forever while processing
      the kmemleak_object cache and any additional freeing or allocating
      objects will trigger a reprocessing. To make a situation worse,
      soft-lockups could easily happen in this sitatuion which will call
      printk() to allocate more kmemleak objects to guarantee an infinite
      loop.
      
      Also, since it seems no one had noticed when it was totally broken
      more than 2-year ago - see the commit fcf88917 ("slab: fix a crash
      by reading /proc/slab_allocators"), probably nobody cares about it
      anymore due to the decline of the SLAB. Just remove it entirely.
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7878c231
  8. 15 5月, 2019 3 次提交
  9. 20 4月, 2019 1 次提交
    • Q
      slab: store tagged freelist for off-slab slabmgmt · 1a62b18d
      Qian Cai 提交于
      Commit 51dedad0 ("kasan, slab: make freelist stored without tags")
      calls kasan_reset_tag() for off-slab slab management object leading to
      freelist being stored non-tagged.
      
      However, cache_grow_begin() calls alloc_slabmgmt() which calls
      kmem_cache_alloc_node() assigns a tag for the address and stores it in
      the shadow address.  As the result, it causes endless errors below
      during boot due to drain_freelist() -> slab_destroy() ->
      kasan_slab_free() which compares already untagged freelist against the
      stored tag in the shadow address.
      
      Since off-slab slab management object freelist is such a special case,
      just store it tagged.  Non-off-slab management object freelist is still
      stored untagged which has not been assigned a tag and should not cause
      any other troubles with this inconsistency.
      
        BUG: KASAN: double-free or invalid-free in slab_destroy+0x84/0x88
        Pointer tag: [ff], memory tag: [99]
      
        CPU: 0 PID: 1376 Comm: kworker/0:4 Tainted: G        W 5.1.0-rc3+ #8
        Hardware name: HPE Apollo 70             /C01_APACHE_MB         , BIOS L50_5.13_1.0.6 07/10/2018
        Workqueue: cgroup_destroy css_killed_work_fn
        Call trace:
         print_address_description+0x74/0x2a4
         kasan_report_invalid_free+0x80/0xc0
         __kasan_slab_free+0x204/0x208
         kasan_slab_free+0xc/0x18
         kmem_cache_free+0xe4/0x254
         slab_destroy+0x84/0x88
         drain_freelist+0xd0/0x104
         __kmem_cache_shrink+0x1ac/0x224
         __kmemcg_cache_deactivate+0x1c/0x28
         memcg_deactivate_kmem_caches+0xa0/0xe8
         memcg_offline_kmem+0x8c/0x3d4
         mem_cgroup_css_offline+0x24c/0x290
         css_killed_work_fn+0x154/0x618
         process_one_work+0x9cc/0x183c
         worker_thread+0x9b0/0xe38
         kthread+0x374/0x390
         ret_from_fork+0x10/0x18
      
        Allocated by task 1625:
         __kasan_kmalloc+0x168/0x240
         kasan_slab_alloc+0x18/0x20
         kmem_cache_alloc_node+0x1f8/0x3a0
         cache_grow_begin+0x4fc/0xa24
         cache_alloc_refill+0x2f8/0x3e8
         kmem_cache_alloc+0x1bc/0x3bc
         sock_alloc_inode+0x58/0x334
         alloc_inode+0xb8/0x164
         new_inode_pseudo+0x20/0xec
         sock_alloc+0x74/0x284
         __sock_create+0xb0/0x58c
         sock_create+0x98/0xb8
         __sys_socket+0x60/0x138
         __arm64_sys_socket+0xa4/0x110
         el0_svc_handler+0x2c0/0x47c
         el0_svc+0x8/0xc
      
        Freed by task 1625:
         __kasan_slab_free+0x114/0x208
         kasan_slab_free+0xc/0x18
         kfree+0x1a8/0x1e0
         single_release+0x7c/0x9c
         close_pdeo+0x13c/0x43c
         proc_reg_release+0xec/0x108
         __fput+0x2f8/0x784
         ____fput+0x1c/0x28
         task_work_run+0xc0/0x1b0
         do_notify_resume+0xb44/0x1278
         work_pending+0x8/0x10
      
        The buggy address belongs to the object at ffff809681b89e00
         which belongs to the cache kmalloc-128 of size 128
        The buggy address is located 0 bytes inside of
         128-byte region [ffff809681b89e00, ffff809681b89e80)
        The buggy address belongs to the page:
        page:ffff7fe025a06e00 count:1 mapcount:0 mapping:01ff80082000fb00
        index:0xffff809681b8fe04
        flags: 0x17ffffffc000200(slab)
        raw: 017ffffffc000200 ffff7fe025a06d08 ffff7fe022ef7b88 01ff80082000fb00
        raw: ffff809681b8fe04 ffff809681b80000 00000001000000e0 0000000000000000
        page dumped because: kasan: bad access detected
        page allocated via order 0, migratetype Unmovable, gfp_mask
        0x2420c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE)
         prep_new_page+0x4e0/0x5e0
         get_page_from_freelist+0x4ce8/0x50d4
         __alloc_pages_nodemask+0x738/0x38b8
         cache_grow_begin+0xd8/0xa24
         ____cache_alloc_node+0x14c/0x268
         __kmalloc+0x1c8/0x3fc
         ftrace_free_mem+0x408/0x1284
         ftrace_free_init_mem+0x20/0x28
         kernel_init+0x24/0x548
         ret_from_fork+0x10/0x18
      
        Memory state around the buggy address:
         ffff809681b89c00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
         ffff809681b89d00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
        >ffff809681b89e00: 99 99 99 99 99 99 99 99 fe fe fe fe fe fe fe fe
                           ^
         ffff809681b89f00: 43 43 43 43 43 fe fe fe fe fe fe fe fe fe fe fe
         ffff809681b8a000: 6d fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
      
      Link: http://lkml.kernel.org/r/20190403022858.97584-1-cai@lca.pw
      Fixes: 51dedad0 ("kasan, slab: make freelist stored without tags")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a62b18d
  10. 17 4月, 2019 1 次提交
  11. 08 4月, 2019 1 次提交
  12. 30 3月, 2019 1 次提交
    • N
      mm: add support for kmem caches in DMA32 zone · 6d6ea1e9
      Nicolas Boichat 提交于
      Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
      v6.
      
      This is a followup to the discussion in [1], [2].
      
      IOMMUs using ARMv7 short-descriptor format require page tables (level 1
      and 2) to be allocated within the first 4GB of RAM, even on 64-bit
      systems.
      
      For L1 tables that are bigger than a page, we can just use
      __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
      use GFP_DMA).
      
      For L2 tables that only take 1KB, it would be a waste to allocate a full
      page, so we considered 3 approaches:
       1. This series, adding support for GFP_DMA32 slab caches.
       2. genalloc, which requires pre-allocating the maximum number of L2 page
          tables (4096, so 4MB of memory).
       3. page_frag, which is not very memory-efficient as it is unable to reuse
          freed fragments until the whole page is freed. [3]
      
      This series is the most memory-efficient approach.
      
      stable@ note:
        We confirmed that this is a regression, and IOMMU errors happen on 4.19
        and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
        most likely starts from commit ad67f5a6 ("arm64: replace ZONE_DMA
        with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
        platforms (and maybe others?).
      
      [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
      [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
      [3] https://patchwork.codeaurora.org/patch/671639/
      
      This patch (of 3):
      
      IOMMUs using ARMv7 short-descriptor format require page tables to be
      allocated within the first 4GB of RAM, even on 64-bit systems.  On arm64,
      this is done by passing GFP_DMA32 flag to memory allocation functions.
      
      For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
      a full page using get_free_pages, so we considered 3 approaches:
       1. This patch, adding support for GFP_DMA32 slab caches.
       2. genalloc, which requires pre-allocating the maximum number of L2
          page tables (4096, so 4MB of memory).
       3. page_frag, which is not very memory-efficient as it is unable
          to reuse freed fragments until the whole page is freed.
      
      This change makes it possible to create a custom cache in DMA32 zone using
      kmem_cache_create, then allocate memory using kmem_cache_alloc.
      
      We do not create a DMA32 kmalloc cache array, as there are currently no
      users of kmalloc(..., GFP_DMA32).  These calls will continue to trigger a
      warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.
      
      This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
      kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
      unnecessary).
      
      Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.orgSigned-off-by: NNicolas Boichat <drinkcat@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Sasha Levin <Alexander.Levin@microsoft.com>
      Cc: Huaisheng Ye <yehs1@lenovo.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yong Wu <yong.wu@mediatek.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Tomasz Figa <tfiga@google.com>
      Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d6ea1e9
  13. 06 3月, 2019 3 次提交
    • M
      docs/core-api/mm: fix return value descriptions in mm/ · a862f68a
      Mike Rapoport 提交于
      Many kernel-doc comments in mm/ have the return value descriptions
      either misformatted or omitted at all which makes kernel-doc script
      unhappy:
      
      $ make V=1 htmldocs
      ...
      ./mm/util.c:36: info: Scanning doc for kstrdup
      ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
      ./mm/util.c:57: info: Scanning doc for kstrdup_const
      ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
      ./mm/util.c:75: info: Scanning doc for kstrndup
      ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
      ...
      
      Fixing the formatting and adding the missing return value descriptions
      eliminates ~100 such warnings.
      
      Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a862f68a
    • A
      numa: make "nr_node_ids" unsigned int · b9726c26
      Alexey Dobriyan 提交于
      Number of NUMA nodes can't be negative.
      
      This saves a few bytes on x86_64:
      
      	add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
      	Function                                     old     new   delta
      	hv_synic_alloc.cold                           88     110     +22
      	prealloc_shrinker                            260     262      +2
      	bootstrap                                    249     251      +2
      	sched_init_numa                             1566    1567      +1
      	show_slab_objects                            778     777      -1
      	s_show                                      1201    1200      -1
      	kmem_cache_init                              346     345      -1
      	__alloc_workqueue_key                       1146    1145      -1
      	mem_cgroup_css_alloc                        1614    1612      -2
      	__do_sys_swapon                             4702    4699      -3
      	__list_lru_init                              655     651      -4
      	nic_probe                                   2379    2374      -5
      	store_user_store                             118     111      -7
      	red_zone_store                               106      99      -7
      	poison_store                                 106      99      -7
      	wq_numa_init                                 348     338     -10
      	__kmem_cache_empty                            75      65     -10
      	task_numa_free                               186     173     -13
      	merge_across_nodes_store                     351     336     -15
      	irq_create_affinity_masks                   1261    1246     -15
      	do_numa_crng_init                            343     321     -22
      	task_numa_fault                             4760    4737     -23
      	swapfile_init                                179     156     -23
      	hv_synic_alloc                               536     492     -44
      	apply_wqattrs_prepare                        746     695     -51
      
      Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9726c26
    • Q
      mm/slab.c: kmemleak no scan alien caches · 92d1d07d
      Qian Cai 提交于
      Kmemleak throws endless warnings during boot due to in
      __alloc_alien_cache(),
      
          alc = kmalloc_node(memsize, gfp, node);
          init_arraycache(&alc->ac, entries, batch);
          kmemleak_no_scan(ac);
      
      Kmemleak does not track the array cache (alc->ac) but the alien cache
      (alc) instead, so let it track the latter by lifting kmemleak_no_scan()
      out of init_arraycache().
      
      There is another place that calls init_arraycache(), but
      alloc_kmem_cache_cpus() uses the percpu allocation where will never be
      considered as a leak.
      
        kmemleak: Found object by alias at 0xffff8007b9aa7e38
        CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
        Call trace:
         dump_backtrace+0x0/0x168
         show_stack+0x24/0x30
         dump_stack+0x88/0xb0
         lookup_object+0x84/0xac
         find_and_get_object+0x84/0xe4
         kmemleak_no_scan+0x74/0xf4
         setup_kmem_cache_node+0x2b4/0x35c
         __do_tune_cpucache+0x250/0x2d4
         do_tune_cpucache+0x4c/0xe4
         enable_cpucache+0xc8/0x110
         setup_cpu_cache+0x40/0x1b8
         __kmem_cache_create+0x240/0x358
         create_cache+0xc0/0x198
         kmem_cache_create_usercopy+0x158/0x20c
         kmem_cache_create+0x50/0x64
         fsnotify_init+0x58/0x6c
         do_one_initcall+0x194/0x388
         kernel_init_freeable+0x668/0x688
         kernel_init+0x18/0x124
         ret_from_fork+0x10/0x18
        kmemleak: Object 0xffff8007b9aa7e00 (size 256):
        kmemleak:   comm "swapper/0", pid 1, jiffies 4294697137
        kmemleak:   min_count = 1
        kmemleak:   count = 0
        kmemleak:   flags = 0x1
        kmemleak:   checksum = 0
        kmemleak:   backtrace:
             kmemleak_alloc+0x84/0xb8
             kmem_cache_alloc_node_trace+0x31c/0x3a0
             __kmalloc_node+0x58/0x78
             setup_kmem_cache_node+0x26c/0x35c
             __do_tune_cpucache+0x250/0x2d4
             do_tune_cpucache+0x4c/0xe4
             enable_cpucache+0xc8/0x110
             setup_cpu_cache+0x40/0x1b8
             __kmem_cache_create+0x240/0x358
             create_cache+0xc0/0x198
             kmem_cache_create_usercopy+0x158/0x20c
             kmem_cache_create+0x50/0x64
             fsnotify_init+0x58/0x6c
             do_one_initcall+0x194/0x388
             kernel_init_freeable+0x668/0x688
             kernel_init+0x18/0x124
        kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
        CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
        Call trace:
         dump_backtrace+0x0/0x168
         show_stack+0x24/0x30
         dump_stack+0x88/0xb0
         kmemleak_no_scan+0x90/0xf4
         setup_kmem_cache_node+0x2b4/0x35c
         __do_tune_cpucache+0x250/0x2d4
         do_tune_cpucache+0x4c/0xe4
         enable_cpucache+0xc8/0x110
         setup_cpu_cache+0x40/0x1b8
         __kmem_cache_create+0x240/0x358
         create_cache+0xc0/0x198
         kmem_cache_create_usercopy+0x158/0x20c
         kmem_cache_create+0x50/0x64
         fsnotify_init+0x58/0x6c
         do_one_initcall+0x194/0x388
         kernel_init_freeable+0x668/0x688
         kernel_init+0x18/0x124
         ret_from_fork+0x10/0x18
      
      Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
      Fixes: 1fe00d50 ("slab: factor out initialization of array cache")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92d1d07d
  14. 22 2月, 2019 3 次提交
  15. 09 1月, 2019 1 次提交
  16. 29 12月, 2018 5 次提交
    • A
      mm: convert totalram_pages and totalhigh_pages variables to atomic · ca79b0c2
      Arun KS 提交于
      totalram_pages and totalhigh_pages are made static inline function.
      
      Main motivation was that managed_page_count_lock handling was complicating
      things.  It was discussed in length here,
      https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
      better to remove the lock and convert variables to atomic, with preventing
      poteintial store-to-read tearing as a bonus.
      
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca79b0c2
    • A
      kasan, mm, arm64: tag non slab memory allocated via pagealloc · 2813b9c0
      Andrey Konovalov 提交于
      Tag-based KASAN doesn't check memory accesses through pointers tagged with
      0xff.  When page_address is used to get pointer to memory that corresponds
      to some page, the tag of the resulting pointer gets set to 0xff, even
      though the allocated memory might have been tagged differently.
      
      For slab pages it's impossible to recover the correct tag to return from
      page_address, since the page might contain multiple slab objects tagged
      with different values, and we can't know in advance which one of them is
      going to get accessed.  For non slab pages however, we can recover the tag
      in page_address, since the whole page was marked with the same tag.
      
      This patch adds tagging to non slab memory allocated with pagealloc.  To
      set the tag of the pointer returned from page_address, the tag gets stored
      to page->flags when the memory gets allocated.
      
      Link: http://lkml.kernel.org/r/d758ddcef46a5abc9970182b9137e2fbee202a2c.1544099024.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2813b9c0
    • A
      mm: move obj_to_index to include/linux/slab_def.h · 5b7c4148
      Andrey Konovalov 提交于
      While with SLUB we can actually preassign tags for caches with contructors
      and store them in pointers in the freelist, SLAB doesn't allow that since
      the freelist is stored as an array of indexes, so there are no pointers to
      store the tags.
      
      Instead we compute the tag twice, once when a slab is created before
      calling the constructor and then again each time when an object is
      allocated with kmalloc.  Tag is computed simply by taking the lowest byte
      of the index that corresponds to the object.  However in kasan_kmalloc we
      only have access to the objects pointer, so we need a way to find out
      which index this object corresponds to.
      
      This patch moves obj_to_index from slab.c to include/linux/slab_def.h to
      be reused by KASAN.
      
      Link: http://lkml.kernel.org/r/c02cd9e574cfd93858e43ac94b05e38f891fef64.1544099024.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b7c4148
    • A
      kasan: preassign tags to objects with ctors or SLAB_TYPESAFE_BY_RCU · 4d176711
      Andrey Konovalov 提交于
      An object constructor can initialize pointers within this objects based on
      the address of the object.  Since the object address might be tagged, we
      need to assign a tag before calling constructor.
      
      The implemented approach is to assign tags to objects with constructors
      when a slab is allocated and call constructors once as usual.  The
      downside is that such object would always have the same tag when it is
      reallocated, so we won't catch use-after-frees on it.
      
      Also pressign tags for objects from SLAB_TYPESAFE_BY_RCU caches, since
      they can be validy accessed after having been freed.
      
      Link: http://lkml.kernel.org/r/f158a8a74a031d66f0a9398a5b0ed453c37ba09a.1544099024.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d176711
    • A
      kasan, mm: change hooks signatures · 0116523c
      Andrey Konovalov 提交于
      Patch series "kasan: add software tag-based mode for arm64", v13.
      
      This patchset adds a new software tag-based mode to KASAN [1].  (Initially
      this mode was called KHWASAN, but it got renamed, see the naming rationale
      at the end of this section).
      
      The plan is to implement HWASan [2] for the kernel with the incentive,
      that it's going to have comparable to KASAN performance, but in the same
      time consume much less memory, trading that off for somewhat imprecise bug
      detection and being supported only for arm64.
      
      The underlying ideas of the approach used by software tag-based KASAN are:
      
      1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
         pointer tags in the top byte of each kernel pointer.
      
      2. Using shadow memory, we can store memory tags for each chunk of kernel
         memory.
      
      3. On each memory allocation, we can generate a random tag, embed it into
         the returned pointer and set the memory tags that correspond to this
         chunk of memory to the same value.
      
      4. By using compiler instrumentation, before each memory access we can add
         a check that the pointer tag matches the tag of the memory that is being
         accessed.
      
      5. On a tag mismatch we report an error.
      
      With this patchset the existing KASAN mode gets renamed to generic KASAN,
      with the word "generic" meaning that the implementation can be supported
      by any architecture as it is purely software.
      
      The new mode this patchset adds is called software tag-based KASAN.  The
      word "tag-based" refers to the fact that this mode uses tags embedded into
      the top byte of kernel pointers and the TBI arm64 CPU feature that allows
      to dereference such pointers.  The word "software" here means that shadow
      memory manipulation and tag checking on pointer dereference is done in
      software.  As it is the only tag-based implementation right now, "software
      tag-based" KASAN is sometimes referred to as simply "tag-based" in this
      patchset.
      
      A potential expansion of this mode is a hardware tag-based mode, which
      would use hardware memory tagging support (announced by Arm [3]) instead
      of compiler instrumentation and manual shadow memory manipulation.
      
      Same as generic KASAN, software tag-based KASAN is strictly a debugging
      feature.
      
      [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
      
      [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
      
      [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
      
      ====== Rationale
      
      On mobile devices generic KASAN's memory usage is significant problem.
      One of the main reasons to have tag-based KASAN is to be able to perform a
      similar set of checks as the generic one does, but with lower memory
      requirements.
      
      Comment from Vishwath Mohan <vishwath@google.com>:
      
      I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
      problematic to enable for environments that don't tolerate the increased
      memory pressure well.  This includes
      
      (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
      (c) Connected components like Pixel's visual core [1].
      
      These are both places I'd love to have a low(er) memory footprint option at
      my disposal.
      
      Comment from Evgenii Stepanov <eugenis@google.com>:
      
      Looking at a live Android device under load, slab (according to
      /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB).  KASAN's
      overhead of 2x - 3x on top of it is not insignificant.
      
      Not having this overhead enables near-production use - ex.  running
      KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
      not reproduce in test configuration.  These are the ones that often cost
      the most engineering time to track down.
      
      CPU overhead is bad, but generally tolerable.  RAM is critical, in our
      experience.  Once it gets low enough, OOM-killer makes your life
      miserable.
      
      [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
      
      ====== Technical details
      
      Software tag-based KASAN mode is implemented in a very similar way to the
      generic one. This patchset essentially does the following:
      
      1. TCR_TBI1 is set to enable Top Byte Ignore.
      
      2. Shadow memory is used (with a different scale, 1:16, so each shadow
         byte corresponds to 16 bytes of kernel memory) to store memory tags.
      
      3. All slab objects are aligned to shadow scale, which is 16 bytes.
      
      4. All pointers returned from the slab allocator are tagged with a random
         tag and the corresponding shadow memory is poisoned with the same value.
      
      5. Compiler instrumentation is used to insert tag checks. Either by
         calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
         CONFIG_KASAN_INLINE flags are reused).
      
      6. When a tag mismatch is detected in callback instrumentation mode
         KASAN simply prints a bug report. In case of inline instrumentation,
         clang inserts a brk instruction, and KASAN has it's own brk handler,
         which reports the bug.
      
      7. The memory in between slab objects is marked with a reserved tag, and
         acts as a redzone.
      
      8. When a slab object is freed it's marked with a reserved tag.
      
      Bug detection is imprecise for two reasons:
      
      1. We won't catch some small out-of-bounds accesses, that fall into the
         same shadow cell, as the last byte of a slab object.
      
      2. We only have 1 byte to store tags, which means we have a 1/256
         probability of a tag match for an incorrect access (actually even
         slightly less due to reserved tag values).
      
      Despite that there's a particular type of bugs that tag-based KASAN can
      detect compared to generic KASAN: use-after-free after the object has been
      allocated by someone else.
      
      ====== Testing
      
      Some kernel developers voiced a concern that changing the top byte of
      kernel pointers may lead to subtle bugs that are difficult to discover.
      To address this concern deliberate testing has been performed.
      
      It doesn't seem feasible to do some kind of static checking to find
      potential issues with pointer tagging, so a dynamic approach was taken.
      All pointer comparisons/subtractions have been instrumented in an LLVM
      compiler pass and a kernel module that would print a bug report whenever
      two pointers with different tags are being compared/subtracted (ignoring
      comparisons with NULL pointers and with pointers obtained by casting an
      error code to a pointer type) has been used.  Then the kernel has been
      booted in QEMU and on an Odroid C2 board and syzkaller has been run.
      
      This yielded the following results.
      
      The two places that look interesting are:
      
      is_vmalloc_addr in include/linux/mm.h
      is_kernel_rodata in mm/util.c
      
      Here we compare a pointer with some fixed untagged values to make sure
      that the pointer lies in a particular part of the kernel address space.
      Since tag-based KASAN doesn't add tags to pointers that belong to rodata
      or vmalloc regions, this should work as is.  To make sure debug checks to
      those two functions that check that the result doesn't change whether we
      operate on pointers with or without untagging has been added.
      
      A few other cases that don't look that interesting:
      
      Comparing pointers to achieve unique sorting order of pointee objects
      (e.g. sorting locks addresses before performing a double lock):
      
      tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
      pipe_double_lock in fs/pipe.c
      unix_state_double_lock in net/unix/af_unix.c
      lock_two_nondirectories in fs/inode.c
      mutex_lock_double in kernel/events/core.c
      
      ep_cmp_ffd in fs/eventpoll.c
      fsnotify_compare_groups fs/notify/mark.c
      
      Nothing needs to be done here, since the tags embedded into pointers
      don't change, so the sorting order would still be unique.
      
      Checks that a pointer belongs to some particular allocation:
      
      is_sibling_entry in lib/radix-tree.c
      object_is_on_stack in include/linux/sched/task_stack.h
      
      Nothing needs to be done here either, since two pointers can only belong
      to the same allocation if they have the same tag.
      
      Overall, since the kernel boots and works, there are no critical bugs.
      As for the rest, the traditional kernel testing way (use until fails) is
      the only one that looks feasible.
      
      Another point here is that tag-based KASAN is available under a separate
      config option that needs to be deliberately enabled. Even though it might
      be used in a "near-production" environment to find bugs that are not found
      during fuzzing or running tests, it is still a debug tool.
      
      ====== Benchmarks
      
      The following numbers were collected on Odroid C2 board. Both generic and
      tag-based KASAN were used in inline instrumentation mode.
      
      Boot time [1]:
      * ~1.7 sec for clean kernel
      * ~5.0 sec for generic KASAN
      * ~5.0 sec for tag-based KASAN
      
      Network performance [2]:
      * 8.33 Gbits/sec for clean kernel
      * 3.17 Gbits/sec for generic KASAN
      * 2.85 Gbits/sec for tag-based KASAN
      
      Slab memory usage after boot [3]:
      * ~40 kb for clean kernel
      * ~105 kb (~260% overhead) for generic KASAN
      * ~47 kb (~20% overhead) for tag-based KASAN
      
      KASAN memory overhead consists of three main parts:
      1. Increased slab memory usage due to redzones.
      2. Shadow memory (the whole reserved once during boot).
      3. Quaratine (grows gradually until some preset limit; the more the limit,
         the more the chance to detect a use-after-free).
      
      Comparing tag-based vs generic KASAN for each of these points:
      1. 20% vs 260% overhead.
      2. 1/16th vs 1/8th of physical memory.
      3. Tag-based KASAN doesn't require quarantine.
      
      [1] Time before the ext4 driver is initialized.
      [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
      [3] Measured as `cat /proc/meminfo | grep Slab`.
      
      ====== Some notes
      
      A few notes:
      
      1. The patchset can be found here:
         https://github.com/xairy/kasan-prototype/tree/khwasan
      
      2. Building requires a recent Clang version (7.0.0 or later).
      
      3. Stack instrumentation is not supported yet and will be added later.
      
      This patch (of 25):
      
      Tag-based KASAN changes the value of the top byte of pointers returned
      from the kernel allocation functions (such as kmalloc).  This patch
      updates KASAN hooks signatures and their usage in SLAB and SLUB code to
      reflect that.
      
      Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0116523c
  17. 28 11月, 2018 1 次提交
    • P
      slab: Replace synchronize_sched() with synchronize_rcu() · 6564a25e
      Paul E. McKenney 提交于
      Now that synchronize_rcu() waits for preempt-disable regions of code
      as well as RCU read-side critical sections, synchronize_sched() can be
      replaced by synchronize_rcu().  This commit therefore makes this change.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      6564a25e
  18. 27 10月, 2018 2 次提交
    • V
      mm, slab: combine kmalloc_caches and kmalloc_dma_caches · cc252eae
      Vlastimil Babka 提交于
      Patch series "kmalloc-reclaimable caches", v4.
      
      As discussed at LSF/MM [1] here's a patchset that introduces
      kmalloc-reclaimable caches (more details in the second patch) and uses
      them for dcache external names.  That allows us to repurpose the
      NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
      
      With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
      caches, eliminating the need for manual accounting.  More importantly, it
      also ensures the reclaimable kmalloc allocations are grouped in pages
      separate from the regular kmalloc allocations.  The need for proper
      accounting of dcache external names has shown it's easy for misbehaving
      process to allocate lots of them, causing premature OOMs.  Without the
      added grouping, it's likely that a similar workload can interleave the
      dcache external names allocations with regular kmalloc allocations (note:
      I haven't searched myself for an example of such regular kmalloc
      allocation, but I would be very surprised if there wasn't some).  A
      pathological case would be e.g.  one 64byte regular allocations with 63
      external dcache names in a page (64x64=4096), which means the page is not
      freed even after reclaiming after all dcache names, and the process can
      thus "steal" the whole page with single 64byte allocation.
      
      If other kmalloc users similar to dcache external names become identified,
      they can also benefit from the new functionality simply by adding
      __GFP_RECLAIMABLE to the kmalloc calls.
      
      Side benefits of the patchset (that could be also merged separately)
      include removed branch for detecting __GFP_DMA kmalloc(), and shortening
      kmalloc cache names in /proc/slabinfo output.  The latter is potentially
      an ABI break in case there are tools parsing the names and expecting the
      values to be in bytes.
      
      This is how /proc/slabinfo looks like after booting in virtme:
      
      ...
      kmalloc-rcl-4M         0      0 4194304    1 1024 : tunables    1    1    0 : slabdata      0      0      0
      ...
      kmalloc-rcl-96         7     32    128   32    1 : tunables  120   60    8 : slabdata      1      1      0
      kmalloc-rcl-64        25    128     64   64    1 : tunables  120   60    8 : slabdata      2      2      0
      kmalloc-rcl-32         0      0     32  124    1 : tunables  120   60    8 : slabdata      0      0      0
      kmalloc-4M             0      0 4194304    1 1024 : tunables    1    1    0 : slabdata      0      0      0
      kmalloc-2M             0      0 2097152    1  512 : tunables    1    1    0 : slabdata      0      0      0
      kmalloc-1M             0      0 1048576    1  256 : tunables    1    1    0 : slabdata      0      0      0
      ...
      
      /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
      
      ...
      nr_slab_reclaimable 2817
      nr_slab_unreclaimable 1781
      ...
      nr_kernel_misc_reclaimable 0
      ...
      
      /proc/meminfo with new KReclaimable counter:
      
      ...
      Shmem:               564 kB
      KReclaimable:      11260 kB
      Slab:              18368 kB
      SReclaimable:      11260 kB
      SUnreclaim:         7108 kB
      KernelStack:        1248 kB
      ...
      
      This patch (of 6):
      
      The kmalloc caches currently mainain separate (optional) array
      kmalloc_dma_caches for __GFP_DMA allocations.  There are tests for
      __GFP_DMA in the allocation hotpaths.  We can avoid the branches by
      combining kmalloc_caches and kmalloc_dma_caches into a single
      two-dimensional array where the outer dimension is cache "type".  This
      will also allow to add kmalloc-reclaimable caches as a third type.
      
      Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc252eae
    • D
      mm: don't warn about large allocations for slab · 61448479
      Dmitry Vyukov 提交于
      Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE,
      instead it falls back to kmalloc_large().
      
      For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls
      kmalloc_slab() for all allocations relying on NULL return value for
      over-sized allocations.
      
      This inconsistency leads to unwanted warnings from kmalloc_slab() for
      over-sized allocations for slab.  Returning NULL for failed allocations is
      the expected behavior.
      
      Make slub and slab code consistent by checking size >
      KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab().
      
      While we are here also fix the check in kmalloc_slab().  We should check
      against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE.  It all kinda
      worked because for slab the constants are the same, and slub always checks
      the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab().  But if we
      get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will
      happen.  For example, in case of a newly introduced bug in slub code.
      
      Also move the check in kmalloc_slab() from function entry to the size >
      192 case.  This partially compensates for the additional check in slab
      code and makes slub code a bit faster (at least theoretically).
      
      Also drop __GFP_NOWARN in the warning check.  This warning means a bug in
      slab code itself, user-passed flags have nothing to do with it.
      
      Nothing of this affects slob.
      
      Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com
      Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com
      Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com
      Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com
      Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61448479
  19. 13 6月, 2018 1 次提交
    • K
      treewide: kzalloc() -> kcalloc() · 6396bb22
      Kees Cook 提交于
      The kzalloc() function has a 2-factor argument form, kcalloc(). This
      patch replaces cases of:
      
              kzalloc(a * b, gfp)
      
      with:
              kcalloc(a * b, gfp)
      
      as well as handling cases of:
      
              kzalloc(a * b * c, gfp)
      
      with:
      
              kzalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kzalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kzalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kzalloc
      + kcalloc
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(sizeof(THING) * C2, ...)
      |
        kzalloc(sizeof(TYPE) * C2, ...)
      |
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(C1 * C2, ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6396bb22