1. 27 10月, 2018 3 次提交
    • V
      mm, slab/slub: introduce kmalloc-reclaimable caches · 1291523f
      Vlastimil Babka 提交于
      Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
      indicates they contain objects which can be reclaimed under memory
      pressure (typically through a shrinker).  This makes the slab pages
      accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
      MemAvailable meminfo counter and in overcommit decisions.  The slab pages
      are also allocated with __GFP_RECLAIMABLE, which is good for
      anti-fragmentation through grouping pages by mobility.
      
      The generic kmalloc-X caches are created without this flag, but sometimes
      are used also for objects that can be reclaimed, which due to varying size
      cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag.  A
      prominent example are dcache external names, which prompted the creation
      of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
      in commit f1782c9b ("dcache: account external names as indirectly
      reclaimable memory").
      
      To better handle this and any other similar cases, this patch introduces
      SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
      They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
      gfp flags.  They are added to the kmalloc_caches array as a new type.
      Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
      cache.
      
      This change only applies to SLAB and SLUB, not SLOB.  This is fine, since
      SLOB's target are tiny system and this patch does add some overhead of
      kmem management objects.
      
      Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1291523f
    • V
      mm, slab: combine kmalloc_caches and kmalloc_dma_caches · cc252eae
      Vlastimil Babka 提交于
      Patch series "kmalloc-reclaimable caches", v4.
      
      As discussed at LSF/MM [1] here's a patchset that introduces
      kmalloc-reclaimable caches (more details in the second patch) and uses
      them for dcache external names.  That allows us to repurpose the
      NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
      
      With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
      caches, eliminating the need for manual accounting.  More importantly, it
      also ensures the reclaimable kmalloc allocations are grouped in pages
      separate from the regular kmalloc allocations.  The need for proper
      accounting of dcache external names has shown it's easy for misbehaving
      process to allocate lots of them, causing premature OOMs.  Without the
      added grouping, it's likely that a similar workload can interleave the
      dcache external names allocations with regular kmalloc allocations (note:
      I haven't searched myself for an example of such regular kmalloc
      allocation, but I would be very surprised if there wasn't some).  A
      pathological case would be e.g.  one 64byte regular allocations with 63
      external dcache names in a page (64x64=4096), which means the page is not
      freed even after reclaiming after all dcache names, and the process can
      thus "steal" the whole page with single 64byte allocation.
      
      If other kmalloc users similar to dcache external names become identified,
      they can also benefit from the new functionality simply by adding
      __GFP_RECLAIMABLE to the kmalloc calls.
      
      Side benefits of the patchset (that could be also merged separately)
      include removed branch for detecting __GFP_DMA kmalloc(), and shortening
      kmalloc cache names in /proc/slabinfo output.  The latter is potentially
      an ABI break in case there are tools parsing the names and expecting the
      values to be in bytes.
      
      This is how /proc/slabinfo looks like after booting in virtme:
      
      ...
      kmalloc-rcl-4M         0      0 4194304    1 1024 : tunables    1    1    0 : slabdata      0      0      0
      ...
      kmalloc-rcl-96         7     32    128   32    1 : tunables  120   60    8 : slabdata      1      1      0
      kmalloc-rcl-64        25    128     64   64    1 : tunables  120   60    8 : slabdata      2      2      0
      kmalloc-rcl-32         0      0     32  124    1 : tunables  120   60    8 : slabdata      0      0      0
      kmalloc-4M             0      0 4194304    1 1024 : tunables    1    1    0 : slabdata      0      0      0
      kmalloc-2M             0      0 2097152    1  512 : tunables    1    1    0 : slabdata      0      0      0
      kmalloc-1M             0      0 1048576    1  256 : tunables    1    1    0 : slabdata      0      0      0
      ...
      
      /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
      
      ...
      nr_slab_reclaimable 2817
      nr_slab_unreclaimable 1781
      ...
      nr_kernel_misc_reclaimable 0
      ...
      
      /proc/meminfo with new KReclaimable counter:
      
      ...
      Shmem:               564 kB
      KReclaimable:      11260 kB
      Slab:              18368 kB
      SReclaimable:      11260 kB
      SUnreclaim:         7108 kB
      KernelStack:        1248 kB
      ...
      
      This patch (of 6):
      
      The kmalloc caches currently mainain separate (optional) array
      kmalloc_dma_caches for __GFP_DMA allocations.  There are tests for
      __GFP_DMA in the allocation hotpaths.  We can avoid the branches by
      combining kmalloc_caches and kmalloc_dma_caches into a single
      two-dimensional array where the outer dimension is cache "type".  This
      will also allow to add kmalloc-reclaimable caches as a third type.
      
      Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc252eae
    • D
      mm: don't warn about large allocations for slab · 61448479
      Dmitry Vyukov 提交于
      Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE,
      instead it falls back to kmalloc_large().
      
      For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls
      kmalloc_slab() for all allocations relying on NULL return value for
      over-sized allocations.
      
      This inconsistency leads to unwanted warnings from kmalloc_slab() for
      over-sized allocations for slab.  Returning NULL for failed allocations is
      the expected behavior.
      
      Make slub and slab code consistent by checking size >
      KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab().
      
      While we are here also fix the check in kmalloc_slab().  We should check
      against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE.  It all kinda
      worked because for slab the constants are the same, and slub always checks
      the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab().  But if we
      get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will
      happen.  For example, in case of a newly introduced bug in slub code.
      
      Also move the check in kmalloc_slab() from function entry to the size >
      192 case.  This partially compensates for the additional check in slab
      code and makes slub code a bit faster (at least theoretically).
      
      Also drop __GFP_NOWARN in the warning check.  This warning means a bug in
      slab code itself, user-passed flags have nothing to do with it.
      
      Nothing of this affects slob.
      
      Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com
      Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com
      Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com
      Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com
      Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61448479
  2. 18 8月, 2018 1 次提交
  3. 29 6月, 2018 1 次提交
    • M
      slub: fix failure when we delete and create a slab cache · d50d82fa
      Mikulas Patocka 提交于
      In kernel 4.17 I removed some code from dm-bufio that did slab cache
      merging (commit 21bb1327: "dm bufio: remove code that merges slab
      caches") - both slab and slub support merging caches with identical
      attributes, so dm-bufio now just calls kmem_cache_create and relies on
      implicit merging.
      
      This uncovered a bug in the slub subsystem - if we delete a cache and
      immediatelly create another cache with the same attributes, it fails
      because of duplicate filename in /sys/kernel/slab/.  The slub subsystem
      offloads freeing the cache to a workqueue - and if we create the new
      cache before the workqueue runs, it complains because of duplicate
      filename in sysfs.
      
      This patch fixes the bug by moving the call of kobject_del from
      sysfs_slab_remove_workfn to shutdown_cache.  kobject_del must be called
      while we hold slab_mutex - so that the sysfs entry is deleted before a
      cache with the same attributes could be created.
      
      Running device-mapper-test-suite with:
      
        dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/
      
      triggered:
      
        Buffer I/O error on dev dm-0, logical block 1572848, async page read
        device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
        device-mapper: thin: 253:1: aborting current metadata transaction
        sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
        CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
        Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
        Workqueue: dm-thin do_worker [dm_thin_pool]
        Call Trace:
         dump_stack+0x5a/0x73
         sysfs_warn_dup+0x58/0x70
         sysfs_create_dir_ns+0x77/0x80
         kobject_add_internal+0xba/0x2e0
         kobject_init_and_add+0x70/0xb0
         sysfs_slab_add+0xb1/0x250
         __kmem_cache_create+0x116/0x150
         create_cache+0xd9/0x1f0
         kmem_cache_create_usercopy+0x1c1/0x250
         kmem_cache_create+0x18/0x20
         dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
         dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
         __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
         dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
         metadata_operation_failed+0x59/0x100 [dm_thin_pool]
         alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
         process_cell+0x2a3/0x550 [dm_thin_pool]
         do_worker+0x28d/0x8f0 [dm_thin_pool]
         process_one_work+0x171/0x370
         worker_thread+0x49/0x3f0
         kthread+0xf8/0x130
         ret_from_fork+0x35/0x40
        kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
        kmem_cache_create(dm_bufio_buffer-16) failed with error -17
      
      Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.comSigned-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reported-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d50d82fa
  4. 15 6月, 2018 2 次提交
  5. 06 4月, 2018 14 次提交
  6. 01 2月, 2018 1 次提交
  7. 16 1月, 2018 4 次提交
    • K
      usercopy: Restrict non-usercopy caches to size 0 · 6d07d1cd
      Kees Cook 提交于
      With all known usercopied cache whitelists now defined in the
      kernel, switch the default usercopy region of kmem_cache_create()
      to size 0. Any new caches with usercopy regions will now need to use
      kmem_cache_create_usercopy() instead of kmem_cache_create().
      
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      
      Cc: David Windsor <dave@nullcore.net>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-mm@kvack.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6d07d1cd
    • D
      usercopy: Mark kmalloc caches as usercopy caches · 6c0c21ad
      David Windsor 提交于
      Mark the kmalloc slab caches as entirely whitelisted. These caches
      are frequently used to fulfill kernel allocations that contain data
      to be copied to/from userspace. Internal-only uses are also common,
      but are scattered in the kernel. For now, mark all the kmalloc caches
      as whitelisted.
      
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      Signed-off-by: NDavid Windsor <dave@nullcore.net>
      [kees: merged in moved kmalloc hunks, adjust commit log]
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-mm@kvack.org
      Cc: linux-xfs@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      6c0c21ad
    • K
      usercopy: Allow strict enforcement of whitelists · 2d891fbc
      Kees Cook 提交于
      This introduces CONFIG_HARDENED_USERCOPY_FALLBACK to control the
      behavior of hardened usercopy whitelist violations. By default, whitelist
      violations will continue to WARN() so that any bad or missing usercopy
      whitelists can be discovered without being too disruptive.
      
      If this config is disabled at build time or a system is booted with
      "slab_common.usercopy_fallback=0", usercopy whitelists will BUG() instead
      of WARN(). This is useful for admins that want to use usercopy whitelists
      immediately.
      Suggested-by: NMatthew Garrett <mjg59@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      2d891fbc
    • D
      usercopy: Prepare for usercopy whitelisting · 8eb8284b
      David Windsor 提交于
      This patch prepares the slab allocator to handle caches having annotations
      (useroffset and usersize) defining usercopy regions.
      
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on
      my understanding of the code. Changes or omissions from the original
      code are mine and don't reflect the original grsecurity/PaX code.
      
      Currently, hardened usercopy performs dynamic bounds checking on slab
      cache objects. This is good, but still leaves a lot of kernel memory
      available to be copied to/from userspace in the face of bugs. To further
      restrict what memory is available for copying, this creates a way to
      whitelist specific areas of a given slab cache object for copying to/from
      userspace, allowing much finer granularity of access control. Slab caches
      that are never exposed to userspace can declare no whitelist for their
      objects, thereby keeping them unavailable to userspace via dynamic copy
      operations. (Note, an implicit form of whitelisting is the use of constant
      sizes in usercopy operations and get_user()/put_user(); these bypass
      hardened usercopy checks since these sizes cannot change at runtime.)
      
      To support this whitelist annotation, usercopy region offset and size
      members are added to struct kmem_cache. The slab allocator receives a
      new function, kmem_cache_create_usercopy(), that creates a new cache
      with a usercopy region defined, suitable for declaring spans of fields
      within the objects that get copied to/from userspace.
      
      In this patch, the default kmem_cache_create() marks the entire allocation
      as whitelisted, leaving it semantically unchanged. Once all fine-grained
      whitelists have been added (in subsequent patches), this will be changed
      to a usersize of 0, making caches created with kmem_cache_create() not
      copyable to/from userspace.
      
      After the entire usercopy whitelist series is applied, less than 15%
      of the slab cache memory remains exposed to potential usercopy bugs
      after a fresh boot:
      
      Total Slab Memory:           48074720
      Usercopyable Memory:          6367532  13.2%
               task_struct                    0.2%         4480/1630720
               RAW                            0.3%            300/96000
               RAWv6                          2.1%           1408/64768
               ext4_inode_cache               3.0%       269760/8740224
               dentry                        11.1%       585984/5273856
               mm_struct                     29.1%         54912/188448
               kmalloc-8                    100.0%          24576/24576
               kmalloc-16                   100.0%          28672/28672
               kmalloc-32                   100.0%          81920/81920
               kmalloc-192                  100.0%          96768/96768
               kmalloc-128                  100.0%        143360/143360
               names_cache                  100.0%        163840/163840
               kmalloc-64                   100.0%        167936/167936
               kmalloc-256                  100.0%        339968/339968
               kmalloc-512                  100.0%        350720/350720
               kmalloc-96                   100.0%        455616/455616
               kmalloc-8192                 100.0%        655360/655360
               kmalloc-1024                 100.0%        812032/812032
               kmalloc-4096                 100.0%        819200/819200
               kmalloc-2048                 100.0%      1310720/1310720
      
      After some kernel build workloads, the percentage (mainly driven by
      dentry and inode caches expanding) drops under 10%:
      
      Total Slab Memory:           95516184
      Usercopyable Memory:          8497452   8.8%
               task_struct                    0.2%         4000/1456000
               RAW                            0.3%            300/96000
               RAWv6                          2.1%           1408/64768
               ext4_inode_cache               3.0%     1217280/39439872
               dentry                        11.1%     1623200/14608800
               mm_struct                     29.1%         73216/251264
               kmalloc-8                    100.0%          24576/24576
               kmalloc-16                   100.0%          28672/28672
               kmalloc-32                   100.0%          94208/94208
               kmalloc-192                  100.0%          96768/96768
               kmalloc-128                  100.0%        143360/143360
               names_cache                  100.0%        163840/163840
               kmalloc-64                   100.0%        245760/245760
               kmalloc-256                  100.0%        339968/339968
               kmalloc-512                  100.0%        350720/350720
               kmalloc-96                   100.0%        563520/563520
               kmalloc-8192                 100.0%        655360/655360
               kmalloc-1024                 100.0%        794624/794624
               kmalloc-4096                 100.0%        819200/819200
               kmalloc-2048                 100.0%      1257472/1257472
      Signed-off-by: NDavid Windsor <dave@nullcore.net>
      [kees: adjust commit log, split out a few extra kmalloc hunks]
      [kees: add field names to function declarations]
      [kees: convert BUGs to WARNs and fail closed]
      [kees: add attack surface reduction analysis to commit log]
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-mm@kvack.org
      Cc: linux-xfs@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      8eb8284b
  8. 16 11月, 2017 4 次提交
  9. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  10. 04 10月, 2017 1 次提交
    • J
      mm: memcontrol: use vmalloc fallback for large kmem memcg arrays · f80c7dab
      Johannes Weiner 提交于
      For quick per-memcg indexing, slab caches and list_lru structures
      maintain linear arrays of descriptors.  As the number of concurrent
      memory cgroups in the system goes up, this requires large contiguous
      allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for every
      existing slab cache and list_lru, which can easily fail on loaded
      systems.  E.g.:
      
        mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
        CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty #481
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
        Call Trace:
         ? __alloc_pages_direct_compact+0x4c/0x110
         __alloc_pages_nodemask+0xf50/0x1430
         alloc_pages_current+0x60/0xc0
         kmalloc_order_trace+0x29/0x1b0
         __kmalloc+0x1f4/0x320
         memcg_update_all_list_lrus+0xca/0x2e0
         mem_cgroup_css_alloc+0x612/0x670
         cgroup_apply_control_enable+0x19e/0x360
         cgroup_mkdir+0x322/0x490
         kernfs_iop_mkdir+0x55/0x80
         vfs_mkdir+0xd0/0x120
         SyS_mkdirat+0x6c/0xe0
         SyS_mkdir+0x14/0x20
         entry_SYSCALL_64_fastpath+0x18/0xad
        Mem-Info:
        active_anon:2965 inactive_anon:19 isolated_anon:0
         active_file:100270 inactive_file:98846 isolated_file:0
         unevictable:0 dirty:0 writeback:0 unstable:0
         slab_reclaimable:7328 slab_unreclaimable:16402
         mapped:771 shmem:52 pagetables:278 bounce:0
         free:13718 free_pcp:0 free_cma:0
      
      This output is from an artificial reproducer, but we have repeatedly
      observed order-7 failures in production in the Facebook fleet.  These
      systems become useless as they cannot run more jobs, even though there
      is plenty of memory to allocate 128 individual pages.
      
      Use kvmalloc and kvzalloc to fall back to vmalloc space if these arrays
      prove too large for allocating them physically contiguous.
      
      Link: http://lkml.kernel.org/r/20170918184919.20644-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f80c7dab
  11. 07 7月, 2017 1 次提交
    • K
      mm: allow slab_nomerge to be set at build time · 7660a6fd
      Kees Cook 提交于
      Some hardened environments want to build kernels with slab_nomerge
      already set (so that they do not depend on remembering to set the kernel
      command line option).  This is desired to reduce the risk of kernel heap
      overflows being able to overwrite objects from merged caches and changes
      the requirements for cache layout control, increasing the difficulty of
      these attacks.  By keeping caches unmerged, these kinds of exploits can
      usually only damage objects in the same cache (though the risk to
      metadata exploitation is unchanged).
      
      Link: http://lkml.kernel.org/r/20170620230911.GA25238@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: David Windsor <dave@nullcore.net>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: David Windsor <dave@nullcore.net>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Daniel Mack <daniel@zonque.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7660a6fd
  12. 19 4月, 2017 1 次提交
    • P
      mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU · 5f0d5a3a
      Paul E. McKenney 提交于
      A group of Linux kernel hackers reported chasing a bug that resulted
      from their assumption that SLAB_DESTROY_BY_RCU provided an existence
      guarantee, that is, that no block from such a slab would be reallocated
      during an RCU read-side critical section.  Of course, that is not the
      case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
      slab of blocks.
      
      However, there is a phrase for this, namely "type safety".  This commit
      therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
      to avoid future instances of this sort of confusion.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      [ paulmck: Add comments mentioning the old name, as requested by Eric
        Dumazet, in order to help people familiar with the old name find
        the new one. ]
      Acked-by: NDavid Rientjes <rientjes@google.com>
      5f0d5a3a
  13. 25 2月, 2017 1 次提交
    • G
      kasan: drain quarantine of memcg slab objects · f9fa1d91
      Greg Thelen 提交于
      Per memcg slab accounting and kasan have a problem with kmem_cache
      destruction.
       - kmem_cache_create() allocates a kmem_cache, which is used for
         allocations from processes running in root (top) memcg.
       - Processes running in non root memcg and allocating with either
         __GFP_ACCOUNT or from a SLAB_ACCOUNT cache use a per memcg
         kmem_cache.
       - Kasan catches use-after-free by having kfree() and kmem_cache_free()
         defer freeing of objects. Objects are placed in a quarantine.
       - kmem_cache_destroy() destroys root and non root kmem_caches. It takes
         care to drain the quarantine of objects from the root memcg's
         kmem_cache, but ignores objects associated with non root memcg. This
         causes leaks because quarantined per memcg objects refer to per memcg
         kmem cache being destroyed.
      
      To see the problem:
      
       1) create a slab cache with kmem_cache_create(,,,SLAB_ACCOUNT,)
       2) from non root memcg, allocate and free a few objects from cache
       3) dispose of the cache with kmem_cache_destroy() kmem_cache_destroy()
          will trigger a "Slab cache still has objects" warning indicating
          that the per memcg kmem_cache structure was leaked.
      
      Fix the leak by draining kasan quarantined objects allocated from non
      root memcg.
      
      Racing memcg deletion is tricky, but handled.  kmem_cache_destroy() =>
      shutdown_memcg_caches() => __shutdown_memcg_cache() => shutdown_cache()
      flushes per memcg quarantined objects, even if that memcg has been
      rmdir'd and gone through memcg_deactivate_kmem_caches().
      
      This leak only affects destroyed SLAB_ACCOUNT kmem caches when kasan is
      enabled.  So I don't think it's worth patching stable kernels.
      
      Link: http://lkml.kernel.org/r/1482257462-36948-1-git-send-email-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Reviewed-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fa1d91
  14. 23 2月, 2017 5 次提交
    • T
      slab: use memcg_kmem_cache_wq for slab destruction operations · 17cc4dfe
      Tejun Heo 提交于
      If there's contention on slab_mutex, queueing the per-cache destruction
      work item on the system_wq can unnecessarily create and tie up a lot of
      kworkers.
      
      Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it
      global and use that workqueue for the destruction work items too.  While
      at it, convert the workqueue from an unbound workqueue to a per-cpu one
      with concurrency limited to 1.  It's generally preferable to use per-cpu
      workqueues and concurrency limit of 1 is safe enough.
      
      This is suggested by Joonsoo Kim.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJay Vana <jsvana@fb.com>
      Acked-by: NVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17cc4dfe
    • T
      slab: remove synchronous synchronize_sched() from memcg cache deactivation path · 01fb58bc
      Tejun Heo 提交于
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      slub uses synchronize_sched() to deactivate a memcg cache.
      synchronize_sched() is an expensive and slow operation and doesn't scale
      when a huge number of caches are destroyed back-to-back.  While there
      used to be a simple batching mechanism, the batching was too restricted
      to be helpful.
      
      This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
      can use to schedule sched RCU callback instead of performing
      synchronize_sched() synchronously while holding cgroup_mutex.  While
      this adds online cpus, mems and slab_mutex operations, operating on
      these locks back-to-back from the same kworker, which is what's gonna
      happen when there are many to deactivate, isn't expensive at all and
      this gets rid of the scalability problem completely.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJay Vana <jsvana@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01fb58bc
    • T
      slab: introduce __kmemcg_cache_deactivate() · c9fc5864
      Tejun Heo 提交于
      __kmem_cache_shrink() is called with %true @deactivate only for memcg
      caches.  Remove @deactivate from __kmem_cache_shrink() and introduce
      __kmemcg_cache_deactivate() instead.  Each memcg-supporting allocator
      should implement it and it should deactivate and drain the cache.
      
      This is to allow memcg cache deactivation behavior to further deviate
      from simple shrinking without messing up __kmem_cache_shrink().
      
      This is pure reorganization and doesn't introduce any observable
      behavior changes.
      
      v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9fc5864
    • T
      slab: implement slab_root_caches list · 510ded33
      Tejun Heo 提交于
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      slab_caches currently lists all caches including root and memcg ones.
      This is the only data structure which lists the root caches and
      iterating root caches can only be done by walking the list while
      skipping over memcg caches.  As there can be a huge number of memcg
      caches, this can become very expensive.
      
      This also can make /proc/slabinfo behave very badly.  seq_file processes
      reads in 4k chunks and seeks to the previous Nth position on slab_caches
      list to resume after each chunk.  With a lot of memcg cache churns on
      the list, reading /proc/slabinfo can become very slow and its content
      often ends up with duplicate and/or missing entries.
      
      This patch adds a new list slab_root_caches which lists only the root
      caches.  When memcg is not enabled, it becomes just an alias of
      slab_caches.  memcg specific list operations are collected into
      memcg_[un]link_cache().
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJay Vana <jsvana@fb.com>
      Acked-by: NVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      510ded33
    • T
      slab: link memcg kmem_caches on their associated memory cgroup · bc2791f8
      Tejun Heo 提交于
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      While a memcg kmem_cache is listed on its root cache's ->children list,
      there is no direct way to iterate all kmem_caches which are assocaited
      with a memory cgroup.  The only way to iterate them is walking all
      caches while filtering out caches which don't match, which would be most
      of them.
      
      This makes memcg destruction operations O(N^2) where N is the total
      number of slab caches which can be huge.  This combined with the
      synchronous RCU operations can tie up a CPU and affect the whole machine
      for many hours when memory reclaim triggers offlining and destruction of
      the stale memcgs.
      
      This patch adds mem_cgroup->kmem_caches list which goes through
      memcg_cache_params->kmem_caches_node of all kmem_caches which are
      associated with the memcg.  All memcg specific iterations, including
      stat file access, are updated to use the new list instead.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJay Vana <jsvana@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc2791f8