1. 03 8月, 2018 1 次提交
  2. 02 8月, 2018 1 次提交
    • H
      mm: delete historical BUG from zap_pmd_range() · 53406ed1
      Hugh Dickins 提交于
      Delete the old VM_BUG_ON_VMA() from zap_pmd_range(), which asserted
      that mmap_sem must be held when splitting an "anonymous" vma there.
      Whether that's still strictly true nowadays is not entirely clear,
      but the danger of sometimes crashing on the BUG is now fairly clear.
      
      Even with the new stricter rules for anonymous vma marking, the
      condition it checks for can possible trigger. Commit 44960f2a
      ("staging: ashmem: Fix SIGBUS crash when traversing mmaped ashmem
      pages") is good, and originally I thought it was safe from that
      VM_BUG_ON_VMA(), because the /dev/ashmem fd exposed to the user is
      disconnected from the vm_file in the vma, and madvise(,,MADV_REMOVE)
      insists on VM_SHARED.
      
      But after I read John's earlier mail, drawing attention to the
      vfs_fallocate() in there: I may be wrong, and I don't know if Android
      has THP in the config anyway, but it looks to me like an
      unmap_mapping_range() from ashmem's vfs_fallocate() could hit precisely
      the VM_BUG_ON_VMA(), once it's vma_is_anonymous().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53406ed1
  3. 27 7月, 2018 3 次提交
    • L
      zswap: re-check zswap_is_full() after do zswap_shrink() · 16e536ef
      Li Wang 提交于
      /sys/../zswap/stored_pages keeps rising in a zswap test with
      "zswap.max_pool_percent=0" parameter.  But it should not compress or
      store pages any more since there is no space in the compressed pool.
      
      Reproduce steps:
        1. Boot kernel with "zswap.enabled=1"
        2. Set the max_pool_percent to 0
            # echo 0 > /sys/module/zswap/parameters/max_pool_percent
        3. Do memory stress test to see if some pages have been compressed
            # stress --vm 1 --vm-bytes $mem_available"M" --timeout 60s
        4. Watching the 'stored_pages' number increasing or not
      
      The root cause is:
      
        When zswap_max_pool_percent is set to 0 via kernel parameter,
        zswap_is_full() will always return true due to zswap_shrink().  But if
        the shinking is able to reclain a page successfully the code then
        proceeds to compressing/storing another page, so the value of
        stored_pages will keep changing.
      
      To solve the issue, this patch adds a zswap_is_full() check again after
        zswap_shrink() to make sure it's now under the max_pool_percent, and to
        not compress/store if we reached the limit.
      
      Link: http://lkml.kernel.org/r/20180530103936.17812-1-liwang@redhat.comSigned-off-by: NLi Wang <liwang@redhat.com>
      Acked-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Huang Ying <huang.ying.caritas@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16e536ef
    • K
      mm: fix vma_is_anonymous() false-positives · bfd40eaf
      Kirill A. Shutemov 提交于
      vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
      VMA.  This is unreliable as ->mmap may not set ->vm_ops.
      
      False-positive vma_is_anonymous() may lead to crashes:
      
      	next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
      	prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
      	pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
      	flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
      	------------[ cut here ]------------
      	kernel BUG at mm/memory.c:1422!
      	invalid opcode: 0000 [#1] SMP KASAN
      	CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
      	Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
      	01/01/2011
      	RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
      	RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
      	RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
      	RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
      	Call Trace:
      	 unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
      	 zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
      	 unmap_mapping_range_vma mm/memory.c:2792 [inline]
      	 unmap_mapping_range_tree mm/memory.c:2813 [inline]
      	 unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
      	 unmap_mapping_range+0x48/0x60 mm/memory.c:2880
      	 truncate_pagecache+0x54/0x90 mm/truncate.c:800
      	 truncate_setsize+0x70/0xb0 mm/truncate.c:826
      	 simple_setattr+0xe9/0x110 fs/libfs.c:409
      	 notify_change+0xf13/0x10f0 fs/attr.c:335
      	 do_truncate+0x1ac/0x2b0 fs/open.c:63
      	 do_sys_ftruncate+0x492/0x560 fs/open.c:205
      	 __do_sys_ftruncate fs/open.c:215 [inline]
      	 __se_sys_ftruncate fs/open.c:213 [inline]
      	 __x64_sys_ftruncate+0x59/0x80 fs/open.c:213
      	 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Reproducer:
      
      	#include <stdio.h>
      	#include <stddef.h>
      	#include <stdint.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <sys/types.h>
      	#include <sys/stat.h>
      	#include <sys/ioctl.h>
      	#include <sys/mman.h>
      	#include <unistd.h>
      	#include <fcntl.h>
      
      	#define KCOV_INIT_TRACE			_IOR('c', 1, unsigned long)
      	#define KCOV_ENABLE			_IO('c', 100)
      	#define KCOV_DISABLE			_IO('c', 101)
      	#define COVER_SIZE			(1024<<10)
      
      	#define KCOV_TRACE_PC  0
      	#define KCOV_TRACE_CMP 1
      
      	int main(int argc, char **argv)
      	{
      		int fd;
      		unsigned long *cover;
      
      		system("mount -t debugfs none /sys/kernel/debug");
      		fd = open("/sys/kernel/debug/kcov", O_RDWR);
      		ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE);
      		cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
      				PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
      		munmap(cover, COVER_SIZE * sizeof(unsigned long));
      		cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
      				PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
      		memset(cover, 0, COVER_SIZE * sizeof(unsigned long));
      		ftruncate(fd, 3UL << 20);
      		return 0;
      	}
      
      This can be fixed by assigning anonymous VMAs own vm_ops and not relying
      on it being NULL.
      
      If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
      dummy_vm_ops.  This way we will have non-NULL ->vm_ops for all VMAs.
      
      Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfd40eaf
    • K
      mm: use vma_init() to initialize VMAs on stack and data segments · 2c4541e2
      Kirill A. Shutemov 提交于
      Make sure to initialize all VMAs properly, not only those which come
      from vm_area_cachep.
      
      Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c4541e2
  4. 22 7月, 2018 6 次提交
  5. 17 7月, 2018 1 次提交
  6. 15 7月, 2018 4 次提交
    • M
      mm: do not bug_on on incorrect length in __mm_populate() · bb177a73
      Michal Hocko 提交于
      syzbot has noticed that a specially crafted library can easily hit
      VM_BUG_ON in __mm_populate
      
        kernel BUG at mm/gup.c:1242!
        invalid opcode: 0000 [#1] SMP
        CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
        RIP: 0010:__mm_populate+0x1e2/0x1f0
        Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff <0f> 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
        Call Trace:
           vm_brk_flags+0xc3/0x100
           vm_brk+0x1f/0x30
           load_elf_library+0x281/0x2e0
           __ia32_sys_uselib+0x170/0x1e0
           do_fast_syscall_32+0xca/0x420
           entry_SYSENTER_compat+0x70/0x7f
      
      The reason is that the length of the new brk is not page aligned when we
      try to populate the it.  There is no reason to bug on that though.
      do_brk_flags already aligns the length properly so the mapping is
      expanded as it should.  All we need is to tell mm_populate about it.
      Besides that there is absolutely no reason to to bug_on in the first
      place.  The worst thing that could happen is that the last page wouldn't
      get populated and that is far from putting system into an inconsistent
      state.
      
      Fix the issue by moving the length sanitization code from do_brk_flags
      up to vm_brk_flags.  The only other caller of do_brk_flags is brk
      syscall entry and it makes sure to provide the proper length so t here
      is no need for sanitation and so we can use do_brk_flags without it.
      
      Also remove the bogus BUG_ONs.
      
      [osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
      Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: Nsyzbot <syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com>
      Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb177a73
    • M
      mm/memblock.c: do not complain about top-down allocations for !MEMORY_HOTREMOVE · e3d301ca
      Michal Hocko 提交于
      Mike Rapoport is converting architectures from bootmem to nobootmem
      allocator.  While doing so for m68k Geert has noticed that he gets a
      scary looking warning:
      
        WARNING: CPU: 0 PID: 0 at mm/memblock.c:230
        memblock_find_in_range_node+0x11c/0x1be
        memblock: bottom-up allocation failed, memory hotunplug may be affected
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted
        4.18.0-rc3-atari-01343-gf2fb5f2e09a97a3c-dirty #7
        Call Trace: __warn+0xa8/0xc2
          kernel_pg_dir+0x0/0x1000
          netdev_lower_get_next+0x2/0x22
          warn_slowpath_fmt+0x2e/0x36
          memblock_find_in_range_node+0x11c/0x1be
          memblock_find_in_range_node+0x11c/0x1be
          memblock_find_in_range_node+0x0/0x1be
          vprintk_func+0x66/0x6e
          memblock_virt_alloc_internal+0xd0/0x156
          netdev_lower_get_next+0x2/0x22
          netdev_lower_get_next+0x2/0x22
          kernel_pg_dir+0x0/0x1000
          memblock_virt_alloc_try_nid_nopanic+0x58/0x7a
          netdev_lower_get_next+0x2/0x22
          kernel_pg_dir+0x0/0x1000
          kernel_pg_dir+0x0/0x1000
          EXPTBL+0x234/0x400
          EXPTBL+0x234/0x400
          alloc_node_mem_map+0x4a/0x66
          netdev_lower_get_next+0x2/0x22
          free_area_init_node+0xe2/0x29e
          EXPTBL+0x234/0x400
          paging_init+0x430/0x462
          kernel_pg_dir+0x0/0x1000
          printk+0x0/0x1a
          EXPTBL+0x234/0x400
          setup_arch+0x1b8/0x22c
          start_kernel+0x4a/0x40a
          _sinittext+0x344/0x9e8
      
      The warning is basically saying that a top-down allocation can break
      memory hotremove because memblock allocation is not movable.  But m68k
      doesn't even support MEMORY_HOTREMOVE so there is no point to warn about
      it.
      
      Make the warning conditional only to configurations that care.
      
      Link: http://lkml.kernel.org/r/20180706061750.GH32658@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Sam Creasey <sammy@sammy.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3d301ca
    • C
      mm: do not drop unused pages when userfaultd is running · bce73e48
      Christian Borntraeger 提交于
      KVM guests on s390 can notify the host of unused pages.  This can result
      in pte_unused callbacks to be true for KVM guest memory.
      
      If a page is unused (checked with pte_unused) we might drop this page
      instead of paging it.  This can have side-effects on userfaultd, when
      the page in question was already migrated:
      
      The next access of that page will trigger a fault and a user fault
      instead of faulting in a new and empty zero page.  As QEMU does not
      expect a userfault on an already migrated page this migration will fail.
      
      The most straightforward solution is to ignore the pte_unused hint if a
      userfault context is active for this VMA.
      
      Link: http://lkml.kernel.org/r/20180703171854.63981-1-borntraeger@de.ibm.comSigned-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Cornelia Huck <cohuck@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bce73e48
    • P
      mm: zero unavailable pages before memmap init · e181ae0c
      Pavel Tatashin 提交于
      We must zero struct pages for memory that is not backed by physical
      memory, or kernel does not have access to.
      
      Recently, there was a change which zeroed all memmap for all holes in
      e820.  Unfortunately, it introduced a bug that is discussed here:
      
        https://www.spinics.net/lists/linux-mm/msg156764.html
      
      Linus, also saw this bug on his machine, and confirmed that reverting
      commit 124049de ("x86/e820: put !E820_TYPE_RAM regions into
      memblock.reserved") fixes the issue.
      
      The problem is that we incorrectly zero some struct pages after they
      were setup.
      
      The fix is to zero unavailable struct pages prior to initializing of
      struct pages.
      
      A more detailed fix should come later that would avoid double zeroing
      cases: one in __init_single_page(), the other one in
      zero_resv_unavail().
      
      Fixes: 124049de ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e181ae0c
  7. 04 7月, 2018 3 次提交
  8. 29 6月, 2018 2 次提交
    • M
      slub: fix failure when we delete and create a slab cache · d50d82fa
      Mikulas Patocka 提交于
      In kernel 4.17 I removed some code from dm-bufio that did slab cache
      merging (commit 21bb1327: "dm bufio: remove code that merges slab
      caches") - both slab and slub support merging caches with identical
      attributes, so dm-bufio now just calls kmem_cache_create and relies on
      implicit merging.
      
      This uncovered a bug in the slub subsystem - if we delete a cache and
      immediatelly create another cache with the same attributes, it fails
      because of duplicate filename in /sys/kernel/slab/.  The slub subsystem
      offloads freeing the cache to a workqueue - and if we create the new
      cache before the workqueue runs, it complains because of duplicate
      filename in sysfs.
      
      This patch fixes the bug by moving the call of kobject_del from
      sysfs_slab_remove_workfn to shutdown_cache.  kobject_del must be called
      while we hold slab_mutex - so that the sysfs entry is deleted before a
      cache with the same attributes could be created.
      
      Running device-mapper-test-suite with:
      
        dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/
      
      triggered:
      
        Buffer I/O error on dev dm-0, logical block 1572848, async page read
        device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
        device-mapper: thin: 253:1: aborting current metadata transaction
        sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
        CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
        Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
        Workqueue: dm-thin do_worker [dm_thin_pool]
        Call Trace:
         dump_stack+0x5a/0x73
         sysfs_warn_dup+0x58/0x70
         sysfs_create_dir_ns+0x77/0x80
         kobject_add_internal+0xba/0x2e0
         kobject_init_and_add+0x70/0xb0
         sysfs_slab_add+0xb1/0x250
         __kmem_cache_create+0x116/0x150
         create_cache+0xd9/0x1f0
         kmem_cache_create_usercopy+0x1c1/0x250
         kmem_cache_create+0x18/0x20
         dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
         dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
         __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
         dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
         metadata_operation_failed+0x59/0x100 [dm_thin_pool]
         alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
         process_cell+0x2a3/0x550 [dm_thin_pool]
         do_worker+0x28d/0x8f0 [dm_thin_pool]
         process_one_work+0x171/0x370
         worker_thread+0x49/0x3f0
         kthread+0xf8/0x130
         ret_from_fork+0x35/0x40
        kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
        kmem_cache_create(dm_bufio_buffer-16) failed with error -17
      
      Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.comSigned-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reported-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d50d82fa
    • S
      Revert mm/vmstat.c: fix vmstat_update() preemption BUG · 28557cc1
      Sebastian Andrzej Siewior 提交于
      Revert commit c7f26ccf ("mm/vmstat.c: fix vmstat_update() preemption
      BUG").  Steven saw a "using smp_processor_id() in preemptible" message
      and added a preempt_disable() section around it to keep it quiet.  This
      is not the right thing to do it does not fix the real problem.
      
      vmstat_update() is invoked by a kworker on a specific CPU.  This worker
      it bound to this CPU.  The name of the worker was "kworker/1:1" so it
      should have been a worker which was bound to CPU1.  A worker which can
      run on any CPU would have a `u' before the first digit.
      
      smp_processor_id() can be used in a preempt-enabled region as long as
      the task is bound to a single CPU which is the case here.  If it could
      run on an arbitrary CPU then this is the problem we have an should seek
      to resolve.
      
      Not only this smp_processor_id() must not be migrated to another CPU but
      also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
      Not to mention that other code relies on the fact that such a worker
      runs on one specific CPU only.
      
      Therefore revert that commit and we should look instead what broke the
      affinity mask of the kworker.
      
      Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven J. Hill <steven.hill@cavium.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28557cc1
  9. 23 6月, 2018 1 次提交
    • J
      bdi: Fix another oops in wb_workfn() · 3ee7e869
      Jan Kara 提交于
      syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
      wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
      WB_shutting_down after wb->bdi->dev became NULL. This indicates that
      unregister_bdi() failed to call wb_shutdown() on one of wb objects.
      
      The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
      drops bdi's reference to wb structures before going through the list of
      wbs again and calling wb_shutdown() on each of them. This way the loop
      iterating through all wbs can easily miss a wb if that wb has already
      passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
      from cgwb_release_workfn() and as a result fully shutdown bdi although
      wb_workfn() for this wb structure is still running. In fact there are
      also other ways cgwb_bdi_unregister() can race with
      cgwb_release_workfn() leading e.g. to use-after-free issues:
      
      CPU1                            CPU2
                                      cgwb_bdi_unregister()
                                        cgwb_kill(*slot);
      
      cgwb_release()
        queue_work(cgwb_release_wq, &wb->release_work);
      cgwb_release_workfn()
                                        wb = list_first_entry(&bdi->wb_list, ...)
                                        spin_unlock_irq(&cgwb_lock);
        wb_shutdown(wb);
        ...
        kfree_rcu(wb, rcu);
                                        wb_shutdown(wb); -> oops use-after-free
      
      We solve these issues by synchronizing writeback structure shutdown from
      cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
      way we also no longer need synchronization using WB_shutting_down as the
      mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
      CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
      bdi_unregister().
      Reported-by: Nsyzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3ee7e869
  10. 19 6月, 2018 1 次提交
  11. 15 6月, 2018 8 次提交
    • R
      mm: fix oom_kill event handling · fe6bdfc8
      Roman Gushchin 提交于
      Commit e27be240 ("mm: memcg: make sure memory.events is uptodate
      when waking pollers") converted most of memcg event counters to
      per-memcg atomics, which made them less confusing for a user.  The
      "oom_kill" counter remained untouched, so now it behaves differently
      than other counters (including "oom").  This adds nothing but confusion.
      
      Let's fix this by adding the MEMCG_OOM_KILL event, and follow the
      MEMCG_OOM approach.
      
      This also removes a hack from count_memcg_event_mm(), introduced earlier
      specially for the OOM_KILL counter.
      
      [akpm@linux-foundation.org: fix for droppage of memcg-replace-mm-owner-with-mm-memcg.patch]
      Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe6bdfc8
    • J
      mm: use octal not symbolic permissions · 0825a6f9
      Joe Perches 提交于
      mm/*.c files use symbolic and octal styles for permissions.
      
      Using octal and not symbolic permissions is preferred by many as more
      readable.
      
      https://lkml.org/lkml/2016/8/2/1945
      
      Prefer the direct use of octal for permissions.
      
      Done using
      $ scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace mm/*.c
      and some typing.
      
      Before:	 $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
      44
      After:	 $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
      86
      
      Miscellanea:
      
      o Whitespace neatening around these conversions.
      
      Link: http://lkml.kernel.org/r/2e032ef111eebcd4c5952bae86763b541d373469.1522102887.git.joe@perches.comSigned-off-by: NJoe Perches <joe@perches.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0825a6f9
    • M
      mremap: remove LATENCY_LIMIT from mremap to reduce the number of TLB shootdowns · 37a4094e
      Mel Gorman 提交于
      Commit 5d190420 ("mremap: fix race between mremap() and page
      cleanning") fixed races between mremap and other operations for both
      file-backed and anonymous mappings.  The file-backed was the most
      critical as it allowed the possibility that data could be changed on a
      physical page after page_mkclean returned which could trigger data loss
      or data integrity issues.
      
      A customer reported that the cost of the TLBs for anonymous regressions
      was excessive and resulting in a 30-50% drop in performance overall
      since this commit on a microbenchmark.  Unfortunately I neither have
      access to the test-case nor can I describe what it does other than
      saying that mremap operations dominate heavily.
      
      This patch removes the LATENCY_LIMIT to handle TLB flushes on a PMD
      boundary instead of every 64 pages to reduce the number of TLB
      shootdowns by a factor of 8 in the ideal case.  LATENCY_LIMIT was almost
      certainly used originally to limit the PTL hold times but the latency
      savings are likely offset by the cost of IPIs in many cases.  This patch
      is not reported to completely restore performance but gets it within an
      acceptable percentage.  The given metric here is simply described as
      "higher is better".
      
      Baseline that was known good
      002:  Metric:       91.05
      004:  Metric:      109.45
      008:  Metric:       73.08
      016:  Metric:       58.14
      032:  Metric:       61.09
      064:  Metric:       57.76
      128:  Metric:       55.43
      
      Current
      001:  Metric:       54.98
      002:  Metric:       56.56
      004:  Metric:       41.22
      008:  Metric:       35.96
      016:  Metric:       36.45
      032:  Metric:       35.71
      064:  Metric:       35.73
      128:  Metric:       34.96
      
      With patch
      001:  Metric:       61.43
      002:  Metric:       81.64
      004:  Metric:       67.92
      008:  Metric:       51.67
      016:  Metric:       50.47
      032:  Metric:       52.29
      064:  Metric:       50.01
      128:  Metric:       49.04
      
      So for low threads, it's not restored but for larger number of threads,
      it's closer to the "known good" baseline.
      
      Using a different mremap-intensive workload that is not representative
      of the real workload there is little difference observed outside of
      noise in the headline metrics However, the TLB shootdowns are reduced by
      11% on average and at the peak, TLB shootdowns were reduced by 21%.
      Interrupts were sampled every second while the workload ran to get those
      figures.  It's known that the figures will vary as the
      non-representative load is non-deterministic.
      
      An alternative patch was posted that should have significantly reduced
      the TLB flushes but unfortunately it does not perform as well as this
      version on the customer test case.  If revisited, the two patches can
      stack on top of each other.
      
      Link: http://lkml.kernel.org/r/20180606183803.k7qaw2xnbvzshv34@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37a4094e
    • M
      mm/memblock: add missing include <linux/bootmem.h> · 69b5086b
      Mathieu Malaterre 提交于
      Commit 26f09e9b ("mm/memblock: add memblock memory allocation apis")
      introduced two new function definitions:
      
        memblock_virt_alloc_try_nid_nopanic()
        memblock_virt_alloc_try_nid()
      
      Commit ea1f5f37 ("mm: define memblock_virt_alloc_try_nid_raw")
      introduced the following function definition:
      
        memblock_virt_alloc_try_nid_raw()
      
      This commit adds an includeof header file <linux/bootmem.h> to provide
      the missing function prototypes.  Silence the following gcc warning
      (W=1):
      
        mm/memblock.c:1334:15: warning: no previous prototype for `memblock_virt_alloc_try_nid_raw' [-Wmissing-prototypes]
        mm/memblock.c:1371:15: warning: no previous prototype for `memblock_virt_alloc_try_nid_nopanic' [-Wmissing-prototypes]
        mm/memblock.c:1407:15: warning: no previous prototype for `memblock_virt_alloc_try_nid' [-Wmissing-prototypes]
      
      Link: http://lkml.kernel.org/r/20180606194144.16990-1-malat@debian.orgSigned-off-by: NMathieu Malaterre <malat@debian.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69b5086b
    • S
      mm: fix race between kmem_cache destroy, create and deactivate · 92ee383f
      Shakeel Butt 提交于
      The memcg kmem cache creation and deactivation (SLUB only) is
      asynchronous.  If a root kmem cache is destroyed whose memcg cache is in
      the process of creation or deactivation, the kernel may crash.
      
      Example of one such crash:
      	general protection fault: 0000 [#1] SMP PTI
      	CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
      	...
      	Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
      	RIP: 0010:has_cpu_slab
      	...
      	Call Trace:
      	? on_each_cpu_cond
      	__kmem_cache_shrink
      	kmemcg_cache_deact_after_rcu
      	kmemcg_deactivate_workfn
      	process_one_work
      	worker_thread
      	kthread
      	ret_from_fork+0x35/0x40
      
      To fix this race, on root kmem cache destruction, mark the cache as
      dying and flush the workqueue used for memcg kmem cache creation and
      deactivation.  SLUB's memcg kmem cache deactivation also includes RCU
      callback and thus make sure all previous registered RCU callbacks have
      completed as well.
      
      [shakeelb@google.com: handle the RCU callbacks for SLUB deactivation]
        Link: http://lkml.kernel.org/r/20180611192951.195727-1-shakeelb@google.com
      [shakeelb@google.com: add more documentation, rename fields for readability]
        Link: http://lkml.kernel.org/r/20180522201336.196994-1-shakeelb@google.com
      [akpm@linux-foundation.org: fix build, per Shakeel]
      [shakeelb@google.com: v3.  Instead of refcount, flush the workqueue]
        Link: http://lkml.kernel.org/r/20180530001204.183758-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180521174116.171846-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92ee383f
    • D
      mm/swapfile.c: fix swap_count comment about nonexistent SWAP_HAS_CONT · 955c97f0
      Daniel Jordan 提交于
      Commit 570a335b ("swap_info: swap count continuations") introduces
      COUNT_CONTINUED but refers to it incorrectly as SWAP_HAS_CONT in a
      comment in swap_count.  Fix it.
      
      Link: http://lkml.kernel.org/r/20180612175919.30413-1-daniel.m.jordan@oracle.com
      Fixes: 570a335b ("swap_info: swap count continuations")
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      955c97f0
    • R
      mm: fix null pointer dereference in mem_cgroup_protected · df2a4196
      Roman Gushchin 提交于
      Shakeel reported a crash in mem_cgroup_protected(), which can be triggered
      by memcg reclaim if the legacy cgroup v1 use_hierarchy=0 mode is used:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000120
        PGD 8000001ff55da067 P4D 8000001ff55da067 PUD 1fdc7df067 PMD 0
        Oops: 0000 [#4] SMP PTI
        CPU: 0 PID: 15581 Comm: bash Tainted: G      D 4.17.0-smp-clean #5
        Hardware name: ...
        RIP: 0010:mem_cgroup_protected+0x54/0x130
        Code: 4c 8b 8e 00 01 00 00 4c 8b 86 08 01 00 00 48 8d 8a 08 ff ff ff 48 85 d2 ba 00 00 00 00 48 0f 44 ca 48 39 c8 0f 84 cf 00 00 00 <48> 8b 81 20 01 00 00 4d 89 ca 4c 39 c8 4c 0f 46 d0 4d 85 d2 74 05
        RSP: 0000:ffffabe64dfafa58 EFLAGS: 00010286
        RAX: ffff9fb6ff03d000 RBX: ffff9fb6f5b1b000 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffff9fb6f5b1b000 RDI: ffff9fb6f5b1b000
        RBP: ffffabe64dfafb08 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 000000000000c800 R12: ffffabe64dfafb88
        R13: ffff9fb6f5b1b000 R14: ffffabe64dfafb88 R15: ffff9fb77fffe000
        FS:  00007fed1f8ac700(0000) GS:ffff9fb6ff400000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000120 CR3: 0000001fdcf86003 CR4: 00000000001606f0
        Call Trace:
         ? shrink_node+0x194/0x510
         do_try_to_free_pages+0xfd/0x390
         try_to_free_mem_cgroup_pages+0x123/0x210
         try_charge+0x19e/0x700
         mem_cgroup_try_charge+0x10b/0x1a0
         wp_page_copy+0x134/0x5b0
         do_wp_page+0x90/0x460
         __handle_mm_fault+0x8e3/0xf30
         handle_mm_fault+0xfe/0x220
         __do_page_fault+0x262/0x500
         do_page_fault+0x28/0xd0
         ? page_fault+0x8/0x30
         page_fault+0x1e/0x30
        RIP: 0033:0x485b72
      
      The problem happens because parent_mem_cgroup() returns a NULL pointer,
      which is dereferenced later without a check.
      
      As cgroup v1 has no memory guarantee support, let's make
      mem_cgroup_protected() immediately return MEMCG_PROT_NONE, if the given
      cgroup has no parent (non-hierarchical mode is used).
      
      Link: http://lkml.kernel.org/r/20180611175418.7007-2-guro@fb.com
      Fixes: bf8d5d52 ("memcg: introduce memory.min")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NShakeel Butt <shakeelb@google.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Tested-by: NJohn Stultz <john.stultz@linaro.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df2a4196
    • J
      mm/ksm.c: ignore STABLE_FLAG of rmap_item->address in rmap_walk_ksm() · 1105a2fc
      Jia He 提交于
      In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by
      PAGE_SIZE unaligned for rmap_item->address under memory pressure
      tests(start 20 guests and run memhog in the host).
      
        WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8
        CPU: 4 PID: 4641 Comm: memhog Tainted: G        W 4.17.0-rc3+ #8
        Call trace:
         kvm_age_hva_handler+0xc0/0xc8
         handle_hva_to_gpa+0xa8/0xe0
         kvm_age_hva+0x4c/0xe8
         kvm_mmu_notifier_clear_flush_young+0x54/0x98
         __mmu_notifier_clear_flush_young+0x6c/0xa0
         page_referenced_one+0x154/0x1d8
         rmap_walk_ksm+0x12c/0x1d0
         rmap_walk+0x94/0xa0
         page_referenced+0x194/0x1b0
         shrink_page_list+0x674/0xc28
         shrink_inactive_list+0x26c/0x5b8
         shrink_node_memcg+0x35c/0x620
         shrink_node+0x100/0x430
         do_try_to_free_pages+0xe0/0x3a8
         try_to_free_pages+0xe4/0x230
         __alloc_pages_nodemask+0x564/0xdc0
         alloc_pages_vma+0x90/0x228
         do_anonymous_page+0xc8/0x4d0
         __handle_mm_fault+0x4a0/0x508
         handle_mm_fault+0xf8/0x1b0
         do_page_fault+0x218/0x4b8
         do_translation_fault+0x90/0xa0
         do_mem_abort+0x68/0xf0
         el0_da+0x24/0x28
      
      In rmap_walk_ksm, the rmap_item->address might still have the
      STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be
      PAGE_SIZE aligned.  Thus it will cause exceptions in handle_hva_to_gpa
      on arm64.
      
      This patch fixes it by ignoring (not removing) the low bits of address
      when doing rmap_walk_ksm.
      
      IMO, it should be backported to stable tree.  the storm of WARN_ONs is
      very easy for me to reproduce.  More than that, I watched a panic (not
      reproducible) as follows:
      
        page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0
        flags: 0x1fffc00000000000()
        raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000
        raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000
        page dumped because: nonzero _refcount
        CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1
        Hardware name: <snip for confidential issues>
        Call trace:
          dump_backtrace+0x0/0x22c
          show_stack+0x24/0x2c
          dump_stack+0x8c/0xb0
          bad_page+0xf4/0x154
          free_pages_check_bad+0x90/0x9c
          free_pcppages_bulk+0x464/0x518
          free_hot_cold_page+0x22c/0x300
          __put_page+0x54/0x60
          unmap_stage2_range+0x170/0x2b4
          kvm_unmap_hva_handler+0x30/0x40
          handle_hva_to_gpa+0xb0/0xec
          kvm_unmap_hva_range+0x5c/0xd0
      
      I even injected a fault on purpose in kvm_unmap_hva_range by seting
      size=size-0x200, the call trace is similar as above.  So I thought the
      panic is similarly caused by the root cause of WARN_ON.
      
      Andrea said:
      
      : It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
      : zap those bits and hide the misalignment caused by the low metadata
      : bits being erroneously left set in the address, but the arm code
      : notices when that's the last page in the memslot and the hva_end is
      : getting aligned and the size is below one page.
      :
      : I think the problem triggers in the addr += PAGE_SIZE of
      : unmap_stage2_ptes that never matches end because end is aligned but
      : addr is not.
      :
      : 	} while (pte++, addr += PAGE_SIZE, addr != end);
      :
      : x86 again only works on hva_start/hva_end after converting it to
      : gfn_start/end and that being in pfn units the bits are zapped before
      : they risk to cause trouble.
      
      Jia He said:
      
      : I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without
      : this patch, the WARN_ON is very easy for reproducing.  After this patch, I
      : have run the same benchmarch for a whole day without any WARN_ONs
      
      Link: http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejianet@gmail.comSigned-off-by: NJia He <jia.he@hxt-semitech.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NJia He <hejianet@gmail.com>
      Cc: Suzuki K Poulose <Suzuki.Poulose@arm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1105a2fc
  12. 13 6月, 2018 4 次提交
    • K
      treewide: Use array_size() in vmalloc() · 42bc47b3
      Kees Cook 提交于
      The vmalloc() function has no 2-factor argument form, so multiplication
      factors need to be wrapped in array_size(). This patch replaces cases of:
      
              vmalloc(a * b)
      
      with:
              vmalloc(array_size(a, b))
      
      as well as handling cases of:
      
              vmalloc(a * b * c)
      
      with:
      
              vmalloc(array3_size(a, b, c))
      
      This does, however, attempt to ignore constant size factors like:
      
              vmalloc(4 * 1024)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        vmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        vmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        vmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
        vmalloc(
      -	SIZE * COUNT
      +	array_size(COUNT, SIZE)
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        vmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        vmalloc(C1 * C2 * C3, ...)
      |
        vmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants.
      @@
      expression E1, E2;
      constant C1, C2;
      @@
      
      (
        vmalloc(C1 * C2, ...)
      |
        vmalloc(
      -	E1 * E2
      +	array_size(E1, E2)
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      42bc47b3
    • K
      treewide: kvzalloc() -> kvcalloc() · 778e1cdd
      Kees Cook 提交于
      The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
      patch replaces cases of:
      
              kvzalloc(a * b, gfp)
      
      with:
              kvcalloc(a * b, gfp)
      
      as well as handling cases of:
      
              kvzalloc(a * b * c, gfp)
      
      with:
      
              kvzalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kvcalloc(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kvzalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kvzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kvzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kvzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kvzalloc
      + kvcalloc
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kvzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kvzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kvzalloc(C1 * C2 * C3, ...)
      |
        kvzalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kvzalloc(sizeof(THING) * C2, ...)
      |
        kvzalloc(sizeof(TYPE) * C2, ...)
      |
        kvzalloc(C1 * C2 * C3, ...)
      |
        kvzalloc(C1 * C2, ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      778e1cdd
    • K
      treewide: kzalloc() -> kcalloc() · 6396bb22
      Kees Cook 提交于
      The kzalloc() function has a 2-factor argument form, kcalloc(). This
      patch replaces cases of:
      
              kzalloc(a * b, gfp)
      
      with:
              kcalloc(a * b, gfp)
      
      as well as handling cases of:
      
              kzalloc(a * b * c, gfp)
      
      with:
      
              kzalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kzalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kzalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kzalloc
      + kcalloc
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(sizeof(THING) * C2, ...)
      |
        kzalloc(sizeof(TYPE) * C2, ...)
      |
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(C1 * C2, ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6396bb22
    • K
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook 提交于
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6da2ec56
  13. 08 6月, 2018 5 次提交
    • M
      mm: kvmalloc does not fallback to vmalloc for incompatible gfp flags · ce91f6ee
      Michal Hocko 提交于
      kvmalloc warned about incompatible gfp_mask to catch abusers (mostly
      GFP_NOFS) with an intention that this will motivate authors of the code
      to fix those.  Linus argues that this just motivates people to do even
      more hacks like
      
      	if (gfp == GFP_KERNEL)
      		kvmalloc
      	else
      		kmalloc
      
      I haven't seen this happening much (Linus pointed to bucket_lock special
      cases an atomic allocation but my git foo hasn't found much more) but it
      is true that we can grow those in future.  Therefore Linus suggested to
      simply not fallback to vmalloc for incompatible gfp flags and rather
      stick with the kmalloc path.
      
      Link: http://lkml.kernel.org/r/20180601115329.27807-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Tom Herbert <tom@quantonium.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce91f6ee
    • K
      mm/shmem.c: zero out unused vma fields in shmem_pseudo_vma_init() · daa28075
      Kirill A. Shutemov 提交于
      shmem/tmpfs uses pseudo vma to allocate page with correct NUMA policy.
      
      The pseudo vma doesn't have vm_page_prot set.  We are going to encode
      encryption KeyID in vm_page_prot.  Having garbage there causes problems.
      
      Zero out all unused fields in the pseudo vma.
      
      Link: http://lkml.kernel.org/r/20180531135602.20321-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      daa28075
    • V
      mm, page_alloc: do not break __GFP_THISNODE by zonelist reset · 7810e678
      Vlastimil Babka 提交于
      In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
      allocations that can ignore memory policies.  The zonelist is obtained
      from current CPU's node.  This is a problem for __GFP_THISNODE
      allocations that want to allocate on a different node, e.g.  because the
      allocating thread has been migrated to a different CPU.
      
      This has been observed to break SLAB in our 4.4-based kernel, because
      there it relies on __GFP_THISNODE working as intended.  If a slab page
      is put on wrong node's list, then further list manipulations may corrupt
      the list because page_to_nid() is used to determine which node's
      list_lock should be locked and thus we may take a wrong lock and race.
      
      Current SLAB implementation seems to be immune by luck thanks to commit
      511e3a05 ("mm/slab: make cache_grow() handle the page allocated on
      arbitrary node") but there may be others assuming that __GFP_THISNODE
      works as promised.
      
      We can fix it by simply removing the zonelist reset completely.  There
      is actually no reason to reset it, because memory policies and cpusets
      don't affect the zonelist choice in the first place.  This was different
      when commit 183f6371 ("mm: ignore mempolicies when using
      ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
      own restricted zonelists.
      
      We might consider this for 4.17 although I don't know if there's
      anything currently broken.
      
      SLAB is currently not affected, but in kernels older than 4.7 that don't
      yet have 511e3a05 ("mm/slab: make cache_grow() handle the page
      allocated on arbitrary node") it is.  That's at least 4.4 LTS.  Older
      ones I'll have to check.
      
      So stable backports should be more important, but will have to be
      reviewed carefully, as the code went through many changes.  BTW I think
      that also the ac->preferred_zoneref reset is currently useless if we
      don't also reset ac->nodemask from a mempolicy to NULL first (which we
      probably should for the OOM victims etc?), but I would leave that for a
      separate patch.
      
      Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Fixes: 183f6371 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7810e678
    • M
      userfaultfd: prevent non-cooperative events vs mcopy_atomic races · df2cc96e
      Mike Rapoport 提交于
      If a process monitored with userfaultfd changes it's memory mappings or
      forks() at the same time as uffd monitor fills the process memory with
      UFFDIO_COPY, the actual creation of page table entries and copying of
      the data in mcopy_atomic may happen either before of after the memory
      mapping modifications and there is no way for the uffd monitor to
      maintain consistent view of the process memory layout.
      
      For instance, let's consider fork() running in parallel with
      userfaultfd_copy():
      
      process        		         |	uffd monitor
      ---------------------------------+------------------------------
      fork()        		         | userfaultfd_copy()
      ...        		         | ...
          dup_mmap()        	         |     down_read(mmap_sem)
          down_write(mmap_sem)         |     /* create PTEs, copy data */
              dup_uffd()               |     up_read(mmap_sem)
              copy_page_range()        |
              up_write(mmap_sem)       |
              dup_uffd_complete()      |
                  /* notify monitor */ |
      
      If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
      be present by the time copy_page_range() is called and they will appear
      in the child's memory mappings.  However, if the fork() is the first to
      take the mmap_sem, the new pages won't be mapped in the child's address
      space.
      
      If the pages are not present and child tries to access them, the monitor
      will get page fault notification and everything is fine.  However, if
      the pages *are present*, the child can access them without uffd
      noticing.  And if we copy them into child it'll see the wrong data.
      Since we are talking about background copy, we'd need to decide whether
      the pages should be copied or not regardless #PF notifications.
      
      Since userfaultfd monitor has no way to determine what was the order,
      let's disallow userfaultfd_copy in parallel with the non-cooperative
      events.  In such case we return -EAGAIN and the uffd monitor can
      understand that userfaultfd_copy() clashed with a non-cooperative event
      and take an appropriate action.
      
      Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@virtuozzo.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df2cc96e
    • T
      mm: memcg: allow lowering memory.swap.max below the current usage · be09102b
      Tejun Heo 提交于
      Currently an attempt to set swap.max into a value lower than the actual
      swap usage fails, which causes configuration problems as there's no way
      of lowering the configuration below the current usage short of turning
      off swap entirely.  This makes swap.max difficult to use and allows
      delegatees to lock the delegator out of reducing swap allocation.
      
      This patch updates swap_max_write() so that the limit can be lowered
      below the current usage.  It doesn't implement active reclaiming of swap
      entries for the following reasons.
      
      * mem_cgroup_swap_full() already tells the swap machinary to
        aggressively reclaim swap entries if the usage is above 50% of
        limit, so simply lowering the limit automatically triggers gradual
        reclaim.
      
      * Forcing back swapped out pages is likely to heavily impact the
        workload and mess up the working set.  Given that swap usually is a
        lot less valuable and less scarce, letting the existing usage
        dissipate over time through the above gradual reclaim and as they're
        falted back in is likely the better behavior.
      
      Link: http://lkml.kernel.org/r/20180523185041.GR1718769@devbig577.frc2.facebook.comSigned-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NRik van Riel <riel@surriel.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be09102b