1. 18 8月, 2018 1 次提交
  2. 11 8月, 2018 1 次提交
  3. 06 8月, 2018 2 次提交
    • P
      PM / reboot: Eliminate race between reboot and suspend · 55f2503c
      Pingfan Liu 提交于
      At present, "systemctl suspend" and "shutdown" can run in parrallel. A
      system can suspend after devices_shutdown(), and resume. Then the shutdown
      task goes on to power off. This causes many devices are not really shut
      off. Hence replacing reboot_mutex with system_transition_mutex (renamed
      from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
      system_transition_mutex can be better to reflect the purpose of the mutex.
      Signed-off-by: NPingfan Liu <kernelfans@gmail.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      55f2503c
    • D
      mm: Allow non-direct-map arguments to free_reserved_area() · 0d834328
      Dave Hansen 提交于
      free_reserved_area() takes pointers as arguments to show which addresses
      should be freed.  However, it does this in a somewhat ambiguous way.  If it
      gets a kernel direct map address, it always works.  However, if it gets an
      address that is part of the kernel image alias mapping, it can fail.
      
      It fails if all of the following happen:
       * The specified address is part of the kernel image alias
       * Poisoning is requested (forcing a memset())
       * The address is in a read-only portion of the kernel image
      
      The memset() fails on the read-only mapping, of course.
      free_reserved_area() *is* called both on the direct map and on kernel image
      alias addresses.  We've just lucked out thus far that the kernel image
      alias areas it gets used on are read-write.  I'm fairly sure this has been
      just a happy accident.
      
      It is quite easy to make free_reserved_area() work for all cases: just
      convert the address to a direct map address before doing the memset(), and
      do this unconditionally.  There is little chance of a regression here
      because we previously did a virt_to_page() on the address for the memset,
      so we know these are not highmem pages for which virt_to_page() would fail.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keescook@google.com
      Cc: aarcange@redhat.com
      Cc: jgross@suse.com
      Cc: jpoimboe@redhat.com
      Cc: gregkh@linuxfoundation.org
      Cc: peterz@infradead.org
      Cc: hughd@google.com
      Cc: torvalds@linux-foundation.org
      Cc: bp@alien8.de
      Cc: luto@kernel.org
      Cc: ak@linux.intel.com
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180802225826.1287AE3E@viggo.jf.intel.com
      0d834328
  4. 03 8月, 2018 10 次提交
  5. 02 8月, 2018 2 次提交
    • C
      kconfig: add a Memory Management options" menu · 59e0b520
      Christoph Hellwig 提交于
      This moves all the options under a proper menu.
      
      Based on a patch from Randy Dunlap.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NRandy Dunlap <rdunlap@infradead.org>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      59e0b520
    • H
      mm: delete historical BUG from zap_pmd_range() · 53406ed1
      Hugh Dickins 提交于
      Delete the old VM_BUG_ON_VMA() from zap_pmd_range(), which asserted
      that mmap_sem must be held when splitting an "anonymous" vma there.
      Whether that's still strictly true nowadays is not entirely clear,
      but the danger of sometimes crashing on the BUG is now fairly clear.
      
      Even with the new stricter rules for anonymous vma marking, the
      condition it checks for can possible trigger. Commit 44960f2a
      ("staging: ashmem: Fix SIGBUS crash when traversing mmaped ashmem
      pages") is good, and originally I thought it was safe from that
      VM_BUG_ON_VMA(), because the /dev/ashmem fd exposed to the user is
      disconnected from the vm_file in the vma, and madvise(,,MADV_REMOVE)
      insists on VM_SHARED.
      
      But after I read John's earlier mail, drawing attention to the
      vfs_fallocate() in there: I may be wrong, and I don't know if Android
      has THP in the config anyway, but it looks to me like an
      unmap_mapping_range() from ashmem's vfs_fallocate() could hit precisely
      the VM_BUG_ON_VMA(), once it's vma_is_anonymous().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53406ed1
  6. 27 7月, 2018 4 次提交
    • M
      readahead: stricter check for bdi io_pages · dc30b96a
      Markus Stockhausen 提交于
      ondemand_readahead() checks bdi->io_pages to cap the maximum pages
      that need to be processed. This works until the readit section. If
      we would do an async only readahead (async size = sync size) and
      target is at beginning of window we expand the pages by another
      get_next_ra_size() pages. Btrace for large reads shows that kernel
      always issues a doubled size read at the beginning of processing.
      Add an additional check for io_pages in the lower part of the func.
      The fix helps devices that hard limit bio pages and rely on proper
      handling of max_hw_read_sectors (e.g. older FusionIO cards). For
      that reason it could qualify for stable.
      
      Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
      Cc: stable@vger.kernel.org
      Signed-off-by: Markus Stockhausen stockhausen@collogia.de
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dc30b96a
    • L
      zswap: re-check zswap_is_full() after do zswap_shrink() · 16e536ef
      Li Wang 提交于
      /sys/../zswap/stored_pages keeps rising in a zswap test with
      "zswap.max_pool_percent=0" parameter.  But it should not compress or
      store pages any more since there is no space in the compressed pool.
      
      Reproduce steps:
        1. Boot kernel with "zswap.enabled=1"
        2. Set the max_pool_percent to 0
            # echo 0 > /sys/module/zswap/parameters/max_pool_percent
        3. Do memory stress test to see if some pages have been compressed
            # stress --vm 1 --vm-bytes $mem_available"M" --timeout 60s
        4. Watching the 'stored_pages' number increasing or not
      
      The root cause is:
      
        When zswap_max_pool_percent is set to 0 via kernel parameter,
        zswap_is_full() will always return true due to zswap_shrink().  But if
        the shinking is able to reclain a page successfully the code then
        proceeds to compressing/storing another page, so the value of
        stored_pages will keep changing.
      
      To solve the issue, this patch adds a zswap_is_full() check again after
        zswap_shrink() to make sure it's now under the max_pool_percent, and to
        not compress/store if we reached the limit.
      
      Link: http://lkml.kernel.org/r/20180530103936.17812-1-liwang@redhat.comSigned-off-by: NLi Wang <liwang@redhat.com>
      Acked-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Huang Ying <huang.ying.caritas@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16e536ef
    • K
      mm: fix vma_is_anonymous() false-positives · bfd40eaf
      Kirill A. Shutemov 提交于
      vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
      VMA.  This is unreliable as ->mmap may not set ->vm_ops.
      
      False-positive vma_is_anonymous() may lead to crashes:
      
      	next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
      	prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
      	pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
      	flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
      	------------[ cut here ]------------
      	kernel BUG at mm/memory.c:1422!
      	invalid opcode: 0000 [#1] SMP KASAN
      	CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
      	Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
      	01/01/2011
      	RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
      	RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
      	RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
      	RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
      	Call Trace:
      	 unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
      	 zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
      	 unmap_mapping_range_vma mm/memory.c:2792 [inline]
      	 unmap_mapping_range_tree mm/memory.c:2813 [inline]
      	 unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
      	 unmap_mapping_range+0x48/0x60 mm/memory.c:2880
      	 truncate_pagecache+0x54/0x90 mm/truncate.c:800
      	 truncate_setsize+0x70/0xb0 mm/truncate.c:826
      	 simple_setattr+0xe9/0x110 fs/libfs.c:409
      	 notify_change+0xf13/0x10f0 fs/attr.c:335
      	 do_truncate+0x1ac/0x2b0 fs/open.c:63
      	 do_sys_ftruncate+0x492/0x560 fs/open.c:205
      	 __do_sys_ftruncate fs/open.c:215 [inline]
      	 __se_sys_ftruncate fs/open.c:213 [inline]
      	 __x64_sys_ftruncate+0x59/0x80 fs/open.c:213
      	 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Reproducer:
      
      	#include <stdio.h>
      	#include <stddef.h>
      	#include <stdint.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <sys/types.h>
      	#include <sys/stat.h>
      	#include <sys/ioctl.h>
      	#include <sys/mman.h>
      	#include <unistd.h>
      	#include <fcntl.h>
      
      	#define KCOV_INIT_TRACE			_IOR('c', 1, unsigned long)
      	#define KCOV_ENABLE			_IO('c', 100)
      	#define KCOV_DISABLE			_IO('c', 101)
      	#define COVER_SIZE			(1024<<10)
      
      	#define KCOV_TRACE_PC  0
      	#define KCOV_TRACE_CMP 1
      
      	int main(int argc, char **argv)
      	{
      		int fd;
      		unsigned long *cover;
      
      		system("mount -t debugfs none /sys/kernel/debug");
      		fd = open("/sys/kernel/debug/kcov", O_RDWR);
      		ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE);
      		cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
      				PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
      		munmap(cover, COVER_SIZE * sizeof(unsigned long));
      		cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
      				PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
      		memset(cover, 0, COVER_SIZE * sizeof(unsigned long));
      		ftruncate(fd, 3UL << 20);
      		return 0;
      	}
      
      This can be fixed by assigning anonymous VMAs own vm_ops and not relying
      on it being NULL.
      
      If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
      dummy_vm_ops.  This way we will have non-NULL ->vm_ops for all VMAs.
      
      Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfd40eaf
    • K
      mm: use vma_init() to initialize VMAs on stack and data segments · 2c4541e2
      Kirill A. Shutemov 提交于
      Make sure to initialize all VMAs properly, not only those which come
      from vm_area_cachep.
      
      Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c4541e2
  7. 22 7月, 2018 6 次提交
  8. 17 7月, 2018 3 次提交
  9. 15 7月, 2018 4 次提交
    • M
      mm: do not bug_on on incorrect length in __mm_populate() · bb177a73
      Michal Hocko 提交于
      syzbot has noticed that a specially crafted library can easily hit
      VM_BUG_ON in __mm_populate
      
        kernel BUG at mm/gup.c:1242!
        invalid opcode: 0000 [#1] SMP
        CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
        RIP: 0010:__mm_populate+0x1e2/0x1f0
        Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff <0f> 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
        Call Trace:
           vm_brk_flags+0xc3/0x100
           vm_brk+0x1f/0x30
           load_elf_library+0x281/0x2e0
           __ia32_sys_uselib+0x170/0x1e0
           do_fast_syscall_32+0xca/0x420
           entry_SYSENTER_compat+0x70/0x7f
      
      The reason is that the length of the new brk is not page aligned when we
      try to populate the it.  There is no reason to bug on that though.
      do_brk_flags already aligns the length properly so the mapping is
      expanded as it should.  All we need is to tell mm_populate about it.
      Besides that there is absolutely no reason to to bug_on in the first
      place.  The worst thing that could happen is that the last page wouldn't
      get populated and that is far from putting system into an inconsistent
      state.
      
      Fix the issue by moving the length sanitization code from do_brk_flags
      up to vm_brk_flags.  The only other caller of do_brk_flags is brk
      syscall entry and it makes sure to provide the proper length so t here
      is no need for sanitation and so we can use do_brk_flags without it.
      
      Also remove the bogus BUG_ONs.
      
      [osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
      Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: Nsyzbot <syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com>
      Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb177a73
    • M
      mm/memblock.c: do not complain about top-down allocations for !MEMORY_HOTREMOVE · e3d301ca
      Michal Hocko 提交于
      Mike Rapoport is converting architectures from bootmem to nobootmem
      allocator.  While doing so for m68k Geert has noticed that he gets a
      scary looking warning:
      
        WARNING: CPU: 0 PID: 0 at mm/memblock.c:230
        memblock_find_in_range_node+0x11c/0x1be
        memblock: bottom-up allocation failed, memory hotunplug may be affected
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted
        4.18.0-rc3-atari-01343-gf2fb5f2e09a97a3c-dirty #7
        Call Trace: __warn+0xa8/0xc2
          kernel_pg_dir+0x0/0x1000
          netdev_lower_get_next+0x2/0x22
          warn_slowpath_fmt+0x2e/0x36
          memblock_find_in_range_node+0x11c/0x1be
          memblock_find_in_range_node+0x11c/0x1be
          memblock_find_in_range_node+0x0/0x1be
          vprintk_func+0x66/0x6e
          memblock_virt_alloc_internal+0xd0/0x156
          netdev_lower_get_next+0x2/0x22
          netdev_lower_get_next+0x2/0x22
          kernel_pg_dir+0x0/0x1000
          memblock_virt_alloc_try_nid_nopanic+0x58/0x7a
          netdev_lower_get_next+0x2/0x22
          kernel_pg_dir+0x0/0x1000
          kernel_pg_dir+0x0/0x1000
          EXPTBL+0x234/0x400
          EXPTBL+0x234/0x400
          alloc_node_mem_map+0x4a/0x66
          netdev_lower_get_next+0x2/0x22
          free_area_init_node+0xe2/0x29e
          EXPTBL+0x234/0x400
          paging_init+0x430/0x462
          kernel_pg_dir+0x0/0x1000
          printk+0x0/0x1a
          EXPTBL+0x234/0x400
          setup_arch+0x1b8/0x22c
          start_kernel+0x4a/0x40a
          _sinittext+0x344/0x9e8
      
      The warning is basically saying that a top-down allocation can break
      memory hotremove because memblock allocation is not movable.  But m68k
      doesn't even support MEMORY_HOTREMOVE so there is no point to warn about
      it.
      
      Make the warning conditional only to configurations that care.
      
      Link: http://lkml.kernel.org/r/20180706061750.GH32658@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Sam Creasey <sammy@sammy.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3d301ca
    • C
      mm: do not drop unused pages when userfaultd is running · bce73e48
      Christian Borntraeger 提交于
      KVM guests on s390 can notify the host of unused pages.  This can result
      in pte_unused callbacks to be true for KVM guest memory.
      
      If a page is unused (checked with pte_unused) we might drop this page
      instead of paging it.  This can have side-effects on userfaultd, when
      the page in question was already migrated:
      
      The next access of that page will trigger a fault and a user fault
      instead of faulting in a new and empty zero page.  As QEMU does not
      expect a userfault on an already migrated page this migration will fail.
      
      The most straightforward solution is to ignore the pte_unused hint if a
      userfault context is active for this VMA.
      
      Link: http://lkml.kernel.org/r/20180703171854.63981-1-borntraeger@de.ibm.comSigned-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Cornelia Huck <cohuck@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bce73e48
    • P
      mm: zero unavailable pages before memmap init · e181ae0c
      Pavel Tatashin 提交于
      We must zero struct pages for memory that is not backed by physical
      memory, or kernel does not have access to.
      
      Recently, there was a change which zeroed all memmap for all holes in
      e820.  Unfortunately, it introduced a bug that is discussed here:
      
        https://www.spinics.net/lists/linux-mm/msg156764.html
      
      Linus, also saw this bug on his machine, and confirmed that reverting
      commit 124049de ("x86/e820: put !E820_TYPE_RAM regions into
      memblock.reserved") fixes the issue.
      
      The problem is that we incorrectly zero some struct pages after they
      were setup.
      
      The fix is to zero unavailable struct pages prior to initializing of
      struct pages.
      
      A more detailed fix should come later that would avoid double zeroing
      cases: one in __init_single_page(), the other one in
      zero_resv_unavail().
      
      Fixes: 124049de ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e181ae0c
  10. 12 7月, 2018 3 次提交
  11. 09 7月, 2018 4 次提交