1. 14 4月, 2017 5 次提交
  2. 09 4月, 2017 1 次提交
  3. 08 4月, 2017 5 次提交
  4. 01 4月, 2017 8 次提交
    • M
      mm/hugetlb.c: don't call region_abort if region_chg fails · ff8c0c53
      Mike Kravetz 提交于
      Changes to hugetlbfs reservation maps is a two step process.  The first
      step is a call to region_chg to determine what needs to be changed, and
      prepare that change.  This should be followed by a call to call to
      region_add to commit the change, or region_abort to abort the change.
      
      The error path in hugetlb_reserve_pages called region_abort after a
      failed call to region_chg.  As a result, the adds_in_progress counter in
      the reservation map is off by 1.  This is caught by a VM_BUG_ON in
      resv_map_release when the reservation map is freed.
      
      syzkaller fuzzer (when using an injected kmalloc failure) found this
      bug, that resulted in the following:
      
       kernel BUG at mm/hugetlb.c:742!
       Call Trace:
        hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
        evict+0x481/0x920 fs/inode.c:553
        iput_final fs/inode.c:1515 [inline]
        iput+0x62b/0xa20 fs/inode.c:1542
        hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
        newseg+0x422/0xd30 ipc/shm.c:575
        ipcget_new ipc/util.c:285 [inline]
        ipcget+0x21e/0x580 ipc/util.c:639
        SYSC_shmget ipc/shm.c:673 [inline]
        SyS_shmget+0x158/0x230 ipc/shm.c:657
        entry_SYSCALL_64_fastpath+0x1f/0xc2
       RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742
      
      Link: http://lkml.kernel.org/r/1490821682-23228-1-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff8c0c53
    • M
      kasan: report only the first error by default · b0845ce5
      Mark Rutland 提交于
      Disable kasan after the first report.  There are several reasons for
      this:
      
       - Single bug quite often has multiple invalid memory accesses causing
         storm in the dmesg.
      
       - Write OOB access might corrupt metadata so the next report will print
         bogus alloc/free stacktraces.
      
       - Reports after the first easily could be not bugs by itself but just
         side effects of the first one.
      
      Given that multiple reports usually only do harm, it makes sense to
      disable kasan after the first one.  If user wants to see all the
      reports, the boot-time parameter kasan_multi_shot must be used.
      
      [aryabinin@virtuozzo.com: wrote changelog and doc, added missing include]
      Link: http://lkml.kernel.org/r/20170323154416.30257-1-aryabinin@virtuozzo.comSigned-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0845ce5
    • K
      mm: fix section name for .data..ro_after_init · 906f2a51
      Kees Cook 提交于
      A section name for .data..ro_after_init was added by both:
      
          commit d07a980c ("s390: add proper __ro_after_init support")
      
      and
      
          commit d7c19b06 ("mm: kmemleak: scan .data.ro_after_init")
      
      The latter adds incorrect wrapping around the existing s390 section, and
      came later.  I'd prefer the s390 naming, so this moves the s390-specific
      name up to the asm-generic/sections.h and renames the section as used by
      kmemleak (and in the future, kernel/extable.c).
      
      Link: http://lkml.kernel.org/r/20170327192213.GA129375@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>	[s390 parts]
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Cc: Eddie Kovsky <ewk@edkovsky.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      906f2a51
    • N
      mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd() · c9d398fa
      Naoya Horiguchi 提交于
      I found the race condition which triggers the following bug when
      move_pages() and soft offline are called on a single hugetlb page
      concurrently.
      
          Soft offlining page 0x119400 at 0x700000000000
          BUG: unable to handle kernel paging request at ffffea0011943820
          IP: follow_huge_pmd+0x143/0x190
          PGD 7ffd2067
          PUD 7ffd1067
          PMD 0
              [61163.582052] Oops: 0000 [#1] SMP
          Modules linked in: binfmt_misc ppdev virtio_balloon parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
          CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P           OE   4.11.0-rc2-mm1+ #2
          Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
          RIP: 0010:follow_huge_pmd+0x143/0x190
          RSP: 0018:ffffc90004bdbcd0 EFLAGS: 00010202
          RAX: 0000000465003e80 RBX: ffffea0004e34d30 RCX: 00003ffffffff000
          RDX: 0000000011943800 RSI: 0000000000080001 RDI: 0000000465003e80
          RBP: ffffc90004bdbd18 R08: 0000000000000000 R09: ffff880138d34000
          R10: ffffea0004650000 R11: 0000000000c363b0 R12: ffffea0011943800
          R13: ffff8801b8d34000 R14: ffffea0000000000 R15: 000077ff80000000
          FS:  00007fc977710740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: ffffea0011943820 CR3: 000000007a746000 CR4: 00000000001406f0
          Call Trace:
           follow_page_mask+0x270/0x550
           SYSC_move_pages+0x4ea/0x8f0
           SyS_move_pages+0xe/0x10
           do_syscall_64+0x67/0x180
           entry_SYSCALL64_slow_path+0x25/0x25
          RIP: 0033:0x7fc976e03949
          RSP: 002b:00007ffe72221d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
          RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc976e03949
          RDX: 0000000000c22390 RSI: 0000000000001400 RDI: 0000000000005827
          RBP: 00007ffe72221e00 R08: 0000000000c2c3a0 R09: 0000000000000004
          R10: 0000000000c363b0 R11: 0000000000000246 R12: 0000000000400650
          R13: 00007ffe72221ee0 R14: 0000000000000000 R15: 0000000000000000
          Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
          RIP: follow_huge_pmd+0x143/0x190 RSP: ffffc90004bdbcd0
          CR2: ffffea0011943820
          ---[ end trace e4f81353a2d23232 ]---
          Kernel panic - not syncing: Fatal exception
          Kernel Offset: disabled
      
      This bug is triggered when pmd_present() returns true for non-present
      hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
      Using pmd_present() to determine present/non-present for hugetlb is not
      correct, because pmd_present() checks multiple bits (not only
      _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
      
      Fixes: e66f17ff ("mm/hugetlb: take page table lock in follow_huge_pmd()")
      Link: http://lkml.kernel.org/r/1490149898-20231-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: <stable@vger.kernel.org>        [4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9d398fa
    • J
      mm: workingset: fix premature shadow node shrinking with cgroups · 0cefabda
      Johannes Weiner 提交于
      Commit 0a6b76dd ("mm: workingset: make shadow node shrinker memcg
      aware") enabled cgroup-awareness in the shadow node shrinker, but forgot
      to also enable cgroup-awareness in the list_lru the shadow nodes sit on.
      
      Consequently, all shadow nodes are sitting on a global (per-NUMA node)
      list, while the shrinker applies the limits according to the amount of
      cache in the cgroup its shrinking.  The result is excessive pressure on
      the shadow nodes from cgroups that have very little cache.
      
      Enable memcg-mode on the shadow node LRUs, such that per-cgroup limits
      are applied to per-cgroup lists.
      
      Fixes: 0a6b76dd ("mm: workingset: make shadow node shrinker memcg aware")
      Link: http://lkml.kernel.org/r/20170322005320.8165-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVladimir Davydov <vdavydov@tarantool.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>	[4.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cefabda
    • J
      mm: rmap: fix huge file mmap accounting in the memcg stats · 553af430
      Johannes Weiner 提交于
      Huge pages are accounted as single units in the memcg's "file_mapped"
      counter.  Account the correct number of base pages, like we do in the
      corresponding node counter.
      
      Link: http://lkml.kernel.org/r/20170322005111.3156-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      553af430
    • M
      mm: move mm_percpu_wq initialization earlier · 597b7305
      Michal Hocko 提交于
      Yang Li has reported that drain_all_pages triggers a WARN_ON which means
      that this function is called earlier than the mm_percpu_wq is
      initialized on arm64 with CMA configured:
      
        WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
        Modules linked in:
        CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
        Hardware name: Freescale Layerscape 2088A RDB Board (DT)
        task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
        PC is at drain_all_pages+0x244/0x25c
        LR is at start_isolate_page_range+0x14c/0x1f0
        [...]
         drain_all_pages+0x244/0x25c
         start_isolate_page_range+0x14c/0x1f0
         alloc_contig_range+0xec/0x354
         cma_alloc+0x100/0x1fc
         dma_alloc_from_contiguous+0x3c/0x44
         atomic_pool_init+0x7c/0x208
         arm64_dma_init+0x44/0x4c
         do_one_initcall+0x38/0x128
         kernel_init_freeable+0x1a0/0x240
         kernel_init+0x10/0xfc
         ret_from_fork+0x10/0x20
      
      Fix this by moving the whole setup_vmstat which is an initcall right now
      to init_mm_internals which will be called right after the WQ subsystem
      is initialized.
      
      Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NYang Li <pku.leo@gmail.com>
      Tested-by: NYang Li <pku.leo@gmail.com>
      Tested-by: NXiaolong Ye <xiaolong.ye@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      597b7305
    • N
      mm: migrate: fix remove_migration_pte() for ksm pages · 4b0ece6f
      Naoya Horiguchi 提交于
      I found that calling page migration for ksm pages causes the following
      bug:
      
          page:ffffea0004d51180 count:2 mapcount:2 mapping:ffff88013c785141 index:0x913
          flags: 0x57ffffc0040068(uptodate|lru|active|swapbacked)
          raw: 0057ffffc0040068 ffff88013c785141 0000000000000913 0000000200000001
          raw: ffffea0004d5f9e0 ffffea0004d53f60 0000000000000000 ffff88007d81b800
          page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
          page->mem_cgroup:ffff88007d81b800
          ------------[ cut here ]------------
          kernel BUG at /src/linux-dev/mm/rmap.c:1086!
          invalid opcode: 0000 [#1] SMP
          Modules linked in: ppdev parport_pc virtio_balloon i2c_piix4 pcspkr parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi ata_piix 8139too libata virtio_blk 8139cp crc32c_intel mii virtio_pci virtio_ring serio_raw virtio floppy dm_mirror dm_region_hash dm_log dm_mod
          CPU: 0 PID: 3162 Comm: bash Not tainted 4.11.0-rc2-mm1+ #1
          Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
          RIP: 0010:do_page_add_anon_rmap+0x1ba/0x260
          RSP: 0018:ffffc90002473b30 EFLAGS: 00010282
          RAX: 0000000000000021 RBX: ffffea0004d51180 RCX: 0000000000000006
          RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff88007dc0dfe0
          RBP: ffffc90002473b58 R08: 00000000fffffffe R09: 00000000000001c1
          R10: 0000000000000005 R11: 00000000000001c0 R12: ffff880139ab3d80
          R13: 0000000000000000 R14: 0000700000000200 R15: 0000160000000000
          FS:  00007f5195f50740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fd450287000 CR3: 000000007a08e000 CR4: 00000000001406f0
          Call Trace:
           page_add_anon_rmap+0x18/0x20
           remove_migration_pte+0x220/0x2c0
           rmap_walk_ksm+0x143/0x220
           rmap_walk+0x55/0x60
           remove_migration_ptes+0x53/0x80
           migrate_pages+0x8ed/0xb60
           soft_offline_page+0x309/0x8d0
           store_soft_offline_page+0xaf/0xf0
           dev_attr_store+0x18/0x30
           sysfs_kf_write+0x3a/0x50
           kernfs_fop_write+0xff/0x180
           __vfs_write+0x37/0x160
           vfs_write+0xb2/0x1b0
           SyS_write+0x55/0xc0
           do_syscall_64+0x67/0x180
           entry_SYSCALL64_slow_path+0x25/0x25
          RIP: 0033:0x7f51956339e0
          RSP: 002b:00007ffcfa0dffc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
          RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f51956339e0
          RDX: 000000000000000c RSI: 00007f5195f53000 RDI: 0000000000000001
          RBP: 00007f5195f53000 R08: 000000000000000a R09: 00007f5195f50740
          R10: 000000000000000b R11: 0000000000000246 R12: 00007f5195907400
          R13: 000000000000000c R14: 0000000000000001 R15: 0000000000000000
          Code: fe ff ff 48 81 c2 00 02 00 00 48 89 55 d8 e8 2e c3 fd ff 48 8b 55 d8 e9 42 ff ff ff 48 c7 c6 e0 52 a1 81 48 89 df e8 46 ad fe ff <0f> 0b 48 83 e8 01 e9 7f fe ff ff 48 83 e8 01 e9 96 fe ff ff 48
          RIP: do_page_add_anon_rmap+0x1ba/0x260 RSP: ffffc90002473b30
          ---[ end trace a679d00f4af2df48 ]---
          Kernel panic - not syncing: Fatal exception
          Kernel Offset: disabled
          ---[ end Kernel panic - not syncing: Fatal exception
      
      The problem is in the following lines:
      
          new = page - pvmw.page->index +
              linear_page_index(vma, pvmw.address);
      
      The 'new' is calculated with 'page' which is given by the caller as a
      destination page and some offset adjustment for thp.  But this doesn't
      properly work for ksm pages because pvmw.page->index doesn't change for
      each address but linear_page_index() changes, which means that 'new'
      points to different pages for each addresses backed by the ksm page.  As
      a result, we try to set totally unrelated pages as destination pages,
      and that causes kernel crash.
      
      This patch fixes the miscalculation and makes ksm page migration work
      fine.
      
      Fixes: 3fe87967 ("mm: convert remove_migration_pte() to use page_vma_mapped_walk()")
      Link: http://lkml.kernel.org/r/1489717683-29905-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b0ece6f
  5. 22 3月, 2017 1 次提交
    • H
      mm, swap: Remove WARN_ON_ONCE() in free_swap_slot() · 093b995e
      Huang Ying 提交于
      Before commit 452b94b8 ("mm/swap: don't BUG_ON() due to
      uninitialized swap slot cache"), the following bug is reported,
      
        ------------[ cut here ]------------
        kernel BUG at mm/swap_slots.c:270!
        invalid opcode: 0000 [#1] SMP
        CPU: 5 PID: 1745 Comm: (sd-pam) Not tainted 4.11.0-rc1-00243-g24c534bb #1
        Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
        RIP: 0010:free_swap_slot+0xba/0xd0
        Call Trace:
         swap_free+0x36/0x40
         do_swap_page+0x360/0x6d0
         __handle_mm_fault+0x880/0x1080
         handle_mm_fault+0xd0/0x240
         __do_page_fault+0x232/0x4d0
         do_page_fault+0x20/0x70
         page_fault+0x22/0x30
        ---[ end trace aefc9ede53e0ab21 ]---
      
      This is raised by the BUG_ON(!swap_slot_cache_initialized) in
      free_swap_slot().  This is incorrect, because even if the swap slots
      cache fails to be initialized, the swap should operate properly without
      the swap slots cache.  And the use_swap_slot_cache check later in the
      function will protect the uninitialized swap slots cache case.
      
      In commit 452b94b8, the BUG_ON() is replaced by WARN_ON_ONCE().  In
      the patch, the WARN_ON_ONCE() is removed too.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      093b995e
  6. 20 3月, 2017 1 次提交
    • L
      mm/swap: don't BUG_ON() due to uninitialized swap slot cache · 452b94b8
      Linus Torvalds 提交于
      This BUG_ON() triggered for me once at shutdown, and I don't see a
      reason for the check.  The code correctly checks whether the swap slot
      cache is usable or not, so an uninitialized swap slot cache is not
      actually problematic afaik.
      
      I've temporarily just switched the BUG_ON() to a WARN_ON_ONCE(), since
      I'm not sure why that seemingly pointless check was there.  I suspect
      the real fix is to just remove it entirely, but for now we'll warn about
      it but not bring the machine down.
      
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      452b94b8
  7. 17 3月, 2017 3 次提交
  8. 13 3月, 2017 1 次提交
  9. 10 3月, 2017 11 次提交
    • D
      kasan: fix races in quarantine_remove_cache() · ce5bec54
      Dmitry Vyukov 提交于
      quarantine_remove_cache() frees all pending objects that belong to the
      cache, before we destroy the cache itself.  However there are currently
      two possibilities how it can fail to do so.
      
      First, another thread can hold some of the objects from the cache in
      temp list in quarantine_put().  quarantine_put() has a windows of
      enabled interrupts, and on_each_cpu() in quarantine_remove_cache() can
      finish right in that window.  These objects will be later freed into the
      destroyed cache.
      
      Then, quarantine_reduce() has the same problem.  It grabs a batch of
      objects from the global quarantine, then unlocks quarantine_lock and
      then frees the batch.  quarantine_remove_cache() can finish while some
      objects from the cache are still in the local to_free list in
      quarantine_reduce().
      
      Fix the race with quarantine_put() by disabling interrupts for the whole
      duration of quarantine_put().  In combination with on_each_cpu() in
      quarantine_remove_cache() it ensures that quarantine_remove_cache()
      either sees the objects in the per-cpu list or in the global list.
      
      Fix the race with quarantine_reduce() by protecting quarantine_reduce()
      with srcu critical section and then doing synchronize_srcu() at the end
      of quarantine_remove_cache().
      
      I've done some assessment of how good synchronize_srcu() works in this
      case.  And on a 4 CPU VM I see that it blocks waiting for pending read
      critical sections in about 2-3% of cases.  Which looks good to me.
      
      I suspect that these races are the root cause of some GPFs that I
      episodically hit.  Previously I did not have any explanation for them.
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000c8
        IP: qlist_free_all+0x2e/0xc0 mm/kasan/quarantine.c:155
        PGD 6aeea067
        PUD 60ed7067
        PMD 0
        Oops: 0000 [#1] SMP KASAN
        Dumping ftrace buffer:
           (ftrace buffer empty)
        Modules linked in:
        CPU: 0 PID: 13667 Comm: syz-executor2 Not tainted 4.10.0+ #60
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        task: ffff88005f948040 task.stack: ffff880069818000
        RIP: 0010:qlist_free_all+0x2e/0xc0 mm/kasan/quarantine.c:155
        RSP: 0018:ffff88006981f298 EFLAGS: 00010246
        RAX: ffffea0000ffff00 RBX: 0000000000000000 RCX: ffffea0000ffff1f
        RDX: 0000000000000000 RSI: ffff88003fffc3e0 RDI: 0000000000000000
        RBP: ffff88006981f2c0 R08: ffff88002fed7bd8 R09: 00000001001f000d
        R10: 00000000001f000d R11: ffff88006981f000 R12: ffff88003fffc3e0
        R13: ffff88006981f2d0 R14: ffffffff81877fae R15: 0000000080000000
        FS:  00007fb911a2d700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000c8 CR3: 0000000060ed6000 CR4: 00000000000006f0
        Call Trace:
         quarantine_reduce+0x10e/0x120 mm/kasan/quarantine.c:239
         kasan_kmalloc+0xca/0xe0 mm/kasan/kasan.c:590
         kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:544
         slab_post_alloc_hook mm/slab.h:456 [inline]
         slab_alloc_node mm/slub.c:2718 [inline]
         kmem_cache_alloc_node+0x1d3/0x280 mm/slub.c:2754
         __alloc_skb+0x10f/0x770 net/core/skbuff.c:219
         alloc_skb include/linux/skbuff.h:932 [inline]
         _sctp_make_chunk+0x3b/0x260 net/sctp/sm_make_chunk.c:1388
         sctp_make_data net/sctp/sm_make_chunk.c:1420 [inline]
         sctp_make_datafrag_empty+0x208/0x360 net/sctp/sm_make_chunk.c:746
         sctp_datamsg_from_user+0x7e8/0x11d0 net/sctp/chunk.c:266
         sctp_sendmsg+0x2611/0x3970 net/sctp/socket.c:1962
         inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761
         sock_sendmsg_nosec net/socket.c:633 [inline]
         sock_sendmsg+0xca/0x110 net/socket.c:643
         SYSC_sendto+0x660/0x810 net/socket.c:1685
         SyS_sendto+0x40/0x50 net/socket.c:1653
      
      I am not sure about backporting.  The bug is quite hard to trigger, I've
      seen it few times during our massive continuous testing (however, it
      could be cause of some other episodic stray crashes as it leads to
      memory corruption...).  If it is triggered, the consequences are very
      bad -- almost definite bad memory corruption.  The fix is non trivial
      and has chances of introducing new bugs.  I am also not sure how
      actively people use KASAN on older releases.
      
      [dvyukov@google.com: - sorted includes[
        Link: http://lkml.kernel.org/r/20170309094028.51088-1-dvyukov@google.com
      Link: http://lkml.kernel.org/r/20170308151532.5070-1-dvyukov@google.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce5bec54
    • D
      kasan: resched in quarantine_remove_cache() · 68fd814a
      Dmitry Vyukov 提交于
      We see reported stalls/lockups in quarantine_remove_cache() on machines
      with large amounts of RAM.  quarantine_remove_cache() needs to scan
      whole quarantine in order to take out all objects belonging to the
      cache.  Quarantine is currently 1/32-th of RAM, e.g.  on a machine with
      256GB of memory that will be 8GB.  Moreover quarantine scanning is a
      walk over uncached linked list, which is slow.
      
      Add cond_resched() after scanning of each non-empty batch of objects.
      Batches are specifically kept of reasonable size for quarantine_put().
      On a machine with 256GB of RAM we should have ~512 non-empty batches,
      each with 16MB of objects.
      
      Link: http://lkml.kernel.org/r/20170308154239.25440-1-dvyukov@google.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68fd814a
    • T
      mm: do not call mem_cgroup_free() from within mem_cgroup_alloc() · 40e952f9
      Tahsin Erdogan 提交于
      mem_cgroup_free() indirectly calls wb_domain_exit() which is not
      prepared to deal with a struct wb_domain object that hasn't executed
      wb_domain_init().  For instance, the following warning message is
      printed by lockdep if alloc_percpu() fails in mem_cgroup_alloc():
      
        INFO: trying to register non-static key.
        the code is fine but needs lockdep annotation.
        turning off the locking correctness validator.
        CPU: 1 PID: 1950 Comm: mkdir Not tainted 4.10.0+ #151
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         dump_stack+0x67/0x99
         register_lock_class+0x36d/0x540
         __lock_acquire+0x7f/0x1a30
         lock_acquire+0xcc/0x200
         del_timer_sync+0x3c/0xc0
         wb_domain_exit+0x14/0x20
         mem_cgroup_free+0x14/0x40
         mem_cgroup_css_alloc+0x3f9/0x620
         cgroup_apply_control_enable+0x190/0x390
         cgroup_mkdir+0x290/0x3d0
         kernfs_iop_mkdir+0x58/0x80
         vfs_mkdir+0x10e/0x1a0
         SyS_mkdirat+0xa8/0xd0
         SyS_mkdir+0x14/0x20
         entry_SYSCALL_64_fastpath+0x18/0xad
      
      Add __mem_cgroup_free() which skips wb_domain_exit().  This is used by
      both mem_cgroup_free() and mem_cgroup_alloc() clean up.
      
      Fixes: 0b8f73e1 ("mm: memcontrol: clean up alloc, online, offline, free functions")
      Link: http://lkml.kernel.org/r/20170306192122.24262-1-tahsin@google.comSigned-off-by: NTahsin Erdogan <tahsin@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40e952f9
    • K
      thp: fix another corner case of munlock() vs. THPs · 6ebb4a1b
      Kirill A. Shutemov 提交于
      The following test case triggers BUG() in munlock_vma_pages_range():
      
      	int main(int argc, char *argv[])
      	{
      		int fd;
      
      		system("mount -t tmpfs -o huge=always none /mnt");
      		fd = open("/mnt/test", O_CREAT | O_RDWR);
      		ftruncate(fd, 4UL << 20);
      		mmap(NULL, 4UL << 20, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_FIXED | MAP_LOCKED, fd, 0);
      		mmap(NULL, 4096, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_LOCKED, fd, 0);
      		munlockall();
      		return 0;
      	}
      
      The second mmap() create PTE-mapping of the first huge page in file.  It
      makes kernel munlock the page as we never keep PTE-mapped page mlocked.
      
      On munlockall() when we handle vma created by the first mmap(),
      munlock_vma_page() returns page_mask == 0, as the page is not mlocked
      anymore.  On next iteration follow_page_mask() return tail page, but
      page_mask is HPAGE_NR_PAGES - 1.  It makes us skip to the first tail
      page of the next huge page and step on
      VM_BUG_ON_PAGE(PageMlocked(page)).
      
      The fix is not use the page_mask from follow_page_mask() at all.  It has
      no use for us.
      
      Link: http://lkml.kernel.org/r/20170302150252.34120-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>    [4.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ebb4a1b
    • K
      rmap: fix NULL-pointer dereference on THP munlocking · 8346242a
      Kirill A. Shutemov 提交于
      The following test case triggers NULL-pointer derefernce in
      try_to_unmap_one():
      
      	#include <fcntl.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      
      	int main(int argc, char *argv[])
      	{
      		int fd;
      
      		system("mount -t tmpfs -o huge=always none /mnt");
      		fd = open("/mnt/test", O_CREAT | O_RDWR);
      		ftruncate(fd, 2UL << 20);
      		mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_FIXED | MAP_LOCKED, fd, 0);
      		mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_LOCKED, fd, 0);
      		munlockall();
      		return 0;
      	}
      
      Apparently, there's a case when we call try_to_unmap() on huge PMDs:
      it's TTU_MUNLOCK.
      
      Let's handle this case correctly.
      
      Fixes: c7ab0d2f ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
      Link: http://lkml.kernel.org/r/20170302151159.30592-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8346242a
    • A
      mm/memblock.c: fix memblock_next_valid_pfn() · c9a1b80d
      AKASHI Takahiro 提交于
      Obviously, we should not access memblock.memory.regions[right] if
      'right' is outside of [0..memblock.memory.cnt>.
      
      Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Link: http://lkml.kernel.org/r/20170303023745.9104-1-takahiro.akashi@linaro.orgSigned-off-by: NAKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9a1b80d
    • A
      userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED · 70ccb92f
      Andrea Arcangeli 提交于
      userfaultfd_remove() has to be execute before zapping the pagetables or
      UFFDIO_COPY could keep filling pages after zap_page_range returned,
      which would result in non zero data after a MADV_DONTNEED.
      
      However userfaultfd_remove() may have to release the mmap_sem.  This was
      handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
      potentially stale vma (the very vma passed to zap_page_range(vma, ...)).
      
      The fix consists in revalidating the vma in case userfaultfd_remove()
      had to release the mmap_sem.
      
      This also optimizes away an unnecessary down_read/up_read in the
      MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.
      
      It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
      userfaultfd_remove() will be defined as "true" at build time.
      
      Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ccb92f
    • L
      mm/cgroup: avoid panic when init with low memory · bfc7228b
      Laurent Dufour 提交于
      The system may panic when initialisation is done when almost all the
      memory is assigned to the huge pages using the kernel command line
      parameter hugepage=xxxx.  Panic may occur like this:
      
        Unable to handle kernel paging request for data at address 0x00000000
        Faulting instruction address: 0xc000000000302b88
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 [    0.082424] NUMA
        pSeries
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
        task: c00000021ed01600 task.stack: c00000010d108000
        NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
        REGS: c00000010d10b2c0 TRAP: 0300   Not tainted (4.9.0-15-generic)
        MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>[ 0.082770]   CR: 28424422  XER: 00000000
        CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
        GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
        GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
        GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
        GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
        GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
        GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
        GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
        GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
        NIP mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
        LR do_try_to_free_pages+0x1b4/0x450
        Call Trace:
          do_try_to_free_pages+0x1b4/0x450
          try_to_free_pages+0xf8/0x270
          __alloc_pages_nodemask+0x7a8/0xff0
          new_slab+0x104/0x8e0
          ___slab_alloc+0x620/0x700
          __slab_alloc+0x34/0x60
          kmem_cache_alloc_node_trace+0xdc/0x310
          mem_cgroup_init+0x158/0x1c8
          do_one_initcall+0x68/0x1d0
          kernel_init_freeable+0x278/0x360
          kernel_init+0x24/0x170
          ret_from_kernel_thread+0x5c/0x74
        Instruction dump:
        eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
        3929acd8 794a1f24 7d295214 eac90100 <e9360000> 2fa90000 419eff74 3b200000
        ---[ end trace 342f5208b00d01b6 ]---
      
      This is a chicken and egg issue where the kernel try to get free memory
      when allocating per node data in mem_cgroup_init(), but in that path
      mem_cgroup_soft_limit_reclaim() is called which assumes that these data
      are allocated.
      
      As mem_cgroup_soft_limit_reclaim() is best effort, it should return when
      these data are not yet allocated.
      
      This patch also fixes potential null pointer access in
      mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().
      
      Link: http://lkml.kernel.org/r/1487856999-16581-2-git-send-email-ldufour@linux.vnet.ibm.comSigned-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfc7228b
    • Y
      mm/vmstats: add thp_split_pud event for clarity · ce9311cf
      Yisheng Xie 提交于
      We added support for PUD-sized transparent hugepages, however we count
      the event "thp split pud" into thp_split_pmd event.
      
      To separate the event count of thp split pud from pmd, add a new event
      named thp_split_pud.
      
      Link: http://lkml.kernel.org/r/1488282380-5076-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce9311cf
    • K
      mm: introduce __p4d_alloc() · 90eceff1
      Kirill A. Shutemov 提交于
      For full 5-level paging we need a helper to allocate p4d page table.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90eceff1
    • K
      mm: convert generic code to 5-level paging · c2febafc
      Kirill A. Shutemov 提交于
      Convert all non-architecture-specific code to 5-level paging.
      
      It's mostly mechanical adding handling one more page table level in
      places where we deal with pud_t.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2febafc
  10. 09 3月, 2017 3 次提交
    • T
      mm, page_alloc: Add missing check for memory holes · b4fb8f66
      Tony Luck 提交于
      Commit 13ad59df ("mm, page_alloc: avoid page_to_pfn() when merging
      buddies") moved the check for memory holes out of page_is_buddy() and
      had the callers do the check.
      
      But this wasn't done correctly in one place which caused ia64 to crash
      very early in boot.
      
      Update to fix that and make ia64 boot again.
      
      [ v2: Vlastimil pointed out we don't need to call page_to_pfn()
            since we already have the result of that in "buddy_pfn" ]
      
      Fixes: 13ad59df ("avoid page_to_pfn() when merging buddies")
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4fb8f66
    • J
      bdi: Fix use-after-free in wb_congested_put() · df23de55
      Jan Kara 提交于
      bdi_writeback_congested structures get created for each blkcg and bdi
      regardless whether bdi is registered or not. When they are created in
      unregistered bdi and the request queue (and thus bdi) is then destroyed
      while blkg still holds reference to bdi_writeback_congested structure,
      this structure will be referencing freed bdi and last wb_congested_put()
      will try to remove the structure from already freed bdi.
      
      With commit 165a5e22 "block: Move bdi_unregister() to
      del_gendisk()", SCSI started to destroy bdis without calling
      bdi_unregister() first (previously it was calling bdi_unregister() even
      for unregistered bdis) and thus the code detaching
      bdi_writeback_congested in cgwb_bdi_destroy() was not triggered and we
      started hitting this use-after-free bug. It is enough to boot a KVM
      instance with virtio-scsi device to trigger this behavior.
      
      Fix the problem by detaching bdi_writeback_congested structures in
      bdi_exit() instead of bdi_unregister(). This is also more logical as
      they can get attached to bdi regardless whether it ever got registered
      or not.
      
      Fixes: 165a5e22Signed-off-by: NJan Kara <jack@suse.cz>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      df23de55
    • J
      block: Allow bdi re-registration · b6f8fec4
      Jan Kara 提交于
      SCSI can call device_add_disk() several times for one request queue when
      a device in unbound and bound, creating new gendisk each time. This will
      lead to bdi being repeatedly registered and unregistered. This was not a
      big problem until commit 165a5e22 "block: Move bdi_unregister() to
      del_gendisk()" since bdi was only registered repeatedly (bdi_register()
      handles repeated calls fine, only we ended up leaking reference to
      gendisk due to overwriting bdi->owner) but unregistered only in
      blk_cleanup_queue() which didn't get called repeatedly. After
      165a5e22 we were doing correct bdi_register() - bdi_unregister()
      cycles however bdi_unregister() is not prepared for it. So make sure
      bdi_unregister() cleans up bdi in such a way that it is prepared for
      a possible following bdi_register() call.
      
      An easy way to provoke this behavior is to enable
      CONFIG_DEBUG_TEST_DRIVER_REMOVE and use scsi_debug driver to create a
      scsi disk which immediately hangs without this fix.
      
      Fixes: 165a5e22Signed-off-by: NJan Kara <jack@suse.cz>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b6f8fec4
  11. 07 3月, 2017 1 次提交