1. 15 11月, 2020 6 次提交
    • M
      hugetlbfs: fix anon huge page migration race · 336bf30e
      Mike Kravetz 提交于
      Qian Cai reported the following BUG in [1]
      
        LTP: starting move_pages12
        BUG: unable to handle page fault for address: ffffffffffffffe0
        ...
        RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
        Call Trace:
          rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
          try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
          migrate_pages+0x1005/0x1fb0
          move_pages_and_store_status.isra.47+0xd7/0x1a0
          __x64_sys_move_pages+0xa5c/0x1100
          do_syscall_64+0x5f/0x310
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Hugh Dickins diagnosed this as a migration bug caused by code introduced
      to use i_mmap_rwsem for pmd sharing synchronization.  Specifically, the
      routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
      flag to try_to_unmap() while holding i_mmap_rwsem.  This is wrong for
      anon pages as the anon_vma_lock should be held in this case.  Further
      analysis suggested that i_mmap_rwsem was not required to he held at all
      when calling try_to_unmap for anon pages as an anon page could never be
      part of a shared pmd mapping.
      
      Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
      to drop page lock and acquire i_mmap_rwsem is wrong.  There is no way to
      keep mapping valid while dropping page lock.
      
      This patch does the following:
      
       - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
         calling try_to_unmap.
      
       - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
         will now simply do a 'trylock' while still holding the page lock. If
         the trylock fails, it will return NULL. This could impact the
         callers:
      
          - migration calling code will receive -EAGAIN and retry up to the
            hard coded limit (10).
      
          - memory error code will treat the page as BUSY. This will force
            killing (SIGKILL) instead of SIGBUS any mapping tasks.
      
         Do note that this change in behavior only happens when there is a
         race. None of the standard kernel testing suites actually hit this
         race, but it is possible.
      
      [1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
      [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/
      
      Fixes: c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
      Reported-by: NQian Cai <cai@lca.pw>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      336bf30e
    • J
      mm/gup: use unpin_user_pages() in __gup_longterm_locked() · 96e1fac1
      Jason Gunthorpe 提交于
      When FOLL_PIN is passed to __get_user_pages() the page list must be put
      back using unpin_user_pages() otherwise the page pin reference persists
      in a corrupted state.
      
      There are two places in the unwind of __gup_longterm_locked() that put
      the pages back without checking.  Normally on error this function would
      return the partial page list making this the caller's responsibility,
      but in these two cases the caller is not allowed to see these pages at
      all.
      
      Fixes: 3faa52c0 ("mm/gup: track FOLL_PIN pages")
      Reported-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/0-v2-3ae7d9d162e2+2a7-gup_cma_fix_jgg@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96e1fac1
    • L
      mm/slub: fix panic in slab_alloc_node() · 22e4663e
      Laurent Dufour 提交于
      While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs
      with 11TB of ram, I hit the following panic:
      
          BUG: Kernel NULL pointer dereference on read at 0x00000007
          Faulting instruction address: 0xc000000000456048
          Oops: Kernel access of bad area, sig: 11 [#2]
          LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS= 2048 NUMA pSeries
          Modules linked in: rpadlpar_io rpaphp
          CPU: 160 PID: 1 Comm: systemd Tainted: G      D           5.9.0 #1
          NIP:  c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350
          REGS: c00006028d1b77a0 TRAP: 0300   Tainted: G      D            (5.9.0)
          MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004228  XER: 00000000
          CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0
          GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000
          GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320
          GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000
          GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a
          GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1
          GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8
          GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000
          GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001
          NIP [c000000000456048] __kmalloc_node+0x108/0x790
          LR [c000000000455fd4] __kmalloc_node+0x94/0x790
          Call Trace:
            kvmalloc_node+0x58/0x110
            mem_cgroup_css_online+0x10c/0x270
            online_css+0x48/0xd0
            cgroup_apply_control_enable+0x2c4/0x470
            cgroup_mkdir+0x408/0x5f0
            kernfs_iop_mkdir+0x90/0x100
            vfs_mkdir+0x138/0x250
            do_mkdirat+0x154/0x1c0
            system_call_exception+0xf8/0x200
            system_call_common+0xf0/0x27c
          Instruction dump:
          e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a
          2fbc0000 419e0018 41920230 e9270010 <89290007> 7f994800 419e0220 7ee6bb78
      
      This pointing to the following code:
      
          mm/slub.c:2851
                  if (unlikely(!object || !node_match(page, node))) {
          c000000000456038:       00 00 bc 2f     cmpdi   cr7,r28,0
          c00000000045603c:       18 00 9e 41     beq     cr7,c000000000456054 <__kmalloc_node+0x114>
          node_match():
          mm/slub.c:2491
                  if (node != NUMA_NO_NODE && page_to_nid(page) != node)
          c000000000456040:       30 02 92 41     beq     cr4,c000000000456270 <__kmalloc_node+0x330>
          page_to_nid():
          include/linux/mm.h:1294
          c000000000456044:       10 00 27 e9     ld      r9,16(r7)
          c000000000456048:       07 00 29 89     lbz     r9,7(r9)	<<<< r9 = NULL
          node_match():
          mm/slub.c:2491
          c00000000045604c:       00 48 99 7f     cmpw    cr7,r25,r9
          c000000000456050:       20 02 9e 41     beq     cr7,c000000000456270 <__kmalloc_node+0x330>
      
      The panic occurred in slab_alloc_node() when checking for the page's node:
      
      	object = c->freelist;
      	page = c->page;
      	if (unlikely(!object || !node_match(page, node))) {
      		object = __slab_alloc(s, gfpflags, node, addr, c);
      		stat(s, ALLOC_SLOWPATH);
      
      The issue is that object is not NULL while page is NULL which is odd but
      may happen if the cache flush happened after loading object but before
      loading page.  Thus checking for the page pointer is required too.
      
      The cache flush is done through an inter processor interrupt when a
      piece of memory is off-lined.  That interrupt is triggered when a memory
      hot-unplug operation is initiated and offline_pages() is calling the
      slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback()
      which is calling flush_cpu_slab().  If that interrupt is caught between
      the reading of c->freelist and the reading of c->page, this could lead
      to such a situation.  That situation is expected and the later call to
      this_cpu_cmpxchg_double() will detect the change to c->freelist and redo
      the whole operation.
      
      In commit 6159d0f5 ("mm/slub.c: page is always non-NULL in
      node_match()") check on the page pointer has been removed assuming that
      page is always valid when it is called.  It happens that this is not
      true in that particular case, so check for page before calling
      node_match() here.
      
      Fixes: 6159d0f5 ("mm/slub.c: page is always non-NULL in node_match()")
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201027190406.33283-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22e4663e
    • N
      mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit · 2da9f630
      Nicholas Piggin 提交于
      Previously the negated unsigned long would be cast back to signed long
      which would have the correct negative value.  After commit 730ec8c0
      ("mm/vmscan.c: change prototype for shrink_page_list"), the large
      unsigned int converts to a large positive signed long.
      
      Symptoms include CMA allocations hanging forever holding the cma_mutex
      due to alloc_contig_range->...->isolate_migratepages_block waiting
      forever in "while (unlikely(too_many_isolated(pgdat)))".
      
      [akpm@linux-foundation.org: fix -stat.nr_lazyfree_fail as well, per Michal]
      
      Fixes: 730ec8c0 ("mm/vmscan.c: change prototype for shrink_page_list")
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vaneet Narang <v.narang@samsung.com>
      Cc: Maninder Singh <maninder1.s@samsung.com>
      Cc: Amit Sahrawat <a.sahrawat@samsung.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201029032320.1448441-1-npiggin@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2da9f630
    • Z
      mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate · d20bdd57
      Zi Yan 提交于
      In isolate_migratepages_block, if we have too many isolated pages and
      nr_migratepages is not zero, we should try to migrate what we have
      without wasting time on isolating.
      
      In theory it's possible that multiple parallel compactions will cause
      too_many_isolated() to become true even if each has isolated less than
      COMPACT_CLUSTER_MAX, and loop forever in the while loop.  Bailing
      immediately prevents that.
      
      [vbabka@suse.cz: changelog addition]
      
      Fixes: 1da2f328 (“mm,thp,compaction,cma: allow THP migration for CMA allocations”)
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NZi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Link: https://lkml.kernel.org/r/20201030183809.3616803-2-zi.yan@sent.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d20bdd57
    • Z
      mm/compaction: count pages and stop correctly during page isolation · 38935861
      Zi Yan 提交于
      In isolate_migratepages_block, when cc->alloc_contig is true, we are
      able to isolate compound pages.  But nr_migratepages and nr_isolated did
      not count compound pages correctly, causing us to isolate more pages
      than we thought.
      
      So count compound pages as the number of base pages they contain.
      Otherwise, we might be trapped in too_many_isolated while loop, since
      the actual isolated pages can go up to COMPACT_CLUSTER_MAX*512=16384,
      where COMPACT_CLUSTER_MAX is 32, since we stop isolation after
      cc->nr_migratepages reaches to COMPACT_CLUSTER_MAX.
      
      In addition, after we fix the issue above, cc->nr_migratepages could
      never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated,
      thus page isolation could not stop as we intended.  Change the isolation
      stop condition to '>='.
      
      The issue can be triggered as follows:
      
      In a system with 16GB memory and an 8GB CMA region reserved by
      hugetlb_cma, if we first allocate 10GB THPs and mlock them (so some THPs
      are allocated in the CMA region and mlocked), reserving 6 1GB hugetlb
      pages via /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will
      get stuck (looping in too_many_isolated function) until we kill either
      task.  With the patch applied, oom will kill the application with 10GB
      THPs and let hugetlb page reservation finish.
      
      [ziy@nvidia.com: v3]
      
      Link: https://lkml.kernel.org/r/20201030183809.3616803-1-zi.yan@sent.com
      Fixes: 1da2f328 ("cmm,thp,compaction,cma: allow THP migration for CMA allocations")
      Signed-off-by: NZi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201029200435.3386066-1-zi.yan@sent.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38935861
  2. 03 11月, 2020 6 次提交
  3. 28 10月, 2020 2 次提交
  4. 19 10月, 2020 21 次提交
  5. 18 10月, 2020 2 次提交
    • J
      mm: use limited read-ahead to satisfy read · 324bcf54
      Jens Axboe 提交于
      For the case where read-ahead is disabled on the file, or if the cgroup
      is congested, ensure that we can at least do 1 page of read-ahead to
      make progress on the read in an async fashion. This could potentially be
      larger, but it's not needed in terms of functionality, so let's error on
      the side of caution as larger counts of pages may run into reclaim
      issues (particularly if we're congested).
      
      This makes sure we're not hitting the potentially sync ->readpage() path
      for IO that is marked IOCB_WAITQ, which could cause us to block. It also
      means we'll use the same path for IO, regardless of whether or not
      read-ahead happens to be disabled on the lower level device.
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: NHao_Xu <haoxu@linux.alibaba.com>
      [axboe: updated for new ractl API]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      324bcf54
    • J
      mm: mark async iocb read as NOWAIT once some data has been copied · 13bd6914
      Jens Axboe 提交于
      Once we've copied some data for an iocb that is marked with IOCB_WAITQ,
      we should no longer attempt to async lock a new page. Instead make sure
      we return the copied amount, and let the caller retry, instead of
      returning -EIOCBQUEUED for a new page.
      
      This should only be possible with read-ahead disabled on the below
      device, and multiple threads racing on the same file. Haven't been able
      to reproduce on anything else.
      
      Cc: stable@vger.kernel.org # v5.9
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Reported-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      13bd6914
  6. 17 10月, 2020 3 次提交
    • J
      mm: remove the now-unnecessary mmget_still_valid() hack · 4d45e75a
      Jann Horn 提交于
      The preceding patches have ensured that core dumping properly takes the
      mmap_lock.  Thanks to that, we can now remove mmget_still_valid() and all
      its users.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d45e75a
    • J
      mm/gup: take mmap_lock in get_dump_page() · 7f3bfab5
      Jann Horn 提交于
      Properly take the mmap_lock before calling into the GUP code from
      get_dump_page(); and play nice, allowing the GUP code to drop the
      mmap_lock if it has to sleep.
      
      As Linus pointed out, we don't actually need the VMA because
      __get_user_pages() will flush the dcache for us if necessary.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-7-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f3bfab5
    • J
      binfmt_elf_fdpic: stop using dump_emit() on user pointers on !MMU · 8f942eea
      Jann Horn 提交于
      Patch series "Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there", v5.
      
      At the moment, we have that rather ugly mmget_still_valid() helper to work
      around <https://crbug.com/project-zero/1790>: ELF core dumping doesn't
      take the mmap_sem while traversing the task's VMAs, and if anything (like
      userfaultfd) then remotely messes with the VMA tree, fireworks ensue.  So
      at the moment we use mmget_still_valid() to bail out in any writers that
      might be operating on a remote mm's VMAs.
      
      With this series, I'm trying to get rid of the need for that as cleanly as
      possible.  ("cleanly" meaning "avoid holding the mmap_lock across
      unbounded sleeps".)
      
      Patches 1, 2, 3 and 4 are relatively unrelated cleanups in the core
      dumping code.
      
      Patches 5 and 6 implement the main change: Instead of repeatedly accessing
      the VMA list with sleeps in between, we snapshot it at the start with
      proper locking, and then later we just use our copy of the VMA list.  This
      ensures that the kernel won't crash, that VMA metadata in the coredump is
      consistent even in the presence of concurrent modifications, and that any
      virtual addresses that aren't being concurrently modified have their
      contents show up in the core dump properly.
      
      The disadvantage of this approach is that we need a bit more memory during
      core dumping for storing metadata about all VMAs.
      
      At the end of the series, patch 7 removes the old workaround for this
      issue (mmget_still_valid()).
      
      I have tested:
      
       - Creating a simple core dump on X86-64 still works.
       - The created coredump on X86-64 opens in GDB and looks plausible.
       - X86-64 core dumps contain the first page for executable mappings at
         offset 0, and don't contain the first page for non-executable file
         mappings or executable mappings at offset !=0.
       - NOMMU 32-bit ARM can still generate plausible-looking core dumps
         through the FDPIC implementation. (I can't test this with GDB because
         GDB is missing some structure definition for nommu ARM, but I've
         poked around in the hexdump and it looked decent.)
      
      This patch (of 7):
      
      dump_emit() is for kernel pointers, and VMAs describe userspace memory.
      Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
      even if it probably doesn't matter much on !MMU systems - especially given
      that it looks like we can just use the same get_dump_page() as on MMU if
      we move it out of the CONFIG_MMU block.
      
      One small change we have to make in get_dump_page() is to use
      __get_user_pages_locked() instead of __get_user_pages(), since the latter
      doesn't exist on nommu.  On mmu builds, __get_user_pages_locked() will
      just call __get_user_pages() for us.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-1-jannh@google.com
      Link: http://lkml.kernel.org/r/20200827114932.3572699-2-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f942eea