1. 08 4月, 2020 3 次提交
  2. 03 4月, 2020 2 次提交
  3. 02 12月, 2019 1 次提交
  4. 16 11月, 2019 1 次提交
    • S
      mm,thp: recheck each page before collapsing file THP · 4655e5e5
      Song Liu 提交于
      In collapse_file(), for !is_shmem case, current check cannot guarantee
      the locked page is up-to-date.  Specifically, xas_unlock_irq() should
      not be called before lock_page() and get_page(); and it is necessary to
      recheck PageUptodate() after locking the page.
      
      With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
      may contain corrupted data.  This is because khugepaged mistakenly
      collapses some not up-to-date sub pages into a huge page, and assumes
      the huge page is up-to-date.  This will NOT corrupt data in the disk,
      because the page is read-only and never written back.  Fix this by
      properly checking PageUptodate() after locking the page.  This check
      replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".
      
      Also, move PageDirty() check after locking the page.  Current khugepaged
      should not try to collapse dirty file THP, because it is limited to
      read-only .text.  The only case we hit a dirty page here is when the
      page hasn't been written since write.  Bail out and retry when this
      happens.
      
      syzbot reported bug on previous version of this patch.
      
      Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4655e5e5
  5. 07 11月, 2019 1 次提交
  6. 25 9月, 2019 5 次提交
  7. 03 9月, 2019 1 次提交
    • M
      sched/topology: Improve load balancing on AMD EPYC systems · a55c7454
      Matt Fleming 提交于
      SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
      for any sched domains with a NUMA distance greater than 2 hops
      (RECLAIM_DISTANCE). The idea being that it's expensive to balance
      across domains that far apart.
      
      However, as is rather unfortunately explained in:
      
        commit 32e45ff4 ("mm: increase RECLAIM_DISTANCE to 30")
      
      the value for RECLAIM_DISTANCE is based on node distance tables from
      2011-era hardware.
      
      Current AMD EPYC machines have the following NUMA node distances:
      
       node distances:
       node   0   1   2   3   4   5   6   7
         0:  10  16  16  16  32  32  32  32
         1:  16  10  16  16  32  32  32  32
         2:  16  16  10  16  32  32  32  32
         3:  16  16  16  10  32  32  32  32
         4:  32  32  32  32  10  16  16  16
         5:  32  32  32  32  16  10  16  16
         6:  32  32  32  32  16  16  10  16
         7:  32  32  32  32  16  16  16  10
      
      where 2 hops is 32.
      
      The result is that the scheduler fails to load balance properly across
      NUMA nodes on different sockets -- 2 hops apart.
      
      For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
      (CPUs 32-39) like so,
      
        $ numactl -C 0-7,32-39 ./spinner 16
      
      causes all threads to fork and remain on node 0 until the active
      balancer kicks in after a few seconds and forcibly moves some threads
      to node 4.
      
      Override node_reclaim_distance for AMD Zen.
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suravee.Suthikulpanit@amd.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas.Lendacky@amd.com
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a55c7454
  8. 06 7月, 2019 1 次提交
    • L
      Revert "mm: page cache: store only head pages in i_pages" · 69bf4b6b
      Linus Torvalds 提交于
      This reverts commit 5fd4ca2d.
      
      Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
      __delete_from_swap_cache() to trigger:
      
         page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
         anon
         flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
         raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
         raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
         page dumped because: VM_BUG_ON_PAGE(entry != page)
         page->mem_cgroup:ffff978433ace000
         ------------[ cut here ]------------
         kernel BUG at mm/swap_state.c:170!
         invalid opcode: 0000 [#1] SMP NOPTI
         CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
         Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
         RIP: 0010:__delete_from_swap_cache+0x20d/0x240
         Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
         RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
         RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
         RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
         RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
         R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
         R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
         FS:  0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
         Call Trace:
          delete_from_swap_cache+0x46/0xa0
          try_to_free_swap+0xbc/0x110
          swap_writepage+0x13/0x70
          pageout.isra.0+0x13c/0x350
          shrink_page_list+0xc14/0xdf0
          shrink_inactive_list+0x1e5/0x3c0
          shrink_node_memcg+0x202/0x760
          shrink_node+0xe0/0x470
          balance_pgdat+0x2d1/0x510
          kswapd+0x220/0x420
          kthread+0xfb/0x130
          ret_from_fork+0x22/0x40
      
      and it's not immediately obvious why it happens.  It's too late in the
      rc cycle to do anything but revert for now.
      
      Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/Reported-and-bisected-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69bf4b6b
  9. 14 6月, 2019 1 次提交
    • A
      coredump: fix race condition between collapse_huge_page() and core dumping · 59ea6d06
      Andrea Arcangeli 提交于
      When fixing the race conditions between the coredump and the mmap_sem
      holders outside the context of the process, we focused on
      mmget_not_zero()/get_task_mm() callers in 04f5866e ("coredump: fix
      race condition between mmget_not_zero()/get_task_mm() and core
      dumping"), but those aren't the only cases where the mmap_sem can be
      taken outside of the context of the process as Michal Hocko noticed
      while backporting that commit to older -stable kernels.
      
      If mmgrab() is called in the context of the process, but then the
      mm_count reference is transferred outside the context of the process,
      that can also be a problem if the mmap_sem has to be taken for writing
      through that mm_count reference.
      
      khugepaged registration calls mmgrab() in the context of the process,
      but the mmap_sem for writing is taken later in the context of the
      khugepaged kernel thread.
      
      collapse_huge_page() after taking the mmap_sem for writing doesn't
      modify any vma, so it's not obvious that it could cause a problem to the
      coredump, but it happens to modify the pmd in a way that breaks an
      invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
      needs the mmap_sem for writing just to block concurrent page faults that
      call pmd_trans_huge_lock().
      
      Specifically the invariant that "!pmd_trans_huge()" cannot become a
      "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.
      
      The coredump will call __get_user_pages() without mmap_sem for reading,
      which eventually can invoke a lockless page fault which will need a
      functional pmd_trans_huge_lock().
      
      So collapse_huge_page() needs to use mmget_still_valid() to check it's
      not running concurrently with the coredump...  as long as the coredump
      can invoke page faults without holding the mmap_sem for reading.
      
      This has "Fixes: khugepaged" to facilitate backporting, but in my view
      it's more a bug in the coredump code that will eventually have to be
      rewritten to stop invoking page faults without the mmap_sem for reading.
      So the long term plan is still to drop all mmget_still_valid().
      
      Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
      Fixes: ba76149f ("thp: khugepaged")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59ea6d06
  10. 15 5月, 2019 3 次提交
  11. 06 3月, 2019 1 次提交
  12. 29 12月, 2018 1 次提交
  13. 01 12月, 2018 7 次提交
  14. 13 11月, 2018 1 次提交
    • L
      mm: Replace spin_is_locked() with lockdep · 35f3aa39
      Lance Roy 提交于
      lockdep_assert_held() is better suited to checking locking requirements,
      since it only checks if the current thread holds the lock regardless of
      whether someone else does. This is also a step towards possibly removing
      spin_is_locked().
      Signed-off-by: NLance Roy <ldr709@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <linux-mm@kvack.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      35f3aa39
  15. 21 10月, 2018 2 次提交
  16. 30 9月, 2018 1 次提交
    • M
      xarray: Replace exceptional entries · 3159f943
      Matthew Wilcox 提交于
      Introduce xarray value entries and tagged pointers to replace radix
      tree exceptional entries.  This is a slight change in encoding to allow
      the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
      value entry).  It is also a change in emphasis; exceptional entries are
      intimidating and different.  As the comment explains, you can choose
      to store values or pointers in the xarray and they are both first-class
      citizens.
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      3159f943
  17. 24 8月, 2018 1 次提交
  18. 18 8月, 2018 3 次提交
  19. 12 4月, 2018 3 次提交
  20. 23 3月, 2018 1 次提交