1. 01 12月, 2019 2 次提交
    • J
      mm: emit tracepoint when RSS changes · b3d1411b
      Joel Fernandes (Google) 提交于
      Useful to track how RSS is changing per TGID to detect spikes in RSS and
      memory hogs.  Several Android teams have been using this patch in
      various kernel trees for half a year now.  Many reported to me it is
      really useful so I'm posting it upstream.
      
      Initial patch developed by Tim Murray.  Changes I made from original
      patch: o Prevent any additional space consumed by mm_struct.
      
      Regarding the fact that the RSS may change too often thus flooding the
      traces - note that, there is some "hysterisis" with this already.  That
      is - We update the counter only if we receive 64 page faults due to
      SPLIT_RSS_ACCOUNTING.  However, during zapping or copying of pte range,
      the RSS is updated immediately which can become noisy/flooding.  In a
      previous discussion, we agreed that BPF or ftrace can be used to rate
      limit the signal if this becomes an issue.
      
      Also note that I added wrappers to trace_rss_stat to prevent compiler
      errors where linux/mm.h is included from tracing code, causing errors
      such as:
      
          CC      kernel/trace/power-traces.o
        In file included from ./include/trace/define_trace.h:102,
                         from ./include/trace/events/kmem.h:342,
                         from ./include/linux/mm.h:31,
                         from ./include/linux/ring_buffer.h:5,
                         from ./include/linux/trace_events.h:6,
                         from ./include/trace/events/power.h:12,
                         from kernel/trace/power-traces.c:15:
        ./include/trace/trace_events.h:113:22: error: field `ent' has incomplete type
           struct trace_entry ent;    \
      
      Link: http://lore.kernel.org/r/20190903200905.198642-1-joel@joelfernandes.org
      Link: http://lkml.kernel.org/r/20191001172817.234886-1-joel@joelfernandes.orgCo-developed-by: NTim Murray <timmurray@google.com>
      Signed-off-by: NTim Murray <timmurray@google.com>
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Carmen Jackson <carmenjackson@google.com>
      Cc: Mayank Gupta <mayankgupta@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3d1411b
    • J
      mm: drop mmap_sem before calling balance_dirty_pages() in write fault · 89b15332
      Johannes Weiner 提交于
      One of our services is observing hanging ps/top/etc under heavy write
      IO, and the task states show this is an mmap_sem priority inversion:
      
      A write fault is holding the mmap_sem in read-mode and waiting for
      (heavily cgroup-limited) IO in balance_dirty_pages():
      
          balance_dirty_pages+0x724/0x905
          balance_dirty_pages_ratelimited+0x254/0x390
          fault_dirty_shared_page.isra.96+0x4a/0x90
          do_wp_page+0x33e/0x400
          __handle_mm_fault+0x6f0/0xfa0
          handle_mm_fault+0xe4/0x200
          __do_page_fault+0x22b/0x4a0
          page_fault+0x45/0x50
      
      Somebody tries to change the address space, contending for the mmap_sem in
      write-mode:
      
          call_rwsem_down_write_failed_killable+0x13/0x20
          do_mprotect_pkey+0xa8/0x330
          SyS_mprotect+0xf/0x20
          do_syscall_64+0x5b/0x100
          entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      The waiting writer locks out all subsequent readers to avoid lock
      starvation, and several threads can be seen hanging like this:
      
          call_rwsem_down_read_failed+0x14/0x30
          proc_pid_cmdline_read+0xa0/0x480
          __vfs_read+0x23/0x140
          vfs_read+0x87/0x130
          SyS_read+0x42/0x90
          do_syscall_64+0x5b/0x100
          entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      To fix this, do what we do for cache read faults already: drop the
      mmap_sem before calling into anything IO bound, in this case the
      balance_dirty_pages() function, and return VM_FAULT_RETRY.
      
      Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89b15332
  2. 18 10月, 2019 1 次提交
    • J
      mm: fix double page fault on arm64 if PTE_AF is cleared · 83d116c5
      Jia He 提交于
      When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there
      will be a double page fault in __copy_from_user_inatomic of cow_user_page.
      
      To reproduce the bug, the cmd is as follows after you deployed everything:
      make -C src/test/vmmalloc_fork/ TEST_TIME=60m check
      
      Below call trace is from arm64 do_page_fault for debugging purpose:
      [  110.016195] Call trace:
      [  110.016826]  do_page_fault+0x5a4/0x690
      [  110.017812]  do_mem_abort+0x50/0xb0
      [  110.018726]  el1_da+0x20/0xc4
      [  110.019492]  __arch_copy_from_user+0x180/0x280
      [  110.020646]  do_wp_page+0xb0/0x860
      [  110.021517]  __handle_mm_fault+0x994/0x1338
      [  110.022606]  handle_mm_fault+0xe8/0x180
      [  110.023584]  do_page_fault+0x240/0x690
      [  110.024535]  do_mem_abort+0x50/0xb0
      [  110.025423]  el0_da+0x20/0x24
      
      The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
      [ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003,
                     pmd=000000023d4b3003, pte=360000298607bd3
      
      As told by Catalin: "On arm64 without hardware Access Flag, copying from
      user will fail because the pte is old and cannot be marked young. So we
      always end up with zeroed page after fork() + CoW for pfn mappings. we
      don't always have a hardware-managed access flag on arm64."
      
      This patch fixes it by calling pte_mkyoung. Also, the parameter is
      changed because vmf should be passed to cow_user_page()
      
      Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
      in case there can be some obscure use-case (by Kirill).
      
      [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_forkSigned-off-by: NJia He <justin.he@arm.com>
      Reported-by: NYibo Cai <Yibo.Cai@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      83d116c5
  3. 25 9月, 2019 3 次提交
  4. 20 8月, 2019 1 次提交
  5. 19 7月, 2019 1 次提交
  6. 16 7月, 2019 2 次提交
  7. 15 7月, 2019 1 次提交
  8. 13 7月, 2019 5 次提交
  9. 03 7月, 2019 2 次提交
  10. 18 6月, 2019 2 次提交
    • T
      mm: Add an apply_to_pfn_range interface · 29875a52
      Thomas Hellstrom 提交于
      This is basically apply_to_page_range with added functionality:
      Allocating missing parts of the page table becomes optional, which
      means that the function can be guaranteed not to error if allocation
      is disabled. Also passing of the closure struct and callback function
      becomes different and more in line with how things are done elsewhere.
      
      Finally we keep apply_to_page_range as a wrapper around apply_to_pfn_range
      
      The reason for not using the page-walk code is that we want to perform
      the page-walk on vmas pointing to an address space without requiring the
      mmap_sem to be held rather than on vmas belonging to a process with the
      mmap_sem held.
      
      Notable changes since RFC:
      Don't export apply_to_pfn range.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: linux-mm@kvack.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> #v1
      29875a52
    • T
      mm: Allow the [page|pfn]_mkwrite callbacks to drop the mmap_sem · c9e5f41f
      Thomas Hellstrom 提交于
      Driver fault callbacks are allowed to drop the mmap_sem when expecting
      long hardware waits to avoid blocking other mm users. Allow the mkwrite
      callbacks to do the same by returning early on VM_FAULT_RETRY.
      
      In particular we want to be able to drop the mmap_sem when waiting for
      a reservation object lock on a GPU buffer object. These locks may be
      held while waiting for the GPU.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: linux-mm@kvack.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      c9e5f41f
  11. 21 5月, 2019 1 次提交
  12. 15 5月, 2019 3 次提交
    • S
      mm: introduce new vm_map_pages() and vm_map_pages_zero() API · a667d745
      Souptick Joarder 提交于
      Patch series "mm: Use vm_map_pages() and vm_map_pages_zero() API", v5.
      
      This patch (of 5):
      
      Previouly drivers have their own way of mapping range of kernel
      pages/memory into user vma and this was done by invoking vm_insert_page()
      within a loop.
      
      As this pattern is common across different drivers, it can be generalized
      by creating new functions and using them across the drivers.
      
      vm_map_pages() is the API which can be used to map kernel memory/pages in
      drivers which have considered vm_pgoff
      
      vm_map_pages_zero() is the API which can be used to map a range of kernel
      memory/pages in drivers which have not considered vm_pgoff.  vm_pgoff is
      passed as default 0 for those drivers.
      
      We _could_ then at a later "fix" these drivers which are using
      vm_map_pages_zero() to behave according to the normal vm_pgoff offsetting
      simply by removing the _zero suffix on the function name and if that
      causes regressions, it gives us an easy way to revert.
      
      Tested on Rockchip hardware and display is working, including talking to
      Lima via prime.
      
      Link: http://lkml.kernel.org/r/751cb8a0f4c3e67e95c58a3b072937617f338eea.1552921225.git.jrdr.linux@gmail.comSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Suggested-by: NRussell King <linux@armlinux.org.uk>
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Tested-by: NHeiko Stuebner <heiko@sntech.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
      Cc: Sandy Huang <hjc@rock-chips.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Pawel Osciak <pawel@osciak.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a667d745
    • J
      mm/mmu_notifier: use correct mmu_notifier events for each invalidation · 7269f999
      Jérôme Glisse 提交于
      This updates each existing invalidation to use the correct mmu notifier
      event that represent what is happening to the CPU page table.  See the
      patch which introduced the events to see the rational behind this.
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7269f999
    • J
      mm/mmu_notifier: contextual information for event triggering invalidation · 6f4f13e8
      Jérôme Glisse 提交于
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      Users of mmu notifier API track changes to the CPU page table and take
      specific action for them.  While current API only provide range of virtual
      address affected by the change, not why the changes is happening.
      
      This patchset do the initial mechanical convertion of all the places that
      calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
      event as well as the vma if it is know (most invalidation happens against
      a given vma).  Passing down the vma allows the users of mmu notifier to
      inspect the new vma page protection.
      
      The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
      should assume that every for the range is going away when that event
      happens.  A latter patch do convert mm call path to use a more appropriate
      events for each call.
      
      This is done as 2 patches so that no call site is forgotten especialy
      as it uses this following coccinelle patch:
      
      %<----------------------------------------------------------------------
      @@
      identifier I1, I2, I3, I4;
      @@
      static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
      +enum mmu_notifier_event event,
      +unsigned flags,
      +struct vm_area_struct *vma,
      struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
      
      @@
      @@
      -#define mmu_notifier_range_init(range, mm, start, end)
      +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
      
      @@
      expression E1, E3, E4;
      identifier I1;
      @@
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, I1,
      I1->vm_mm, E3, E4)
      ...>
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(..., struct vm_area_struct *VMA, ...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(...) {
      struct vm_area_struct *VMA;
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN;
      @@
      FN(...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, NULL,
      E2, E3, E4)
      ...> }
      ---------------------------------------------------------------------->%
      
      Applied with:
      spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
      spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
      spatch --sp-file mmu-notifier.spatch --dir mm --in-place
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f4f13e8
  13. 09 4月, 2019 1 次提交
    • S
      treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively · d75f773c
      Sakari Ailus 提交于
      %pF and %pf are functionally equivalent to %pS and %ps conversion
      specifiers. The former are deprecated, therefore switch the current users
      to use the preferred variant.
      
      The changes have been produced by the following command:
      
      	git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
      	while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done
      
      And verifying the result.
      
      Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: xen-devel@lists.xenproject.org
      Cc: linux-acpi@vger.kernel.org
      Cc: linux-pm@vger.kernel.org
      Cc: drbd-dev@lists.linbit.com
      Cc: linux-block@vger.kernel.org
      Cc: linux-mmc@vger.kernel.org
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-pci@vger.kernel.org
      Cc: linux-scsi@vger.kernel.org
      Cc: linux-btrfs@vger.kernel.org
      Cc: linux-f2fs-devel@lists.sourceforge.net
      Cc: linux-mm@kvack.org
      Cc: ceph-devel@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: NSakari Ailus <sakari.ailus@linux.intel.com>
      Acked-by: David Sterba <dsterba@suse.com> (for btrfs)
      Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c)
      Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci)
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      d75f773c
  14. 03 4月, 2019 2 次提交
    • P
      asm-generic/tlb: Remove tlb_flush_mmu_free() · fa0aafb8
      Peter Zijlstra 提交于
      As the comment notes; it is a potentially dangerous operation. Just
      use tlb_flush_mmu(), that will skip the (double) TLB invalidate if
      it really isn't needed anyway.
      
      No change in behavior intended.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      fa0aafb8
    • P
      asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE · ed6a7935
      Peter Zijlstra 提交于
      Move the mmu_gather::page_size things into the generic code instead of
      PowerPC specific bits.
      
      No change in behavior intended.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ed6a7935
  15. 30 3月, 2019 1 次提交
  16. 06 3月, 2019 9 次提交
  17. 09 1月, 2019 2 次提交
    • M
      mm/memory.c: initialise mmu_notifier_range correctly · 1ed7293a
      Matthew Wilcox 提交于
      One of the paths in follow_pte_pmd() initialised the mmu_notifier_range
      incorrectly.
      
      Link: http://lkml.kernel.org/r/20190103002126.GM6310@bombadil.infradead.org
      Fixes: ac46d4f3 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2")
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Tested-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ed7293a
    • M
      mm, memcg: fix reclaim deadlock with writeback · 63f3655f
      Michal Hocko 提交于
      Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
      ext4 writeback
      
        task1:
          wait_on_page_bit+0x82/0xa0
          shrink_page_list+0x907/0x960
          shrink_inactive_list+0x2c7/0x680
          shrink_node_memcg+0x404/0x830
          shrink_node+0xd8/0x300
          do_try_to_free_pages+0x10d/0x330
          try_to_free_mem_cgroup_pages+0xd5/0x1b0
          try_charge+0x14d/0x720
          memcg_kmem_charge_memcg+0x3c/0xa0
          memcg_kmem_charge+0x7e/0xd0
          __alloc_pages_nodemask+0x178/0x260
          alloc_pages_current+0x95/0x140
          pte_alloc_one+0x17/0x40
          __pte_alloc+0x1e/0x110
          alloc_set_pte+0x5fe/0xc20
          do_fault+0x103/0x970
          handle_mm_fault+0x61e/0xd10
          __do_page_fault+0x252/0x4d0
          do_page_fault+0x30/0x80
          page_fault+0x28/0x30
      
        task2:
          __lock_page+0x86/0xa0
          mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
          ext4_writepages+0x479/0xd60
          do_writepages+0x1e/0x30
          __writeback_single_inode+0x45/0x320
          writeback_sb_inodes+0x272/0x600
          __writeback_inodes_wb+0x92/0xc0
          wb_writeback+0x268/0x300
          wb_workfn+0xb4/0x390
          process_one_work+0x189/0x420
          worker_thread+0x4e/0x4b0
          kthread+0xe6/0x100
          ret_from_fork+0x41/0x50
      
      He adds
       "task1 is waiting for the PageWriteback bit of the page that task2 has
        collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
        LOCKED bit the page which tasks1 has locked"
      
      More precisely task1 is handling a page fault and it has a page locked
      while it charges a new page table to a memcg.  That in turn hits a
      memory limit reclaim and the memcg reclaim for legacy controller is
      waiting on the writeback but that is never going to finish because the
      writeback itself is waiting for the page locked in the #PF path.  So
      this is essentially ABBA deadlock:
      
                                              lock_page(A)
                                              SetPageWriteback(A)
                                              unlock_page(A)
        lock_page(B)
                                              lock_page(B)
        pte_alloc_pne
          shrink_page_list
            wait_on_page_writeback(A)
                                              SetPageWriteback(B)
                                              unlock_page(B)
      
                                              # flush A, B to clear the writeback
      
      This accumulating of more pages to flush is used by several filesystems
      to generate a more optimal IO patterns.
      
      Waiting for the writeback in legacy memcg controller is a workaround for
      pre-mature OOM killer invocations because there is no dirty IO
      throttling available for the controller.  There is no easy way around
      that unfortunately.  Therefore fix this specific issue by pre-allocating
      the page table outside of the page lock.  We have that handy
      infrastructure for that already so simply reuse the fault-around pattern
      which already does this.
      
      There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
      from under a fs page locked but they should be really rare.  I am not
      aware of a better solution unfortunately.
      
      [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
      [akpm@linux-foundation.org: coding-style fixes]
      [mhocko@kernel.org: enhance comment, per Johannes]
        Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
      Fixes: c3b94f44 ("memcg: further prevent OOM with too many dirty pages")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Debugged-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63f3655f
  18. 05 1月, 2019 1 次提交
    • J
      mm: treewide: remove unused address argument from pte_alloc functions · 4cf58924
      Joel Fernandes (Google) 提交于
      Patch series "Add support for fast mremap".
      
      This series speeds up the mremap(2) syscall by copying page tables at
      the PMD level even for non-THP systems.  There is concern that the extra
      'address' argument that mremap passes to pte_alloc may do something
      subtle architecture related in the future that may make the scheme not
      work.  Also we find that there is no point in passing the 'address' to
      pte_alloc since its unused.  This patch therefore removes this argument
      tree-wide resulting in a nice negative diff as well.  Also ensuring
      along the way that the enabled architectures do not do anything funky
      with the 'address' argument that goes unnoticed by the optimization.
      
      Build and boot tested on x86-64.  Build tested on arm64.  The config
      enablement patch for arm64 will be posted in the future after more
      testing.
      
      The changes were obtained by applying the following Coccinelle script.
      (thanks Julia for answering all Coccinelle questions!).
      Following fix ups were done manually:
      * Removal of address argument from  pte_fragment_alloc
      * Removal of pte_alloc_one_fast definitions from m68k and microblaze.
      
      // Options: --include-headers --no-includes
      // Note: I split the 'identifier fn' line, so if you are manually
      // running it, please unsplit it so it runs for you.
      
      virtual patch
      
      @pte_alloc_func_def depends on patch exists@
      identifier E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      type T2;
      @@
      
       fn(...
      - , T2 E2
       )
       { ... }
      
      @pte_alloc_func_proto_noarg depends on patch exists@
      type T1, T2, T3, T4;
      identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1, T2);
      + T3 fn(T1);
      |
      - T3 fn(T1, T2, T4);
      + T3 fn(T1, T2);
      )
      
      @pte_alloc_func_proto depends on patch exists@
      identifier E1, E2, E4;
      type T1, T2, T3, T4;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1 E1, T2 E2);
      + T3 fn(T1 E1);
      |
      - T3 fn(T1 E1, T2 E2, T4 E4);
      + T3 fn(T1 E1, T2 E2);
      )
      
      @pte_alloc_func_call depends on patch exists@
      expression E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
       fn(...
      -,  E2
       )
      
      @pte_alloc_macro depends on patch exists@
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      identifier a, b, c;
      expression e;
      position p;
      @@
      
      (
      - #define fn(a, b, c) e
      + #define fn(a, b) e
      |
      - #define fn(a, b) e
      + #define fn(a) e
      )
      
      Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.comSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cf58924