1. 01 5月, 2021 3 次提交
  2. 10 4月, 2021 1 次提交
    • A
      mm/gup: check page posion status for coredump. · d3378e86
      Aili Yao 提交于
      When we do coredump for user process signal, this may be an SIGBUS signal
      with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
      resulted from ECC memory fail like SRAR or SRAO, we expect the memory
      recovery work is finished correctly, then the get_dump_page() will not
      return the error page as its process pte is set invalid by
      memory_failure().
      
      But memory_failure() may fail, and the process's related pte may not be
      correctly set invalid, for current code, we will return the poison page,
      get it dumped, and then lead to system panic as its in kernel code.
      
      So check the poison status in get_dump_page(), and if TRUE, return NULL.
      
      There maybe other scenario that is also better to check the posion status
      and not to panic, so make a wrapper for this check, Thanks to David's
      suggestion(<david@redhat.com>).
      
      [akpm@linux-foundation.org: s/0/false/]
      [yaoaili@kingsoft.com: is_page_poisoned() arg cannot be null, per Matthew]
      
      Link: https://lkml.kernel.org/r/20210322115233.05e4e82a@alex-virtual-machine
      Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machineSigned-off-by: NAili Yao <yaoaili@kingsoft.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3378e86
  3. 25 2月, 2021 1 次提交
  4. 16 12月, 2020 5 次提交
  5. 03 12月, 2020 1 次提交
  6. 15 11月, 2020 1 次提交
  7. 17 10月, 2020 2 次提交
    • J
      mm/gup: take mmap_lock in get_dump_page() · 7f3bfab5
      Jann Horn 提交于
      Properly take the mmap_lock before calling into the GUP code from
      get_dump_page(); and play nice, allowing the GUP code to drop the
      mmap_lock if it has to sleep.
      
      As Linus pointed out, we don't actually need the VMA because
      __get_user_pages() will flush the dcache for us if necessary.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-7-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f3bfab5
    • J
      binfmt_elf_fdpic: stop using dump_emit() on user pointers on !MMU · 8f942eea
      Jann Horn 提交于
      Patch series "Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there", v5.
      
      At the moment, we have that rather ugly mmget_still_valid() helper to work
      around <https://crbug.com/project-zero/1790>: ELF core dumping doesn't
      take the mmap_sem while traversing the task's VMAs, and if anything (like
      userfaultfd) then remotely messes with the VMA tree, fireworks ensue.  So
      at the moment we use mmget_still_valid() to bail out in any writers that
      might be operating on a remote mm's VMAs.
      
      With this series, I'm trying to get rid of the need for that as cleanly as
      possible.  ("cleanly" meaning "avoid holding the mmap_lock across
      unbounded sleeps".)
      
      Patches 1, 2, 3 and 4 are relatively unrelated cleanups in the core
      dumping code.
      
      Patches 5 and 6 implement the main change: Instead of repeatedly accessing
      the VMA list with sleeps in between, we snapshot it at the start with
      proper locking, and then later we just use our copy of the VMA list.  This
      ensures that the kernel won't crash, that VMA metadata in the coredump is
      consistent even in the presence of concurrent modifications, and that any
      virtual addresses that aren't being concurrently modified have their
      contents show up in the core dump properly.
      
      The disadvantage of this approach is that we need a bit more memory during
      core dumping for storing metadata about all VMAs.
      
      At the end of the series, patch 7 removes the old workaround for this
      issue (mmget_still_valid()).
      
      I have tested:
      
       - Creating a simple core dump on X86-64 still works.
       - The created coredump on X86-64 opens in GDB and looks plausible.
       - X86-64 core dumps contain the first page for executable mappings at
         offset 0, and don't contain the first page for non-executable file
         mappings or executable mappings at offset !=0.
       - NOMMU 32-bit ARM can still generate plausible-looking core dumps
         through the FDPIC implementation. (I can't test this with GDB because
         GDB is missing some structure definition for nommu ARM, but I've
         poked around in the hexdump and it looked decent.)
      
      This patch (of 7):
      
      dump_emit() is for kernel pointers, and VMAs describe userspace memory.
      Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
      even if it probably doesn't matter much on !MMU systems - especially given
      that it looks like we can just use the same get_dump_page() as on MMU if
      we move it out of the CONFIG_MMU block.
      
      One small change we have to make in get_dump_page() is to use
      __get_user_pages_locked() instead of __get_user_pages(), since the latter
      doesn't exist on nommu.  On mmu builds, __get_user_pages_locked() will
      just call __get_user_pages() for us.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-1-jannh@google.com
      Link: http://lkml.kernel.org/r/20200827114932.3572699-2-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f942eea
  8. 14 10月, 2020 2 次提交
  9. 29 9月, 2020 1 次提交
    • J
      mm: do not rely on mm == current->mm in __get_user_pages_locked · a4d63c37
      Jason A. Donenfeld 提交于
      It seems likely this block was pasted from internal_get_user_pages_fast,
      which is not passed an mm struct and therefore uses current's.  But
      __get_user_pages_locked is passed an explicit mm, and current->mm is not
      always valid. This was hit when being called from i915, which uses:
      
        pin_user_pages_remote->
          __get_user_pages_remote->
            __gup_longterm_locked->
              __get_user_pages_locked
      
      Before, this would lead to an OOPS:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000064
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S   U     O      5.9.0-rc7+ #140
        Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 08/06/2020
        Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
        RIP: 0010:__get_user_pages_remote+0xd7/0x310
        Call Trace:
         __i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
         process_one_work+0x1ca/0x390
         worker_thread+0x48/0x3c0
         kthread+0x114/0x130
         ret_from_fork+0x1f/0x30
        CR2: 0000000000000064
      
      This commit fixes the problem by using the mm pointer passed to the
      function rather than the bogus one in current.
      
      Fixes: 008cfe44 ("mm: Introduce mm_struct.has_pinned")
      Tested-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reported-by: NHarald Arnesen <harald@skogtun.org>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4d63c37
  10. 28 9月, 2020 1 次提交
    • P
      mm: Introduce mm_struct.has_pinned · 008cfe44
      Peter Xu 提交于
      (Commit message majorly collected from Jason Gunthorpe)
      
      Reduce the chance of false positive from page_maybe_dma_pinned() by
      keeping track if the mm_struct has ever been used with pin_user_pages().
      This allows cases that might drive up the page ref_count to avoid any
      penalty from handling dma_pinned pages.
      
      Future work is planned, to provide a more sophisticated solution, likely
      to turn it into a real counter.  For now, make it atomic_t but use it as
      a boolean for simplicity.
      Suggested-by: NJason Gunthorpe <jgg@ziepe.ca>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      008cfe44
  11. 27 9月, 2020 1 次提交
    • V
      mm/gup: fix gup_fast with dynamic page table folding · d3f7b1bb
      Vasily Gorbik 提交于
      Currently to make sure that every page table entry is read just once
      gup_fast walks perform READ_ONCE and pass pXd value down to the next
      gup_pXd_range function by value e.g.:
      
        static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                                 unsigned int flags, struct page **pages, int *nr)
        ...
                pudp = pud_offset(&p4d, addr);
      
      This function passes a reference on that local value copy to pXd_offset,
      and might get the very same pointer in return.  This happens when the
      level is folded (on most arches), and that pointer should not be
      iterated.
      
      On s390 due to the fact that each task might have different 5,4 or
      3-level address translation and hence different levels folded the logic
      is more complex and non-iteratable pointer to a local copy leads to
      severe problems.
      
      Here is an example of what happens with gup_fast on s390, for a task
      with 3-level paging, crossing a 2 GB pud boundary:
      
        // addr = 0x1007ffff000, end = 0x10080001000
        static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                                 unsigned int flags, struct page **pages, int *nr)
        {
              unsigned long next;
              pud_t *pudp;
      
              // pud_offset returns &p4d itself (a pointer to a value on stack)
              pudp = pud_offset(&p4d, addr);
              do {
                      // on second iteratation reading "random" stack value
                      pud_t pud = READ_ONCE(*pudp);
      
                      // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                      next = pud_addr_end(addr, end);
                      ...
              } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
      
              return 1;
        }
      
      This happens since s390 moved to common gup code with commit
      d1874a0c ("s390/mm: make the pxd_offset functions more robust") and
      commit 1a42010c ("s390/mm: convert to the generic
      get_user_pages_fast code").
      
      s390 tried to mimic static level folding by changing pXd_offset
      primitives to always calculate top level page table offset in pgd_offset
      and just return the value passed when pXd_offset has to act as folded.
      
      What is crucial for gup_fast and what has been overlooked is that
      PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
      And the latter is not possible with dynamic folding.
      
      To fix the issue in addition to pXd values pass original pXdp pointers
      down to gup_pXd_range functions.  And introduce pXd_offset_lockless
      helpers, which take an additional pXd entry value parameter.  This has
      already been discussed in
      
        https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
      
      Fixes: 1a42010c ("s390/mm: convert to the generic get_user_pages_fast code")
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Reviewed-by: NAlexander Gordeev <agordeev@linux.ibm.com>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hoursSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3f7b1bb
  12. 05 9月, 2020 1 次提交
  13. 04 9月, 2020 1 次提交
    • D
      mm: fix pin vs. gup mismatch with gate pages · 9fa2dd94
      Dave Hansen 提交于
      Gate pages were missed when converting from get to pin_user_pages().
      This can lead to refcount imbalances.  This is reliably and quickly
      reproducible running the x86 selftests when vsyscall=emulate is enabled
      (the default).  Fix by using try_grab_page() with appropriate flags
      passed.
      
      The long story:
      
      Today, pin_user_pages() and get_user_pages() are similar interfaces for
      manipulating page reference counts.  However, "pins" use a "bias" value
      and manipulate the actual reference count by 1024 instead of 1 used by
      plain "gets".
      
      That means that pin_user_pages() must be matched with unpin_user_pages()
      and can't be mixed with a plain put_user_pages() or put_page().
      
      Enter gate pages, like the vsyscall page.  They are pages usually in the
      kernel image, but which are mapped to userspace.  Userspace is allowed
      access to them, including interfaces using get/pin_user_pages().  The
      refcount of these kernel pages is manipulated just like a normal user
      page on the get/pin side so that the put/unpin side can work the same
      for normal user pages or gate pages.
      
      get_gate_page() uses try_get_page() which only bumps the refcount by
      1, not 1024, even if called in the pin_user_pages() path.  If someone
      pins a gate page, this happens:
      
      	pin_user_pages()
      		get_gate_page()
      			try_get_page() // bump refcount +1
      	... some time later
      	unpin_user_pages()
      		page_ref_sub_and_test(page, 1024))
      
      ... and boom, we get a refcount off by 1023.  This is reliably and
      quickly reproducible running the x86 selftests when booted with
      vsyscall=emulate (the default).  The selftests use ptrace(), but I
      suspect anything using pin_user_pages() on gate pages could hit this.
      
      To fix it, simply use try_grab_page() instead of try_get_page(), and
      pass 'gup_flags' in so that FOLL_PIN can be respected.
      
      This bug traces back to the very beginning of the FOLL_PIN support in
      commit 3faa52c0 ("mm/gup: track FOLL_PIN pages"), which showed up in
      the 5.7 release.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Fixes: 3faa52c0 ("mm/gup: track FOLL_PIN pages")
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: x86@kernel.org
      Cc: Jann Horn <jannh@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fa2dd94
  14. 15 8月, 2020 1 次提交
  15. 13 8月, 2020 6 次提交
    • P
      mm/gup: remove task_struct pointer for all gup code · 64019a2e
      Peter Xu 提交于
      After the cleanup of page fault accounting, gup does not need to pass
      task_struct around any more.  Remove that parameter in the whole gup
      stack.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64019a2e
    • P
      mm: clean up the last pieces of page fault accountings · a2beb5f1
      Peter Xu 提交于
      Here're the last pieces of page fault accounting that were still done
      outside handle_mm_fault() where we still have regs==NULL when calling
      handle_mm_fault():
      
      arch/powerpc/mm/copro_fault.c:   copro_handle_mm_fault
      arch/sparc/mm/fault_32.c:        force_user_fault
      arch/um/kernel/trap.c:           handle_page_fault
      mm/gup.c:                        faultin_page
                                       fixup_user_fault
      mm/hmm.c:                        hmm_vma_fault
      mm/ksm.c:                        break_ksm
      
      Some of them has the issue of duplicated accounting for page fault
      retries.  Some of them didn't do the accounting at all.
      
      This patch cleans all these up by letting handle_mm_fault() to do per-task
      page fault accounting even if regs==NULL (though we'll still skip the perf
      event accountings).  With that, we can safely remove all the outliers now.
      
      There's another functional change in that now we account the page faults
      to the caller of gup, rather than the task_struct that passed into the gup
      code.  More information of this can be found at [1].
      
      After this patch, below things should never be touched again outside
      handle_mm_fault():
      
        - task_struct.[maj|min]_flt
        - PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]
      
      [1] https://lore.kernel.org/lkml/CAHk-=wj_V2Tps2QrMn20_W0OJF9xqNh52XSGA42s-ZJ8Y+GyKw@mail.gmail.com/Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200707225021.200906-25-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2beb5f1
    • P
      mm: do page fault accounting in handle_mm_fault · bce617ed
      Peter Xu 提交于
      Patch series "mm: Page fault accounting cleanups", v5.
      
      This is v5 of the pf accounting cleanup series.  It originates from Gerald
      Schaefer's report on an issue a week ago regarding to incorrect page fault
      accountings for retried page fault after commit 4064b982 ("mm: allow
      VM_FAULT_RETRY for multiple times"):
      
        https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/
      
      What this series did:
      
        - Correct page fault accounting: we do accounting for a page fault
          (no matter whether it's from #PF handling, or gup, or anything else)
          only with the one that completed the fault.  For example, page fault
          retries should not be counted in page fault counters.  Same to the
          perf events.
      
        - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
          event is used in an adhoc way across different archs.
      
          Case (1): for many archs it's done at the entry of a page fault
          handler, so that it will also cover e.g.  errornous faults.
      
          Case (2): for some other archs, it is only accounted when the page
          fault is resolved successfully.
      
          Case (3): there're still quite some archs that have not enabled
          this perf event.
      
          Since this series will touch merely all the archs, we unify this
          perf event to always follow case (1), which is the one that makes most
          sense.  And since we moved the accounting into handle_mm_fault, the
          other two MAJ/MIN perf events are well taken care of naturally.
      
        - Unify definition of "major faults": the definition of "major
          fault" is slightly changed when used in accounting (not
          VM_FAULT_MAJOR).  More information in patch 1.
      
        - Always account the page fault onto the one that triggered the page
          fault.  This does not matter much for #PF handlings, but mostly for
          gup.  More information on this in patch 25.
      
      Patchset layout:
      
      Patch 1:     Introduced the accounting in handle_mm_fault(), not enabled.
      Patch 2-23:  Enable the new accounting for arch #PF handlers one by one.
      Patch 24:    Enable the new accounting for the rest outliers (gup, iommu, etc.)
      Patch 25:    Cleanup GUP task_struct pointer since it's not needed any more
      
      This patch (of 25):
      
      This is a preparation patch to move page fault accountings into the
      general code in handle_mm_fault().  This includes both the per task
      flt_maj/flt_min counters, and the major/minor page fault perf events.  To
      do this, the pt_regs pointer is passed into handle_mm_fault().
      
      PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
      handlers.
      
      So far, all the pt_regs pointer that passed into handle_mm_fault() is
      NULL, which means this patch should have no intented functional change.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
      Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bce617ed
    • J
      mm/gup: use a standard migration target allocation callback · ed03d924
      Joonsoo Kim 提交于
      There is a well-defined migration target allocation callback. Use it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/1596180906-8442-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed03d924
    • J
      mm/hugetlb: make hugetlb migration callback CMA aware · bbe88753
      Joonsoo Kim 提交于
      new_non_cma_page() in gup.c requires to allocate the new page that is not
      on the CMA area.  new_non_cma_page() implements it by using allocation
      scope APIs.
      
      However, there is a work-around for hugetlb.  Normal hugetlb page
      allocation API for migration is alloc_huge_page_nodemask().  It consists
      of two steps.  First is dequeing from the pool.  Second is, if there is no
      available page on the queue, allocating by using the page allocator.
      
      new_non_cma_page() can't use this API since first step (deque) isn't aware
      of scope API to exclude CMA area.  So, new_non_cma_page() exports hugetlb
      internal function for the second step, alloc_migrate_huge_page(), to
      global scope and uses it directly.  This is suboptimal since hugetlb pages
      on the queue cannot be utilized.
      
      This patch tries to fix this situation by making the deque function on
      hugetlb CMA aware.  In the deque function, CMA memory is skipped if
      PF_MEMALLOC_NOCMA flag is found.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/1596180906-8442-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbe88753
    • J
      mm/gup: restrict CMA region by using allocation scope API · 41b4dc14
      Joonsoo Kim 提交于
      We have well defined scope API to exclude CMA region.  Use it rather than
      manipulating gfp_mask manually.  With this change, we can now restore
      __GFP_MOVABLE for gfp_mask like as usual migration target allocation.  It
      would result in that the ZONE_MOVABLE is also searched by page allocator.
      For hugetlb, gfp_mask is redefined since it has a regular allocation mask
      filter for migration target.  __GPF_NOWARN is added to hugetlb gfp_mask
      filter since a new user for gfp_mask filter, gup, want to be silent when
      allocation fails.
      
      Note that this can be considered as a fix for the commit 9a4e9f3b
      ("mm: update get_user_pages_longterm to migrate pages allocated from CMA
      region").  However, "Fixes" tag isn't added here since it is just
      suboptimal but it doesn't cause any problem.
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Link: http://lkml.kernel.org/r/1596180906-8442-1-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41b4dc14
  16. 08 8月, 2020 1 次提交
  17. 20 6月, 2020 2 次提交
  18. 10 6月, 2020 6 次提交
  19. 09 6月, 2020 3 次提交