1. 05 9月, 2020 1 次提交
  2. 24 8月, 2020 1 次提交
  3. 13 8月, 2020 1 次提交
    • P
      mm: do page fault accounting in handle_mm_fault · bce617ed
      Peter Xu 提交于
      Patch series "mm: Page fault accounting cleanups", v5.
      
      This is v5 of the pf accounting cleanup series.  It originates from Gerald
      Schaefer's report on an issue a week ago regarding to incorrect page fault
      accountings for retried page fault after commit 4064b982 ("mm: allow
      VM_FAULT_RETRY for multiple times"):
      
        https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/
      
      What this series did:
      
        - Correct page fault accounting: we do accounting for a page fault
          (no matter whether it's from #PF handling, or gup, or anything else)
          only with the one that completed the fault.  For example, page fault
          retries should not be counted in page fault counters.  Same to the
          perf events.
      
        - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
          event is used in an adhoc way across different archs.
      
          Case (1): for many archs it's done at the entry of a page fault
          handler, so that it will also cover e.g.  errornous faults.
      
          Case (2): for some other archs, it is only accounted when the page
          fault is resolved successfully.
      
          Case (3): there're still quite some archs that have not enabled
          this perf event.
      
          Since this series will touch merely all the archs, we unify this
          perf event to always follow case (1), which is the one that makes most
          sense.  And since we moved the accounting into handle_mm_fault, the
          other two MAJ/MIN perf events are well taken care of naturally.
      
        - Unify definition of "major faults": the definition of "major
          fault" is slightly changed when used in accounting (not
          VM_FAULT_MAJOR).  More information in patch 1.
      
        - Always account the page fault onto the one that triggered the page
          fault.  This does not matter much for #PF handlings, but mostly for
          gup.  More information on this in patch 25.
      
      Patchset layout:
      
      Patch 1:     Introduced the accounting in handle_mm_fault(), not enabled.
      Patch 2-23:  Enable the new accounting for arch #PF handlers one by one.
      Patch 24:    Enable the new accounting for the rest outliers (gup, iommu, etc.)
      Patch 25:    Cleanup GUP task_struct pointer since it's not needed any more
      
      This patch (of 25):
      
      This is a preparation patch to move page fault accountings into the
      general code in handle_mm_fault().  This includes both the per task
      flt_maj/flt_min counters, and the major/minor page fault perf events.  To
      do this, the pt_regs pointer is passed into handle_mm_fault().
      
      PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
      handlers.
      
      So far, all the pt_regs pointer that passed into handle_mm_fault() is
      NULL, which means this patch should have no intented functional change.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
      Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bce617ed
  4. 21 7月, 2020 1 次提交
    • N
      powerpc/64s: Remove PROT_SAO support · 5c9fa16e
      Nicholas Piggin 提交于
      ISA v3.1 does not support the SAO storage control attribute required to
      implement PROT_SAO. PROT_SAO was used by specialised system software
      (Lx86) that has been discontinued for about 7 years, and is not thought
      to be used elsewhere, so removal should not cause problems.
      
      We rather remove it than keep support for older processors, because
      live migrating guest partitions to newer processors may not be possible
      if SAO is in use (or worse allowed with silent races).
      
      - PROT_SAO stays in the uapi header so code using it would still build.
      - arch_validate_prot() is removed, the generic version rejects PROT_SAO
        so applications would get a failure at mmap() time.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Drop KVM change for the time being]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200703011958.1166620-3-npiggin@gmail.com
      5c9fa16e
  5. 17 7月, 2020 1 次提交
  6. 10 6月, 2020 3 次提交
  7. 05 6月, 2020 1 次提交
  8. 22 4月, 2020 1 次提交
    • M
      mm/ksm: fix NULL pointer dereference when KSM zero page is enabled · 56df70a6
      Muchun Song 提交于
      find_mergeable_vma() can return NULL.  In this case, it leads to a crash
      when we access vm_mm(its offset is 0x40) later in write_protect_page.
      And this case did happen on our server.  The following call trace is
      captured in kernel 4.19 with the following patch applied and KSM zero
      page enabled on our server.
      
        commit e86c59b1 ("mm/ksm: improve deduplication of zero pages with colouring")
      
      So add a vma check to fix it.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
        Oops: 0000 [#1] SMP NOPTI
        CPU: 9 PID: 510 Comm: ksmd Kdump: loaded Tainted: G OE 4.19.36.bsk.9-amd64 #4.19.36.bsk.9
        RIP: try_to_merge_one_page+0xc7/0x760
        Code: 24 58 65 48 33 34 25 28 00 00 00 89 e8 0f 85 a3 06 00 00 48 83 c4
              60 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 46 08 a8 01 75 b8 <49>
              8b 44 24 40 4c 8d 7c 24 20 b9 07 00 00 00 4c 89 e6 4c 89 ff 48
        RSP: 0018:ffffadbdd9fffdb0 EFLAGS: 00010246
        RAX: ffffda83ffd4be08 RBX: ffffda83ffd4be40 RCX: 0000002c6e800000
        RDX: 0000000000000000 RSI: ffffda83ffd4be40 RDI: 0000000000000000
        RBP: ffffa11939f02ec0 R08: 0000000094e1a447 R09: 00000000abe76577
        R10: 0000000000000962 R11: 0000000000004e6a R12: 0000000000000000
        R13: ffffda83b1e06380 R14: ffffa18f31f072c0 R15: ffffda83ffd4be40
        FS: 0000000000000000(0000) GS:ffffa0da43b80000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000040 CR3: 0000002c77c0a003 CR4: 00000000007626e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        PKRU: 55555554
        Call Trace:
          ksm_scan_thread+0x115e/0x1960
          kthread+0xf5/0x130
          ret_from_fork+0x1f/0x30
      
      [songmuchun@bytedance.com: if the vma is out of date, just exit]
        Link: http://lkml.kernel.org/r/20200416025034.29780-1-songmuchun@bytedance.com
      [akpm@linux-foundation.org: add the conventional braces, replace /** with /*]
      Fixes: e86c59b1 ("mm/ksm: improve deduplication of zero pages with colouring")
      Co-developed-by: NXiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Markus Elfring <Markus.Elfring@web.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200416025034.29780-1-songmuchun@bytedance.com
      Link: http://lkml.kernel.org/r/20200414132905.83819-1-songmuchun@bytedance.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56df70a6
  9. 08 4月, 2020 2 次提交
  10. 28 11月, 2019 1 次提交
  11. 23 11月, 2019 1 次提交
  12. 25 9月, 2019 1 次提交
  13. 19 6月, 2019 1 次提交
  14. 15 5月, 2019 2 次提交
    • J
      mm/mmu_notifier: use correct mmu_notifier events for each invalidation · 7269f999
      Jérôme Glisse 提交于
      This updates each existing invalidation to use the correct mmu notifier
      event that represent what is happening to the CPU page table.  See the
      patch which introduced the events to see the rational behind this.
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7269f999
    • J
      mm/mmu_notifier: contextual information for event triggering invalidation · 6f4f13e8
      Jérôme Glisse 提交于
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      Users of mmu notifier API track changes to the CPU page table and take
      specific action for them.  While current API only provide range of virtual
      address affected by the change, not why the changes is happening.
      
      This patchset do the initial mechanical convertion of all the places that
      calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
      event as well as the vma if it is know (most invalidation happens against
      a given vma).  Passing down the vma allows the users of mmu notifier to
      inspect the new vma page protection.
      
      The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
      should assume that every for the range is going away when that event
      happens.  A latter patch do convert mm call path to use a more appropriate
      events for each call.
      
      This is done as 2 patches so that no call site is forgotten especialy
      as it uses this following coccinelle patch:
      
      %<----------------------------------------------------------------------
      @@
      identifier I1, I2, I3, I4;
      @@
      static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
      +enum mmu_notifier_event event,
      +unsigned flags,
      +struct vm_area_struct *vma,
      struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
      
      @@
      @@
      -#define mmu_notifier_range_init(range, mm, start, end)
      +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
      
      @@
      expression E1, E3, E4;
      identifier I1;
      @@
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, I1,
      I1->vm_mm, E3, E4)
      ...>
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(..., struct vm_area_struct *VMA, ...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(...) {
      struct vm_area_struct *VMA;
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN;
      @@
      FN(...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, NULL,
      E2, E3, E4)
      ...> }
      ---------------------------------------------------------------------->%
      
      Applied with:
      spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
      spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
      spatch --sp-file mmu-notifier.spatch --dir mm --in-place
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f4f13e8
  15. 06 3月, 2019 3 次提交
    • Y
      mm: ksm: do not block on page lock when searching stable tree · 2cee57d1
      Yang Shi 提交于
      ksmd needs to search the stable tree to look for the suitable KSM page,
      but the KSM page might be locked for a while due to i.e.  KSM page rmap
      walk.  Basically it is not a big deal since commit 2c653d0e ("ksm:
      introduce ksm_max_page_sharing per page deduplication limit"), since
      max_page_sharing limits the number of shared KSM pages.
      
      But it still sounds not worth waiting for the lock, the page can be
      skip, then try to merge it in the next scan to avoid potential stall if
      its content is still intact.
      
      Introduce trylock mode to get_ksm_page() to not block on page lock, like
      what try_to_merge_one_page() does.  And, define three possible
      operations (nolock, lock and trylock) as enum type to avoid stacking up
      bools and make the code more readable.
      
      Return -EBUSY if trylock fails, since NULL means not find suitable KSM
      page, which is a valid case.
      
      With the default max_page_sharing setting (256), there is almost no
      observed change comparing lock vs trylock.
      
      However, with ksm02 of LTP, the reduced ksmd full scan time can be
      observed, which has set max_page_sharing to 786432.  With lock version,
      ksmd may tak 10s - 11s to run two full scans, with trylock version ksmd
      may take 8s - 11s to run two full scans.  And, the number of
      pages_sharing and pages_to_scan keep same.  Basically, this change has
      no harm.
      
      [hughd@google.com: fix BUG_ON()]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182122280.6914@eggly.anvils
      Link: http://lkml.kernel.org/r/1548793753-62377-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Suggested-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cee57d1
    • K
      mm: reuse only-pte-mapped KSM page in do_wp_page() · 52d1e606
      Kirill Tkhai 提交于
      Add an optimization for KSM pages almost in the same way that we have
      for ordinary anonymous pages.  If there is a write fault in a page,
      which is mapped to an only pte, and it is not related to swap cache; the
      page may be reused without copying its content.
      
      [ Note that we do not consider PageSwapCache() pages at least for now,
        since we don't want to complicate __get_ksm_page(), which has nice
        optimization based on this (for the migration case). Currenly it is
        spinning on PageSwapCache() pages, waiting for when they have
        unfreezed counters (i.e., for the migration finish). But we don't want
        to make it also spinning on swap cache pages, which we try to reuse,
        since there is not a very high probability to reuse them. So, for now
        we do not consider PageSwapCache() pages at all. ]
      
      So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
      page_stable_node(), to skip a page, which KSM is currently trying to
      link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
      merge one more page into the page, we are reusing.  After that, nobody
      can refer to the reusing page: KSM skips !PageSwapCache() pages with
      zero refcount; and the protection against of all other participants is
      the same as for reused ordinary anon pages pte lock, page lock and
      mmap_sem.
      
      [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
      Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52d1e606
    • A
      mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3
      Anshuman Khandual 提交于
      Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
      
      All these places for replacement were found by running the following
      grep patterns on the entire kernel code.  Please let me know if this
      might have missed some instances.  This might also have replaced some
      false positives.  I will appreciate suggestions, inputs and review.
      
      1. git grep "nid == -1"
      2. git grep "node == -1"
      3. git grep "nid = -1"
      4. git grep "node = -1"
      
      This patch (of 2):
      
      At present there are multiple places where invalid node number is
      encoded as -1.  Even though implicitly understood it is always better to
      have macros in there.  Replace these open encodings for an invalid node
      number with the global macro NUMA_NO_NODE.  This helps remove NUMA
      related assumptions like 'invalid node' from various places redirecting
      them to a common definition.
      
      Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98fa15f3
  16. 29 12月, 2018 3 次提交
  17. 23 8月, 2018 2 次提交
  18. 18 8月, 2018 2 次提交
    • S
      mm: convert return type of handle_mm_fault() caller to vm_fault_t · 50a7ca3c
      Souptick Joarder 提交于
      Use new return type vm_fault_t for fault handler.  For now, this is just
      documenting that the function returns a VM_FAULT value rather than an
      errno.  Once all instances are converted, vm_fault_t will become a
      distinct type.
      
      Ref-> commit 1c8f4220 ("mm: change return type to vm_fault_t")
      
      In this patch all the caller of handle_mm_fault() are changed to return
      vm_fault_t type.
      
      Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: James E.J. Bottomley <jejb@parisc-linux.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50a7ca3c
    • D
      dax: remove VM_MIXEDMAP for fsdax and device dax · e1fb4a08
      Dave Jiang 提交于
      This patch is reworked from an earlier patch that Dan has posted:
      https://patchwork.kernel.org/patch/10131727/
      
      VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
      the memory page it is dealing with is not typical memory from the linear
      map.  The get_user_pages_fast() path, since it does not resolve the vma,
      is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
      use that as a VM_MIXEDMAP replacement in some locations.  In the cases
      where there is no pte to consult we fallback to using vma_is_dax() to
      detect the VM_MIXEDMAP special case.
      
      Now that we have explicit driver pfn_t-flag opt-in/opt-out for
      get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
      also means we no longer need to worry about safely manipulating vm_flags
      in a future where we support dynamically changing the dax mode of a
      file.
      
      DAX should also now be supported with madvise_behavior(), vma_merge(),
      and copy_page_range().
      
      This patch has been tested against ndctl unit test.  It has also been
      tested against xfstests commit: 625515d using fake pmem created by
      memmap and no additional issues have been observed.
      
      Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1fb4a08
  19. 15 6月, 2018 1 次提交
    • J
      mm/ksm.c: ignore STABLE_FLAG of rmap_item->address in rmap_walk_ksm() · 1105a2fc
      Jia He 提交于
      In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by
      PAGE_SIZE unaligned for rmap_item->address under memory pressure
      tests(start 20 guests and run memhog in the host).
      
        WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8
        CPU: 4 PID: 4641 Comm: memhog Tainted: G        W 4.17.0-rc3+ #8
        Call trace:
         kvm_age_hva_handler+0xc0/0xc8
         handle_hva_to_gpa+0xa8/0xe0
         kvm_age_hva+0x4c/0xe8
         kvm_mmu_notifier_clear_flush_young+0x54/0x98
         __mmu_notifier_clear_flush_young+0x6c/0xa0
         page_referenced_one+0x154/0x1d8
         rmap_walk_ksm+0x12c/0x1d0
         rmap_walk+0x94/0xa0
         page_referenced+0x194/0x1b0
         shrink_page_list+0x674/0xc28
         shrink_inactive_list+0x26c/0x5b8
         shrink_node_memcg+0x35c/0x620
         shrink_node+0x100/0x430
         do_try_to_free_pages+0xe0/0x3a8
         try_to_free_pages+0xe4/0x230
         __alloc_pages_nodemask+0x564/0xdc0
         alloc_pages_vma+0x90/0x228
         do_anonymous_page+0xc8/0x4d0
         __handle_mm_fault+0x4a0/0x508
         handle_mm_fault+0xf8/0x1b0
         do_page_fault+0x218/0x4b8
         do_translation_fault+0x90/0xa0
         do_mem_abort+0x68/0xf0
         el0_da+0x24/0x28
      
      In rmap_walk_ksm, the rmap_item->address might still have the
      STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be
      PAGE_SIZE aligned.  Thus it will cause exceptions in handle_hva_to_gpa
      on arm64.
      
      This patch fixes it by ignoring (not removing) the low bits of address
      when doing rmap_walk_ksm.
      
      IMO, it should be backported to stable tree.  the storm of WARN_ONs is
      very easy for me to reproduce.  More than that, I watched a panic (not
      reproducible) as follows:
      
        page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0
        flags: 0x1fffc00000000000()
        raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000
        raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000
        page dumped because: nonzero _refcount
        CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1
        Hardware name: <snip for confidential issues>
        Call trace:
          dump_backtrace+0x0/0x22c
          show_stack+0x24/0x2c
          dump_stack+0x8c/0xb0
          bad_page+0xf4/0x154
          free_pages_check_bad+0x90/0x9c
          free_pcppages_bulk+0x464/0x518
          free_hot_cold_page+0x22c/0x300
          __put_page+0x54/0x60
          unmap_stage2_range+0x170/0x2b4
          kvm_unmap_hva_handler+0x30/0x40
          handle_hva_to_gpa+0xb0/0xec
          kvm_unmap_hva_range+0x5c/0xd0
      
      I even injected a fault on purpose in kvm_unmap_hva_range by seting
      size=size-0x200, the call trace is similar as above.  So I thought the
      panic is similarly caused by the root cause of WARN_ON.
      
      Andrea said:
      
      : It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
      : zap those bits and hide the misalignment caused by the low metadata
      : bits being erroneously left set in the address, but the arm code
      : notices when that's the last page in the memslot and the hva_end is
      : getting aligned and the size is below one page.
      :
      : I think the problem triggers in the addr += PAGE_SIZE of
      : unmap_stage2_ptes that never matches end because end is aligned but
      : addr is not.
      :
      : 	} while (pte++, addr += PAGE_SIZE, addr != end);
      :
      : x86 again only works on hva_start/hva_end after converting it to
      : gfn_start/end and that being in pfn units the bits are zapped before
      : they risk to cause trouble.
      
      Jia He said:
      
      : I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without
      : this patch, the WARN_ON is very easy for reproducing.  After this patch, I
      : have run the same benchmarch for a whole day without any WARN_ONs
      
      Link: http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejianet@gmail.comSigned-off-by: NJia He <jia.he@hxt-semitech.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NJia He <hejianet@gmail.com>
      Cc: Suzuki K Poulose <Suzuki.Poulose@arm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1105a2fc
  20. 08 6月, 2018 1 次提交
  21. 28 4月, 2018 1 次提交
  22. 17 4月, 2018 1 次提交
  23. 12 4月, 2018 1 次提交
  24. 06 4月, 2018 2 次提交
  25. 18 3月, 2018 1 次提交
    • K
      sparc64: Add support for ADI (Application Data Integrity) · 74a04967
      Khalid Aziz 提交于
      ADI is a new feature supported on SPARC M7 and newer processors to allow
      hardware to catch rogue accesses to memory. ADI is supported for data
      fetches only and not instruction fetches. An app can enable ADI on its
      data pages, set version tags on them and use versioned addresses to
      access the data pages. Upper bits of the address contain the version
      tag. On M7 processors, upper four bits (bits 63-60) contain the version
      tag. If a rogue app attempts to access ADI enabled data pages, its
      access is blocked and processor generates an exception. Please see
      Documentation/sparc/adi.txt for further details.
      
      This patch extends mprotect to enable ADI (TSTATE.mcde), enable/disable
      MCD (Memory Corruption Detection) on selected memory ranges, enable
      TTE.mcd in PTEs, return ADI parameters to userspace and save/restore ADI
      version tags on page swap out/in or migration. ADI is not enabled by
      default for any task. A task must explicitly enable ADI on a memory
      range and set version tag for ADI to be effective for the task.
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Khalid Aziz <khalid@gonehiking.org>
      Reviewed-by: NAnthony Yznaga <anthony.yznaga@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74a04967
  26. 07 2月, 2018 1 次提交
  27. 05 12月, 2017 1 次提交
    • P
      mm/ksm: Remove now-redundant smp_read_barrier_depends() · 08df4774
      Paul E. McKenney 提交于
      Because READ_ONCE() now implies smp_read_barrier_depends(), the
      smp_read_barrier_depends() in get_ksm_page() is now redundant.
      This commit removes it and updates the comments.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: <linux-mm@kvack.org>
      08df4774
  28. 16 11月, 2017 1 次提交
    • J
      mm/mmu_notifier: avoid double notification when it is useless · 0f10851e
      Jérôme Glisse 提交于
      This patch only affects users of mmu_notifier->invalidate_range callback
      which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
      and it is an optimization for those users.  Everyone else is unaffected
      by it.
      
      When clearing a pte/pmd we are given a choice to notify the event under
      the page table lock (notify version of *_clear_flush helpers do call the
      mmu_notifier_invalidate_range).  But that notification is not necessary
      in all cases.
      
      This patch removes almost all cases where it is useless to have a call
      to mmu_notifier_invalidate_range before
      mmu_notifier_invalidate_range_end.  It also adds documentation in all
      those cases explaining why.
      
      Below is a more in depth analysis of why this is fine to do this:
      
      For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
      device use thing like ATS/PASID to get the IOMMU to walk the CPU page
      table to access a process virtual address space).  There is only 2 cases
      when you need to notify those secondary TLB while holding page table
      lock when clearing a pte/pmd:
      
        A) page backing address is free before mmu_notifier_invalidate_range_end
        B) a page table entry is updated to point to a new page (COW, write fault
           on zero page, __replace_page(), ...)
      
      Case A is obvious you do not want to take the risk for the device to write
      to a page that might now be used by something completely different.
      
      Case B is more subtle. For correctness it requires the following sequence
      to happen:
        - take page table lock
        - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
        - set page table entry to point to new page
      
      If clearing the page table entry is not followed by a notify before setting
      the new pte/pmd value then you can break memory model like C11 or C++11 for
      the device.
      
      Consider the following scenario (device use a feature similar to ATS/
      PASID):
      
      Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
      assume they are write protected for COW (other case of B apply too).
      
      [Time N] -----------------------------------------------------------------
      CPU-thread-0  {try to write to addrA}
      CPU-thread-1  {try to write to addrB}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA and populate device TLB}
      DEV-thread-2  {read addrB and populate device TLB}
      [Time N+1] ---------------------------------------------------------------
      CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
      CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+2] ---------------------------------------------------------------
      CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
      CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {write to addrA which is a write to new page}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {}
      CPU-thread-3  {write to addrB which is a write to new page}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+4] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+5] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA from old page}
      DEV-thread-2  {read addrB from new page}
      
      So here because at time N+2 the clear page table entry was not pair with a
      notification to invalidate the secondary TLB, the device see the new value
      for addrB before seing the new value for addrA.  This break total memory
      ordering for the device.
      
      When changing a pte to write protect or to point to a new write protected
      page with same content (KSM) it is ok to delay invalidate_range callback
      to mmu_notifier_invalidate_range_end() outside the page table lock.  This
      is true even if the thread doing page table update is preempted right
      after releasing page table lock before calling
      mmu_notifier_invalidate_range_end
      
      Thanks to Andrea for thinking of a problematic scenario for COW.
      
      [jglisse@redhat.com: v2]
        Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f10851e
  29. 04 10月, 2017 1 次提交