1. 03 6月, 2021 1 次提交
    • J
      mm: notify remote TLBs when dirtying a PTE · d30a2a28
      Jean-Philippe Brucker 提交于
      maillist inclusion
      category: feature
      bugzilla: 51855
      CVE: NA
      
      Reference: https://jpbrucker.net/git/linux/commit/?h=sva/2021-03-01&id=d32d8baaf293aaefef8a1c9b8a4508ab2ec46c61
      
      ---------------------------------------------
      
      The ptep_set_access_flags path in handle_pte_fault, can cause a change of
      the pte's permissions on some architectures. A Read-Only and
      writeable-clean entry becomes Read-Write and dirty. This requires us to
      call the MMU notifier to invalidate the entry in remote TLBs, for instance
      in a PCIe Address Translation Cache (ATC).
      
      Here is a scenario where the lack of notifier call ends up locking a
      device:
      
      1) A shared anonymous buffer is mapped with READ|WRITE prot, at VA.
      
      2) A PCIe device with ATS/PRI/PASID capabilities wants to read the buffer,
         using its virtual address.
      
         a) Device asks for translation of VA for reading (NW=1)
      
         b) The IOMMU cannot fulfill the request, so the device does a Page
            Request for VA. The fault is handled with do_read_fault, after which
            the PTE has flags young, write and rdonly.
      
         c) Device retries the translation; IOMMU sends a Translation Completion
            with the PA and Read-Only permission.
      
         d) The VA->PA translation is stored in the ATC, with Read-Only
            permission. From the device's point of view, the page may or may not
            be writeable. It didn't ask for writeability, so it doesn't get a
            definite answer on that point.
      
      3) The same device now wants to write the buffer. It needs to restart
         the AT-PR-AT dance for writing this time.
      
         a) Device could asks for translation of VA for reading and writing
            (NW=0). The IOMMU would reply with the same Read-Only mapping, so
            this time the device is certain that the page isn't writeable. Some
            implementations might update their ATC entry to store that
            information. The ATS specification is pretty fuzzy on the behaviour
            to adopt.
      
         b) The entry is Read-Only, so we fault again. The PTE exists and is
            valid, all we need to do is mark it dirty. TLBs are invalidated, but
            not the ATC since there is no notifier.
      
         c) Now the behaviour depends on the device implementation. If 3a)
            didn't update the ATC entry, the device is still uncertain on the
            writeability of the page, goto 3a) - repeat the Translation Request
            and get Read-Write permissions.
      
            But if 3a) updated the ATC entry, the device is certain of the
            PTE's permissions, and will goto 3b) instead - repeat the page
            fault, again and again. This time we take the "spurious fault" path
            in the same function, which invalidates the TLB but doesn't call an
            MMU notifier either.
      
      To avoid this page request loop, call mmu_notifier_change_pte after
      dirtying the PTE.
      
      Note: if the IOMMU supports hardware update of the access/dirty bits, 3a)
      dirties the PTE, and the IOMMU returns RW permission to the device, so
      there is no need to do a Page Request.
      Signed-off-by: NJean-Philippe Brucker <jean-philippe@linaro.org>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      d30a2a28
  2. 22 4月, 2021 1 次提交
    • I
      mm: fix race by making init_zero_pfn() early_initcall · a202e050
      Ilya Lipnitskiy 提交于
      stable inclusion
      from stable-5.10.28
      commit ec3e06e06f763d8ef5ba3919e2c4e59366b5b92a
      bugzilla: 51779
      
      --------------------------------
      
      commit e720e7d0 upstream.
      
      There are code paths that rely on zero_pfn to be fully initialized
      before core_initcall.  For example, wq_sysfs_init() is a core_initcall
      function that eventually results in a call to kernel_execve, which
      causes a page fault with a subsequent mmput.  If zero_pfn is not
      initialized by then it may not get cleaned up properly and result in an
      error:
      
        BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1
      
      Here is an analysis of the race as seen on a MIPS device. On this
      particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
      initialized, at which point it becomes PFN 5120:
      
        1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
             kobject_uevent_env+0x7e4/0x7ec
             kset_register+0x68/0x88
             bus_register+0xdc/0x34c
             subsys_virtual_register+0x34/0x78
             wq_sysfs_init+0x1c/0x4c
             do_one_initcall+0x50/0x1a8
             kernel_init_freeable+0x230/0x2c8
             kernel_init+0x10/0x100
             ret_from_kernel_thread+0x14/0x1c
      
        2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
           kernel_execve asynchronously.
      
        3. Memory allocations in kernel_execve cause a page fault, bumping the
           MM reference counter:
             add_mm_counter_fast+0xb4/0xc0
             handle_mm_fault+0x6e4/0xea0
             __get_user_pages.part.78+0x190/0x37c
             __get_user_pages_remote+0x128/0x360
             get_arg_page+0x34/0xa0
             copy_string_kernel+0x194/0x2a4
             kernel_execve+0x11c/0x298
             call_usermodehelper_exec_async+0x114/0x194
      
        4. In case zero_pfn has not been initialized yet, zap_pte_range does
           not decrement the MM_ANONPAGES RSS counter and the BUG message is
           triggered shortly afterwards when __mmdrop checks the ref counters:
             __mmdrop+0x98/0x1d0
             free_bprm+0x44/0x118
             kernel_execve+0x160/0x1d8
             call_usermodehelper_exec_async+0x114/0x194
             ret_from_kernel_thread+0x14/0x1c
      
      To avoid races such as described above, initialize init_zero_pfn at
      early_initcall level.  Depending on the architecture, ZERO_PAGE is
      either constant or gets initialized even earlier, at paging_init, so
      there is no issue with initializing zero_pfn earlier.
      
      Link: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.comSigned-off-by: NIlya Lipnitskiy <ilya.lipnitskiy@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org
      Tested-by: N周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      a202e050
  3. 09 4月, 2021 6 次提交
  4. 12 1月, 2021 1 次提交
  5. 19 10月, 2020 1 次提交
  6. 17 10月, 2020 1 次提交
  7. 14 10月, 2020 4 次提交
  8. 09 10月, 2020 1 次提交
    • L
      mm: avoid early COW write protect games during fork() · f3c64eda
      Linus Torvalds 提交于
      In commit 70e806e4 ("mm: Do early cow for pinned pages during fork()
      for ptes") we write-protected the PTE before doing the page pinning
      check, in order to avoid a race with concurrent fast-GUP pinning (which
      doesn't take the mm semaphore or the page table lock).
      
      That trick doesn't actually work - it doesn't handle memory ordering
      properly, and doing so would be prohibitively expensive.
      
      It also isn't really needed.  While we're moving in the direction of
      allowing and supporting page pinning without marking the pinned area
      with MADV_DONTFORK, the fact is that we've never really supported this
      kind of odd "concurrent fork() and page pinning", and doing the
      serialization on a pte level is just wrong.
      
      We can add serialization with a per-mm sequence counter, so we know how
      to solve that race properly, but we'll do that at a more appropriate
      time.  Right now this just removes the write protect games.
      
      It also turns out that the write protect games actually break on Power,
      as reported by Aneesh Kumar:
      
       "Architecture like ppc64 expects set_pte_at to be not used for updating
        a valid pte. This is further explained in commit 56eecdb9 ("mm:
        Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit")"
      
      and the code triggered a warning there:
      
        WARNING: CPU: 0 PID: 30613 at arch/powerpc/mm/pgtable.c:185 set_pte_at+0x2a8/0x3a0 arch/powerpc/mm/pgtable.c:185
        Call Trace:
          copy_present_page mm/memory.c:857 [inline]
          copy_present_pte mm/memory.c:899 [inline]
          copy_pte_range mm/memory.c:1014 [inline]
          copy_pmd_range mm/memory.c:1092 [inline]
          copy_pud_range mm/memory.c:1127 [inline]
          copy_p4d_range mm/memory.c:1150 [inline]
          copy_page_range+0x1f6c/0x2cc0 mm/memory.c:1212
          dup_mmap kernel/fork.c:592 [inline]
          dup_mm+0x77c/0xab0 kernel/fork.c:1355
          copy_mm kernel/fork.c:1411 [inline]
          copy_process+0x1f00/0x2740 kernel/fork.c:2070
          _do_fork+0xc4/0x10b0 kernel/fork.c:2429
      
      Link: https://lore.kernel.org/lkml/CAHk-=wiWr+gO0Ro4LvnJBMs90OiePNyrE3E+pJvc9PzdBShdmw@mail.gmail.com/
      Link: https://lore.kernel.org/linuxppc-dev/20201008092541.398079-1-aneesh.kumar@linux.ibm.com/Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Tested-by: NLeon Romanovsky <leonro@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3c64eda
  9. 06 10月, 2020 1 次提交
  10. 28 9月, 2020 2 次提交
    • P
      mm: Do early cow for pinned pages during fork() for ptes · 70e806e4
      Peter Xu 提交于
      This allows copy_pte_range() to do early cow if the pages were pinned on
      the source mm.
      
      Currently we don't have an accurate way to know whether a page is pinned
      or not.  The only thing we have is page_maybe_dma_pinned().  However
      that's good enough for now.  Especially, with the newly added
      mm->has_pinned flag to make sure we won't affect processes that never
      pinned any pages.
      
      It would be easier if we can do GFP_KERNEL allocation within
      copy_one_pte().  Unluckily, we can't because we're with the page table
      locks held for both the parent and child processes.  So the page
      allocation needs to be done outside copy_one_pte().
      
      Some trick is there in copy_present_pte(), majorly the wrprotect trick
      to block concurrent fast-gup.  Comments in the function should explain
      better in place.
      
      Oleg Nesterov reported a (probably harmless) bug during review that we
      didn't reset entry.val properly in copy_pte_range() so that potentially
      there's chance to call add_swap_count_continuation() multiple times on
      the same swp entry.  However that should be harmless since even if it
      happens, the same function (add_swap_count_continuation()) will return
      directly noticing that there're enough space for the swp counter.  So
      instead of a standalone stable patch, it is touched up in this patch
      directly.
      
      Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70e806e4
    • P
      mm/fork: Pass new vma pointer into copy_page_range() · 7a4830c3
      Peter Xu 提交于
      This prepares for the future work to trigger early cow on pinned pages
      during fork().
      
      No functional change intended.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a4830c3
  11. 24 9月, 2020 3 次提交
    • L
      mm: fix misplaced unlock_page in do_wp_page() · be068f29
      Linus Torvalds 提交于
      Commit 09854ba9 ("mm: do_wp_page() simplification") reorganized all
      the code around the page re-use vs copy, but in the process also moved
      the final unlock_page() around to after the wp_page_reuse() call.
      
      That normally doesn't matter - but it means that the unlock_page() is
      now done after releasing the page table lock.  Again, not a big deal,
      you'd think.
      
      But it turns out that it's very wrong indeed, because once we've
      released the page table lock, we've basically lost our only reference to
      the page - the page tables - and it could now be free'd at any time.  We
      do hold the mmap_sem, so no actual unmap() can happen, but madvise can
      come in and a MADV_DONTNEED will zap the page range - and free the page.
      
      So now the page may be free'd just as we're unlocking it, which in turn
      will usually trigger a "Bad page state" error in the freeing path.  To
      make matters more confusing, by the time the debug code prints out the
      page state, the unlock has typically completed and everything looks fine
      again.
      
      This all doesn't happen in any normal situations, but it does trigger
      with the dirtyc0w_child LTP test.  And it seems to trigger much more
      easily (but not expclusively) on s390 than elsewhere, probably because
      s390 doesn't do the "batch pages up for freeing after the TLB flush"
      that gives the unlock_page() more time to complete and makes the race
      harder to hit.
      
      Fixes: 09854ba9 ("mm: do_wp_page() simplification")
      Link: https://lore.kernel.org/lkml/a46e9bbef2ed4e17778f5615e818526ef848d791.camel@redhat.com/
      Link: https://lore.kernel.org/linux-mm/c41149a8-211e-390b-af1d-d5eee690fecb@linux.alibaba.com/Reported-by: NQian Cai <cai@redhat.com>
      Reported-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Bisected-and-analyzed-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Tested-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be068f29
    • L
      mm: move the copy_one_pte() pte_present check into the caller · 79a1971c
      Linus Torvalds 提交于
      This completes the split of the non-present and present pte cases by
      moving the check for the source pte being present into the single
      caller, which also means that we clearly separate out the very different
      return value case for a non-present pte.
      
      The present pte case currently always succeeds.
      
      This is a pure code re-organization with no semantic change: the intent
      is to make it much easier to add a new return case to the present pte
      case for when we do early COW at page table copy time.
      
      This was split out from the previous commit simply to make it easy to
      visually see that there were no semantic changes from this code
      re-organization.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79a1971c
    • L
      mm: split out the non-present case from copy_one_pte() · df3a57d1
      Linus Torvalds 提交于
      This is a purely mechanical split of the copy_one_pte() function.  It's
      not immediately obvious when looking at the diff because of the
      indentation change, but the way to see what is going on in this commit
      is to use the "-w" flag to not show pure whitespace changes, and you see
      how the first part of copy_one_pte() is simply lifted out into a
      separate function.
      
      And since the non-present case is marked unlikely, don't make the new
      function be inlined.  Not that gcc really seems to care, since it looks
      like it will inline it anyway due to the whole "single callsite for
      static function" logic.  In fact, code generation with the function
      split is almost identical to before.  But not marking it inline is the
      right thing to do.
      
      This is pure prep-work and cleanup for subsequent changes.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df3a57d1
  12. 06 9月, 2020 1 次提交
    • J
      mm: track page table modifications in __apply_to_page_range() · e80d3909
      Joerg Roedel 提交于
      __apply_to_page_range() is also used to change and/or allocate
      page-table pages in the vmalloc area of the address space.  Make sure
      these changes get synchronized to other page-tables in the system by
      calling arch_sync_kernel_mappings() when necessary.
      
      The impact appears limited to x86-32, where apply_to_page_range may miss
      updating the PMD.  That leads to explosions in drivers like
      
        BUG: unable to handle page fault for address: fe036000
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        *pde = 00000000
        Oops: 0002 [#1] SMP
        CPU: 3 PID: 1300 Comm: gem_concurrent_ Not tainted 5.9.0-rc1+ #16
        Hardware name:  /NUC6i3SYB, BIOS SYSKLi35.86A.0024.2015.1027.2142 10/27/2015
        EIP: __execlists_context_alloc+0x132/0x2d0 [i915]
        Code: 31 d2 89 f0 e8 2f 55 02 00 89 45 e8 3d 00 f0 ff ff 0f 87 11 01 00 00 8b 4d e8 03 4b 30 b8 5a 5a 5a 5a ba 01 00 00 00 8d 79 04 <c7> 01 5a 5a 5a 5a c7 81 fc 0f 00 00 5a 5a 5a 5a 83 e7 fc 29 f9 81
        EAX: 5a5a5a5a EBX: f60ca000 ECX: fe036000 EDX: 00000001
        ESI: f43b7340 EDI: fe036004 EBP: f6389cb8 ESP: f6389c9c
        DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286
        CR0: 80050033 CR2: fe036000 CR3: 2d361000 CR4: 001506d0
        DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
        DR6: fffe0ff0 DR7: 00000400
        Call Trace:
          execlists_context_alloc+0x10/0x20 [i915]
          intel_context_alloc_state+0x3f/0x70 [i915]
          __intel_context_do_pin+0x117/0x170 [i915]
          i915_gem_do_execbuffer+0xcc7/0x2500 [i915]
          i915_gem_execbuffer2_ioctl+0xcd/0x1f0 [i915]
          drm_ioctl_kernel+0x8f/0xd0
          drm_ioctl+0x223/0x3d0
          __ia32_sys_ioctl+0x1ab/0x760
          __do_fast_syscall_32+0x3f/0x70
          do_fast_syscall_32+0x29/0x60
          do_SYSENTER_32+0x15/0x20
          entry_SYSENTER_32+0x9f/0xf2
        EIP: 0xb7f28559
        Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
        EAX: ffffffda EBX: 00000005 ECX: c0406469 EDX: bf95556c
        ESI: b7e68000 EDI: c0406469 EBP: 00000005 ESP: bf9554d8
        DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000296
        Modules linked in: i915 x86_pkg_temp_thermal intel_powerclamp crc32_pclmul crc32c_intel intel_cstate intel_uncore intel_gtt drm_kms_helper intel_pch_thermal video button autofs4 i2c_i801 i2c_smbus fan
        CR2: 00000000fe036000
      
      It looks like kasan, xen and i915 are vulnerable.
      
      Actual impact is "on thinkpad X60 in 5.9-rc1, screen starts blinking
      after 30-or-so minutes, and machine is unusable"
      
      [sfr@canb.auug.org.au: ARCH_PAGE_TABLE_SYNC_MASK needs vmalloc.h]
        Link: https://lkml.kernel.org/r/20200825172508.16800a4f@canb.auug.org.au
      [chris@chris-wilson.co.uk: changelog addition]
      [pavel@ucw.cz: changelog addition]
      
      Fixes: 2ba3e694 ("mm/vmalloc: track which page-table levels were modified")
      Fixes: 86cf69f1 ("x86/mm/32: implement arch_sync_kernel_mappings()")
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Chris Wilson <chris@chris-wilson.co.uk>	[x86-32]
      Tested-by: NPavel Machek <pavel@ucw.cz>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@vger.kernel.org>	[5.8+]
      Link: https://lkml.kernel.org/r/20200821123746.16904-1-joro@8bytes.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e80d3909
  13. 05 9月, 2020 2 次提交
  14. 19 8月, 2020 1 次提交
  15. 15 8月, 2020 2 次提交
    • Q
      mm/swapfile: fix and annotate various data races · a449bf58
      Qian Cai 提交于
      swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
      be accessed concurrently separately as noticed by KCSAN,
      
      === si.highest_bit ===
      
       write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
        swap_range_alloc+0x81/0x130
        swap_range_alloc at mm/swapfile.c:681
        scan_swap_map_slots+0x371/0xb90
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0xf2/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1795/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
       read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
        scan_swap_map_slots+0x4a6/0xb90
        scan_swap_map_slots at mm/swapfile.c:892
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0xf2/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1795/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 70 PID: 6672 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #3
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
      
      === si.swap_map[offset] ===
      
       write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
        __swap_entry_free_locked+0x8c/0x100
        __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
        __swap_entry_free.constprop.20+0x69/0xb0
        free_swap_and_cache+0x53/0xa0
        unmap_page_range+0x7f8/0x1d70
        unmap_single_vma+0xcd/0x170
        unmap_vmas+0x18b/0x220
        exit_mmap+0xee/0x220
        mmput+0x10e/0x270
        do_exit+0x59b/0xf40
        do_group_exit+0x8b/0x180
      
       read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
        _swap_info_get+0x81/0xa0
        _swap_info_get at mm/swapfile.c:1140
        free_swap_and_cache+0x40/0xa0
        unmap_page_range+0x7f8/0x1d70
        unmap_single_vma+0xcd/0x170
        unmap_vmas+0x18b/0x220
        exit_mmap+0xee/0x220
        mmput+0x10e/0x270
        do_exit+0x59b/0xf40
        do_group_exit+0x8b/0x180
      
      === si.flags ===
      
       write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
        scan_swap_map_slots+0x6fe/0xb50
        scan_swap_map_slots at mm/swapfile.c:887
        get_swap_pages+0x39d/0x5c0
        get_swap_page+0x377/0x524
        add_to_swap+0xe4/0x1c0
        shrink_page_list+0x1795/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
       read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
        _swap_info_get+0x41/0xa0
        __swap_info_get at mm/swapfile.c:1114
        put_swap_page+0x84/0x490
        __remove_mapping+0x384/0x5f0
        shrink_page_list+0xff1/0x2870
        shrink_inactive_list+0x316/0x880
        shrink_lruvec+0x8dc/0x1380
        shrink_node+0x317/0xd80
        do_try_to_free_pages+0x1f7/0xa10
        try_to_free_pages+0x26c/0x5e0
        __alloc_pages_slowpath+0x458/0x1290
      
      The writes are under si->lock but the reads are not. For si.highest_bit
      and si.swap_map[offset], data race could trigger logic bugs, so fix them
      by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
      except those isolated reads where they compare against zero which a data
      race would cause no harm. Thus, annotate them as intentional data races
      using the data_race() macro.
      
      For si.flags, the readers are only interested in a single bit where a
      data race there would cause no issue there.
      
      [cai@lca.pw: add a missing annotation for si->flags in memory.c]
        Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a449bf58
    • L
      dma-debug: remove debug_dma_assert_idle() function · 5848dc5b
      Linus Torvalds 提交于
      This remoes the code from the COW path to call debug_dma_assert_idle(),
      which was added many years ago.
      
      Google shows that it hasn't caught anything in the 6+ years we've had it
      apart from a false positive, and Hugh just noticed how it had a very
      unfortunate spinlock serialization in the COW path.
      
      He fixed that issue the previous commit (a85ffd59: "dma-debug: fix
      debug_dma_assert_idle(), use rcu_read_lock()"), but let's see if anybody
      even notices when we remove this function entirely.
      
      NOTE! We keep the dma tracking infrastructure that was added by the
      commit that introduced it.  Partly to make it easier to resurrect this
      debug code if we ever deside to, and partly because that tracking by pfn
      and offset looks quite reasonable.
      
      The problem with this debug code was simply that it was expensive and
      didn't seem worth it, not that it was wrong per se.
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5848dc5b
  16. 13 8月, 2020 6 次提交
    • P
      mm/gup: remove task_struct pointer for all gup code · 64019a2e
      Peter Xu 提交于
      After the cleanup of page fault accounting, gup does not need to pass
      task_struct around any more.  Remove that parameter in the whole gup
      stack.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64019a2e
    • P
      mm: clean up the last pieces of page fault accountings · a2beb5f1
      Peter Xu 提交于
      Here're the last pieces of page fault accounting that were still done
      outside handle_mm_fault() where we still have regs==NULL when calling
      handle_mm_fault():
      
      arch/powerpc/mm/copro_fault.c:   copro_handle_mm_fault
      arch/sparc/mm/fault_32.c:        force_user_fault
      arch/um/kernel/trap.c:           handle_page_fault
      mm/gup.c:                        faultin_page
                                       fixup_user_fault
      mm/hmm.c:                        hmm_vma_fault
      mm/ksm.c:                        break_ksm
      
      Some of them has the issue of duplicated accounting for page fault
      retries.  Some of them didn't do the accounting at all.
      
      This patch cleans all these up by letting handle_mm_fault() to do per-task
      page fault accounting even if regs==NULL (though we'll still skip the perf
      event accountings).  With that, we can safely remove all the outliers now.
      
      There's another functional change in that now we account the page faults
      to the caller of gup, rather than the task_struct that passed into the gup
      code.  More information of this can be found at [1].
      
      After this patch, below things should never be touched again outside
      handle_mm_fault():
      
        - task_struct.[maj|min]_flt
        - PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]
      
      [1] https://lore.kernel.org/lkml/CAHk-=wj_V2Tps2QrMn20_W0OJF9xqNh52XSGA42s-ZJ8Y+GyKw@mail.gmail.com/Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200707225021.200906-25-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2beb5f1
    • P
      mm: do page fault accounting in handle_mm_fault · bce617ed
      Peter Xu 提交于
      Patch series "mm: Page fault accounting cleanups", v5.
      
      This is v5 of the pf accounting cleanup series.  It originates from Gerald
      Schaefer's report on an issue a week ago regarding to incorrect page fault
      accountings for retried page fault after commit 4064b982 ("mm: allow
      VM_FAULT_RETRY for multiple times"):
      
        https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/
      
      What this series did:
      
        - Correct page fault accounting: we do accounting for a page fault
          (no matter whether it's from #PF handling, or gup, or anything else)
          only with the one that completed the fault.  For example, page fault
          retries should not be counted in page fault counters.  Same to the
          perf events.
      
        - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
          event is used in an adhoc way across different archs.
      
          Case (1): for many archs it's done at the entry of a page fault
          handler, so that it will also cover e.g.  errornous faults.
      
          Case (2): for some other archs, it is only accounted when the page
          fault is resolved successfully.
      
          Case (3): there're still quite some archs that have not enabled
          this perf event.
      
          Since this series will touch merely all the archs, we unify this
          perf event to always follow case (1), which is the one that makes most
          sense.  And since we moved the accounting into handle_mm_fault, the
          other two MAJ/MIN perf events are well taken care of naturally.
      
        - Unify definition of "major faults": the definition of "major
          fault" is slightly changed when used in accounting (not
          VM_FAULT_MAJOR).  More information in patch 1.
      
        - Always account the page fault onto the one that triggered the page
          fault.  This does not matter much for #PF handlings, but mostly for
          gup.  More information on this in patch 25.
      
      Patchset layout:
      
      Patch 1:     Introduced the accounting in handle_mm_fault(), not enabled.
      Patch 2-23:  Enable the new accounting for arch #PF handlers one by one.
      Patch 24:    Enable the new accounting for the rest outliers (gup, iommu, etc.)
      Patch 25:    Cleanup GUP task_struct pointer since it's not needed any more
      
      This patch (of 25):
      
      This is a preparation patch to move page fault accountings into the
      general code in handle_mm_fault().  This includes both the per task
      flt_maj/flt_min counters, and the major/minor page fault perf events.  To
      do this, the pt_regs pointer is passed into handle_mm_fault().
      
      PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
      handlers.
      
      So far, all the pt_regs pointer that passed into handle_mm_fault() is
      NULL, which means this patch should have no intented functional change.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
      Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bce617ed
    • R
    • J
      mm/swap: implement workingset detection for anonymous LRU · aae466b0
      Joonsoo Kim 提交于
      This patch implements workingset detection for anonymous LRU.  All the
      infrastructure is implemented by the previous patches so this patch just
      activates the workingset detection by installing/retrieving the shadow
      entry and adding refault calculation.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aae466b0
    • J
      mm/vmscan: protect the workingset on anonymous LRU · b518154e
      Joonsoo Kim 提交于
      In current implementation, newly created or swap-in anonymous page is
      started on active list.  Growing active list results in rebalancing
      active/inactive list so old pages on active list are demoted to inactive
      list.  Hence, the page on active list isn't protected at all.
      
      Following is an example of this situation.
      
      Assume that 50 hot pages on active list.  Numbers denote the number of
      pages on active/inactive list (active | inactive).
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(uo) | 50(h)
      
      3. workload: another 50 newly created (used-once) pages
      50(uo) | 50(uo), swap-out 50(h)
      
      This patch tries to fix this issue.  Like as file LRU, newly created or
      swap-in anonymous pages will be inserted to the inactive list.  They are
      promoted to active list if enough reference happens.  This simple
      modification changes the above example as following.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(h) | 50(uo)
      
      3. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(uo)
      
      As you can see, hot pages on active list would be protected.
      
      Note that, this implementation has a drawback that the page cannot be
      promoted and will be swapped-out if re-access interval is greater than the
      size of inactive list but less than the size of total(active+inactive).
      To solve this potential issue, following patch will apply workingset
      detection similar to the one that's already applied to file LRU.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b518154e
  17. 08 8月, 2020 2 次提交
  18. 25 7月, 2020 1 次提交
  19. 21 7月, 2020 1 次提交
  20. 17 7月, 2020 1 次提交
  21. 26 6月, 2020 1 次提交