1. 14 4月, 2021 3 次提交
  2. 15 3月, 2021 1 次提交
  3. 04 11月, 2020 2 次提交
    • K
      mm: avoid data corruption on CoW fault into PFN-mapped VMA · cba490ed
      Kirill A. Shutemov 提交于
      stable inclusion
      from linux-4.19.149
      commit 2b294ac325c7ce3f36854b74d0d1d89dc1d1d8b8
      
      --------------------------------
      
      [ Upstream commit c3e5ea6e ]
      
      Jeff Moyer has reported that one of xfstests triggers a warning when run
      on DAX-enabled filesystem:
      
      	WARNING: CPU: 76 PID: 51024 at mm/memory.c:2317 wp_page_copy+0xc40/0xd50
      	...
      	wp_page_copy+0x98c/0xd50 (unreliable)
      	do_wp_page+0xd8/0xad0
      	__handle_mm_fault+0x748/0x1b90
      	handle_mm_fault+0x120/0x1f0
      	__do_page_fault+0x240/0xd70
      	do_page_fault+0x38/0xd0
      	handle_page_fault+0x10/0x30
      
      The warning happens on failed __copy_from_user_inatomic() which tries to
      copy data into a CoW page.
      
      This happens because of race between MADV_DONTNEED and CoW page fault:
      
      	CPU0					CPU1
       handle_mm_fault()
         do_wp_page()
           wp_page_copy()
             do_wp_page()
      					madvise(MADV_DONTNEED)
      					  zap_page_range()
      					    zap_pte_range()
      					      ptep_get_and_clear_full()
      					      <TLB flush>
      	 __copy_from_user_inatomic()
      	 sees empty PTE and fails
      	 WARN_ON_ONCE(1)
      	 clear_page()
      
      The solution is to re-try __copy_from_user_inatomic() under PTL after
      checking that PTE is matches the orig_pte.
      
      The second copy attempt can still fail, like due to non-readable PTE, but
      there's nothing reasonable we can do about, except clearing the CoW page.
      Reported-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Justin He <Justin.He@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200218154151.13349-1-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      cba490ed
    • J
      mm: fix double page fault on arm64 if PTE_AF is cleared · daa1e1e9
      Jia He 提交于
      stable inclusion
      from linux-4.19.149
      commit 8579a0440381353e0a71dd6a4d4371be8457eac4
      
      --------------------------------
      
      [ Upstream commit 83d116c5 ]
      
      When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there
      will be a double page fault in __copy_from_user_inatomic of cow_user_page.
      
      To reproduce the bug, the cmd is as follows after you deployed everything:
      make -C src/test/vmmalloc_fork/ TEST_TIME=60m check
      
      Below call trace is from arm64 do_page_fault for debugging purpose:
      [  110.016195] Call trace:
      [  110.016826]  do_page_fault+0x5a4/0x690
      [  110.017812]  do_mem_abort+0x50/0xb0
      [  110.018726]  el1_da+0x20/0xc4
      [  110.019492]  __arch_copy_from_user+0x180/0x280
      [  110.020646]  do_wp_page+0xb0/0x860
      [  110.021517]  __handle_mm_fault+0x994/0x1338
      [  110.022606]  handle_mm_fault+0xe8/0x180
      [  110.023584]  do_page_fault+0x240/0x690
      [  110.024535]  do_mem_abort+0x50/0xb0
      [  110.025423]  el0_da+0x20/0x24
      
      The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
      [ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003,
                     pmd=000000023d4b3003, pte=360000298607bd3
      
      As told by Catalin: "On arm64 without hardware Access Flag, copying from
      user will fail because the pte is old and cannot be marked young. So we
      always end up with zeroed page after fork() + CoW for pfn mappings. we
      don't always have a hardware-managed access flag on arm64."
      
      This patch fixes it by calling pte_mkyoung. Also, the parameter is
      changed because vmf should be passed to cow_user_page()
      
      Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
      in case there can be some obscure use-case (by Kirill).
      
      [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_forkSigned-off-by: NJia He <justin.he@arm.com>
      Reported-by: NYibo Cai <Yibo.Cai@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      daa1e1e9
  4. 27 12月, 2019 11 次提交
    • M
      mm, thp, proc: report THP eligibility for each vma · ef2fef06
      Michal Hocko 提交于
      [ Upstream commit 7635d9cb ]
      
      Userspace falls short when trying to find out whether a specific memory
      range is eligible for THP.  There are usecases that would like to know
      that
      http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
      : This is used to identify heap mappings that should be able to fault thp
      : but do not, and they normally point to a low-on-memory or fragmentation
      : issue.
      
      The only way to deduce this now is to query for hg resp.  nh flags and
      confronting the state with the global setting.  Except that there is also
      PR_SET_THP_DISABLE that might change the picture.  So the final logic is
      not trivial.  Moreover the eligibility of the vma depends on the type of
      VMA as well.  In the past we have supported only anononymous memory VMAs
      but things have changed and shmem based vmas are supported as well these
      days and the query logic gets even more complicated because the
      eligibility depends on the mount option and another global configuration
      knob.
      
      Simplify the current state and report the THP eligibility in
      /proc/<pid>/smaps for each existing vma.  Reuse
      transparent_hugepage_enabled for this purpose.  The original
      implementation of this function assumes that the caller knows that the vma
      itself is supported for THP so make the core checks into
      __transparent_hugepage_enabled and use it for existing callers.
      __show_smap just use the new transparent_hugepage_enabled which also
      checks the vma support status (please note that this one has to be out of
      line due to include dependency issues).
      
      [mhocko@kernel.org: fix oops with NULL ->f_mapping]
        Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ef2fef06
    • H
      mm, swap: fix race between swapoff and some swap operations · 3a270bde
      Huang Ying 提交于
      mainline inclusion
      from mainline-v5.3-rc1
      commit eb085574
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      ------------------------------------------------
      
      When swapin is performed, after getting the swap entry information from
      the page table, system will swap in the swap entry, without any lock held
      to prevent the swap device from being swapoff.  This may cause the race
      like below,
      
      CPU 1				CPU 2
      -----				-----
      				do_swap_page
      				  swapin_readahead
      				    __read_swap_cache_async
      swapoff				      swapcache_prepare
        p->swap_map = NULL		        __swap_duplicate
      					  p->swap_map[?] /* !!! NULL pointer access */
      
      Because swapoff is usually done when system shutdown only, the race may
      not hit many people in practice.  But it is still a race need to be fixed.
      
      To fix the race, get_swap_device() is added to check whether the specified
      swap entry is valid in its swap device.  If so, it will keep the swap
      entry valid via preventing the swap device from being swapoff, until
      put_swap_device() is called.
      
      Because swapoff() is very rare code path, to make the normal path runs as
      fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
      reference count is used to implement get/put_swap_device().  >From
      get_swap_device() to put_swap_device(), RCU reader side is locked, so
      synchronize_rcu() in swapoff() will wait until put_swap_device() is
      called.
      
      In addition to swap_map, cluster_info, etc.  data structure in the struct
      swap_info_struct, the swap cache radix tree will be freed after swapoff,
      so this patch fixes the race between swap cache looking up and swapoff
      too.
      
      Races between some other swap cache usages and swapoff are fixed too via
      calling synchronize_rcu() between clearing PageSwapCache() and freeing
      swap cache data structure.
      
      Another possible method to fix this is to use preempt_off() +
      stop_machine() to prevent the swap device from being swapoff when its data
      structure is being accessed.  The overhead in hot-path of both methods is
      similar.  The advantages of RCU based method are,
      
      1. stop_machine() may disturb the normal execution code path on other
         CPUs.
      
      2. File cache uses RCU to protect its radix tree.  If the similar
         mechanism is used for swap cache too, it is easier to share code
         between them.
      
      3. RCU is used to protect swap cache in total_swapcache_pages() and
         exit_swap_address_space() already.  The two mechanisms can be
         merged to simplify the logic.
      
      Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
      Fixes: 235b6217 ("mm/swap: add cluster lock")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Not-nacked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3a270bde
    • H
      ktask: change the chunk size for some ktask thread functions · 135e9232
      Hongbo Yao 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      
      ---------------------------
      
      This patch fixes some issue in original series.
      1) PMD_SIZE chunks have made thread finishing times too spread out
      in some cases, so KTASK_MEM_CHUNK(128M) seems to be a reasonable compromise
      2) If hugepagesz=1G, then pages_per_huge_page = 1G / 4K = 256,
      use KTASK_MEM_CHUNK will cause the ktask thread to be 1, which will not
      improve the performance of clear gigantic page.
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      135e9232
    • D
      mm: parallelize clear_gigantic_page · ae0cd4d4
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Parallelize clear_gigantic_page, which zeroes any page size larger than
      8M (e.g. 1G on x86).
      
      Performance results (the default number of threads is 4; higher thread
      counts shown for context only):
      
      Machine:  Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
      Test:     Clear a range of gigantic pages (triggered via fallocate)
      
      nthread   speedup   size (GiB)   min time (s)   stdev
            1                    100          41.13    0.03
            2     2.03x          100          20.26    0.14
            4     4.28x          100           9.62    0.09
            8     8.39x          100           4.90    0.05
           16    10.44x          100           3.94    0.03
      
            1                    200          89.68    0.35
            2     2.21x          200          40.64    0.18
            4     4.64x          200          19.33    0.32
            8     8.99x          200           9.98    0.04
           16    11.27x          200           7.96    0.04
      
            1                    400         188.20    1.57
            2     2.30x          400          81.84    0.09
            4     4.63x          400          40.62    0.26
            8     8.92x          400          21.09    0.50
           16    11.78x          400          15.97    0.25
      
            1                    800         434.91    1.81
            2     2.54x          800         170.97    1.46
            4     4.98x          800          87.38    1.91
            8    10.15x          800          42.86    2.59
           16    12.99x          800          33.48    0.83
      
      The speedups are mostly due to the fact that more threads can use more
      memory bandwidth.  The loop we're stressing on the x86 chip in this test
      is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one
      thread.  We get the same bandwidth per thread for 2, 4, or 8 threads,
      but at 16 threads the per-thread bandwidth drops to 1420 MiB/s.
      
      However, the performance also improves over a single thread because of
      the ktask threads' NUMA awareness (ktask migrates worker threads to the
      node local to the work being done).  This becomes a bigger factor as the
      amount of pages to zero grows to include memory from multiple nodes, so
      that speedups increase as the size increases.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ae0cd4d4
    • P
      mm/memory: Move mmu_gather and TLB invalidation code into its own file · 1f7b2415
      Peter Zijlstra 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit: 196d9d8bb71deaa2d1c7170c88a2f1a318363047
      category: feature
      feature: Reduce synchronous TLB invalidation on ARM64
      bugzilla: NA
      CVE: NA
      
      --------------------------------------------------
      
      In preparation for maintaining the mmu_gather code as its own entity,
      move the implementation out of memory.c and into its own file.
      
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1f7b2415
    • W
      asm-generic/tlb: Track which levels of the page tables have been cleared · aa951a96
      Will Deacon 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit: a6d60245
      category: feature
      feature: Reduce synchronous TLB invalidation on ARM64
      bugzilla: NA
      CVE: NA
      
      --------------------------------------------------
      
      It is common for architectures with hugepage support to require only a
      single TLB invalidation operation per hugepage during unmap(), rather than
      iterating through the mapping at a PAGE_SIZE increment. Currently,
      however, the level in the page table where the unmap() operation occurs
      is not stored in the mmu_gather structure, therefore forcing
      architectures to issue additional TLB invalidation operations or to give
      up and over-invalidate by e.g. invalidating the entire TLB.
      
      Ideally, we could add an interval rbtree to the mmu_gather structure,
      which would allow us to associate the correct mapping granule with the
      various sub-mappings within the range being invalidated. However, this
      is costly in terms of book-keeping and memory management, so instead we
      approximate by keeping track of the page table levels that are cleared
      and provide a means to query the smallest granule required for invalidation.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      aa951a96
    • K
      mm: use down_read_killable for locking mmap_sem in access_remote_vm · db2b2d20
      Konstantin Khlebnikov 提交于
      [ Upstream commit 1e426fe2 ]
      
      This function is used by ptrace and proc files like /proc/pid/cmdline and
      /proc/pid/environ.
      
      Access_remote_vm never returns error codes, all errors are ignored and
      only size of successfully read data is returned.  So, if current task was
      killed we'll simply return 0 (bytes read).
      
      Mmap_sem could be locked for a long time or forever if something goes
      wrong.  Using a killable lock permits cleanup of stuck tasks and
      simplifies investigation.
      
      Link: http://lkml.kernel.org/r/156007494202.3335.16782303099589302087.stgit@buzzSigned-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NMichal Koutný <mkoutny@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      db2b2d20
    • J
      mm/memory.c: fix modifying of page protection by insert_pfn() · c044377e
      Jan Kara 提交于
      [ Upstream commit cae85cb8 ]
      
      Aneesh has reported that PPC triggers the following warning when
      excercising DAX code:
      
        IP set_pte_at+0x3c/0x190
        LR insert_pfn+0x208/0x280
        Call Trace:
           insert_pfn+0x68/0x280
           dax_iomap_pte_fault.isra.7+0x734/0xa40
           __xfs_filemap_fault+0x280/0x2d0
           do_wp_page+0x48c/0xa40
           __handle_mm_fault+0x8d0/0x1fd0
           handle_mm_fault+0x140/0x250
           __do_page_fault+0x300/0xd60
           handle_page_fault+0x18
      
      Now that is WARN_ON in set_pte_at which is
      
              VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
      
      The problem is that on some architectures set_pte_at() cannot cope with
      a situation where there is already some (different) valid entry present.
      
      Use ptep_set_access_flags() instead to modify the pfn which is built to
      deal with modifying existing PTE.
      
      Link: http://lkml.kernel.org/r/20190311084537.16029-1-jack@suse.cz
      Fixes: b2770da6 "mm: add vm_insert_mixed_mkwrite()"
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reported-by: N"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Chandan Rajendra <chandan@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c044377e
    • J
      mm: Fix warning in insert_pfn() · 80ded77d
      Jan Kara 提交于
      commit f2c57d91 upstream.
      
      In DAX mode a write pagefault can race with write(2) in the following
      way:
      
      CPU0                            CPU1
                                      write fault for mapped zero page (hole)
      dax_iomap_rw()
        iomap_apply()
          xfs_file_iomap_begin()
            - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - invalidates radix tree entries in given range
                                      dax_iomap_pte_fault()
                                        grab_mapping_entry()
                                          - no entry found, creates empty
                                        ...
                                        xfs_file_iomap_begin()
                                          - finds already allocated block
                                        ...
                                        vmf_insert_mixed_mkwrite()
                                          - WARNs and does nothing because there
                                            is still zero page mapped in PTE
              unmap_mapping_pages()
      
      This race results in WARN_ON from insert_pfn() and is occasionally
      triggered by fstest generic/344. Note that the race is otherwise
      harmless as before write(2) on CPU0 is finished, we will invalidate page
      tables properly and thus user of mmap will see modified data from
      write(2) from that point on. So just restrict the warning only to the
      case when the PFN in PTE is not zero page.
      
      Link: http://lkml.kernel.org/r/20180824154542.26872-1-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      80ded77d
    • J
      mm/memory.c: do_fault: avoid usage of stale vm_area_struct · affc42a6
      Jan Stancek 提交于
      mainline inclusion
      from mainline-5.0
      commit fc8efd2d
      category: bugfix
      bugzilla: 11617
      CVE: NA
      
      ------------------------------------------------
      
      LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
      This is a stress test, where one thread mmaps/writes/munmaps memory area
      and other thread is trying to read from it:
      
        CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
        Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
        Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
        Call Trace:
        ([<0000000000000000>]           (null))
         [<00000000001adae4>] lock_acquire+0xec/0x258
         [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
         [<000000000012a780>] page_table_free+0x48/0x1a8
         [<00000000002f6e54>] do_fault+0xdc/0x670
         [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
         [<00000000002fb138>] handle_mm_fault+0x1b0/0x320
         [<00000000001248cc>] do_dat_exception+0x19c/0x2c8
         [<000000000080e5ee>] pgm_check_handler+0x19e/0x200
      
      page_table_free() is called with NULL mm parameter, but because
      "0" is a valid address on s390 (see S390_lowcore), it keeps
      going until it eventually crashes in lockdep's lock_acquire.
      This crash is reproducible at least since 4.14.
      
      Problem is that "vmf->vma" used in do_fault() can become stale.
      Because mmap_sem may be released, other threads can come in,
      call munmap() and cause "vma" be returned to kmem cache, and
      get zeroed/re-initialized and re-used:
      
      handle_mm_fault                           |
        __handle_mm_fault                       |
          do_fault                              |
            vma = vmf->vma                      |
            do_read_fault                       |
              __do_fault                        |
                vma->vm_ops->fault(vmf);        |
                  mmap_sem is released          |
                                                |
                                                | do_munmap()
                                                |   remove_vma_list()
                                                |     remove_vma()
                                                |       vm_area_free()
                                                |         # vma is released
                                                | ...
                                                | # same vma is allocated
                                                | # from kmem cache
                                                | do_mmap()
                                                |   vm_area_alloc()
                                                |     memset(vma, 0, ...)
                                                |
            pte_free(vma->vm_mm, ...);          |
              page_table_free                   |
                spin_lock_bh(&mm->context.lock);|
                  <crash>                       |
      
      Cache mm_struct to avoid using potentially stale "vma".
      
      [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
      
      Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.comSigned-off-by: NJan Stancek <jstancek@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      affc42a6
    • M
      mm, memcg: fix reclaim deadlock with writeback · a9cdf229
      Michal Hocko 提交于
      commit 63f3655f upstream.
      
      Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
      ext4 writeback
      
        task1:
          wait_on_page_bit+0x82/0xa0
          shrink_page_list+0x907/0x960
          shrink_inactive_list+0x2c7/0x680
          shrink_node_memcg+0x404/0x830
          shrink_node+0xd8/0x300
          do_try_to_free_pages+0x10d/0x330
          try_to_free_mem_cgroup_pages+0xd5/0x1b0
          try_charge+0x14d/0x720
          memcg_kmem_charge_memcg+0x3c/0xa0
          memcg_kmem_charge+0x7e/0xd0
          __alloc_pages_nodemask+0x178/0x260
          alloc_pages_current+0x95/0x140
          pte_alloc_one+0x17/0x40
          __pte_alloc+0x1e/0x110
          alloc_set_pte+0x5fe/0xc20
          do_fault+0x103/0x970
          handle_mm_fault+0x61e/0xd10
          __do_page_fault+0x252/0x4d0
          do_page_fault+0x30/0x80
          page_fault+0x28/0x30
      
        task2:
          __lock_page+0x86/0xa0
          mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
          ext4_writepages+0x479/0xd60
          do_writepages+0x1e/0x30
          __writeback_single_inode+0x45/0x320
          writeback_sb_inodes+0x272/0x600
          __writeback_inodes_wb+0x92/0xc0
          wb_writeback+0x268/0x300
          wb_workfn+0xb4/0x390
          process_one_work+0x189/0x420
          worker_thread+0x4e/0x4b0
          kthread+0xe6/0x100
          ret_from_fork+0x41/0x50
      
      He adds
       "task1 is waiting for the PageWriteback bit of the page that task2 has
        collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
        LOCKED bit the page which tasks1 has locked"
      
      More precisely task1 is handling a page fault and it has a page locked
      while it charges a new page table to a memcg.  That in turn hits a
      memory limit reclaim and the memcg reclaim for legacy controller is
      waiting on the writeback but that is never going to finish because the
      writeback itself is waiting for the page locked in the #PF path.  So
      this is essentially ABBA deadlock:
      
                                              lock_page(A)
                                              SetPageWriteback(A)
                                              unlock_page(A)
        lock_page(B)
                                              lock_page(B)
        pte_alloc_pne
          shrink_page_list
            wait_on_page_writeback(A)
                                              SetPageWriteback(B)
                                              unlock_page(B)
      
                                              # flush A, B to clear the writeback
      
      This accumulating of more pages to flush is used by several filesystems
      to generate a more optimal IO patterns.
      
      Waiting for the writeback in legacy memcg controller is a workaround for
      pre-mature OOM killer invocations because there is no dirty IO
      throttling available for the controller.  There is no easy way around
      that unfortunately.  Therefore fix this specific issue by pre-allocating
      the page table outside of the page lock.  We have that handy
      infrastructure for that already so simply reuse the fault-around pattern
      which already does this.
      
      There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
      from under a fs page locked but they should be really rare.  I am not
      aware of a better solution unfortunately.
      
      [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
      [akpm@linux-foundation.org: coding-style fixes]
      [mhocko@kernel.org: enhance comment, per Johannes]
        Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
      Fixes: c3b94f44 ("memcg: further prevent OOM with too many dirty pages")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Debugged-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a9cdf229
  5. 01 12月, 2018 1 次提交
  6. 26 8月, 2018 1 次提交
    • L
      mm/cow: don't bother write protecting already write-protected pages · 1b2de5d0
      Linus Torvalds 提交于
      This is not normally noticeable, but repeated forks are unnecessarily
      expensive because they repeatedly dirty the parent page tables during
      the page table copy operation.
      
      It's trivial to just avoid write protecting the page table entry if it
      was already not writable.
      
      This patch was inspired by
      
          https://bugzilla.kernel.org/show_bug.cgi?id=200447
      
      which points to an ancient "waste time re-doing fork" issue in the
      presence of lots of signals.
      
      That bug was fixed by Eric Biederman's signal handling series
      culminating in commit c3ad2c3b ("signal: Don't restart fork when
      signals come in"), but the unnecessary work for repeated forks is still
      work just fixing, particularly since the fix is trivial.
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b2de5d0
  7. 24 8月, 2018 5 次提交
  8. 23 8月, 2018 1 次提交
  9. 18 8月, 2018 6 次提交
    • R
      Revert "mm: always flush VMA ranges affected by zap_page_range" · 50c150f2
      Rik van Riel 提交于
      There was a bug in Linux that could cause madvise (and mprotect?) system
      calls to return to userspace without the TLB having been flushed for all
      the pages involved.
      
      This could happen when multiple threads of a process made simultaneous
      madvise and/or mprotect calls.
      
      This was noticed in the summer of 2017, at which time two solutions
      were created:
      
        56236a59 ("mm: refactor TLB gathering API")
        99baac21 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
      and
        4647706e ("mm: always flush VMA ranges affected by zap_page_range")
      
      We need only one of these solutions, and the former appears to be a
      little more efficient than the latter, so revert that one.
      
      This reverts 4647706e ("mm: always flush VMA ranges affected by
      zap_page_range")
      
      Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.comSigned-off-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50c150f2
    • M
      memcg, oom: move out_of_memory back to the charge path · 29ef680a
      Michal Hocko 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") has changed the ENOMEM semantic of memcg charges.
      Rather than invoking the oom killer from the charging context it delays
      the oom killer to the page fault path (pagefault_out_of_memory).  This
      in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
      the corresponding memcg hits the hard limit and the memcg is is OOM.
      This is behavior is inconsistent with !memcg case where the oom killer
      is invoked from the allocation context and the allocator keeps retrying
      until it succeeds.
      
      The difference in the behavior is user visible.  mmap(MAP_POPULATE)
      might result in not fully populated ranges while the mmap return code
      doesn't tell that to the userspace.  Random syscalls might fail with
      ENOMEM etc.
      
      The primary motivation of the different memcg oom semantic was the
      deadlock avoidance.  Things have changed since then, though.  We have an
      async oom teardown by the oom reaper now and so we do not have to rely
      on the victim to tear down its memory anymore.  Therefore we can return
      to the original semantic as long as the memcg oom killer is not handed
      over to the users space.
      
      There is still one thing to be careful about here though.  If the oom
      killer is not able to make any forward progress - e.g.  because there is
      no eligible task to kill - then we have to bail out of the charge path
      to prevent from same class of deadlocks.  We have basically two options
      here.  Either we fail the charge with ENOMEM or force the charge and
      allow overcharge.  The first option has been considered more harmful
      than useful because rare inconsistencies in the ENOMEM behavior is hard
      to test for and error prone.  Basically the same reason why the page
      allocator doesn't fail allocations under such conditions.  The later
      might allow runaways but those should be really unlikely unless somebody
      misconfigures the system.  E.g.  allowing to migrate tasks away from the
      memcg to a different unlimited memcg with move_charge_at_immigrate
      disabled.
      
      Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29ef680a
    • H
      mm, huge page: copy target sub-page last when copy huge page · c9f4cd71
      Huang Ying 提交于
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      copying huge page on x86_64 platform, the cache footprint is 4M.  But on
      a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
      (last level cache).  That is, in average, there are 2.5M LLC for each
      core and 1.25M LLC for each thread.
      
      If the cache contention is heavy when copying the huge page, and we copy
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing copying the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after copying the huge page.
      
      In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing
      huge page"), to keep the cache lines of the target subpage hot, the
      order to clear the subpages in the huge page in clear_huge_page() is
      changed to clearing the subpage which is furthest from the target
      subpage firstly, and the target subpage last.  The similar order
      changing helps huge page copying too.  That is implemented in this
      patch.  Because we have put the order algorithm into a separate
      function, the implementation is quite simple.
      
      The patch is a generic optimization which should benefit quite some
      workloads, not for a specific use case.  To demonstrate the performance
      benefit of the patch, we tested it with vm-scalability run on
      transparent huge page.
      
      With this patch, the throughput increases ~16.6% in vm-scalability
      anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case set
      /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
      anonymous memory area and populate it, then forked 36 child processes,
      each writes to the anonymous memory area from the begin to the end, so
      cause copy on write.  For each child process, other child processes
      could be seen as other workloads which generate heavy cache pressure.
      At the same time, the IPC (instruction per cycle) increased from 0.63 to
      0.78, and the time spent in user space is reduced ~7.2%.
      
      Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9f4cd71
    • H
      mm, clear_huge_page: move order algorithm into a separate function · c6ddfb6c
      Huang Ying 提交于
      Patch series "mm, huge page: Copy target sub-page last when copy huge
      page", v2.
      
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      copying huge page on x86_64 platform, the cache footprint is 4M.  But on
      a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
      (last level cache).  That is, in average, there are 2.5M LLC for each
      core and 1.25M LLC for each thread.
      
      If the cache contention is heavy when copying the huge page, and we copy
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing copying the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after copying the huge page.
      
      In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing
      huge page"), to keep the cache lines of the target subpage hot, the
      order to clear the subpages in the huge page in clear_huge_page() is
      changed to clearing the subpage which is furthest from the target
      subpage firstly, and the target subpage last.  The similar order
      changing helps huge page copying too.  That is implemented in this
      patchset.
      
      The patchset is a generic optimization which should benefit quite some
      workloads, not for a specific use case.  To demonstrate the performance
      benefit of the patchset, we have tested it with vm-scalability run on
      transparent huge page.
      
      With this patchset, the throughput increases ~16.6% in vm-scalability
      anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case set
      /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
      anonymous memory area and populate it, then forked 36 child processes,
      each writes to the anonymous memory area from the begin to the end, so
      cause copy on write.  For each child process, other child processes
      could be seen as other workloads which generate heavy cache pressure.
      At the same time, the IPC (instruction per cycle) increased from 0.63 to
      0.78, and the time spent in user space is reduced ~7.2%.
      
      This patch (of 4):
      
      In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing
      huge page"), to keep the cache lines of the target subpage hot, the
      order to clear the subpages in the huge page in clear_huge_page() is
      changed to clearing the subpage which is furthest from the target
      subpage firstly, and the target subpage last.  This optimization could
      be applied to copying huge page too with the same order algorithm.  To
      avoid code duplication and reduce maintenance overhead, in this patch,
      the order algorithm is moved out of clear_huge_page() into a separate
      function: process_huge_page().  So that we can use it for copying huge
      page too.
      
      This will change the direct calls to clear_user_highpage() into the
      indirect calls.  But with the proper inline support of the compilers,
      the indirect call will be optimized to be the direct call.  Our tests
      show no performance change with the patch.
      
      This patch is a code cleanup without functionality change.
      
      Link: http://lkml.kernel.org/r/20180524005851.4079-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6ddfb6c
    • Y
      thp: use mm_file_counter to determine update which rss counter · fadae295
      Yang Shi 提交于
      Since commit eca56ff9 ("mm, shmem: add internal shmem resident
      memory accounting"), MM_SHMEMPAGES is added to separate the shmem
      accounting from regular files.  So, all shmem pages should be accounted
      to MM_SHMEMPAGES instead of MM_FILEPAGES.
      
      And, normal 4K shmem pages have been accounted to MM_SHMEMPAGES, so
      shmem thp pages should be not treated differently.  Account them to
      MM_SHMEMPAGES via mm_counter_file() since shmem pages are swap backed to
      keep consistent with normal 4K shmem pages.
      
      This will not change the rss counter of processes since shmem pages are
      still a part of it.
      
      The /proc/pid/status and /proc/pid/statm counters will however be more
      accurate wrt shmem usage, as originally intended.  And as eca56ff9
      ("mm, shmem: add internal shmem resident memory accounting") mentioned,
      oom also could report more accurate "shmem-rss".
      
      Link: http://lkml.kernel.org/r/1529442518-17398-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fadae295
    • D
      dax: remove VM_MIXEDMAP for fsdax and device dax · e1fb4a08
      Dave Jiang 提交于
      This patch is reworked from an earlier patch that Dan has posted:
      https://patchwork.kernel.org/patch/10131727/
      
      VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
      the memory page it is dealing with is not typical memory from the linear
      map.  The get_user_pages_fast() path, since it does not resolve the vma,
      is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
      use that as a VM_MIXEDMAP replacement in some locations.  In the cases
      where there is no pte to consult we fallback to using vma_is_dax() to
      detect the VM_MIXEDMAP special case.
      
      Now that we have explicit driver pfn_t-flag opt-in/opt-out for
      get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
      also means we no longer need to worry about safely manipulating vm_flags
      in a future where we support dynamically changing the dax mode of a
      file.
      
      DAX should also now be supported with madvise_behavior(), vma_merge(),
      and copy_page_range().
      
      This patch has been tested against ndctl unit test.  It has also been
      tested against xfstests commit: 625515d using fake pmem created by
      memmap and no additional issues have been observed.
      
      Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1fb4a08
  10. 11 8月, 2018 1 次提交
  11. 02 8月, 2018 1 次提交
    • H
      mm: delete historical BUG from zap_pmd_range() · 53406ed1
      Hugh Dickins 提交于
      Delete the old VM_BUG_ON_VMA() from zap_pmd_range(), which asserted
      that mmap_sem must be held when splitting an "anonymous" vma there.
      Whether that's still strictly true nowadays is not entirely clear,
      but the danger of sometimes crashing on the BUG is now fairly clear.
      
      Even with the new stricter rules for anonymous vma marking, the
      condition it checks for can possible trigger. Commit 44960f2a
      ("staging: ashmem: Fix SIGBUS crash when traversing mmaped ashmem
      pages") is good, and originally I thought it was safe from that
      VM_BUG_ON_VMA(), because the /dev/ashmem fd exposed to the user is
      disconnected from the vm_file in the vma, and madvise(,,MADV_REMOVE)
      insists on VM_SHARED.
      
      But after I read John's earlier mail, drawing attention to the
      vfs_fallocate() in there: I may be wrong, and I don't know if Android
      has THP in the config anyway, but it looks to me like an
      unmap_mapping_range() from ashmem's vfs_fallocate() could hit precisely
      the VM_BUG_ON_VMA(), once it's vma_is_anonymous().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53406ed1
  12. 17 7月, 2018 1 次提交
    • R
      x86/mm/tlb: Leave lazy TLB mode at page table free time · 2ff6ddf1
      Rik van Riel 提交于
      Andy discovered that speculative memory accesses while in lazy
      TLB mode can crash a system, when a CPU tries to dereference a
      speculative access using memory contents that used to be valid
      page table memory, but have since been reused for something else
      and point into la-la land.
      
      The latter problem can be prevented in two ways. The first is to
      always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
      the second one is to only send the TLB shootdown at page table
      freeing time.
      
      The second should result in fewer IPIs, since operationgs like
      mprotect and madvise are very common with some workloads, but
      do not involve page table freeing. Also, on munmap, batching
      of page table freeing covers much larger ranges of virtual
      memory than the batching of unmapped user pages.
      Tested-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NRik van Riel <riel@surriel.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: kernel-team@fb.com
      Cc: luto@kernel.org
      Link: http://lkml.kernel.org/r/20180716190337.26133-3-riel@surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2ff6ddf1
  13. 09 7月, 2018 1 次提交
  14. 21 6月, 2018 1 次提交
    • A
      x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings · 42e4089c
      Andi Kleen 提交于
      For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
      table entry. This sets the high bits in the CPU's address space, thus
      making sure to point to not point an unmapped entry to valid cached memory.
      
      Some server system BIOSes put the MMIO mappings high up in the physical
      address space. If such an high mapping was mapped to unprivileged users
      they could attack low memory by setting such a mapping to PROT_NONE. This
      could happen through a special device driver which is not access
      protected. Normal /dev/mem is of course access protected.
      
      To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
      
      Valid page mappings are allowed because the system is then unsafe anyways.
      
      It's not expected that users commonly use PROT_NONE on MMIO. But to
      minimize any impact this is only enforced if the mapping actually refers to
      a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
      the check for root.
      
      For mmaps this is straight forward and can be handled in vm_insert_pfn and
      in remap_pfn_range().
      
      For mprotect it's a bit trickier. At the point where the actual PTEs are
      accessed a lot of state has been changed and it would be difficult to undo
      on an error. Since this is a uncommon case use a separate early page talk
      walk pass for MMIO PROT_NONE mappings that checks for this condition
      early. For non MMIO and non PROT_NONE there are no changes.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      
      42e4089c
  15. 08 6月, 2018 3 次提交
  16. 01 6月, 2018 1 次提交