1. 23 8月, 2023 3 次提交
  2. 11 11月, 2022 1 次提交
  3. 19 10月, 2022 1 次提交
  4. 18 7月, 2022 1 次提交
    • M
      mm: fix race between MADV_FREE reclaim and blkdev direct IO read · b873d752
      Mauricio Faria de Oliveira 提交于
      stable inclusion
      from stable-v5.10.111
      commit 3c3fbfa6dddb022e91be18902c3785a82aa4867d
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3c3fbfa6dddb022e91be18902c3785a82aa4867d
      
      --------------------------------
      
      commit 6c8e2a25 upstream.
      
      Problem:
      Reviewed-by: NWei Li <liwei391@huawei.com>
      
      =======
      
      Userspace might read the zero-page instead of actual data from a direct IO
      read on a block device if the buffers have been called madvise(MADV_FREE)
      on earlier (this is discussed below) due to a race between page reclaim on
      MADV_FREE and blkdev direct IO read.
      
      - Race condition:
        ==============
      
      During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
      if the page is not dirty, then discards its rmap PTE(s) (vs.  remap back
      if the page is dirty).
      
      However, after try_to_unmap_one() returns to shrink_page_list(), it might
      keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
      _one_ page reference, from the isolation for page reclaim).
      
      Well, blkdev_direct_IO() gets references for all pages, and on READ
      operations it only sets them dirty _later_.
      
      So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
      IO read from block devices, and page reclaim happens during
      __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
      returns, but BEFORE the pages are set dirty, the situation happens.
      
      The direct IO read eventually completes.  Now, when userspace reads the
      buffers, the PTE is no longer there and the page fault handler
      do_anonymous_page() services that with the zero-page, NOT the data!
      
      A synthetic reproducer is provided.
      
      - Page faults:
        ===========
      
      If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
      happen, because that faults-in all pages as writeable, so
      do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
      direct IO.  The userspace reads don't fault as the PTE is there (thus
      zero-page is not used/setup).
      
      But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
      is no longer there; the subsequent page faults can't help:
      
      The data-read from the block device probably won't generate faults due to
      DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
      different virtual addresses (not user-mapped addresses) because `struct
      bio_vec` stores `struct page` to figure addresses out (which are different
      from user-mapped addresses) for the read.
      
      Thus userspace reads (to user-mapped addresses) still fault, then
      do_anonymous_page() gets another `struct page` that would address/ map to
      other memory than the `struct page` used by `struct bio_vec` for the read.
      (The original `struct page` is not available, since it wasn't freed, as
      page_ref_freeze() failed due to more page refs.  And even if it were
      available, its data cannot be trusted anymore.)
      
      Solution:
      ========
      
      One solution is to check for the expected page reference count in
      try_to_unmap_one().
      
      There should be one reference from the isolation (that is also checked in
      shrink_page_list() with page_ref_freeze()) plus one or more references
      from page mapping(s) (put in discard: label).  Further references mean
      that rmap/PTE cannot be unmapped/nuked.
      
      (Note: there might be more than one reference from mapping due to
      fork()/clone() without CLONE_VM, which use the same `struct page` for
      references, until the copy-on-write page gets copied.)
      
      So, additional page references (e.g., from direct IO read) now prevent the
      rmap/PTE from being unmapped/dropped; similarly to the page is not freed
      per shrink_page_list()/page_ref_freeze()).
      
      - Races and Barriers:
        ==================
      
      The new check in try_to_unmap_one() should be safe in races with
      bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
      done under the PTE lock.
      
      The fast path doesn't take the lock, but it checks if the PTE has changed
      and if so, it drops the reference and leaves the page for the slow path
      (which does take that lock).
      
      The fast path requires synchronization w/ full memory barrier: it writes
      the page reference count first then it reads the PTE later, while
      try_to_unmap() writes PTE first then it reads page refcount.
      
      And a second barrier is needed, as the page dirty flag should not be read
      before the page reference count (as in __remove_mapping()).  (This can be
      a load memory barrier only; no writes are involved.)
      
      Call stack/comments:
      
      - try_to_unmap_one()
        - page_vma_mapped_walk()
          - map_pte()			# see pte_offset_map_lock():
              pte_offset_map()
              spin_lock()
      
        - ptep_get_and_clear()	# write PTE
        - smp_mb()			# (new barrier) GUP fast path
        - page_ref_count()		# (new check) read refcount
      
        - page_vma_mapped_walk_done()	# see pte_unmap_unlock():
            pte_unmap()
            spin_unlock()
      
      - bio_iov_iter_get_pages()
        - __bio_iov_iter_get_pages()
          - iov_iter_get_pages()
            - get_user_pages_fast()
              - internal_get_user_pages_fast()
      
                # fast path
                - lockless_pages_from_mm()
                  - gup_{pgd,p4d,pud,pmd,pte}_range()
                      ptep = pte_offset_map()		# not _lock()
                      pte = ptep_get_lockless(ptep)
      
                      page = pte_page(pte)
                      try_grab_compound_head(page)	# inc refcount
                                                  	# (RMW/barrier
                                                   	#  on success)
      
                      if (pte_val(pte) != pte_val(*ptep)) # read PTE
                              put_compound_head(page) # dec refcount
                              			# go slow path
      
                # slow path
                - __gup_longterm_unlocked()
                  - get_user_pages_unlocked()
                    - __get_user_pages_locked()
                      - __get_user_pages()
                        - follow_{page,p4d,pud,pmd}_mask()
                          - follow_page_pte()
                              ptep = pte_offset_map_lock()
                              pte = *ptep
                              page = vm_normal_page(pte)
                              try_grab_page(page)	# inc refcount
                              pte_unmap_unlock()
      
      - Huge Pages:
        ==========
      
      Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
      (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
      (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
      thus should reach shrink_page_list() -> split_huge_page_to_list() before
      try_to_unmap[_one](), so it deals with normal pages only.
      
      (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
      which should not or be rare, the page refcount should be greater than
      mapcount: the head page is referenced by tail pages.  That also prevents
      checking the head `page` then incorrectly call page_remove_rmap(subpage)
      for a tail page, that isn't even in the shrink_page_list()'s page_list (an
      effect of split huge pmd/pmvw), as it might happen today in this unlikely
      scenario.)
      
      MADV_FREE'd buffers:
      ===================
      
      So, back to the "if MADV_FREE pages are used as buffers" note.  The case
      is arguable, and subject to multiple interpretations.
      
      The madvise(2) manual page on the MADV_FREE advice value says:
      
      1) 'After a successful MADV_FREE ... data will be lost when
         the kernel frees the pages.'
      2) 'the free operation will be canceled if the caller writes
         into the page' / 'subsequent writes ... will succeed and
         then [the] kernel cannot free those dirtied pages'
      3) 'If there is no subsequent write, the kernel can free the
         pages at any time.'
      
      Thoughts, questions, considerations... respectively:
      
      1) Since the kernel didn't actually free the page (page_ref_freeze()
         failed), should the data not have been lost? (on userspace read.)
      2) Should writes performed by the direct IO read be able to cancel
         the free operation?
         - Should the direct IO read be considered as 'the caller' too,
           as it's been requested by 'the caller'?
         - Should the bio technique to dirty pages on return to userspace
           (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
           be considered in another/special way here?
      3) Should an upcoming write from a previously requested direct IO
         read be considered as a subsequent write, so the kernel should
         not free the pages? (as it's known at the time of page reclaim.)
      
      And lastly:
      
      Technically, the last point would seem a reasonable consideration and
      balance, as the madvise(2) manual page apparently (and fairly) seem to
      assume that 'writes' are memory access from the userspace process (not
      explicitly considering writes from the kernel or its corner cases; again,
      fairly)..  plus the kernel fix implementation for the corner case of the
      largely 'non-atomic write' encompassed by a direct IO read operation, is
      relatively simple; and it helps.
      
      Reproducer:
      ==========
      
      @ test.c (simplified, but works)
      
      	#define _GNU_SOURCE
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      
      	int main() {
      		int fd, i;
      		char *buf;
      
      		fd = open(DEV, O_RDONLY | O_DIRECT);
      
      		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
                      	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			buf[i] = 1; // init to non-zero
      
      		madvise(buf, BUF_SIZE, MADV_FREE);
      
      		read(fd, buf, BUF_SIZE);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			printf("%p: 0x%x\n", &buf[i], buf[i]);
      
      		return 0;
      	}
      
      @ block/fops.c (formerly fs/block_dev.c)
      
      	+#include <linux/swap.h>
      	...
      	... __blkdev_direct_IO[_simple](...)
      	{
      	...
      	+	if (!strcmp(current->comm, "good"))
      	+		shrink_all_memory(ULONG_MAX);
      	+
               	ret = bio_iov_iter_get_pages(...);
      	+
      	+	if (!strcmp(current->comm, "bad"))
      	+		shrink_all_memory(ULONG_MAX);
      	...
      	}
      
      @ shell
      
              # NUM_PAGES=4
              # PAGE_SIZE=$(getconf PAGE_SIZE)
      
              # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
              # DEV=$(losetup -f --show test.img)
      
              # gcc -DDEV=\"$DEV\" \
                    -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
                    -DPAGE_SIZE=${PAGE_SIZE} \
                     test.c -o test
      
              # od -tx1 $DEV
              0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
              *
              0040000
      
              # mv test good
              # ./good
              0x7f7c10418000: 0x79
              0x7f7c10419000: 0x79
              0x7f7c1041a000: 0x79
              0x7f7c1041b000: 0x79
      
              # mv good bad
              # ./bad
              0x7fa1b8050000: 0x0
              0x7fa1b8051000: 0x0
              0x7fa1b8052000: 0x0
              0x7fa1b8053000: 0x0
      
      Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
      support of MADV_FREE on v4.5 (60%-70% error; needs swap).  [wrap
      do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].
      
      - v5.17-rc3:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x0
      
              # free | grep Swap
              Swap:             0           0           0
      
      - v4.5:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 2702  0x0
                 1298  0x79
      
              # swapoff -av
              swapoff /swap
      
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
      Ceph/TCMalloc:
      =============
      
      For documentation purposes, the use case driving the analysis/fix is Ceph
      on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
      release unused memory to the system from the mmap'ed page heap (might be
      committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
      -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
      TCMalloc_SystemCommit() -> do nothing.
      
      Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
      release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
      just 'disappeared' on Ceph on later Ubuntu releases but is still present
      in the kernel, and can be hit by other use cases.
      
      The observed issue seems to be the old Ceph bug #22464 [1], where checksum
      mismatches are observed (and instrumentation with buffer dumps shows
      zero-pages read from mmap'ed/MADV_FREE'd page ranges).
      
      The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
      mostly worked around with a retry mechanism, but other parts of Ceph could
      still hit that (rocksdb).  Anyway, it's less likely to be hit again as
      TCMalloc switched out of MADV_FREE by default.
      
      (Some kernel versions/reports from the Ceph bug, and relation with
      the MADV_FREE introduction/changes; TCMalloc versions not checked.)
      - 4.4 good
      - 4.5 (madv_free: introduction)
      - 4.9 bad
      - 4.10 good? maybe a swapless system
      - 4.12 (madv_free: no longer free instantly on swapless systems)
      - 4.13 bad
      
      [1] https://tracker.ceph.com/issues/22464
      
      Thanks:
      ======
      
      Several people contributed to analysis/discussions/tests/reproducers in
      the first stages when drilling down on ceph/tcmalloc/linux kernel:
      
      - Dan Hill
      - Dan Streetman
      - Dongdong Tao
      - Gavin Guo
      - Gerald Yang
      - Heitor Alves de Siqueira
      - Ioanna Alifieraki
      - Jay Vosburgh
      - Matthew Ruffell
      - Ponnuvel Palaniyappan
      
      Reviews, suggestions, corrections, comments:
      
      - Minchan Kim
      - Yu Zhao
      - Huang, Ying
      - John Hubbard
      - Christoph Hellwig
      
      [mfo@canonical.com: v4]
        Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com
      
      Fixes: 802a3a92 ("mm: reclaim MADV_FREE pages")
      Signed-off-by: NMauricio Faria de Oliveira <mfo@canonical.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Hill <daniel.hill@canonical.com>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Cc: Dongdong Tao <dongdong.tao@canonical.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Gerald Yang <gerald.yang@canonical.com>
      Cc: Heitor Alves de Siqueira <halves@canonical.com>
      Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
      Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
      Cc: <stable@vger.kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [mfo: backport: replace folio/test_flag with page/flag equivalents;
       real Fixes: 854e9ed0 ("mm: support madvise(MADV_FREE)") in v4.]
      Signed-off-by: NMauricio Faria de Oliveira <mfo@canonical.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      b873d752
  5. 22 1月, 2022 1 次提交
    • J
      mm/pin_mem: improve pin mem pages rmap and free method · a8576876
      Jingxian He 提交于
      euleros inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4QPBH
      CVE: NA
      
      ------------
      
      1. Improve pin mem pages rmap:
      Add Hotreplace flag for pin mem pages to avoid spilting.
      When the Hotreplace flag is set, the page will not be
      added to deferred_split page list during rmap.
      If the pin mem pages added to the deferred_split page list,
      deferred_split_scan ops will spilt the pages which has been pinned.
      If the pin mem page is spilted, we can't remap the page to the recover
      process with the recorded pin mem mapping rule.
      Moreover, the deferred_split page list node can be corrupted while the
      deferred_split_scan function and pin pages remapping executing at the
      same time.
      
      2. Improve free method for pin mem pages:
      Use the put_page method instead of free_pages directly.
      Signed-off-by: NJingxian He <hejingxian@huawei.com>
      Reviewed-by: Kefeng Wang<wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      a8576876
  6. 15 11月, 2021 1 次提交
  7. 12 10月, 2021 5 次提交
  8. 14 7月, 2021 2 次提交
  9. 12 1月, 2021 1 次提交
    • S
      mm/rmap: always do TTU_IGNORE_ACCESS · 6d38f6ae
      Shakeel Butt 提交于
      stable inclusion
      from stable-5.10.4
      commit dd156e3fcabff9ac2f102ae92f9b2f5dd8525e4d
      bugzilla: 46903
      
      --------------------------------
      
      [ Upstream commit 013339df ]
      
      Since commit 369ea824 ("mm/rmap: update to new mmu_notifier semantic
      v2"), the code to check the secondary MMU's page table access bit is
      broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
      secondary MMU's page table before the check.  More specifically for those
      secondary MMUs which unmap the memory in
      mmu_notifier_invalidate_range_start() like kvm.
      
      However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
      absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
      access check before trying to unmap the page.  So, at worst the reclaim
      will miss accesses in a very short window if we remove page table access
      check in unmapping code.
      
      There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
      reclaim.  From memcg reclaim the page_referenced() only account the
      accesses from the processes which are in the same memcg of the target page
      but the unmapping code is considering accesses from all the processes, so,
      decreasing the effectiveness of memcg reclaim.
      
      The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
      code.
      
      Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
      Fixes: 369ea824 ("mm/rmap: update to new mmu_notifier semantic v2")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      6d38f6ae
  10. 15 11月, 2020 1 次提交
    • M
      hugetlbfs: fix anon huge page migration race · 336bf30e
      Mike Kravetz 提交于
      Qian Cai reported the following BUG in [1]
      
        LTP: starting move_pages12
        BUG: unable to handle page fault for address: ffffffffffffffe0
        ...
        RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
        Call Trace:
          rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
          try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
          migrate_pages+0x1005/0x1fb0
          move_pages_and_store_status.isra.47+0xd7/0x1a0
          __x64_sys_move_pages+0xa5c/0x1100
          do_syscall_64+0x5f/0x310
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Hugh Dickins diagnosed this as a migration bug caused by code introduced
      to use i_mmap_rwsem for pmd sharing synchronization.  Specifically, the
      routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
      flag to try_to_unmap() while holding i_mmap_rwsem.  This is wrong for
      anon pages as the anon_vma_lock should be held in this case.  Further
      analysis suggested that i_mmap_rwsem was not required to he held at all
      when calling try_to_unmap for anon pages as an anon page could never be
      part of a shared pmd mapping.
      
      Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
      to drop page lock and acquire i_mmap_rwsem is wrong.  There is no way to
      keep mapping valid while dropping page lock.
      
      This patch does the following:
      
       - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
         calling try_to_unmap.
      
       - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
         will now simply do a 'trylock' while still holding the page lock. If
         the trylock fails, it will return NULL. This could impact the
         callers:
      
          - migration calling code will receive -EAGAIN and retry up to the
            hard coded limit (10).
      
          - memory error code will treat the page as BUSY. This will force
            killing (SIGKILL) instead of SIGBUS any mapping tasks.
      
         Do note that this change in behavior only happens when there is a
         race. None of the standard kernel testing suites actually hit this
         race, but it is possible.
      
      [1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
      [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/
      
      Fixes: c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
      Reported-by: NQian Cai <cai@lca.pw>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      336bf30e
  11. 17 10月, 2020 1 次提交
  12. 06 9月, 2020 1 次提交
    • A
      mm/rmap: fixup copying of soft dirty and uffd ptes · ad7df764
      Alistair Popple 提交于
      During memory migration a pte is temporarily replaced with a migration
      swap pte.  Some pte bits from the existing mapping such as the soft-dirty
      and uffd write-protect bits are preserved by copying these to the
      temporary migration swap pte.
      
      However these bits are not stored at the same location for swap and
      non-swap ptes.  Therefore testing these bits requires using the
      appropriate helper function for the given pte type.
      
      Unfortunately several code locations were found where the wrong helper
      function is being used to test soft_dirty and uffd_wp bits which leads to
      them getting incorrectly set or cleared during page-migration.
      
      Fix these by using the correct tests based on pte type.
      
      Fixes: a5430dda ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
      Fixes: 8c3328f1 ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
      Fixes: f45ec5ff ("userfaultfd: wp: support swap and page migration")
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200825064232.10023-2-alistair@popple.id.auSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad7df764
  13. 15 8月, 2020 2 次提交
  14. 13 8月, 2020 1 次提交
    • M
      hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem · 34ae204f
      Mike Kravetz 提交于
      Commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
      synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
      in at least read mode.  This is because the explicit locking in
      huge_pmd_share (called by huge_pte_alloc) was removed.  When restructuring
      the code, the call to huge_pte_alloc in the else block at the beginning of
      hugetlb_fault was missed.
      
      Unfortunately, that else clause is exercised when there is no page table
      entry.  This will likely lead to a call to huge_pmd_share.  If
      huge_pmd_share thinks pmd sharing is possible, it will traverse the
      mapping tree (i_mmap) without holding i_mmap_rwsem.  If someone else is
      modifying the tree, bad things such as addressing exceptions or worse
      could happen.
      
      Simply remove the else clause.  It should have been removed previously.
      The code following the else will call huge_pte_alloc with the appropriate
      locking.
      
      To prevent this type of issue in the future, add routines to assert that
      i_mmap_rwsem is held, and call these routines in huge pmd sharing
      routines.
      
      Fixes: c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A.Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34ae204f
  15. 10 6月, 2020 1 次提交
  16. 04 6月, 2020 2 次提交
  17. 08 4月, 2020 4 次提交
    • P
      mm: prevent a warning when casting void* -> enum · 4708f318
      Palmer Dabbelt 提交于
      I recently build the RISC-V port with LLVM trunk, which has introduced a
      new warning when casting from a pointer to an enum of a smaller size.
      This patch simply casts to a long in the middle to stop the warning.  I'd
      be surprised this is the only one in the kernel, but it's the only one I
      saw.
      Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200227211741.83165-1-palmer@dabbelt.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4708f318
    • P
      userfaultfd: wp: support swap and page migration · f45ec5ff
      Peter Xu 提交于
      For either swap and page migration, we all use the bit 2 of the entry to
      identify whether this entry is uffd write-protected.  It plays a similar
      role as the existing soft dirty bit in swap entries but only for keeping
      the uffd-wp tracking for a specific PTE/PMD.
      
      Something special here is that when we want to recover the uffd-wp bit
      from a swap/migration entry to the PTE bit we'll also need to take care of
      the _PAGE_RW bit and make sure it's cleared, otherwise even with the
      _PAGE_UFFD_WP bit we can't trap it at all.
      
      In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
      That can lead to data mismatch if the page that we are going to write
      protect is swapped out when sending the UFFDIO_WRITEPROTECT.  This patch
      also applies/removes the uffd-wp bit even for the swap entries.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f45ec5ff
    • M
      mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE · 396bcc52
      Matthew Wilcox (Oracle) 提交于
      Commit e496cf3d ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
      notes that it should be reverted when the PowerPC problem was fixed.  The
      commit fixing the PowerPC problem (953c66c2) did not revert the
      commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
      CONFIG_TRANSPARENT_HUGEPAGE.  Checking with Kirill and Aneesh, this was an
      oversight, so remove the Kconfig symbol and undo the work of commit
      e496cf3d.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      396bcc52
    • L
      Revert "mm/rmap.c: reuse mergeable anon_vma as parent when fork" · 23ab76bf
      Li Xinhai 提交于
      This reverts commit 4e4a9eb9 ("mm/rmap.c: reuse mergeable
      anon_vma as parent when fork").
      
      In dup_mmap(), anon_vma_fork() is called for attaching anon_vma and
      parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and
      ->vm_prev as its parent vma.  That causes the anon_vma used by parent been
      mistakenly shared by child (In anon_vma_clone(), the code added by that
      commit will do this reuse work).
      
      Besides this issue, the design of reusing anon_vma from vma which has gone
      through fork should be avoided ([1]).  So, this patch reverts that commit
      and maintains the consistent logic of reusing anon_vma for
      fork/split/merge vma.
      
      Reusing anon_vma within the process is fine.  But if a vma has gone
      through fork(), then that vma's anon_vma should not be shared with its
      neighbor vma.  As explained in [1], when vma gone through fork(), the
      check for list_is_singular(vma->anon_vma_chain) will be false, and
      don't share anon_vma.
      
      With current issue, one example can clarify more.  Parent process do
      below two steps:
      
      1. p_vma_1 is created and p_anon_vma_1 is prepared;
      
      2. p_vma_2 is created and share p_anon_vma_1; (this is allowed,
         becaues p_vma_1 didn't gothrough fork()); parent process do fork():
      
      3. c_vma_1 is dup from p_vma_1, and has its own c_anon_vma_1
         prepared; at this point, c_vma_1->anon_vma_chain has two items, one
         for p_anon_vma_1 and one for c_anon_vma_1;
      
      4. c_vma_2 is dup from p_vma_2, it is not allowed to share
         c_anon_vma_1, because
      
      c_vma_1->anon_vma_chain has two items.
      [1] commit d0e9fe17 ("Simplify and comment on anon_vma re-use for
          anon_vma_prepare()") explains the test of "list_is_singular()".
      
      Fixes: 4e4a9eb9 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork")
      Signed-off-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1581150928-3214-3-git-send-email-lixinhai.lxh@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23ab76bf
  18. 03 4月, 2020 3 次提交
    • M
      hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization · c0d0381a
      Mike Kravetz 提交于
      Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
      
      While discussing the issue with huge_pte_offset [1], I remembered that
      there were more outstanding hugetlb races.  These issues are:
      
      1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
         invalid via a call to huge_pmd_unshare by another thread.
      2) hugetlbfs page faults can race with truncation causing invalid global
         reserve counts and state.
      
      A previous attempt was made to use i_mmap_rwsem in this manner as
      described at [2].  However, those patches were reverted starting with [3]
      due to locking issues.
      
      To effectively use i_mmap_rwsem to address the above issues it needs to be
      held (in read mode) during page fault processing.  However, during fault
      processing we need to lock the page we will be adding.  Lock ordering
      requires we take page lock before i_mmap_rwsem.  Waiting until after
      taking the page lock is too late in the fault process for the
      synchronization we want to do.
      
      To address this lock ordering issue, the following patches change the lock
      ordering for hugetlb pages.  This is not too invasive as hugetlbfs
      processing is done separate from core mm in many places.  However, I don't
      really like this idea.  Much ugliness is contained in the new routine
      hugetlb_page_mapping_lock_write() of patch 1.
      
      The only other way I can think of to address these issues is by catching
      all the races.  After catching a race, cleanup, backout, retry ...  etc,
      as needed.  This can get really ugly, especially for huge page
      reservations.  At one time, I started writing some of the reservation
      backout code for page faults and it got so ugly and complicated I went
      down the path of adding synchronization to avoid the races.  Any other
      suggestions would be welcome.
      
      [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
      [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
      [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
      [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
      [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
      
      This patch (of 2):
      
      While looking at BUGs associated with invalid huge page map counts, it was
      discovered and observed that a huge pte pointer could become 'invalid' and
      point to another task's page table.  Consider the following:
      
      A task takes a page fault on a shared hugetlbfs file and calls
      huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
      shared pmd.
      
      Now, another task truncates the hugetlbfs file.  As part of truncation, it
      unmaps everyone who has the file mapped.  If the range being truncated is
      covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
      last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
      to the pmd.  If the task in the middle of the page fault is not the last
      user, the ptep returned by huge_pte_alloc now points to another task's
      page table or worse.  This leads to bad things such as incorrect page
      map/reference counts or invalid memory references.
      
      To fix, expand the use of i_mmap_rwsem as follows:
      - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
        huge_pmd_share is only called via huge_pte_alloc, so callers of
        huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
        of huge_pte_alloc continue to hold the semaphore until finished with
        the ptep.
      - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
      
      One problem with this scheme is that it requires taking i_mmap_rwsem
      before taking the page lock during page faults.  This is not the order
      specified in the rest of mm code.  Handling of hugetlbfs pages is mostly
      isolated today.  Therefore, we use this alternative locking order for
      PageHuge() pages.
      
               mapping->i_mmap_rwsem
                 hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
                   page->flags PG_locked (lock_page)
      
      To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
      introduced to write lock the i_mmap_rwsem associated with a page.
      
      In most cases it is easy to get address_space via vma->vm_file->f_mapping.
      However, in the case of migration or memory errors for anon pages we do
      not have an associated vma.  A new routine _get_hugetlb_page_mapping()
      will use anon_vma to get address_space in these cases.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0d0381a
    • A
      mm/vma: make is_vma_temporary_stack() available for general use · 222100ee
      Anshuman Khandual 提交于
      Currently the declaration and definition for is_vma_temporary_stack() are
      scattered.  Lets make is_vma_temporary_stack() helper available for
      general use and also drop the declaration from (include/linux/huge_mm.h)
      which is no longer required.  While at this, rename this as
      vma_is_temporary_stack() in line with existing helpers.  This should not
      cause any functional change.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      222100ee
    • J
      mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages · 47e29d32
      John Hubbard 提交于
      For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
      scheme tends to overflow too easily, each tail page increments the head
      page->_refcount by GUP_PIN_COUNTING_BIAS (1024).  That limits the number
      of huge pages that can be pinned.
      
      This patch removes that limitation, by using an exact form of pin counting
      for compound pages of order > 1.  The "order > 1" is required because this
      approach uses the 3rd struct page in the compound page, and order 1
      compound pages only have two pages, so that won't work there.
      
      A new struct page field, hpage_pinned_refcount, has been added, replacing
      a padding field in the union (so no new space is used).
      
      This enhancement also has a useful side effect: huge pages and compound
      pages (of order > 1) do not suffer from the "potential false positives"
      problem that is discussed in the page_dma_pinned() comment block.  That is
      because these compound pages have extra space for tracking things, so they
      get exact pin counts instead of overloading page->_refcount.
      
      Documentation/core-api/pin_user_pages.rst is updated accordingly.
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47e29d32
  19. 02 12月, 2019 1 次提交
  20. 01 12月, 2019 4 次提交
  21. 19 10月, 2019 1 次提交
  22. 25 9月, 2019 2 次提交