1. 04 6月, 2020 6 次提交
    • J
      mm: memcontrol: delete unused lrucare handling · d9eb1ea2
      Johannes Weiner 提交于
      Swapin faults were the last event to charge pages after they had already
      been put on the LRU list.  Now that we charge directly on swapin, the
      lrucare portion of the charge code is unused.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9eb1ea2
    • J
      mm: memcontrol: charge swapin pages on instantiation · 4c6355b2
      Johannes Weiner 提交于
      Right now, users that are otherwise memory controlled can easily escape
      their containment and allocate significant amounts of memory that they're
      not being charged for.  That's because swap readahead pages are not being
      charged until somebody actually faults them into their page table.  This
      can be exploited with MADV_WILLNEED, which triggers arbitrary readahead
      allocations without charging the pages.
      
      There are additional problems with the delayed charging of swap pages:
      
      1. To implement refault/workingset detection for anonymous pages, we
         need to have a target LRU available at swapin time, but the LRU is not
         determinable until the page has been charged.
      
      2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
         stable when the page is isolated from the LRU; otherwise, the locks
         change under us.  But swapcache gets charged after it's already on the
         LRU, and even if we cannot isolate it ourselves (since charging is not
         exactly optional).
      
      The previous patch ensured we always maintain cgroup ownership records for
      swap pages.  This patch moves the swapcache charging point from the fault
      handler to swapin time to fix all of the above problems.
      
      v2: simplify swapin error checking (Joonsoo)
      
      [hughd@google.com: fix livelock in __read_swap_cache_async()]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005212246080.8458@eggly.anvilsSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c6355b2
    • J
      mm: memcontrol: switch to native NR_FILE_PAGES and NR_SHMEM counters · 0d1c2072
      Johannes Weiner 提交于
      Memcg maintains private MEMCG_CACHE and NR_SHMEM counters.  This
      divergence from the generic VM accounting means unnecessary code overhead,
      and creates a dependency for memcg that page->mapping is set up at the
      time of charging, so that page types can be told apart.
      
      Convert the generic accounting sites to mod_lruvec_page_state and friends
      to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM.
      The page is already locked in these places, so page->mem_cgroup is stable;
      we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure
      it's set up in time.
      
      Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
      NR_SHMEM accounting sites.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d1c2072
    • J
      mm: memcontrol: convert page cache to a new mem_cgroup_charge() API · 3fea5a49
      Johannes Weiner 提交于
      The try/commit/cancel protocol that memcg uses dates back to when pages
      used to be uncharged upon removal from the page cache, and thus couldn't
      be committed before the insertion had succeeded.  Nowadays, pages are
      uncharged when they are physically freed; it doesn't matter whether the
      insertion was successful or not.  For the page cache, the transaction
      dance has become unnecessary.
      
      Introduce a mem_cgroup_charge() function that simply charges a newly
      allocated page to a cgroup and sets up page->mem_cgroup in one single
      step.  If the insertion fails, the caller doesn't have to do anything but
      free/put the page.
      
      Then switch the page cache over to this new API.
      
      Subsequent patches will also convert anon pages, but it needs a bit more
      prep work.  Right now, memcg depends on page->mapping being already set up
      at the time of charging, so that it can maintain its own MEMCG_CACHE and
      MEMCG_RSS counters.  For anon, page->mapping is set under the same pte
      lock under which the page is publishd, so a single charge point that can
      block doesn't work there just yet.
      
      The following prep patches will replace the private memcg counters with
      the generic vmstat counters, thus removing the page->mapping dependency,
      then complete the transition to the new single-point charge API and delete
      the old transactional scheme.
      
      v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
      v3: rebase on preceeding shmem simplification patch
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fea5a49
    • J
      mm: shmem: remove rare optimization when swapin races with hole punching · 14235ab3
      Johannes Weiner 提交于
      Commit 215c02bc ("tmpfs: fix shmem_getpage_gfp() VM_BUG_ON")
      recognized that hole punching can race with swapin and removed the
      BUG_ON() for a truncated entry from the swapin path.
      
      The patch also added a swapcache deletion to optimize this rare case:
      Since swapin has the page locked, and free_swap_and_cache() merely
      trylocks, this situation can leave the page stranded in swapcache.
      Usually, page reclaim picks up stale swapcache pages, and the race can
      happen at any other time when the page is locked.  (The same happens for
      non-shmem swapin racing with page table zapping.) The thinking here was:
      we already observed the race and we have the page locked, we may as well
      do the cleanup instead of waiting for reclaim.
      
      However, this optimization complicates the next patch which moves the
      cgroup charging code around.  As this is just a minor speedup for a race
      condition that is so rare that it required a fuzzer to trigger the
      original BUG_ON(), it's no longer worth the complications.
      Suggested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200511181056.GA339505@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14235ab3
    • J
      mm: memcontrol: drop @compound parameter from memcg charging API · 3fba69a5
      Johannes Weiner 提交于
      The memcg charging API carries a boolean @compound parameter that tells
      whether the page we're dealing with is a hugepage.
      mem_cgroup_commit_charge() has another boolean @lrucare that indicates
      whether the page needs LRU locking or not while charging.  The majority of
      callsites know those parameters at compile time, which results in a lot of
      naked "false, false" argument lists.  This makes for cryptic code and is a
      breeding ground for subtle mistakes.
      
      Thankfully, the huge page state can be inferred from the page itself and
      doesn't need to be passed along.  This is safe because charging completes
      before the page is published and somebody may split it.
      
      Simplify the callsites by removing @compound, and let memcg infer the
      state by using hpage_nr_pages() unconditionally.  That function does
      PageTransHuge() to identify huge pages, which also helpfully asserts that
      nobody passes in tail pages by accident.
      
      The following patches will introduce a new charging API, best not to carry
      over unnecessary weight.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fba69a5
  2. 22 4月, 2020 3 次提交
  3. 08 4月, 2020 7 次提交
  4. 17 3月, 2020 1 次提交
  5. 19 2月, 2020 1 次提交
  6. 08 2月, 2020 3 次提交
  7. 07 2月, 2020 2 次提交
  8. 14 1月, 2020 1 次提交
    • K
      mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment · 99158997
      Kirill A. Shutemov 提交于
      Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
      enabled.  But it doesn't work well with above-47bit hint address.
      
      Normally, the kernel doesn't create userspace mappings above 47-bit,
      even if the machine allows this (such as with 5-level paging on x86-64).
      Not all user space is ready to handle wide addresses.  It's known that
      at least some JIT compilers use higher bits in pointers to encode their
      information.
      
      Userspace can ask for allocation from full address space by specifying
      hint address (with or without MAP_FIXED) above 47-bits.  If the
      application doesn't need a particular address, but wants to allocate
      from whole address space it can specify -1 as a hint address.
      
      Unfortunately, this trick breaks THP alignment in shmem/tmp:
      shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
      *any* hint address specified.
      
      This can be fixed by requesting the aligned area if the we failed to
      allocated at user-specified hint address.  The request with inflated
      length will also take the user-specified hint address.  This way we will
      not lose an allocation request from the full address space.
      
      [kirill@shutemov.name: fold in a fixup]
        Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
      Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
      Fixes: b569bab7 ("x86/mm: Prepare to expose larger address space to userspace")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Willhalm, Thomas" <thomas.willhalm@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99158997
  9. 02 12月, 2019 4 次提交
  10. 01 12月, 2019 1 次提交
    • K
      shmem: pin the file in shmem_fault() if mmap_sem is dropped · 8897c1b1
      Kirill A. Shutemov 提交于
      syzbot found the following crash:
      
        BUG: KASAN: use-after-free in perf_trace_lock_acquire+0x401/0x530 include/trace/events/lock.h:13
        Read of size 8 at addr ffff8880a5cf2c50 by task syz-executor.0/26173
      
        CPU: 0 PID: 26173 Comm: syz-executor.0 Not tainted 5.3.0-rc6 #146
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
           perf_trace_lock_acquire+0x401/0x530 include/trace/events/lock.h:13
           trace_lock_acquire include/trace/events/lock.h:13 [inline]
           lock_acquire+0x2de/0x410 kernel/locking/lockdep.c:4411
           __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
           _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
           spin_lock include/linux/spinlock.h:338 [inline]
           shmem_fault+0x5ec/0x7b0 mm/shmem.c:2034
           __do_fault+0x111/0x540 mm/memory.c:3083
           do_shared_fault mm/memory.c:3535 [inline]
           do_fault mm/memory.c:3613 [inline]
           handle_pte_fault mm/memory.c:3840 [inline]
           __handle_mm_fault+0x2adf/0x3f20 mm/memory.c:3964
           handle_mm_fault+0x1b5/0x6b0 mm/memory.c:4001
           do_user_addr_fault arch/x86/mm/fault.c:1441 [inline]
           __do_page_fault+0x536/0xdd0 arch/x86/mm/fault.c:1506
           do_page_fault+0x38/0x590 arch/x86/mm/fault.c:1530
           page_fault+0x39/0x40 arch/x86/entry/entry_64.S:1202
      
      It happens if the VMA got unmapped under us while we dropped mmap_sem
      and inode got freed.
      
      Pinning the file if we drop mmap_sem fixes the issue.
      
      Link: http://lkml.kernel.org/r/20190927083908.rhifa4mmaxefc24r@boxSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: syzbot+03ee87124ee05af991bd@syzkaller.appspotmail.com
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8897c1b1
  11. 10 10月, 2019 1 次提交
  12. 29 9月, 2019 1 次提交
    • D
      Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 19deb769
      David Rientjes 提交于
      This reverts commit 92717d42.
      
      Since commit a8282608 ("Revert "mm, thp: restore node-local hugepage
      allocations"") is reverted in this series, it is better to restore the
      previous 5.2 behavior between the thp allocation and the page allocator
      rather than to attempt any consolidation or cleanup for a policy that is
      now reverted.  It's less risky during an rc cycle and subsequent patches
      in this series further modify the same policy that the pre-5.3 behavior
      implements.
      
      Consolidation and cleanup can be done subsequent to a sane default page
      allocation strategy, so this patch reverts a cleanup done on a strategy
      that is now reverted and thus is the least risky option.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19deb769
  13. 25 9月, 2019 3 次提交
  14. 13 9月, 2019 5 次提交
    • D
      vfs: Convert ramfs, shmem, tmpfs, devtmpfs, rootfs to use the new mount API · f3235626
      David Howells 提交于
      Convert the ramfs, shmem, tmpfs, devtmpfs and rootfs filesystems to the new
      internal mount API as the old one will be obsoleted and removed.  This
      allows greater flexibility in communication of mount parameters between
      userspace, the VFS and the filesystem.
      
      See Documentation/filesystems/mount_api.txt for more information.
      
      Note that tmpfs is slightly tricky as it can contain embedded commas, so it
      can't be trivially split up using strsep() to break on commas in
      generic_parse_monolithic().  Instead, tmpfs has to supply its own generic
      parser.
      
      However, if tmpfs changes, then devtmpfs and rootfs, which are wrappers
      around tmpfs or ramfs, must change too - and thus so must ramfs, so these
      had to be converted also.
      
      [AV: rewritten]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Hugh Dickins <hughd@google.com>
      cc: linux-mm@kvack.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f3235626
    • A
      shmem_parse_one(): switch to use of fs_parse() · 626c3920
      Al Viro 提交于
      This thing will eventually become our ->parse_param(), while
      shmem_parse_options() - ->parse_monolithic().  At that point
      shmem_parse_options() will start calling vfs_parse_fs_string(),
      rather than calling shmem_parse_one() directly.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      626c3920
    • A
      shmem_parse_options(): take handling a single option into a helper · e04dc423
      Al Viro 提交于
      mechanical move.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e04dc423
    • A
      shmem_parse_options(): don't bother with mpol in separate variable · f6490b7f
      Al Viro 提交于
      just use ctx->mpol (note that callers always set ctx->mpol to NULL when
      calling that).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f6490b7f
    • A
      shmem_parse_options(): use a separate structure to keep the results · 0b5071dd
      Al Viro 提交于
      ... and copy the data from it into sbinfo in the callers.
      For use by remount we need to keep track whether there'd
      been options setting max_inodes, max_blocks and huge resp.
      and do the sanity checks (and copying) only if such options
      had been seen.  uid/gid/mode is ignored by remount and
      NULL mpol is already explicitly treated as "ignore it",
      so we don't need to keep track of those.
      
      Note: theoretically, mpol_parse_string() may return NULL
      not in case of error (for default policy), so the assumption
      that NULL mpol means "change nothing" is incorrect.  However,
      that's the mainline behaviour and any changes belong in
      a separate patch.  If we go for that, we'll need to keep
      track of having encountered mpol= option too.
      
      [changes in remount logics from Hugh Dickins folded]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0b5071dd
  15. 06 9月, 2019 1 次提交