1. 08 4月, 2020 2 次提交
    • H
      mm: huge tmpfs: try to split_huge_page() when punching hole · 71725ed1
      Hugh Dickins 提交于
      Yang Shi writes:
      
      Currently, when truncating a shmem file, if the range is partly in a THP
      (start or end is in the middle of THP), the pages actually will just get
      cleared rather than being freed, unless the range covers the whole THP.
      Even though all the subpages are truncated (randomly or sequentially), the
      THP may still be kept in page cache.
      
      This might be fine for some usecases which prefer preserving THP, but
      balloon inflation is handled in base page size.  So when using shmem THP
      as memory backend, QEMU inflation actually doesn't work as expected since
      it doesn't free memory.  But the inflation usecase really needs to get the
      memory freed.  (Anonymous THP will also not get freed right away, but will
      be freed eventually when all subpages are unmapped: whereas shmem THP
      still stays in page cache.)
      
      Split THP right away when doing partial hole punch, and if split fails
      just clear the page so that read of the punched area will return zeroes.
      
      Hugh Dickins adds:
      
      Our earlier "team of pages" huge tmpfs implementation worked in the way
      that Yang Shi proposes; and we have been using this patch to continue to
      split the huge page when hole-punched or truncated, since converting over
      to the compound page implementation.  Although huge tmpfs gives out huge
      pages when available, if the user specifically asks to truncate or punch a
      hole (perhaps to free memory, perhaps to reduce the memcg charge), then
      the filesystem should do so as best it can, splitting the huge page.
      
      That is not always possible: any additional reference to the huge page
      prevents split_huge_page() from succeeding, so the result can be flaky.
      But in practice it works successfully enough that we've not seen any
      problem from that.
      
      Add shmem_punch_compound() to encapsulate the decision of when a split is
      needed, and doing the split if so.  Using this simplifies the flow in
      shmem_undo_range(); and the first (trylock) pass does not need to do any
      page clearing on failure, because the second pass will either succeed or
      do that clearing.  Following the example of zero_user_segment() when
      clearing a partial page, add flush_dcache_page() and set_page_dirty() when
      clearing a hole - though I'm not certain that either is needed.
      
      But: split_huge_page() would be sure to fail if shmem_undo_range()'s
      pagevec holds further references to the huge page.  The easiest way to fix
      that is for find_get_entries() to return early, as soon as it has put one
      compound head or tail into the pagevec.  At first this felt like a hack;
      but on examination, this convention better suits all its callers - or will
      do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
      and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
      speedup by checking for compound pages there.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvilsSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71725ed1
    • H
      mm: code cleanup for MADV_FREE · 9de4f22a
      Huang Ying 提交于
      Some comments for MADV_FREE is revised and added to help people understand
      the MADV_FREE code, especially the page flag, PG_swapbacked.  This makes
      page_is_file_cache() isn't consistent with its comments.  So the function
      is renamed to page_is_file_lru() to make them consistent again.  All these
      are put in one patch as one logical change.
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de4f22a
  2. 03 4月, 2020 2 次提交
  3. 01 2月, 2020 1 次提交
    • J
      mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages · 07d80269
      John Hubbard 提交于
      An upcoming patch changes and complicates the refcounting and especially
      the "put page" aspects of it.  In order to keep everything clean,
      refactor the devmap page release routines:
      
      * Rename put_devmap_managed_page() to page_is_devmap_managed(), and
        limit the functionality to "read only": return a bool, with no side
        effects.
      
      * Add a new routine, put_devmap_managed_page(), to handle decrementing
        the refcount for ZONE_DEVICE pages.
      
      * Change callers (just release_pages() and put_page()) to check
        page_is_devmap_managed() before calling the new
        put_devmap_managed_page() routine.  This is a performance point:
        put_page() is a hot path, so we need to avoid non- inline function calls
        where possible.
      
      * Rename __put_devmap_managed_page() to free_devmap_managed_page(), and
        limit the functionality to unconditionally freeing a devmap page.
      
      This is originally based on a separate patch by Ira Weiny, which applied
      to an early version of the put_user_page() experiments.  Since then,
      Jérôme Glisse suggested the refactoring described above.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-5-jhubbard@nvidia.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Suggested-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07d80269
  4. 01 12月, 2019 2 次提交
  5. 26 9月, 2019 1 次提交
    • M
      mm: introduce MADV_COLD · 9c276cc6
      Minchan Kim 提交于
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c276cc6
  6. 25 9月, 2019 2 次提交
  7. 15 7月, 2019 2 次提交
  8. 03 7月, 2019 1 次提交
  9. 02 7月, 2019 1 次提交
  10. 21 5月, 2019 1 次提交
  11. 15 5月, 2019 1 次提交
  12. 06 3月, 2019 1 次提交
  13. 22 2月, 2019 1 次提交
  14. 05 1月, 2019 1 次提交
  15. 29 12月, 2018 1 次提交
  16. 13 11月, 2018 1 次提交
    • L
      mm: Replace spin_is_locked() with lockdep · 35f3aa39
      Lance Roy 提交于
      lockdep_assert_held() is better suited to checking locking requirements,
      since it only checks if the current thread holds the lock regardless of
      whether someone else does. This is also a step towards possibly removing
      spin_is_locked().
      Signed-off-by: NLance Roy <ldr709@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <linux-mm@kvack.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      35f3aa39
  17. 27 10月, 2018 1 次提交
  18. 21 10月, 2018 1 次提交
  19. 30 9月, 2018 1 次提交
    • M
      xarray: Replace exceptional entries · 3159f943
      Matthew Wilcox 提交于
      Introduce xarray value entries and tagged pointers to replace radix
      tree exceptional entries.  This is a slight change in encoding to allow
      the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
      value entry).  It is also a change in emphasis; exceptional entries are
      intimidating and different.  As the comment explains, you can choose
      to store values or pointers in the xarray and they are both first-class
      citizens.
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      3159f943
  20. 22 5月, 2018 1 次提交
    • D
      mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS · e7638488
      Dan Williams 提交于
      In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
      be able to rely on the fact that they will get wakeups on dev_pagemap
      page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
      generic_dax_page_free() as common indicator / infrastructure for dax
      filesytems to require. With this change there are no users of the
      MEMORY_DEVICE_HOST designation, so remove it.
      
      The HMM sub-system extended dev_pagemap to arrange a callback when a
      dev_pagemap managed page is freed. Since a dev_pagemap page is free /
      idle when its reference count is 1 it requires an additional branch to
      check the page-type at put_page() time. Given put_page() is a hot-path
      we do not want to incur that check if HMM is not in use, so a static
      branch is used to avoid that overhead when not necessary.
      
      Now, the FS_DAX implementation wants to reuse this mechanism for
      receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
      static-key into a generic mechanism that either HMM or FS_DAX code paths
      can enable.
      
      For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
      care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
      However, we still need to support FS_DAX in the FS_DAX_LIMITED case
      implemented by the s390/dcssblk driver.
      
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Reported-by: NThomas Meyer <thomas@m3y3r.de>
      Reported-by: NDave Jiang <dave.jiang@intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      e7638488
  21. 06 4月, 2018 1 次提交
  22. 22 2月, 2018 2 次提交
    • M
      mm/swap.c: make functions and their kernel-doc agree (again) · cb6f0f34
      Mike Rapoport 提交于
      There was a conflict between the commit e02a9f04 ("mm/swap.c: make
      functions and their kernel-doc agree") and the commit f144c390 ("mm:
      docs: fix parameter names mismatch") that both tried to fix mismatch
      betweeen pagevec_lookup_entries() parameter names and their description.
      
      Since nr_entries is a better name for the parameter, fix the description
      again.
      
      Link: http://lkml.kernel.org/r/1518116946-20947-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb6f0f34
    • S
      mm, mlock, vmscan: no more skipping pagevecs · 9c4e6b1a
      Shakeel Butt 提交于
      When a thread mlocks an address space backed either by file pages which
      are currently not present in memory or swapped out anon pages (not in
      swapcache), a new page is allocated and added to the local pagevec
      (lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
      On I/O completion, the thread can wake on a different CPU, the mlock
      syscall will then sets the PageMlocked() bit of the page but will not be
      able to put that page in unevictable LRU as the page is on the pagevec
      of a different CPU.  Even on drain, that page will go to evictable LRU
      because the PageMlocked() bit is not checked on pagevec drain.
      
      The page will eventually go to right LRU on reclaim but the LRU stats
      will remain skewed for a long time.
      
      This patch puts all the pages, even unevictable, to the pagevecs and on
      the drain, the pages will be added on their LRUs correctly by checking
      their evictability.  This resolves the mlocked pages on pagevec of other
      CPUs issue because when those pagevecs will be drained, the mlocked file
      pages will go to unevictable LRU.  Also this makes the race with munlock
      easier to resolve because the pagevec drains happen in LRU lock.
      
      However there is still one place which makes a page evictable and does
      PageLRU check on that page without LRU lock and needs special attention.
      TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
      
      	#0: __pagevec_lru_add_fn	#1: clear_page_mlock
      
      	SetPageLRU()			if (!TestClearPageMlocked())
      					  return
      	smp_mb() // <--required
      					// inside does PageLRU
      	if (!PageMlocked())		if (isolate_lru_page())
      	  move to evictable LRU		  putback_lru_page()
      	else
      	  move to unevictable LRU
      
      In '#1', TestClearPageMlocked() provides full memory barrier semantics
      and thus the PageLRU check (inside isolate_lru_page) can not be
      reordered before it.
      
      In '#0', without explicit memory barrier, the PageMlocked() check can be
      reordered before SetPageLRU().  If that happens, '#0' can put a page in
      unevictable LRU and '#1' might have just cleared the Mlocked bit of that
      page but fails to isolate as PageLRU fails as '#0' still hasn't set
      PageLRU bit of that page.  That page will be stranded on the unevictable
      LRU.
      
      There is one (good) side effect though.  Without this patch, the pages
      allocated for System V shared memory segment are added to evictable LRUs
      even after shmctl(SHM_LOCK) on that segment.  This patch will correctly
      put such pages to unevictable LRU.
      
      Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c4e6b1a
  23. 07 2月, 2018 1 次提交
  24. 01 2月, 2018 2 次提交
  25. 16 11月, 2017 8 次提交
  26. 04 10月, 2017 1 次提交