1. 17 10月, 2020 3 次提交
  2. 14 10月, 2020 5 次提交
  3. 29 9月, 2020 1 次提交
    • H
      io_uring: fix async buffered reads when readahead is disabled · c8d317aa
      Hao Xu 提交于
      The async buffered reads feature is not working when readahead is
      turned off. There are two things to concern:
      
      - when doing retry in io_read, not only the IOCB_WAITQ flag but also
        the IOCB_NOWAIT flag is still set, which makes it goes to would_block
        phase in generic_file_buffered_read() and then return -EAGAIN. After
        that, the io-wq thread work is queued, and later doing the async
        reads in the old way.
      
      - even if we remove IOCB_NOWAIT when doing retry, the feature is still
        not running properly, since in generic_file_buffered_read() it goes to
        lock_page_killable() after calling mapping->a_ops->readpage() to do
        IO, and thus causing process to sleep.
      
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Fixes: 3b2a4439 ("io_uring: get rid of kiocb_wait_page_queue_init()")
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c8d317aa
  4. 25 9月, 2020 1 次提交
  5. 21 9月, 2020 1 次提交
  6. 18 9月, 2020 1 次提交
    • L
      mm: allow a controlled amount of unfairness in the page lock · 5ef64cc8
      Linus Torvalds 提交于
      Commit 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic") made
      the page locking entirely fair, in that if a waiter came in while the
      lock was held, the lock would be transferred to the lockers strictly in
      order.
      
      That was intended to finally get rid of the long-reported watchdog
      failures that involved the page lock under extreme load, where a process
      could end up waiting essentially forever, as other page lockers stole
      the lock from under it.
      
      It also improved some benchmarks, but it ended up causing huge
      performance regressions on others, simply because fair lock behavior
      doesn't end up giving out the lock as aggressively, causing better
      worst-case latency, but potentially much worse average latencies and
      throughput.
      
      Instead of reverting that change entirely, this introduces a controlled
      amount of unfairness, with a sysctl knob to tune it if somebody needs
      to.  But the default value should hopefully be good for any normal load,
      allowing a few rounds of lock stealing, but enforcing the strict
      ordering before the lock has been stolen too many times.
      
      There is also a hint from Matthieu Baerts that the fair page coloring
      may end up exposing an ABBA deadlock that is hidden by the usual
      optimistic lock stealing, and while the unfairness doesn't fix the
      fundamental issue (and I'm still looking at that), it avoids it in
      practice.
      
      The amount of unfairness can be modified by writing a new value to the
      'sysctl_page_lock_unfairness' variable (default value of 5, exposed
      through /proc/sys/vm/page_lock_unfairness), but that is hopefully
      something we'd use mainly for debugging rather than being necessary for
      any deep system tuning.
      
      This whole issue has exposed just how critical the page lock can be, and
      how contended it gets under certain locks.  And the main contention
      doesn't really seem to be anything related to IO (which was the origin
      of this lock), but for things like just verifying that the page file
      mapping is stable while faulting in the page into a page table.
      
      Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
      Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
      Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/Reported-and-tested-by: NMichael Larabel <Michael@michaellarabel.com>
      Tested-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ef64cc8
  7. 15 8月, 2020 2 次提交
  8. 13 8月, 2020 1 次提交
  9. 08 8月, 2020 2 次提交
  10. 03 8月, 2020 2 次提交
  11. 08 7月, 2020 1 次提交
  12. 22 6月, 2020 4 次提交
  13. 10 6月, 2020 3 次提交
  14. 05 6月, 2020 1 次提交
  15. 04 6月, 2020 6 次提交
    • J
      mm: memcontrol: delete unused lrucare handling · d9eb1ea2
      Johannes Weiner 提交于
      Swapin faults were the last event to charge pages after they had already
      been put on the LRU list.  Now that we charge directly on swapin, the
      lrucare portion of the charge code is unused.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9eb1ea2
    • J
      mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API · 9d82c694
      Johannes Weiner 提交于
      With the page->mapping requirement gone from memcg, we can charge anon and
      file-thp pages in one single step, right after they're allocated.
      
      This removes two out of three API calls - especially the tricky commit
      step that needed to happen at just the right time between when the page is
      "set up" and when it's "published" - somewhat vague and fluid concepts
      that varied by page type.  All we need is a freshly allocated page and a
      memcg context to charge.
      
      v2: prevent double charges on pre-allocated hugepages in khugepaged
      
      [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
        Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d82c694
    • J
      mm: memcontrol: switch to native NR_FILE_PAGES and NR_SHMEM counters · 0d1c2072
      Johannes Weiner 提交于
      Memcg maintains private MEMCG_CACHE and NR_SHMEM counters.  This
      divergence from the generic VM accounting means unnecessary code overhead,
      and creates a dependency for memcg that page->mapping is set up at the
      time of charging, so that page types can be told apart.
      
      Convert the generic accounting sites to mod_lruvec_page_state and friends
      to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM.
      The page is already locked in these places, so page->mem_cgroup is stable;
      we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure
      it's set up in time.
      
      Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
      NR_SHMEM accounting sites.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d1c2072
    • J
      mm: memcontrol: convert page cache to a new mem_cgroup_charge() API · 3fea5a49
      Johannes Weiner 提交于
      The try/commit/cancel protocol that memcg uses dates back to when pages
      used to be uncharged upon removal from the page cache, and thus couldn't
      be committed before the insertion had succeeded.  Nowadays, pages are
      uncharged when they are physically freed; it doesn't matter whether the
      insertion was successful or not.  For the page cache, the transaction
      dance has become unnecessary.
      
      Introduce a mem_cgroup_charge() function that simply charges a newly
      allocated page to a cgroup and sets up page->mem_cgroup in one single
      step.  If the insertion fails, the caller doesn't have to do anything but
      free/put the page.
      
      Then switch the page cache over to this new API.
      
      Subsequent patches will also convert anon pages, but it needs a bit more
      prep work.  Right now, memcg depends on page->mapping being already set up
      at the time of charging, so that it can maintain its own MEMCG_CACHE and
      MEMCG_RSS counters.  For anon, page->mapping is set under the same pte
      lock under which the page is publishd, so a single charge point that can
      block doesn't work there just yet.
      
      The following prep patches will replace the private memcg counters with
      the generic vmstat counters, thus removing the page->mapping dependency,
      then complete the transition to the new single-point charge API and delete
      the old transactional scheme.
      
      v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
      v3: rebase on preceeding shmem simplification patch
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fea5a49
    • J
      mm: memcontrol: drop @compound parameter from memcg charging API · 3fba69a5
      Johannes Weiner 提交于
      The memcg charging API carries a boolean @compound parameter that tells
      whether the page we're dealing with is a hugepage.
      mem_cgroup_commit_charge() has another boolean @lrucare that indicates
      whether the page needs LRU locking or not while charging.  The majority of
      callsites know those parameters at compile time, which results in a lot of
      naked "false, false" argument lists.  This makes for cryptic code and is a
      breeding ground for subtle mistakes.
      
      Thankfully, the huge page state can be inferred from the page itself and
      doesn't need to be passed along.  This is safe because charging completes
      before the page is published and somebody may split it.
      
      Simplify the callsites by removing @compound, and let memcg infer the
      state by using hpage_nr_pages() unconditionally.  That function does
      PageTransHuge() to identify huge pages, which also helpfully asserts that
      nobody passes in tail pages by accident.
      
      The following patches will introduce a new charging API, best not to carry
      over unnecessary weight.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fba69a5
    • J
      mm: fix NUMA node file count error in replace_page_cache() · f4129ea3
      Johannes Weiner 提交于
      Patch series "mm: memcontrol: charge swapin pages on instantiation", v2.
      
      This patch series reworks memcg to charge swapin pages directly at
      swapin time, rather than at fault time, which may be much later, or
      not happen at all.
      
      Changes in version 2:
      - prevent double charges on pre-allocated hugepages in khugepaged
      - leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
      - fix temporary accounting bug by switching rmap<->commit (Joonsoo)
      - fix double swap charge bug in cgroup1/cgroup2 code gating
      - simplify swapin error checking (Joonsoo)
      - mm: memcontrol: document the new swap control behavior (Alex)
      - review tags
      
      The delayed swapin charging scheme we have right now causes problems:
      
      - Alex's per-cgroup lru_lock patches rely on pages that have been
        isolated from the LRU to have a stable page->mem_cgroup; otherwise
        the lock may change underneath him. Swapcache pages are charged only
        after they are added to the LRU, and charging doesn't follow the LRU
        isolation protocol.
      
      - Joonsoo's anon workingset patches need a suitable LRU at the time
        the page enters the swap cache and displaces the non-resident
        info. But the correct LRU is only available after charging.
      
      - It's a containment hole / DoS vector. Users can trigger arbitrarily
        large swap readahead using MADV_WILLNEED. The memory is never
        charged unless somebody actually touches it.
      
      - It complicates the page->mem_cgroup stabilization rules
      
      In order to charge pages directly at swapin time, the memcg code base
      needs to be prepared, and several overdue cleanups become a necessity:
      
      To charge pages at swapin time, we need to always have cgroup
      ownership tracking of swap records. We also cannot rely on
      page->mapping to tell apart page types at charge time, because that's
      only set up during a page fault.
      
      To eliminate the page->mapping dependency, memcg needs to ditch its
      private page type counters (MEMCG_CACHE, MEMCG_RSS, NR_SHMEM) in favor
      of the generic vmstat counters and accounting sites, such as
      NR_FILE_PAGES, NR_ANON_MAPPED etc.
      
      To switch to generic vmstat counters, the charge sequence must be
      adjusted such that page->mem_cgroup is set up by the time these
      counters are modified.
      
      The series is structured as follows:
      
      1. Bug fixes
      2. Decoupling charging from rmap
      3. Swap controller integration into memcg
      4. Direct swapin charging
      
      This patch (of 19):
      
      When replacing one page with another one in the cache, we have to decrease
      the file count of the old page's NUMA node and increase the one of the new
      NUMA node, otherwise the old node leaks the count and the new node
      eventually underflows its counter.
      
      Fixes: 74d60958 ("page cache: Add and replace pages using the XArray")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-1-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20200508183105.225460-2-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4129ea3
  16. 03 6月, 2020 1 次提交
  17. 25 5月, 2020 1 次提交
  18. 08 4月, 2020 1 次提交
    • H
      mm: huge tmpfs: try to split_huge_page() when punching hole · 71725ed1
      Hugh Dickins 提交于
      Yang Shi writes:
      
      Currently, when truncating a shmem file, if the range is partly in a THP
      (start or end is in the middle of THP), the pages actually will just get
      cleared rather than being freed, unless the range covers the whole THP.
      Even though all the subpages are truncated (randomly or sequentially), the
      THP may still be kept in page cache.
      
      This might be fine for some usecases which prefer preserving THP, but
      balloon inflation is handled in base page size.  So when using shmem THP
      as memory backend, QEMU inflation actually doesn't work as expected since
      it doesn't free memory.  But the inflation usecase really needs to get the
      memory freed.  (Anonymous THP will also not get freed right away, but will
      be freed eventually when all subpages are unmapped: whereas shmem THP
      still stays in page cache.)
      
      Split THP right away when doing partial hole punch, and if split fails
      just clear the page so that read of the punched area will return zeroes.
      
      Hugh Dickins adds:
      
      Our earlier "team of pages" huge tmpfs implementation worked in the way
      that Yang Shi proposes; and we have been using this patch to continue to
      split the huge page when hole-punched or truncated, since converting over
      to the compound page implementation.  Although huge tmpfs gives out huge
      pages when available, if the user specifically asks to truncate or punch a
      hole (perhaps to free memory, perhaps to reduce the memcg charge), then
      the filesystem should do so as best it can, splitting the huge page.
      
      That is not always possible: any additional reference to the huge page
      prevents split_huge_page() from succeeding, so the result can be flaky.
      But in practice it works successfully enough that we've not seen any
      problem from that.
      
      Add shmem_punch_compound() to encapsulate the decision of when a split is
      needed, and doing the split if so.  Using this simplifies the flow in
      shmem_undo_range(); and the first (trylock) pass does not need to do any
      page clearing on failure, because the second pass will either succeed or
      do that clearing.  Following the example of zero_user_segment() when
      clearing a partial page, add flush_dcache_page() and set_page_dirty() when
      clearing a hole - though I'm not certain that either is needed.
      
      But: split_huge_page() would be sure to fail if shmem_undo_range()'s
      pagevec holds further references to the huge page.  The easiest way to fix
      that is for find_get_entries() to return early, as soon as it has put one
      compound head or tail into the pagevec.  At first this felt like a hack;
      but on examination, this convention better suits all its callers - or will
      do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
      and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
      speedup by checking for compound pages there.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvilsSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71725ed1
  19. 03 4月, 2020 3 次提交