1. 18 10月, 2020 1 次提交
    • J
      mm: mark async iocb read as NOWAIT once some data has been copied · 13bd6914
      Jens Axboe 提交于
      Once we've copied some data for an iocb that is marked with IOCB_WAITQ,
      we should no longer attempt to async lock a new page. Instead make sure
      we return the copied amount, and let the caller retry, instead of
      returning -EIOCBQUEUED for a new page.
      
      This should only be possible with read-ahead disabled on the below
      device, and multiple threads racing on the same file. Haven't been able
      to reproduce on anything else.
      
      Cc: stable@vger.kernel.org # v5.9
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Reported-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      13bd6914
  2. 17 10月, 2020 4 次提交
  3. 16 10月, 2020 1 次提交
  4. 15 10月, 2020 1 次提交
    • D
      vfs: move generic_remap_checks out of mm · 02e83f46
      Darrick J. Wong 提交于
      I would like to move all the generic helpers for the vfs remap range
      functionality (aka clonerange and dedupe) into a separate file so that
      they won't be scattered across the vfs and the mm subsystems.  The
      eventual goal is to be able to deselect remap_range.c if none of the
      filesystems need that code, but the tricky part here is picking a
      stable(ish) part of the merge window to rearrange code.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      02e83f46
  5. 14 10月, 2020 5 次提交
  6. 29 9月, 2020 1 次提交
    • H
      io_uring: fix async buffered reads when readahead is disabled · c8d317aa
      Hao Xu 提交于
      The async buffered reads feature is not working when readahead is
      turned off. There are two things to concern:
      
      - when doing retry in io_read, not only the IOCB_WAITQ flag but also
        the IOCB_NOWAIT flag is still set, which makes it goes to would_block
        phase in generic_file_buffered_read() and then return -EAGAIN. After
        that, the io-wq thread work is queued, and later doing the async
        reads in the old way.
      
      - even if we remove IOCB_NOWAIT when doing retry, the feature is still
        not running properly, since in generic_file_buffered_read() it goes to
        lock_page_killable() after calling mapping->a_ops->readpage() to do
        IO, and thus causing process to sleep.
      
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Fixes: 3b2a4439 ("io_uring: get rid of kiocb_wait_page_queue_init()")
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c8d317aa
  7. 25 9月, 2020 1 次提交
  8. 21 9月, 2020 1 次提交
  9. 18 9月, 2020 1 次提交
    • L
      mm: allow a controlled amount of unfairness in the page lock · 5ef64cc8
      Linus Torvalds 提交于
      Commit 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic") made
      the page locking entirely fair, in that if a waiter came in while the
      lock was held, the lock would be transferred to the lockers strictly in
      order.
      
      That was intended to finally get rid of the long-reported watchdog
      failures that involved the page lock under extreme load, where a process
      could end up waiting essentially forever, as other page lockers stole
      the lock from under it.
      
      It also improved some benchmarks, but it ended up causing huge
      performance regressions on others, simply because fair lock behavior
      doesn't end up giving out the lock as aggressively, causing better
      worst-case latency, but potentially much worse average latencies and
      throughput.
      
      Instead of reverting that change entirely, this introduces a controlled
      amount of unfairness, with a sysctl knob to tune it if somebody needs
      to.  But the default value should hopefully be good for any normal load,
      allowing a few rounds of lock stealing, but enforcing the strict
      ordering before the lock has been stolen too many times.
      
      There is also a hint from Matthieu Baerts that the fair page coloring
      may end up exposing an ABBA deadlock that is hidden by the usual
      optimistic lock stealing, and while the unfairness doesn't fix the
      fundamental issue (and I'm still looking at that), it avoids it in
      practice.
      
      The amount of unfairness can be modified by writing a new value to the
      'sysctl_page_lock_unfairness' variable (default value of 5, exposed
      through /proc/sys/vm/page_lock_unfairness), but that is hopefully
      something we'd use mainly for debugging rather than being necessary for
      any deep system tuning.
      
      This whole issue has exposed just how critical the page lock can be, and
      how contended it gets under certain locks.  And the main contention
      doesn't really seem to be anything related to IO (which was the origin
      of this lock), but for things like just verifying that the page file
      mapping is stable while faulting in the page into a page table.
      
      Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
      Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
      Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/Reported-and-tested-by: NMichael Larabel <Michael@michaellarabel.com>
      Tested-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ef64cc8
  10. 29 8月, 2020 1 次提交
  11. 15 8月, 2020 2 次提交
  12. 13 8月, 2020 1 次提交
  13. 08 8月, 2020 2 次提交
  14. 03 8月, 2020 2 次提交
  15. 08 7月, 2020 1 次提交
  16. 22 6月, 2020 4 次提交
  17. 10 6月, 2020 3 次提交
  18. 05 6月, 2020 1 次提交
  19. 04 6月, 2020 6 次提交
    • J
      mm: memcontrol: delete unused lrucare handling · d9eb1ea2
      Johannes Weiner 提交于
      Swapin faults were the last event to charge pages after they had already
      been put on the LRU list.  Now that we charge directly on swapin, the
      lrucare portion of the charge code is unused.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9eb1ea2
    • J
      mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API · 9d82c694
      Johannes Weiner 提交于
      With the page->mapping requirement gone from memcg, we can charge anon and
      file-thp pages in one single step, right after they're allocated.
      
      This removes two out of three API calls - especially the tricky commit
      step that needed to happen at just the right time between when the page is
      "set up" and when it's "published" - somewhat vague and fluid concepts
      that varied by page type.  All we need is a freshly allocated page and a
      memcg context to charge.
      
      v2: prevent double charges on pre-allocated hugepages in khugepaged
      
      [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
        Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d82c694
    • J
      mm: memcontrol: switch to native NR_FILE_PAGES and NR_SHMEM counters · 0d1c2072
      Johannes Weiner 提交于
      Memcg maintains private MEMCG_CACHE and NR_SHMEM counters.  This
      divergence from the generic VM accounting means unnecessary code overhead,
      and creates a dependency for memcg that page->mapping is set up at the
      time of charging, so that page types can be told apart.
      
      Convert the generic accounting sites to mod_lruvec_page_state and friends
      to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM.
      The page is already locked in these places, so page->mem_cgroup is stable;
      we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure
      it's set up in time.
      
      Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
      NR_SHMEM accounting sites.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d1c2072
    • J
      mm: memcontrol: convert page cache to a new mem_cgroup_charge() API · 3fea5a49
      Johannes Weiner 提交于
      The try/commit/cancel protocol that memcg uses dates back to when pages
      used to be uncharged upon removal from the page cache, and thus couldn't
      be committed before the insertion had succeeded.  Nowadays, pages are
      uncharged when they are physically freed; it doesn't matter whether the
      insertion was successful or not.  For the page cache, the transaction
      dance has become unnecessary.
      
      Introduce a mem_cgroup_charge() function that simply charges a newly
      allocated page to a cgroup and sets up page->mem_cgroup in one single
      step.  If the insertion fails, the caller doesn't have to do anything but
      free/put the page.
      
      Then switch the page cache over to this new API.
      
      Subsequent patches will also convert anon pages, but it needs a bit more
      prep work.  Right now, memcg depends on page->mapping being already set up
      at the time of charging, so that it can maintain its own MEMCG_CACHE and
      MEMCG_RSS counters.  For anon, page->mapping is set under the same pte
      lock under which the page is publishd, so a single charge point that can
      block doesn't work there just yet.
      
      The following prep patches will replace the private memcg counters with
      the generic vmstat counters, thus removing the page->mapping dependency,
      then complete the transition to the new single-point charge API and delete
      the old transactional scheme.
      
      v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
      v3: rebase on preceeding shmem simplification patch
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fea5a49
    • J
      mm: memcontrol: drop @compound parameter from memcg charging API · 3fba69a5
      Johannes Weiner 提交于
      The memcg charging API carries a boolean @compound parameter that tells
      whether the page we're dealing with is a hugepage.
      mem_cgroup_commit_charge() has another boolean @lrucare that indicates
      whether the page needs LRU locking or not while charging.  The majority of
      callsites know those parameters at compile time, which results in a lot of
      naked "false, false" argument lists.  This makes for cryptic code and is a
      breeding ground for subtle mistakes.
      
      Thankfully, the huge page state can be inferred from the page itself and
      doesn't need to be passed along.  This is safe because charging completes
      before the page is published and somebody may split it.
      
      Simplify the callsites by removing @compound, and let memcg infer the
      state by using hpage_nr_pages() unconditionally.  That function does
      PageTransHuge() to identify huge pages, which also helpfully asserts that
      nobody passes in tail pages by accident.
      
      The following patches will introduce a new charging API, best not to carry
      over unnecessary weight.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fba69a5
    • J
      mm: fix NUMA node file count error in replace_page_cache() · f4129ea3
      Johannes Weiner 提交于
      Patch series "mm: memcontrol: charge swapin pages on instantiation", v2.
      
      This patch series reworks memcg to charge swapin pages directly at
      swapin time, rather than at fault time, which may be much later, or
      not happen at all.
      
      Changes in version 2:
      - prevent double charges on pre-allocated hugepages in khugepaged
      - leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
      - fix temporary accounting bug by switching rmap<->commit (Joonsoo)
      - fix double swap charge bug in cgroup1/cgroup2 code gating
      - simplify swapin error checking (Joonsoo)
      - mm: memcontrol: document the new swap control behavior (Alex)
      - review tags
      
      The delayed swapin charging scheme we have right now causes problems:
      
      - Alex's per-cgroup lru_lock patches rely on pages that have been
        isolated from the LRU to have a stable page->mem_cgroup; otherwise
        the lock may change underneath him. Swapcache pages are charged only
        after they are added to the LRU, and charging doesn't follow the LRU
        isolation protocol.
      
      - Joonsoo's anon workingset patches need a suitable LRU at the time
        the page enters the swap cache and displaces the non-resident
        info. But the correct LRU is only available after charging.
      
      - It's a containment hole / DoS vector. Users can trigger arbitrarily
        large swap readahead using MADV_WILLNEED. The memory is never
        charged unless somebody actually touches it.
      
      - It complicates the page->mem_cgroup stabilization rules
      
      In order to charge pages directly at swapin time, the memcg code base
      needs to be prepared, and several overdue cleanups become a necessity:
      
      To charge pages at swapin time, we need to always have cgroup
      ownership tracking of swap records. We also cannot rely on
      page->mapping to tell apart page types at charge time, because that's
      only set up during a page fault.
      
      To eliminate the page->mapping dependency, memcg needs to ditch its
      private page type counters (MEMCG_CACHE, MEMCG_RSS, NR_SHMEM) in favor
      of the generic vmstat counters and accounting sites, such as
      NR_FILE_PAGES, NR_ANON_MAPPED etc.
      
      To switch to generic vmstat counters, the charge sequence must be
      adjusted such that page->mem_cgroup is set up by the time these
      counters are modified.
      
      The series is structured as follows:
      
      1. Bug fixes
      2. Decoupling charging from rmap
      3. Swap controller integration into memcg
      4. Direct swapin charging
      
      This patch (of 19):
      
      When replacing one page with another one in the cache, we have to decrease
      the file count of the old page's NUMA node and increase the one of the new
      NUMA node, otherwise the old node leaks the count and the new node
      eventually underflows its counter.
      
      Fixes: 74d60958 ("page cache: Add and replace pages using the XArray")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-1-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20200508183105.225460-2-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4129ea3
  20. 03 6月, 2020 1 次提交