1. 25 2月, 2021 1 次提交
  2. 11 2月, 2021 1 次提交
  3. 06 2月, 2021 1 次提交
    • W
      mm/filemap: add missing mem_cgroup_uncharge() to __add_to_page_cache_locked() · da74240e
      Waiman Long 提交于
      Commit 3fea5a49 ("mm: memcontrol: convert page cache to a new
      mem_cgroup_charge() API") introduced a bug in __add_to_page_cache_locked()
      causing the following splat:
      
        page dumped because: VM_BUG_ON_PAGE(page_memcg(page))
        pages's memcg:ffff8889a4116000
        ------------[ cut here ]------------
        kernel BUG at mm/memcontrol.c:2924!
        invalid opcode: 0000 [#1] SMP KASAN PTI
        CPU: 35 PID: 12345 Comm: cat Tainted: G S      W I       5.11.0-rc4-debug+ #1
        Hardware name: HP HP Z8 G4 Workstation/81C7, BIOS P60 v01.25 12/06/2017
        RIP: commit_charge+0xf4/0x130
        Call Trace:
          mem_cgroup_charge+0x175/0x770
          __add_to_page_cache_locked+0x712/0xad0
          add_to_page_cache_lru+0xc5/0x1f0
          cachefiles_read_or_alloc_pages+0x895/0x2e10 [cachefiles]
          __fscache_read_or_alloc_pages+0x6c0/0xa00 [fscache]
          __nfs_readpages_from_fscache+0x16d/0x630 [nfs]
          nfs_readpages+0x24e/0x540 [nfs]
          read_pages+0x5b1/0xc40
          page_cache_ra_unbounded+0x460/0x750
          generic_file_buffered_read_get_pages+0x290/0x1710
          generic_file_buffered_read+0x2a9/0xc30
          nfs_file_read+0x13f/0x230 [nfs]
          new_sync_read+0x3af/0x610
          vfs_read+0x339/0x4b0
          ksys_read+0xf1/0x1c0
          do_syscall_64+0x33/0x40
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Before that commit, there was a try_charge() and commit_charge() in
      __add_to_page_cache_locked().  These two separated charge functions were
      replaced by a single mem_cgroup_charge().  However, it forgot to add a
      matching mem_cgroup_uncharge() when the xarray insertion failed with the
      page released back to the pool.
      
      Fix this by adding a mem_cgroup_uncharge() call when insertion error
      happens.
      
      Link: https://lkml.kernel.org/r/20210125042441.20030-1-longman@redhat.com
      Fixes: 3fea5a49 ("mm: memcontrol: convert page cache to a new mem_cgroup_charge() API")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <smuchun@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da74240e
  4. 21 1月, 2021 1 次提交
    • W
      mm: Pass 'address' to map to do_set_pte() and drop FAULT_FLAG_PREFAULT · 9d3af4b4
      Will Deacon 提交于
      Rather than modifying the 'address' field of the 'struct vm_fault'
      passed to do_set_pte(), leave that to identify the real faulting address
      and pass in the virtual address to be mapped by the new pte as a
      separate argument.
      
      This makes FAULT_FLAG_PREFAULT redundant, as a prefault entry can be
      identified simply by comparing the new address parameter with the
      faulting address, so remove the redundant flag at the same time.
      
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NWill Deacon <will@kernel.org>
      9d3af4b4
  5. 20 1月, 2021 2 次提交
    • W
      mm: Allow architectures to request 'old' entries when prefaulting · 46bdb427
      Will Deacon 提交于
      Commit 5c0a85fa ("mm: make faultaround produce old ptes") changed
      the "faultaround" behaviour to initialise prefaulted PTEs as 'old',
      since this avoids vmscan wrongly assuming that they are hot, despite
      having never been explicitly accessed by userspace. The change has been
      shown to benefit numerous arm64 micro-architectures (with hardware
      access flag) running Android, where both application launch latency and
      direct reclaim time are significantly reduced (by 10%+ and ~80%
      respectively).
      
      Unfortunately, commit 315d09bf ("Revert "mm: make faultaround
      produce old ptes"") reverted the change due to it being identified as
      the cause of a ~6% regression in unixbench on x86. Experiments on a
      variety of recent arm64 micro-architectures indicate that unixbench is
      not affected by the original commit, which appears to yield a 0-1%
      performance improvement.
      
      Since one size does not fit all for the initial state of prefaulted
      PTEs, introduce arch_wants_old_prefaulted_pte(), which allows an
      architecture to opt-in to 'old' prefaulted PTEs at runtime based on
      whatever criteria it may have.
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NVinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NWill Deacon <will@kernel.org>
      46bdb427
    • K
      mm: Cleanup faultaround and finish_fault() codepaths · f9ce0be7
      Kirill A. Shutemov 提交于
      alloc_set_pte() has two users with different requirements: in the
      faultaround code, it called from an atomic context and PTE page table
      has to be preallocated. finish_fault() can sleep and allocate page table
      as needed.
      
      PTL locking rules are also strange, hard to follow and overkill for
      finish_fault().
      
      Let's untangle the mess. alloc_set_pte() has gone now. All locking is
      explicit.
      
      The price is some code duplication to handle huge pages in faultaround
      path, but it should be fine, having overall improvement in readability.
      
      Link: https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@boxSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      [will: s/from from/from/ in comment; spotted by willy]
      Signed-off-by: NWill Deacon <will@kernel.org>
      f9ce0be7
  6. 19 12月, 2020 1 次提交
  7. 16 12月, 2020 6 次提交
  8. 12 12月, 2020 1 次提交
  9. 07 12月, 2020 1 次提交
  10. 02 12月, 2020 1 次提交
  11. 25 11月, 2020 1 次提交
    • H
      mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback) · 073861ed
      Hugh Dickins 提交于
      Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
      on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
      end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
      no longer an ext4 page at all.
      
      The problem is that PageWriteback is not accompanied by a page reference
      (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
      soon as TestClearPageWriteback has been done, that page could be removed
      from page cache, freed, and reused for something else by the time that
      wake_up_page() is reached.
      
      https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
      Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
      check; but I'm paranoid about even looking at an unreferenced struct page,
      lest its memory might itself have already been reused or hotremoved (and
      wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
      
      Then on crashing a second time, realized there's a stronger reason against
      that approach.  If my testing just occasionally crashes on that check,
      when the page is reused for part of a compound page, wouldn't it be much
      more common for the page to get reused as an order-0 page before reaching
      wake_up_page()?  And on rare occasions, might that reused page already be
      marked PageWriteback by its new user, and already be waited upon?  What
      would that look like?
      
      It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
      in write_cache_pages() (though I have never seen that crash myself).
      
      Matthew Wilcox explaining this to himself:
       "page is allocated, added to page cache, dirtied, writeback starts,
      
        --- thread A ---
        filesystem calls end_page_writeback()
              test_clear_page_writeback()
        --- context switch to thread B ---
        truncate_inode_pages_range() finds the page, it doesn't have writeback set,
        we delete it from the page cache.  Page gets reallocated, dirtied, writeback
        starts again.  Then we call write_cache_pages(), see
        PageWriteback() set, call wait_on_page_writeback()
        --- context switch back to thread A ---
        wake_up_page(page, PG_writeback);
        ... thread B is woken, but because the wakeup was for the old use of
        the page, PageWriteback is still set.
      
        Devious"
      
      And prior to 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic")
      this would have been much less likely: before that, wake_page_function()'s
      non-exclusive case would stop walking and not wake if it found Writeback
      already set again; whereas now the non-exclusive case proceeds to wake.
      
      I have not thought of a fix that does not add a little overhead: the
      simplest fix is for end_page_writeback() to get_page() before calling
      test_clear_page_writeback(), then put_page() after wake_up_page().
      
      Was there a chance of missed wakeups before, since a page freed before
      reaching wake_up_page() would have PageWaiters cleared?  I think not,
      because each waiter does hold a reference on the page.  This bug comes
      when the old use of the page, the one we do TestClearPageWriteback on,
      had *no* waiters, so no additional page reference beyond the page cache
      (and whoever racily freed it).  The reuse of the page has a waiter
      holding a reference, and its own PageWriteback set; but the belated
      wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).
      
      Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
      Reported-by: NQian Cai <cai@lca.pw>
      Fixes: 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org # v5.8+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      073861ed
  12. 17 11月, 2020 1 次提交
    • J
      mm: never attempt async page lock if we've transferred data already · 0abed7c6
      Jens Axboe 提交于
      We catch the case where we enter generic_file_buffered_read() with data
      already transferred, but we also need to be careful not to allow an async
      page lock if we're looping transferring data. If not, we could be
      returning -EIOCBQUEUED instead of the transferred amount, and it could
      result in double waitqueue additions as well.
      
      Cc: stable@vger.kernel.org # v5.9
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0abed7c6
  13. 18 10月, 2020 1 次提交
    • J
      mm: mark async iocb read as NOWAIT once some data has been copied · 13bd6914
      Jens Axboe 提交于
      Once we've copied some data for an iocb that is marked with IOCB_WAITQ,
      we should no longer attempt to async lock a new page. Instead make sure
      we return the copied amount, and let the caller retry, instead of
      returning -EIOCBQUEUED for a new page.
      
      This should only be possible with read-ahead disabled on the below
      device, and multiple threads racing on the same file. Haven't been able
      to reproduce on anything else.
      
      Cc: stable@vger.kernel.org # v5.9
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Reported-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      13bd6914
  14. 17 10月, 2020 4 次提交
  15. 16 10月, 2020 1 次提交
  16. 15 10月, 2020 1 次提交
    • D
      vfs: move generic_remap_checks out of mm · 02e83f46
      Darrick J. Wong 提交于
      I would like to move all the generic helpers for the vfs remap range
      functionality (aka clonerange and dedupe) into a separate file so that
      they won't be scattered across the vfs and the mm subsystems.  The
      eventual goal is to be able to deselect remap_range.c if none of the
      filesystems need that code, but the tricky part here is picking a
      stable(ish) part of the merge window to rearrange code.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      02e83f46
  17. 14 10月, 2020 5 次提交
  18. 29 9月, 2020 1 次提交
    • H
      io_uring: fix async buffered reads when readahead is disabled · c8d317aa
      Hao Xu 提交于
      The async buffered reads feature is not working when readahead is
      turned off. There are two things to concern:
      
      - when doing retry in io_read, not only the IOCB_WAITQ flag but also
        the IOCB_NOWAIT flag is still set, which makes it goes to would_block
        phase in generic_file_buffered_read() and then return -EAGAIN. After
        that, the io-wq thread work is queued, and later doing the async
        reads in the old way.
      
      - even if we remove IOCB_NOWAIT when doing retry, the feature is still
        not running properly, since in generic_file_buffered_read() it goes to
        lock_page_killable() after calling mapping->a_ops->readpage() to do
        IO, and thus causing process to sleep.
      
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Fixes: 3b2a4439 ("io_uring: get rid of kiocb_wait_page_queue_init()")
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c8d317aa
  19. 25 9月, 2020 1 次提交
  20. 21 9月, 2020 1 次提交
  21. 18 9月, 2020 1 次提交
    • L
      mm: allow a controlled amount of unfairness in the page lock · 5ef64cc8
      Linus Torvalds 提交于
      Commit 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic") made
      the page locking entirely fair, in that if a waiter came in while the
      lock was held, the lock would be transferred to the lockers strictly in
      order.
      
      That was intended to finally get rid of the long-reported watchdog
      failures that involved the page lock under extreme load, where a process
      could end up waiting essentially forever, as other page lockers stole
      the lock from under it.
      
      It also improved some benchmarks, but it ended up causing huge
      performance regressions on others, simply because fair lock behavior
      doesn't end up giving out the lock as aggressively, causing better
      worst-case latency, but potentially much worse average latencies and
      throughput.
      
      Instead of reverting that change entirely, this introduces a controlled
      amount of unfairness, with a sysctl knob to tune it if somebody needs
      to.  But the default value should hopefully be good for any normal load,
      allowing a few rounds of lock stealing, but enforcing the strict
      ordering before the lock has been stolen too many times.
      
      There is also a hint from Matthieu Baerts that the fair page coloring
      may end up exposing an ABBA deadlock that is hidden by the usual
      optimistic lock stealing, and while the unfairness doesn't fix the
      fundamental issue (and I'm still looking at that), it avoids it in
      practice.
      
      The amount of unfairness can be modified by writing a new value to the
      'sysctl_page_lock_unfairness' variable (default value of 5, exposed
      through /proc/sys/vm/page_lock_unfairness), but that is hopefully
      something we'd use mainly for debugging rather than being necessary for
      any deep system tuning.
      
      This whole issue has exposed just how critical the page lock can be, and
      how contended it gets under certain locks.  And the main contention
      doesn't really seem to be anything related to IO (which was the origin
      of this lock), but for things like just verifying that the page file
      mapping is stable while faulting in the page into a page table.
      
      Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
      Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
      Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/Reported-and-tested-by: NMichael Larabel <Michael@michaellarabel.com>
      Tested-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ef64cc8
  22. 29 8月, 2020 1 次提交
  23. 15 8月, 2020 2 次提交
  24. 13 8月, 2020 1 次提交
  25. 08 8月, 2020 2 次提交