1. 12 12月, 2020 1 次提交
  2. 07 12月, 2020 1 次提交
  3. 25 11月, 2020 1 次提交
    • H
      mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback) · 073861ed
      Hugh Dickins 提交于
      Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
      on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
      end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
      no longer an ext4 page at all.
      
      The problem is that PageWriteback is not accompanied by a page reference
      (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
      soon as TestClearPageWriteback has been done, that page could be removed
      from page cache, freed, and reused for something else by the time that
      wake_up_page() is reached.
      
      https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
      Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
      check; but I'm paranoid about even looking at an unreferenced struct page,
      lest its memory might itself have already been reused or hotremoved (and
      wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
      
      Then on crashing a second time, realized there's a stronger reason against
      that approach.  If my testing just occasionally crashes on that check,
      when the page is reused for part of a compound page, wouldn't it be much
      more common for the page to get reused as an order-0 page before reaching
      wake_up_page()?  And on rare occasions, might that reused page already be
      marked PageWriteback by its new user, and already be waited upon?  What
      would that look like?
      
      It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
      in write_cache_pages() (though I have never seen that crash myself).
      
      Matthew Wilcox explaining this to himself:
       "page is allocated, added to page cache, dirtied, writeback starts,
      
        --- thread A ---
        filesystem calls end_page_writeback()
              test_clear_page_writeback()
        --- context switch to thread B ---
        truncate_inode_pages_range() finds the page, it doesn't have writeback set,
        we delete it from the page cache.  Page gets reallocated, dirtied, writeback
        starts again.  Then we call write_cache_pages(), see
        PageWriteback() set, call wait_on_page_writeback()
        --- context switch back to thread A ---
        wake_up_page(page, PG_writeback);
        ... thread B is woken, but because the wakeup was for the old use of
        the page, PageWriteback is still set.
      
        Devious"
      
      And prior to 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic")
      this would have been much less likely: before that, wake_page_function()'s
      non-exclusive case would stop walking and not wake if it found Writeback
      already set again; whereas now the non-exclusive case proceeds to wake.
      
      I have not thought of a fix that does not add a little overhead: the
      simplest fix is for end_page_writeback() to get_page() before calling
      test_clear_page_writeback(), then put_page() after wake_up_page().
      
      Was there a chance of missed wakeups before, since a page freed before
      reaching wake_up_page() would have PageWaiters cleared?  I think not,
      because each waiter does hold a reference on the page.  This bug comes
      when the old use of the page, the one we do TestClearPageWriteback on,
      had *no* waiters, so no additional page reference beyond the page cache
      (and whoever racily freed it).  The reuse of the page has a waiter
      holding a reference, and its own PageWriteback set; but the belated
      wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).
      
      Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
      Reported-by: NQian Cai <cai@lca.pw>
      Fixes: 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org # v5.8+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      073861ed
  4. 17 11月, 2020 1 次提交
    • J
      mm: never attempt async page lock if we've transferred data already · 0abed7c6
      Jens Axboe 提交于
      We catch the case where we enter generic_file_buffered_read() with data
      already transferred, but we also need to be careful not to allow an async
      page lock if we're looping transferring data. If not, we could be
      returning -EIOCBQUEUED instead of the transferred amount, and it could
      result in double waitqueue additions as well.
      
      Cc: stable@vger.kernel.org # v5.9
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0abed7c6
  5. 18 10月, 2020 1 次提交
    • J
      mm: mark async iocb read as NOWAIT once some data has been copied · 13bd6914
      Jens Axboe 提交于
      Once we've copied some data for an iocb that is marked with IOCB_WAITQ,
      we should no longer attempt to async lock a new page. Instead make sure
      we return the copied amount, and let the caller retry, instead of
      returning -EIOCBQUEUED for a new page.
      
      This should only be possible with read-ahead disabled on the below
      device, and multiple threads racing on the same file. Haven't been able
      to reproduce on anything else.
      
      Cc: stable@vger.kernel.org # v5.9
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Reported-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      13bd6914
  6. 17 10月, 2020 4 次提交
  7. 16 10月, 2020 1 次提交
  8. 15 10月, 2020 1 次提交
    • D
      vfs: move generic_remap_checks out of mm · 02e83f46
      Darrick J. Wong 提交于
      I would like to move all the generic helpers for the vfs remap range
      functionality (aka clonerange and dedupe) into a separate file so that
      they won't be scattered across the vfs and the mm subsystems.  The
      eventual goal is to be able to deselect remap_range.c if none of the
      filesystems need that code, but the tricky part here is picking a
      stable(ish) part of the merge window to rearrange code.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      02e83f46
  9. 14 10月, 2020 5 次提交
  10. 29 9月, 2020 1 次提交
    • H
      io_uring: fix async buffered reads when readahead is disabled · c8d317aa
      Hao Xu 提交于
      The async buffered reads feature is not working when readahead is
      turned off. There are two things to concern:
      
      - when doing retry in io_read, not only the IOCB_WAITQ flag but also
        the IOCB_NOWAIT flag is still set, which makes it goes to would_block
        phase in generic_file_buffered_read() and then return -EAGAIN. After
        that, the io-wq thread work is queued, and later doing the async
        reads in the old way.
      
      - even if we remove IOCB_NOWAIT when doing retry, the feature is still
        not running properly, since in generic_file_buffered_read() it goes to
        lock_page_killable() after calling mapping->a_ops->readpage() to do
        IO, and thus causing process to sleep.
      
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Fixes: 3b2a4439 ("io_uring: get rid of kiocb_wait_page_queue_init()")
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c8d317aa
  11. 25 9月, 2020 1 次提交
  12. 21 9月, 2020 1 次提交
  13. 18 9月, 2020 1 次提交
    • L
      mm: allow a controlled amount of unfairness in the page lock · 5ef64cc8
      Linus Torvalds 提交于
      Commit 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic") made
      the page locking entirely fair, in that if a waiter came in while the
      lock was held, the lock would be transferred to the lockers strictly in
      order.
      
      That was intended to finally get rid of the long-reported watchdog
      failures that involved the page lock under extreme load, where a process
      could end up waiting essentially forever, as other page lockers stole
      the lock from under it.
      
      It also improved some benchmarks, but it ended up causing huge
      performance regressions on others, simply because fair lock behavior
      doesn't end up giving out the lock as aggressively, causing better
      worst-case latency, but potentially much worse average latencies and
      throughput.
      
      Instead of reverting that change entirely, this introduces a controlled
      amount of unfairness, with a sysctl knob to tune it if somebody needs
      to.  But the default value should hopefully be good for any normal load,
      allowing a few rounds of lock stealing, but enforcing the strict
      ordering before the lock has been stolen too many times.
      
      There is also a hint from Matthieu Baerts that the fair page coloring
      may end up exposing an ABBA deadlock that is hidden by the usual
      optimistic lock stealing, and while the unfairness doesn't fix the
      fundamental issue (and I'm still looking at that), it avoids it in
      practice.
      
      The amount of unfairness can be modified by writing a new value to the
      'sysctl_page_lock_unfairness' variable (default value of 5, exposed
      through /proc/sys/vm/page_lock_unfairness), but that is hopefully
      something we'd use mainly for debugging rather than being necessary for
      any deep system tuning.
      
      This whole issue has exposed just how critical the page lock can be, and
      how contended it gets under certain locks.  And the main contention
      doesn't really seem to be anything related to IO (which was the origin
      of this lock), but for things like just verifying that the page file
      mapping is stable while faulting in the page into a page table.
      
      Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
      Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
      Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/Reported-and-tested-by: NMichael Larabel <Michael@michaellarabel.com>
      Tested-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ef64cc8
  14. 29 8月, 2020 1 次提交
  15. 15 8月, 2020 2 次提交
  16. 13 8月, 2020 1 次提交
  17. 08 8月, 2020 2 次提交
  18. 03 8月, 2020 2 次提交
  19. 08 7月, 2020 1 次提交
  20. 22 6月, 2020 4 次提交
  21. 10 6月, 2020 3 次提交
  22. 05 6月, 2020 1 次提交
  23. 04 6月, 2020 3 次提交