1. 25 2月, 2021 1 次提交
  2. 23 11月, 2020 1 次提交
    • M
      mm: fix readahead_page_batch for retry entries · 4349a83a
      Matthew Wilcox (Oracle) 提交于
      Both btrfs and fuse have reported faults caused by seeing a retry entry
      instead of the page they were looking for.  This was caused by a missing
      check in the iterator.
      
      As can be seen in the below panic log, the accessing 0x402 causes a
      panic.  In the xarray.h, 0x402 means RETRY_ENTRY.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000402
        CPU: 14 PID: 306003 Comm: as Not tainted 5.9.0-1-amd64 #1 Debian 5.9.1-1
        Hardware name: Lenovo ThinkSystem SR665/7D2VCTO1WW, BIOS D8E106Q-1.01 05/30/2020
        RIP: 0010:fuse_readahead+0x152/0x470 [fuse]
        Code: 41 8b 57 18 4c 8d 54 10 ff 4c 89 d6 48 8d 7c 24 10 e8 d2 e3 28 f9 48 85 c0 0f 84 fe 00 00 00 44 89 f2 49 89 04 d4 44 8d 72 01 <48> 8b 10 41 8b 4f 1c 48 c1 ea 10 83 e2 01 80 fa 01 19 d2 81 e2 01
        RSP: 0018:ffffad99ceaebc50 EFLAGS: 00010246
        RAX: 0000000000000402 RBX: 0000000000000001 RCX: 0000000000000002
        RDX: 0000000000000000 RSI: ffff94c5af90bd98 RDI: ffffad99ceaebc60
        RBP: ffff94ddc1749a00 R08: 0000000000000402 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000100 R12: ffff94de6c429ce0
        R13: ffff94de6c4d3700 R14: 0000000000000001 R15: ffffad99ceaebd68
        FS:  00007f228c5c7040(0000) GS:ffff94de8ed80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000402 CR3: 0000001dbd9b4000 CR4: 0000000000350ee0
        Call Trace:
          read_pages+0x83/0x270
          page_cache_readahead_unbounded+0x197/0x230
          generic_file_buffered_read+0x57a/0xa20
          new_sync_read+0x112/0x1a0
          vfs_read+0xf8/0x180
          ksys_read+0x5f/0xe0
          do_syscall_64+0x33/0x80
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 042124cc ("mm: add new readahead_control API")
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Reported-by: NWonhyuk Yang <vvghjk1234@gmail.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201103142852.8543-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20201103124349.16722-1-vvghjk1234@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4349a83a
  3. 29 10月, 2020 1 次提交
  4. 17 10月, 2020 5 次提交
  5. 14 10月, 2020 3 次提交
  6. 12 10月, 2020 1 次提交
  7. 21 9月, 2020 1 次提交
  8. 15 8月, 2020 1 次提交
  9. 22 6月, 2020 3 次提交
  10. 10 6月, 2020 1 次提交
  11. 03 6月, 2020 5 次提交
    • G
      include/linux/pagemap.h: introduce attach/detach_page_private · b03143ac
      Guoqing Jiang 提交于
      Patch series "Introduce attach/detach_page_private to cleanup code".
      
      This patch (of 10):
      
      The logic in attach_page_buffers and __clear_page_buffers are quite
      paired, but
      
      1. they are located in different files.
      
      2. attach_page_buffers is implemented in buffer_head.h, so it could be
         used by other files. But __clear_page_buffers is static function in
         buffer.c and other potential users can't call the function, md-bitmap
         even copied the function.
      
      So, introduce the new attach/detach_page_private to replace them.  With
      the new pair of function, we will remove the usage of attach_page_buffers
      and __clear_page_buffers in next patches.  Thanks for suggestions about
      the function name from Alexander Viro, Andreas Grünbacher, Christoph
      Hellwig and Matthew Wilcox.
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Cc: Song Liu <song@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Cc: Martin Brandenburg <martin@omnibond.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Link: http://lkml.kernel.org/r/20200517214718.468-1-guoqing.jiang@cloud.ionos.com
      Link: http://lkml.kernel.org/r/20200517214718.468-2-guoqing.jiang@cloud.ionos.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b03143ac
    • M
      mm: add page_cache_readahead_unbounded · 2c684234
      Matthew Wilcox (Oracle) 提交于
      ext4 and f2fs have duplicated the guts of the readahead code so they can
      read past i_size.  Instead, separate out the guts of the readahead code
      so they can call it directly.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: NEric Biggers <ebiggers@google.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-14-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c684234
    • M
      mm: add new readahead_control API · 042124cc
      Matthew Wilcox (Oracle) 提交于
      Filesystems which implement the upcoming ->readahead method will get
      their pages by calling readahead_page() or readahead_page_batch().
      These functions support large pages, even though none of the filesystems
      to be converted do yet.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-6-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      042124cc
    • M
      mm: move readahead prototypes from mm.h · cee9a0c4
      Matthew Wilcox (Oracle) 提交于
      Patch series "Change readahead API", v11.
      
      This series adds a readahead address_space operation to replace the
      readpages operation.  The key difference is that pages are added to the
      page cache as they are allocated (and then looked up by the filesystem)
      instead of passing them on a list to the readpages operation and having
      the filesystem add them to the page cache.  It's a net reduction in code
      for each implementation, more efficient than walking a list, and solves
      the direct-write vs buffered-read problem reported by yu kuai at
      http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com
      
      The only unconverted filesystems are those which use fscache.  Their
      conversion is pending Dave Howells' rewrite which will make the
      conversion substantially easier.  This should be completed by the end of
      the year.
      
      I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric
      Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and
      Miklos Szeredi have done a marvellous job of providing constructive
      criticism.
      
      These patches pass an xfstests run on ext4, xfs & btrfs with no
      regressions that I can tell (some of the tests seem a little flaky
      before and remain flaky afterwards).
      
      This patch (of 25):
      
      The readahead code is part of the page cache so should be found in the
      pagemap.h file.  force_page_cache_readahead is only used within mm, so
      move it to mm/internal.h instead.  Remove the parameter names where they
      add no value, and rename the ones which were actively misleading.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org
      Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cee9a0c4
    • J
      vfs: track per-sb writeback errors and report them to syncfs · 735e4ae5
      Jeff Layton 提交于
      Patch series "vfs: have syncfs() return error when there are writeback
      errors", v6.
      
      Currently, syncfs does not return errors when one of the inodes fails to
      be written back.  It will return errors based on the legacy AS_EIO and
      AS_ENOSPC flags when syncing out the block device fails, but that's not
      particularly helpful for filesystems that aren't backed by a blockdev.
      It's also possible for a stray sync to lose those errors.
      
      The basic idea in this set is to track writeback errors at the
      superblock level, so that we can quickly and easily check whether
      something bad happened without having to fsync each file individually.
      syncfs is then changed to reliably report writeback errors after they
      occur, much in the same fashion as fsync does now.
      
      This patch (of 2):
      
      Usually we suggest that applications call fsync when they want to ensure
      that all data written to the file has made it to the backing store, but
      that can be inefficient when there are a lot of open files.
      
      Calling syncfs on the filesystem can be more efficient in some
      situations, but the error reporting doesn't currently work the way most
      people expect.  If a single inode on a filesystem reports a writeback
      error, syncfs won't necessarily return an error.  syncfs only returns an
      error if __sync_blockdev fails, and on some filesystems that's a no-op.
      
      It would be better if syncfs reported an error if there were any
      writeback failures.  Then applications could call syncfs to see if there
      are any errors on any open files, and could then call fsync on all of
      the other descriptors to figure out which one failed.
      
      This patch adds a new errseq_t to struct super_block, and has
      mapping_set_error also record writeback errors there.
      
      To report those errors, we also need to keep an errseq_t in struct file
      to act as a cursor.  This patch adds a dedicated field for that purpose,
      which slots nicely into 4 bytes of padding at the end of struct file on
      x86_64.
      
      An earlier version of this patch used an O_PATH file descriptor to cue
      the kernel that the open file should track the superblock error and not
      the inode's writeback error.
      
      I think that API is just too weird though.  This is simpler and should
      make syncfs error reporting "just work" even if someone is multiplexing
      fsync and syncfs on the same fds.
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andres Freund <andres@anarazel.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
      Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      735e4ae5
  12. 08 4月, 2020 1 次提交
  13. 03 4月, 2020 3 次提交
  14. 07 1月, 2020 1 次提交
    • A
      fs: Fix page_mkwrite off-by-one errors · 243145bc
      Andreas Gruenbacher 提交于
      The check in block_page_mkwrite that is meant to determine whether an
      offset is within the inode size is off by one.  This bug has been copied
      into iomap_page_mkwrite and several filesystems (ubifs, ext4, f2fs,
      ceph).
      
      Fix that by introducing a new page_mkwrite_check_truncate helper that
      checks for truncate and computes the bytes in the page up to EOF.  Use
      the helper in iomap.
      
      NOTE from Darrick: The original patch fixed a number of filesystems, but
      then there were merge conflicts with the f2fs for-next tree; a
      subsequent re-submission of the patch had different btrfs changes with
      no explanation; and Christoph complained that each per-fs fix should be
      a separate patch.  In my view that's too much risk to take on, so I
      decided to drop all the hunks except for iomap, since I've actually QA'd
      XFS.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: drop everything but the iomap parts]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      243145bc
  15. 25 9月, 2019 1 次提交
  16. 13 7月, 2019 2 次提交
  17. 06 7月, 2019 1 次提交
    • L
      Revert "mm: page cache: store only head pages in i_pages" · 69bf4b6b
      Linus Torvalds 提交于
      This reverts commit 5fd4ca2d.
      
      Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
      __delete_from_swap_cache() to trigger:
      
         page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
         anon
         flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
         raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
         raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
         page dumped because: VM_BUG_ON_PAGE(entry != page)
         page->mem_cgroup:ffff978433ace000
         ------------[ cut here ]------------
         kernel BUG at mm/swap_state.c:170!
         invalid opcode: 0000 [#1] SMP NOPTI
         CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
         Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
         RIP: 0010:__delete_from_swap_cache+0x20d/0x240
         Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
         RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
         RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
         RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
         RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
         R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
         R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
         FS:  0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
         Call Trace:
          delete_from_swap_cache+0x46/0xa0
          try_to_free_swap+0xbc/0x110
          swap_writepage+0x13/0x70
          pageout.isra.0+0x13c/0x350
          shrink_page_list+0xc14/0xdf0
          shrink_inactive_list+0x1e5/0x3c0
          shrink_node_memcg+0x202/0x760
          shrink_node+0xe0/0x470
          balance_pgdat+0x2d1/0x510
          kswapd+0x220/0x420
          kthread+0xfb/0x130
          ret_from_fork+0x22/0x40
      
      and it's not immediately obvious why it happens.  It's too late in the
      rc cycle to do anything but revert for now.
      
      Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/Reported-and-bisected-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69bf4b6b
  18. 15 5月, 2019 3 次提交
  19. 16 3月, 2019 1 次提交
    • J
      filemap: kill page_cache_read usage in filemap_fault · a75d4c33
      Josef Bacik 提交于
      Patch series "drop the mmap_sem when doing IO in the fault path", v6.
      
      Now that we have proper isolation in place with cgroups2 we have started
      going through and fixing the various priority inversions.  Most are all
      gone now, but this one is sort of weird since it's not necessarily a
      priority inversion that happens within the kernel, but rather because of
      something userspace does.
      
      We have giant applications that we want to protect, and parts of these
      giant applications do things like watch the system state to determine how
      healthy the box is for load balancing and such.  This involves running
      'ps' or other such utilities.  These utilities will often walk
      /proc/<pid>/whatever, and these files can sometimes need to
      down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
      we are stress testing that sometimes our protected application has latency
      spikes trying to get the mmap_sem for tasks that are in lower priority
      cgroups.
      
      This is because any down_write() on a semaphore essentially turns it into
      a mutex, so even if we currently have it held for reading, any new readers
      will not be allowed on to keep from starving the writer.  This is fine,
      except a lower priority task could be stuck doing IO because it has been
      throttled to the point that its IO is taking much longer than normal.  But
      because a higher priority group depends on this completing it is now stuck
      behind lower priority work.
      
      In order to avoid this particular priority inversion we want to use the
      existing retry mechanism to stop from holding the mmap_sem at all if we
      are going to do IO.  This already exists in the read case sort of, but
      needed to be extended for more than just grabbing the page lock.  With
      io.latency we throttle at submit_bio() time, so the readahead stuff can
      block and even page_cache_read can block, so all these paths need to have
      the mmap_sem dropped.
      
      The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
      because we have to reserve space for the dirty page, which can be a very
      expensive operation.  We use the same retry method as the read path, and
      simply cache the page and verify the page is still setup properly the next
      pass through ->page_mkwrite().
      
      I've tested these patches with xfstests and there are no regressions.
      
      This patch (of 3):
      
      If we do not have a page at filemap_fault time we'll do this weird forced
      page_cache_read thing to populate the page, and then drop it again and
      loop around and find it.  This makes for 2 ways we can read a page in
      filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
      flag so that pagecache_get_page() will return a unlocked page that's in
      pagecache.  Then use the normal page locking and readpage logic already in
      filemap_fault.  This simplifies the no page in page cache case
      significantly.
      
      [akpm@linux-foundation.org: fix comment text]
      [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
        Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
      Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a75d4c33
  20. 06 3月, 2019 1 次提交
    • J
      mm: page_cache_add_speculative(): refactor out some code duplication · 494eec70
      john.hubbard@gmail.com 提交于
      From: John Hubbard <jhubbard@nvidia.com>
      
      This combines the common elements of these routines:
      
          page_cache_get_speculative()
          page_cache_add_speculative()
      
      This was anticipated by the original author, as shown by the comment in
      commit ce0ad7f0 ("powerpc/mm: Lockless get_user_pages_fast() for
      64-bit (v3)"):
      
          "Same as above, but add instead of inc (could just be merged)"
      
      There is no intention to introduce any behavioral change, but there is a
      small risk of that, due to slightly differing ways of expressing the
      TINY_RCU and related configurations.
      
      This also removes the VM_BUG_ON(in_interrupt()) that was in
      page_cache_add_speculative(), but not in page_cache_get_speculative().
      This provides slightly less detection of such bugs, but it given that it
      was only there on the "add" path anyway, we can likely do without it
      just fine.
      
      And it removes the
      VM_BUG_ON_PAGE(PageCompound(page) && page != compound_head(page), page);
      that page_cache_add_speculative() had.
      
      Link: http://lkml.kernel.org/r/20190206231016.22734-2-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      494eec70
  21. 29 12月, 2018 1 次提交
    • H
      mm: put_and_wait_on_page_locked() while page is migrated · 9a1ea439
      Hugh Dickins 提交于
      Waiting on a page migration entry has used wait_on_page_locked() all along
      since 2006: but you cannot safely wait_on_page_locked() without holding a
      reference to the page, and that extra reference is enough to make
      migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
      on the entry before migrate_page_move_mapping() gets there.
      
      And that failure is retried nine times, amplifying the pain when trying to
      migrate a popular page.  With a single persistent faulter, migration
      sometimes succeeds; with two or three concurrent faulters, success becomes
      much less likely (and the more the page was mapped, the worse the overhead
      of unmapping and remapping it on each try).
      
      This is especially a problem for memory offlining, where the outer level
      retries forever (or until terminated from userspace), because a heavy
      refault workload can trigger an endless loop of migration failures.
      wait_on_page_locked() is the wrong tool for the job.
      
      David Herrmann (but was he the first?) noticed this issue in 2014:
      https://marc.info/?l=linux-mm&m=140110465608116&w=2
      
      Tim Chen started a thread in August 2017 which appears relevant:
      https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
      on to implicate __migration_entry_wait():
      https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
      up with the v4.14 commits: 2554db91 ("sched/wait: Break up long wake
      list walk") 11a19c7b ("sched/wait: Introduce wakeup boomark in
      wake_up_page_bit")
      
      Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
      https://marc.info/?l=linux-mm&m=154217936431300&w=2
      
      We have all assumed that it is essential to hold a page reference while
      waiting on a page lock: partly to guarantee that there is still a struct
      page when MEMORY_HOTREMOVE is configured, but also to protect against
      reuse of the struct page going to someone who then holds the page locked
      indefinitely, when the waiter can reasonably expect timely unlocking.
      
      But in fact, so long as wait_on_page_bit_common() does the put_page(), and
      is careful not to rely on struct page contents thereafter, there is no
      need to hold a reference to the page while waiting on it.  That does mean
      that this case cannot go back through the loop: but that's fine for the
      page migration case, and even if used more widely, is limited by the "Stop
      walking if it's locked" optimization in wake_page_function().
      
      Add interface put_and_wait_on_page_locked() to do this, using "behavior"
      enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
      No interruptible or killable variant needed yet, but they might follow: I
      have a vague notion that reporting -EINTR should take precedence over
      return from wait_on_page_bit_common() without knowing the page state, so
      arrange it accordingly - but that may be nothing but pedantic.
      
      __migration_entry_wait() still has to take a brief reference to the page,
      prior to calling put_and_wait_on_page_locked(): but now that it is dropped
      before waiting, the chance of impeding page migration is very much
      reduced.  Should we perhaps disable preemption across this?
      
      shrink_page_list()'s __ClearPageLocked(): that was a surprise!  This
      survived a lot of testing before that showed up.  PageWaiters may have
      been set by wait_on_page_bit_common(), and the reference dropped, just
      before shrink_page_list() succeeds in freezing its last page reference: in
      such a case, unlock_page() must be used.  Follow the suggestion from
      Michal Hocko, just revert a978d6f5 ("mm: unlockless reclaim") now:
      that optimization predates PageWaiters, and won't buy much these days; but
      we can reinstate it for the !PageWaiters case if anyone notices.
      
      It does raise the question: should vmscan.c's is_page_cache_freeable() and
      __remove_mapping() now treat a PageWaiters page as if an extra reference
      were held?  Perhaps, but I don't think it matters much, since
      shrink_page_list() already had to win its trylock_page(), so waiters are
      not very common there: I noticed no difference when trying the bigger
      change, and it's surely not needed while put_and_wait_on_page_locked() is
      only used for page migration.
      
      [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvilsSigned-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NBaoquan He <bhe@redhat.com>
      Tested-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a1ea439
  22. 21 10月, 2018 2 次提交