1. 10 6月, 2019 1 次提交
  2. 21 5月, 2019 1 次提交
  3. 15 5月, 2019 4 次提交
  4. 16 3月, 2019 3 次提交
    • L
      filemap: add a comment about FAULT_FLAG_RETRY_NOWAIT behavior · 8b0f9fa2
      Linus Torvalds 提交于
      I thought Josef Bacik's patch to drop the mmap_sem was buggy, because
      when looking at the error cases, there was one case where we returned
      VM_FAULT_RETRY without actually dropping the mmap_sem.
      
      Josef had to explain to me (using small words) that yes, that's actually
      what we're supposed to do, and his patch was correct.  Which not only
      convinced me he knew what he was doing and I should stop arguing with
      him, but also that I should add a comment to the case I was confused
      about.
      Patiently-pointed-out-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b0f9fa2
    • J
      filemap: drop the mmap_sem for all blocking operations · 6b4c9f44
      Josef Bacik 提交于
      Currently we only drop the mmap_sem if there is contention on the page
      lock.  The idea is that we issue readahead and then go to lock the page
      while it is under IO and we want to not hold the mmap_sem during the IO.
      
      The problem with this is the assumption that the readahead does anything.
      In the case that the box is under extreme memory or IO pressure we may end
      up not reading anything at all for readahead, which means we will end up
      reading in the page under the mmap_sem.
      
      Even if the readahead does something, it could get throttled because of io
      pressure on the system and the process is in a lower priority cgroup.
      
      Holding the mmap_sem while doing IO is problematic because it can cause
      system-wide priority inversions.  Consider some large company that does a
      lot of web traffic.  This large company has load balancing logic in it's
      core web server, cause some engineer thought this was a brilliant plan.
      This load balancing logic gets statistics from /proc about the system,
      which trip over processes mmap_sem for various reasons.  Now the web
      server application is in a protected cgroup, but these other processes may
      not be, and if they are being throttled while their mmap_sem is held we'll
      stall, and cause this nice death spiral.
      
      Instead rework filemap fault path to drop the mmap sem at any point that
      we may do IO or block for an extended period of time.  This includes while
      issuing readahead, locking the page, or needing to call ->readpage because
      readahead did not occur.  Then once we have a fully uptodate page we can
      return with VM_FAULT_RETRY and come back again to find our nicely in-cache
      page that was gotten outside of the mmap_sem.
      
      This patch also adds a new helper for locking the page with the mmap_sem
      dropped.  This doesn't make sense currently as generally speaking if the
      page is already locked it'll have been read in (unless there was an error)
      before it was unlocked.  However a forthcoming patchset will change this
      with the ability to abort read-ahead bio's if necessary, making it more
      likely that we could contend for a page lock and still have a not uptodate
      page.  This allows us to deal with this case by grabbing the lock and
      issuing the IO without the mmap_sem held, and then returning
      VM_FAULT_RETRY to come back around.
      
      [josef@toxicpanda.com: v6]
        Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
      [kirill@shutemov.name: fix race in filemap_fault()]
        Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b4c9f44
    • J
      filemap: kill page_cache_read usage in filemap_fault · a75d4c33
      Josef Bacik 提交于
      Patch series "drop the mmap_sem when doing IO in the fault path", v6.
      
      Now that we have proper isolation in place with cgroups2 we have started
      going through and fixing the various priority inversions.  Most are all
      gone now, but this one is sort of weird since it's not necessarily a
      priority inversion that happens within the kernel, but rather because of
      something userspace does.
      
      We have giant applications that we want to protect, and parts of these
      giant applications do things like watch the system state to determine how
      healthy the box is for load balancing and such.  This involves running
      'ps' or other such utilities.  These utilities will often walk
      /proc/<pid>/whatever, and these files can sometimes need to
      down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
      we are stress testing that sometimes our protected application has latency
      spikes trying to get the mmap_sem for tasks that are in lower priority
      cgroups.
      
      This is because any down_write() on a semaphore essentially turns it into
      a mutex, so even if we currently have it held for reading, any new readers
      will not be allowed on to keep from starving the writer.  This is fine,
      except a lower priority task could be stuck doing IO because it has been
      throttled to the point that its IO is taking much longer than normal.  But
      because a higher priority group depends on this completing it is now stuck
      behind lower priority work.
      
      In order to avoid this particular priority inversion we want to use the
      existing retry mechanism to stop from holding the mmap_sem at all if we
      are going to do IO.  This already exists in the read case sort of, but
      needed to be extended for more than just grabbing the page lock.  With
      io.latency we throttle at submit_bio() time, so the readahead stuff can
      block and even page_cache_read can block, so all these paths need to have
      the mmap_sem dropped.
      
      The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
      because we have to reserve space for the dirty page, which can be a very
      expensive operation.  We use the same retry method as the read path, and
      simply cache the page and verify the page is still setup properly the next
      pass through ->page_mkwrite().
      
      I've tested these patches with xfstests and there are no regressions.
      
      This patch (of 3):
      
      If we do not have a page at filemap_fault time we'll do this weird forced
      page_cache_read thing to populate the page, and then drop it again and
      loop around and find it.  This makes for 2 ways we can read a page in
      filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
      flag so that pagecache_get_page() will return a unlocked page that's in
      pagecache.  Then use the normal page locking and readpage logic already in
      filemap_fault.  This simplifies the no page in page cache case
      significantly.
      
      [akpm@linux-foundation.org: fix comment text]
      [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
        Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
      Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a75d4c33
  5. 15 3月, 2019 1 次提交
  6. 06 3月, 2019 5 次提交
  7. 05 1月, 2019 1 次提交
  8. 29 12月, 2018 3 次提交
  9. 30 10月, 2018 5 次提交
  10. 27 10月, 2018 6 次提交
  11. 24 10月, 2018 1 次提交
  12. 21 10月, 2018 9 次提交