1. 05 12月, 2021 3 次提交
  2. 04 12月, 2021 4 次提交
  3. 17 11月, 2021 1 次提交
  4. 10 11月, 2021 1 次提交
    • J
      vfs: keep inodes with page cache off the inode shrinker LRU · 51b8c1fe
      Johannes Weiner 提交于
      Historically (pre-2.5), the inode shrinker used to reclaim only empty
      inodes and skip over those that still contained page cache.  This caused
      problems on highmem hosts: struct inode could put fill lowmem zones
      before the cache was getting reclaimed in the highmem zones.
      
      To address this, the inode shrinker started to strip page cache to
      facilitate reclaiming lowmem.  However, this comes with its own set of
      problems: the shrinkers may drop actively used page cache just because
      the inodes are not currently open or dirty - think working with a large
      git tree.  It further doesn't respect cgroup memory protection settings
      and can cause priority inversions between containers.
      
      Nowadays, the page cache also holds non-resident info for evicted cache
      pages in order to detect refaults.  We've come to rely heavily on this
      data inside reclaim for protecting the cache workingset and driving swap
      behavior.  We also use it to quantify and report workload health through
      psi.  The latter in turn is used for fleet health monitoring, as well as
      driving automated memory sizing of workloads and containers, proactive
      reclaim and memory offloading schemes.
      
      The consequences of dropping page cache prematurely is that we're seeing
      subtle and not-so-subtle failures in all of the above-mentioned
      scenarios, with the workload generally entering unexpected thrashing
      states while losing the ability to reliably detect it.
      
      To fix this on non-highmem systems at least, going back to rotating
      inodes on the LRU isn't feasible.  We've tried (commit a76cf1a4
      ("mm: don't reclaim inodes with many attached pages")) and failed
      (commit 69056ee6 ("Revert "mm: don't reclaim inodes with many
      attached pages"")).
      
      The issue is mostly that shrinker pools attract pressure based on their
      size, and when objects get skipped the shrinkers remember this as
      deferred reclaim work.  This accumulates excessive pressure on the
      remaining inodes, and we can quickly eat into heavily used ones, or
      dirty ones that require IO to reclaim, when there potentially is plenty
      of cold, clean cache around still.
      
      Instead, this patch keeps populated inodes off the inode LRU in the
      first place - just like an open file or dirty state would.  An otherwise
      clean and unused inode then gets queued when the last cache entry
      disappears.  This solves the problem without reintroducing the reclaim
      issues, and generally is a bit more scalable than having to wade through
      potentially hundreds of thousands of busy inodes.
      
      Locking is a bit tricky because the locks protecting the inode state
      (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
      irq-safe page cache lock (i_pages.xa_lock).  Page cache deletions are
      serialized through i_lock, taken before the i_pages lock, to make sure
      depopulated inodes are queued reliably.  Additions may race with
      deletions, but we'll check again in the shrinker.  If additions race
      with the shrinker itself, we're protected by the i_lock: if find_inode()
      or iput() win, the shrinker will bail on the elevated i_count or
      I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
      will set I_FREEING and inhibit further igets(), which will cause the
      other side to create a new instance of the inode instead.
      
      Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51b8c1fe
  5. 07 11月, 2021 1 次提交
  6. 03 11月, 2021 1 次提交
  7. 27 10月, 2021 1 次提交
  8. 26 10月, 2021 1 次提交
  9. 19 10月, 2021 1 次提交
  10. 18 10月, 2021 2 次提交
  11. 04 9月, 2021 1 次提交
  12. 27 8月, 2021 1 次提交
  13. 25 8月, 2021 1 次提交
  14. 24 8月, 2021 2 次提交
  15. 23 8月, 2021 4 次提交
    • J
      fs: kill sync_inode · 5662c967
      Josef Bacik 提交于
      Now that all users of sync_inode() have been deleted, remove
      sync_inode().
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5662c967
    • J
      fs: add a filemap_fdatawrite_wbc helper · 5a798493
      Josef Bacik 提交于
      Btrfs sometimes needs to flush dirty pages on a bunch of dirty inodes in
      order to reclaim metadata reservations.  Unfortunately most helpers in
      this area are too smart for us:
      
      1) The normal filemap_fdata* helpers only take range and sync modes, and
         don't give any indication of how much was written, so we can only
         flush full inodes, which isn't what we want in most cases.
      2) The normal writeback path requires us to have the s_umount sem held,
         but we can't unconditionally take it in this path because we could
         deadlock.
      3) The normal writeback path also skips inodes with I_SYNC set if we
         write with WB_SYNC_NONE.  This isn't the behavior we want under heavy
         ENOSPC pressure, we want to actually make sure the pages are under
         writeback before returning, and if another thread is in the middle of
         writing the file we may return before they're under writeback and
         miss our ordered extents and not properly wait for completion.
      4) sync_inode() uses the normal writeback path and has the same problem
         as #3.
      
      What we really want is to call do_writepages() with our wbc.  This way
      we can make sure that writeback is actually started on the pages, and we
      can control how many pages are written as a whole as we write many
      inodes using the same wbc.  Accomplish this with a new helper that does
      just that so we can use it for our ENOSPC flushing infrastructure.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a798493
    • J
      fs: remove mandatory file locking support · f7e33bdb
      Jeff Layton 提交于
      We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
      off in Fedora and RHEL8. Several other distros have followed suit.
      
      I've heard of one problem in all that time: Someone migrated from an
      older distro that supported "-o mand" to one that didn't, and the host
      had a fstab entry with "mand" in it which broke on reboot. They didn't
      actually _use_ mandatory locking so they just removed the mount option
      and moved on.
      
      This patch rips out mandatory locking support wholesale from the kernel,
      along with the Kconfig option and the Documentation file. It also
      changes the mount code to ignore the "mand" mount option instead of
      erroring out, and to throw a big, ugly warning.
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      f7e33bdb
    • C
      fs: simplify get_filesystem_list / get_all_fs_names · 6e7c1770
      Christoph Hellwig 提交于
      Just output the '\0' separate list of supported file systems for block
      devices directly rather than going through a pointless round of string
      manipulation.
      
      Based on an earlier patch from Al Viro <viro@zeniv.linux.org.uk>.
      
      Vivek:
      Modified list_bdev_fs_names() and split_fs_names() to return number of
      null terminted strings to caller. Callers now use that information to
      loop through all the strings instead of relying on one extra null char
      being present at the end.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6e7c1770
  16. 19 8月, 2021 2 次提交
  17. 17 8月, 2021 1 次提交
  18. 13 8月, 2021 1 次提交
  19. 11 8月, 2021 1 次提交
  20. 13 7月, 2021 2 次提交
    • J
      mm: Add functions to lock invalidate_lock for two mappings · 7506ae6a
      Jan Kara 提交于
      Some operations such as reflinking blocks among files will need to lock
      invalidate_lock for two mappings. Add helper functions to do that.
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      7506ae6a
    • J
      mm: Protect operations adding pages to page cache with invalidate_lock · 730633f0
      Jan Kara 提交于
      Currently, serializing operations such as page fault, read, or readahead
      against hole punching is rather difficult. The basic race scheme is
      like:
      
      fallocate(FALLOC_FL_PUNCH_HOLE)			read / fault / ..
        truncate_inode_pages_range()
      						  <create pages in page
      						   cache here>
        <update fs block mapping and free blocks>
      
      Now the problem is in this way read / page fault / readahead can
      instantiate pages in page cache with potentially stale data (if blocks
      get quickly reused). Avoiding this race is not simple - page locks do
      not work because we want to make sure there are *no* pages in given
      range. inode->i_rwsem does not work because page fault happens under
      mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
      the performance for mixed read-write workloads suffer.
      
      So create a new rw_semaphore in the address_space - invalidate_lock -
      that protects adding of pages to page cache for page faults / reads /
      readahead.
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      730633f0
  21. 30 6月, 2021 2 次提交
  22. 24 6月, 2021 1 次提交
  23. 04 6月, 2021 1 次提交
  24. 07 5月, 2021 1 次提交
    • D
      drivers/char: remove /dev/kmem for good · bbcd53c9
      David Hildenbrand 提交于
      Patch series "drivers/char: remove /dev/kmem for good".
      
      Exploring /dev/kmem and /dev/mem in the context of memory hot(un)plug and
      memory ballooning, I started questioning the existence of /dev/kmem.
      
      Comparing it with the /proc/kcore implementation, it does not seem to be
      able to deal with things like
      
      a) Pages unmapped from the direct mapping (e.g., to be used by secretmem)
        -> kern_addr_valid(). virt_addr_valid() is not sufficient.
      
      b) Special cases like gart aperture memory that is not to be touched
        -> mem_pfn_is_ram()
      
      Unless I am missing something, it's at least broken in some cases and might
      fault/crash the machine.
      
      Looks like its existence has been questioned before in 2005 and 2010 [1],
      after ~11 additional years, it might make sense to revive the discussion.
      
      CONFIG_DEVKMEM is only enabled in a single defconfig (on purpose or by
      mistake?).  All distributions disable it: in Ubuntu it has been disabled
      for more than 10 years, in Debian since 2.6.31, in Fedora at least
      starting with FC3, in RHEL starting with RHEL4, in SUSE starting from
      15sp2, and OpenSUSE has it disabled as well.
      
      1) /dev/kmem was popular for rootkits [2] before it got disabled
         basically everywhere. Ubuntu documents [3] "There is no modern user of
         /dev/kmem any more beyond attackers using it to load kernel rootkits.".
         RHEL documents in a BZ [5] "it served no practical purpose other than to
         serve as a potential security problem or to enable binary module drivers
         to access structures/functions they shouldn't be touching"
      
      2) /proc/kcore is a decent interface to have a controlled way to read
         kernel memory for debugging puposes. (will need some extensions to
         deal with memory offlining/unplug, memory ballooning, and poisoned
         pages, though)
      
      3) It might be useful for corner case debugging [1]. KDB/KGDB might be a
         better fit, especially, to write random memory; harder to shoot
         yourself into the foot.
      
      4) "Kernel Memory Editor" [4] hasn't seen any updates since 2000 and seems
         to be incompatible with 64bit [1]. For educational purposes,
         /proc/kcore might be used to monitor value updates -- or older
         kernels can be used.
      
      5) It's broken on arm64, and therefore, completely disabled there.
      
      Looks like it's essentially unused and has been replaced by better
      suited interfaces for individual tasks (/proc/kcore, KDB/KGDB). Let's
      just remove it.
      
      [1] https://lwn.net/Articles/147901/
      [2] https://www.linuxjournal.com/article/10505
      [3] https://wiki.ubuntu.com/Security/Features#A.2Fdev.2Fkmem_disabled
      [4] https://sourceforge.net/projects/kme/
      [5] https://bugzilla.redhat.com/show_bug.cgi?id=154796
      
      Link: https://lkml.kernel.org/r/20210324102351.6932-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20210324102351.6932-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Alexander A. Klimov" <grandmaster@al2klimov.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Corentin Labbe <clabbe@baylibre.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Gregory Clement <gregory.clement@bootlin.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: James Troup <james.troup@canonical.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
      Cc: Liviu Dudau <liviu.dudau@arm.com>
      Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: openrisc@lists.librecores.org
      Cc: Palmer Dabbelt <palmerdabbelt@google.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Pavel Machek (CIP)" <pavel@denx.de>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Pierre Morel <pmorel@linux.ibm.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Cc: sparclinux@vger.kernel.org
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Theodore Dubois <tblodt@icloud.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: William Cohen <wcohen@redhat.com>
      Cc: Xiaoming Ni <nixiaoming@huawei.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbcd53c9
  25. 06 5月, 2021 1 次提交
  26. 01 5月, 2021 1 次提交
    • J
      mm: provide filemap_range_needs_writeback() helper · 63135aa3
      Jens Axboe 提交于
      Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3.
      
      An internal workload complained because it was using too much CPU, and
      when I took a look, we had a lot of io_uring workers going to town.
      
      For an async buffered read like workload, I am normally expecting _zero_
      offloads to a worker thread, but this one had tons of them.  I'd drop
      caches and things would look good again, but then a minute later we'd
      regress back to using workers.  Turns out that every minute something
      was reading parts of the device, which would add page cache for that
      inode.  I put patches like these in for our kernel, and the problem was
      solved.
      
      Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache
      entries for the given range.  This causes unnecessary work from the
      callers side, when the IO could have been issued totally fine without
      blocking on writeback when there is none.
      
      This patch (of 3):
      
      For O_DIRECT reads/writes, we check if we need to issue a call to
      filemap_write_and_wait_range() to issue and/or wait for writeback for any
      page in the given range.  The existing mechanism just checks for a page in
      the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the
      slow path (and needing retry) if there's just a clean page cache page in
      the range.
      
      Provide filemap_range_needs_writeback() which tries a little harder to
      check if we actually need to issue and/or wait for writeback in the range.
      
      Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk
      Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63135aa3
  27. 23 4月, 2021 1 次提交