1. 25 9月, 2019 5 次提交
  2. 31 8月, 2019 1 次提交
  3. 14 8月, 2019 1 次提交
    • M
      mm, vmscan: do not special-case slab reclaim when watermarks are boosted · 28360f39
      Mel Gorman 提交于
      Dave Chinner reported a problem pointing a finger at commit 1c30844d
      ("mm: reclaim small amounts of memory when an external fragmentation
      event occurs").
      
      The report is extensive:
      
        https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/
      
      and it's worth recording the most relevant parts (colorful language and
      typos included).
      
      	When running a simple, steady state 4kB file creation test to
      	simulate extracting tarballs larger than memory full of small
      	files into the filesystem, I noticed that once memory fills up
      	the cache balance goes to hell.
      
      	The workload is creating one dirty cached inode for every dirty
      	page, both of which should require a single IO each to clean and
      	reclaim, and creation of inodes is throttled by the rate at which
      	dirty writeback runs at (via balance dirty pages). Hence the ingest
      	rate of new cached inodes and page cache pages is identical and
      	steady. As a result, memory reclaim should quickly find a steady
      	balance between page cache and inode caches.
      
      	The moment memory fills, the page cache is reclaimed at a much
      	faster rate than the inode cache, and evidence suggests that
      	the inode cache shrinker is not being called when large batches
      	of pages are being reclaimed. In roughly the same time period
      	that it takes to fill memory with 50% pages and 50% slab caches,
      	memory reclaim reduces the page cache down to just dirty pages
      	and slab caches fill the entirety of memory.
      
      	The LRU is largely full of dirty pages, and we're getting spikes
      	of random writeback from memory reclaim so it's all going to shit.
      	Behaviour never recovers, the page cache remains pinned at just
      	dirty pages, and nothing I could tune would make any difference.
      	vfs_cache_pressure makes no difference - I would set it so high
      	it should trim the entire inode caches in a single pass, yet it
      	didn't do anything. It was clear from tracing and live telemetry
      	that the shrinkers were pretty much not running except when
      	there was absolutely no memory free at all, and then they did
      	the minimum necessary to free memory to make progress.
      
      	So I went looking at the code, trying to find places where pages
      	got reclaimed and the shrinkers weren't called. There's only one
      	- kswapd doing boosted reclaim as per commit 1c30844d ("mm:
      	reclaim small amounts of memory when an external fragmentation
      	event occurs").
      
      The watermark boosting introduced by the commit is triggered in response
      to an allocation "fragmentation event".  The boosting was not intended
      to target THP specifically and triggers even if THP is disabled.
      However, with Dave's perfectly reasonable workload, fragmentation events
      can be very common given the ratio of slab to page cache allocations so
      boosting remains active for long periods of time.
      
      As high-order allocations might use compaction and compaction cannot
      move slab pages the decision was made in the commit to special-case
      kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
      reclaiming slab does not directly help compaction.
      
      As Dave notes, this decision means that slab can be artificially
      protected for long periods of time and messes up the balance with slab
      and page caches.
      
      Removing the special casing can still indirectly help avoid
      fragmentation by avoiding fragmentation-causing events due to slab
      allocation as pages from a slab pageblock will have some slab objects
      freed.  Furthermore, with the special casing, reclaim behaviour is
      unpredictable as kswapd sometimes examines slab and sometimes does not
      in a manner that is tricky to tune or analyse.
      
      This patch removes the special casing.  The downside is that this is not
      a universal performance win.  Some benchmarks that depend on the
      residency of data when rereading metadata may see a regression when slab
      reclaim is restored to its original behaviour.  Similarly, some
      benchmarks that only read-once or write-once may perform better when
      page reclaim is too aggressive.  The primary upside is that slab
      shrinker is less surprising (arguably more sane but that's a matter of
      opinion), behaves consistently regardless of the fragmentation state of
      the system and properly obeys VM sysctls.
      
      A fsmark benchmark configuration was constructed similar to what Dave
      reported and is codified by the mmtest configuration
      config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket
      machine to avoid dealing with NUMA-related issues and the timing of
      reclaim.  The storage was an SSD Samsung Evo and a fresh trimmed XFS
      filesystem was used for the test data.
      
      This is not an exact replication of Dave's setup.  The configuration
      scales its parameters depending on the memory size of the SUT to behave
      similarly across machines.  The parameters mean the first sample
      reported by fs_mark is using 50% of RAM which will barely be throttled
      and look like a big outlier.  Dave used fake NUMA to have multiple
      kswapd instances which I didn't replicate.  Finally, the number of
      iterations differ from Dave's test as the target disk was not large
      enough.  While not identical, it should be representative.
      
        fsmark
                                           5.3.0-rc3              5.3.0-rc3
                                             vanilla          shrinker-v1r1
        Min       1-files/sec     4444.80 (   0.00%)     4765.60 (   7.22%)
        1st-qrtle 1-files/sec     5005.10 (   0.00%)     5091.70 (   1.73%)
        2nd-qrtle 1-files/sec     4917.80 (   0.00%)     4855.60 (  -1.26%)
        3rd-qrtle 1-files/sec     4667.40 (   0.00%)     4831.20 (   3.51%)
        Max-1     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-5     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-10    1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-90    1-files/sec     4649.60 (   0.00%)     4780.70 (   2.82%)
        Max-95    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max-99    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max       1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Hmean     1-files/sec     5004.75 (   0.00%)     5075.96 (   1.42%)
        Stddev    1-files/sec     1778.70 (   0.00%)     1369.66 (  23.00%)
        CoeffVar  1-files/sec       33.70 (   0.00%)       26.05 (  22.71%)
        BHmean-99 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-95 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-90 1-files/sec     5107.05 (   0.00%)     5131.41 (   0.48%)
        BHmean-75 1-files/sec     5208.45 (   0.00%)     5206.68 (  -0.03%)
        BHmean-50 1-files/sec     5405.53 (   0.00%)     5381.62 (  -0.44%)
        BHmean-25 1-files/sec     6179.75 (   0.00%)     6095.14 (  -1.37%)
      
                           5.3.0-rc3   5.3.0-rc3
                             vanillashrinker-v1r1
        Duration User         501.82      497.29
        Duration System      4401.44     4424.08
        Duration Elapsed     8124.76     8358.05
      
      This is showing a slight skew for the max result representing a large
      outlier for the 1st, 2nd and 3rd quartile are similar indicating that
      the bulk of the results show little difference.  Note that an earlier
      version of the fsmark configuration showed a regression but that
      included more samples taken while memory was still filling.
      
      Note that the elapsed time is higher.  Part of this is that the
      configuration included time to delete all the test files when the test
      completes -- the test automation handles the possibility of testing
      fsmark with multiple thread counts.  Without the patch, many of these
      objects would be memory resident which is part of what the patch is
      addressing.
      
      There are other important observations that justify the patch.
      
      1. With the vanilla kernel, the number of dirty pages in the system is
         very low for much of the test. With this patch, dirty pages is
         generally kept at 10% which matches vm.dirty_background_ratio which
         is normal expected historical behaviour.
      
      2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
         0.95 for much of the test i.e. Slab is being left alone and
         dominating memory consumption. With the patch applied, the ratio
         varies between 0.35 and 0.45 with the bulk of the measured ratios
         roughly half way between those values. This is a different balance to
         what Dave reported but it was at least consistent.
      
      3. Slabs are scanned throughout the entire test with the patch applied.
         The vanille kernel has periods with no scan activity and then
         relatively massive spikes.
      
      4. Without the patch, kswapd scan rates are very variable. With the
         patch, the scan rates remain quite steady.
      
      4. Overall vmstats are closer to normal expectations
      
      	                                5.3.0-rc3      5.3.0-rc3
      	                                  vanilla  shrinker-v1r1
          Ops Direct pages scanned             99388.00      328410.00
          Ops Kswapd pages scanned          45382917.00    33451026.00
          Ops Kswapd pages reclaimed        30869570.00    25239655.00
          Ops Direct pages reclaimed           74131.00        5830.00
          Ops Kswapd efficiency %                 68.02          75.45
          Ops Kswapd velocity                   5585.75        4002.25
          Ops Page reclaim immediate         1179721.00      430927.00
          Ops Slabs scanned                 62367361.00    73581394.00
          Ops Direct inode steals               2103.00        1002.00
          Ops Kswapd inode steals             570180.00     5183206.00
      
      	o Vanilla kernel is hitting direct reclaim more frequently,
      	  not very much in absolute terms but the fact the patch
      	  reduces it is interesting
      	o "Page reclaim immediate" in the vanilla kernel indicates
      	  dirty pages are being encountered at the tail of the LRU.
      	  This is generally bad and means in this case that the LRU
      	  is not long enough for dirty pages to be cleaned by the
      	  background flush in time. This is much reduced by the
      	  patch.
      	o With the patch, kswapd is reclaiming 10 times more slab
      	  pages than with the vanilla kernel. This is indicative
      	  of the watermark boosting over-protecting slab
      
      A more complete set of tests were run that were part of the basis for
      introducing boosting and while there are some differences, they are well
      within tolerances.
      
      Bottom line, the special casing kswapd to avoid slab behaviour is
      unpredictable and can lead to abnormal results for normal workloads.
      
      This patch restores the expected behaviour that slab and page cache is
      balanced consistently for a workload with a steady allocation ratio of
      slab/pagecache pages.  It also means that if there are workloads that
      favour the preservation of slab over pagecache that it can be tuned via
      vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
      the parameter when boosting is active.
      
      Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
      Fixes: 1c30844d ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28360f39
  4. 03 8月, 2019 1 次提交
  5. 17 7月, 2019 3 次提交
  6. 13 7月, 2019 3 次提交
    • Y
      mm: vmscan: correct some vmscan counters for THP swapout · 98879b3b
      Yang Shi 提交于
      Commit bd4c82c2 ("mm, THP, swap: delay splitting THP after swapped
      out"), THP can be swapped out in a whole.  But, nr_reclaimed and some
      other vm counters still get inc'ed by one even though a whole THP (512
      pages) gets swapped out.
      
      This doesn't make too much sense to memory reclaim.
      
      For example, direct reclaim may just need reclaim SWAP_CLUSTER_MAX
      pages, reclaiming one THP could fulfill it.  But, if nr_reclaimed is not
      increased correctly, direct reclaim may just waste time to reclaim more
      pages, SWAP_CLUSTER_MAX * 512 pages in worst case.
      
      And, it may cause pgsteal_{kswapd|direct} is greater than
      pgscan_{kswapd|direct}, like the below:
      
      pgsteal_kswapd 122933
      pgsteal_direct 26600225
      pgscan_kswapd 174153
      pgscan_direct 14678312
      
      nr_reclaimed and nr_scanned must be fixed in parallel otherwise it would
      break some page reclaim logic, e.g.
      
      vmpressure: this looks at the scanned/reclaimed ratio so it won't change
      semantics as long as scanned & reclaimed are fixed in parallel.
      
      compaction/reclaim: compaction wants a certain number of physical pages
      freed up before going back to compacting.
      
      kswapd priority raising: kswapd raises priority if we scan fewer pages
      than the reclaim target (which itself is obviously expressed in order-0
      pages).  As a result, kswapd can falsely raise its aggressiveness even
      when it's making great progress.
      
      Other than nr_scanned and nr_reclaimed, some other counters, e.g.
      pgactivate, nr_skipped, nr_ref_keep and nr_unmap_fail need to be fixed too
      since they are user visible via cgroup, /proc/vmstat or trace points,
      otherwise they would be underreported.
      
      When isolating pages from LRUs, nr_taken has been accounted in base page,
      but nr_scanned and nr_skipped are still accounted in THP.  It doesn't make
      too much sense too since this may cause trace point underreport the
      numbers as well.
      
      So accounting those counters in base page instead of accounting THP as one
      page.
      
      nr_dirty, nr_unqueued_dirty, nr_congested and nr_writeback are used by
      file cache, so they are not impacted by THP swap.
      
      This change may result in lower steal/scan ratio in some cases since THP
      may get split during page reclaim, then a part of tail pages get reclaimed
      instead of the whole 512 pages, but nr_scanned is accounted by 512,
      particularly for direct reclaim.  But, this should be not a significant
      issue.
      
      Link: http://lkml.kernel.org/r/1559025859-72759-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98879b3b
    • Y
      mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned · af5d4403
      Yang Shi 提交于
      Commit 9092c71b ("mm: use sc->priority for slab shrink targets") has
      broken up the relationship between sc->nr_scanned and slab pressure.
      The sc->nr_scanned can't double slab pressure anymore.  So, it sounds no
      sense to still keep sc->nr_scanned inc'ed.  Actually, it would prevent
      from adding pressure on slab shrink since excessive sc->nr_scanned would
      prevent from scan->priority raise.
      
      The bonnie test doesn't show this would change the behavior of slab
      shrinkers.
      
      				w/		w/o
      			  /sec    %CP      /sec      %CP
      Sequential delete: 	3960.6    94.6    3997.6     96.2
      Random delete: 		2518      63.8    2561.6     64.6
      
      The slight increase of "/sec" without the patch would be caused by the
      slight increase of CPU usage.
      
      Link: http://lkml.kernel.org/r/1559025859-72759-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af5d4403
    • K
      mm: vmscan: scan anonymous pages on file refaults · 2c012a4a
      Kuo-Hsin Yang 提交于
      When file refaults are detected and there are many inactive file pages,
      the system never reclaim anonymous pages, the file pages are dropped
      aggressively when there are still a lot of cold anonymous pages and
      system thrashes.  This issue impacts the performance of applications
      with large executable, e.g.  chrome.
      
      With this patch, when file refault is detected, inactive_list_is_low()
      always returns true for file pages in get_scan_count() to enable
      scanning anonymous pages.
      
      The problem can be reproduced by the following test program.
      
      ---8<---
      void fallocate_file(const char *filename, off_t size)
      {
      	struct stat st;
      	int fd;
      
      	if (!stat(filename, &st) && st.st_size >= size)
      		return;
      
      	fd = open(filename, O_WRONLY | O_CREAT, 0600);
      	if (fd < 0) {
      		perror("create file");
      		exit(1);
      	}
      	if (posix_fallocate(fd, 0, size)) {
      		perror("fallocate");
      		exit(1);
      	}
      	close(fd);
      }
      
      long *alloc_anon(long size)
      {
      	long *start = malloc(size);
      	memset(start, 1, size);
      	return start;
      }
      
      long access_file(const char *filename, long size, long rounds)
      {
      	int fd, i;
      	volatile char *start1, *end1, *start2;
      	const int page_size = getpagesize();
      	long sum = 0;
      
      	fd = open(filename, O_RDONLY);
      	if (fd == -1) {
      		perror("open");
      		exit(1);
      	}
      
      	/*
      	 * Some applications, e.g. chrome, use a lot of executable file
      	 * pages, map some of the pages with PROT_EXEC flag to simulate
      	 * the behavior.
      	 */
      	start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
      		      fd, 0);
      	if (start1 == MAP_FAILED) {
      		perror("mmap");
      		exit(1);
      	}
      	end1 = start1 + size / 2;
      
      	start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
      	if (start2 == MAP_FAILED) {
      		perror("mmap");
      		exit(1);
      	}
      
      	for (i = 0; i < rounds; ++i) {
      		struct timeval before, after;
      		volatile char *ptr1 = start1, *ptr2 = start2;
      		gettimeofday(&before, NULL);
      		for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
      			sum += *ptr1 + *ptr2;
      		gettimeofday(&after, NULL);
      		printf("File access time, round %d: %f (sec)
      ", i,
      		       (after.tv_sec - before.tv_sec) +
      		       (after.tv_usec - before.tv_usec) / 1000000.0);
      	}
      	return sum;
      }
      
      int main(int argc, char *argv[])
      {
      	const long MB = 1024 * 1024;
      	long anon_mb, file_mb, file_rounds;
      	const char filename[] = "large";
      	long *ret1;
      	long ret2;
      
      	if (argc != 4) {
      		printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS
      ");
      		exit(0);
      	}
      	anon_mb = atoi(argv[1]);
      	file_mb = atoi(argv[2]);
      	file_rounds = atoi(argv[3]);
      
      	fallocate_file(filename, file_mb * MB);
      	printf("Allocate %ld MB anonymous pages
      ", anon_mb);
      	ret1 = alloc_anon(anon_mb * MB);
      	printf("Access %ld MB file pages
      ", file_mb);
      	ret2 = access_file(filename, file_mb * MB, file_rounds);
      	printf("Print result to prevent optimization: %ld
      ",
      	       *ret1 + ret2);
      	return 0;
      }
      ---8<---
      
      Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the program
      fills ram with 2048 MB memory, access a 200 MB file for 10 times.  Without
      this patch, the file cache is dropped aggresively and every access to the
      file is from disk.
      
        $ ./thrash 2048 200 10
        Allocate 2048 MB anonymous pages
        Access 200 MB file pages
        File access time, round 0: 2.489316 (sec)
        File access time, round 1: 2.581277 (sec)
        File access time, round 2: 2.487624 (sec)
        File access time, round 3: 2.449100 (sec)
        File access time, round 4: 2.420423 (sec)
        File access time, round 5: 2.343411 (sec)
        File access time, round 6: 2.454833 (sec)
        File access time, round 7: 2.483398 (sec)
        File access time, round 8: 2.572701 (sec)
        File access time, round 9: 2.493014 (sec)
      
      With this patch, these file pages can be cached.
      
        $ ./thrash 2048 200 10
        Allocate 2048 MB anonymous pages
        Access 200 MB file pages
        File access time, round 0: 2.475189 (sec)
        File access time, round 1: 2.440777 (sec)
        File access time, round 2: 2.411671 (sec)
        File access time, round 3: 1.955267 (sec)
        File access time, round 4: 0.029924 (sec)
        File access time, round 5: 0.000808 (sec)
        File access time, round 6: 0.000771 (sec)
        File access time, round 7: 0.000746 (sec)
        File access time, round 8: 0.000738 (sec)
        File access time, round 9: 0.000747 (sec)
      
      Checked the swap out stats during the test [1], 19006 pages swapped out
      with this patch, 3418 pages swapped out without this patch. There are
      more swap out, but I think it's within reasonable range when file backed
      data set doesn't fit into the memory.
      
      $ ./thrash 2000 100 2100 5 1 # ANON_MB FILE_EXEC FILE_NOEXEC ROUNDS
      PROCESSES Allocate 2000 MB anonymous pages active_anon: 1613644,
      inactive_anon: 348656, active_file: 892, inactive_file: 1384 (kB)
      pswpout: 7972443, pgpgin: 478615246 Access 100 MB executable file pages
      Access 2100 MB regular file pages File access time, round 0: 12.165,
      (sec) active_anon: 1433788, inactive_anon: 478116, active_file: 17896,
      inactive_file: 24328 (kB) File access time, round 1: 11.493, (sec)
      active_anon: 1430576, inactive_anon: 477144, active_file: 25440,
      inactive_file: 26172 (kB) File access time, round 2: 11.455, (sec)
      active_anon: 1427436, inactive_anon: 476060, active_file: 21112,
      inactive_file: 28808 (kB) File access time, round 3: 11.454, (sec)
      active_anon: 1420444, inactive_anon: 473632, active_file: 23216,
      inactive_file: 35036 (kB) File access time, round 4: 11.479, (sec)
      active_anon: 1413964, inactive_anon: 471460, active_file: 31728,
      inactive_file: 32224 (kB) pswpout: 7991449 (+ 19006), pgpgin: 489924366
      (+ 11309120)
      
      With 4 processes accessing non-overlapping parts of a large file, 30316
      pages swapped out with this patch, 5152 pages swapped out without this
      patch.  The swapout number is small comparing to pgpgin.
      
      [1]: https://github.com/vovo/testing/blob/master/mem_thrash.c
      
      Link: http://lkml.kernel.org/r/20190701081038.GA83398@google.com
      Fixes: e9868505 ("mm,vmscan: only evict file pages when we have plenty")
      Fixes: 7c5bd705 ("mm: memcg: only evict file pages when we have plenty")
      Signed-off-by: NKuo-Hsin Yang <vovoy@chromium.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c012a4a
  7. 05 7月, 2019 1 次提交
    • S
      mm/vmscan.c: prevent useless kswapd loops · dffcac2c
      Shakeel Butt 提交于
      In production we have noticed hard lockups on large machines running
      large jobs due to kswaps hoarding lru lock within isolate_lru_pages when
      sc->reclaim_idx is 0 which is a small zone.  The lru was couple hundred
      GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in
      isolate_lru_pages() was basically skipping GiBs of pages while holding
      the LRU spinlock with interrupt disabled.
      
      On further inspection, it seems like there are two issues:
      
      (1) If kswapd on the return from balance_pgdat() could not sleep (i.e.
          node is still unbalanced), the classzone_idx is unintentionally set
          to 0 and the whole reclaim cycle of kswapd will try to reclaim only
          the lowest and smallest zone while traversing the whole memory.
      
      (2) Fundamentally isolate_lru_pages() is really bad when the
          allocation has woken kswapd for a smaller zone on a very large machine
          running very large jobs.  It can hoard the LRU spinlock while skipping
          over 100s of GiBs of pages.
      
      This patch only fixes (1).  (2) needs a more fundamental solution.  To
      fix (1), in the kswapd context, if pgdat->kswapd_classzone_idx is
      invalid use the classzone_idx of the previous kswapd loop otherwise use
      the one the waker has requested.
      
      Link: http://lkml.kernel.org/r/20190701201847.251028-1-shakeelb@google.com
      Fixes: e716f2eb ("mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dffcac2c
  8. 14 6月, 2019 2 次提交
  9. 15 5月, 2019 10 次提交
  10. 20 4月, 2019 1 次提交
  11. 09 4月, 2019 1 次提交
    • S
      treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively · d75f773c
      Sakari Ailus 提交于
      %pF and %pf are functionally equivalent to %pS and %ps conversion
      specifiers. The former are deprecated, therefore switch the current users
      to use the preferred variant.
      
      The changes have been produced by the following command:
      
      	git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
      	while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done
      
      And verifying the result.
      
      Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: xen-devel@lists.xenproject.org
      Cc: linux-acpi@vger.kernel.org
      Cc: linux-pm@vger.kernel.org
      Cc: drbd-dev@lists.linbit.com
      Cc: linux-block@vger.kernel.org
      Cc: linux-mmc@vger.kernel.org
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-pci@vger.kernel.org
      Cc: linux-scsi@vger.kernel.org
      Cc: linux-btrfs@vger.kernel.org
      Cc: linux-f2fs-devel@lists.sourceforge.net
      Cc: linux-mm@kvack.org
      Cc: ceph-devel@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: NSakari Ailus <sakari.ailus@linux.intel.com>
      Acked-by: David Sterba <dsterba@suse.com> (for btrfs)
      Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c)
      Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci)
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      d75f773c
  12. 06 3月, 2019 7 次提交
  13. 13 2月, 2019 1 次提交
    • D
      Revert "mm: slowly shrink slabs with a relatively small number of objects" · a9a238e8
      Dave Chinner 提交于
      This reverts commit 172b06c3 ("mm: slowly shrink slabs with a
      relatively small number of objects").
      
      This change changes the agressiveness of shrinker reclaim, causing small
      cache and low priority reclaim to greatly increase scanning pressure on
      small caches.  As a result, light memory pressure has a disproportionate
      affect on small caches, and causes large caches to be reclaimed much
      faster than previously.
      
      As a result, it greatly perturbs the delicate balance of the VFS caches
      (dentry/inode vs file page cache) such that the inode/dentry caches are
      reclaimed much, much faster than the page cache and this drives us into
      several other caching imbalance related problems.
      
      As such, this is a bad change and needs to be reverted.
      
      [ Needs some massaging to retain the later seekless shrinker
        modifications.]
      
      Link: http://lkml.kernel.org/r/20190130041707.27750-3-david@fromorbit.com
      Fixes: 172b06c3 ("mm: slowly shrink slabs with a relatively small number of objects")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Wolfgang Walter <linux@stwm.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Spock <dairinin@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a238e8
  14. 29 12月, 2018 2 次提交
    • H
      mm: put_and_wait_on_page_locked() while page is migrated · 9a1ea439
      Hugh Dickins 提交于
      Waiting on a page migration entry has used wait_on_page_locked() all along
      since 2006: but you cannot safely wait_on_page_locked() without holding a
      reference to the page, and that extra reference is enough to make
      migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
      on the entry before migrate_page_move_mapping() gets there.
      
      And that failure is retried nine times, amplifying the pain when trying to
      migrate a popular page.  With a single persistent faulter, migration
      sometimes succeeds; with two or three concurrent faulters, success becomes
      much less likely (and the more the page was mapped, the worse the overhead
      of unmapping and remapping it on each try).
      
      This is especially a problem for memory offlining, where the outer level
      retries forever (or until terminated from userspace), because a heavy
      refault workload can trigger an endless loop of migration failures.
      wait_on_page_locked() is the wrong tool for the job.
      
      David Herrmann (but was he the first?) noticed this issue in 2014:
      https://marc.info/?l=linux-mm&m=140110465608116&w=2
      
      Tim Chen started a thread in August 2017 which appears relevant:
      https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
      on to implicate __migration_entry_wait():
      https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
      up with the v4.14 commits: 2554db91 ("sched/wait: Break up long wake
      list walk") 11a19c7b ("sched/wait: Introduce wakeup boomark in
      wake_up_page_bit")
      
      Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
      https://marc.info/?l=linux-mm&m=154217936431300&w=2
      
      We have all assumed that it is essential to hold a page reference while
      waiting on a page lock: partly to guarantee that there is still a struct
      page when MEMORY_HOTREMOVE is configured, but also to protect against
      reuse of the struct page going to someone who then holds the page locked
      indefinitely, when the waiter can reasonably expect timely unlocking.
      
      But in fact, so long as wait_on_page_bit_common() does the put_page(), and
      is careful not to rely on struct page contents thereafter, there is no
      need to hold a reference to the page while waiting on it.  That does mean
      that this case cannot go back through the loop: but that's fine for the
      page migration case, and even if used more widely, is limited by the "Stop
      walking if it's locked" optimization in wake_page_function().
      
      Add interface put_and_wait_on_page_locked() to do this, using "behavior"
      enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
      No interruptible or killable variant needed yet, but they might follow: I
      have a vague notion that reporting -EINTR should take precedence over
      return from wait_on_page_bit_common() without knowing the page state, so
      arrange it accordingly - but that may be nothing but pedantic.
      
      __migration_entry_wait() still has to take a brief reference to the page,
      prior to calling put_and_wait_on_page_locked(): but now that it is dropped
      before waiting, the chance of impeding page migration is very much
      reduced.  Should we perhaps disable preemption across this?
      
      shrink_page_list()'s __ClearPageLocked(): that was a surprise!  This
      survived a lot of testing before that showed up.  PageWaiters may have
      been set by wait_on_page_bit_common(), and the reference dropped, just
      before shrink_page_list() succeeds in freezing its last page reference: in
      such a case, unlock_page() must be used.  Follow the suggestion from
      Michal Hocko, just revert a978d6f5 ("mm: unlockless reclaim") now:
      that optimization predates PageWaiters, and won't buy much these days; but
      we can reinstate it for the !PageWaiters case if anyone notices.
      
      It does raise the question: should vmscan.c's is_page_cache_freeable() and
      __remove_mapping() now treat a PageWaiters page as if an extra reference
      were held?  Perhaps, but I don't think it matters much, since
      shrink_page_list() already had to win its trylock_page(), so waiters are
      not very common there: I noticed no difference when trying the bigger
      change, and it's surely not needed while put_and_wait_on_page_locked() is
      only used for page migration.
      
      [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvilsSigned-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NBaoquan He <bhe@redhat.com>
      Tested-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a1ea439
    • M
      mm: reclaim small amounts of memory when an external fragmentation event occurs · 1c30844d
      Mel Gorman 提交于
      An external fragmentation event was previously described as
      
          When the page allocator fragments memory, it records the event using
          the mm_page_alloc_extfrag event. If the fallback_order is smaller
          than a pageblock order (order-9 on 64-bit x86) then it's considered
          an event that will cause external fragmentation issues in the future.
      
      The kernel reduces the probability of such events by increasing the
      watermark sizes by calling set_recommended_min_free_kbytes early in the
      lifetime of the system.  This works reasonably well in general but if
      there are enough sparsely populated pageblocks then the problem can still
      occur as enough memory is free overall and kswapd stays asleep.
      
      This patch introduces a watermark_boost_factor sysctl that allows a zone
      watermark to be temporarily boosted when an external fragmentation causing
      events occurs.  The boosting will stall allocations that would decrease
      free memory below the boosted low watermark and kswapd is woken if the
      calling context allows to reclaim an amount of memory relative to the size
      of the high watermark and the watermark_boost_factor until the boost is
      cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
      to clean some of the pageblocks that may have been affected by the
      fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
      from reclaim context during this operation to avoid excessive system
      disruption in the name of fragmentation avoidance.  Care is taken so that
      kswapd will do normal reclaim work if the system is really low on memory.
      
      This was evaluated using the same workloads as "mm, page_alloc: Spread
      allocations across zones before introducing fragmentation".
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      4.20-rc3+patch1-4:                    18421 (98% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
      Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)
      
      Note that external fragmentation causing events are massively reduced by
      this path whether in comparison to the previous kernel or the vanilla
      kernel.  The fault latency for huge pages appears to be increased but that
      is only because THP allocations were successful with the patch applied.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      4.20-rc3+patch1-4:                   13464 (95% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
      Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
      Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
      Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)
      
      As before, massive reduction in external fragmentation events, some jitter
      on latencies and an increase in THP allocation success rates.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      4.20-rc3+patch1-4:                   14263 (93% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
      Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)
      
      There is a 93% reduction in fragmentation causing events, there is a big
      reduction in the huge page fault latency and allocation success rate is
      higher.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      4.20-rc3+patch1-4:                  11095 (93% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
      Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)
      
      There is a large reduction in fragmentation events with some jitter around
      the latencies and success rates.  As before, the high THP allocation
      success rate does mean the system is under a lot of pressure.  However, as
      the fragmentation events are reduced, it would be expected that the
      long-term allocation success rate would be higher.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c30844d
  15. 07 11月, 2018 1 次提交