1. 27 9月, 2021 19 次提交
  2. 25 9月, 2021 8 次提交
  3. 24 9月, 2021 1 次提交
    • S
      memcg: flush lruvec stats in the refault · 1f828223
      Shakeel Butt 提交于
      Prior to the commit 7e1c0d6f ("memcg: switch lruvec stats to rstat")
      and the commit aa48e47e ("memcg: infrastructure to flush memcg
      stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus *
      32) at worst and for unbounded amount of time.  The commit aa48e47e
      moved the lruvec stats to rstat infrastructure and the commit
      7e1c0d6f bounded the error for all the lruvec stats to (nr_cpus *
      32) at worst for at most 2 seconds.  More specifically it decoupled the
      number of stats and the number of cgroups from the error rate.
      
      However this reduction in error comes with the cost of triggering the
      slowpath of stats update more frequently.  Previously in the slowpath
      the kernel adds the stats up the memcg tree.  After aa48e47e, the
      kernel triggers the asyn lruvec stats flush through queue_work().  This
      causes regression reports from 0day kernel bot [1] as well as from
      phoronix test suite [2].
      
      We tried two options to fix the regression:
      
       1) Increase the threshold to trigger the slowpath in lruvec stats
          update codepath from 32 to 512.
      
       2) Remove the slowpath from lruvec stats update codepath and instead
          flush the stats in the page refault codepath. The assumption is that
          the kernel timely flush the stats, so, the update tree would be
          small in the refault codepath to not cause the preformance impact.
      
      Following are the results of will-it-scale/page_fault[1|2|3] benchmark
      on four settings i.e.  (1) 5.15-rc1 as baseline (2) 5.15-rc1 with
      aa48e47e and 7e1c0d6f reverted (3) 5.15-rc1 with option-1
      (4) 5.15-rc1 with option-2.
      
        test       (1)      (2)               (3)               (4)
        pg_f1   368563   406277 (10.23%)   399693  (8.44%)   416398 (12.97%)
        pg_f2   338399   372133  (9.96%)   369180  (9.09%)   381024 (12.59%)
        pg_f3   500853   575399 (14.88%)   570388 (13.88%)   576083 (15.02%)
      
      From the above result, it seems like the option-2 not only solves the
      regression but also improves the performance for at least these
      benchmarks.
      
      Feng Tang (intel) ran the aim7 benchmark with these two options and
      confirms that option-1 reduces the regression but option-2 removes the
      regression.
      
      Michael Larabel (phoronix) ran multiple benchmarks with these options
      and reported the results at [3] and it shows for most benchmarks
      option-2 removes the regression introduced by the commit aa48e47e
      ("memcg: infrastructure to flush memcg stats").
      
      Based on the experiment results, this patch proposed the option-2 as the
      solution to resolve the regression.
      
      Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1]
      Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress [2]
      Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104 [3]
      Fixes: aa48e47e ("memcg: infrastructure to flush memcg stats")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Tested-by: NMichael Larabel <Michael@phoronix.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hillf Danton <hdanton@sina.com>,
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>,
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f828223
  4. 15 9月, 2021 1 次提交
    • L
      memblock: introduce saner 'memblock_free_ptr()' interface · 77e02cf5
      Linus Torvalds 提交于
      The boot-time allocation interface for memblock is a mess, with
      'memblock_alloc()' returning a virtual pointer, but then you are
      supposed to free it with 'memblock_free()' that takes a _physical_
      address.
      
      Not only is that all kinds of strange and illogical, but it actually
      causes bugs, when people then use it like a normal allocation function,
      and it fails spectacularly on a NULL pointer:
      
         https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/
      
      or just random memory corruption if the debug checks don't catch it:
      
         https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/
      
      I really don't want to apply patches that treat the symptoms, when the
      fundamental cause is this horribly confusing interface.
      
      I started out looking at just automating a sane replacement sequence,
      but because of this mix or virtual and physical addresses, and because
      people have used the "__pa()" macro that can take either a regular
      kernel pointer, or just the raw "unsigned long" address, it's all quite
      messy.
      
      So this just introduces a new saner interface for freeing a virtual
      address that was allocated using 'memblock_alloc()', and that was kept
      as a regular kernel pointer.  And then it converts a couple of users
      that are obvious and easy to test, including the 'xbc_nodes' case in
      lib/bootconfig.c that caused problems.
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Fixes: 40caa127 ("init: bootconfig: Remove all bootconfig data when the init memory is removed")
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77e02cf5
  5. 14 9月, 2021 1 次提交
  6. 13 9月, 2021 1 次提交
    • D
      afs: Fix mmap coherency vs 3rd-party changes · 6e0e99d5
      David Howells 提交于
      Fix the coherency management of mmap'd data such that 3rd-party changes
      become visible as soon as possible after the callback notification is
      delivered by the fileserver.  This is done by the following means:
      
       (1) When we break a callback on a vnode specified by the CB.CallBack call
           from the server, we queue a work item (vnode->cb_work) to go and
           clobber all the PTEs mapping to that inode.
      
           This causes the CPU to trip through the ->map_pages() and
           ->page_mkwrite() handlers if userspace attempts to access the page(s)
           again.
      
           (Ideally, this would be done in the service handler for CB.CallBack,
           but the server is waiting for our reply before considering, and we
           have a list of vnodes, all of which need breaking - and the process of
           getting the mmap_lock and stripping the PTEs on all CPUs could be
           quite slow.)
      
       (2) Call afs_validate() from the ->map_pages() handler to check to see if
           the file has changed and to get a new callback promise from the
           server.
      
      Also handle the fileserver telling us that it's dropping all callbacks,
      possibly after it's been restarted by sending us a CB.InitCallBackState*
      call by the following means:
      
       (3) Maintain a per-cell list of afs files that are currently mmap'd
           (cell->fs_open_mmaps).
      
       (4) Add a work item to each server that is invoked if there are any open
           mmaps when CB.InitCallBackState happens.  This work item goes through
           the aforementioned list and invokes the vnode->cb_work work item for
           each one that is currently using this server.
      
           This causes the PTEs to be cleared, causing ->map_pages() or
           ->page_mkwrite() to be called again, thereby calling afs_validate()
           again.
      
      I've chosen to simply strip the PTEs at the point of notification reception
      rather than invalidate all the pages as well because (a) it's faster, (b)
      we may get a notification for other reasons than the data being altered (in
      which case we don't want to clobber the pagecache) and (c) we need to ask
      the server to find out - and I don't want to wait for the reply before
      holding up userspace.
      
      This was tested using the attached test program:
      
      	#include <stdbool.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <fcntl.h>
      	#include <sys/mman.h>
      	int main(int argc, char *argv[])
      	{
      		size_t size = getpagesize();
      		unsigned char *p;
      		bool mod = (argc == 3);
      		int fd;
      		if (argc != 2 && argc != 3) {
      			fprintf(stderr, "Format: %s <file> [mod]\n", argv[0]);
      			exit(2);
      		}
      		fd = open(argv[1], mod ? O_RDWR : O_RDONLY);
      		if (fd < 0) {
      			perror(argv[1]);
      			exit(1);
      		}
      
      		p = mmap(NULL, size, mod ? PROT_READ|PROT_WRITE : PROT_READ,
      			 MAP_SHARED, fd, 0);
      		if (p == MAP_FAILED) {
      			perror("mmap");
      			exit(1);
      		}
      		for (;;) {
      			if (mod) {
      				p[0]++;
      				msync(p, size, MS_ASYNC);
      				fsync(fd);
      			}
      			printf("%02x", p[0]);
      			fflush(stdout);
      			sleep(1);
      		}
      	}
      
      It runs in two modes: in one mode, it mmaps a file, then sits in a loop
      reading the first byte, printing it and sleeping for a second; in the
      second mode it mmaps a file, then sits in a loop incrementing the first
      byte and flushing, then printing and sleeping.
      
      Two instances of this program can be run on different machines, one doing
      the reading and one doing the writing.  The reader should see the changes
      made by the writer, but without this patch, they aren't because validity
      checking is being done lazily - only on entry to the filesystem.
      
      Testing the InitCallBackState change is more complicated.  The server has
      to be taken offline, the saved callback state file removed and then the
      server restarted whilst the reading-mode program continues to run.  The
      client machine then has to poke the server to trigger the InitCallBackState
      call.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMarkus Suvanto <markus.suvanto@gmail.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/163111668833.283156.382633263709075739.stgit@warthog.procyon.org.uk/
      6e0e99d5
  7. 09 9月, 2021 9 次提交