1. 07 10月, 2021 2 次提交
    • E
      coredump: Don't perform any cleanups before dumping core · 92307383
      Eric W. Biederman 提交于
      Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
      before PTRACE_EVENT_EXIT, and before any cleanup work for a task
      happens.  This ensures that an accurate copy of the process can be
      captured in the coredump as no cleanup for the process happens before
      the coredump completes.  This also ensures that PTRACE_EVENT_EXIT
      will not be visited by any thread until the coredump is complete.
      
      Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
      coredump_task_exit can be recognized and ignored in zap_process.
      
      Now that all of the coredumping happens before exit_mm remove code to
      test for a coredump in progress from mm_release.
      
      Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
      The other tests in may_ptrace_stop all concern avoiding stopping
      during a coredump.  These tests are no longer necessary as it is now
      guaranteed that fatal_signal_pending will be set if the code enters
      ptrace_stop during a coredump.  The code in ptrace_stop is guaranteed
      not to stop if fatal_signal_pending returns true.
      
      Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
      ptrace_stop without fatal_signal_pending being true, as signals are
      dequeued in get_signal before calling do_exit.  This is no longer
      an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
      until after the coredump completes.
      
      Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      92307383
    • E
      ptrace: Remove the unnecessary arguments from arch_ptrace_stop · 4f627af8
      Eric W. Biederman 提交于
      Both arch_ptrace_stop_needed and arch_ptrace_stop are called with an
      exit_code and a siginfo structure.  Neither argument is used by any of
      the implementations so just remove the unneeded arguments.
      
      The two arechitectures that implement arch_ptrace_stop are ia64 and
      sparc.  Both architectures flush their register stacks before a
      ptrace_stack so that all of the register information can be accessed
      by debuggers.
      
      As the question of if a register stack needs to be flushed is
      independent of why ptrace is stopping not needing arguments make sense.
      
      Cc: David Miller <davem@davemloft.net>
      Cc: sparclinux@vger.kernel.org
      Link: https://lkml.kernel.org/r/87lf3mx290.fsf@disp2133Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      4f627af8
  2. 11 9月, 2021 2 次提交
  3. 10 9月, 2021 1 次提交
  4. 09 9月, 2021 30 次提交
    • L
      mmap_lock: change trace and locking order · 10994316
      Liam Howlett 提交于
      Print to the trace log before releasing the lock to avoid racing with
      other trace log printers of the same lock type.
      
      Link: https://lkml.kernel.org/r/20210903022041.1843024-1-Liam.Howlett@oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@oracle.com>
      Suggested-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michel Lespinasse <walken.cr@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10994316
    • L
      mm/hugetlb: initialize hugetlb_usage in mm_init · 13db8c50
      Liu Zixian 提交于
      After fork, the child process will get incorrect (2x) hugetlb_usage.  If
      a process uses 5 2MB hugetlb pages in an anonymous mapping,
      
      	HugetlbPages:	   10240 kB
      
      and then forks, the child will show,
      
      	HugetlbPages:	   20480 kB
      
      The reason for double the amount is because hugetlb_usage will be copied
      from the parent and then increased when we copy page tables from parent
      to child.  Child will have 2x actual usage.
      
      Fix this by adding hugetlb_count_init in mm_init.
      
      Link: https://lkml.kernel.org/r/20210826071742.877-1-liuzixian4@huawei.com
      Fixes: 5d317b2b ("mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status")
      Signed-off-by: NLiu Zixian <liuzixian4@huawei.com>
      Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13db8c50
    • N
      compiler_attributes.h: move __compiletime_{error|warning} · b83a9084
      Nick Desaulniers 提交于
      Clang 14 will add support for __attribute__((__error__(""))) and
      __attribute__((__warning__(""))). To make use of these in
      __compiletime_error and __compiletime_warning (as used by BUILD_BUG and
      friends) for newer clang and detect/fallback for older versions of
      clang, move these to compiler_attributes.h and guard them with
      __has_attribute preprocessor guards.
      
      Link: https://reviews.llvm.org/D106030
      Link: https://bugs.llvm.org/show_bug.cgi?id=16428
      Link: https://github.com/ClangBuiltLinux/linux/issues/1173Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: NNathan Chancellor <nathan@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      [Reworded, landed in Clang 14]
      Signed-off-by: NMiguel Ojeda <ojeda@kernel.org>
      b83a9084
    • A
      arch: remove compat_alloc_user_space · a7a08b27
      Arnd Bergmann 提交于
      All users of compat_alloc_user_space() and copy_in_user() have been
      removed from the kernel, only a few functions in sparc remain that can be
      changed to calling arch_copy_in_user() instead.
      
      Link: https://lkml.kernel.org/r/20210727144859.4150043-7-arnd@kernel.orgSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7a08b27
    • A
      compat: remove some compat entry points · 59ab844e
      Arnd Bergmann 提交于
      These are all handled correctly when calling the native system call entry
      point, so remove the special cases.
      
      Link: https://lkml.kernel.org/r/20210727144859.4150043-6-arnd@kernel.orgSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59ab844e
    • A
      mm: simplify compat numa syscalls · e130242d
      Arnd Bergmann 提交于
      The compat implementations for mbind, get_mempolicy, set_mempolicy and
      migrate_pages are just there to handle the subtly different layout of
      bitmaps on 32-bit hosts.
      
      The compat implementation however lacks some of the checks that are
      present in the native one, in particular for checking that the extra bits
      are all zero when user space has a larger mask size than the kernel.
      Worse, those extra bits do not get cleared when copying in or out of the
      kernel, which can lead to incorrect data as well.
      
      Unify the implementation to handle the compat bitmap layout directly in
      the get_nodes() and copy_nodes_to_user() helpers.  Splitting out the
      get_bitmap() helper from get_nodes() also helps readability of the native
      case.
      
      On x86, two additional problems are addressed by this: compat tasks can
      pass a bitmap at the end of a mapping, causing a fault when reading across
      the page boundary for a 64-bit word.  x32 tasks might also run into
      problems with get_mempolicy corrupting data when an odd number of 32-bit
      words gets passed.
      
      On parisc the migrate_pages() system call apparently had the wrong calling
      convention, as big-endian architectures expect the words inside of a
      bitmap to be swapped.  This is not a problem though since parisc has no
      NUMA support.
      
      [arnd@arndb.de: fix mempolicy crash]
        Link: https://lkml.kernel.org/r/20210730143417.3700653-1-arnd@kernel.org
        Link: https://lore.kernel.org/lkml/YQPLG20V3dmOfq3a@osiris/
      
      Link: https://lkml.kernel.org/r/20210727144859.4150043-5-arnd@kernel.orgSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e130242d
    • T
      pid: cleanup the stale comment mentioning pidmap_init(). · 5b91a75b
      Takahiro Itazuri 提交于
      pidmap_init() has already been replaced with pid_idr_init() in the commit
      95846ecf ("pid: replace pid bitmap implementation with IDR API").
      Cleanup the stale comment which still mentions it.
      
      Link: https://lkml.kernel.org/r/20210714120713.19825-1-itazur@amazon.comSigned-off-by: NTakahiro Itazuri <itazur@amazon.com>
      Cc: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b91a75b
    • N
      fs/epoll: use a per-cpu counter for user's watches count · 1e1c1583
      Nicholas Piggin 提交于
      This counter tracks the number of watches a user has, to compare against
      the 'max_user_watches' limit. This causes a scalability bottleneck on
      SPECjbb2015 on large systems as there is only one user. Changing to a
      per-cpu counter increases throughput of the benchmark by about 30% on a
      16-socket, > 1000 thread system.
      
      [rdunlap@infradead.org: fix build errors in kernel/user.c when CONFIG_EPOLL=n]
      [npiggin@gmail.com: move ifdefs into wrapper functions, slightly improve panic message]
        Link: https://lkml.kernel.org/r/1628051945.fens3r99ox.astroid@bobo.none
      [akpm@linux-foundation.org: tweak user_epoll_alloc(), per Guenter]
        Link: https://lkml.kernel.org/r/20210804191421.GA1900577@roeck-us.net
      
      Link: https://lkml.kernel.org/r/20210802032013.2751916-1-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reported-by: NAnton Blanchard <anton@ozlabs.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e1c1583
    • D
      units: add the HZ macros · e2c77032
      Daniel Lezcano 提交于
      The macros for the unit conversion for frequency are duplicated in
      different places.
      
      Provide these macros in the 'units' header, so they can be reused.
      
      Link: https://lkml.kernel.org/r/20210816114732.1834145-3-daniel.lezcano@linaro.orgSigned-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Reviewed-by: NChristian Eggers <ceggers@arri.de>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Chanwoo Choi <cw00.choi@samsung.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Jonathan Cameron <jic23@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Lars-Peter Clausen <lars@metafoo.de>
      Cc: Lukasz Luba <lukasz.luba@arm.com>
      Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
      Cc: Peter Meerwald <pmeerw@pmeerw.net>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2c77032
    • D
      units: change from 'L' to 'UL' · c9221919
      Daniel Lezcano 提交于
      Patch series "Add Hz macros", v3.
      
      There are multiple definitions of the HZ_PER_MHZ or HZ_PER_KHZ in the
      different drivers.  Instead of duplicating this definition again and
      again, add one in the units.h header to be reused in all the place the
      redefiniton occurs.
      
      At the same time, change the type of the Watts, as they can not be
      negative.
      
      This patch (of 10):
      
      The users of the macros are safe to be assigned with an unsigned instead
      of signed as the variables using them are themselves unsigned.
      
      Link: https://lkml.kernel.org/r/20210816114732.1834145-1-daniel.lezcano@linaro.org
      Link: https://lkml.kernel.org/r/20210816114732.1834145-2-daniel.lezcano@linaro.orgSigned-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Jonathan Cameron <jic23@kernel.org>
      Cc: Christian Eggers <ceggers@arri.de>
      Cc: Lukasz Luba <lukasz.luba@arm.com>
      Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Lars-Peter Clausen <lars@metafoo.de>
      Cc: Peter Meerwald <pmeerw@pmeerw.net>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Chanwoo Choi <cw00.choi@samsung.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9221919
    • A
    • S
      mm/damon: implement a debugfs-based user space interface · 4bc05954
      SeongJae Park 提交于
      DAMON is designed to be used by kernel space code such as the memory
      management subsystems, and therefore it provides only kernel space API.
      That said, letting the user space control DAMON could provide some
      benefits to them.  For example, it will allow user space to analyze their
      specific workloads and make their own special optimizations.
      
      For such cases, this commit implements a simple DAMON application kernel
      module, namely 'damon-dbgfs', which merely wraps the DAMON api and exports
      those to the user space via the debugfs.
      
      'damon-dbgfs' exports three files, ``attrs``, ``target_ids``, and
      ``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``.
      
      Attributes
      ----------
      
      Users can read and write the ``sampling interval``, ``aggregation
      interval``, ``regions update interval``, and min/max number of monitoring
      target regions by reading from and writing to the ``attrs`` file.  For
      example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10,
      1000 and check it again::
      
          # cd <debugfs>/damon
          # echo 5000 100000 1000000 10 1000 > attrs
          # cat attrs
          5000 100000 1000000 10 1000
      
      Target IDs
      ----------
      
      Some types of address spaces supports multiple monitoring target.  For
      example, the virtual memory address spaces monitoring can have multiple
      processes as the monitoring targets.  Users can set the targets by writing
      relevant id values of the targets to, and get the ids of the current
      targets by reading from the ``target_ids`` file.  In case of the virtual
      address spaces monitoring, the values should be pids of the monitoring
      target processes.  For example, below commands set processes having pids
      42 and 4242 as the monitoring targets and check it again::
      
          # cd <debugfs>/damon
          # echo 42 4242 > target_ids
          # cat target_ids
          42 4242
      
      Note that setting the target ids doesn't start the monitoring.
      
      Turning On/Off
      --------------
      
      Setting the files as described above doesn't incur effect unless you
      explicitly start the monitoring.  You can start, stop, and check the
      current status of the monitoring by writing to and reading from the
      ``monitor_on`` file.  Writing ``on`` to the file starts the monitoring of
      the targets with the attributes.  Writing ``off`` to the file stops those.
      DAMON also stops if every targets are invalidated (in case of the virtual
      memory monitoring, target processes are invalidated when terminated).
      Below example commands turn on, off, and check the status of DAMON::
      
          # cd <debugfs>/damon
          # echo on > monitor_on
          # echo off > monitor_on
          # cat monitor_on
          off
      
      Please note that you cannot write to the above-mentioned debugfs files
      while the monitoring is turned on.  If you write to the files while DAMON
      is running, an error code such as ``-EBUSY`` will be returned.
      
      [akpm@linux-foundation.org: remove unneeded "alloc failed" printks]
      [akpm@linux-foundation.org: replace macro with static inline]
      
      Link: https://lkml.kernel.org/r/20210716081449.22187-8-sj38.park@gmail.comSigned-off-by: NSeongJae Park <sjpark@amazon.de>
      Reviewed-by: NLeonard Foerster <foersleo@amazon.de>
      Reviewed-by: NFernand Sieber <sieberf@amazon.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Fan Du <fan.du@intel.com>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Maximilian Heyne <mheyne@amazon.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bc05954
    • S
      mm/damon: implement primitives for the virtual memory address spaces · 3f49584b
      SeongJae Park 提交于
      This commit introduces a reference implementation of the address space
      specific low level primitives for the virtual address space, so that users
      of DAMON can easily monitor the data accesses on virtual address spaces of
      specific processes by simply configuring the implementation to be used by
      DAMON.
      
      The low level primitives for the fundamental access monitoring are defined
      in two parts:
      
      1. Identification of the monitoring target address range for the address
         space.
      2. Access check of specific address range in the target space.
      
      The reference implementation for the virtual address space does the works
      as below.
      
      PTE Accessed-bit Based Access Check
      -----------------------------------
      
      The implementation uses PTE Accessed-bit for basic access checks.  That
      is, it clears the bit for the next sampling target page and checks whether
      it is set again after one sampling period.  This could disturb the reclaim
      logic.  DAMON uses ``PG_idle`` and ``PG_young`` page flags to solve the
      conflict, as Idle page tracking does.
      
      VMA-based Target Address Range Construction
      -------------------------------------------
      
      Only small parts in the super-huge virtual address space of the processes
      are mapped to physical memory and accessed.  Thus, tracking the unmapped
      address regions is just wasteful.  However, because DAMON can deal with
      some level of noise using the adaptive regions adjustment mechanism,
      tracking every mapping is not strictly required but could even incur a
      high overhead in some cases.  That said, too huge unmapped areas inside
      the monitoring target should be removed to not take the time for the
      adaptive mechanism.
      
      For the reason, this implementation converts the complex mappings to three
      distinct regions that cover every mapped area of the address space.  Also,
      the two gaps between the three regions are the two biggest unmapped areas
      in the given address space.  The two biggest unmapped areas would be the
      gap between the heap and the uppermost mmap()-ed region, and the gap
      between the lowermost mmap()-ed region and the stack in most of the cases.
      Because these gaps are exceptionally huge in usual address spaces,
      excluding these will be sufficient to make a reasonable trade-off.  Below
      shows this in detail::
      
          <heap>
          <BIG UNMAPPED REGION 1>
          <uppermost mmap()-ed region>
          (small mmap()-ed regions and munmap()-ed regions)
          <lowermost mmap()-ed region>
          <BIG UNMAPPED REGION 2>
          <stack>
      
      [akpm@linux-foundation.org: mm/damon/vaddr.c needs highmem.h for kunmap_atomic()]
      [sjpark@amazon.de: remove unnecessary PAGE_EXTENSION setup]
        Link: https://lkml.kernel.org/r/20210806095153.6444-2-sj38.park@gmail.com
      [sjpark@amazon.de: safely walk page table]
        Link: https://lkml.kernel.org/r/20210831161800.29419-1-sj38.park@gmail.com
      
      Link: https://lkml.kernel.org/r/20210716081449.22187-6-sj38.park@gmail.comSigned-off-by: NSeongJae Park <sjpark@amazon.de>
      Reviewed-by: NLeonard Foerster <foersleo@amazon.de>
      Reviewed-by: NFernand Sieber <sieberf@amazon.com>
      Acked-by: NShakeel Butt <shakeelb@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Fan Du <fan.du@intel.com>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Maximilian Heyne <mheyne@amazon.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f49584b
    • S
      mm/idle_page_tracking: make PG_idle reusable · 1c676e0d
      SeongJae Park 提交于
      PG_idle and PG_young allow the two PTE Accessed bit users, Idle Page
      Tracking and the reclaim logic concurrently work while not interfering
      with each other.  That is, when they need to clear the Accessed bit, they
      set PG_young to represent the previous state of the bit, respectively.
      And when they need to read the bit, if the bit is cleared, they further
      read the PG_young to know whether the other has cleared the bit meanwhile
      or not.
      
      For yet another user of the PTE Accessed bit, we could add another page
      flag, or extend the mechanism to use the flags.  For the DAMON usecase,
      however, we don't need to do that just yet.  IDLE_PAGE_TRACKING and DAMON
      are mutually exclusive, so there's only ever going to be one user of the
      current set of flags.
      
      In this commit, we split out the CONFIG options to allow for the use of
      PG_young and PG_idle outside of idle page tracking.
      
      In the next commit, DAMON's reference implementation of the virtual memory
      address space monitoring primitives will use it.
      
      [sjpark@amazon.de: set PAGE_EXTENSION for non-64BIT]
        Link: https://lkml.kernel.org/r/20210806095153.6444-1-sj38.park@gmail.com
      [akpm@linux-foundation.org: tweak Kconfig text]
      [sjpark@amazon.de: hide PAGE_IDLE_FLAG from users]
        Link: https://lkml.kernel.org/r/20210813081238.34705-1-sj38.park@gmail.com
      
      Link: https://lkml.kernel.org/r/20210716081449.22187-5-sj38.park@gmail.comSigned-off-by: NSeongJae Park <sjpark@amazon.de>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NFernand Sieber <sieberf@amazon.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Fan Du <fan.du@intel.com>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Maximilian Heyne <mheyne@amazon.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c676e0d
    • S
      mm/damon: adaptively adjust regions · b9a6ac4e
      SeongJae Park 提交于
      Even somehow the initial monitoring target regions are well constructed to
      fulfill the assumption (pages in same region have similar access
      frequencies), the data access pattern can be dynamically changed.  This
      will result in low monitoring quality.  To keep the assumption as much as
      possible, DAMON adaptively merges and splits each region based on their
      access frequency.
      
      For each ``aggregation interval``, it compares the access frequencies of
      adjacent regions and merges those if the frequency difference is small.
      Then, after it reports and clears the aggregated access frequency of each
      region, it splits each region into two or three regions if the total
      number of regions will not exceed the user-specified maximum number of
      regions after the split.
      
      In this way, DAMON provides its best-effort quality and minimal overhead
      while keeping the upper-bound overhead that users set.
      
      Link: https://lkml.kernel.org/r/20210716081449.22187-4-sj38.park@gmail.comSigned-off-by: NSeongJae Park <sjpark@amazon.de>
      Reviewed-by: NLeonard Foerster <foersleo@amazon.de>
      Reviewed-by: NFernand Sieber <sieberf@amazon.com>
      Acked-by: NShakeel Butt <shakeelb@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Fan Du <fan.du@intel.com>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Maximilian Heyne <mheyne@amazon.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9a6ac4e
    • S
      mm/damon/core: implement region-based sampling · f23b8eee
      SeongJae Park 提交于
      To avoid the unbounded increase of the overhead, DAMON groups adjacent
      pages that are assumed to have the same access frequencies into a
      region.  As long as the assumption (pages in a region have the same
      access frequencies) is kept, only one page in the region is required to
      be checked.  Thus, for each ``sampling interval``,
      
       1. the 'prepare_access_checks' primitive picks one page in each region,
       2. waits for one ``sampling interval``,
       3. checks whether the page is accessed meanwhile, and
       4. increases the access count of the region if so.
      
      Therefore, the monitoring overhead is controllable by adjusting the
      number of regions.  DAMON allows both the underlying primitives and user
      callbacks to adjust regions for the trade-off.  In other words, this
      commit makes DAMON to use not only time-based sampling but also
      space-based sampling.
      
      This scheme, however, cannot preserve the quality of the output if the
      assumption is not guaranteed.  Next commit will address this problem.
      
      Link: https://lkml.kernel.org/r/20210716081449.22187-3-sj38.park@gmail.comSigned-off-by: NSeongJae Park <sjpark@amazon.de>
      Reviewed-by: NLeonard Foerster <foersleo@amazon.de>
      Reviewed-by: NFernand Sieber <sieberf@amazon.com>
      Acked-by: NShakeel Butt <shakeelb@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Fan Du <fan.du@intel.com>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Maximilian Heyne <mheyne@amazon.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f23b8eee
    • S
      mm: introduce Data Access MONitor (DAMON) · 2224d848
      SeongJae Park 提交于
      Patch series "Introduce Data Access MONitor (DAMON)", v34.
      
      Introduction
      ============
      
      DAMON is a data access monitoring framework for the Linux kernel.  The
      core mechanisms of DAMON called 'region based sampling' and 'adaptive
      regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this
      patchset for the detail) make it
      
      - accurate (The monitored information is useful for DRAM level memory
        management.  It might not appropriate for Cache-level accuracy,
        though.),
      
      - light-weight (The monitoring overhead is low enough to be applied
        online while making no impact on the performance of the target
        workloads.), and
      
      - scalable (the upper-bound of the instrumentation overhead is
        controllable regardless of the size of target workloads.).
      
      Using this framework, therefore, several memory management mechanisms such
      as reclamation and THP can be optimized to aware real data access
      patterns.  Experimental access pattern aware memory management
      optimization works that incurring high instrumentation overhead will be
      able to have another try.
      
      Though DAMON is for kernel subsystems, it can be easily exposed to the
      user space by writing a DAMON-wrapper kernel subsystem.  Then, user space
      users who have some special workloads will be able to write personalized
      tools or applications for deeper understanding and specialized
      optimizations of their systems.
      
      DAMON is also merged in two public Amazon Linux kernel trees that based on
      v5.4.y[1] and v5.10.y[2].
      
      [1] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
      [2] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon
      
      The userspace tool[1] is available, released under GPLv2, and actively
      being maintained.  I am also planning to implement another basic user
      interface in perf[2].  Also, the basic test suite for DAMON is available
      under GPLv2[3].
      
      [1] https://github.com/awslabs/damo
      [2] https://lore.kernel.org/linux-mm/20210107120729.22328-1-sjpark@amazon.com/
      [3] https://github.com/awslabs/damon-tests
      
      Long-term Plan
      --------------
      
      DAMON is a part of a project called Data Access-aware Operating System
      (DAOS).  As the name implies, I want to improve the performance and
      efficiency of systems using fine-grained data access patterns.  The
      optimizations are for both kernel and user spaces.  I will therefore
      modify or create kernel subsystems, export some of those to user space and
      implement user space library / tools.  Below shows the layers and
      components for the project.
      
          ---------------------------------------------------------------------------
          Primitives:     PTE Accessed bit, PG_idle, rmap, (Intel CMT), ...
          Framework:      DAMON
          Features:       DAMOS, virtual addr, physical addr, ...
          Applications:   DAMON-debugfs, (DARC), ...
          ^^^^^^^^^^^^^^^^^^^^^^^    KERNEL SPACE    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      
          Raw Interface:  debugfs, (sysfs), (damonfs), tracepoints, (sys_damon), ...
      
          vvvvvvvvvvvvvvvvvvvvvvv    USER SPACE      vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
          Library:        (libdamon), ...
          Tools:          DAMO, (perf), ...
          ---------------------------------------------------------------------------
      
      The components in parentheses or marked as '...' are not implemented yet
      but in the future plan.  IOW, those are the TODO tasks of DAOS project.
      For more detail, please refer to the plans:
      https://lore.kernel.org/linux-mm/20201202082731.24828-1-sjpark@amazon.com/
      
      Evaluations
      ===========
      
      We evaluated DAMON's overhead, monitoring quality and usefulness using 24
      realistic workloads on my QEMU/KVM based virtual machine running a kernel
      that v24 DAMON patchset is applied.
      
      DAMON is lightweight.  It increases system memory usage by 0.39% and slows
      target workloads down by 1.16%.
      
      DAMON is accurate and useful for memory management optimizations.  An
      experimental DAMON-based operation scheme for THP, namely 'ethp', removes
      76.15% of THP memory overheads while preserving 51.25% of THP speedup.
      Another experimental DAMON-based 'proactive reclamation' implementation,
      'prcl', reduces 93.38% of residential sets and 23.63% of system memory
      footprint while incurring only 1.22% runtime overhead in the best case
      (parsec3/freqmine).
      
      NOTE that the experimental THP optimization and proactive reclamation are
      not for production but only for proof of concepts.
      
      Please refer to the official document[1] or "Documentation/admin-guide/mm:
      Add a document for DAMON" patch in this patchset for detailed evaluation
      setup and results.
      
      [1] https://damonitor.github.io/doc/html/latest-damon/admin-guide/mm/damon/eval.html
      
      Real-world User Story
      =====================
      
      In summary, DAMON has used on production systems and proved its usefulness.
      
      DAMON as a profiler
      -------------------
      
      We analyzed characteristics of a large scale production systems of our
      customers using DAMON.  The systems utilize 70GB DRAM and 36 CPUs.  From
      this, we were able to find interesting things below.
      
      There were obviously different access pattern under idle workload and
      active workload.  Under the idle workload, it accessed large memory
      regions with low frequency, while the active workload accessed small
      memory regions with high freuqnecy.
      
      DAMON found a 7GB memory region that showing obviously high access
      frequency under the active workload.  We believe this is the
      performance-effective working set and need to be protected.
      
      There was a 4KB memory region that showing highest access frequency under
      not only active but also idle workloads.  We think this must be a hottest
      code section like thing that should never be paged out.
      
      For this analysis, DAMON used only 0.3-1% of single CPU time.  Because we
      used recording-based analysis, it consumed about 3-12 MB of disk space per
      20 minutes.  This is only small amount of disk space, but we can further
      reduce the disk usage by using non-recording-based DAMON features.  I'd
      like to argue that only DAMON can do such detailed analysis (finding 4KB
      highest region in 70GB memory) with the light overhead.
      
      DAMON as a system optimization tool
      -----------------------------------
      
      We also found below potential performance problems on the systems and made
      DAMON-based solutions.
      
      The system doesn't want to make the workload suffer from the page
      reclamation and thus it utilizes enough DRAM but no swap device.  However,
      we found the system is actively reclaiming file-backed pages, because the
      system has intensive file IO.  The file IO turned out to be not
      performance critical for the workload, but the customer wanted to ensure
      performance critical file-backed pages like code section to not mistakenly
      be evicted.
      
      Using direct IO should or `mlock()` would be a straightforward solution,
      but modifying the user space code is not easy for the customer.
      Alternatively, we could use DAMON-based operation scheme[1].  By using it,
      we can ask DAMON to track access frequency of each region and make
      'process_madvise(MADV_WILLNEED)[2]' call for regions having specific size
      and access frequency for a time interval.
      
      We also found the system is having high number of TLB misses.  We tried
      'always' THP enabled policy and it greatly reduced TLB misses, but the
      page reclamation also been more frequent due to the THP internal
      fragmentation caused memory bloat.  We could try another DAMON-based
      operation scheme that applies 'MADV_HUGEPAGE' to memory regions having
      >=2MB size and high access frequency, while applying 'MADV_NOHUGEPAGE' to
      regions having <2MB size and low access frequency.
      
      We do not own the systems so we only reported the analysis results and
      possible optimization solutions to the customers.  The customers satisfied
      about the analysis results and promised to try the optimization guides.
      
      [1] https://lore.kernel.org/linux-mm/20201006123931.5847-1-sjpark@amazon.com/
      [2] https://lore.kernel.org/linux-api/20200622192900.22757-4-minchan@kernel.org/
      
      Comparison with Idle Page Tracking
      ==================================
      
      Idle Page Tracking allows users to set and read idleness of pages using a
      bitmap file which represents each page with each bit of the file.  One
      recommended usage of it is working set size detection.  Users can do that
      by
      
          1. find PFN of each page for workloads in interest,
          2. set all the pages as idle by doing writes to the bitmap file,
          3. wait until the workload accesses its working set, and
          4. read the idleness of the pages again and count pages became not idle.
      
      NOTE: While Idle Page Tracking is for user space users, DAMON is primarily
      designed for kernel subsystems though it can easily exposed to the user
      space.  Hence, this section only assumes such user space use of DAMON.
      
      For what use cases Idle Page Tracking would be better?
      ------------------------------------------------------
      
      1. Flexible usecases other than hotness monitoring.
      
      Because Idle Page Tracking allows users to control the primitive (Page
      idleness) by themselves, Idle Page Tracking users can do anything they
      want.  Meanwhile, DAMON is primarily designed to monitor the hotness of
      each memory region.  For this, DAMON asks users to provide sampling
      interval and aggregation interval.  For the reason, there could be some
      use case that using Idle Page Tracking is simpler.
      
      2. Physical memory monitoring.
      
      Idle Page Tracking receives PFN range as input, so natively supports
      physical memory monitoring.
      
      DAMON is designed to be extensible for multiple address spaces and use
      cases by implementing and using primitives for the given use case.
      Therefore, by theory, DAMON has no limitation in the type of target
      address space as long as primitives for the given address space exists.
      However, the default primitives introduced by this patchset supports only
      virtual address spaces.
      
      Therefore, for physical memory monitoring, you should implement your own
      primitives and use it, or simply use Idle Page Tracking.
      
      Nonetheless, RFC patchsets[1] for the physical memory address space
      primitives is already available.  It also supports user memory same to
      Idle Page Tracking.
      
      [1] https://lore.kernel.org/linux-mm/20200831104730.28970-1-sjpark@amazon.com/
      
      For what use cases DAMON is better?
      -----------------------------------
      
      1. Hotness Monitoring.
      
      Idle Page Tracking let users know only if a page frame is accessed or not.
      For hotness check, the user should write more code and use more memory.
      DAMON do that by itself.
      
      2. Low Monitoring Overhead
      
      DAMON receives user's monitoring request with one step and then provide
      the results.  So, roughly speaking, DAMON require only O(1) user/kernel
      context switches.
      
      In case of Idle Page Tracking, however, because the interface receives
      contiguous page frames, the number of user/kernel context switches
      increases as the monitoring target becomes complex and huge.  As a result,
      the context switch overhead could be not negligible.
      
      Moreover, DAMON is born to handle with the monitoring overhead.  Because
      the core mechanism is pure logical, Idle Page Tracking users might be able
      to implement the mechanism on their own, but it would be time consuming
      and the user/kernel context switching will still more frequent than that
      of DAMON.  Also, the kernel subsystems cannot use the logic in this case.
      
      3. Page granularity working set size detection.
      
      Until v22 of this patchset, this was categorized as the thing Idle Page
      Tracking could do better, because DAMON basically maintains additional
      metadata for each of the monitoring target regions.  So, in the page
      granularity working set size detection use case, DAMON would incur (number
      of monitoring target pages * size of metadata) memory overhead.  Size of
      the single metadata item is about 54 bytes, so assuming 4KB pages, about
      1.3% of monitoring target pages will be additionally used.
      
      All essential metadata for Idle Page Tracking are embedded in 'struct
      page' and page table entries.  Therefore, in this use case, only one
      counter variable for working set size accounting is required if Idle Page
      Tracking is used.
      
      There are more details to consider, but roughly speaking, this is true in
      most cases.
      
      However, the situation changed from v23.  Now DAMON supports arbitrary
      types of monitoring targets, which don't use the metadata.  Using that,
      DAMON can do the working set size detection with no additional space
      overhead but less user-kernel context switch.  A first draft for the
      implementation of monitoring primitives for this usage is available in a
      DAMON development tree[1].  An RFC patchset for it based on this patchset
      will also be available soon.
      
      Since v24, the arbitrary type support is dropped from this patchset
      because this patchset doesn't introduce real use of the type.  You can
      still get it from the DAMON development tree[2], though.
      
      [1] https://github.com/sjp38/linux/tree/damon/pgidle_hack
      [2] https://github.com/sjp38/linux/tree/damon/master
      
      4. More future usecases
      
      While Idle Page Tracking has tight coupling with base primitives (PG_Idle
      and page table Accessed bits), DAMON is designed to be extensible for many
      use cases and address spaces.  If you need some special address type or
      want to use special h/w access check primitives, you can write your own
      primitives for that and configure DAMON to use those.  Therefore, if your
      use case could be changed a lot in future, using DAMON could be better.
      
      Can I use both Idle Page Tracking and DAMON?
      --------------------------------------------
      
      Yes, though using them concurrently for overlapping memory regions could
      result in interference to each other.  Nevertheless, such use case would
      be rare or makes no sense at all.  Even in the case, the noise would bot
      be really significant.  So, you can choose whatever you want depending on
      the characteristics of your use cases.
      
      More Information
      ================
      
      We prepared a showcase web site[1] that you can get more information.
      There are
      
      - the official documentations[2],
      - the heatmap format dynamic access pattern of various realistic workloads for
        heap area[3], mmap()-ed area[4], and stack[5] area,
      - the dynamic working set size distribution[6] and chronological working set
        size changes[7], and
      - the latest performance test results[8].
      
      [1] https://damonitor.github.io/_index
      [2] https://damonitor.github.io/doc/html/latest-damon
      [3] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.0.png.html
      [4] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
      [5] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.2.png.html
      [6] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
      [7] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
      [8] https://damonitor.github.io/test/result/perf/latest/html/index.html
      
      Baseline and Complete Git Trees
      ===============================
      
      The patches are based on the latest -mm tree, specifically
      v5.14-rc1-mmots-2021-07-15-18-47 of https://github.com/hnaz/linux-mm.  You can
      also clone the complete git tree:
      
          $ git clone git://github.com/sjp38/linux -b damon/patches/v34
      
      The web is also available:
      https://github.com/sjp38/linux/releases/tag/damon/patches/v34
      
      Development Trees
      -----------------
      
      There are a couple of trees for entire DAMON patchset series and features
      for future release.
      
      - For latest release: https://github.com/sjp38/linux/tree/damon/master
      - For next release: https://github.com/sjp38/linux/tree/damon/next
      
      Long-term Support Trees
      -----------------------
      
      For people who want to test DAMON but using LTS kernels, there are another
      couple of trees based on two latest LTS kernels respectively and
      containing the 'damon/master' backports.
      
      - For v5.4.y: https://github.com/sjp38/linux/tree/damon/for-v5.4.y
      - For v5.10.y: https://github.com/sjp38/linux/tree/damon/for-v5.10.y
      
      Amazon Linux Kernel Trees
      -------------------------
      
      DAMON is also merged in two public Amazon Linux kernel trees that based on
      v5.4.y[1] and v5.10.y[2].
      
      [1] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
      [2] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon
      
      Git Tree for Diff of Patches
      ============================
      
      For easy review of diff between different versions of each patch, I
      prepared a git tree containing all versions of the DAMON patchset series:
      https://github.com/sjp38/damon-patches
      
      You can clone it and use 'diff' for easy review of changes between
      different versions of the patchset.  For example:
      
          $ git clone https://github.com/sjp38/damon-patches && cd damon-patches
          $ diff -u damon/v33 damon/v34
      
      Sequence Of Patches
      ===================
      
      First three patches implement the core logics of DAMON.  The 1st patch
      introduces basic sampling based hotness monitoring for arbitrary types of
      targets.  Following two patches implement the core mechanisms for control
      of overhead and accuracy, namely regions based sampling (patch 2) and
      adaptive regions adjustment (patch 3).
      
      Now the essential parts of DAMON is complete, but it cannot work unless
      someone provides monitoring primitives for a specific use case.  The
      following two patches make it just work for virtual address spaces
      monitoring.  The 4th patch makes 'PG_idle' can be used by DAMON and the
      5th patch implements the virtual memory address space specific monitoring
      primitives using page table Accessed bits and the 'PG_idle' page flag.
      
      Now DAMON just works for virtual address space monitoring via the kernel
      space api.  To let the user space users can use DAMON, following four
      patches add interfaces for them.  The 6th patch adds a tracepoint for
      monitoring results.  The 7th patch implements a DAMON application kernel
      module, namely damon-dbgfs, that simply wraps DAMON and exposes DAMON
      interface to the user space via the debugfs interface.  The 8th patch
      further exports pid of monitoring thread (kdamond) to user space for
      easier cpu usage accounting, and the 9th patch makes the debugfs interface
      to support multiple contexts.
      
      Three patches for maintainability follows.  The 10th patch adds
      documentations for both the user space and the kernel space.  The 11th
      patch provides unit tests (based on the kunit) while the 12th patch adds
      user space tests (based on the kselftest).
      
      Finally, the last patch (13th) updates the MAINTAINERS file.
      
      This patch (of 13):
      
      DAMON is a data access monitoring framework for the Linux kernel.  The
      core mechanisms of DAMON make it
      
       - accurate (the monitoring output is useful enough for DRAM level
         performance-centric memory management; It might be inappropriate for
         CPU cache levels, though),
       - light-weight (the monitoring overhead is normally low enough to be
         applied online), and
       - scalable (the upper-bound of the overhead is in constant range
         regardless of the size of target workloads).
      
      Using this framework, hence, we can easily write efficient kernel space
      data access monitoring applications.  For example, the kernel's memory
      management mechanisms can make advanced decisions using this.
      Experimental data access aware optimization works that incurring high
      access monitoring overhead could again be implemented on top of this.
      
      Due to its simple and flexible interface, providing user space interface
      would be also easy.  Then, user space users who have some special
      workloads can write personalized applications for better understanding and
      optimizations of their workloads and systems.
      
      ===
      
      Nevertheless, this commit is defining and implementing only basic access
      check part without the overhead-accuracy handling core logic.  The basic
      access check is as below.
      
      The output of DAMON says what memory regions are how frequently accessed
      for a given duration.  The resolution of the access frequency is
      controlled by setting ``sampling interval`` and ``aggregation interval``.
      In detail, DAMON checks access to each page per ``sampling interval`` and
      aggregates the results.  In other words, counts the number of the accesses
      to each region.  After each ``aggregation interval`` passes, DAMON calls
      callback functions that previously registered by users so that users can
      read the aggregated results and then clears the results.  This can be
      described in below simple pseudo-code::
      
          init()
          while monitoring_on:
              for page in monitoring_target:
                  if accessed(page):
                      nr_accesses[page] += 1
              if time() % aggregation_interval == 0:
                  for callback in user_registered_callbacks:
                      callback(monitoring_target, nr_accesses)
                  for page in monitoring_target:
                      nr_accesses[page] = 0
              if time() % update_interval == 0:
                  update()
              sleep(sampling interval)
      
      The target regions constructed at the beginning of the monitoring and
      updated after each ``regions_update_interval``, because the target regions
      could be dynamically changed (e.g., mmap() or memory hotplug).  The
      monitoring overhead of this mechanism will arbitrarily increase as the
      size of the target workload grows.
      
      The basic monitoring primitives for actual access check and dynamic target
      regions construction aren't in the core part of DAMON.  Instead, it allows
      users to implement their own primitives that are optimized for their use
      case and configure DAMON to use those.  In other words, users cannot use
      current version of DAMON without some additional works.
      
      Following commits will implement the core mechanisms for the
      overhead-accuracy control and default primitives implementations.
      
      Link: https://lkml.kernel.org/r/20210716081449.22187-1-sj38.park@gmail.com
      Link: https://lkml.kernel.org/r/20210716081449.22187-2-sj38.park@gmail.comSigned-off-by: NSeongJae Park <sjpark@amazon.de>
      Reviewed-by: NLeonard Foerster <foersleo@amazon.de>
      Reviewed-by: NFernand Sieber <sieberf@amazon.com>
      Acked-by: NShakeel Butt <shakeelb@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Fan Du <fan.du@intel.com>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Maximilian Heyne <mheyne@amazon.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2224d848
    • M
      mm: introduce PAGEFLAGS_MASK to replace ((1UL << NR_PAGEFLAGS) - 1) · 41c961b9
      Muchun Song 提交于
      Instead of hard-coding ((1UL << NR_PAGEFLAGS) - 1) everywhere, introducing
      PAGEFLAGS_MASK to make the code clear to get the page flags.
      
      Link: https://lkml.kernel.org/r/20210819150712.59948-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41c961b9
    • S
      highmem: don't disable preemption on RT in kmap_atomic() · 51386120
      Sebastian Andrzej Siewior 提交于
      kmap_atomic() disables preemption and pagefaults for historical reasons.
      The conversion to kmap_local(), which only disables migration, cannot be
      done wholesale because quite some call sites need to be updated to
      accommodate with the changed semantics.
      
      On PREEMPT_RT enabled kernels the kmap_atomic() semantics are problematic
      due to the implicit disabling of preemption which makes it impossible to
      acquire 'sleeping' spinlocks within the kmap atomic sections.
      
      PREEMPT_RT replaces the preempt_disable() with a migrate_disable() for
      more than a decade.  It could be argued that this is a justification to do
      this unconditionally, but PREEMPT_RT covers only a limited number of
      architectures and it disables some functionality which limits the coverage
      further.
      
      Limit the replacement to PREEMPT_RT for now.
      
      Link: https://lkml.kernel.org/r/20210810091116.pocdmaatdcogvdso@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51386120
    • C
      mm: move ioremap_page_range to vmalloc.c · 82a70ce0
      Christoph Hellwig 提交于
      Patch series "small ioremap cleanups".
      
      The first patch moves a little code around the vmalloc/ioremap boundary
      following a bigger move by Nick earlier.  The second enforces
      non-executable mapping on ioremap just like we do for vmap.  No driver
      currently uses executable mappings anyway, as they should.
      
      This patch (of 2):
      
      This keeps it together with the implementation, and to remove the
      vmap_range wrapper.
      
      Link: https://lkml.kernel.org/r/20210824091259.1324527-1-hch@lst.de
      Link: https://lkml.kernel.org/r/20210824091259.1324527-2-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82a70ce0
    • M
      mm: remove redundant compound_head() calling · fe3df441
      Muchun Song 提交于
      There is a READ_ONCE() in the macro of compound_head(), which will prevent
      compiler from optimizing the code when there are more than once calling of
      it in a function.  Remove the redundant calling of compound_head() from
      page_to_index() and page_add_file_rmap() for better code generation.
      
      Link: https://lkml.kernel.org/r/20210811101431.83940-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe3df441
    • D
      mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online policy · 3fcebf90
      David Hildenbrand 提交于
      Currently, the "auto-movable" online policy does not allow for hotplugged
      KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we
      can have, primarily, because there is no coordiantion across memory
      devices and we don't want to create zone-imbalances accidentially when
      unplugging memory.
      
      However, within a single memory device it's different.  Let's allow for
      KERNEL memory within a dynamic memory group to allow for more MOVABLE
      within the same memory group.  The only thing we have to take care of is
      that the managing driver avoids zone imbalances by unplugging MOVABLE
      memory first, otherwise there can be corner cases where unplug of memory
      could result in (accidential) zone imbalances.
      
      virtio-mem is the only user of dynamic memory groups and recently added
      support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we
      don't need a new toggle to enable it for dynamic memory groups.
      
      We limit this handling to dynamic memory groups, because:
      
      * We want to keep the runtime overhead for collecting stats when
        onlining a single memory block small.  We tend to have only a handful of
        dynamic memory groups, but we can have quite some static memory groups
        (e.g., 256 DIMMs).
      
      * It doesn't make too much sense for static memory groups, as we try
        onlining all applicable memory blocks either completely to ZONE_MOVABLE
        or not.  In ordinary operation, we won't have a mixture of zones within
        a static memory group.
      
      When adding memory to a dynamic memory group, we'll first online memory to
      ZONE_MOVABLE as long as early KERNEL memory allows for it.  Then, we'll
      online the next unit(s) to ZONE_NORMAL, until we can online the next
      unit(s) to ZONE_MOVABLE.
      
      For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will
      result in a layout like:
      
        [M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]...
        ^ movable memory due to early kernel memory
      			   ^ allows for more movable memory ...
      			      ^-----^ ... here
      				       ^ allows for more movable memory ...
      				          ^-----^ ... here
      
      While the created layout is sub-optimal when it comes to contiguous zones,
      it gives us the maximum flexibility when dynamically growing/shrinking a
      device; we can grow small VMs really big in small steps, and still shrink
      reliably to e.g., 1/4 of the maximum VM size in this example, removing
      full memory blocks along with meta data more reliably.
      
      Mark dynamic memory groups in the xarray such that we can efficiently
      iterate over them when collecting stats.  In usual setups, we have one
      virtio-mem device per NUMA node, and usually only a small number of NUMA
      nodes.
      
      Note: for now, there seems to be no compelling reason to make this
      behavior configurable.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-10-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fcebf90
    • D
      mm/memory_hotplug: memory group aware "auto-movable" online policy · 445fcf7c
      David Hildenbrand 提交于
      Use memory groups to improve our "auto-movable" onlining policy:
      
      1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE
         only if all other memory blocks in the group are either MOVABLE or could
         be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture.
      
      2. For dynamic memory groups (e.g., a virtio-mem device), online a
         memory block MOVABLE only if all other memory blocks inside the
         current unit are either MOVABLE or could be onlined MOVABLE. For a
         virtio-mem device with a device block size with 512 MiB, all 128 MiB
         memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not
         a mixture.
      
      We have to pass the memory group to zone_for_pfn_range() to take the
      memory group into account.
      
      Note: for now, there seems to be no compelling reason to make this
      behavior configurable.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-9-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      445fcf7c
    • D
      mm/memory_hotplug: track present pages in memory groups · 836809ec
      David Hildenbrand 提交于
      Let's track all present pages in each memory group.  Especially, track
      memory present in ZONE_MOVABLE and memory present in one of the kernel
      zones (which really only is ZONE_NORMAL right now as memory groups only
      apply to hotplugged memory) separately within a memory group, to prepare
      for making smart auto-online decision for individual memory blocks within
      a memory group based on group statistics.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      836809ec
    • D
      drivers/base/memory: introduce "memory groups" to logically group memory blocks · 028fc57a
      David Hildenbrand 提交于
      In our "auto-movable" memory onlining policy, we want to make decisions
      across memory blocks of a single memory device.  Examples of memory
      devices include ACPI memory devices (in the simplest case a single DIMM)
      and virtio-mem.  For now, we don't have a connection between a single
      memory block device and the real memory device.  Each memory device
      consists of 1..X memory block devices.
      
      Let's logically group memory blocks belonging to the same memory device in
      "memory groups".  Memory groups can span multiple physical ranges and a
      memory group itself does not contain any information regarding physical
      ranges, only properties (e.g., "max_pages") necessary for improved memory
      onlining.
      
      Introduce two memory group types:
      
      1) Static memory group: E.g., a single ACPI memory device, consisting
         of 1..X memory resources.  A memory group consists of 1..Y memory
         blocks.  The whole group is added/removed in one go.  If any part
         cannot get offlined, the whole group cannot be removed.
      
      2) Dynamic memory group: E.g., a single virtio-mem device.  Memory is
         dynamically added/removed in a fixed granularity, called a "unit",
         consisting of 1..X memory blocks.  A unit is added/removed in one go.
         If any part of a unit cannot get offlined, the whole unit cannot be
         removed.
      
      In case of 1) we usually want either all memory managed by ZONE_MOVABLE or
      none.  In case of 2) we usually want to have as many units as possible
      managed by ZONE_MOVABLE.  We want a single unit to be of the same type.
      
      For now, memory groups are an internal concept that is not exposed to user
      space; we might want to change that in the future, though.
      
      add_memory() users can specify a mgid instead of a nid when passing the
      MHP_NID_IS_MGID flag.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      028fc57a
    • D
      mm: track present early pages per zone · 4b097002
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.
      
      I. Goal
      
      The goal of this series is improving in-kernel auto-online support.  It
      tackles the fundamental problems that:
      
       1) We can create zone imbalances when onlining all memory blindly to
          ZONE_MOVABLE, in the worst case crashing the system. We have to know
          upfront how much memory we are going to hotplug such that we can
          safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
          via "online_movable". This is far from practical and only applicable in
          limited setups -- like inside VMs under the RHV/oVirt hypervisor which
          will never hotplug more than 3 times the boot memory (and the
          limitation is only in place due to the Linux limitation).
      
       2) We see more setups that implement dynamic VM resizing, hot(un)plugging
          memory to resize VM memory. In these setups, we might hotplug a lot of
          memory, but it might happen in various small steps in both directions
          (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
          primary driver of this upstream right now, performing such dynamic
          resizing NUMA-aware via multiple virtio-mem devices.
      
          Onlining all hotplugged memory to ZONE_NORMAL means we basically have
          no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
          easily run into zone imbalances when growing a VM. We want a mixture,
          and we want as much memory as reasonable/configured in ZONE_MOVABLE.
          Details regarding zone imbalances can be found at [1].
      
       3) Memory devices consist of 1..X memory block devices, however, the
          kernel doesn't really track the relationship. Consequently, also user
          space has no idea. We want to make per-device decisions.
      
          As one example, for memory hotunplug it doesn't make sense to use a
          mixture of zones within a single DIMM: we want all MOVABLE if
          possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
          block the whole DIMM from getting hotunplugged.
      
          As another example, virtio-mem operates on individual units that span
          1..X memory blocks. Similar to a DIMM, we want a unit to either be all
          MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
          all units of a virtio-mem device logically belong together and are
          managed (added/removed) by a single driver. We want as much memory of
          a virtio-mem device to be MOVABLE as possible.
      
       4) We want memory onlining to be done right from the kernel while adding
          memory, not triggered by user space via udev rules; for example, this
          is reqired for fast memory hotplug for drivers that add individual
          memory blocks, like virito-mem. We want a way to configure a policy in
          the kernel and avoid implementing advanced policies in user space.
      
      The auto-onlining support we have in the kernel is not sufficient.  All we
      have is a) online everything MOVABLE (online_movable) b) online everything
      !MOVABLE (online_kernel) c) keep zones contiguous (online).  This series
      allows configuring c) to mean instead "online movable if possible
      according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
      -- a new onlining policy.
      
      II. Approach
      
      This series does 3 things:
      
       1) Introduces the "auto-movable" online policy that initially operates on
          individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
          to make a decision whether a memory block will be onlined to
          ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
          memory does not allow for more MOVABLE memory (details in the
          patches). CMA memory is treated like MOVABLE memory.
      
       2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
          groups and uses group information to make decisions in the
          "auto-movable" online policy across memory blocks of a single memory
          device (modeled as memory group). More details can be found in patch
          #3 or in the DIMM example below.
      
       3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
          allowing ZONE_NORMAL memory within a dynamic memory group to allow for
          more ZONE_MOVABLE memory within the same memory group. The target use
          case is dynamic VM resizing using virtio-mem. See the virtio-mem
          example below.
      
      I remember that the basic idea of using a ratio to implement a policy in
      the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
      lost the pointer to that discussion).
      
      For me, the main use case is using it along with virtio-mem (and DIMMs /
      ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
      amount of memory we can hotunplug reliably again if we might eventually
      hotplug a lot of memory to a VM.
      
      III. Target Usage
      
      The target usage will be:
      
       1) Linux boots with "mhp_default_online_type=offline"
      
       2) User space (e.g., systemd unit) configures memory onlining (according
          to a config file and system properties), for example:
          * Setting memory_hotplug.online_policy=auto-movable
          * Setting memory_hotplug.auto_movable_ratio=301
          * Setting memory_hotplug.auto_movable_numa_aware=true
      
       3) User space enabled auto onlining via "echo online >
          /sys/devices/system/memory/auto_online_blocks"
      
       4) User space triggers manual onlining of all already-offline memory
          blocks (go over offline memory blocks and set them to "online")
      
      IV. Example
      
      For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
      301% results in the following layout:
      	Memory block 0-15:    DMA32   (early)
      	Memory block 32-47:   Normal  (early)
      	Memory block 48-79:   Movable (DIMM 0)
      	Memory block 80-111:  Movable (DIMM 1)
      	Memory block 112-143: Movable (DIMM 2)
      	Memory block 144-275: Normal  (DIMM 3)
      	Memory block 176-207: Normal  (DIMM 4)
      	... all Normal
      	(-> hotplugged Normal memory does not allow for more Movable memory)
      
      For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
      will result in the following layout:
      	Memory block 0-15:    DMA32   (early)
      	Memory block 32-47:   Normal  (early)
      	Memory block 48-143:  Movable (virtio-mem, first 12 GiB)
      	Memory block 144:     Normal  (virtio-mem, next 128 MiB)
      	Memory block 145-147: Movable (virtio-mem, next 384 MiB)
      	Memory block 148:     Normal  (virtio-mem, next 128 MiB)
      	Memory block 149-151: Movable (virtio-mem, next 384 MiB)
      	... Normal/Movable mixture as above
      	(-> hotplugged Normal memory allows for more Movable memory within
      	    the same device)
      
      Which gives us maximum flexibility when dynamically growing/shrinking a
      VM in smaller steps.
      
      V. Doc Update
      
      I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
      usptream. Until then, details can be found in patch #2.
      
      VI. Future Work
      
       1) Use memory groups for ppc64 dlpar
       2) Being able to specify a portion of (early) kernel memory that will be
          excluded from the ratio. Like "128 MiB globally/per node" are excluded.
      
          This might be helpful when starting VMs with extremely small memory
          footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
          the first hotplugged units getting onlined to ZONE_MOVABLE. One
          alternative would be a trigger to not consider ZONE_DMA memory
          in the ratio. We'll have to see if this is really rrequired.
       3) Indicate to user space that MOVABLE might be a bad idea -- especially
          relevant when memory ballooning without support for balloon compaction
          is active.
      
      This patch (of 9):
      
      For implementing a new memory onlining policy, which determines when to
      online memory blocks to ZONE_MOVABLE semi-automatically, we need the
      number of present early (boot) pages -- present pages excluding hotplugged
      pages.  Let's track these pages per zone.
      
      Pass a page instead of the zone to adjust_present_page_count(), similar as
      adjust_managed_page_count() and derive the zone from the page.
      
      It's worth noting that a memory block to be offlined/onlined is either
      completely "early" or "not early".  add_memory() and friends can only add
      complete memory blocks and we only online/offline complete (individual)
      memory blocks.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b097002
    • D
      mm/memory_hotplug: remove nid parameter from remove_memory() and friends · e1c158e4
      David Hildenbrand 提交于
      There is only a single user remaining.  We can simply lookup the nid only
      used for node offlining purposes when walking our memory blocks.  We don't
      expect to remove multi-nid ranges; and if we'd ever do, we most probably
      don't care about removing multi-nid ranges that actually result in empty
      nodes.
      
      If ever required, we can detect the "multi-nid" scenario and simply try
      offlining all online nodes.
      
      Link: https://lkml.kernel.org/r/20210712124052.26491-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@ionos.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pierre Morel <pmorel@linux.ibm.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Sergei Trofimovich <slyfox@gentoo.org>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1c158e4
    • D
      mm/memory_hotplug: remove nid parameter from arch_remove_memory() · 65a2aa5f
      David Hildenbrand 提交于
      The parameter is unused, let's remove it.
      
      Link: https://lkml.kernel.org/r/20210712124052.26491-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
      Acked-by: Heiko Carstens <hca@linux.ibm.com>	[s390]
      Reviewed-by: NPankaj Gupta <pankaj.gupta@ionos.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Sergei Trofimovich <slyfox@gentoo.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Pierre Morel <pmorel@linux.ibm.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65a2aa5f
    • D
      mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range() · 7cf209ba
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory"
      
      These are all cleanups and one fix previously sent as part of [1]:
      [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory
      groups.
      
      These patches make sense even without the other series, therefore I pulled
      them out to make the other series easier to digest.
      
      [1] https://lkml.kernel.org/r/20210607195430.48228-1-david@redhat.com
      
      This patch (of 4):
      
      Checkpatch complained on a follow-up patch that we are using "unsigned"
      here, which defaults to "unsigned int" and checkpatch is correct.
      
      As we will search for a fitting zone using the wrong pfn, we might end
      up onlining memory to one of the special kernel zones, such as ZONE_DMA,
      which can end badly as the onlined memory does not satisfy properties of
      these zones.
      
      Use "unsigned long" instead, just as we do in other places when handling
      PFNs.  This can bite us once we have physical addresses in the range of
      multiple TB.
      
      Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com
      Fixes: e5e68930 ("mm, memory_hotplug: display allowed zones in the preferred ordering")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta@ionos.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: virtualization@lists.linux-foundation.org
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pierre Morel <pmorel@linux.ibm.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Sergei Trofimovich <slyfox@gentoo.org>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf209ba
    • M
      mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE · 859a85dd
      Mike Rapoport 提交于
      Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".
      
      After recent updates to freeing unused parts of the memory map, no
      architecture can have holes in the memory map within a pageblock.  This
      makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
      option redundant.
      
      The first patch removes them both in a mechanical way and the second patch
      simplifies memory_hotplug::test_pages_in_a_zone() that had
      pfn_valid_within() surrounded by more logic than simple if.
      
      This patch (of 2):
      
      After recent changes in freeing of the unused parts of the memory map and
      rework of pfn_valid() in arm and arm64 there are no architectures that can
      have holes in the memory map within a pageblock and so nothing can enable
      CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
      pfn_valid_within().
      
      With that, pfn_valid_within() is always hardwired to 1 and can be
      completely removed.
      
      Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.
      
      Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      859a85dd
  5. 08 9月, 2021 3 次提交
    • L
      time: Handle negative seconds correctly in timespec64_to_ns() · 39ff83f2
      Lukas Hannen 提交于
      timespec64_ns() prevents multiplication overflows by comparing the seconds
      value of the timespec to KTIME_SEC_MAX. If the value is greater or equal it
      returns KTIME_MAX.
      
      But that check casts the signed seconds value to unsigned which makes the
      comparision true for all negative values and therefore return wrongly
      KTIME_MAX.
      
      Negative second values are perfectly valid and required in some places,
      e.g. ptp_clock_adjtime().
      
      Remove the cast and add a check for the negative boundary which is required
      to prevent undefined behaviour due to multiplication underflow.
      
      Fixes: cb477557 ("time: Prevent undefined behaviour in timespec64_to_ns()")'
      Signed-off-by: NLukas Hannen <lukas.hannen@opensource.tttech-industrial.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/AM6PR01MB541637BD6F336B8FFB72AF80EEC69@AM6PR01MB5416.eurprd01.prod.exchangelabs.com
      39ff83f2
    • L
      PM: EM: fix kernel-doc comments · ca67408a
      Lukasz Luba 提交于
      Fix the kernel-doc comments for the improved Energy Model documentation.
      Signed-off-by: NLukasz Luba <lukasz.luba@arm.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ca67408a
    • L
      Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly" · cd1adf1b
      Linus Torvalds 提交于
      This reverts commit 9857a17f.
      
      That commit was completely broken, and I should have caught on to it
      earlier.  But happily, the kernel test robot noticed the breakage fairly
      quickly.
      
      The breakage is because "try_get_page()" is about avoiding the page
      reference count overflow case, but is otherwise the exact same as a
      plain "get_page()".
      
      In contrast, "try_get_compound_head()" is an entirely different beast,
      and uses __page_cache_add_speculative() because it's not just about the
      page reference count, but also about possibly racing with the underlying
      page going away.
      
      So all the commentary about how
      
       "try_get_page() has fallen a little behind in terms of maintenance,
        try_get_compound_head() handles speculative page references more
        thoroughly"
      
      was just completely wrong: yes, try_get_compound_head() handles
      speculative page references, but the point is that try_get_page() does
      not, and must not.
      
      So there's no lack of maintainance - there are fundamentally different
      semantics.
      
      A speculative page reference would be entirely wrong in "get_page()",
      and it's entirely wrong in "try_get_page()".  It's not about
      speculation, it's purely about "uhhuh, you can't get this page because
      you've tried to increment the reference count too much already".
      
      The reason the kernel test robot noticed this bug was that it hit the
      VM_BUG_ON() in __page_cache_add_speculative(), which is all about
      verifying that the context of any speculative page access is correct.
      But since that isn't what try_get_page() is all about, the VM_BUG_ON()
      tests things that are not correct to test for try_get_page().
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd1adf1b
  6. 07 9月, 2021 1 次提交
  7. 06 9月, 2021 1 次提交