1. 15 5月, 2019 40 次提交
    • G
      lib/sort: avoid indirect calls to built-in swap · 8fb583c4
      George Spelvin 提交于
      Similar to what's being done in the net code, this takes advantage of
      the fact that most invocations use only a few common swap functions, and
      replaces indirect calls to them with (highly predictable) conditional
      branches.  (The downside, of course, is that if you *do* use a custom
      swap function, there are a few extra predicted branches on the code
      path.)
      
      This actually *shrinks* the x86-64 code, because it inlines the various
      swap functions inside do_swap, eliding function prologues & epilogues.
      
      x86-64 code size 767 -> 703 bytes (-64)
      
      Link: http://lkml.kernel.org/r/d10c5d4b393a1847f32f5b26f4bbaa2857140e1e.1552704200.git.lkml@sdf.orgSigned-off-by: NGeorge Spelvin <lkml@sdf.org>
      Acked-by: NAndrey Abramov <st5pub@yandex.ru>
      Acked-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Daniel Wagner <daniel.wagner@siemens.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Don Mullis <don.mullis@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fb583c4
    • G
      lib/sort: use more efficient bottom-up heapsort variant · 22a241cc
      George Spelvin 提交于
      This uses fewer comparisons than the previous code (approaching half as
      many for large random inputs), but produces identical results; it
      actually performs the exact same series of swap operations.
      
      Specifically, it reduces the average number of compares from
        2*n*log2(n) - 3*n + o(n)
      to
          n*log2(n) + 0.37*n + o(n).
      
      This is still 1.63*n worse than glibc qsort() which manages n*log2(n) -
      1.26*n, but at least the leading coefficient is correct.
      
      Standard heapsort, when sifting down, performs two comparisons per
      level: one to find the greater child, and a second to see if the current
      node should be exchanged with that child.
      
      Bottom-up heapsort observes that it's better to postpone the second
      comparison and search for the leaf where -infinity would be sent to,
      then search back *up* for the current node's destination.
      
      Since sifting down usually proceeds to the leaf level (that's where half
      the nodes are), this does O(1) second comparisons rather than log2(n).
      That saves a lot of (expensive since Spectre) indirect function calls.
      
      The one time it's worse than the previous code is if there are large
      numbers of duplicate keys, when the top-down algorithm is O(n) and
      bottom-up is O(n log n).  For distinct keys, it's provably always
      better, doing 1.5*n*log2(n) + O(n) in the worst case.
      
      (The code is not significantly more complex.  This patch also merges the
      heap-building and -extracting sift-down loops, resulting in a net code
      size savings.)
      
      x86-64 code size 885 -> 767 bytes (-118)
      
      (I see the checkpatch complaint about "else if (n -= size)".  The
      alternative is significantly uglier.)
      
      Link: http://lkml.kernel.org/r/2de8348635a1a421a72620677898c7fd5bd4b19d.1552704200.git.lkml@sdf.orgSigned-off-by: NGeorge Spelvin <lkml@sdf.org>
      Acked-by: NAndrey Abramov <st5pub@yandex.ru>
      Acked-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Daniel Wagner <daniel.wagner@siemens.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Don Mullis <don.mullis@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22a241cc
    • G
      lib/sort: make swap functions more generic · 37d0ec34
      George Spelvin 提交于
      Patch series "lib/sort & lib/list_sort: faster and smaller", v2.
      
      Because CONFIG_RETPOLINE has made indirect calls much more expensive, I
      thought I'd try to reduce the number made by the library sort functions.
      
      The first three patches apply to lib/sort.c.
      
      Patch #1 is a simple optimization.  The built-in swap has special cases
      for aligned 4- and 8-byte objects.  But those are almost never used;
      most calls to sort() work on larger structures, which fall back to the
      byte-at-a-time loop.  This generalizes them to aligned *multiples* of 4
      and 8 bytes.  (If nothing else, it saves an awful lot of energy by not
      thrashing the store buffers as much.)
      
      Patch #2 grabs a juicy piece of low-hanging fruit.  I agree that nice
      simple solid heapsort is preferable to more complex algorithms (sorry,
      Andrey), but it's possible to implement heapsort with far fewer
      comparisons (50% asymptotically, 25-40% reduction for realistic sizes)
      than the way it's been done up to now.  And with some care, the code
      ends up smaller, as well.  This is the "big win" patch.
      
      Patch #3 adds the same sort of indirect call bypass that has been added
      to the net code of late.  The great majority of the callers use the
      builtin swap functions, so replace the indirect call to sort_func with a
      (highly preditable) series of if() statements.  Rather surprisingly,
      this decreased code size, as the swap functions were inlined and their
      prologue & epilogue code eliminated.
      
      lib/list_sort.c is a bit trickier, as merge sort is already close to
      optimal, and we don't want to introduce triumphs of theory over
      practicality like the Ford-Johnson merge-insertion sort.
      
      Patch #4, without changing the algorithm, chops 32% off the code size
      and removes the part[MAX_LIST_LENGTH+1] pointer array (and the
      corresponding upper limit on efficiently sortable input size).
      
      Patch #5 improves the algorithm.  The previous code is already optimal
      for power-of-two (or slightly smaller) size inputs, but when the input
      size is just over a power of 2, there's a very unbalanced final merge.
      
      There are, in the literature, several algorithms which solve this, but
      they all depend on the "breadth-first" merge order which was replaced by
      commit 835cc0c8 with a more cache-friendly "depth-first" order.
      Some hard thinking came up with a depth-first algorithm which defers
      merges as little as possible while avoiding bad merges.  This saves
      0.2*n compares, averaged over all sizes.
      
      The code size increase is minimal (64 bytes on x86-64, reducing the net
      savings to 26%), but the comments expanded significantly to document the
      clever algorithm.
      
      TESTING NOTES: I have some ugly user-space benchmarking code which I
      used for testing before moving this code into the kernel.  Shout if you
      want a copy.
      
      I'm running this code right now, with CONFIG_TEST_SORT and
      CONFIG_TEST_LIST_SORT, but I confess I haven't rebooted since the last
      round of minor edits to quell checkpatch.  I figure there will be at
      least one round of comments and final testing.
      
      This patch (of 5):
      
      Rather than having special-case swap functions for 4- and 8-byte
      objects, special-case aligned multiples of 4 or 8 bytes.  This speeds up
      most users of sort() by avoiding fallback to the byte copy loop.
      
      Despite what ca96ab85 ("lib/sort: Add 64 bit swap function") claims,
      very few users of sort() sort pointers (or pointer-sized objects); most
      sort structures containing at least two words.  (E.g.
      drivers/acpi/fan.c:acpi_fan_get_fps() sorts an array of 40-byte struct
      acpi_fan_fps.)
      
      The functions also got renamed to reflect the fact that they support
      multiple words.  In the great tradition of bikeshedding, the names were
      by far the most contentious issue during review of this patch series.
      
      x86-64 code size 872 -> 886 bytes (+14)
      
      With feedback from Andy Shevchenko, Rasmus Villemoes and Geert
      Uytterhoeven.
      
      Link: http://lkml.kernel.org/r/f24f932df3a7fa1973c1084154f1cea596bcf341.1552704200.git.lkml@sdf.orgSigned-off-by: NGeorge Spelvin <lkml@sdf.org>
      Acked-by: NAndrey Abramov <st5pub@yandex.ru>
      Acked-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Daniel Wagner <daniel.wagner@siemens.com>
      Cc: Don Mullis <don.mullis@gmail.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37d0ec34
    • D
      lib/plist: rename DEBUG_PI_LIST to DEBUG_PLIST · 8e18faea
      Davidlohr Bueso 提交于
      This is a lot more appropriate than PI_LIST, which in the kernel one
      would assume that it has to do with priority-inheritance; which is not
      -- furthermore futexes make use of plists so this can be even more
      confusing, albeit the debug nature of the config option.
      
      Link: http://lkml.kernel.org/r/20190317185434.1626-1-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e18faea
    • R
      lib/bitmap.c: guard exotic bitmap functions by CONFIG_NUMA · cdc90a18
      Rasmus Villemoes 提交于
      The bitmap_remap, _bitremap, _onto and _fold functions are only used,
      via their node_ wrappers, in mm/mempolicy.c, which is only built for
      CONFIG_NUMA.  The helper bitmap_ord_to_pos used by these functions is
      global, but its only external caller is node_random() in lib/nodemask.c,
      which is also guarded by CONFIG_NUMA.
      
      For !CONFIG_NUMA:
      
      add/remove: 0/6 grow/shrink: 0/0 up/down: 0/-621 (-621)
      Function                                     old     new   delta
      bitmap_pos_to_ord                             20       -     -20
      bitmap_ord_to_pos                             70       -     -70
      bitmap_bitremap                               81       -     -81
      bitmap_fold                                  113       -    -113
      bitmap_onto                                  123       -    -123
      bitmap_remap                                 214       -    -214
      Total: Before=4776, After=4155, chg -13.00%
      
      Link: http://lkml.kernel.org/r/20190329205353.6010-2-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Cc: Yury Norov <yury.norov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdc90a18
    • R
      lib/bitmap.c: remove unused EXPORT_SYMBOLs · 5f239f65
      Rasmus Villemoes 提交于
      AFAICT, there have never been any callers of these functions outside
      mm/mempolicy.c (via their nodemask.h wrappers).  In particular, no
      modular code has ever used them, and given their somewhat exotic
      semantics, I highly doubt they will ever find such a use.  In any case,
      no need to export them currently.
      
      Link: http://lkml.kernel.org/r/20190329205353.6010-1-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Cc: Yury Norov <yury.norov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f239f65
    • R
      kernel/user.c: clean up some leftover code · 6c4e121f
      Rasmus Villemoes 提交于
      The out_unlock label is misleading; no unlocking happens after it, so
      just return NULL directly.
      
      Also, nothing between the kmem_cache_zalloc() that creates new and the
      two key_put() can initialize new->uid_keyring or new->session_keyring,
      so those calls are no-ops.
      
      Link: http://lkml.kernel.org/r/20190424200404.9114-1-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c4e121f
    • L
      kernel/latencytop.c: rename clear_all_latency_tracing to clear_tsk_latency_tracing · e02c9b0d
      Lin Feng 提交于
      The name clear_all_latency_tracing is misleading, in fact which only
      clear per task's latency_record[], and we do have another function named
      clear_global_latency_tracing which clear the global latency_record[]
      buffer.
      
      Link: http://lkml.kernel.org/r/20190226114602.16902-1-linf@wangsu.comSigned-off-by: NLin Feng <linf@wangsu.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e02c9b0d
    • L
      kernel/latencytop.c: remove unnecessary checks for latencytop_enabled · 0cc75888
      Lin Feng 提交于
      1. In latencytop source codes, we only have such calling chain:
      
      account_scheduler_latency(struct task_struct *task, int usecs, int inter)
      {
              if (unlikely(latencytop_enabled)) /* the outtermost check */
                      __account_scheduler_latency(task, usecs, inter);
      }
      __account_scheduler_latency
          account_global_scheduler_latency
              if (!latencytop_enabled)
      
      So, the inner check for latencytop_enabled is not necessary at all.
      
      2. In clear_all_latency_tracing and now is called
         clear_tsk_latency_tracing the check for latencytop_enabled is redundant
         and buggy to some extent.
      
         We have no reason to refuse clearing the /proc/$pid/latency if
         latencytop_enabled is set to 0, considering that if we use latencytop
         manually by echo 0 > /proc/sys/kernel/latencytop, then we want to clear
         /proc/$pid/latency and failed.
      
         Also we don't have such check in brother function
         clear_global_latency_tracing.
      
      Notes: These changes are only visible to users who set
         CONFIG_LATENCYTOP and won't change user tool latencytop's behavior.
      
      Link: http://lkml.kernel.org/r/20190226114602.16902-2-linf@wangsu.comSigned-off-by: NLin Feng <linf@wangsu.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cc75888
    • V
      kernel/notifier.c: double register detection · 83124657
      Vasily Averin 提交于
      By design notifiers can be registerd once only, 2nd register attempt
      called by mistake silently corrupts notifiers list.
      
      A few years ago I investigated described problem, the host was power
      cycled because of notifier list corruption.  I've prepared this patch
      and applied it to the OpenVZ kernel and sent this patch but nobody
      commented on it.  Later it helped us to detect a similar problem in the
      OpenVz kernel.
      
      Mistakes with notifier registration can happen for example during
      subsystem initialization from different namespaces, or because of a lost
      unregister in the roll-back path on initialization failures.
      
      The proposed check cannot prevent the described problem, however it
      allows us to detect its reason quickly without coredump analysis.
      
      Link: http://lkml.kernel.org/r/04127e71-4782-9bbb-fe5a-7c01e93a99b0@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83124657
    • M
      compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING · 9012d011
      Masahiro Yamada 提交于
      Commit 60a3cdd0 ("x86: add optimized inlining") introduced
      CONFIG_OPTIMIZE_INLINING, but it has been available only for x86.
      
      The idea is obviously arch-agnostic.  This commit moves the config entry
      from arch/x86/Kconfig.debug to lib/Kconfig.debug so that all
      architectures can benefit from it.
      
      This can make a huge difference in kernel image size especially when
      CONFIG_OPTIMIZE_FOR_SIZE is enabled.
      
      For example, I got 3.5% smaller arm64 kernel for v5.1-rc1.
      
        dec       file
        18983424  arch/arm64/boot/Image.before
        18321920  arch/arm64/boot/Image.after
      
      This also slightly improves the "Kernel hacking" Kconfig menu as
      e61aca51 ("Merge branch 'kconfig-diet' from Dave Hansen') suggested;
      this config option would be a good fit in the "compiler option" menu.
      
      Link: http://lkml.kernel.org/r/20190423034959.13525-12-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Brezillon <bbrezillon@kernel.org>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Marek Vasut <marek.vasut@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9012d011
    • M
      powerpc/mm/radix: mark as __tlbie_pid() and friends as__always_inline · efc344c5
      Masahiro Yamada 提交于
      This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
      place.  We need to eliminate potential issues beforehand.
      
      If it is enabled for powerpc, the following errors are reported:
      
        arch/powerpc/mm/tlb-radix.c: In function '__tlbie_lpid':
        arch/powerpc/mm/tlb-radix.c:148:2: warning: asm operand 3 probably doesn't match constraints
          asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
          ^~~
        arch/powerpc/mm/tlb-radix.c:148:2: error: impossible constraint in 'asm'
        arch/powerpc/mm/tlb-radix.c: In function '__tlbie_pid':
        arch/powerpc/mm/tlb-radix.c:118:2: warning: asm operand 3 probably doesn't match constraints
          asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
          ^~~
        arch/powerpc/mm/tlb-radix.c: In function '__tlbiel_pid':
        arch/powerpc/mm/tlb-radix.c:104:2: warning: asm operand 3 probably doesn't match constraints
          asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
          ^~~
      
      Link: http://lkml.kernel.org/r/20190423034959.13525-11-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Brezillon <bbrezillon@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Marek Vasut <marek.vasut@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efc344c5
    • M
      powerpc/mm/radix: mark __radix__flush_tlb_range_psize() as __always_inline · e12d6d7d
      Masahiro Yamada 提交于
      This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
      place.  We need to eliminate potential issues beforehand.
      
      If it is enabled for powerpc, the following error is reported:
      
        arch/powerpc/mm/tlb-radix.c: In function '__radix__flush_tlb_range_psize':
        arch/powerpc/mm/tlb-radix.c:104:2: error: asm operand 3 probably doesn't match constraints [-Werror]
          asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
          ^~~
        arch/powerpc/mm/tlb-radix.c:104:2: error: impossible constraint in 'asm'
      
      Link: http://lkml.kernel.org/r/20190423034959.13525-10-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Brezillon <bbrezillon@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Marek Vasut <marek.vasut@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e12d6d7d
    • M
      powerpc/prom_init: mark prom_getprop() and prom_getproplen() as __init · 480795a0
      Masahiro Yamada 提交于
      This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
      place.  We need to eliminate potential issues beforehand.
      
      If it is enabled for powerpc, the following modpost warnings are
      reported:
      
        WARNING: vmlinux.o(.text.unlikely+0x20): Section mismatch in reference from the function .prom_getprop() to the function .init.text:.call_prom()
        The function .prom_getprop() references the function __init .call_prom().
        This is often because .prom_getprop lacks a __init annotation or the annotation of .call_prom is wrong.
      
        WARNING: vmlinux.o(.text.unlikely+0x3c): Section mismatch in reference from the function .prom_getproplen() to the function .init.text:.call_prom()
        The function .prom_getproplen() references the function __init .call_prom().
        This is often because .prom_getproplen lacks a __init annotation or the annotation of .call_prom is wrong.
      
      Link: http://lkml.kernel.org/r/20190423034959.13525-9-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Brezillon <bbrezillon@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Marek Vasut <marek.vasut@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      480795a0
    • M
      ARM: mark setup_machine_tags() stub as __init __noreturn · 2e0168a7
      Masahiro Yamada 提交于
      This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
      place.  We need to eliminate potential issues beforehand.
      
      If it is enabled for arm, Clang build results in the following modpost
      warning:
      
        WARNING: vmlinux.o(.text+0x1124): Section mismatch in reference from the function setup_machine_tags() to the function .init.text:early_print()
        The function setup_machine_tags() references the function __init early_print().
        This is often because setup_machine_tags lacks a __init annotation or the annotation of early_print is wrong.
      
      Link: http://lkml.kernel.org/r/20190423034959.13525-8-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Brezillon <bbrezillon@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Marek Vasut <marek.vasut@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e0168a7
    • M
      MIPS: mark __fls() and __ffs() as __always_inline · e9ea596c
      Masahiro Yamada 提交于
      This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
      place.  We need to eliminate potential issues beforehand.
      
      If it is enabled for mips, the following errors are reported:
      
        arch/mips/mm/sc-mips.o: In function `mips_sc_prefetch_enable.part.2':
        sc-mips.c:(.text+0x98): undefined reference to `mips_gcr_base'
        sc-mips.c:(.text+0x9c): undefined reference to `mips_gcr_base'
        sc-mips.c:(.text+0xbc): undefined reference to `mips_gcr_base'
        sc-mips.c:(.text+0xc8): undefined reference to `mips_gcr_base'
        sc-mips.c:(.text+0xdc): undefined reference to `mips_gcr_base'
        arch/mips/mm/sc-mips.o:sc-mips.c:(.text.unlikely+0x44): more undefined references to `mips_gcr_base'
      
      Link: http://lkml.kernel.org/r/20190423034959.13525-7-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Brezillon <bbrezillon@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Marek Vasut <marek.vasut@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9ea596c
    • M
      mtd: rawnand: vf610_nfc: add initializer to avoid -Wmaybe-uninitialized · 21279828
      Masahiro Yamada 提交于
      This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
      place.  We need to eliminate potential issues beforehand.
      
      Kbuild test robot has never reported -Wmaybe-uninitialized warning for
      this probably because vf610_nfc_run() is inlined by the x86 compiler's
      inlining heuristic.
      
      If CONFIG_OPTIMIZE_INLINING is enabled for a different architecture and
      vf610_nfc_run() is not inlined, the following warning is reported:
      
        drivers/mtd/nand/raw/vf610_nfc.c: In function `vf610_nfc_cmd':
        drivers/mtd/nand/raw/vf610_nfc.c:455:3: warning: `offset' may be used uninitialized in this function [-Wmaybe-uninitialized]
           vf610_nfc_rd_from_sram(instr->ctx.data.buf.in + offset,
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                    nfc->regs + NFC_MAIN_AREA(0) + offset,
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                    trfr_sz, !nfc->data_access);
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Link: http://lkml.kernel.org/r/20190423034959.13525-6-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Brezillon <bbrezillon@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Marek Vasut <marek.vasut@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21279828
    • M
      s390/cpacf: mark scpacf_query() as __always_inline · e60fb8bf
      Masahiro Yamada 提交于
      This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
      place.  We need to eliminate potential issues beforehand.
      
      If it is enabled for s390, the following error is reported:
      
        In file included from arch/s390/crypto/des_s390.c:19:
        arch/s390/include/asm/cpacf.h: In function 'cpacf_query':
        arch/s390/include/asm/cpacf.h:170:2: warning: asm operand 3 probably doesn't match constraints
          asm volatile(
          ^~~
        arch/s390/include/asm/cpacf.h:170:2: error: impossible constraint in 'asm'
      
      Link: http://lkml.kernel.org/r/20190423034959.13525-5-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Brezillon <bbrezillon@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Marek Vasut <marek.vasut@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e60fb8bf
    • M
      MIPS: mark mult_sh_align_mod() as __always_inline · 1221a585
      Masahiro Yamada 提交于
      This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
      place.  We need to eliminate potential issues beforehand.
      
      If it is enabled for mips, the following error is reported:
      
        arch/mips/kernel/cpu-bugs64.c: In function 'mult_sh_align_mod.constprop':
        arch/mips/kernel/cpu-bugs64.c:33:2: error: asm operand 1 probably doesn't match constraints [-Werror]
          asm volatile(
          ^~~
        arch/mips/kernel/cpu-bugs64.c:33:2: error: asm operand 1 probably doesn't match constraints [-Werror]
          asm volatile(
          ^~~
        arch/mips/kernel/cpu-bugs64.c:33:2: error: impossible constraint in 'asm'
          asm volatile(
          ^~~
        arch/mips/kernel/cpu-bugs64.c:33:2: error: impossible constraint in 'asm'
          asm volatile(
          ^~~
      
      Link: http://lkml.kernel.org/r/20190423034959.13525-4-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Brezillon <bbrezillon@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Marek Vasut <marek.vasut@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1221a585
    • M
      arm64: mark (__)cpus_have_const_cap as __always_inline · 02166b88
      Masahiro Yamada 提交于
      This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
      place.  We need to eliminate potential issues beforehand.
      
      If it is enabled for arm64, the following errors are reported:
      
        In file included from include/linux/compiler_types.h:68,
                         from <command-line>:
        arch/arm64/include/asm/jump_label.h: In function 'cpus_have_const_cap':
        include/linux/compiler-gcc.h:120:38: warning: asm operand 0 probably doesn't match constraints
         #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
                                              ^~~
        arch/arm64/include/asm/jump_label.h:32:2: note: in expansion of macro 'asm_volatile_goto'
          asm_volatile_goto(
          ^~~~~~~~~~~~~~~~~
        include/linux/compiler-gcc.h:120:38: error: impossible constraint in 'asm'
         #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
                                              ^~~
        arch/arm64/include/asm/jump_label.h:32:2: note: in expansion of macro 'asm_volatile_goto'
          asm_volatile_goto(
          ^~~~~~~~~~~~~~~~~
      
      Link: http://lkml.kernel.org/r/20190423034959.13525-3-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Brezillon <bbrezillon@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Marek Vasut <marek.vasut@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02166b88
    • A
      ARM: prevent tracing IPI_CPU_BACKTRACE · be167862
      Arnd Bergmann 提交于
      Patch series "compiler: allow all arches to enable
      CONFIG_OPTIMIZE_INLINING", v3.
      
      This patch (of 11):
      
      When function tracing for IPIs is enabled, we get a warning for an
      overflow of the ipi_types array with the IPI_CPU_BACKTRACE type as
      triggered by raise_nmi():
      
        arch/arm/kernel/smp.c: In function 'raise_nmi':
        arch/arm/kernel/smp.c:489:2: error: array subscript is above array bounds [-Werror=array-bounds]
          trace_ipi_raise(target, ipi_types[ipinr]);
      
      This is a correct warning as we actually overflow the array here.
      
      This patch raise_nmi() to call __smp_cross_call() instead of
      smp_cross_call(), to avoid calling into ftrace.  For clarification, I'm
      also adding a two new code comments describing how this one is special.
      
      The warning appears to have shown up after commit e7273ff4 ("ARM:
      8488/1: Make IPI_CPU_BACKTRACE a "non-secure" SGI"), which changed the
      number assignment from '15' to '8', but as far as I can tell has existed
      since the IPI tracepoints were first introduced.  If we decide to
      backport this patch to stable kernels, we probably need to backport
      e7273ff4 as well.
      
      [yamada.masahiro@socionext.com: rebase on v5.1-rc1]
      Link: http://lkml.kernel.org/r/20190423034959.13525-2-yamada.masahiro@socionext.com
      Fixes: e7273ff4 ("ARM: 8488/1: Make IPI_CPU_BACKTRACE a "non-secure" SGI")
      Fixes: 365ec7b1 ("ARM: add IPI tracepoints") # v3.17
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Boris Brezillon <bbrezillon@kernel.org>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Marek Vasut <marek.vasut@gmail.com>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be167862
    • M
      treewide: remove SPDX "WITH Linux-syscall-note" from kernel-space headers · 687a3e4d
      Masahiro Yamada 提交于
      The "WITH Linux-syscall-note" should be added to headers exported to the
      user-space.
      
      Some kernel-space headers have "WITH Linux-syscall-note", which seems a
      mistake.
      
      [1] arch/x86/include/asm/hyperv-tlfs.h
      
      Commit 5a485803 ("x86/hyper-v: move hyperv.h out of uapi") moved
      this file out of uapi, but missed to update the SPDX License tag.
      
      [2] include/asm-generic/shmparam.h
      
      Commit 76ce2a80 ("Rename include/{uapi => }/asm-generic/shmparam.h
      really") moved this file out of uapi, but missed to update the SPDX
      License tag.
      
      [3] include/linux/qcom-geni-se.h
      
      Commit eddac5af ("soc: qcom: Add GENI based QUP Wrapper driver")
      added this file, but I do not see a good reason why its license tag must
      include "WITH Linux-syscall-note".
      
      Link: http://lkml.kernel.org/r/1554196104-3522-1-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      687a3e4d
    • A
      fs/select: avoid clang stack usage warning · ad312f95
      Arnd Bergmann 提交于
      The select() implementation is carefully tuned to put a sensible amount
      of data on the stack for holding a copy of the user space fd_set, but
      not too large to risk overflowing the kernel stack.
      
      When building a 32-bit kernel with clang, we need a little more space
      than with gcc, which often triggers a warning:
      
        fs/select.c:619:5: error: stack frame size of 1048 bytes in function 'core_sys_select' [-Werror,-Wframe-larger-than=]
        int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
      
      I experimentally found that for 32-bit ARM, reducing the maximum stack
      usage by 64 bytes keeps us reliably under the warning limit again.
      
      Link: http://lkml.kernel.org/r/20190307090146.1874906-1-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad312f95
    • J
      mm/mincore.c: make mincore() more conservative · 134fca90
      Jiri Kosina 提交于
      The semantics of what mincore() considers to be resident is not
      completely clear, but Linux has always (since 2.3.52, which is when
      mincore() was initially done) treated it as "page is available in page
      cache".
      
      That's potentially a problem, as that [in]directly exposes
      meta-information about pagecache / memory mapping state even about
      memory not strictly belonging to the process executing the syscall,
      opening possibilities for sidechannel attacks.
      
      Change the semantics of mincore() so that it only reveals pagecache
      information for non-anonymous mappings that belog to files that the
      calling process could (if it tried to) successfully open for writing;
      otherwise we'd be including shared non-exclusive mappings, which
      
       - is the sidechannel
      
       - is not the usecase for mincore(), as that's primarily used for data,
         not (shared) text
      
      [jkosina@suse.cz: v2]
        Link: http://lkml.kernel.org/r/20190312141708.6652-2-vbabka@suse.cz
      [mhocko@suse.com: restructure can_do_mincore() conditions]
      Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1903062342020.19912@cbobk.fhfr.pmSigned-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJosh Snyder <joshs@netflix.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Originally-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Originally-by: NDominique Martinet <asmadeus@codewreck.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Kevin Easton <kevin@guarana.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Cyril Hrubis <chrubis@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Daniel Gruss <daniel@gruss.cc>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      134fca90
    • D
      mm: maintain randomization of page free lists · 97500a4a
      Dan Williams 提交于
      When freeing a page with an order >= shuffle_page_order randomly select
      the front or back of the list for insertion.
      
      While the mm tries to defragment physical pages into huge pages this can
      tend to make the page allocator more predictable over time.  Inject the
      front-back randomness to preserve the initial randomness established by
      shuffle_free_memory() when the kernel was booted.
      
      The overhead of this manipulation is constrained by only being applied
      for MAX_ORDER sized pages by default.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Robert Elliott <elliott@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97500a4a
    • D
      mm: move buddy list manipulations into helpers · b03641af
      Dan Williams 提交于
      In preparation for runtime randomization of the zone lists, take all
      (well, most of) the list_*() functions in the buddy allocator and put
      them in helper functions.  Provide a common control point for injecting
      additional behavior when freeing pages.
      
      [dan.j.williams@intel.com: fix buddy list helpers]
        Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
      [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
        Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
      Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Robert Elliott <elliott@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b03641af
    • D
      mm: shuffle initial free memory to improve memory-side-cache utilization · e900a918
      Dan Williams 提交于
      Patch series "mm: Randomize free memory", v10.
      
      This patch (of 3):
      
      Randomization of the page allocator improves the average utilization of
      a direct-mapped memory-side-cache.  Memory side caching is a platform
      capability that Linux has been previously exposed to in HPC
      (high-performance computing) environments on specialty platforms.  In
      that instance it was a smaller pool of high-bandwidth-memory relative to
      higher-capacity / lower-bandwidth DRAM.  Now, this capability is going
      to be found on general purpose server platforms where DRAM is a cache in
      front of higher latency persistent memory [1].
      
      Robert offered an explanation of the state of the art of Linux
      interactions with memory-side-caches [2], and I copy it here:
      
          It's been a problem in the HPC space:
          http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
      
          A kernel module called zonesort is available to try to help:
          https://software.intel.com/en-us/articles/xeon-phi-software
      
          and this abandoned patch series proposed that for the kernel:
          https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com
      
          Dan's patch series doesn't attempt to ensure buffers won't conflict, but
          also reduces the chance that the buffers will. This will make performance
          more consistent, albeit slower than "optimal" (which is near impossible
          to attain in a general-purpose kernel).  That's better than forcing
          users to deploy remedies like:
              "To eliminate this gradual degradation, we have added a Stream
               measurement to the Node Health Check that follows each job;
               nodes are rebooted whenever their measured memory bandwidth
               falls below 300 GB/s."
      
      A replacement for zonesort was merged upstream in commit cc9aec03
      ("x86/numa_emulation: Introduce uniform split capability").  With this
      numa_emulation capability, memory can be split into cache sized
      ("near-memory" sized) numa nodes.  A bind operation to such a node, and
      disabling workloads on other nodes, enables full cache performance.
      However, once the workload exceeds the cache size then cache conflicts
      are unavoidable.  While HPC environments might be able to tolerate
      time-scheduling of cache sized workloads, for general purpose server
      platforms, the oversubscribed cache case will be the common case.
      
      The worst case scenario is that a server system owner benchmarks a
      workload at boot with an un-contended cache only to see that performance
      degrade over time, even below the average cache performance due to
      excessive conflicts.  Randomization clips the peaks and fills in the
      valleys of cache utilization to yield steady average performance.
      
      Here are some performance impact details of the patches:
      
      1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
         3X speedup in a contrived case that tries to force cache conflicts.
         The contrived cased used the numa_emulation capability to force an
         instance of the benchmark to be run in two of the near-memory sized
         numa nodes.  If both instances were placed on the same emulated they
         would fit and cause zero conflicts.  While on separate emulated nodes
         without randomization they underutilized the cache and conflicted
         unnecessarily due to the in-order allocation per node.
      
      2/ A well known Java server application benchmark was run with a heap
         size that exceeded cache size by 3X.  The cache conflict rate was 8%
         for the first run and degraded to 21% after page allocator aging.  With
         randomization enabled the rate levelled out at 11%.
      
      3/ A MongoDB workload did not observe measurable difference in
         cache-conflict rates, but the overall throughput dropped by 7% with
         randomization in one case.
      
      4/ Mel Gorman ran his suite of performance workloads with randomization
         enabled on platforms without a memory-side-cache and saw a mix of some
         improvements and some losses [3].
      
      While there is potentially significant improvement for applications that
      depend on low latency access across a wide working-set, the performance
      may be negligible to negative for other workloads.  For this reason the
      shuffle capability defaults to off unless a direct-mapped
      memory-side-cache is detected.  Even then, the page_alloc.shuffle=0
      parameter can be specified to disable the randomization on those systems.
      
      Outside of memory-side-cache utilization concerns there is potentially
      security benefit from randomization.  Some data exfiltration and
      return-oriented-programming attacks rely on the ability to infer the
      location of sensitive data objects.  The kernel page allocator, especially
      early in system boot, has predictable first-in-first out behavior for
      physical pages.  Pages are freed in physical address order when first
      onlined.
      
      Quoting Kees:
          "While we already have a base-address randomization
           (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
           memory layouts would certainly be using the predictability of
           allocation ordering (i.e. for attacks where the base address isn't
           important: only the relative positions between allocated memory).
           This is common in lots of heap-style attacks. They try to gain
           control over ordering by spraying allocations, etc.
      
           I'd really like to see this because it gives us something similar
           to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
      
      While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
      caches it leaves vast bulk of memory to be predictably in order allocated.
      However, it should be noted, the concrete security benefits are hard to
      quantify, and no known CVE is mitigated by this randomization.
      
      Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
      a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
      are initially populated with free memory at boot and at hotplug time.  Do
      this based on either the presence of a page_alloc.shuffle=Y command line
      parameter, or autodetection of a memory-side-cache (to be added in a
      follow-on patch).
      
      The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
      pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.  10,
      4MB this trades off randomization granularity for time spent shuffling.
      MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
      while still showing memory-side cache behavior improvements, and the
      expectation that the security implications of finer granularity
      randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.  The
      performance impact of the shuffling appears to be in the noise compared to
      other memory initialization work.
      
      This initial randomization can be undone over time so a follow-on patch is
      introduced to inject entropy on page free decisions.  It is reasonable to
      ask if the page free entropy is sufficient, but it is not enough due to
      the in-order initial freeing of pages.  At the start of that process
      putting page1 in front or behind page0 still keeps them close together,
      page2 is still near page1 and has a high chance of being adjacent.  As
      more pages are added ordering diversity improves, but there is still high
      page locality for the low address pages and this leads to no significant
      impact to the cache conflict rate.
      
      [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
      [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
      [3]: https://lkml.org/lkml/2018/10/12/309
      
      [dan.j.williams@intel.com: fix shuffle enable]
        Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
      [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
        Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Robert Elliott <elliott@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e900a918
    • U
      mm/vmalloc.c: convert vmap_lazy_nr to atomic_long_t · 4d36e6f8
      Uladzislau Rezki (Sony) 提交于
      vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on
      both 32 and 64 bit systems.  lazy_max_pages() deals with "unsigned long"
      that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on
      64 bit as well.
      
      Link: http://lkml.kernel.org/r/20190131162452.25879-1-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d36e6f8
    • U
      mm/vmalloc.c: add priority threshold to __purge_vmap_area_lazy() · 68571be9
      Uladzislau Rezki (Sony) 提交于
      Commit 763b218d ("mm: add preempt points into __purge_vmap_area_lazy()")
      introduced some preempt points, one of those is making an allocation
      more prioritized over lazy free of vmap areas.
      
      Prioritizing an allocation over freeing does not work well all the time,
      i.e.  it should be rather a compromise.
      
      1) Number of lazy pages directly influences the busy list length thus
         on operations like: allocation, lookup, unmap, remove, etc.
      
      2) Under heavy stress of vmalloc subsystem I run into a situation when
         memory usage gets increased hitting out_of_memory -> panic state due to
         completely blocking of logic that frees vmap areas in the
         __purge_vmap_area_lazy() function.
      
      Establish a threshold passing which the freeing is prioritized back over
      allocation creating a balance between each other.
      
      Using vmalloc test driver in "stress mode", i.e.  When all available
      test cases are run simultaneously on all online CPUs applying a
      pressure on the vmalloc subsystem, my HiKey 960 board runs out of
      memory due to the fact that __purge_vmap_area_lazy() logic simply is
      not able to free pages in time.
      
      How I run it:
      
      1) You should build your kernel with CONFIG_TEST_VMALLOC=m
      2) ./tools/testing/selftests/vm/test_vmalloc.sh stress
      
      During this test "vmap_lazy_nr" pages will go far beyond acceptable
      lazy_max_pages() threshold, that will lead to enormous busy list size
      and other problems including allocation time and so on.
      
      Link: http://lkml.kernel.org/r/20190124115648.9433-3-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68571be9
    • D
      kernel/sched/psi.c: expose pressure metrics on root cgroup · df5ba5be
      Dan Schatzberg 提交于
      Pressure metrics are already recorded and exposed in procfs for the
      entire system, but any tool which monitors cgroup pressure has to
      special case the root cgroup to read from procfs.  This patch exposes
      the already recorded pressure metrics on the root cgroup.
      
      Link: http://lkml.kernel.org/r/20190510174938.3361741-1-dschatzberg@fb.comSigned-off-by: NDan Schatzberg <dschatzberg@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df5ba5be
    • S
      psi: introduce psi monitor · 0e94682b
      Suren Baghdasaryan 提交于
      Psi monitor aims to provide a low-latency short-term pressure detection
      mechanism configurable by users.  It allows users to monitor psi metrics
      growth and trigger events whenever a metric raises above user-defined
      threshold within user-defined time window.
      
      Time window and threshold are both expressed in usecs.  Multiple psi
      resources with different thresholds and window sizes can be monitored
      concurrently.
      
      Psi monitors activate when system enters stall state for the monitored
      psi metric and deactivate upon exit from the stall state.  While system
      is in the stall state psi signal growth is monitored at a rate of 10
      times per tracking window.  Min window size is 500ms, therefore the min
      monitoring interval is 50ms.  Max window size is 10s with monitoring
      interval of 1s.
      
      When activated psi monitor stays active for at least the duration of one
      tracking window to avoid repeated activations/deactivations when psi
      signal is bouncing.
      
      Notifications to the users are rate-limited to one per tracking window.
      
      Link: http://lkml.kernel.org/r/20190319235619.260832-8-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e94682b
    • S
      include/: refactor headers to allow kthread.h inclusion in psi_types.h · 8af0c18a
      Suren Baghdasaryan 提交于
      kthread.h can't be included in psi_types.h because it creates a circular
      inclusion with kthread.h eventually including psi_types.h and
      complaining on kthread structures not being defined because they are
      defined further in the kthread.h.  Resolve this by removing psi_types.h
      inclusion from the headers included from kthread.h.
      
      Link: http://lkml.kernel.org/r/20190319235619.260832-7-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8af0c18a
    • S
      psi: track changed states · 333f3017
      Suren Baghdasaryan 提交于
      Introduce changed_states parameter into collect_percpu_times to track
      the states changed since the last update.
      
      This will be needed to detect whether polled states activated in the
      monitor patch.
      
      Link: http://lkml.kernel.org/r/20190319235619.260832-6-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      333f3017
    • S
      psi: split update_stats into parts · 7fc70a39
      Suren Baghdasaryan 提交于
      Split update_stats into collect_percpu_times and update_averages for
      collect_percpu_times to be reused later inside psi monitor.
      
      Link: http://lkml.kernel.org/r/20190319235619.260832-5-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fc70a39
    • S
      psi: rename psi fields in preparation for psi trigger addition · bcc78db6
      Suren Baghdasaryan 提交于
      Rename psi_group structure member fields used for calculating psi totals
      and averages for clear distinction between them and for trigger-related
      fields that will be added by "psi: introduce psi monitor".
      
      [surenb@google.com: v6]
        Link: http://lkml.kernel.org/r/20190319235619.260832-4-surenb@google.com
      Link: http://lkml.kernel.org/r/20190124211518.244221-5-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcc78db6
    • S
      9289c5e6
    • S
      psi: introduce state_mask to represent stalled psi states · 33b2d630
      Suren Baghdasaryan 提交于
      Patch series "psi: pressure stall monitors", v6.
      
      This is a respin of:
        https://lwn.net/ml/linux-kernel/20190308184311.144521-1-surenb%40google.com/
      
      Android is adopting psi to detect and remedy memory pressure that
      results in stuttering and decreased responsiveness on mobile devices.
      
      Psi gives us the stall information, but because we're dealing with
      latencies in the millisecond range, periodically reading the pressure
      files to detect stalls in a timely fashion is not feasible.  Psi also
      doesn't aggregate its averages at a high-enough frequency right now.
      
      This patch series extends the psi interface such that users can
      configure sensitive latency thresholds and use poll() and friends to be
      notified when these are breached.
      
      As high-frequency aggregation is costly, it implements an aggregation
      method that is optimized for fast, short-interval averaging, and makes
      the aggregation frequency adaptive, such that high-frequency updates
      only happen while monitored stall events are actively occurring.
      
      With these patches applied, Android can monitor for, and ward off,
      mounting memory shortages before they cause problems for the user.  For
      example, using memory stall monitors in userspace low memory killer
      daemon (lmkd) we can detect mounting pressure and kill less important
      processes before device becomes visibly sluggish.  In our memory stress
      testing psi memory monitors produce roughly 10x less false positives
      compared to vmpressure signals.  Having ability to specify multiple
      triggers for the same psi metric allows other parts of Android framework
      to monitor memory state of the device and act accordingly.
      
      The new interface is straight-forward.  The user opens one of the
      pressure files for writing and writes a trigger description into the
      file descriptor that defines the stall state - some or full, and the
      maximum stall time over a given window of time.  E.g.:
      
              /* Signal when stall time exceeds 100ms of a 1s window */
              char trigger[] = "full 100000 1000000"
              fd = open("/proc/pressure/memory")
              write(fd, trigger, sizeof(trigger))
              while (poll() >= 0) {
                      ...
              };
              close(fd);
      
      When the monitored stall state is entered, psi adapts its aggregation
      frequency according to what the configured time window requires in order
      to emit event signals in a timely fashion.  Once the stalling subsides,
      aggregation reverts back to normal.
      
      The trigger is associated with the open file descriptor.  To stop
      monitoring, the user only needs to close the file descriptor and the
      trigger is discarded.
      
      Patches 1-6 prepare the psi code for polling support.  Patch 7
      implements the adaptive polling logic, the pressure growth detection
      optimized for short intervals, and hooks up write() and poll() on the
      pressure files.
      
      The patches were developed in collaboration with Johannes Weiner.
      
      This patch (of 7):
      
      The psi monitoring patches will need to determine the same states as
      record_times().  To avoid calculating them twice, maintain a state mask
      that can be consulted cheaply.  Do this in a separate patch to keep the
      churn in the main feature patch at a minimum.
      
      This adds 4-byte state_mask member into psi_group_cpu struct which
      results in its first cacheline-aligned part becoming 52 bytes long.  Add
      explicit values to enumeration element counters that affect
      psi_group_cpu struct size.
      
      Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33b2d630
    • B
      mm: update references to page _refcount · 136ac591
      Baruch Siach 提交于
      Commit 0139aa7b ("mm: rename _count, field of the struct page, to
      _refcount") left out a couple of references to the old field name.  Fix
      that.
      
      Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il
      Fixes: 0139aa7b ("mm: rename _count, field of the struct page, to _refcount")
      Signed-off-by: NBaruch Siach <baruch@tkos.co.il>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      136ac591
    • A
      mm: change mm_update_next_owner() to update mm->owner with WRITE_ONCE · 987717e5
      Andrea Arcangeli 提交于
      The RCU reader uses rcu_dereference() inside rcu_read_lock critical
      sections, so the writer shall use WRITE_ONCE.  Just a cleanup, we still
      rely on gcc to emit atomic writes in other places.
      
      Link: http://lkml.kernel.org/r/20190325225636.11635-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      987717e5
    • A
      userfaultfd: use RCU to free the task struct when fork fails · c3f3ce04
      Andrea Arcangeli 提交于
      The task structure is freed while get_mem_cgroup_from_mm() holds
      rcu_read_lock() and dereferences mm->owner.
      
        get_mem_cgroup_from_mm()                failing fork()
        ----                                    ---
        task = mm->owner
                                                mm->owner = NULL;
                                                free(task)
        if (task) *task; /* use after free */
      
      The fix consists in freeing the task with RCU also in the fork failure
      case, exactly like it always happens for the regular exit(2) path.  That
      is enough to make the rcu_read_lock hold in get_mem_cgroup_from_mm()
      (left side above) effective to avoid a use after free when dereferencing
      the task structure.
      
      An alternate possible fix would be to defer the delivery of the
      userfaultfd contexts to the monitor until after fork() is guaranteed to
      succeed.  Such a change would require more changes because it would
      create a strict ordering dependency where the uffd methods would need to
      be called beyond the last potentially failing branch in order to be
      safe.  This solution as opposed only adds the dependency to common code
      to set mm->owner to NULL and to free the task struct that was pointed by
      mm->owner with RCU, if fork ends up failing.  The userfaultfd methods
      can still be called anywhere during the fork runtime and the monitor
      will keep discarding orphaned "mm" coming from failed forks in userland.
      
      This race condition couldn't trigger if CONFIG_MEMCG was set =n at build
      time.
      
      [aarcange@redhat.com: improve changelog, reduce #ifdefs per Michal]
        Link: http://lkml.kernel.org/r/20190429035752.4508-1-aarcange@redhat.com
      Link: http://lkml.kernel.org/r/20190325225636.11635-2-aarcange@redhat.com
      Fixes: 893e26e6 ("userfaultfd: non-cooperative: Add fork() event")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: Nzhong jiang <zhongjiang@huawei.com>
      Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3f3ce04