1. 14 2月, 2020 1 次提交
  2. 09 2月, 2020 1 次提交
    • L
      pipe: use exclusive waits when reading or writing · 0ddad21d
      Linus Torvalds 提交于
      This makes the pipe code use separate wait-queues and exclusive waiting
      for readers and writers, avoiding a nasty thundering herd problem when
      there are lots of readers waiting for data on a pipe (or, less commonly,
      lots of writers waiting for a pipe to have space).
      
      While this isn't a common occurrence in the traditional "use a pipe as a
      data transport" case, where you typically only have a single reader and
      a single writer process, there is one common special case: using a pipe
      as a source of "locking tokens" rather than for data communication.
      
      In particular, the GNU make jobserver code ends up using a pipe as a way
      to limit parallelism, where each job consumes a token by reading a byte
      from the jobserver pipe, and releases the token by writing a byte back
      to the pipe.
      
      This pattern is fairly traditional on Unix, and works very well, but
      will waste a lot of time waking up a lot of processes when only a single
      reader needs to be woken up when a writer releases a new token.
      
      A simplified test-case of just this pipe interaction is to create 64
      processes, and then pass a single token around between them (this
      test-case also intentionally passes another token that gets ignored to
      test the "wake up next" logic too, in case anybody wonders about it):
      
          #include <unistd.h>
      
          int main(int argc, char **argv)
          {
              int fd[2], counters[2];
      
              pipe(fd);
              counters[0] = 0;
              counters[1] = -1;
              write(fd[1], counters, sizeof(counters));
      
              /* 64 processes */
              fork(); fork(); fork(); fork(); fork(); fork();
      
              do {
                      int i;
                      read(fd[0], &i, sizeof(i));
                      if (i < 0)
                              continue;
                      counters[0] = i+1;
                      write(fd[1], counters, (1+(i & 1)) *sizeof(int));
              } while (counters[0] < 1000000);
              return 0;
          }
      
      and in a perfect world, passing that token around should only cause one
      context switch per transfer, when the writer of a token causes a
      directed wakeup of just a single reader.
      
      But with the "writer wakes all readers" model we traditionally had, on
      my test box the above case causes more than an order of magnitude more
      scheduling: instead of the expected ~1M context switches, "perf stat"
      shows
      
              231,852.37 msec task-clock                #   15.857 CPUs utilized
              11,250,961      context-switches          #    0.049 M/sec
                 616,304      cpu-migrations            #    0.003 M/sec
                   1,648      page-faults               #    0.007 K/sec
       1,097,903,998,514      cycles                    #    4.735 GHz
         120,781,778,352      instructions              #    0.11  insn per cycle
          27,997,056,043      branches                  #  120.754 M/sec
             283,581,233      branch-misses             #    1.01% of all branches
      
            14.621273891 seconds time elapsed
      
             0.018243000 seconds user
             3.611468000 seconds sys
      
      before this commit.
      
      After this commit, I get
      
                5,229.55 msec task-clock                #    3.072 CPUs utilized
               1,212,233      context-switches          #    0.232 M/sec
                 103,951      cpu-migrations            #    0.020 M/sec
                   1,328      page-faults               #    0.254 K/sec
          21,307,456,166      cycles                    #    4.074 GHz
          12,947,819,999      instructions              #    0.61  insn per cycle
           2,881,985,678      branches                  #  551.096 M/sec
              64,267,015      branch-misses             #    2.23% of all branches
      
             1.702148350 seconds time elapsed
      
             0.004868000 seconds user
             0.110786000 seconds sys
      
      instead. Much better.
      
      [ Note! This kernel improvement seems to be very good at triggering a
        race condition in the make jobserver (in GNU make 4.2.1) for me. It's
        a long known bug that was fixed back in June 2017 by GNU make commit
        b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
        avoid hangs.").
      
        But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
        so a number of distributions may still have the buggy version. Some
        have backported the fix to their 4.2.1 release, though, and even
        without the fix it's quite timing-dependent whether the bug actually
        is hit. ]
      
      Josh Triplett says:
       "I've been hammering on your pipe fix patch (switching to exclusive
        wait queues) for a month or so, on several different systems, and I've
        run into no issues with it. The patch *substantially* improves
        parallel build times on large (~100 CPU) systems, both with parallel
        make and with other things that use make's pipe-based jobserver.
      
        All current distributions (including stable and long-term stable
        distributions) have versions of GNU make that no longer have the
        jobserver bug"
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ddad21d
  3. 08 2月, 2020 12 次提交
  4. 07 2月, 2020 4 次提交
  5. 06 2月, 2020 2 次提交
    • Q
      skbuff: fix a data race in skb_queue_len() · 86b18aaa
      Qian Cai 提交于
      sk_buff.qlen can be accessed concurrently as noticed by KCSAN,
      
       BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg
      
       read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96:
        unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821
      				 net/unix/af_unix.c:1761
        ____sys_sendmsg+0x33e/0x370
        ___sys_sendmsg+0xa6/0xf0
        __sys_sendmsg+0x69/0xf0
        __x64_sys_sendmsg+0x51/0x70
        do_syscall_64+0x91/0xb47
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99:
        __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029
        __skb_try_recv_datagram+0xbe/0x220
        unix_dgram_recvmsg+0xee/0x850
        ____sys_recvmsg+0x1fb/0x210
        ___sys_recvmsg+0xa2/0xf0
        __sys_recvmsg+0x66/0xf0
        __x64_sys_recvmsg+0x51/0x70
        do_syscall_64+0x91/0xb47
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Since only the read is operating as lockless, it could introduce a logic
      bug in unix_recvq_full() due to the load tearing. Fix it by adding
      a lockless variant of skb_queue_len() and unix_recvq_full() where
      READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to
      the commit d7d16a89 ("net: add skb_queue_empty_lockless()").
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86b18aaa
    • G
      of: clk: Make <linux/of_clk.h> self-contained · 5df86714
      Geert Uytterhoeven 提交于
      Depending on include order:
      
          include/linux/of_clk.h:11:45: warning: ‘struct device_node’ declared inside parameter list will not be visible outside of this definition or declaration
           unsigned int of_clk_get_parent_count(struct device_node *np);
      						 ^~~~~~~~~~~
          include/linux/of_clk.h:12:43: warning: ‘struct device_node’ declared inside parameter list will not be visible outside of this definition or declaration
           const char *of_clk_get_parent_name(struct device_node *np, int index);
      					       ^~~~~~~~~~~
          include/linux/of_clk.h:13:31: warning: ‘struct of_device_id’ declared inside parameter list will not be visible outside of this definition or declaration
           void of_clk_init(const struct of_device_id *matches);
      				   ^~~~~~~~~~~~
      
      Fix this by adding forward declarations for struct device_node and
      struct of_device_id.
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Link: https://lkml.kernel.org/r/20200205194649.31309-1-geert+renesas@glider.beSigned-off-by: NStephen Boyd <sboyd@kernel.org>
      5df86714
  6. 05 2月, 2020 2 次提交
  7. 04 2月, 2020 18 次提交
    • D
      nfs: optimise readdir cache page invalidation · 227823d2
      Dai Ngo 提交于
      When the directory is large and it's being modified by one client
      while another client is doing the 'ls -l' on the same directory then
      the cache page invalidation from nfs_force_use_readdirplus causes
      the reading client to keep restarting READDIRPLUS from cookie 0
      which causes the 'ls -l' to take a very long time to complete,
      possibly never completing.
      
      Currently when nfs_force_use_readdirplus is called to switch from
      READDIR to READDIRPLUS, it invalidates all the cached pages of the
      directory. This cache page invalidation causes the next nfs_readdir
      to re-read the directory content from cookie 0.
      
      This patch is to optimise the cache invalidation in
      nfs_force_use_readdirplus by only truncating the cached pages from
      last page index accessed to the end the file. It also marks the
      inode to delay invalidating all the cached page of the directory
      until the next initial nfs_readdir of the next 'ls' instance.
      Signed-off-by: NDai Ngo <dai.ngo@oracle.com>
      Reviewed-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      [Anna - Fix conflicts with Trond's readdir patches]
      [Anna - Remove redundant call to nfs_zap_mapping()]
      [Anna - Replace d_inode(file_dentry(desc->file)) with file_inode(desc->file)]
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      227823d2
    • Y
      include/linux/cpumask.h: don't calculate length of the input string · 190535f7
      Yury Norov 提交于
      New design of inner bitmap_parse() allows to avoid calculating the size of
      a null-terminated string.
      
      Link: http://lkml.kernel.org/r/20200102043031.30357-8-yury.norov@gmail.comSigned-off-by: NYury Norov <yury.norov@gmail.com>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Amritha Nambiar <amritha.nambiar@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: "Tobin C . Harding" <tobin@kernel.org>
      Cc: Vineet Gupta <vineet.gupta1@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      190535f7
    • Y
      lib: rework bitmap_parse() · 2d626158
      Yury Norov 提交于
      bitmap_parse() is ineffective and full of opaque variables and opencoded
      parts.  It leads to hard understanding and usage of it.  This rework
      includes:
      
      - remove bitmap_shift_left() call from the cycle.  Now it makes the
        complexity of the algorithm as O(nbits^2).  In the suggested approach
        the input string is parsed in reverse direction, so no shifts needed;
      
      - relax requirement on a single comma and no white spaces between
        chunks.  It is considered useful in scripting, and it aligns with
        bitmap_parselist();
      
      - split bitmap_parse() to small readable helpers;
      
      - make an explicit calculation of the end of input line at the
        beginning, so users of the bitmap_parse() won't bother doing this.
      
      Link: http://lkml.kernel.org/r/20200102043031.30357-6-yury.norov@gmail.comSigned-off-by: NYury Norov <yury.norov@gmail.com>
      Cc: Amritha Nambiar <amritha.nambiar@intel.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: "Tobin C . Harding" <tobin@kernel.org>
      Cc: Vineet Gupta <vineet.gupta1@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d626158
    • Y
      bitops: more BITS_TO_* macros · 0bddc1bd
      Yury Norov 提交于
      Introduce BITS_TO_U64, BITS_TO_U32 and BITS_TO_BYTES as they are handy in
      the following patches (BITS_TO_U32 specifically).  Reimplement tools/
      version of the macros according to the kernel implementation.
      
      Also fix indentation for BITS_PER_TYPE definition.
      
      Link: http://lkml.kernel.org/r/20200102043031.30357-3-yury.norov@gmail.comSigned-off-by: NYury Norov <yury.norov@gmail.com>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Amritha Nambiar <amritha.nambiar@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: "Tobin C . Harding" <tobin@kernel.org>
      Cc: Vineet Gupta <vineet.gupta1@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bddc1bd
    • Y
      lib/string: add strnchrnul() · 0bee0cec
      Yury Norov 提交于
      Patch series "lib: rework bitmap_parse", v5.
      
      Similarl to the recently revisited bitmap_parselist(), bitmap_parse() is
      ineffective and overcomplicated.  This series reworks it, aligns its
      interface with bitmap_parselist() and makes it simpler to use.
      
      The series also adds a test for the function and fixes usage of it in
      cpumask_parse() according to the new design - drops the calculating of
      length of an input string.
      
      bitmap_parse() takes the array of numbers to be put into the map in the BE
      order which is reversed to the natural LE order for bitmaps.  For example,
      to construct bitmap containing a bit on the position 42, we have to put a
      line '400,0'.  Current implementation reads chunk one by one from the
      beginning ('400' before '0') and makes bitmap shift after each successful
      parse.  It makes the complexity of the whole process as O(n^2).  We can do
      it in reverse direction ('0' before '400') and avoid shifting, but it
      requires reverse parsing helpers.
      
      This patch (of 7):
      
      New function works like strchrnul() with a length limited string.
      
      Link: http://lkml.kernel.org/r/20200102043031.30357-2-yury.norov@gmail.comSigned-off-by: NYury Norov <yury.norov@gmail.com>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Amritha Nambiar <amritha.nambiar@intel.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Tobin C . Harding" <tobin@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: Vineet Gupta <vineet.gupta1@synopsys.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bee0cec
    • A
      proc: convert everything to "struct proc_ops" · 97a32539
      Alexey Dobriyan 提交于
      The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
      seq_file.h.
      
      Conversion rule is:
      
      	llseek		=> proc_lseek
      	unlocked_ioctl	=> proc_ioctl
      
      	xxx		=> proc_xxx
      
      	delete ".owner = THIS_MODULE" line
      
      [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
      [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
        Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a32539
    • A
      proc: decouple proc from VFS with "struct proc_ops" · d56c0d45
      Alexey Dobriyan 提交于
      Currently core /proc code uses "struct file_operations" for custom hooks,
      however, VFS doesn't directly call them.  Every time VFS expands
      file_operations hook set, /proc code bloats for no reason.
      
      Introduce "struct proc_ops" which contains only those hooks which /proc
      allows to call into (open, release, read, write, ioctl, mmap, poll).  It
      doesn't contain module pointer as well.
      
      Save ~184 bytes per usage:
      
      	add/remove: 26/26 grow/shrink: 1/4 up/down: 1922/-6674 (-4752)
      	Function                                     old     new   delta
      	sysvipc_proc_ops                               -      72     +72
      				...
      	config_gz_proc_ops                             -      72     +72
      	proc_get_inode                               289     339     +50
      	proc_reg_get_unmapped_area                   110     107      -3
      	close_pdeo                                   227     224      -3
      	proc_reg_open                                289     284      -5
      	proc_create_data                              60      53      -7
      	rt_cpu_seq_fops                              256       -    -256
      				...
      	default_affinity_proc_fops                   256       -    -256
      	Total: Before=5430095, After=5425343, chg -0.09%
      
      Link: http://lkml.kernel.org/r/20191225172228.GA13378@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d56c0d45
    • S
      x86: mm: avoid allocating struct mm_struct on the stack · e47690d7
      Steven Price 提交于
      struct mm_struct is quite large (~1664 bytes) and so allocating on the
      stack may cause problems as the kernel stack size is small.
      
      Since ptdump_walk_pgd_level_core() was only allocating the structure so
      that it could modify the pgd argument we can instead introduce a pgd
      override in struct mm_walk and pass this down the call stack to where it
      is needed.
      
      Since the correct mm_struct is now being passed down, it is now also
      unnecessary to take the mmap_sem semaphore because ptdump_walk_pgd() will
      now take the semaphore on the real mm.
      
      [steven.price@arm.com: restore missed arm64 changes]
        Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
      Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e47690d7
    • S
      mm: ptdump: reduce level numbers by 1 in note_page() · f8f0d0b6
      Steven Price 提交于
      Rather than having to increment the 'depth' number by 1 in ptdump_hole(),
      let's change the meaning of 'level' in note_page() since that makes the
      code simplier.
      
      Note that for x86, the level numbers were previously increased by 1 in
      commit 45dcd209 ("x86/mm/dump_pagetables: Fix printout of p4d level")
      and the comment "Bit 7 has a different meaning" was not updated, so this
      change also makes the code match the comment again.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-24-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8f0d0b6
    • S
      mm: add generic ptdump · 30d621f6
      Steven Price 提交于
      Add a generic version of page table dumping that architectures can opt-in
      to.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-20-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30d621f6
    • S
      mm: pagewalk: add 'depth' parameter to pte_hole · b7a16c7a
      Steven Price 提交于
      The pte_hole() callback is called at multiple levels of the page tables.
      Code dumping the kernel page tables needs to know what at what depth the
      missing entry is.  Add this is an extra parameter to pte_hole().  When the
      depth isn't know (e.g.  processing a vma) then -1 is passed.
      
      The depth that is reported is the actual level where the entry is missing
      (ignoring any folding that is in place), i.e.  any levels where
      PTRS_PER_P?D is set to 1 are ignored.
      
      Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
      natural numbers as levels 2/3/4.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Tested-by: NZong Li <zong.li@sifive.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7a16c7a
    • S
      mm: pagewalk: allow walking without vma · 488ae6a2
      Steven Price 提交于
      Since 48684a65: "mm: pagewalk: fix misbehavior of walk_page_range for
      vma(VM_PFNMAP)", page_table_walk() will report any kernel area as a hole,
      because it lacks a vma.
      
      This means each arch has re-implemented page table walking when needed,
      for example in the per-arch ptdump walker.
      
      Remove the requirement to have a vma in the generic code and add a new
      function walk_page_range_novma() which ignores the VMAs and simply walks
      the page tables.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-13-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      488ae6a2
    • S
      mm: pagewalk: add p4d_entry() and pgd_entry() · 3afc4236
      Steven Price 提交于
      pgd_entry() and pud_entry() were removed by commit 0b1fbfe5
      ("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were no
      users.  We're about to add users so reintroduce them, along with
      p4d_entry() as we now have 5 levels of tables.
      
      Note that commit a00cc7d9 ("mm, x86: add support for PUD-sized
      transparent hugepages") already re-added pud_entry() but with different
      semantics to the other callbacks.  This commit reverts the semantics back
      to match the other callbacks.
      
      To support hmm.c which now uses the new semantics of pud_entry() a new
      member ('action') of struct mm_walk is added which allows the callbacks to
      either descend (ACTION_SUBTREE, the default), skip (ACTION_CONTINUE) or
      repeat the callback (ACTION_AGAIN).  hmm.c is then updated to call
      pud_trans_huge_lock() itself and make use of the splitting/retry logic of
      the core code.
      
      After this change pud_entry() is called for all entries, not just
      transparent huge pages.
      
      [arnd@arndb.de: fix unused variable warning]
       Link: http://lkml.kernel.org/r/20200107204607.1533842-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20191218162402.45610-12-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3afc4236
    • F
      mm: remove __krealloc · 1c948715
      Florian Westphal 提交于
      Since 5.5-rc1 the last user of this function is gone, so remove the
      functionality.
      
      See commit
      2ad9d774 ("netfilter: conntrack: free extension area immediately")
      for details.
      
      Link: http://lkml.kernel.org/r/20191212223442.22141-1-fw@strlen.deSigned-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c948715
    • D
      mm/memory_hotplug: drop valid_start/valid_end from test_pages_in_a_zone() · 92917998
      David Hildenbrand 提交于
      The callers are only interested in the actual zone, they don't care about
      boundaries.  Return the zone instead to simplify.
      
      Link: http://lkml.kernel.org/r/20200110183308.11849-1-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92917998
    • D
      mm: factor out next_present_section_nr() · 4c605881
      David Hildenbrand 提交于
      Let's move it to the header and use the shorter variant from
      mm/page_alloc.c (the original one will also check
      "__highest_present_section_nr + 1", which is not necessary).  While at
      it, make the section_nr in next_pfn() const.
      
      In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
      exceed __highest_present_section_nr, which doesn't make a difference in
      the caller as it is big enough (>= all sane end_pfn).
      
      Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Jin, Zhi" <zhi.jin@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c605881
    • D
      mm/page_alloc.c: initialize memmap of unavailable memory directly · 4b094b78
      David Hildenbrand 提交于
      Let's make sure that all memory holes are actually marked PageReserved(),
      that page_to_pfn() produces reliable results, and that these pages are not
      detected as "mmap" pages due to the mapcount.
      
      E.g., booting a x86-64 QEMU guest with 4160 MB:
      
      [    0.010585] Early memory node ranges
      [    0.010586]   node   0: [mem 0x0000000000001000-0x000000000009efff]
      [    0.010588]   node   0: [mem 0x0000000000100000-0x00000000bffdefff]
      [    0.010589]   node   0: [mem 0x0000000100000000-0x0000000143ffffff]
      
      max_pfn is 0x144000.
      
      Before this change:
      
      [root@localhost ~]# ./page-types -r -a 0x144000,
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000800           16384       64  ___________M_______________________________        mmap
                   total           16384       64
      
      After this change:
      
      [root@localhost ~]# ./page-types -r -a 0x144000,
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000100000000           16384       64  ___________________________r_______________        reserved
                   total           16384       64
      
      IOW, especially the unavailable physical memory ("memory hole") in the
      last section would not get properly marked PageReserved() and is indicated
      to be "mmap" memory.
      
      Drop the trace of that function from include/linux/mm.h - nobody else
      needs it, and rename it accordingly.
      
      Note: The fake zone/node might not be covered by the zone/node span.  This
      is not an urgent issue (for now, we had the same node/zone due to the
      zeroing).  We'll need a clean way to mark memory holes (e.g., using a page
      type PageHole() if possible or a fake ZONE_INVALID) and eventually stop
      marking these memory holes PageReserved().
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b094b78
    • J
      eventfd: track eventfd_signal() recursion depth · b5e683d5
      Jens Axboe 提交于
      eventfd use cases from aio and io_uring can deadlock due to circular
      or resursive calling, when eventfd_signal() tries to grab the waitqueue
      lock. On top of that, it's also possible to construct notification
      chains that are deep enough that we could blow the stack.
      
      Add a percpu counter that tracks the percpu recursion depth, warn if we
      exceed it. The counter is also exposed so that users of eventfd_signal()
      can do the right thing if it's non-zero in the context where it is
      called.
      
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5e683d5