1. 23 4月, 2018 1 次提交
    • D
      earlycon: Use a pointer table to fix __earlycon_table stride · dd709e72
      Daniel Kurtz 提交于
      Commit 99492c39 ("earlycon: Fix __earlycon_table stride") tried to fix
      __earlycon_table stride by forcing the earlycon_id struct alignment to 32
      and asking the linker to 32-byte align the __earlycon_table symbol.  This
      fix was based on commit 07fca0e5 ("tracing: Properly align linker
      defined symbols") which tried a similar fix for the tracing subsystem.
      
      However, this fix doesn't quite work because there is no guarantee that
      gcc will place structures packed into an array format.  In fact, gcc 4.9
      chooses to 64-byte align these structs by inserting additional padding
      between the entries because it has no clue that they are supposed to be in
      an array.  If we are unlucky, the linker will assign symbol
      "__earlycon_table" to a 32-byte aligned address which does not correspond
      to the 64-byte aligned contents of section "__earlycon_table".
      
      To address this same problem, the fix to the tracing system was
      subsequently re-implemented using a more robust table of pointers approach
      by commits:
       3d56e331 ("tracing: Replace syscall_meta_data struct array with pointer array")
       65498646 ("tracepoints: Fix section alignment using pointer array")
       e4a9ea5e ("tracing: Replace trace_event struct array with pointer array")
      
      Let's use this same "array of pointers to structs" approach for
      EARLYCON_TABLE.
      
      Fixes: 99492c39 ("earlycon: Fix __earlycon_table stride")
      Signed-off-by: NDaniel Kurtz <djkurtz@chromium.org>
      Suggested-by: NAaron Durbin <adurbin@chromium.org>
      Reviewed-by: NRob Herring <robh@kernel.org>
      Tested-by: NGuenter Roeck <groeck@chromium.org>
      Reviewed-by: NGuenter Roeck <groeck@chromium.org>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd709e72
  2. 10 4月, 2018 2 次提交
    • S
      io: change writeX_relaxed() to remove barriers · a71e7c44
      Sinan Kaya 提交于
      Now that we hardened writeX() API in asm-generic version, writeX_relaxed()
      API is violating the rules when writeX_relaxed() == writeX() in the default
      implementation.
      
      The relaxed API shouldn't have any barriers in it and it doesn't provide
      any ordering with respect to the memory transactions. The only requirement
      is for writes to be ordered with respect to each other. This is achieved
      by the volatile in the __raw_writeX() API.
      
      Open code the relaxed API and remove any barriers in it.
      Signed-off-by: NSinan Kaya <okaya@codeaurora.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      a71e7c44
    • S
      io: change readX_relaxed() to remove barriers · 8875c554
      Sinan Kaya 提交于
      Now that we hardened readX() API in asm-generic version, readX_relaxed()
      API is violating the rules when readX_relaxed() == readX() in the default
      implementation.
      
      The relaxed API shouldn't have any barriers in it and it doesn't provide
      any ordering with respect to the memory transactions. The only requirement
      is for reads to be ordered with respect to each other. This is achieved
      by the volatile in the __raw_readX() API.
      
      Open code the relaxed API and remove any barriers in it.
      Signed-off-by: NSinan Kaya <okaya@codeaurora.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      8875c554
  3. 06 4月, 2018 5 次提交
  4. 04 4月, 2018 1 次提交
  5. 29 3月, 2018 1 次提交
    • A
      bpf: introduce BPF_RAW_TRACEPOINT · c4f6699d
      Alexei Starovoitov 提交于
      Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
      kernel internal arguments of the tracepoints in their raw form.
      
      >From bpf program point of view the access to the arguments look like:
      struct bpf_raw_tracepoint_args {
             __u64 args[0];
      };
      
      int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
      {
        // program can read args[N] where N depends on tracepoint
        // and statically verified at program load+attach time
      }
      
      kprobe+bpf infrastructure allows programs access function arguments.
      This feature allows programs access raw tracepoint arguments.
      
      Similar to proposed 'dynamic ftrace events' there are no abi guarantees
      to what the tracepoints arguments are and what their meaning is.
      The program needs to type cast args properly and use bpf_probe_read()
      helper to access struct fields when argument is a pointer.
      
      For every tracepoint __bpf_trace_##call function is prepared.
      In assembler it looks like:
      (gdb) disassemble __bpf_trace_xdp_exception
      Dump of assembler code for function __bpf_trace_xdp_exception:
         0xffffffff81132080 <+0>:     mov    %ecx,%ecx
         0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>
      
      where
      
      TRACE_EVENT(xdp_exception,
              TP_PROTO(const struct net_device *dev,
                       const struct bpf_prog *xdp, u32 act),
      
      The above assembler snippet is casting 32-bit 'act' field into 'u64'
      to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
      All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
      and in total this approach adds 7k bytes to .text.
      
      This approach gives the lowest possible overhead
      while calling trace_xdp_exception() from kernel C code and
      transitioning into bpf land.
      Since tracepoint+bpf are used at speeds of 1M+ events per second
      this is valuable optimization.
      
      The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
      that returns anon_inode FD of 'bpf-raw-tracepoint' object.
      
      The user space looks like:
      // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
      prog_fd = bpf_prog_load(...);
      // receive anon_inode fd for given bpf_raw_tracepoint with prog attached
      raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool that uses this feature
      will automatically detach bpf program, unload it and
      unregister tracepoint probe.
      
      On the kernel side the __bpf_raw_tp_map section of pointers to
      tracepoint definition and to __bpf_trace_*() probe function is used
      to find a tracepoint with "xdp_exception" name and
      corresponding __bpf_trace_xdp_exception() probe function
      which are passed to tracepoint_probe_register() to connect probe
      with tracepoint.
      
      Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
      tracepoint mechanisms. perf_event_open() can be used in parallel
      on the same tracepoint.
      Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
      Each with its own bpf program. The kernel will execute
      all tracepoint probes and all attached bpf programs.
      
      In the future bpf_raw_tracepoints can be extended with
      query/introspection logic.
      
      __bpf_raw_tp_map section logic was contributed by Steven Rostedt
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c4f6699d
  6. 23 3月, 2018 1 次提交
    • T
      mm/vmalloc: add interfaces to free unmapped page table · b6bdb751
      Toshi Kani 提交于
      On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
      create pud/pmd mappings.  A kernel panic was observed on arm64 systems
      with Cortex-A75 in the following steps as described by Hanjun Guo.
      
       1. ioremap a 4K size, valid page table will build,
       2. iounmap it, pte0 will set to 0;
       3. ioremap the same address with 2M size, pgd/pmd is unchanged,
          then set the a new value for pmd;
       4. pte0 is leaked;
       5. CPU may meet exception because the old pmd is still in TLB,
          which will lead to kernel panic.
      
      This panic is not reproducible on x86.  INVLPG, called from iounmap,
      purges all levels of entries associated with purged address on x86.  x86
      still has memory leak.
      
      The patch changes the ioremap path to free unmapped page table(s) since
      doing so in the unmap path has the following issues:
      
       - The iounmap() path is shared with vunmap(). Since vmap() only
         supports pte mappings, making vunmap() to free a pte page is an
         overhead for regular vmap users as they do not need a pte page freed
         up.
      
       - Checking if all entries in a pte page are cleared in the unmap path
         is racy, and serializing this check is expensive.
      
       - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
         Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
         purge.
      
      Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
      clear a given pud/pmd entry and free up a page for the lower level
      entries.
      
      This patch implements their stub functions on x86 and arm64, which work
      as workaround.
      
      [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
      Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Reported-by: NLei Li <lious.lilei@hisilicon.com>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6bdb751
  7. 22 3月, 2018 1 次提交
  8. 18 3月, 2018 1 次提交
  9. 12 3月, 2018 3 次提交
  10. 10 3月, 2018 1 次提交
  11. 08 3月, 2018 1 次提交
  12. 07 3月, 2018 1 次提交
  13. 22 2月, 2018 2 次提交
    • G
      asm-generic/io.h: move ioremap_nocache/ioremap_uc/ioremap_wc/ioremap_wt out of ifndef CONFIG_MMU · b3ada9d0
      Greentime Hu 提交于
      It allows some architectures to use this generic macro instead of
      defining theirs.
      
      sparc: io: To use the define of ioremap_[nocache|wc|wb] in asm-generic/io.h
      It will move the ioremap_nocache out of the CONFIG_MMU ifdef. This means that
      in order to suppress re-definition errors we need to remove the #define
      in arch/sparc/include/asm/io_32.h. Also, the change adds a prototype for
      ioremap where size is size_t and offset is phys_addr_t so fix that as well.
      Signed-off-by: NGreentime Hu <greentime@andestech.com>
      b3ada9d0
    • A
      bug.h: work around GCC PR82365 in BUG() · 173a3efd
      Arnd Bergmann 提交于
      Looking at functions with large stack frames across all architectures
      led me discovering that BUG() suffers from the same problem as
      fortify_panic(), which I've added a workaround for already.
      
      In short, variables that go out of scope by calling a noreturn function
      or __builtin_unreachable() keep using stack space in functions
      afterwards.
      
      A workaround that was identified is to insert an empty assembler
      statement just before calling the function that doesn't return.  I'm
      adding a macro "barrier_before_unreachable()" to document this, and
      insert calls to that in all instances of BUG() that currently suffer
      from this problem.
      
      The files that saw the largest change from this had these frame sizes
      before, and much less with my patch:
      
        fs/ext4/inode.c:82:1: warning: the frame size of 1672 bytes is larger than 800 bytes [-Wframe-larger-than=]
        fs/ext4/namei.c:434:1: warning: the frame size of 904 bytes is larger than 800 bytes [-Wframe-larger-than=]
        fs/ext4/super.c:2279:1: warning: the frame size of 1160 bytes is larger than 800 bytes [-Wframe-larger-than=]
        fs/ext4/xattr.c:146:1: warning: the frame size of 1168 bytes is larger than 800 bytes [-Wframe-larger-than=]
        fs/f2fs/inode.c:152:1: warning: the frame size of 1424 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_core.c:1195:1: warning: the frame size of 1068 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_core.c:395:1: warning: the frame size of 1084 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_ftp.c:298:1: warning: the frame size of 928 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_ftp.c:418:1: warning: the frame size of 908 bytes is larger than 800 bytes [-Wframe-larger-than=]
        net/netfilter/ipvs/ip_vs_lblcr.c:718:1: warning: the frame size of 960 bytes is larger than 800 bytes [-Wframe-larger-than=]
        drivers/net/xen-netback/netback.c:1500:1: warning: the frame size of 1088 bytes is larger than 800 bytes [-Wframe-larger-than=]
      
      In case of ARC and CRIS, it turns out that the BUG() implementation
      actually does return (or at least the compiler thinks it does),
      resulting in lots of warnings about uninitialized variable use and
      leaving noreturn functions, such as:
      
        block/cfq-iosched.c: In function 'cfq_async_queue_prio':
        block/cfq-iosched.c:3804:1: error: control reaches end of non-void function [-Werror=return-type]
        include/linux/dmaengine.h: In function 'dma_maxpq':
        include/linux/dmaengine.h:1123:1: error: control reaches end of non-void function [-Werror=return-type]
      
      This makes them call __builtin_trap() instead, which should normally
      dump the stack and kill the current process, like some of the other
      architectures already do.
      
      I tried adding barrier_before_unreachable() to panic() and
      fortify_panic() as well, but that had very little effect, so I'm not
      submitting that patch.
      
      Vineet said:
      
      : For ARC, it is double win.
      :
      : 1. Fixes 3 -Wreturn-type warnings
      :
      : | ../net/core/ethtool.c:311:1: warning: control reaches end of non-void function
      : [-Wreturn-type]
      : | ../kernel/sched/core.c:3246:1: warning: control reaches end of non-void function
      : [-Wreturn-type]
      : | ../include/linux/sunrpc/svc_xprt.h:180:1: warning: control reaches end of
      : non-void function [-Wreturn-type]
      :
      : 2.  bloat-o-meter reports code size improvements as gcc elides the
      :    generated code for stack return.
      
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82365
      Link: http://lkml.kernel.org/r/20171219114112.939391-1-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: Vineet Gupta <vgupta@synopsys.com>	[arch/arc]
      Tested-by: Vineet Gupta <vgupta@synopsys.com>	[arch/arc]
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Christopher Li <sparse@chrisli.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      173a3efd
  14. 14 2月, 2018 1 次提交
  15. 13 2月, 2018 1 次提交
  16. 07 2月, 2018 1 次提交
    • C
      lib: optimize cpumask_next_and() · 0ade34c3
      Clement Courbet 提交于
      We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
      It's essentially a joined iteration in search for a non-zero bit, which is
      currently implemented as a lookup join (find a nonzero bit on the lhs,
      lookup the rhs to see if it's set there).
      
      Implement a direct join (find a nonzero bit on the incrementally built
      join).  Also add generic bitmap benchmarks in the new `test_find_bit`
      module for new function (see `find_next_and_bit` in [2] and [3] below).
      
      For cpumask_next_and, direct benchmarking shows that it's 1.17x to 14x
      faster with a geometric mean of 2.1 on 32 CPUs [1].  No impact on memory
      usage.  Note that on Arm, the new pure-C implementation still outperforms
      the old one that uses a mix of C and asm (`find_next_bit`) [3].
      
      [1] Approximate benchmark code:
      
      ```
        unsigned long src1p[nr_cpumask_longs] = {pattern1};
        unsigned long src2p[nr_cpumask_longs] = {pattern2};
        for (/*a bunch of repetitions*/) {
          for (int n = -1; n <= nr_cpu_ids; ++n) {
            asm volatile("" : "+rm"(src1p)); // prevent any optimization
            asm volatile("" : "+rm"(src2p));
            unsigned long result = cpumask_next_and(n, src1p, src2p);
            asm volatile("" : "+rm"(result));
          }
        }
      ```
      
      Results:
      pattern1    pattern2     time_before/time_after
      0x0000ffff  0x0000ffff   1.65
      0x0000ffff  0x00005555   2.24
      0x0000ffff  0x00001111   2.94
      0x0000ffff  0x00000000   14.0
      0x00005555  0x0000ffff   1.67
      0x00005555  0x00005555   1.71
      0x00005555  0x00001111   1.90
      0x00005555  0x00000000   6.58
      0x00001111  0x0000ffff   1.46
      0x00001111  0x00005555   1.49
      0x00001111  0x00001111   1.45
      0x00001111  0x00000000   3.10
      0x00000000  0x0000ffff   1.18
      0x00000000  0x00005555   1.18
      0x00000000  0x00001111   1.17
      0x00000000  0x00000000   1.25
      -----------------------------
                     geo.mean  2.06
      
      [2] test_find_next_bit, X86 (skylake)
      
       [ 3913.477422] Start testing find_bit() with random-filled bitmap
       [ 3913.477847] find_next_bit: 160868 cycles, 16484 iterations
       [ 3913.477933] find_next_zero_bit: 169542 cycles, 16285 iterations
       [ 3913.478036] find_last_bit: 201638 cycles, 16483 iterations
       [ 3913.480214] find_first_bit: 4353244 cycles, 16484 iterations
       [ 3913.480216] Start testing find_next_and_bit() with random-filled
       bitmap
       [ 3913.481074] find_next_and_bit: 89604 cycles, 8216 iterations
       [ 3913.481075] Start testing find_bit() with sparse bitmap
       [ 3913.481078] find_next_bit: 2536 cycles, 66 iterations
       [ 3913.481252] find_next_zero_bit: 344404 cycles, 32703 iterations
       [ 3913.481255] find_last_bit: 2006 cycles, 66 iterations
       [ 3913.481265] find_first_bit: 17488 cycles, 66 iterations
       [ 3913.481266] Start testing find_next_and_bit() with sparse bitmap
       [ 3913.481272] find_next_and_bit: 764 cycles, 1 iterations
      
      [3] test_find_next_bit, arm (v7 odroid XU3).
      
      [  267.206928] Start testing find_bit() with random-filled bitmap
      [  267.214752] find_next_bit: 4474 cycles, 16419 iterations
      [  267.221850] find_next_zero_bit: 5976 cycles, 16350 iterations
      [  267.229294] find_last_bit: 4209 cycles, 16419 iterations
      [  267.279131] find_first_bit: 1032991 cycles, 16420 iterations
      [  267.286265] Start testing find_next_and_bit() with random-filled
      bitmap
      [  267.302386] find_next_and_bit: 2290 cycles, 8140 iterations
      [  267.309422] Start testing find_bit() with sparse bitmap
      [  267.316054] find_next_bit: 191 cycles, 66 iterations
      [  267.322726] find_next_zero_bit: 8758 cycles, 32703 iterations
      [  267.329803] find_last_bit: 84 cycles, 66 iterations
      [  267.336169] find_first_bit: 4118 cycles, 66 iterations
      [  267.342627] Start testing find_next_and_bit() with sparse bitmap
      [  267.356919] find_next_and_bit: 91 cycles, 1 iterations
      
      [courbet@google.com: v6]
        Link: http://lkml.kernel.org/r/20171129095715.23430-1-courbet@google.com
      [geert@linux-m68k.org: m68k/bitops: always include <asm-generic/bitops/find.h>]
        Link: http://lkml.kernel.org/r/1512556816-28627-1-git-send-email-geert@linux-m68k.org
      Link: http://lkml.kernel.org/r/20171128131334.23491-1-courbet@google.comSigned-off-by: NClement Courbet <courbet@google.com>
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Yury Norov <ynorov@caviumnetworks.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ade34c3
  17. 06 2月, 2018 1 次提交
    • A
      locking/qrwlock: include asm/byteorder.h as needed · ca66e797
      Arnd Bergmann 提交于
      Moving the qrwlock struct definition into a header file introduced
      a subtle bug on all little-endian machines, where some files in some
      configurations would see the fields in an incorrect order.  This was
      found by building with an LTO enabled compiler that warns every time we
      try to link together files with incompatible data structures.
      
      A second patch changes linux/kconfig.h to always define the symbols,
      but this seems to be the root cause of most of the issues, so I'd suggest
      we do both.
      
      On a current linux-next kernel, I verified that this header is
      responsible for all type mismatches as a result from the endianess
      confusion.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Babu Moger <babu.moger@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Fixes: e0d02285 ("locking/qrwlock: Use 'struct qrwlock' instead of 'struct __qrwlock'")
      Link: http://lkml.kernel.org/r/20180202154104.1522809-1-arnd@arndb.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ca66e797
  18. 01 2月, 2018 3 次提交
    • A
      mm/thp: remove pmd_huge_split_prepare() · 423ac9af
      Aneesh Kumar K.V 提交于
      Instead of marking the pmd ready for split, invalidate the pmd.  This
      should take care of powerpc requirement.  Only side effect is that we
      mark the pmd invalid early.  This can result in us blocking access to
      the page a bit longer if we race against a thp split.
      
      [kirill.shutemov@linux.intel.com: rebased, dirty THP once]
      Link: http://lkml.kernel.org/r/20171213105756.69879-13-kirill.shutemov@linux.intel.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitin Gupta <nitin.m.gupta@oracle.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      423ac9af
    • K
      mm: do not lose dirty and accessed bits in pmdp_invalidate() · d52605d7
      Kirill A. Shutemov 提交于
      Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
      dirty and access bits if CPU sets them after pmdp dereference, but
      before set_pmd_at().
      
      The patch change pmdp_invalidate() to make the entry non-present
      atomically and return previous value of the entry.  This value can be
      used to check if CPU set dirty/accessed bits under us.
      
      The race window is very small and I haven't seen any reports that can be
      attributed to the bug.  For this reason, I don't think backporting to
      stable trees needed.
      
      Link: http://lkml.kernel.org/r/20171213105756.69879-11-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitin Gupta <nitin.m.gupta@oracle.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d52605d7
    • K
      asm-generic: provide generic_pmdp_establish() · c58f0bb7
      Kirill A. Shutemov 提交于
      Patch series "Do not lose dirty bit on THP pages", v4.
      
      Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
      dirty and access bits if CPU sets them after pmdp dereference, but
      before set_pmd_at().
      
      The bug can lead to data loss, but the race window is tiny and I haven't
      seen any reports that suggested that it happens in reality.  So I don't
      think it worth sending it to stable.
      
      Unfortunately, there's no way to address the issue in a generic way.  We
      need to fix all architectures that support THP one-by-one.
      
      All architectures that have THP supported have to provide atomic
      pmdp_invalidate() that returns previous value.
      
      If generic implementation of pmdp_invalidate() is used, architecture
      needs to provide atomic pmdp_estabish().
      
      pmdp_estabish() is not used out-side generic implementation of
      pmdp_invalidate() so far, but I think this can change in the future.
      
      This patch (of 12):
      
      This is an implementation of pmdp_establish() that is only suitable for
      an architecture that doesn't have hardware dirty/accessed bits.  In this
      case we can't race with CPU which sets these bits and non-atomic
      approach is fine.
      
      Link: http://lkml.kernel.org/r/20171213105756.69879-2-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Nitin Gupta <nitin.m.gupta@oracle.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c58f0bb7
  19. 31 1月, 2018 1 次提交
  20. 29 1月, 2018 1 次提交
  21. 15 1月, 2018 2 次提交
  22. 13 1月, 2018 2 次提交
    • M
      error-injection: Add injectable error types · 663faf9f
      Masami Hiramatsu 提交于
      Add injectable error types for each error-injectable function.
      
      One motivation of error injection test is to find software flaws,
      mistakes or mis-handlings of expectable errors. If we find such
      flaws by the test, that is a program bug, so we need to fix it.
      
      But if the tester miss input the error (e.g. just return success
      code without processing anything), it causes unexpected behavior
      even if the caller is correctly programmed to handle any errors.
      That is not what we want to test by error injection.
      
      To clarify what type of errors the caller must expect for each
      injectable function, this introduces injectable error types:
      
       - EI_ETYPE_NULL : means the function will return NULL if it
      		    fails. No ERR_PTR, just a NULL.
       - EI_ETYPE_ERRNO : means the function will return -ERRNO
      		    if it fails.
       - EI_ETYPE_ERRNO_NULL : means the function will return -ERRNO
      		       (ERR_PTR) or NULL.
      
      ALLOW_ERROR_INJECTION() macro is expanded to get one of
      NULL, ERRNO, ERRNO_NULL to record the error type for
      each function. e.g.
      
       ALLOW_ERROR_INJECTION(open_ctree, ERRNO)
      
      This error types are shown in debugfs as below.
      
        ====
        / # cat /sys/kernel/debug/error_injection/list
        open_ctree [btrfs]	ERRNO
        io_ctl_init [btrfs]	ERRNO
        ====
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      663faf9f
    • M
      error-injection: Separate error-injection from kprobe · 540adea3
      Masami Hiramatsu 提交于
      Since error-injection framework is not limited to be used
      by kprobes, nor bpf. Other kernel subsystems can use it
      freely for checking safeness of error-injection, e.g.
      livepatch, ftrace etc.
      So this separate error-injection framework from kprobes.
      
      Some differences has been made:
      
      - "kprobe" word is removed from any APIs/structures.
      - BPF_ALLOW_ERROR_INJECTION() is renamed to
        ALLOW_ERROR_INJECTION() since it is not limited for BPF too.
      - CONFIG_FUNCTION_ERROR_INJECTION is the config item of this
        feature. It is automatically enabled if the arch supports
        error injection feature for kprobe or ftrace etc.
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      540adea3
  23. 10 1月, 2018 1 次提交
  24. 09 1月, 2018 1 次提交
    • S
      sections: split dereference_function_descriptor() · b865ea64
      Sergey Senozhatsky 提交于
      There are two format specifiers to print out a pointer in symbolic
      format: '%pS/%ps' and '%pF/%pf'. On most architectures, the two
      mean exactly the same thing, but some architectures (ia64, ppc64,
      parisc64) use an indirect pointer for C function pointers, where
      the function pointer points to a function descriptor (which in
      turn contains the actual pointer to the code). The '%pF/%pf, when
      used appropriately, automatically does the appropriate function
      descriptor dereference on such architectures.
      
      The "when used appropriately" part is tricky. Basically this is
      a subtle ABI detail, specific to some platforms, that made it to
      the API level and people can be unaware of it and miss the whole
      "we need to dereference the function" business out. [1] proves
      that point (note that it fixes only '%pF' and '%pS', there might
      be '%pf' and '%ps' cases as well).
      
      It appears that we can handle everything within the affected
      arches and make '%pS/%ps' smart enough to retire '%pF/%pf'.
      Function descriptors live in .opd elf section and all affected
      arches (ia64, ppc64, parisc64) handle it properly for kernel
      and modules. So we, technically, can decide if the dereference
      is needed by simply looking at the pointer: if it belongs to
      .opd section then we need to dereference it.
      
      The kernel and modules have their own .opd sections, obviously,
      that's why we need to split dereference_function_descriptor()
      and use separate kernel and module dereference arch callbacks.
      
      This patch does the first step, it
      a) adds dereference_kernel_function_descriptor() function.
      b) adds a weak alias to dereference_module_function_descriptor()
         function.
      
      So, for the time being, we will have:
      1) dereference_function_descriptor()
         A generic function, that simply dereferences the pointer. There is
         bunch of places that call it: kgdbts, init/main.c, extable, etc.
      
      2) dereference_kernel_function_descriptor()
         A function to call on kernel symbols that does kernel .opd section
         address range test.
      
      3) dereference_module_function_descriptor()
         A function to call on modules' symbols that does modules' .opd
         section address range test.
      
      [1] https://marc.info/?l=linux-kernel&m=150472969730573
      
      Link: http://lkml.kernel.org/r/20171109234830.5067-2-sergey.senozhatsky@gmail.com
      To: Fenghua Yu <fenghua.yu@intel.com>
      To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      To: Paul Mackerras <paulus@samba.org>
      To: Michael Ellerman <mpe@ellerman.id.au>
      To: James Bottomley <jejb@parisc-linux.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Tested-by: Tony Luck <tony.luck@intel.com> #ia64
      Tested-by: Santosh Sivaraj <santosh@fossix.org> #powerpc
      Tested-by: Helge Deller <deller@gmx.de> #parisc64
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      b865ea64
  25. 04 1月, 2018 1 次提交
  26. 23 12月, 2017 2 次提交
    • T
      init: Invoke init_espfix_bsp() from mm_init() · 613e396b
      Thomas Gleixner 提交于
      init_espfix_bsp() needs to be invoked before the page table isolation
      initialization. Move it into mm_init() which is the place where pti_init()
      will be added.
      
      While at it get rid of the #ifdeffery and provide proper stub functions.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      613e396b
    • T
      arch, mm: Allow arch_dup_mmap() to fail · c10e83f5
      Thomas Gleixner 提交于
      In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
      allowed to fail. Fix up all instances.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: dan.j.williams@intel.com
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: kirill.shutemov@linux.intel.com
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c10e83f5
  27. 13 12月, 2017 1 次提交