1. 17 8月, 2017 1 次提交
    • M
      membarrier: Provide expedited private command · 22e4ebb9
      Mathieu Desnoyers 提交于
      Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
      from all runqueues for which current thread's mm is the same as the
      thread calling sys_membarrier. It executes faster than the non-expedited
      variant (no blocking). It also works on NOHZ_FULL configurations.
      
      Scheduler-wise, it requires a memory barrier before and after context
      switching between processes (which have different mm). The memory
      barrier before context switch is already present. For the barrier after
      context switch:
      
      * Our TSO archs can do RELEASE without being a full barrier. Look at
        x86 spin_unlock() being a regular STORE for example.  But for those
        archs, all atomics imply smp_mb and all of them have atomic ops in
        switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
        barrier.
      
      * From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
        the rest does indeed do smp_mb(), so there the spin_unlock() is a full
        barrier and we're good.
      
      * ARM64 has a very heavy barrier in switch_to(), which suffices.
      
      * PPC just removed its barrier from switch_to(), but appears to be
        talking about adding something to switch_mm(). So add a
        smp_mb__after_unlock_lock() for now, until this is settled on the PPC
        side.
      
      Changes since v3:
      - Properly document the memory barriers provided by each architecture.
      
      Changes since v2:
      - Address comments from Peter Zijlstra,
      - Add smp_mb__after_unlock_lock() after finish_lock_switch() in
        finish_task_switch() to add the memory barrier we need after storing
        to rq->curr. This is much simpler than the previous approach relying
        on atomic_dec_and_test() in mmdrop(), which actually added a memory
        barrier in the common case of switching between userspace processes.
      - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
        kernel, rather than having the whole membarrier system call returning
        -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
        Adapt the CMD_QUERY mask accordingly.
      
      Changes since v1:
      - move membarrier code under kernel/sched/ because it uses the
        scheduler runqueue,
      - only add the barrier when we switch from a kernel thread. The case
        where we switch from a user-space thread is already handled by
        the atomic_dec_and_test() in mmdrop().
      - add a comment to mmdrop() documenting the requirement on the implicit
        memory barrier.
      
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      CC: Boqun Feng <boqun.feng@gmail.com>
      CC: Andrew Hunter <ahh@google.com>
      CC: Maged Michael <maged.michael@gmail.com>
      CC: gromer@google.com
      CC: Avi Kivity <avi@scylladb.com>
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      CC: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NDave Watson <davejwatson@fb.com>
      22e4ebb9
  2. 13 7月, 2017 1 次提交
  3. 09 5月, 2017 1 次提交
    • H
      crash: move crashkernel parsing and vmcore related code under CONFIG_CRASH_CORE · 692f66f2
      Hari Bathini 提交于
      Patch series "kexec/fadump: remove dependency with CONFIG_KEXEC and
      reuse crashkernel parameter for fadump", v4.
      
      Traditionally, kdump is used to save vmcore in case of a crash.  Some
      architectures like powerpc can save vmcore using architecture specific
      support instead of kexec/kdump mechanism.  Such architecture specific
      support also needs to reserve memory, to be used by dump capture kernel.
      crashkernel parameter can be a reused, for memory reservation, by such
      architecture specific infrastructure.
      
      This patchset removes dependency with CONFIG_KEXEC for crashkernel
      parameter and vmcoreinfo related code as it can be reused without kexec
      support.  Also, crashkernel parameter is reused instead of
      fadump_reserve_mem to reserve memory for fadump.
      
      The first patch moves crashkernel parameter parsing and vmcoreinfo
      related code under CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE.  The
      second patch reuses the definitions of append_elf_note() & final_note()
      functions under CONFIG_CRASH_CORE in IA64 arch code.  The third patch
      removes dependency on CONFIG_KEXEC for firmware-assisted dump (fadump)
      in powerpc.  The next patch reuses crashkernel parameter for reserving
      memory for fadump, instead of the fadump_reserve_mem parameter.  This
      has the advantage of using all syntaxes crashkernel parameter supports,
      for fadump as well.  The last patch updates fadump kernel documentation
      about use of crashkernel parameter.
      
      This patch (of 5):
      
      Traditionally, kdump is used to save vmcore in case of a crash.  Some
      architectures like powerpc can save vmcore using architecture specific
      support instead of kexec/kdump mechanism.  Such architecture specific
      support also needs to reserve memory, to be used by dump capture kernel.
      crashkernel parameter can be a reused, for memory reservation, by such
      architecture specific infrastructure.
      
      But currently, code related to vmcoreinfo and parsing of crashkernel
      parameter is built under CONFIG_KEXEC_CORE.  This patch introduces
      CONFIG_CRASH_CORE and moves the above mentioned code under this config,
      allowing code reuse without dependency on CONFIG_KEXEC.  There is no
      functional change with this patch.
      
      Link: http://lkml.kernel.org/r/149035338104.6881.4550894432615189948.stgit@hbathini.in.ibm.comSigned-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com>
      Acked-by: NDave Young <dyoung@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      692f66f2
  4. 28 12月, 2016 1 次提交
  5. 15 12月, 2016 1 次提交
  6. 14 12月, 2016 1 次提交
  7. 09 8月, 2016 1 次提交
  8. 24 5月, 2016 1 次提交
    • R
      ELF/MIPS build fix · f43edca7
      Ralf Baechle 提交于
      CONFIG_MIPS32_N32=y but CONFIG_BINFMT_ELF disabled results in the
      following linker errors:
      
        arch/mips/built-in.o: In function `elf_core_dump':
        binfmt_elfn32.c:(.text+0x23dbc): undefined reference to `elf_core_extra_phdrs'
        binfmt_elfn32.c:(.text+0x246e4): undefined reference to `elf_core_extra_data_size'
        binfmt_elfn32.c:(.text+0x248d0): undefined reference to `elf_core_write_extra_phdrs'
        binfmt_elfn32.c:(.text+0x24ac4): undefined reference to `elf_core_write_extra_data'
      
      CONFIG_MIPS32_O32=y but CONFIG_BINFMT_ELF disabled results in the following
      linker errors:
      
        arch/mips/built-in.o: In function `elf_core_dump':
        binfmt_elfo32.c:(.text+0x28a04): undefined reference to `elf_core_extra_phdrs'
        binfmt_elfo32.c:(.text+0x29330): undefined reference to `elf_core_extra_data_size'
        binfmt_elfo32.c:(.text+0x2951c): undefined reference to `elf_core_write_extra_phdrs'
        binfmt_elfo32.c:(.text+0x29710): undefined reference to `elf_core_write_extra_data'
      
      This is because binfmt_elfn32 and binfmt_elfo32 are using symbols from
      elfcore but for these configurations elfcore will not be built.
      
      Fixed by making elfcore selectable by a separate config symbol which
      unlike the current mechanism can also be used from other directories
      than kernel/, then having each flavor of ELF that relies on elfcore.o,
      select it in Kconfig, including CONFIG_MIPS32_N32 and CONFIG_MIPS32_O32
      which fixes this issue.
      
      Link: http://lkml.kernel.org/r/20160520141705.GA1913@linux-mips.orgSigned-off-by: NRalf Baechle <ralf@linux-mips.org>
      Reviewed-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f43edca7
  9. 23 3月, 2016 1 次提交
    • D
      kernel: add kcov code coverage · 5c9a8750
      Dmitry Vyukov 提交于
      kcov provides code coverage collection for coverage-guided fuzzing
      (randomized testing).  Coverage-guided fuzzing is a testing technique
      that uses coverage feedback to determine new interesting inputs to a
      system.  A notable user-space example is AFL
      (http://lcamtuf.coredump.cx/afl/).  However, this technique is not
      widely used for kernel testing due to missing compiler and kernel
      support.
      
      kcov does not aim to collect as much coverage as possible.  It aims to
      collect more or less stable coverage that is function of syscall inputs.
      To achieve this goal it does not collect coverage in soft/hard
      interrupts and instrumentation of some inherently non-deterministic or
      non-interesting parts of kernel is disbled (e.g.  scheduler, locking).
      
      Currently there is a single coverage collection mode (tracing), but the
      API anticipates additional collection modes.  Initially I also
      implemented a second mode which exposes coverage in a fixed-size hash
      table of counters (what Quentin used in his original patch).  I've
      dropped the second mode for simplicity.
      
      This patch adds the necessary support on kernel side.  The complimentary
      compiler support was added in gcc revision 231296.
      
      We've used this support to build syzkaller system call fuzzer, which has
      found 90 kernel bugs in just 2 months:
      
        https://github.com/google/syzkaller/wiki/Found-Bugs
      
      We've also found 30+ bugs in our internal systems with syzkaller.
      Another (yet unexplored) direction where kcov coverage would greatly
      help is more traditional "blob mutation".  For example, mounting a
      random blob as a filesystem, or receiving a random blob over wire.
      
      Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
      coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
      typical coverage can be just a dozen of basic blocks (e.g.  an invalid
      input).  In such context gcov becomes prohibitively expensive as
      reset/collect coverage steps depend on total number of basic
      blocks/edges in program (in case of kernel it is about 2M).  Cost of
      kcov depends only on number of executed basic blocks/edges.  On top of
      that, kernel requires per-thread coverage because there are always
      background threads and unrelated processes that also produce coverage.
      With inlined gcov instrumentation per-thread coverage is not possible.
      
      kcov exposes kernel PCs and control flow to user-space which is
      insecure.  But debugfs should not be mapped as user accessible.
      
      Based on a patch by Quentin Casasnovas.
      
      [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
      [akpm@linux-foundation.org: unbreak allmodconfig]
      [akpm@linux-foundation.org: follow x86 Makefile layout standards]
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tavis Ormandy <taviso@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Drysdale <drysdale@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c9a8750
  10. 31 1月, 2016 1 次提交
  11. 12 9月, 2015 1 次提交
    • M
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers 提交于
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  12. 11 9月, 2015 2 次提交
    • D
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young 提交于
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
    • D
      kexec: split kexec_file syscall code to kexec_file.c · a43cac0d
      Dave Young 提交于
      Split kexec_file syscall related code to another file kernel/kexec_file.c
      so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.
      
      Sharing variables and functions are moved to kernel/kexec_internal.h per
      suggestion from Vivek and Petr.
      
      [akpm@linux-foundation.org: fix bisectability]
      [akpm@linux-foundation.org: declare the various arch_kexec functions]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a43cac0d
  13. 15 8月, 2015 1 次提交
    • D
      arch: introduce memremap() · 92281dee
      Dan Williams 提交于
      Existing users of ioremap_cache() are mapping memory that is known in
      advance to not have i/o side effects.  These users are forced to cast
      away the __iomem annotation, or otherwise neglect to fix the sparse
      errors thrown when dereferencing pointers to this memory.  Provide
      memremap() as a non __iomem annotated ioremap_*() in the case when
      ioremap is otherwise a pointer to cacheable memory. Empirically,
      ioremap_<cacheable-type>() call sites are seeking memory-like semantics
      (e.g.  speculative reads, and prefetching permitted).
      
      memremap() is a break from the ioremap implementation pattern of adding
      a new memremap_<type>() for each mapping type and having silent
      compatibility fall backs.  Instead, the implementation defines flags
      that are passed to the central memremap() and if a mapping type is not
      supported by an arch memremap returns NULL.
      
      We introduce a memremap prototype as a trivial wrapper of
      ioremap_cache() and ioremap_wt().  Later, once all ioremap_cache() and
      ioremap_wt() usage has been removed from drivers we teach archs to
      implement arch_memremap() with the ability to strictly enforce the
      mapping type.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      92281dee
  14. 14 8月, 2015 1 次提交
  15. 13 8月, 2015 1 次提交
  16. 07 8月, 2015 5 次提交
    • D
      modsign: Add explicit CONFIG_SYSTEM_TRUSTED_KEYS option · 99d27b1b
      David Woodhouse 提交于
      Let the user explicitly provide a file containing trusted keys, instead of
      just automatically finding files matching *.x509 in the build tree and
      trusting whatever we find. This really ought to be an *explicit*
      configuration, and the build rules for dealing with the files were
      fairly painful too.
      
      Fix applied from James Morris that removes an '=' from a macro definition
      in kernel/Makefile as this is a feature that only exists from GNU make 3.82
      onwards.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      99d27b1b
    • D
      modsign: Use single PEM file for autogenerated key · fb117949
      David Woodhouse 提交于
      The current rule for generating signing_key.priv and signing_key.x509 is
      a classic example of a bad rule which has a tendency to break parallel
      make. When invoked to create *either* target, it generates the other
      target as a side-effect that make didn't predict.
      
      So let's switch to using a single file signing_key.pem which contains
      both key and certificate. That matches what we do in the case of an
      external key specified by CONFIG_MODULE_SIG_KEY anyway, so it's also
      slightly cleaner.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      fb117949
    • D
      modsign: Extract signing cert from CONFIG_MODULE_SIG_KEY if needed · 1329e8cc
      David Woodhouse 提交于
      Where an external PEM file or PKCS#11 URI is given, we can get the cert
      from it for ourselves instead of making the user drop signing_key.x509
      in place for us.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      1329e8cc
    • D
    • R
      audit: clean simple fsnotify implementation · 7f492942
      Richard Guy Briggs 提交于
      This is to be used to audit by executable path rules, but audit watches should
      be able to share this code eventually.
      
      At the moment the audit watch code is a lot more complex.  That code only
      creates one fsnotify watch per parent directory.  That 'audit_parent' in
      turn has a list of 'audit_watches' which contain the name, ino, dev of
      the specific object we care about.  This just creates one fsnotify watch
      per object we care about.  So if you watch 100 inodes in /etc this code
      will create 100 fsnotify watches on /etc.  The audit_watch code will
      instead create 1 fsnotify watch on /etc (the audit_parent) and then 100
      individual watches chained from that fsnotify mark.
      
      We should be able to convert the audit_watch code to do one fsnotify
      mark per watch and simplify things/remove a whole lot of code.  After
      that conversion we should be able to convert the audit_fsnotify code to
      support that hierarchy if the optimization is necessary.
      
      Move the access to the entry for audit_match_signal() to the beginning of
      the audit_del_rule() function in case the entry found is the same one passed
      in.  This will enable it to be used by audit_autoremove_mark_rule(),
      kill_rules() and audit_remove_parent_watches().
      
      This is a heavily modified and merged version of two patches originally
      submitted by Eric Paris.
      
      Cc: Peter Moody <peter@hda3.com>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
      [PM: added a space after a declaration to keep ./scripts/checkpatch happy]
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      7f492942
  17. 15 7月, 2015 1 次提交
    • A
      cgroup: implement the PIDs subsystem · 49b786ea
      Aleksa Sarai 提交于
      Adds a new single-purpose PIDs subsystem to limit the number of
      tasks that can be forked inside a cgroup. Essentially this is an
      implementation of RLIMIT_NPROC that applies to a cgroup rather than a
      process tree.
      
      However, it should be noted that organisational operations (adding and
      removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
      the number of tasks in the hierarchy cannot exceed the limit through
      forking. This is due to the fact that, in the unified hierarchy, attach
      cannot fail (and it is not possible for a task to overcome its PIDs
      cgroup policy limit by attaching to a child cgroup -- even if migrating
      mid-fork it must be able to fork in the parent first).
      
      PIDs are fundamentally a global resource, and it is possible to reach
      PID exhaustion inside a cgroup without hitting any reasonable kmemcg
      policy. Once you've hit PID exhaustion, you're only in a marginally
      better state than OOM. This subsystem allows PID exhaustion inside a
      cgroup to be prevented.
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      49b786ea
  18. 03 7月, 2015 1 次提交
  19. 01 5月, 2015 1 次提交
  20. 16 4月, 2015 1 次提交
  21. 29 1月, 2015 1 次提交
  22. 23 1月, 2015 1 次提交
  23. 22 12月, 2014 1 次提交
  24. 11 12月, 2014 1 次提交
  25. 28 10月, 2014 1 次提交
    • A
      bpf: split eBPF out of NET · f89b7755
      Alexei Starovoitov 提交于
      introduce two configs:
      - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
        depend on
      - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use
      
      that solves several problems:
      - tracing and others that wish to use eBPF don't need to depend on NET.
        They can use BPF_SYSCALL to allow loading from userspace or select BPF
        to use it directly from kernel in NET-less configs.
      - in 3.18 programs cannot be attached to events yet, so don't force it on
      - when the rest of eBPF infra is there in 3.19+, it's still useful to
        switch it off to minimize kernel size
      
      bloat-o-meter on x64 shows:
      add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)
      
      tested with many different config combinations. Hopefully didn't miss anything.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f89b7755
  26. 09 8月, 2014 1 次提交
    • V
      bin2c: move bin2c in scripts/basic · 8370edea
      Vivek Goyal 提交于
      This patch series does not do kernel signature verification yet.  I plan
      to post another patch series for that.  Now distributions are already
      signing PE/COFF bzImage with PKCS7 signature I plan to parse and verify
      those signatures.
      
      Primary goal of this patchset is to prepare groundwork so that kernel
      image can be signed and signatures be verified during kexec load.  This
      should help with two things.
      
      - It should allow kexec/kdump on secureboot enabled machines.
      
      - In general it can help even without secureboot. By being able to verify
        kernel image signature in kexec, it should help with avoiding module
        signing restrictions. Matthew Garret showed how to boot into a custom
        kernel, modify first kernel's memory and then jump back to old kernel and
        bypass any policy one wants to.
      
      This patch (of 15):
      
      Kexec wants to use bin2c and it wants to use it really early in the build
      process. See arch/x86/purgatory/ code in later patches.
      
      So move bin2c in scripts/basic so that it can be built very early and
      be usable by arch/x86/purgatory/
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8370edea
  27. 24 7月, 2014 1 次提交
  28. 23 6月, 2014 1 次提交
  29. 24 2月, 2014 1 次提交
  30. 14 2月, 2014 1 次提交
  31. 11 2月, 2014 1 次提交
  32. 13 12月, 2013 2 次提交
    • K
      KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y · f46a3cbb
      Kirill Tkhai 提交于
      Always remove generated SYSTEM_TRUSTED_KEYRING files while doing make mrproper.
      Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f46a3cbb
    • D
      X.509: Fix certificate gathering · d7ec435f
      David Howells 提交于
      Fix the gathering of certificates from both the source tree and the build tree
      to correctly calculate the pathnames of all the certificates.
      
      The problem was that if the default generated cert, signing_key.x509, didn't
      exist then it would not have a path attached and if it did, it would have a
      path attached.
      
      This means that the contents of kernel/.x509.list would change between the
      first compilation in a directory and the second.  After the second it would
      remain stable because the signing_key.x509 file exists.
      
      The consequence was that the kernel would get relinked unconditionally on the
      second recompilation.  The second recompilation would also show something like
      this:
      
         X.509 certificate list changed
           CERTS   kernel/x509_certificate_list
           - Including cert /home/torvalds/v2.6/linux/signing_key.x509
           AS      kernel/system_certificates.o
           LD      kernel/built-in.o
      
      which is why the relink would happen.
      
      
      Unfortunately, it isn't a simple matter of just sticking a path on the front
      of the filename of the certificate in the build directory as make can't then
      work out how to build it.
      
      So the path has to be prepended to the name for sorting and duplicate
      elimination and then removed for the make rule if it is in the build tree.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      d7ec435f
  33. 06 11月, 2013 2 次提交