1. 20 5月, 2016 9 次提交
    • J
      mm: rename _count, field of the struct page, to _refcount · 0139aa7b
      Joonsoo Kim 提交于
      Many developers already know that field for reference count of the
      struct page is _count and atomic type.  They would try to handle it
      directly and this could break the purpose of page reference count
      tracepoint.  To prevent direct _count modification, this patch rename it
      to _refcount and add warning message on the code.  After that, developer
      who need to handle reference count will find that field should not be
      accessed directly.
      
      [akpm@linux-foundation.org: fix comments, per Vlastimil]
      [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
      [sfr@canb.auug.org.au: sync ethernet driver changes]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Manish Chopra <manish.chopra@qlogic.com>
      Cc: Yuval Mintz <yuval.mintz@qlogic.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0139aa7b
    • T
      mm: SLAB freelist randomization · c7ce4f60
      Thomas Garnier 提交于
      Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize
      the SLAB freelist.  The list is randomized during initialization of a
      new set of pages.  The order on different freelist sizes is pre-computed
      at boot for performance.  Each kmem_cache has its own randomized
      freelist.  Before pre-computed lists are available freelists are
      generated dynamically.  This security feature reduces the predictability
      of the kernel SLAB allocator against heap overflows rendering attacks
      much less stable.
      
      For example this attack against SLUB (also applicable against SLAB)
      would be affected:
      
        https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/
      
      Also, since v4.6 the freelist was moved at the end of the SLAB.  It
      means a controllable heap is opened to new attacks not yet publicly
      discussed.  A kernel heap overflow can be transformed to multiple
      use-after-free.  This feature makes this type of attack harder too.
      
      To generate entropy, we use get_random_bytes_arch because 0 bits of
      entropy is available in the boot stage.  In the worse case this function
      will fallback to the get_random_bytes sub API.  We also generate a shift
      random number to shift pre-computed freelist for each new set of pages.
      
      The config option name is not specific to the SLAB as this approach will
      be extended to other allocators like SLUB.
      
      Performance results highlighted no major changes:
      
      Hackbench (running 90 10 times):
      
        Before average: 0.0698
        After average: 0.0663 (-5.01%)
      
      slab_test 1 run on boot.  Difference only seen on the 2048 size test
      being the worse case scenario covered by freelist randomization.  New
      slab pages are constantly being created on the 10000 allocations.
      Variance should be mainly due to getting new pages every few
      allocations.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles
        10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles
        10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles
        10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles
        10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles
        10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles
        10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles
        10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles
        10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles
        10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles
        10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles
        10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles
        2. Kmalloc: alloc/free test
        10000 times kmalloc(8)/kfree -> 121 cycles
        10000 times kmalloc(16)/kfree -> 121 cycles
        10000 times kmalloc(32)/kfree -> 121 cycles
        10000 times kmalloc(64)/kfree -> 121 cycles
        10000 times kmalloc(128)/kfree -> 121 cycles
        10000 times kmalloc(256)/kfree -> 119 cycles
        10000 times kmalloc(512)/kfree -> 119 cycles
        10000 times kmalloc(1024)/kfree -> 119 cycles
        10000 times kmalloc(2048)/kfree -> 119 cycles
        10000 times kmalloc(4096)/kfree -> 121 cycles
        10000 times kmalloc(8192)/kfree -> 119 cycles
        10000 times kmalloc(16384)/kfree -> 119 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles
        10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles
        10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles
        10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles
        10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles
        10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles
        10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles
        10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles
        10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles
        10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles
        10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles
        10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles
        2. Kmalloc: alloc/free test
        10000 times kmalloc(8)/kfree -> 121 cycles
        10000 times kmalloc(16)/kfree -> 121 cycles
        10000 times kmalloc(32)/kfree -> 123 cycles
        10000 times kmalloc(64)/kfree -> 142 cycles
        10000 times kmalloc(128)/kfree -> 121 cycles
        10000 times kmalloc(256)/kfree -> 119 cycles
        10000 times kmalloc(512)/kfree -> 119 cycles
        10000 times kmalloc(1024)/kfree -> 119 cycles
        10000 times kmalloc(2048)/kfree -> 119 cycles
        10000 times kmalloc(4096)/kfree -> 119 cycles
        10000 times kmalloc(8192)/kfree -> 119 cycles
        10000 times kmalloc(16384)/kfree -> 119 cycles
      
      [akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()]
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7ce4f60
    • R
      kernel/padata.c: removed unused code · 815613da
      Richard Cochran 提交于
      By accident I stumbled across code that has never been used.  This
      driver has EXPORT_SYMBOL functions, and the only user of the code is
      pcrypt.c, but this only uses a subset of the exported symbols.
      
      According to 'git log -G', the functions, padata_set_cpumasks,
      padata_add_cpu, and padata_remove_cpu have never been used since they
      were first introduced.  This patch removes the unused code.
      
      On one 64 bit build, with CRYPTO_PCRYPT built in, the text is more than
      4k smaller.
      
        kbuild_hp> size $KBUILD_OUTPUT/vmlinux
            text    data     bss      dec hex    filename
        10566658 4678360 1122304 16367322 f9beda vmlinux
        10561984 4678360 1122304 16362648 f9ac98 vmlinux
      
      On another config, 32 bit, the saving is about 0.5k bytes.
      
        kbuild_hp-x86> size $KBUILD_OUTPUT/vmlinux
        6012005 2409513 2785280 11206798 ab008e vmlinux
        6011491 2409513 2785280 11206284 aafe8c vmlinux
      Signed-off-by: NRichard Cochran <rcochran@linutronix.de>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      815613da
    • D
      debugobjects: insulate non-fixup logic related to static obj from fixup callbacks · b9fdac7f
      Du, Changbin 提交于
      When activating a static object we need make sure that the object is
      tracked in the object tracker.  If it is a non-static object then the
      activation is illegal.
      
      In previous implementation, each subsystem need take care of this in
      their fixup callbacks.  Actually we can put it into debugobjects core.
      Thus we can save duplicated code, and have *pure* fixup callbacks.
      
      To achieve this, a new callback "is_static_object" is introduced to let
      the type specific code decide whether a object is static or not.  If
      yes, we take it into object tracker, otherwise give warning and invoke
      fixup callback.
      
      This change has paassed debugobjects selftest, and I also do some test
      with all debugobjects supports enabled.
      
      At last, I have a concern about the fixups that can it change the object
      which is in incorrect state on fixup? Because the 'addr' may not point
      to any valid object if a non-static object is not tracked.  Then Change
      such object can overwrite someone's memory and cause unexpected
      behaviour.  For example, the timer_fixup_activate bind timer to function
      stub_timer.
      
      Link: http://lkml.kernel.org/r/1462576157-14539-1-git-send-email-changbin.du@intel.com
      [changbin.du@intel.com: improve code comments where invoke the new is_static_object callback]
        Link: http://lkml.kernel.org/r/1462777431-8171-1-git-send-email-changbin.du@intel.comSigned-off-by: NDu, Changbin <changbin.du@intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Triplett <josh@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9fdac7f
    • D
      debugobjects: make fixup functions return bool instead of int · b1e4d9d8
      Du, Changbin 提交于
      I am going to introduce debugobjects infrastructure to USB subsystem.
      But before this, I found the code of debugobjects could be improved.
      This patchset will make fixup functions return bool type instead of int.
      Because fixup only need report success or no.  boolean is the 'real'
      type.
      
      This patch (of 7):
      
      The object debugging infrastructure core provides some fixup callbacks
      for the subsystem who use it.  These callbacks are called from the debug
      code whenever a problem in debug_object_init is detected.  And
      debugobjects core suppose them returns 1 when the fixup was successful,
      otherwise 0.  So the return type is boolean.
      
      A bad thing is that debug_object_fixup use the return value for
      arithmetic operation.  It confused me that what is the reall return
      type.
      
      Reading over the whole code, I found some place do use the return value
      incorrectly(see next patch).  So why use bool type instead?
      Signed-off-by: NDu, Changbin <changbin.du@intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Triplett <josh@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1e4d9d8
    • D
      time: remove timespec_add_safe() · 8e4f70e2
      Deepa Dinamani 提交于
      All references to timespec_add_safe() now use timespec64_add_safe().
      
      The plan is to replace struct timespec references with struct timespec64
      throughout the kernel as timespec is not y2038 safe.
      
      Drop timespec_add_safe() and use timespec64_add_safe() for all
      architectures.
      
      Link: http://lkml.kernel.org/r/1461947989-21926-4-git-send-email-deepa.kernel@gmail.comSigned-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Acked-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e4f70e2
    • D
      fs: poll/select/recvmmsg: use timespec64 for timeout events · 766b9f92
      Deepa Dinamani 提交于
      struct timespec is not y2038 safe.  Even though timespec might be
      sufficient to represent timeouts, use struct timespec64 here as the plan
      is to get rid of all timespec reference in the kernel.
      
      The patch transitions the common functions: poll_select_set_timeout()
      and select_estimate_accuracy() to use timespec64.  And, all the syscalls
      that use these functions are transitioned in the same patch.
      
      The restart block parameters for poll uses monotonic time.  Use
      timespec64 here as well to assign timeout value.  This parameter in the
      restart block need not change because this only holds the monotonic
      timestamp at which timeout should occur.  And, unsigned long data type
      should be big enough for this timestamp.
      
      The system call interfaces will be handled in a separate series.
      
      Compat interfaces need not change as timespec64 is an alias to struct
      timespec on a 64 bit system.
      
      Link: http://lkml.kernel.org/r/1461947989-21926-3-git-send-email-deepa.kernel@gmail.comSigned-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Acked-by: NJohn Stultz <john.stultz@linaro.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      766b9f92
    • D
      time: add missing implementation for timespec64_add_safe() · bc2c53e5
      Deepa Dinamani 提交于
      timespec64_add_safe() has been defined in time64.h for 64 bit systems.
      But, 32 bit systems only have an extern function prototype defined.
      Provide a definition for the above function.
      
      The function will be necessary as part of y2038 changes.  struct
      timespec is not y2038 safe.  All references to timespec will be replaced
      by struct timespec64.  The function is meant to be a replacement for
      timespec_add_safe().
      
      The implementation is similar to timespec_add_safe().
      
      Link: http://lkml.kernel.org/r/1461947989-21926-2-git-send-email-deepa.kernel@gmail.comSigned-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Acked-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc2c53e5
    • J
      fsnotify: avoid spurious EMFILE errors from inotify_init() · 35e48176
      Jan Kara 提交于
      Inotify instance is destroyed when all references to it are dropped.
      That not only means that the corresponding file descriptor needs to be
      closed but also that all corresponding instance marks are freed (as each
      mark holds a reference to the inotify instance).  However marks are
      freed only after SRCU period ends which can take some time and thus if
      user rapidly creates and frees inotify instances, number of existing
      inotify instances can exceed max_user_instances limit although from user
      point of view there is always at most one existing instance.  Thus
      inotify_init() returns EMFILE error which is hard to justify from user
      point of view.  This problem is exposed by LTP inotify06 testcase on
      some machines.
      
      We fix the problem by making sure all group marks are properly freed
      while destroying inotify instance.  We wait for SRCU period to end in
      that path anyway since we have to make sure there is no event being
      added to the instance while we are tearing down the instance.  So it
      takes only some plumbing to allow for marks to be destroyed in that path
      as well and not from a dedicated work item.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reported-by: NXiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
      Tested-by: NXiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35e48176
  2. 18 5月, 2016 2 次提交
  3. 17 5月, 2016 7 次提交
    • D
      bpf: add generic constant blinding for use in jits · 4f3446bb
      Daniel Borkmann 提交于
      This work adds a generic facility for use from eBPF JIT compilers
      that allows for further hardening of JIT generated images through
      blinding constants. In response to the original work on BPF JIT
      spraying published by Keegan McAllister [1], most BPF JITs were
      changed to make images read-only and start at a randomized offset
      in the page, where the rest was filled with trap instructions. We
      have this nowadays in x86, arm, arm64 and s390 JIT compilers.
      Additionally, later work also made eBPF interpreter images read
      only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
      arm, arm64 and s390 archs as well currently. This is done by
      default for mentioned JITs when JITing is enabled. Furthermore,
      we had a generic and configurable constant blinding facility on our
      todo for quite some time now to further make spraying harder, and
      first implementation since around netconf 2016.
      
      We found that for systems where untrusted users can load cBPF/eBPF
      code where JIT is enabled, start offset randomization helps a bit
      to make jumps into crafted payload harder, but in case where larger
      programs that cross page boundary are injected, we again have some
      part of the program opcodes at a page start offset. With improved
      guessing and more reliable payload injection, chances can increase
      to jump into such payload. Elena Reshetova recently wrote a test
      case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
      can leave some more room for payloads. Note that for all this,
      additional bugs in the kernel are still required to make the jump
      (and of course to guess right, to not jump into a trap) and naturally
      the JIT must be enabled, which is disabled by default.
      
      For helping mitigation, the general idea is to provide an option
      bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
      that for cases where JIT should be enabled for performance reasons,
      the generated image can be further hardened with blinding constants
      for unpriviledged users (bpf_jit_harden == 1), with trading off
      performance for these, but not for privileged ones. We also added
      the option of blinding for all users (bpf_jit_harden == 2), which
      is quite helpful for testing f.e. with test_bpf.ko. There are no
      further e.g. hardening levels of bpf_jit_harden switch intended,
      rationale is to have it dead simple to use as on/off. Since this
      functionality would need to be duplicated over and over for JIT
      compilers to use, which are already complex enough, we provide a
      generic eBPF byte-code level based blinding implementation, which is
      then just transparently JITed. JIT compilers need to make only a few
      changes to integrate this facility and can be migrated one by one.
      
      This option is for eBPF JITs and will be used in x86, arm64, s390
      without too much effort, and soon ppc64 JITs, thus that native eBPF
      can be blinded as well as cBPF to eBPF migrations, so that both can
      be covered with a single implementation. The rule for JITs is that
      bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
      and in case blinding is disabled, we follow normally with JITing the
      passed program. In case blinding is enabled and we fail during the
      process of blinding itself, we must return with the interpreter.
      Similarly, in case the JITing process after the blinding failed, we
      return normally to the interpreter with the non-blinded code. Meaning,
      interpreter doesn't change in any way and operates on eBPF code as
      usual. For doing this pre-JIT blinding step, we need to make use of
      a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
      to the JIT and not in any way part of the eBPF architecture. Just like
      in the same way as JITs internally make use of some helper registers
      when emitting code, only that here the helper register is one
      abstraction level higher in eBPF bytecode, but nevertheless in JIT
      phase. That helper register is needed since f.e. manually written
      program can issue loads to all registers of eBPF architecture.
      
      The core concept with the additional register is: blind out all 32
      and 64 bit constants by converting BPF_K based instructions into a
      small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
      is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
      and REG <OP> BPF_REG_AX, so actual operation on the target register
      is translated from BPF_K into BPF_X one that is operating on
      BPF_REG_AX's content. During rewriting phase when blinding, RND is
      newly generated via prandom_u32() for each processed instruction.
      64 bit loads are split into two 32 bit loads to make translation and
      patching not too complex. Only basic thing required by JITs is to
      call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
      pair, and to map BPF_REG_AX into an unused register.
      
      Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
      
      echo 0 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f5e9 + <x>:
        [...]
        39:   mov    $0xa8909090,%eax
        3e:   mov    $0xa8909090,%eax
        43:   mov    $0xa8ff3148,%eax
        48:   mov    $0xa89081b4,%eax
        4d:   mov    $0xa8900bb0,%eax
        52:   mov    $0xa810e0c1,%eax
        57:   mov    $0xa8908eb4,%eax
        5c:   mov    $0xa89020b0,%eax
        [...]
      
      echo 1 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f1e5 + <x>:
        [...]
        39:   mov    $0xe1192563,%r10d
        3f:   xor    $0x4989b5f3,%r10d
        46:   mov    %r10d,%eax
        49:   mov    $0xb8296d93,%r10d
        4f:   xor    $0x10b9fd03,%r10d
        56:   mov    %r10d,%eax
        59:   mov    $0x8c381146,%r10d
        5f:   xor    $0x24c7200e,%r10d
        66:   mov    %r10d,%eax
        69:   mov    $0xeb2a830e,%r10d
        6f:   xor    $0x43ba02ba,%r10d
        76:   mov    %r10d,%eax
        79:   mov    $0xd9730af,%r10d
        7f:   xor    $0xa5073b1f,%r10d
        86:   mov    %r10d,%eax
        89:   mov    $0x9a45662b,%r10d
        8f:   xor    $0x325586ea,%r10d
        96:   mov    %r10d,%eax
        [...]
      
      As can be seen, original constants that carry payload are hidden
      when enabled, actual operations are transformed from constant-based
      to register-based ones, making jumps into constants ineffective.
      Above extract/example uses single BPF load instruction over and
      over, but of course all instructions with constants are blinded.
      
      Performance wise, JIT with blinding performs a bit slower than just
      JIT and faster than interpreter case. This is expected, since we
      still get all the performance benefits from JITing and in normal
      use-cases not every single instruction needs to be blinded. Summing
      up all 296 test cases averaged over multiple runs from test_bpf.ko
      suite, interpreter was 55% slower than JIT only and JIT with blinding
      was 8% slower than JIT only. Since there are also some extremes in
      the test suite, I expect for ordinary workloads that the performance
      for the JIT with blinding case is even closer to JIT only case,
      f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
      35 (+ blinding), and 151 (interpreter).
      
      BPF test suite, seccomp test suite, eBPF sample code and various
      bigger networking eBPF programs have been tested with this and were
      running fine. For testing purposes, I also adapted interpreter and
      redirected blinded eBPF image to interpreter and also here all tests
      pass.
      
        [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
        [2] https://github.com/01org/jit-spray-poc-for-ksp/
        [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NElena Reshetova <elena.reshetova@intel.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f3446bb
    • D
      bpf: prepare bpf_int_jit_compile/bpf_prog_select_runtime apis · d1c55ab5
      Daniel Borkmann 提交于
      Since the blinding is strictly only called from inside eBPF JITs,
      we need to change signatures for bpf_int_jit_compile() and
      bpf_prog_select_runtime() first in order to prepare that the
      eBPF program we're dealing with can change underneath. Hence,
      for call sites, we need to return the latest prog. No functional
      change in this patch.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1c55ab5
    • D
      bpf: add bpf_patch_insn_single helper · c237ee5e
      Daniel Borkmann 提交于
      Move the functionality to patch instructions out of the verifier
      code and into the core as the new bpf_patch_insn_single() helper
      will be needed later on for blinding as well. No changes in
      functionality.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c237ee5e
    • D
      bpf: move bpf_jit_enable declaration · c94987e4
      Daniel Borkmann 提交于
      Move the bpf_jit_enable declaration to the filter.h file where
      most other core code is declared, also since we're going to add
      a second knob there.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c94987e4
    • A
      net/mlx5_core: Flow counters infrastructure · 43a335e0
      Amir Vadai 提交于
      If a counter has the aging flag set when created, it is added to a list
      of counters that will be queried periodically from a workqueue.  query
      result and last use timestamp are cached.
      add/del counter must be very efficient since thousands of such
      operations might be issued in a second.
      There is only a single reference to counters without aging, therefore
      no need for locks.
      But, counters with aging enabled are stored in a list. In order to make
      code as lockless as possible, all the list manipulation and access to
      hardware is done from a single context - the periodic counters query
      thread.
      
      The hardware supports multiple counters per FTE, however currently we
      are using one counter for each FTE.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43a335e0
    • A
      net/mlx5_core: Introduce flow steering destination of type counter · bd5251db
      Amir Vadai 提交于
      When adding a flow steering rule with a counter, need to supply a
      destination of type MLX5_FLOW_DESTINATION_TYPE_COUNTER, with a pointer
      to a struct mlx5_fc.
      Also, MLX5_FLOW_CONTEXT_ACTION_COUNT bit should be set in the action.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd5251db
    • A
      net/mlx5_core: Firmware commands to support flow counters · 9dc0b289
      Amir Vadai 提交于
      Getting packet/byte statistics on flows is done through flow counters.
      Implement the firmware commands to alloc, free and query flow counters.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9dc0b289
  4. 16 5月, 2016 2 次提交
    • D
      ibft: Expose iBFT acpi header via sysfs · b3c8eb50
      David Bond 提交于
      Some ethernet adapter vendors are supplying products which support optional
      (payed license) features. On some adapters this includes a hardware iscsi
      initiator.  The same adapters in a normal (no extra licenses) mode of
      operation can be used as a software iscsi initiator.  In addition, software
      iscsi boot initiators are becoming a standard part of many vendors uefi
      implementations.  This is creating difficulties during early boot/install
      determining the proper configuration method for these adapters when they
      are used as a boot device.
      
      The attached patch creates sysfs entries to expose information from the
      acpi header of the ibft table.  This information allows for a method to
      easily determining if an ibft table was created by a ethernet card's
      firmware or the system uefi/bios.  In the case of a hardware initiator this
      information in combination with the pci vendor and device id can be used
      to ascertain any vendor specific behaviors that need to be accommodated.
      Reviewed-by: NLee Duncan <lduncan@suse.com>
      Signed-off-by: NDavid Bond <dbond@suse.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      b3c8eb50
    • H
      iscsi_ibft: Add prefix-len attr and display netmask · 9a99425f
      Hannes Reinecke 提交于
      The iBFT table only specifies a prefix length, not a netmask.
      And the netmask is pretty much pointless for IPv6.
      So introduce a new attribute 'prefix-len'.
      
      Some older user-space code might rely on the netmask attribute
      being present, so we should always display it.
      
      Changes from v1:
       - Combined two patches into one
      
      Changes from v2:
       - Cleaned up/corrected wording for patch description
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NLee Duncan <lduncan@suse.com>
      Reviewed-by: NMike Christie <michaelc@cs.wisc.edu>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad@kernel.org>
      9a99425f
  5. 13 5月, 2016 5 次提交
    • A
      ext4: switch to ->iterate_shared() · ae05327a
      Al Viro 提交于
      Note that we need relax_dir() equivalent for directories
      locked shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ae05327a
    • A
      mm: thp: calculate the mapcount correctly for THP pages during WP faults · 6d0a07ed
      Andrea Arcangeli 提交于
      This will provide fully accuracy to the mapcount calculation in the
      write protect faults, so page pinning will not get broken by false
      positive copy-on-writes.
      
      total_mapcount() isn't the right calculation needed in
      reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
      that is effectively the full accurate return value for page_mapcount()
      if dealing with Transparent Hugepages, however we only use the
      page_trans_huge_mapcount() during COW faults where it strictly needed,
      due to its higher runtime cost.
      
      This also provide at practical zero cost the total_mapcount
      information which is needed to know if we can still relocate the page
      anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
      can reuse the page no matter if it's a pte or a pmd_trans_huge
      triggering the fault, but we can only relocate the page anon_vma to
      the local vma->anon_vma if we're sure it's only this "vma" mapping the
      whole THP physical range.
      
      Kirill A. Shutemov discovered the problem with moving the page
      anon_vma to the local vma->anon_vma in a previous version of this
      patch and another problem in the way page_move_anon_rmap() was called.
      
      Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
      previous version, because reuse_swap_page must be a macro to call
      page_trans_huge_mapcount from swap.h, so this uses a macro again
      instead of an inline function. With this change at least it's a less
      dangerous usage than it was before, because "page" is used only once
      now, while with the previous code reuse_swap_page(page++) would have
      called page_mapcount on page+1 and it would have increased page twice
      instead of just once.
      
      Dean Luick noticed an uninitialized variable that could result in a
      rmap inefficiency for the non-THP case in a previous version.
      
      Mike Marciniszyn said:
      
      : Our RDMA tests are seeing an issue with memory locking that bisects to
      : commit 61f5d698 ("mm: re-enable THP")
      :
      : The test program registers two rather large MRs (512M) and RDMA
      : writes data to a passive peer using the first and RDMA reads it back
      : into the second MR and compares that data.  The sizes are chosen randomly
      : between 0 and 1024 bytes.
      :
      : The test will get through a few (<= 4 iterations) and then gets a
      : compare error.
      :
      : Tracing indicates the kernel logical addresses associated with the individual
      : pages at registration ARE correct , the data in the "RDMA read response only"
      : packets ARE correct.
      :
      : The "corruption" occurs when the packet crosse two pages that are not physically
      : contiguous.   The second page reads back as zero in the program.
      :
      : It looks like the user VA at the point of the compare error no longer points to
      : the same physical address as was registered.
      :
      : This patch totally resolves the issue!
      
      Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Reviewed-by: NDean Luick <dean.luick@intel.com>
      Tested-by: NAlex Williamson <alex.williamson@redhat.com>
      Tested-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Tested-by: NJosh Collier <josh.d.collier@intel.com>
      Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
      Cc: <stable@vger.kernel.org>	[4.5]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d0a07ed
    • B
      remoteproc: Add additional crash reasons · b3d39032
      Bjorn Andersson 提交于
      The Qualcomm WCNSS can crash by watchdog or a fatal software error. Add
      these types to the list of remoteproc crash reasons.
      Signed-off-by: NBjorn Andersson <bjorn.andersson@sonymobile.com>
      Signed-off-by: NBjorn Andersson <bjorn.andersson@linaro.org>
      b3d39032
    • O
      coredump: only charge written data against RLIMIT_CORE · 2c4cb043
      Omar Sandoval 提交于
      Commit 9b56d543 ("dump_skip(): dump_seek() replacement taking
      coredump_params") introduced a regression with regard to RLIMIT_CORE.
      Previously, when a core dump was sparse, only the data that was actually
      written out would count against the limit. Now, the sparse ranges are
      also included, which leads to truncated core dumps when the actual disk
      usage is still well below the limit. Restore the old behavior by only
      counting what gets emitted and ignoring what gets skipped.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2c4cb043
    • O
      coredump: get rid of coredump_params->written · a0083939
      Omar Sandoval 提交于
      cprm->written is redundant with cprm->file->f_pos, so use that instead.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a0083939
  6. 12 5月, 2016 12 次提交
  7. 11 5月, 2016 3 次提交