1. 04 4月, 2014 12 次提交
  2. 03 4月, 2014 1 次提交
  3. 31 3月, 2014 2 次提交
    • A
      net: filter: rework/optimize internal BPF interpreter's instruction set · bd4cf0ed
      Alexei Starovoitov 提交于
      This patch replaces/reworks the kernel-internal BPF interpreter with
      an optimized BPF instruction set format that is modelled closer to
      mimic native instruction sets and is designed to be JITed with one to
      one mapping. Thus, the new interpreter is noticeably faster than the
      current implementation of sk_run_filter(); mainly for two reasons:
      
      1. Fall-through jumps:
      
        BPF jump instructions are forced to go either 'true' or 'false'
        branch which causes branch-miss penalty. The new BPF jump
        instructions have only one branch and fall-through otherwise,
        which fits the CPU branch predictor logic better. `perf stat`
        shows drastic difference for branch-misses between the old and
        new code.
      
      2. Jump-threaded implementation of interpreter vs switch
         statement:
      
        Instead of single table-jump at the top of 'switch' statement,
        gcc will now generate multiple table-jump instructions, which
        helps CPU branch predictor logic.
      
      Note that the verification of filters is still being done through
      sk_chk_filter() in classical BPF format, so filters from user- or
      kernel space are verified in the same way as we do now, and same
      restrictions/constraints hold as well.
      
      We reuse current BPF JIT compilers in a way that this upgrade would
      even be fine as is, but nevertheless allows for a successive upgrade
      of BPF JIT compilers to the new format.
      
      The internal instruction set migration is being done after the
      probing for JIT compilation, so in case JIT compilers are able to
      create a native opcode image, we're going to use that, and in all
      other cases we're doing a follow-up migration of the BPF program's
      instruction set, so that it can be transparently run in the new
      interpreter.
      
      In short, the *internal* format extends BPF in the following way (more
      details can be taken from the appended documentation):
      
        - Number of registers increase from 2 to 10
        - Register width increases from 32-bit to 64-bit
        - Conditional jt/jf targets replaced with jt/fall-through
        - Adds signed > and >= insns
        - 16 4-byte stack slots for register spill-fill replaced
          with up to 512 bytes of multi-use stack space
        - Introduction of bpf_call insn and register passing convention
          for zero overhead calls from/to other kernel functions
        - Adds arithmetic right shift and endianness conversion insns
        - Adds atomic_add insn
        - Old tax/txa insns are replaced with 'mov dst,src' insn
      
      Performance of two BPF filters generated by libpcap resp. bpf_asm
      was measured on x86_64, i386 and arm32 (other libpcap programs
      have similar performance differences):
      
      fprog #1 is taken from Documentation/networking/filter.txt:
      tcpdump -i eth0 port 22 -dd
      
      fprog #2 is taken from 'man tcpdump':
      tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
         ((tcp[12]&0xf0)>>2)) != 0)' -dd
      
      Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
      same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
      smaller is better:
      
      --x86_64--
               fprog #1  fprog #1   fprog #2  fprog #2
               cache-hit cache-miss cache-hit cache-miss
      old BPF      90       101        192       202
      new BPF      31        71         47        97
      old BPF jit  12        34         17        44
      new BPF jit TBD
      
      --i386--
               fprog #1  fprog #1   fprog #2  fprog #2
               cache-hit cache-miss cache-hit cache-miss
      old BPF     107       136        227       252
      new BPF      40       119         69       172
      
      --arm32--
               fprog #1  fprog #1   fprog #2  fprog #2
               cache-hit cache-miss cache-hit cache-miss
      old BPF     202       300        475       540
      new BPF     180       270        330       470
      old BPF jit  26       182         37       202
      new BPF jit TBD
      
      Thus, without changing any userland BPF filters, applications on
      top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
      classifier, netfilter's xt_bpf, team driver's load-balancing mode,
      and many more will have better interpreter filtering performance.
      
      While we are replacing the internal BPF interpreter, we also need
      to convert seccomp BPF in the same step to make use of the new
      internal structure since it makes use of lower-level API details
      without being further decoupled through higher-level calls like
      sk_unattached_filter_{create,destroy}(), for example.
      
      Just as for normal socket filtering, also seccomp BPF experiences
      a time-to-verdict speedup:
      
      05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
      
        seccomp_rule_add_exact(ctx,...
        seccomp_rule_add_exact(ctx,...
      
        rc = seccomp_load(ctx);
      
        for (i = 0; i < 10000000; i++)
           syscall(199, 100);
      
      'short filter' has 2 rules
      'large filter' has 200 rules
      
      'short filter' performance is slightly better on x86_64/i386/arm32
      'large filter' is much faster on x86_64 and i386 and shows no
                     difference on arm32
      
      --x86_64-- short filter
      old BPF: 2.7 sec
       39.12%  bench  libc-2.15.so       [.] syscall
        8.10%  bench  [kernel.kallsyms]  [k] sk_run_filter
        6.31%  bench  [kernel.kallsyms]  [k] system_call
        5.59%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
        4.37%  bench  [kernel.kallsyms]  [k] trace_hardirqs_off_caller
        3.70%  bench  [kernel.kallsyms]  [k] __secure_computing
        3.67%  bench  [kernel.kallsyms]  [k] lock_is_held
        3.03%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
      new BPF: 2.58 sec
       42.05%  bench  libc-2.15.so       [.] syscall
        6.91%  bench  [kernel.kallsyms]  [k] system_call
        6.25%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
        6.07%  bench  [kernel.kallsyms]  [k] __secure_computing
        5.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
      
      --arm32-- short filter
      old BPF: 4.0 sec
       39.92%  bench  [kernel.kallsyms]  [k] vector_swi
       16.60%  bench  [kernel.kallsyms]  [k] sk_run_filter
       14.66%  bench  libc-2.17.so       [.] syscall
        5.42%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
        5.10%  bench  [kernel.kallsyms]  [k] __secure_computing
      new BPF: 3.7 sec
       35.93%  bench  [kernel.kallsyms]  [k] vector_swi
       21.89%  bench  libc-2.17.so       [.] syscall
       13.45%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
        6.25%  bench  [kernel.kallsyms]  [k] __secure_computing
        3.96%  bench  [kernel.kallsyms]  [k] syscall_trace_exit
      
      --x86_64-- large filter
      old BPF: 8.6 seconds
          73.38%    bench  [kernel.kallsyms]  [k] sk_run_filter
          10.70%    bench  libc-2.15.so       [.] syscall
           5.09%    bench  [kernel.kallsyms]  [k] seccomp_bpf_load
           1.97%    bench  [kernel.kallsyms]  [k] system_call
      new BPF: 5.7 seconds
          66.20%    bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
          16.75%    bench  libc-2.15.so       [.] syscall
           3.31%    bench  [kernel.kallsyms]  [k] system_call
           2.88%    bench  [kernel.kallsyms]  [k] __secure_computing
      
      --i386-- large filter
      old BPF: 5.4 sec
      new BPF: 3.8 sec
      
      --arm32-- large filter
      old BPF: 13.5 sec
       73.88%  bench  [kernel.kallsyms]  [k] sk_run_filter
       10.29%  bench  [kernel.kallsyms]  [k] vector_swi
        6.46%  bench  libc-2.17.so       [.] syscall
        2.94%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
        1.19%  bench  [kernel.kallsyms]  [k] __secure_computing
        0.87%  bench  [kernel.kallsyms]  [k] sys_getuid
      new BPF: 13.5 sec
       76.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
       10.98%  bench  [kernel.kallsyms]  [k] vector_swi
        5.87%  bench  libc-2.17.so       [.] syscall
        1.77%  bench  [kernel.kallsyms]  [k] __secure_computing
        0.93%  bench  [kernel.kallsyms]  [k] sys_getuid
      
      BPF filters generated by seccomp are very branchy, so the new
      internal BPF performance is better than the old one. Performance
      gains will be even higher when BPF JIT is committed for the
      new structure, which is planned in future work (as successive
      JIT migrations).
      
      BPF has also been stress-tested with trinity's BPF fuzzer.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Paul Moore <pmoore@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: linux-kernel@vger.kernel.org
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd4cf0ed
    • E
      AUDIT: Allow login in non-init namespaces · aa4af831
      Eric Paris 提交于
      It its possible to configure your PAM stack to refuse login if audit
      messages (about the login) were unable to be sent.  This is common in
      many distros and thus normal configuration of many containers.  The PAM
      modules determine if audit is enabled/disabled in the kernel based on
      the return value from sending an audit message on the netlink socket.
      If userspace gets back ECONNREFUSED it believes audit is disabled in the
      kernel.  If it gets any other error else it refuses to let the login
      proceed.
      
      Just about ever since the introduction of namespaces the kernel audit
      subsystem has returned EPERM if the task sending a message was not in
      the init user or pid namespace.  So many forms of containers have never
      worked if audit was enabled in the kernel.
      
      BUT if the container was not in net_init then the kernel network code
      would send ECONNREFUSED (instead of the audit code sending EPERM).  Thus
      by pure accident/dumb luck/bug if an admin configured the PAM stack to
      reject all logins that didn't talk to audit, but then ran the login
      untility in the non-init_net namespace, it would work!! Clearly this was
      a bug, but it is a bug some people expected.
      
      With the introduction of network namespace support in 3.14-rc1 the two
      bugs stopped cancelling each other out.  Now, containers in the
      non-init_net namespace refused to let users log in (just like PAM was
      configfured!) Obviously some people were not happy that what used to let
      users log in, now didn't!
      
      This fix is kinda hacky.  We return ECONNREFUSED for all non-init
      relevant namespaces.  That means that not only will the old broken
      non-init_net setups continue to work, now the broken non-init_pid or
      non-init_user setups will 'work'.  They don't really work, since audit
      isn't logging things.  But it's what most users want.
      
      In 3.15 we should have patches to support not only the non-init_net
      (3.14) namespace but also the non-init_pid and non-init_user namespace.
      So all will be right in the world.  This just opens the doors wide open
      on 3.14 and hopefully makes users happy, if not the audit system...
      Reported-by: NAndre Tomt <andre@tomt.net>
      Reported-by: NAdam Richter <adam_richter2004@yahoo.com>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa4af831
  4. 28 3月, 2014 1 次提交
  5. 26 3月, 2014 4 次提交
  6. 22 3月, 2014 1 次提交
  7. 21 3月, 2014 4 次提交
    • L
      futex: revert back to the explicit waiter counting code · 11d4616b
      Linus Torvalds 提交于
      Srikar Dronamraju reports that commit b0c29f79 ("futexes: Avoid
      taking the hb->lock if there's nothing to wake up") causes java threads
      getting stuck on futexes when runing specjbb on a power7 numa box.
      
      The cause appears to be that the powerpc spinlocks aren't using the same
      ticket lock model that we use on x86 (and other) architectures, which in
      turn result in the "spin_is_locked()" test in hb_waiters_pending()
      occasionally reporting an unlocked spinlock even when there are pending
      waiters.
      
      So this reinstates Davidlohr Bueso's original explicit waiter counting
      code, which I had convinced Davidlohr to drop in favor of figuring out
      the pending waiters by just using the existing state of the spinlock and
      the wait queue.
      Reported-and-tested-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Original-code-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11d4616b
    • P
      rcu: Provide grace-period piggybacking API · 765a3f4f
      Paul E. McKenney 提交于
      The following pattern is currently not well supported by RCU:
      
      1.	Make data element inaccessible to RCU readers.
      
      2.	Do work that probably lasts for more than one grace period.
      
      3.	Do something to make sure RCU readers in flight before #1 above
      	have completed.
      
      Here are some things that could currently be done:
      
      a.	Do a synchronize_rcu() unconditionally at either #1 or #3 above.
      	This works, but imposes needless work and latency.
      
      b.	Post an RCU callback at #1 above that does a wakeup, then
      	wait for the wakeup at #3.  This works well, but likely results
      	in an extra unneeded grace period.  Open-coding this is also
      	a bit more semi-tricky code than would be good.
      
      This commit therefore adds get_state_synchronize_rcu() and
      cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
      above and pass its return value to cond_synchronize_rcu() at #3 above.
      This results in a call to synchronize_rcu() if no grace period has
      elapsed between #1 and #3, but requires only a load, comparison, and
      memory barrier if a full grace period did elapse.
      Requested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      765a3f4f
    • D
      Rename TAINT_UNSAFE_SMP to TAINT_CPU_OUT_OF_SPEC · 8c90487c
      Dave Jones 提交于
      Rename TAINT_UNSAFE_SMP to TAINT_CPU_OUT_OF_SPEC, so we can repurpose
      the flag to encompass a wider range of pushing the CPU beyond its
      warrany.
      Signed-off-by: NDave Jones <davej@fedoraproject.org>
      Link: http://lkml.kernel.org/r/20140226154949.GA770@redhat.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      8c90487c
    • V
      tracing: Fix array size mismatch in format string · 87291347
      Vaibhav Nagarnaik 提交于
      In event format strings, the array size is reported in two locations.
      One in array subscript and then via the "size:" attribute. The values
      reported there have a mismatch.
      
      For e.g., in sched:sched_switch the prev_comm and next_comm character
      arrays have subscript values as [32] where as the actual field size is
      16.
      
      name: sched_switch
      ID: 301
      format:
              field:unsigned short common_type;       offset:0;       size:2; signed:0;
              field:unsigned char common_flags;       offset:2;       size:1; signed:0;
              field:unsigned char common_preempt_count;       offset:3;       size:1;signed:0;
              field:int common_pid;   offset:4;       size:4; signed:1;
      
              field:char prev_comm[32];       offset:8;       size:16;        signed:1;
              field:pid_t prev_pid;   offset:24;      size:4; signed:1;
              field:int prev_prio;    offset:28;      size:4; signed:1;
              field:long prev_state;  offset:32;      size:8; signed:1;
              field:char next_comm[32];       offset:40;      size:16;        signed:1;
              field:pid_t next_pid;   offset:56;      size:4; signed:1;
              field:int next_prio;    offset:60;      size:4; signed:1;
      
      After bisection, the following commit was blamed:
      92edca07 tracing: Use direct field, type and system names
      
      This commit removes the duplication of strings for field->name and
      field->type assuming that all the strings passed in
      __trace_define_field() are immutable. This is not true for arrays, where
      the type string is created in event_storage variable and field->type for
      all array fields points to event_storage.
      
      Use __stringify() to create a string constant for the type string.
      
      Also, get rid of event_storage and event_storage_mutex that are not
      needed anymore.
      
      also, an added benefit is that this reduces the overhead of events a bit more:
      
         text    data     bss     dec     hex filename
      8424787 2036472 1302528 11763787         b3804b vmlinux
      8420814 2036408 1302528 11759750         b37086 vmlinux.patched
      
      Link: http://lkml.kernel.org/r/1392349908-29685-1-git-send-email-vnagarnaik@google.com
      
      Cc: Laurent Chavey <chavey@google.com>
      Cc: stable@vger.kernel.org # 3.10+
      Signed-off-by: NVaibhav Nagarnaik <vnagarnaik@google.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      87291347
  8. 20 3月, 2014 3 次提交
  9. 19 3月, 2014 3 次提交
  10. 14 3月, 2014 1 次提交
  11. 13 3月, 2014 3 次提交
    • J
      block: remove old blk_iopoll_enabled variable · 89f8b33c
      Jens Axboe 提交于
      This was a debugging measure to toggle enabled/disabled
      when testing. But for real production setups, it's not
      safe to toggle this setting without either reloading
      drivers of quiescing IO first. Neither of which the toggle
      enforces.
      
      Additionally, it makes drivers deal with the conditional
      state.
      
      Remove it completely. It's up to the driver whether iopoll
      is enabled or not.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      89f8b33c
    • F
      sched: Remove needless round trip nsecs <-> tick conversion of steal time · 300a9d88
      Frederic Weisbecker 提交于
      When update_rq_clock_task() accounts the pending steal time for a task,
      it converts the steal delta from nsecs to tick then from tick to nsecs.
      
      There is no apparent good reason for doing that though because both
      the task clock and the prev steal delta are u64 and store values
      in nsecs.
      
      So lets remove the needless conversion.
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      300a9d88
    • F
      cputime: Fix jiffies based cputime assumption on steal accounting · dee08a72
      Frederic Weisbecker 提交于
      The steal guest time accounting code assumes that cputime_t is based on
      jiffies. So when CONFIG_NO_HZ_FULL=y, which implies that cputime_t
      is based on nsecs, steal_account_process_tick() passes the delta in
      jiffies to account_steal_time() which then accounts it as if it's a
      value in nsecs.
      
      As a result, accounting 1 second of steal time (with HZ=100 that would
      be 100 jiffies) is spuriously accounted as 100 nsecs.
      
      As such /proc/stat may report 0 values of steal time even when two
      guests have run concurrently for a few seconds on the same host and
      same CPU.
      
      In order to fix this, lets convert the nsecs based steal delta to
      cputime instead of jiffies by using the right conversion API.
      
      Given that the steal time is stored in cputime_t and this type can have
      a smaller granularity than nsecs, we only account the rounded converted
      value and leave the remaining nsecs for the next deltas.
      Reported-by: NHuiqingding <huding@redhat.com>
      Reported-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      dee08a72
  12. 12 3月, 2014 5 次提交