1. 01 6月, 2016 1 次提交
  2. 26 5月, 2016 1 次提交
  3. 21 5月, 2016 1 次提交
  4. 17 5月, 2016 4 次提交
    • D
      net: also make sch_handle_egress() drop monitor ready · 7e2c3aea
      Daniel Borkmann 提交于
      Follow-up for 8a3a4c6e ("net: make sch_handle_ingress() drop
      monitor ready") to also make the egress side drop monitor ready.
      
      Also here only TC_ACT_SHOT is a clear indication that something
      went wrong. Hence don't provide false positives to drop monitors
      such as 'perf record -e skb:kfree_skb ...'.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e2c3aea
    • D
      bpf: add generic constant blinding for use in jits · 4f3446bb
      Daniel Borkmann 提交于
      This work adds a generic facility for use from eBPF JIT compilers
      that allows for further hardening of JIT generated images through
      blinding constants. In response to the original work on BPF JIT
      spraying published by Keegan McAllister [1], most BPF JITs were
      changed to make images read-only and start at a randomized offset
      in the page, where the rest was filled with trap instructions. We
      have this nowadays in x86, arm, arm64 and s390 JIT compilers.
      Additionally, later work also made eBPF interpreter images read
      only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
      arm, arm64 and s390 archs as well currently. This is done by
      default for mentioned JITs when JITing is enabled. Furthermore,
      we had a generic and configurable constant blinding facility on our
      todo for quite some time now to further make spraying harder, and
      first implementation since around netconf 2016.
      
      We found that for systems where untrusted users can load cBPF/eBPF
      code where JIT is enabled, start offset randomization helps a bit
      to make jumps into crafted payload harder, but in case where larger
      programs that cross page boundary are injected, we again have some
      part of the program opcodes at a page start offset. With improved
      guessing and more reliable payload injection, chances can increase
      to jump into such payload. Elena Reshetova recently wrote a test
      case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
      can leave some more room for payloads. Note that for all this,
      additional bugs in the kernel are still required to make the jump
      (and of course to guess right, to not jump into a trap) and naturally
      the JIT must be enabled, which is disabled by default.
      
      For helping mitigation, the general idea is to provide an option
      bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
      that for cases where JIT should be enabled for performance reasons,
      the generated image can be further hardened with blinding constants
      for unpriviledged users (bpf_jit_harden == 1), with trading off
      performance for these, but not for privileged ones. We also added
      the option of blinding for all users (bpf_jit_harden == 2), which
      is quite helpful for testing f.e. with test_bpf.ko. There are no
      further e.g. hardening levels of bpf_jit_harden switch intended,
      rationale is to have it dead simple to use as on/off. Since this
      functionality would need to be duplicated over and over for JIT
      compilers to use, which are already complex enough, we provide a
      generic eBPF byte-code level based blinding implementation, which is
      then just transparently JITed. JIT compilers need to make only a few
      changes to integrate this facility and can be migrated one by one.
      
      This option is for eBPF JITs and will be used in x86, arm64, s390
      without too much effort, and soon ppc64 JITs, thus that native eBPF
      can be blinded as well as cBPF to eBPF migrations, so that both can
      be covered with a single implementation. The rule for JITs is that
      bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
      and in case blinding is disabled, we follow normally with JITing the
      passed program. In case blinding is enabled and we fail during the
      process of blinding itself, we must return with the interpreter.
      Similarly, in case the JITing process after the blinding failed, we
      return normally to the interpreter with the non-blinded code. Meaning,
      interpreter doesn't change in any way and operates on eBPF code as
      usual. For doing this pre-JIT blinding step, we need to make use of
      a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
      to the JIT and not in any way part of the eBPF architecture. Just like
      in the same way as JITs internally make use of some helper registers
      when emitting code, only that here the helper register is one
      abstraction level higher in eBPF bytecode, but nevertheless in JIT
      phase. That helper register is needed since f.e. manually written
      program can issue loads to all registers of eBPF architecture.
      
      The core concept with the additional register is: blind out all 32
      and 64 bit constants by converting BPF_K based instructions into a
      small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
      is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
      and REG <OP> BPF_REG_AX, so actual operation on the target register
      is translated from BPF_K into BPF_X one that is operating on
      BPF_REG_AX's content. During rewriting phase when blinding, RND is
      newly generated via prandom_u32() for each processed instruction.
      64 bit loads are split into two 32 bit loads to make translation and
      patching not too complex. Only basic thing required by JITs is to
      call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
      pair, and to map BPF_REG_AX into an unused register.
      
      Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
      
      echo 0 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f5e9 + <x>:
        [...]
        39:   mov    $0xa8909090,%eax
        3e:   mov    $0xa8909090,%eax
        43:   mov    $0xa8ff3148,%eax
        48:   mov    $0xa89081b4,%eax
        4d:   mov    $0xa8900bb0,%eax
        52:   mov    $0xa810e0c1,%eax
        57:   mov    $0xa8908eb4,%eax
        5c:   mov    $0xa89020b0,%eax
        [...]
      
      echo 1 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f1e5 + <x>:
        [...]
        39:   mov    $0xe1192563,%r10d
        3f:   xor    $0x4989b5f3,%r10d
        46:   mov    %r10d,%eax
        49:   mov    $0xb8296d93,%r10d
        4f:   xor    $0x10b9fd03,%r10d
        56:   mov    %r10d,%eax
        59:   mov    $0x8c381146,%r10d
        5f:   xor    $0x24c7200e,%r10d
        66:   mov    %r10d,%eax
        69:   mov    $0xeb2a830e,%r10d
        6f:   xor    $0x43ba02ba,%r10d
        76:   mov    %r10d,%eax
        79:   mov    $0xd9730af,%r10d
        7f:   xor    $0xa5073b1f,%r10d
        86:   mov    %r10d,%eax
        89:   mov    $0x9a45662b,%r10d
        8f:   xor    $0x325586ea,%r10d
        96:   mov    %r10d,%eax
        [...]
      
      As can be seen, original constants that carry payload are hidden
      when enabled, actual operations are transformed from constant-based
      to register-based ones, making jumps into constants ineffective.
      Above extract/example uses single BPF load instruction over and
      over, but of course all instructions with constants are blinded.
      
      Performance wise, JIT with blinding performs a bit slower than just
      JIT and faster than interpreter case. This is expected, since we
      still get all the performance benefits from JITing and in normal
      use-cases not every single instruction needs to be blinded. Summing
      up all 296 test cases averaged over multiple runs from test_bpf.ko
      suite, interpreter was 55% slower than JIT only and JIT with blinding
      was 8% slower than JIT only. Since there are also some extremes in
      the test suite, I expect for ordinary workloads that the performance
      for the JIT with blinding case is even closer to JIT only case,
      f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
      35 (+ blinding), and 151 (interpreter).
      
      BPF test suite, seccomp test suite, eBPF sample code and various
      bigger networking eBPF programs have been tested with this and were
      running fine. For testing purposes, I also adapted interpreter and
      redirected blinded eBPF image to interpreter and also here all tests
      pass.
      
        [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
        [2] https://github.com/01org/jit-spray-poc-for-ksp/
        [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NElena Reshetova <elena.reshetova@intel.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f3446bb
    • D
      bpf: prepare bpf_int_jit_compile/bpf_prog_select_runtime apis · d1c55ab5
      Daniel Borkmann 提交于
      Since the blinding is strictly only called from inside eBPF JITs,
      we need to change signatures for bpf_int_jit_compile() and
      bpf_prog_select_runtime() first in order to prepare that the
      eBPF program we're dealing with can change underneath. Hence,
      for call sites, we need to return the latest prog. No functional
      change in this patch.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1c55ab5
    • D
      bpf: minor cleanups in ebpf code · 4936e352
      Daniel Borkmann 提交于
      Besides others, remove redundant comments where the code is self
      documenting enough, and properly indent various bpf_verifier_ops
      and bpf_prog_type_list declarations. Moreover, remove two exports
      that actually have no module user.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4936e352
  5. 12 5月, 2016 1 次提交
    • D
      net: l3mdev: Add hook in ip and ipv6 · 74b20582
      David Ahern 提交于
      Currently the VRF driver uses the rx_handler to switch the skb device
      to the VRF device. Switching the dev prior to the ip / ipv6 layer
      means the VRF driver has to duplicate IP/IPv6 processing which adds
      overhead and makes features such as retaining the ingress device index
      more complicated than necessary.
      
      This patch moves the hook to the L3 layer just after the first NF_HOOK
      for PRE_ROUTING. This location makes exposing the original ingress device
      trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
      in the future.
      
      dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
      with the switched device through the packet taps to maintain current
      behavior (tcpdump can be used on either the vrf device or the enslaved
      devices).
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74b20582
  6. 11 5月, 2016 1 次提交
  7. 09 5月, 2016 1 次提交
  8. 07 5月, 2016 1 次提交
  9. 05 5月, 2016 4 次提交
  10. 04 5月, 2016 2 次提交
  11. 03 5月, 2016 4 次提交
    • N
      net: rtnetlink: add linkxstats callbacks and attribute · 97a47fac
      Nikolay Aleksandrov 提交于
      Add callbacks to calculate the size and fill link extended statistics
      which can be split into multiple messages and are dumped via the new
      rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute.
      Also add that attribute to the idx mask check since it is expected to
      be able to save state and resume dumping (e.g. future bridge per-vlan
      stats will be dumped via this attribute and callbacks).
      Each link type should nest its private attributes under the per-link type
      attribute. This allows to have any number of separated private attributes
      and to avoid one call to get the dev link type.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97a47fac
    • N
      net: rtnetlink: allow rtnl_fill_statsinfo to save private state counter · e8872a25
      Nikolay Aleksandrov 提交于
      The new prividx argument allows the current dumping device to save a
      private state counter which would enable it to continue dumping from
      where it left off. And the idxattr is used to save the current idx user
      so multiple prividx using attributes can be requested at the same time
      as suggested by Roopa Prabhu.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8872a25
    • E
      tcp: make tcp_sendmsg() aware of socket backlog · d41a69f1
      Eric Dumazet 提交于
      Large sendmsg()/write() hold socket lock for the duration of the call,
      unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
      are parked into socket backlog for a long time.
      Critical decisions like fast retransmit might be delayed.
      Receivers have to maintain a big out of order queue with additional cpu
      overhead, and also possible stalls in TX once windows are full.
      
      Bidirectional flows are particularly hurt since the backlog can become
      quite big if the copy from user space triggers IO (page faults)
      
      Some applications learnt to use sendmsg() (or sendmmsg()) with small
      chunks to avoid this issue.
      
      Kernel should know better, right ?
      
      Add a generic sk_flush_backlog() helper and use it right
      before a new skb is allocated. Typically we put 64KB of payload
      per skb (unless MSG_EOR is requested) and checking socket backlog
      every 64KB gives good results.
      
      As a matter of fact, tests with TSO/GSO disabled give very nice
      results, as we manage to keep a small write queue and smaller
      perceived rtt.
      
      Note that sk_flush_backlog() maintains socket ownership,
      so is not equivalent to a {release_sock(sk); lock_sock(sk);},
      to ensure implicit atomicity rules that sendmsg() was
      giving to (possibly buggy) applications.
      
      In this simple implementation, I chose to not call tcp_release_cb(),
      but we might consider this later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d41a69f1
    • E
      net: do not block BH while processing socket backlog · 5413d1ba
      Eric Dumazet 提交于
      Socket backlog processing is a major latency source.
      
      With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
      holding cpu for more than 5 ms, and packets being dropped by the NIC
      once ring buffer is filled.
      
      All users are now ready to be called from process context,
      we can unblock BH and let interrupts be serviced faster.
      
      cond_resched_softirq() could be removed, as it has no more user.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5413d1ba
  12. 30 4月, 2016 1 次提交
  13. 29 4月, 2016 1 次提交
    • J
      tuntap: calculate rps hash only when needed · 3df97ba8
      Jason Wang 提交于
      There's no need to calculate rps hash if it was not enabled. So this
      patch export rps_needed and check it before trying to get rps
      hash. Tests (using pktgen to inject packets to guest) shows this can
      improve pps about 13% (when rps is disabled).
      
      Before:
      ~1150000 pps
      After:
      ~1300000 pps
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      ----
      Changes from V1:
      - Fix build when CONFIG_RPS is not set
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3df97ba8
  14. 28 4月, 2016 1 次提交
  15. 27 4月, 2016 5 次提交
  16. 26 4月, 2016 2 次提交
    • S
      skbuff: Add pskb_extract() helper function · 6fa01ccd
      Sowmini Varadhan 提交于
      A pattern of skb usage seen in modules such as RDS-TCP is to
      extract `to_copy' bytes from the received TCP segment, starting
      at some offset `off' into a new skb `clone'. This is done in
      the ->data_ready callback, where the clone skb is queued up for rx on
      the PF_RDS socket, while the parent TCP segment is returned unchanged
      back to the TCP engine.
      
      The existing code uses the sequence
      	clone = skb_clone(..);
      	pskb_pull(clone, off, ..);
      	pskb_trim(clone, to_copy, ..);
      with the intention of discarding the first `off' bytes. However,
      skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
      up doing a redundant memcpy of bytes that will then get discarded
      in __pskb_pull_tail().
      
      To avoid this inefficiency, this commit adds pskb_extract() that
      creates the clone, and memcpy's only the relevant header/frag/frag_list
      to the start of `clone'. pskb_trim() is then invoked to trim clone
      down to the requested to_copy bytes.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fa01ccd
    • N
      rtnl: use nla_put_u64_64bit() · 343a6d8e
      Nicolas Dichtel 提交于
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      343a6d8e
  17. 24 4月, 2016 2 次提交
  18. 22 4月, 2016 2 次提交
  19. 21 4月, 2016 1 次提交
    • R
      rtnetlink: add new RTM_GETSTATS message to dump link stats · 10c9ead9
      Roopa Prabhu 提交于
      This patch adds a new RTM_GETSTATS message to query link stats via netlink
      from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK
      returns a lot more than just stats and is expensive in some cases when
      frequent polling for stats from userspace is a common operation.
      
      RTM_GETSTATS is an attempt to provide a light weight netlink message
      to explicity query only link stats from the kernel on an interface.
      The idea is to also keep it extensible so that new kinds of stats can be
      added to it in the future.
      
      This patch adds the following attribute for NETDEV stats:
      struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = {
              [IFLA_STATS_LINK_64]  = { .len = sizeof(struct rtnl_link_stats64) },
      };
      
      Like any other rtnetlink message, RTM_GETSTATS can be used to get stats of
      a single interface or all interfaces with NLM_F_DUMP.
      
      Future possible new types of stat attributes:
      link af stats:
          - IFLA_STATS_LINK_IPV6  (nested. for ipv6 stats)
          - IFLA_STATS_LINK_MPLS  (nested. for mpls/mdev stats)
      extended stats:
          - IFLA_STATS_LINK_EXTENDED (nested. extended software netdev stats like bridge,
            vlan, vxlan etc)
          - IFLA_STATS_LINK_HW_EXTENDED (nested. extended hardware stats which are
            available via ethtool today)
      
      This patch also declares a filter mask for all stat attributes.
      User has to provide a mask of stats attributes to query. filter mask
      can be specified in the new hdr 'struct if_stats_msg' for stats messages.
      Other important field in the header is the ifindex.
      
      This api can also include attributes for global stats (eg tcp) in the future.
      When global stats are included in a stats msg, the ifindex in the header
      must be zero. A single stats message cannot contain both global and
      netdev specific stats. To easily distinguish them, netdev specific stat
      attributes name are prefixed with IFLA_STATS_LINK_
      
      Without any attributes in the filter_mask, no stats will be returned.
      
      This patch has been tested with mofified iproute2 ifstat.
      Suggested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10c9ead9
  20. 20 4月, 2016 3 次提交
    • D
      bpf: add event output helper for notifications/sampling/logging · bd570ff9
      Daniel Borkmann 提交于
      This patch adds a new helper for cls/act programs that can push events
      to user space applications. For networking, this can be f.e. for sampling,
      debugging, logging purposes or pushing of arbitrary wake-up events. The
      idea is similar to a43eec30 ("bpf: introduce bpf_perf_event_output()
      helper") and 39111695 ("samples: bpf: add bpf_perf_event_output example").
      
      The eBPF program utilizes a perf event array map that user space populates
      with fds from perf_event_open(), the eBPF program calls into the helper
      f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
      so that the raw data is pushed into the fd f.e. at the map index of the
      current CPU.
      
      User space can poll/mmap/etc on this and has a data channel for receiving
      events that can be post-processed. The nice thing is that since the eBPF
      program and user space application making use of it are tightly coupled,
      they can define their own arbitrary raw data format and what/when they
      want to push.
      
      While f.e. packet headers could be one part of the meta data that is being
      pushed, this is not a substitute for things like packet sockets as whole
      packet is not being pushed and push is only done in a single direction.
      Intention is more of a generically usable, efficient event pipe to applications.
      Workflow is that tc can pin the map and applications can attach themselves
      e.g. after cls/act setup to one or multiple map slots, demuxing is done by
      the eBPF program.
      
      Adding this facility is with minimal effort, it reuses the helper
      introduced in a43eec30 ("bpf: introduce bpf_perf_event_output() helper")
      and we get its functionality for free by overloading its BPF_FUNC_ identifier
      for cls/act programs, ctx is currently unused, but will be made use of in
      future. Example will be added to iproute2's BPF example files.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd570ff9
    • D
    • D
      net: Align IFLA_STATS64 attributes properly on architectures that need it. · 18402843
      David S. Miller 提交于
      Since the nlattr header is 4 bytes in size, it can cause the netlink
      attribute payload to not be 8-byte aligned.
      
      This is particularly troublesome for IFLA_STATS64 which contains 64-bit
      statistic values.
      
      Solve this by creating a dummy IFLA_PAD attribute which has a payload
      which is zero bytes in size.  When HAVE_EFFICIENT_UNALIGNED_ACCESS is
      false, we insert an IFLA_PAD attribute into the netlink response when
      necessary such that the IFLA_STATS64 payload will be properly aligned.
      
      With help and suggestions from Eric Dumazet.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18402843
  21. 19 4月, 2016 1 次提交