1. 21 5月, 2016 8 次提交
    • T
      fou: Support IPv6 in fou · 5f914b68
      Tom Herbert 提交于
      This patch adds receive path support for IPv6 with fou.
      
      - Add address family to fou structure for open sockets. This supports
        AF_INET and AF_INET6. Lookups for fou ports are performed on both the
        port number and family.
      - In fou and gue receive adjust tot_len in IPv4 header or payload_len
        based on address family.
      - Allow AF_INET6 in FOU_ATTR_AF netlink attribute.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f914b68
    • T
      fou: Split out {fou,gue}_build_header · dc969b81
      Tom Herbert 提交于
      Create __fou_build_header and __gue_build_header. These implement the
      protocol generic parts of building the fou and gue header.
      fou_build_header and gue_build_header implement the IPv4 specific
      functions and call the __*_build_header functions.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc969b81
    • T
      fou: Call setup_udp_tunnel_sock · 440924bb
      Tom Herbert 提交于
      Use helper function to set up UDP tunnel related information for a fou
      socket.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      440924bb
    • T
      net: Cleanup encap items in ip_tunnels.h · 55c2bc14
      Tom Herbert 提交于
      Consolidate all the ip_tunnel_encap definitions in one spot in the
      header file. Also, move ip_encap_hlen and ip_tunnel_encap from
      ip_tunnel.c to ip_tunnels.h so they call be called without a dependency
      on ip_tunnel module. Similarly, move iptun_encaps to ip_tunnel_core.c.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55c2bc14
    • T
      ipv6: Change "final" protocol processing for encapsulation · 1da44f9c
      Tom Herbert 提交于
      When performing foo-over-UDP, UDP packets are processed by the
      encapsulation handler which returns another protocol to process.
      This may result in processing two (or more) protocols in the
      loop that are marked as INET6_PROTO_FINAL. The actions taken
      for hitting a final protocol, in particular the skb_postpull_rcsum
      can only be performed once.
      
      This patch set adds a check of a final protocol has been seen. The
      rules are:
        - If the final protocol has not been seen any protocol is processed
          (final and non-final). In the case of a final protocol, the final
          actions are taken (like the skb_postpull_rcsum)
        - If a final protocol has been seen (e.g. an encapsulating UDP
          header) then no further non-final protocols are allowed
          (e.g. extension headers). For more final protocols the
          final actions are not taken (e.g. skb_postpull_rcsum).
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1da44f9c
    • T
      ipv6: Fix nexthdr for reinjection · 4c64242a
      Tom Herbert 提交于
      In ip6_input_finish the nexthdr protocol is retrieved from the
      next header offset that is returned in the cb of the skb.
      This method does not work for UDP encapsulation that may not
      even have a concept of a nexthdr field (e.g. FOU).
      
      This patch checks for a final protocol (INET6_PROTO_FINAL) when a
      protocol handler returns > 0. If the protocol is not final then
      resubmission is performed on nhoff value. If the protocol is final
      then the nexthdr is taken to be the return value.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c64242a
    • T
      net: define gso types for IPx over IPv4 and IPv6 · 7e13318d
      Tom Herbert 提交于
      This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
      SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
      NETIF_F_GSO_IPXIP6. These are used to described IP in IP
      tunnel and what the outer protocol is. The inner protocol
      can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
      SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
      are removed (these are both instances of SKB_GSO_IPXIP4).
      SKB_GSO_IPXIP6 will be used when support for GSO with IP
      encapsulation over IPv6 is added.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Acked-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e13318d
    • T
      gso: Remove arbitrary checks for unsupported GSO · 5c7cdf33
      Tom Herbert 提交于
      In several gso_segment functions there are checks of gso_type against
      a seemingly arbitrary list of SKB_GSO_* flags. This seems like an
      attempt to identify unsupported GSO types, but since the stack is
      the one that set these GSO types in the first place this seems
      unnecessary to do. If a combination isn't valid in the first
      place that stack should not allow setting it.
      
      This is a code simplication especially for add new GSO types.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c7cdf33
  2. 20 5月, 2016 3 次提交
  3. 18 5月, 2016 11 次提交
    • M
      batman-adv: initialize ELP orig address on secondary interfaces · ebe24cea
      Marek Lindner 提交于
      This fix prevents nodes to wrongly create a 00:00:00:00:00:00 originator
      which can potentially interfere with the rest of the neighbor statistics.
      
      Fixes: d6f94d91 ("batman-adv: ELP - adding basic infrastructure")
      Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: NAntonio Quartulli <a@unstable.cc>
      ebe24cea
    • L
      batman-adv: Avoid duplicate neigh_node additions · e123705e
      Linus Lüssing 提交于
      Two parallel calls to batadv_neigh_node_new() might race for creating
      and adding the same neig_node. Fix this by including the check for any
      already existing, identical neigh_node within the spin-lock.
      
      This fixes splats like the following:
      
      [  739.535069] ------------[ cut here ]------------
      [  739.535079] WARNING: CPU: 0 PID: 0 at /usr/src/batman-adv/git/batman-adv/net/batman-adv/bat_iv_ogm.c:1004 batadv_iv_ogm_process_per_outif+0xe3f/0xe60 [batman_adv]()
      [  739.535092] too many matching neigh_nodes
      [  739.535094] Modules linked in: dm_mod tun ip6table_filter ip6table_mangle ip6table_nat nf_nat_ipv6 ip6_tables xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_TCPMSS xt_mark iptable_mangle xt_tcpudp xt_conntrack iptable_filter ip_tables x_tables ip_gre ip_tunnel gre bridge stp llc thermal_sys kvm_intel kvm crct10dif_pclmul crc32_pclmul sha256_ssse3 sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd evdev pcspkr ip6_gre ip6_tunnel tunnel6 batman_adv(O) libcrc32c nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack autofs4 ext4 crc16 mbcache jbd2 xen_netfront xen_blkfront crc32c_intel
      [  739.535177] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W  O    4.2.0-0.bpo.1-amd64 #1 Debian 4.2.6-3~bpo8+2
      [  739.535186]  0000000000000000 ffffffffa013b050 ffffffff81554521 ffff88007d003c18
      [  739.535201]  ffffffff8106fa01 0000000000000000 ffff8800047a087a ffff880079c3a000
      [  739.735602]  ffff88007b82bf40 ffff88007bc2d1c0 ffffffff8106fa7a ffffffffa013aa8e
      [  739.735624] Call Trace:
      [  739.735639]  <IRQ>  [<ffffffff81554521>] ? dump_stack+0x40/0x50
      [  739.735677]  [<ffffffff8106fa01>] ? warn_slowpath_common+0x81/0xb0
      [  739.735692]  [<ffffffff8106fa7a>] ? warn_slowpath_fmt+0x4a/0x50
      [  739.735715]  [<ffffffffa012448f>] ? batadv_iv_ogm_process_per_outif+0xe3f/0xe60 [batman_adv]
      [  739.735740]  [<ffffffffa0124813>] ? batadv_iv_ogm_receive+0x363/0x380 [batman_adv]
      [  739.735762]  [<ffffffffa0124813>] ? batadv_iv_ogm_receive+0x363/0x380 [batman_adv]
      [  739.735783]  [<ffffffff810b0841>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
      [  739.735804]  [<ffffffffa012cb39>] ? batadv_batman_skb_recv+0xc9/0x110 [batman_adv]
      [  739.735825]  [<ffffffff81464891>] ? __netif_receive_skb_core+0x841/0x9a0
      [  739.735838]  [<ffffffff810b0841>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
      [  739.735853]  [<ffffffff81465681>] ? process_backlog+0xa1/0x140
      [  739.735864]  [<ffffffff81464f1a>] ? net_rx_action+0x20a/0x320
      [  739.735878]  [<ffffffff81073aa7>] ? __do_softirq+0x107/0x270
      [  739.735891]  [<ffffffff81073d82>] ? irq_exit+0x92/0xa0
      [  739.735905]  [<ffffffff8137e0d1>] ? xen_evtchn_do_upcall+0x31/0x40
      [  739.735924]  [<ffffffff8155b8fe>] ? xen_do_hypervisor_callback+0x1e/0x40
      [  739.735939]  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
      [  739.735965]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
      [  739.735979]  [<ffffffff8100a39c>] ? xen_safe_halt+0xc/0x20
      [  739.735991]  [<ffffffff8101da6c>] ? default_idle+0x1c/0xa0
      [  739.736004]  [<ffffffff810abf6b>] ? cpu_startup_entry+0x2eb/0x350
      [  739.736019]  [<ffffffff81b2af5e>] ? start_kernel+0x480/0x48b
      [  739.736032]  [<ffffffff81b2d116>] ? xen_start_kernel+0x507/0x511
      [  739.736048] ---[ end trace c106bb901244bc8c ]---
      
      Fixes: f987ed6e ("batman-adv: protect neighbor list with rcu locks")
      Reported-by: NMartin Weinelt <martin@darmstadt.freifunk.net>
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: NAntonio Quartulli <a@unstable.cc>
      e123705e
    • S
      batman-adv: Fix integer overflow in batadv_iv_ogm_calc_tq · d285f52c
      Sven Eckelmann 提交于
      The undefined behavior sanatizer detected an signed integer overflow in a
      setup with near perfect link quality
      
          UBSAN: Undefined behaviour in net/batman-adv/bat_iv_ogm.c:1246:25
          signed integer overflow:
          8713350 * 255 cannot be represented in type 'int'
      
      The problems happens because the calculation of mixed unsigned and signed
      integers resulted in an integer multiplication.
      
            batadv_ogm_packet::tq (u8 255)
          * tq_own (u8 255)
          * tq_asym_penalty (int 134; max 255)
          * tq_iface_penalty (int 255; max 255)
      
      The tq_iface_penalty, tq_asym_penalty and inv_asym_penalty can just be
      changed to unsigned int because they are not expected to become negative.
      
      Fixes: c0398768 ("batman-adv: add WiFi penalty")
      Signed-off-by: NSven Eckelmann <sven.eckelmann@open-mesh.com>
      Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: NAntonio Quartulli <a@unstable.cc>
      d285f52c
    • A
      batman-adv: make sure ELP/OGM orig MAC is updated on address change · 1653f61d
      Antonio Quartulli 提交于
      When the MAC address of the primary interface is changed,
      update the originator address in the ELP and OGM skb buffers as
      well in order to reflect the change.
      
      Fixes: d6f94d91 ("batman-adv: ELP - adding basic infrastructure")
      Reported-by: NMarek Lindner <marek@neomailbox.ch>
      Signed-off-by: NAntonio Quartulli <a@unstable.cc>
      1653f61d
    • S
      batman-adv: Fix unexpected free of bcast_own on add_if error · f7dcdf5f
      Sven Eckelmann 提交于
      The function batadv_iv_ogm_orig_add_if allocates new buffers for bcast_own
      and bcast_own_sum. It is expected that these buffers are unchanged in case
      either bcast_own or bcast_own_sum couldn't be resized.
      
      But the error handling of this function frees the already resized buffer
      for bcast_own when the allocation of the new bcast_own_sum buffer failed.
      This will lead to an invalid memory access when some code will try to
      access bcast_own.
      
      Instead the resized new bcast_own buffer has to be kept. This will not lead
      to problems because the size of the buffer was only increased and therefore
      no user of the buffer will try to access bytes outside of the new buffer.
      
      Fixes: d0015fdd ("batman-adv: provide orig_node routing API")
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: NAntonio Quartulli <a@unstable.cc>
      f7dcdf5f
    • S
      batman-adv: Fix refcnt leak in batadv_v_neigh_* · 71f9d27d
      Sven Eckelmann 提交于
      The functions batadv_neigh_ifinfo_get increase the reference counter of the
      batadv_neigh_ifinfo. These have to be reduced again when the reference is
      not used anymore to correctly free the objects.
      
      Fixes: 97869060 ("batman-adv: B.A.T.M.A.N. V - implement neighbor comparison API calls")
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: NAntonio Quartulli <a@unstable.cc>
      71f9d27d
    • S
      batman-adv: Avoid nullptr derefence in batadv_v_neigh_is_sob · a45e932a
      Sven Eckelmann 提交于
      batadv_neigh_ifinfo_get can return NULL when it cannot find (even when only
      temporarily) anymore the neigh_ifinfo in the list neigh->ifinfo_list. This
      has to be checked to avoid kernel Oopses when the ifinfo is dereferenced.
      
      This a situation which isn't expected but is already handled by functions
      like batadv_v_neigh_cmp. The same kind of warning is therefore used before
      the function returns without dereferencing the pointers.
      
      Fixes: 97869060 ("batman-adv: B.A.T.M.A.N. V - implement neighbor comparison API calls")
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: NAntonio Quartulli <a@unstable.cc>
      a45e932a
    • F
      batman-adv: fix skb deref after free · 63d443ef
      Florian Westphal 提交于
      batadv_send_skb_to_orig() calls dev_queue_xmit() so we can't use skb->len.
      
      Fixes: 95332477 ("batman-adv: network coding - buffer unicast packets before forward")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Reviewed-by: NSven Eckelmann <sven@narfation.org>
      Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: NAntonio Quartulli <a@unstable.cc>
      63d443ef
    • J
      switchdev: pass pointer to fib_info instead of copy · da4ed551
      Jiri Pirko 提交于
      The problem is that fib_info->nh is [0] so the struct fib_info
      allocation size depends on number of nexthops. If we just copy fib_info,
      we do not copy the nexthops info and driver accesses memory which is not
      ours.
      
      Given the fact that fib4 does not defer operations and therefore it does
      not need copy, just pass the pointer down to drivers as it was done
      before.
      
      Fixes: 850d0cbc ("switchdev: remove pointers from switchdev objects")
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da4ed551
    • W
      net_sched: close another race condition in tcf_mirred_release() · dc327f89
      WANG Cong 提交于
      We saw the following extra refcount release on veth device:
      
        kernel: [7957821.463992] unregister_netdevice: waiting for mesos50284 to become free. Usage count = -1
      
      Since we heavily use mirred action to redirect packets to veth, I think
      this is caused by the following race condition:
      
      CPU0:
      tcf_mirred_release(): (in RCU callback)
      	struct net_device *dev = rcu_dereference_protected(m->tcfm_dev, 1);
      
      CPU1:
      mirred_device_event():
              spin_lock_bh(&mirred_list_lock);
              list_for_each_entry(m, &mirred_list, tcfm_list) {
                      if (rcu_access_pointer(m->tcfm_dev) == dev) {
                              dev_put(dev);
                              /* Note : no rcu grace period necessary, as
                               * net_device are already rcu protected.
                               */
                              RCU_INIT_POINTER(m->tcfm_dev, NULL);
                      }
              }
              spin_unlock_bh(&mirred_list_lock);
      
      CPU0:
      tcf_mirred_release():
              spin_lock_bh(&mirred_list_lock);
              list_del(&m->tcfm_list);
              spin_unlock_bh(&mirred_list_lock);
              if (dev)               // <======== Stil refers to the old m->tcfm_dev
                      dev_put(dev);  // <======== dev_put() is called on it again
      
      The action init code path is good because it is impossible to modify
      an action that is being removed.
      
      So, fix this by moving everything under the spinlock.
      
      Fixes: 2ee22a90 ("net_sched: act_mirred: remove spinlock in fast path")
      Fixes: 6bd00b85 ("act_mirred: fix a race condition on mirred_list")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc327f89
    • R
      tipc: fix nametable publication field in nl compat · 03aaaa9b
      Richard Alpe 提交于
      The publication field of the old netlink API should contain the
      publication key and not the publication reference.
      
      Fixes: 44a8ae94 (tipc: convert legacy nl name table dump to nl compat)
      Signed-off-by: NRichard Alpe <richard.alpe@ericsson.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03aaaa9b
  4. 17 5月, 2016 15 次提交
    • H
      netlink: Fix dump skb leak/double free · 92964c79
      Herbert Xu 提交于
      When we free cb->skb after a dump, we do it after releasing the
      lock.  This means that a new dump could have started in the time
      being and we'll end up freeing their skb instead of ours.
      
      This patch saves the skb and module before we unlock so we free
      the right memory.
      
      Fixes: 16b304f3 ("netlink: Eliminate kmalloc in netlink dump operation.")
      Reported-by: NBaozeng Ding <sploving1@gmail.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92964c79
    • R
      tipc: check nl sock before parsing nested attributes · 45e093ae
      Richard Alpe 提交于
      Make sure the socket for which the user is listing publication exists
      before parsing the socket netlink attributes.
      
      Prior to this patch a call without any socket caused a NULL pointer
      dereference in tipc_nl_publ_dump().
      Tested-and-reported-by: NBaozeng Ding <sploving1@gmail.com>
      Signed-off-by: NRichard Alpe <richard.alpe@ericsson.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.cm>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45e093ae
    • E
      fq_codel: fix memory limitation drift · 77f57761
      Eric Dumazet 提交于
      memory_usage must be decreased in dequeue_func(), not in
      fq_codel_dequeue(), otherwise packets dropped by Codel algo
      are missing this decrease.
      
      Also we need to clear memory_usage in fq_codel_reset()
      
      Fixes: 95b58430 ("fq_codel: add memory limitation per queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77f57761
    • D
      net: also make sch_handle_egress() drop monitor ready · 7e2c3aea
      Daniel Borkmann 提交于
      Follow-up for 8a3a4c6e ("net: make sch_handle_ingress() drop
      monitor ready") to also make the egress side drop monitor ready.
      
      Also here only TC_ACT_SHOT is a clear indication that something
      went wrong. Hence don't provide false positives to drop monitors
      such as 'perf record -e skb:kfree_skb ...'.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e2c3aea
    • M
      net/hsr: Use setup_timer and mod_timer. · 15db6e0d
      Muhammad Falak R Wani 提交于
      The function setup_timer combines the initialization of a timer with the
      initialization of the timer's function and data fields. The mulitiline
      code for timer initialization is now replaced with function setup_timer.
      
      Also, quoting the mod_timer() function comment:
      -> mod_timer() is a more efficient way to update the expire field of an
         active timer (if the timer is inactive it will be activated).
      
      Use setup_timer() and mod_timer() to setup and arm a timer, making the
      code compact and aid readablity.
      Signed-off-by: NMuhammad Falak R Wani <falakreyaz@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15db6e0d
    • D
      bpf: add generic constant blinding for use in jits · 4f3446bb
      Daniel Borkmann 提交于
      This work adds a generic facility for use from eBPF JIT compilers
      that allows for further hardening of JIT generated images through
      blinding constants. In response to the original work on BPF JIT
      spraying published by Keegan McAllister [1], most BPF JITs were
      changed to make images read-only and start at a randomized offset
      in the page, where the rest was filled with trap instructions. We
      have this nowadays in x86, arm, arm64 and s390 JIT compilers.
      Additionally, later work also made eBPF interpreter images read
      only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
      arm, arm64 and s390 archs as well currently. This is done by
      default for mentioned JITs when JITing is enabled. Furthermore,
      we had a generic and configurable constant blinding facility on our
      todo for quite some time now to further make spraying harder, and
      first implementation since around netconf 2016.
      
      We found that for systems where untrusted users can load cBPF/eBPF
      code where JIT is enabled, start offset randomization helps a bit
      to make jumps into crafted payload harder, but in case where larger
      programs that cross page boundary are injected, we again have some
      part of the program opcodes at a page start offset. With improved
      guessing and more reliable payload injection, chances can increase
      to jump into such payload. Elena Reshetova recently wrote a test
      case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
      can leave some more room for payloads. Note that for all this,
      additional bugs in the kernel are still required to make the jump
      (and of course to guess right, to not jump into a trap) and naturally
      the JIT must be enabled, which is disabled by default.
      
      For helping mitigation, the general idea is to provide an option
      bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
      that for cases where JIT should be enabled for performance reasons,
      the generated image can be further hardened with blinding constants
      for unpriviledged users (bpf_jit_harden == 1), with trading off
      performance for these, but not for privileged ones. We also added
      the option of blinding for all users (bpf_jit_harden == 2), which
      is quite helpful for testing f.e. with test_bpf.ko. There are no
      further e.g. hardening levels of bpf_jit_harden switch intended,
      rationale is to have it dead simple to use as on/off. Since this
      functionality would need to be duplicated over and over for JIT
      compilers to use, which are already complex enough, we provide a
      generic eBPF byte-code level based blinding implementation, which is
      then just transparently JITed. JIT compilers need to make only a few
      changes to integrate this facility and can be migrated one by one.
      
      This option is for eBPF JITs and will be used in x86, arm64, s390
      without too much effort, and soon ppc64 JITs, thus that native eBPF
      can be blinded as well as cBPF to eBPF migrations, so that both can
      be covered with a single implementation. The rule for JITs is that
      bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
      and in case blinding is disabled, we follow normally with JITing the
      passed program. In case blinding is enabled and we fail during the
      process of blinding itself, we must return with the interpreter.
      Similarly, in case the JITing process after the blinding failed, we
      return normally to the interpreter with the non-blinded code. Meaning,
      interpreter doesn't change in any way and operates on eBPF code as
      usual. For doing this pre-JIT blinding step, we need to make use of
      a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
      to the JIT and not in any way part of the eBPF architecture. Just like
      in the same way as JITs internally make use of some helper registers
      when emitting code, only that here the helper register is one
      abstraction level higher in eBPF bytecode, but nevertheless in JIT
      phase. That helper register is needed since f.e. manually written
      program can issue loads to all registers of eBPF architecture.
      
      The core concept with the additional register is: blind out all 32
      and 64 bit constants by converting BPF_K based instructions into a
      small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
      is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
      and REG <OP> BPF_REG_AX, so actual operation on the target register
      is translated from BPF_K into BPF_X one that is operating on
      BPF_REG_AX's content. During rewriting phase when blinding, RND is
      newly generated via prandom_u32() for each processed instruction.
      64 bit loads are split into two 32 bit loads to make translation and
      patching not too complex. Only basic thing required by JITs is to
      call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
      pair, and to map BPF_REG_AX into an unused register.
      
      Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
      
      echo 0 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f5e9 + <x>:
        [...]
        39:   mov    $0xa8909090,%eax
        3e:   mov    $0xa8909090,%eax
        43:   mov    $0xa8ff3148,%eax
        48:   mov    $0xa89081b4,%eax
        4d:   mov    $0xa8900bb0,%eax
        52:   mov    $0xa810e0c1,%eax
        57:   mov    $0xa8908eb4,%eax
        5c:   mov    $0xa89020b0,%eax
        [...]
      
      echo 1 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f1e5 + <x>:
        [...]
        39:   mov    $0xe1192563,%r10d
        3f:   xor    $0x4989b5f3,%r10d
        46:   mov    %r10d,%eax
        49:   mov    $0xb8296d93,%r10d
        4f:   xor    $0x10b9fd03,%r10d
        56:   mov    %r10d,%eax
        59:   mov    $0x8c381146,%r10d
        5f:   xor    $0x24c7200e,%r10d
        66:   mov    %r10d,%eax
        69:   mov    $0xeb2a830e,%r10d
        6f:   xor    $0x43ba02ba,%r10d
        76:   mov    %r10d,%eax
        79:   mov    $0xd9730af,%r10d
        7f:   xor    $0xa5073b1f,%r10d
        86:   mov    %r10d,%eax
        89:   mov    $0x9a45662b,%r10d
        8f:   xor    $0x325586ea,%r10d
        96:   mov    %r10d,%eax
        [...]
      
      As can be seen, original constants that carry payload are hidden
      when enabled, actual operations are transformed from constant-based
      to register-based ones, making jumps into constants ineffective.
      Above extract/example uses single BPF load instruction over and
      over, but of course all instructions with constants are blinded.
      
      Performance wise, JIT with blinding performs a bit slower than just
      JIT and faster than interpreter case. This is expected, since we
      still get all the performance benefits from JITing and in normal
      use-cases not every single instruction needs to be blinded. Summing
      up all 296 test cases averaged over multiple runs from test_bpf.ko
      suite, interpreter was 55% slower than JIT only and JIT with blinding
      was 8% slower than JIT only. Since there are also some extremes in
      the test suite, I expect for ordinary workloads that the performance
      for the JIT with blinding case is even closer to JIT only case,
      f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
      35 (+ blinding), and 151 (interpreter).
      
      BPF test suite, seccomp test suite, eBPF sample code and various
      bigger networking eBPF programs have been tested with this and were
      running fine. For testing purposes, I also adapted interpreter and
      redirected blinded eBPF image to interpreter and also here all tests
      pass.
      
        [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
        [2] https://github.com/01org/jit-spray-poc-for-ksp/
        [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NElena Reshetova <elena.reshetova@intel.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f3446bb
    • D
      bpf: prepare bpf_int_jit_compile/bpf_prog_select_runtime apis · d1c55ab5
      Daniel Borkmann 提交于
      Since the blinding is strictly only called from inside eBPF JITs,
      we need to change signatures for bpf_int_jit_compile() and
      bpf_prog_select_runtime() first in order to prepare that the
      eBPF program we're dealing with can change underneath. Hence,
      for call sites, we need to return the latest prog. No functional
      change in this patch.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1c55ab5
    • D
      bpf: split HAVE_BPF_JIT into cBPF and eBPF variant · 6077776b
      Daniel Borkmann 提交于
      Split the HAVE_BPF_JIT into two for distinguishing cBPF and eBPF JITs.
      
      Current cBPF ones:
      
        # git grep -n HAVE_CBPF_JIT arch/
        arch/arm/Kconfig:44:    select HAVE_CBPF_JIT
        arch/mips/Kconfig:18:   select HAVE_CBPF_JIT if !CPU_MICROMIPS
        arch/powerpc/Kconfig:129:       select HAVE_CBPF_JIT
        arch/sparc/Kconfig:35:  select HAVE_CBPF_JIT
      
      Current eBPF ones:
      
        # git grep -n HAVE_EBPF_JIT arch/
        arch/arm64/Kconfig:61:  select HAVE_EBPF_JIT
        arch/s390/Kconfig:126:  select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
        arch/x86/Kconfig:94:    select HAVE_EBPF_JIT                    if X86_64
      
      Later code also needs this facility to check for eBPF JITs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6077776b
    • D
      bpf: minor cleanups in ebpf code · 4936e352
      Daniel Borkmann 提交于
      Besides others, remove redundant comments where the code is self
      documenting enough, and properly indent various bpf_verifier_ops
      and bpf_prog_type_list declarations. Moreover, remove two exports
      that actually have no module user.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4936e352
    • E
      tcp: minor optimizations around tcp_hdr() usage · ea1627c2
      Eric Dumazet 提交于
      tcp_hdr() is slightly more expensive than using skb->data in contexts
      where we know they point to the same byte.
      
      In receive path, tcp_v4_rcv() and tcp_v6_rcv() are in this situation,
      as tcp header has not been pulled yet.
      
      In output path, the same can be said when we just pushed the tcp header
      in the skb, in tcp_transmit_skb() and tcp_make_synack()
      
      Also factorize the two checks for tcb->tcp_flags & TCPHDR_SYN in
      tcp_transmit_skb() and pass tcp header pointer to tcp_ecn_send(),
      so that compiler can further optimize and avoid a reload.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea1627c2
    • E
      sock: propagate __sock_cmsg_send() error · 2632616b
      Eric Dumazet 提交于
      __sock_cmsg_send() might return different error codes, not only -EINVAL.
      
      Fixes: 24025c46 ("ipv4: process socket-level control messages in IPv4")
      Fixes: ad1e46a8 ("ipv6: process socket-level control messages in IPv6")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2632616b
    • A
      net: qrtr: fix build problems · a986a05d
      Arnd Bergmann 提交于
      Having multiple loadable modules with the same name cannot work
      with modprobe, and having both net/qrtr/smd.ko and drivers/soc/qcom/smd.ko
      results in a (somewhat cryptic) build error:
      
      ERROR: "qcom_smd_driver_unregister" [net/qrtr/smd.ko] undefined!
      ERROR: "qcom_smd_driver_register" [net/qrtr/smd.ko] undefined!
      ERROR: "qcom_smd_set_drvdata" [net/qrtr/smd.ko] undefined!
      ERROR: "qcom_smd_send" [net/qrtr/smd.ko] undefined!
      ERROR: "qcom_smd_get_drvdata" [net/qrtr/smd.ko] undefined!
      ERROR: "qcom_smd_driver_unregister" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
      ERROR: "qcom_smd_driver_register" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
      ERROR: "qcom_smd_set_drvdata" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
      ERROR: "qcom_smd_send" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
      ERROR: "qcom_smd_get_drvdata" [drivers/soc/qcom/wcnss_ctrl.ko] undefined!
      
      Also, the qrtr driver uses the SMD interface and has a Kconfig dependency,
      but also allows for compile-testing when SMD is disabled. However, if
      with QCOM_SMD=m and COMPILE_TEST=y we can end up with QRTR_SMD=y and
      that fails with a related link error.
      
      The changes the dependency so we can still compile-test the driver but
      not have it built-in if SMD is a module, to avoid running in the broken
      configuration, and changes the Makefile to provide the driver under
      a different module name.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Fixes: bdabad3e ("net: Add Qualcomm IPC router")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a986a05d
    • A
      net/sched: cls_flower: Hardware offloaded filters statistics support · 10cbc684
      Amir Vadai 提交于
      Introduce a new command in ndo_setup_tc() for hardware offloaded
      filters, to call the NIC driver, and make it update the statistics.
      This will be done before dumping the filter and its statistics.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10cbc684
    • A
      net/sched: act_gact: Update statistics when offloaded to hardware · 9fea47d9
      Amir Vadai 提交于
      Implement the stats_update callback that will be called by NIC drivers
      for hardware offloaded filters.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fea47d9
    • S
      net: cls_u32: Add support for skip-sw flag to tc u32 classifier. · d34e3e18
      Samudrala, Sridhar 提交于
      On devices that support TC U32 offloads, this flag enables a filter to be
      added only to HW. skip-sw and skip-hw are mutually exclusive flags. By
      default without any flags, the filter is added to both HW and SW, but no
      error checks are done in case of failure to add to HW. With skip-sw,
      failure to add to HW is treated as an error.
      
      Here is a sample script that adds 2 filters, one with skip-sw and the other
      with skip-hw flag.
      
         # add ingress qdisc
         tc qdisc add dev p4p1 ingress
      
         # enable hw tc offload.
         ethtool -K p4p1 hw-tc-offload on
      
         # add u32 filter with skip-sw flag.
         tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
            handle 800:0:1 u32 ht 800: flowid 800:1 \
            skip-sw \
            match ip src 192.168.1.0/24 \
            action drop
      
         # add u32 filter with skip-hw flag.
         tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
            handle 800:0:2 u32 ht 800: flowid 800:2 \
            skip-hw \
            match ip src 192.168.2.0/24 \
            action drop
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d34e3e18
  5. 15 5月, 2016 3 次提交