1. 01 8月, 2015 1 次提交
    • A
      bpf: add helpers to access tunnel metadata · d3aa45ce
      Alexei Starovoitov 提交于
      Introduce helpers to let eBPF programs attached to TC manipulate tunnel metadata:
      bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
      skb: pointer to skb
      key: pointer to 'struct bpf_tunnel_key'
      size: size of 'struct bpf_tunnel_key'
      flags: room for future extensions
      
      First eBPF program that uses these helpers will allocate per_cpu
      metadata_dst structures that will be used on TX.
      On RX metadata_dst is allocated by tunnel driver.
      
      Typical usage for TX:
      struct bpf_tunnel_key tkey;
      ... populate tkey ...
      bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), 0);
      bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
      
      RX:
      struct bpf_tunnel_key tkey = {};
      bpf_skb_get_tunnel_key(skb, &tkey, sizeof(tkey), 0);
      ... lookup or redirect based on tkey ...
      
      'struct bpf_tunnel_key' will be extended in the future by adding
      elements to the end and the 'size' argument will indicate which fields
      are populated, thereby keeping backwards compatibility.
      The 'flags' argument may be used as well when the 'size' is not enough or
      to indicate completely different layout of bpf_tunnel_key.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3aa45ce
  2. 30 7月, 2015 3 次提交
  3. 27 7月, 2015 2 次提交
  4. 22 7月, 2015 5 次提交
    • T
      openvswitch: Use regular VXLAN net_device device · 614732ea
      Thomas Graf 提交于
      This gets rid of all OVS specific VXLAN code in the receive and
      transmit path by using a VXLAN net_device to represent the vport.
      Only a small shim layer remains which takes care of handling the
      VXLAN specific OVS Netlink configuration.
      
      Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb()
      since they are no longer needed.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      614732ea
    • T
      fib: Add fib rule match on tunnel id · e7030878
      Thomas Graf 提交于
      This add the ability to select a routing table based on the tunnel
      id which allows to maintain separate routing tables for each virtual
      tunnel network.
      
      ip rule add from all tunnel-id 100 lookup 100
      ip rule add from all tunnel-id 200 lookup 200
      
      A new static key controls the collection of metadata at tunnel level
      upon demand.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7030878
    • T
      dst: Metadata destinations · f38a9eb1
      Thomas Graf 提交于
      Introduces a new dst_metadata which enables to carry per packet metadata
      between forwarding and processing elements via the skb->dst pointer.
      
      The structure is set up to be a union. Thus, each separate type of
      metadata requires its own dst instance. If demand arises to carry
      multiple types of metadata concurrently, metadata dst entries can be
      made stackable.
      
      The metadata dst entry is refcnt'ed as expected for now but a non
      reference counted use is possible if the reference is forced before
      queueing the skb.
      
      In order to allow allocating dsts with variable length, the existing
      dst_alloc() is split into a dst_alloc() and dst_init() function. The
      existing dst_init() function to initialize the subsystem is being
      renamed to dst_subsys_init() to make it clear what is what.
      
      The check before ip_route_input() is changed to ignore metadata dsts
      and drop the dst inside the routing function thus allowing to interpret
      metadata in a later commit.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f38a9eb1
    • R
      lwtunnel: support dst output redirect function · ffce4196
      Roopa Prabhu 提交于
      This patch introduces lwtunnel_output function to call corresponding
      lwtunnels output function to xmit the packet.
      
      It adds two variants lwtunnel_output and lwtunnel_output6 for ipv4 and
      ipv6 respectively today. But this is subject to change when lwtstate will
      reside in dst or dst_metadata (as per upstream discussions).
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ffce4196
    • R
      lwtunnel: infrastructure for handling light weight tunnels like mpls · 499a2425
      Roopa Prabhu 提交于
      Provides infrastructure to parse/dump/store encap information for
      light weight tunnels like mpls. Encap information for such tunnels
      is associated with fib routes.
      
      This infrastructure is based on previous suggestions from
      Eric Biederman to follow the xfrm infrastructure.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      499a2425
  5. 21 7月, 2015 5 次提交
    • K
      net: ratelimit warnings about dst entry refcount underflow or overflow · 8bf4ada2
      Konstantin Khlebnikov 提交于
      Kernel generates a lot of warnings when dst entry reference counter
      overflows and becomes negative. That bug was seen several times at
      machines with outdated 3.10.y kernels. Most like it's already fixed
      in upstream. Anyway that flood completely kills machine and makes
      further debugging impossible.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8bf4ada2
    • A
      test_bpf: add bpf_skb_vlan_push/pop() tests · 4d9c5c53
      Alexei Starovoitov 提交于
      improve accuracy of timing in test_bpf and add two stress tests:
      - {skb->data[0], get_smp_processor_id} repeated 2k times
      - {skb->data[0], vlan_push} x 68 followed by {skb->data[0], vlan_pop} x 68
      
      1st test is useful to test performance of JIT implementation of BPF_LD_ABS
      together with BPF_CALL instructions.
      2nd test is stressing skb_vlan_push/pop logic together with skb->data access
      via BPF_LD_ABS insn which checks that re-caching of skb->data is done correctly.
      
      In order to call bpf_skb_vlan_push() from test_bpf.ko have to add
      three export_symbol_gpl.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d9c5c53
    • A
      bpf: introduce bpf_skb_vlan_push/pop() helpers · 4e10df9a
      Alexei Starovoitov 提交于
      Allow eBPF programs attached to TC qdiscs call skb_vlan_push/pop via
      helper functions. These functions may change skb->data/hlen which are
      cached by some JITs to improve performance of ld_abs/ld_ind instructions.
      Therefore JITs need to recognize bpf_skb_vlan_push/pop() calls,
      re-compute header len and re-cache skb->data/hlen back into cpu registers.
      Note, skb->data/hlen are not directly accessible from the programs,
      so any changes to skb->data done either by these helpers or by other
      TC actions are safe.
      
      eBPF JIT supported by three architectures:
      - arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is.
      - x64 JIT re-caches skb->data/hlen unconditionally after vlan_push/pop calls
        (experiments showed that conditional re-caching is slower).
      - s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is present
        in the program (re-caching is tbd).
      
      These helpers allow more scalable handling of vlan from the programs.
      Instead of creating thousands of vlan netdevs on top of eth0 and attaching
      TC+ingress+bpf to all of them, the program can be attached to eth0 directly
      and manipulate vlans as necessary.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e10df9a
    • S
      net: don't reforward packets already forwarded by offload device · 0c4f691f
      Scott Feldman 提交于
      Just before queuing skb for xmit on port, check if skb has been marked by
      switchdev port driver as already fordwarded by device.  If so, drop skb.  A
      non-zero skb->offload_fwd_mark field is set by the switchdev port
      driver/device on ingress to indicate the skb has already been forwarded by
      the device to egress ports with matching dev->skb_mark.  The switchdev port
      driver would assign a non-zero dev->offload_skb_mark for each device port
      netdev during registration, for example.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c4f691f
    • D
      ebpf: add helper to retrieve net_cls's classid cookie · 8d20aabe
      Daniel Borkmann 提交于
      It would be very useful to retrieve the net_cls's classid from an eBPF
      program to allow for a more fine-grained classification, it could be
      directly used or in conjunction with additional policies. I.e. docker,
      but also tooling such as cgexec, can easily run applications via net_cls
      cgroups:
      
        cgcreate -g net_cls:/foo
        echo 42 > foo/net_cls.classid
        cgexec -g net_cls:foo <prog>
      
      Thus, their respecitve classid cookie of foo can then be looked up on
      the egress path to apply further policies. The helper is desigend such
      that a non-zero value returns the cgroup id.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d20aabe
  6. 16 7月, 2015 5 次提交
  7. 11 7月, 2015 2 次提交
    • J
      net: call rcu_read_lock early in process_backlog · 2c17d27c
      Julian Anastasov 提交于
      Incoming packet should be either in backlog queue or
      in RCU read-side section. Otherwise, the final sequence of
      flush_backlog() and synchronize_net() may miss packets
      that can run without device reference:
      
      CPU 1                  CPU 2
                             skb->dev: no reference
                             process_backlog:__skb_dequeue
                             process_backlog:local_irq_enable
      
      on_each_cpu for
      flush_backlog =>       IPI(hardirq): flush_backlog
                             - packet not found in backlog
      
                             CPU delayed ...
      synchronize_net
      - no ongoing RCU
      read-side sections
      
      netdev_run_todo,
      rcu_barrier: no
      ongoing callbacks
                             __netif_receive_skb_core:rcu_read_lock
                             - too late
      free dev
                             process packet for freed dev
      
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c17d27c
    • J
      net: do not process device backlog during unregistration · e9e4dd32
      Julian Anastasov 提交于
      commit 381c759d ("ipv4: Avoid crashing in ip_error")
      fixes a problem where processed packet comes from device
      with destroyed inetdev (dev->ip_ptr). This is not expected
      because inetdev_destroy is called in NETDEV_UNREGISTER
      phase and packets should not be processed after
      dev_close_many() and synchronize_net(). Above fix is still
      required because inetdev_destroy can be called for other
      reasons. But it shows the real problem: backlog can keep
      packets for long time and they do not hold reference to
      device. Such packets are then delivered to upper levels
      at the same time when device is unregistered.
      Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
      accounts all packets from backlog but before that some packets
      continue to be delivered to upper levels long after the
      synchronize_net call which is supposed to wait the last
      ones. Also, as Eric pointed out, processed packets, mostly
      from other devices, can continue to add new packets to backlog.
      
      Fix the problem by moving flush_backlog early, after the
      device driver is stopped and before the synchronize_net() call.
      Then use netif_running check to make sure we do not add more
      packets to backlog. We have to do it in enqueue_to_backlog
      context when the local IRQ is disabled. As result, after the
      flush_backlog and synchronize_net sequence all packets
      should be accounted.
      
      Thanks to Eric W. Biederman for the test script and his
      valuable feedback!
      Reported-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9e4dd32
  8. 10 7月, 2015 3 次提交
  9. 09 7月, 2015 5 次提交
  10. 01 7月, 2015 1 次提交
    • C
      sock_diag: don't broadcast kernel sockets · b922622e
      Craig Gallek 提交于
      Kernel sockets do not hold a reference for the network namespace to
      which they point.  Socket destruction broadcasting relies on the
      network namespace and will cause the splat below when a kernel socket
      is destroyed.
      
      This fix simply ignores kernel sockets when they are destroyed.
      
      Reported as:
      general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      CPU: 1 PID: 9130 Comm: kworker/1:1 Not tainted 4.1.0-gelk-debug+ #1
      Workqueue: sock_diag_events sock_diag_broadcast_destroy_work
      Stack:
       ffff8800b9c586c0 ffff8800b9c586c0 ffff8800ac4692c0 ffff8800936d4a90
       ffff8800352efd38 ffffffff8469a93e ffff8800352efd98 ffffffffc09b9b90
       ffff8800352efd78 ffff8800ac4692c0 ffff8800b9c586c0 ffff8800831b6ab8
      Call Trace:
       [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10
       [<ffffffffc09b9b90>] ? inet_diag_handler_get_info+0x110/0x1fb [inet_diag]
       [<ffffffff845c868d>] netlink_broadcast+0x1d/0x20
       [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10
       [<ffffffff845b2bf5>] sock_diag_broadcast_destroy_work+0xd5/0x160
       [<ffffffff8408ea97>] process_one_work+0x147/0x420
       [<ffffffff8408f0f9>] worker_thread+0x69/0x470
       [<ffffffff8409fda3>] ? preempt_count_sub+0xa3/0xf0
       [<ffffffff8408f090>] ? rescuer_thread+0x320/0x320
       [<ffffffff84093cd7>] kthread+0x107/0x120
       [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0
       [<ffffffff8469d31f>] ret_from_fork+0x3f/0x70
       [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0
      
      Tested:
        Using a debug kernel while 'ss -E' is running:
        ip netns add test-ns
        ip netns delete test-ns
      
      Fixes: eb4cb008 sock_diag: define destruction multicast groups
      Fixes: 26abe143 net: Modify sk_alloc to not reference count the
        netns of kernel sockets.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b922622e
  11. 29 6月, 2015 2 次提交
  12. 23 6月, 2015 1 次提交
    • S
      switchdev; add VLAN support for port's bridge_getlink · 7d4f8d87
      Scott Feldman 提交于
      One more missing piece of the puzzle.  Add vlan dump support to switchdev
      port's bridge_getlink.  iproute2 "bridge vlan show" cmd already knows how
      to show the vlans installed on the bridge and the device , but (until now)
      no one implemented the port vlan part of the netlink PF_BRIDGE:RTM_GETLINK
      msg.  Before this patch, "bridge vlan show":
      
      	$ bridge -c vlan show
      	port    vlan ids
      	sw1p1    30-34			<< bridge side vlans
      		 57
      
      	sw1p1				<< device side vlans (missing)
      
      	sw1p2    57
      
      	sw1p2
      
      	sw1p3
      
      	sw1p4
      
      	br0     None
      
      (When the port is bridged, the output repeats the vlan list for the vlans
      on the bridge side of the port and the vlans on the device side of the
      port.  The listing above show no vlans for the device side even though they
      are installed).
      
      After this patch:
      
      	$ bridge -c vlan show
      	port    vlan ids
      	sw1p1    30-34			<< bridge side vlan
      		 57
      
      	sw1p1    30-34			<< device side vlans
      		 57
      		 3840 PVID
      
      	sw1p2    57
      
      	sw1p2    57
      		 3840 PVID
      
      	sw1p3    3842 PVID
      
      	sw1p4    3843 PVID
      
      	br0     None
      
      I re-used ndo_dflt_bridge_getlink to add vlan fill call-back func.
      switchdev support adds an obj dump for VLAN objects, using the same
      call-back scheme as FDB dump.  Support included for both compressed and
      un-compressed vlan dumps.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d4f8d87
  13. 22 6月, 2015 1 次提交
    • J
      neigh: do not modify unlinked entries · 2c51a97f
      Julian Anastasov 提交于
      The lockless lookups can return entry that is unlinked.
      Sometimes they get reference before last neigh_cleanup_and_release,
      sometimes they do not need reference. Later, any
      modification attempts may result in the following problems:
      
      1. entry is not destroyed immediately because neigh_update
      can start the timer for dead entry, eg. on change to NUD_REACHABLE
      state. As result, entry lives for some time but is invisible
      and out of control.
      
      2. __neigh_event_send can run in parallel with neigh_destroy
      while refcnt=0 but if timer is started and expired refcnt can
      reach 0 for second time leading to second neigh_destroy and
      possible crash.
      
      Thanks to Eric Dumazet and Ying Xue for their work and analyze
      on the __neigh_event_send change.
      
      Fixes: 767e97e1 ("neigh: RCU conversion of struct neighbour")
      Fixes: a263b309 ("ipv4: Make neigh lookups directly in output packet path.")
      Fixes: 6fd6ce20 ("ipv6: Do not depend on rt->n in ip6_finish_output2().")
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Ying Xue <ying.xue@windriver.com>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c51a97f
  14. 16 6月, 2015 4 次提交