1. 22 7月, 2015 9 次提交
    • T
      route: Extend flow representation with tunnel key · 1b7179d3
      Thomas Graf 提交于
      Add a new flowi_tunnel structure which is a subset of ip_tunnel_key to
      allow routes to match on tunnel metadata. For now, the tunnel id is
      added to flowi_tunnel which allows for routes to be bound to specific
      virtual tunnels.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b7179d3
    • T
      vxlan: Flow based tunneling · ee122c79
      Thomas Graf 提交于
      Allows putting a VXLAN device into a new flow-based mode in which
      skbs with a ip_tunnel_info dst metadata attached will be encapsulated
      according to the instructions stored in there with the VXLAN device
      defaults taken into consideration.
      
      Similar on the receive side, if the VXLAN_F_COLLECT_METADATA flag is
      set, the packet processing will populate a ip_tunnel_info struct for
      each packet received and attach it to the skb using the new metadata
      dst.  The metadata structure will contain the outer header and tunnel
      header fields which have been stripped off. Layers further up in the
      stack such as routing, tc or netfitler can later match on these fields
      and perform forwarding. It is the responsibility of upper layers to
      ensure that the flag is set if the metadata is needed. The flag limits
      the additional cost of metadata collecting based on demand.
      
      This prepares the VXLAN device to be steered by the routing and other
      subsystems which allows to support encapsulation for a large number
      of tunnel endpoints and tunnel ids through a single net_device which
      improves the scalability.
      
      It also allows for OVS to leverage this mode which in turn allows for
      the removal of the OVS specific VXLAN code.
      
      Because the skb is currently scrubed in vxlan_rcv(), the attachment of
      the new dst metadata is postponed until after scrubing which requires
      the temporary addition of a new member to vxlan_metadata. This member
      is removed again in a later commit after the indirect VXLAN receive API
      has been removed.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee122c79
    • T
      dst: Metadata destinations · f38a9eb1
      Thomas Graf 提交于
      Introduces a new dst_metadata which enables to carry per packet metadata
      between forwarding and processing elements via the skb->dst pointer.
      
      The structure is set up to be a union. Thus, each separate type of
      metadata requires its own dst instance. If demand arises to carry
      multiple types of metadata concurrently, metadata dst entries can be
      made stackable.
      
      The metadata dst entry is refcnt'ed as expected for now but a non
      reference counted use is possible if the reference is forced before
      queueing the skb.
      
      In order to allow allocating dsts with variable length, the existing
      dst_alloc() is split into a dst_alloc() and dst_init() function. The
      existing dst_init() function to initialize the subsystem is being
      renamed to dst_subsys_init() to make it clear what is what.
      
      The check before ip_route_input() is changed to ignore metadata dsts
      and drop the dst inside the routing function thus allowing to interpret
      metadata in a later commit.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f38a9eb1
    • T
      ip_tunnel: Make ovs_tunnel_info and ovs_key_ipv4_tunnel generic · 1d8fff90
      Thomas Graf 提交于
      Rename the tunnel metadata data structures currently internal to
      OVS and make them generic for use by all IP tunnels.
      
      Both structures are kernel internal and will stay that way. Their
      members are exposed to user space through individual Netlink
      attributes by OVS. It will therefore be possible to extend/modify
      these structures without affecting user ABI.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d8fff90
    • R
      mpls: ip tunnel support · e3e4712e
      Roopa Prabhu 提交于
      This implementation uses lwtunnel infrastructure to register
      hooks for mpls tunnel encaps.
      
      It picks cues from iptunnel_encaps infrastructure and previous
      mpls iptunnel RFC patches from Eric W. Biederman and Robert Shearman
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3e4712e
    • R
      lwtunnel: support dst output redirect function · ffce4196
      Roopa Prabhu 提交于
      This patch introduces lwtunnel_output function to call corresponding
      lwtunnels output function to xmit the packet.
      
      It adds two variants lwtunnel_output and lwtunnel_output6 for ipv4 and
      ipv6 respectively today. But this is subject to change when lwtstate will
      reside in dst or dst_metadata (as per upstream discussions).
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ffce4196
    • R
      ipv6: support for fib route lwtunnel encap attributes · 19e42e45
      Roopa Prabhu 提交于
      This patch adds support in ipv6 fib functions to parse Netlink
      RTA encap attributes and attach encap state data to rt6_info.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19e42e45
    • R
      ipv4: support for fib route lwtunnel encap attributes · 571e7226
      Roopa Prabhu 提交于
      This patch adds support in ipv4 fib functions to parse user
      provided encap attributes and attach encap state data to fib_nh
      and rtable.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      571e7226
    • R
      lwtunnel: infrastructure for handling light weight tunnels like mpls · 499a2425
      Roopa Prabhu 提交于
      Provides infrastructure to parse/dump/store encap information for
      light weight tunnels like mpls. Encap information for such tunnels
      is associated with fib routes.
      
      This infrastructure is based on previous suggestions from
      Eric Biederman to follow the xfrm infrastructure.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      499a2425
  2. 21 7月, 2015 2 次提交
  3. 10 7月, 2015 6 次提交
  4. 09 7月, 2015 5 次提交
  5. 29 6月, 2015 2 次提交
  6. 24 6月, 2015 2 次提交
    • A
      net: ipv4 sysctl option to ignore routes when nexthop link is down · 0eeb075f
      Andy Gospodarek 提交于
      This feature is only enabled with the new per-interface or ipv4 global
      sysctls called 'ignore_routes_with_linkdown'.
      
      net.ipv4.conf.all.ignore_routes_with_linkdown = 0
      net.ipv4.conf.default.ignore_routes_with_linkdown = 0
      net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
      ...
      
      When the above sysctls are set, will report to userspace that a route is
      dead and will no longer resolve to this nexthop when performing a fib
      lookup.  This will signal to userspace that the route will not be
      selected.  The signalling of a RTNH_F_DEAD is only passed to userspace
      if the sysctl is enabled and link is down.  This was done as without it
      the netlink listeners would have no idea whether or not a nexthop would
      be selected.   The kernel only sets RTNH_F_DEAD internally if the
      interface has IFF_UP cleared.
      
      With the new sysctl set, the following behavior can be observed
      (interface p8p1 is link-down):
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
          cache
      
      While the route does remain in the table (so it can be modified if
      needed rather than being wiped away as it would be if IFF_UP was
      cleared), the proper next-hop is chosen automatically when the link is
      down.  Now interface p8p1 is linked-up:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
      90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      
      and the output changes to what one would expect.
      
      If the sysctl is not set, the following output would be expected when
      p8p1 is down:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      
      Since the dead flag does not appear, there should be no expectation that
      the kernel would skip using this route due to link being down.
      
      v2: Split kernel changes into 2 patches, this actually makes a
      behavioral change if the sysctl is set.  Also took suggestion from Alex
      to simplify code by only checking sysctl during fib lookup and
      suggestion from Scott to add a per-interface sysctl.
      
      v3: Code clean-ups to make it more readable and efficient as well as a
      reverse path check fix.
      
      v4: Drop binary sysctl
      
      v5: Whitespace fixups from Dave
      
      v6: Style changes from Dave and checkpatch suggestions
      
      v7: One more checkpatch fixup
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0eeb075f
    • A
      net: track link-status of ipv4 nexthops · 8a3d0316
      Andy Gospodarek 提交于
      Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
      reachable via an interface where carrier is off.  No action is taken,
      but additional flags are passed to userspace to indicate carrier status.
      
      This also includes a cleanup to fib_disable_ip to more clearly indicate
      what event made the function call to replace the more cryptic force
      option previously used.
      
      v2: Split out kernel functionality into 2 patches, this patch simply
      sets and clears new nexthop flag RTNH_F_LINKDOWN.
      
      v3: Cleanups suggested by Alex as well as a bug noticed in
      fib_sync_down_dev and fib_sync_up when multipath was not enabled.
      
      v5: Whitespace and variable declaration fixups suggested by Dave.
      
      v6: Style fixups noticed by Dave; ran checkpatch to be sure I got them
      all.
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a3d0316
  7. 23 6月, 2015 2 次提交
    • S
      switchdev: rename vlan vid_start to vid_begin · 3e3a78b4
      Scott Feldman 提交于
      Use vid_begin/end to be consistent with BRIDGE_VLAN_INFO_RANGE_BEGIN/END.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e3a78b4
    • E
      netfilter: nf_qeueue: Drop queue entries on nf_unregister_hook · 8405a8ff
      Eric W. Biederman 提交于
      Add code to nf_unregister_hook to flush the nf_queue when a hook is
      unregistered.  This guarantees that the pointer that the nf_queue code
      retains into the nf_hook list will remain valid while a packet is
      queued.
      
      I tested what would happen if we do not flush queued packets and was
      trivially able to obtain the oops below.  All that was required was
      to stop the nf_queue listening process, to delete all of the nf_tables,
      and to awaken the nf_queue listening process.
      
      > BUG: unable to handle kernel paging request at 0000000100000001
      > IP: [<0000000100000001>] 0x100000001
      > PGD b9c35067 PUD 0
      > Oops: 0010 [#1] SMP
      > Modules linked in:
      > CPU: 0 PID: 519 Comm: lt-nfqnl_test Not tainted
      > task: ffff8800b9c8c050 ti: ffff8800ba9d8000 task.ti: ffff8800ba9d8000
      > RIP: 0010:[<0000000100000001>]  [<0000000100000001>] 0x100000001
      > RSP: 0018:ffff8800ba9dba40  EFLAGS: 00010a16
      > RAX: ffff8800bab48a00 RBX: ffff8800ba9dba90 RCX: ffff8800ba9dba90
      > RDX: ffff8800b9c10128 RSI: ffff8800ba940900 RDI: ffff8800bab48a00
      > RBP: ffff8800b9c10128 R08: ffffffff82976660 R09: ffff8800ba9dbb28
      > R10: dead000000100100 R11: dead000000200200 R12: ffff8800ba940900
      > R13: ffffffff8313fd50 R14: ffff8800b9c95200 R15: 0000000000000000
      > FS:  00007fb91fc34700(0000) GS:ffff8800bfa00000(0000) knlGS:0000000000000000
      > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > CR2: 0000000100000001 CR3: 00000000babfb000 CR4: 00000000000007f0
      > Stack:
      >  ffffffff8206ab0f ffffffff82982240 ffff8800bab48a00 ffff8800b9c100a8
      >  ffff8800b9c10100 0000000000000001 ffff8800ba940900 ffff8800b9c10128
      >  ffffffff8206bd65 ffff8800bfb0d5e0 ffff8800bab48a00 0000000000014dc0
      > Call Trace:
      >  [<ffffffff8206ab0f>] ? nf_iterate+0x4f/0xa0
      >  [<ffffffff8206bd65>] ? nf_reinject+0x125/0x190
      >  [<ffffffff8206dee5>] ? nfqnl_recv_verdict+0x255/0x360
      >  [<ffffffff81386290>] ? nla_parse+0x80/0xf0
      >  [<ffffffff8206c42c>] ? nfnetlink_rcv_msg+0x13c/0x240
      >  [<ffffffff811b2fec>] ? __memcg_kmem_get_cache+0x4c/0x150
      >  [<ffffffff8206c2f0>] ? nfnl_lock+0x20/0x20
      >  [<ffffffff82068159>] ? netlink_rcv_skb+0xa9/0xc0
      >  [<ffffffff820677bf>] ? netlink_unicast+0x12f/0x1c0
      >  [<ffffffff82067ade>] ? netlink_sendmsg+0x28e/0x650
      >  [<ffffffff81fdd814>] ? sock_sendmsg+0x44/0x50
      >  [<ffffffff81fde07b>] ? ___sys_sendmsg+0x2ab/0x2c0
      >  [<ffffffff810e8f73>] ? __wake_up+0x43/0x70
      >  [<ffffffff8141a134>] ? tty_write+0x1c4/0x2a0
      >  [<ffffffff81fde9f4>] ? __sys_sendmsg+0x44/0x80
      >  [<ffffffff823ff8d7>] ? system_call_fastpath+0x12/0x6a
      > Code:  Bad RIP value.
      > RIP  [<0000000100000001>] 0x100000001
      >  RSP <ffff8800ba9dba40>
      > CR2: 0000000100000001
      > ---[ end trace 08eb65d42362793f ]---
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8405a8ff
  8. 22 6月, 2015 1 次提交
  9. 19 6月, 2015 9 次提交
  10. 16 6月, 2015 2 次提交