1. 09 5月, 2017 1 次提交
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
  2. 18 4月, 2017 1 次提交
  3. 14 4月, 2017 1 次提交
  4. 02 4月, 2017 6 次提交
    • D
      net: mpls: Increase max number of labels for lwt encap · 1511009c
      David Ahern 提交于
      Alow users to push down more labels per MPLS encap. Similar to LSR case,
      move label array to the end of mpls_iptunnel_encap and allocate based on
      the number of labels for the route.
      
      For consistency with the LSR case, re-use the same maximum number of
      labels.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1511009c
    • D
      net: mpls: bump maximum number of labels · a4ac8c98
      David Ahern 提交于
      Allow users to push down more labels per MPLS route. With the previous
      patches, no memory allocations are based on MAX_NEW_LABELS; the limit
      is only used to keep userspace in check.
      
      At this point MAX_NEW_LABELS is only used for mpls_route_config (copying
      route data from userspace) and processing nexthops looking for the max
      number of labels across the route spec.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4ac8c98
    • D
      net: mpls: Limit memory allocation for mpls_route · df1c6316
      David Ahern 提交于
      Limit memory allocation size for mpls_route to 4096.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df1c6316
    • D
      net: mpls: change mpls_route layout · 59b20966
      David Ahern 提交于
      Move labels to the end of mpls_nh as a 0-sized array and within mpls_route
      move the via for a nexthop after the mpls_nh. The new layout becomes:
      
         +----------------------+
         | mpls_route           |
         +----------------------+
         | mpls_nh 0            |
         +----------------------+
         | alignment padding    |   4 bytes for odd number of labels; 0 for even
         +----------------------+
         | via[rt_max_alen] 0   |
         +----------------------+
         | alignment padding    |   via's aligned on sizeof(unsigned long)
         +----------------------+
         | ...                  |
         +----------------------+
         | mpls_nh n-1          |
         +----------------------+
         | via[rt_max_alen] n-1 |
         +----------------------+
      
      Memory allocated for nexthop + via is constant across all nexthops and
      their via. It is based on the maximum number of labels across all nexthops
      and the maximum via length. The size is saved in the mpls_route as
      rt_nh_size. Accessing a nexthop becomes rt->rt_nh + index * rt->rt_nh_size.
      
      The offset of the via address from a nexthop is saved as rt_via_offset
      so that given an mpls_nh pointer the via for that hop is simply
      nh + rt->rt_via_offset.
      
      With prior code, memory allocated per mpls_route with 1 nexthop:
           via is an ethernet address - 64 bytes
           via is an ipv4 address     - 64
           via is an ipv6 address     - 72
      
      With this patch set, memory allocated per mpls_route with 1 nexthop and
      1 or 2 labels:
           via is an ethernet address - 56 bytes
           via is an ipv4 address     - 56
           via is an ipv6 address     - 64
      
      The 8-byte reduction is due to the previous patch; the change introduced
      by this patch has no impact on the size of allocations for 1 or 2 labels.
      
      Performance impact of this change was examined using network namespaces
      with veth pairs connecting namespaces. ns0 inserts the packet to the
      label-switched path using an lwt route with encap mpls. ns1 adds 1 or 2
      labels depending on test, ns2 (and ns3 for 2-label test) pops the label
      and forwards. ns3 (or ns4) for a 2-label is the destination. Similar
      series of namespaces used for 2-nexthop test.
      
      Intent is to measure changes to latency (overhead in manipulating the
      packet) in the forwarding path. Tests used netperf with UDP_RR.
      
      IPv4:                     current   patches
         1 label, 1 nexthop      29908     30115
         2 label, 1 nexthop      29071     29612
         1 label, 2 nexthop      29582     29776
         2 label, 2 nexthop      29086     29149
      
      IPv6:                     current   patches
         1 label, 1 nexthop      24502     24960
         2 label, 1 nexthop      24041     24407
         1 label, 2 nexthop      23795     23899
         2 label, 2 nexthop      23074     22959
      
      In short, the change has no effect to a modest increase in performance.
      This is expected since this patch does not really have an impact on routes
      with 1 or 2 labels (the current limit) and 1 or 2 nexthops.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59b20966
    • D
      net: mpls: Convert number of nexthops to u8 · 77ef013a
      David Ahern 提交于
      Number of nexthops and number of alive nexthops are tracked using an
      unsigned int. A route should never have more than 255 nexthops so
      convert both to u8. Update all references and intermediate variables
      to consistently use u8 as well.
      
      Shrinks the size of mpls_route from 32 bytes to 24 bytes with a 2-byte
      hole before the nexthops.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77ef013a
    • D
      net: mpls: rt_nhn_alive and nh_flags should be accessed using READ_ONCE · 39eb8cd1
      David Ahern 提交于
      The number of alive nexthops for a route (rt->rt_nhn_alive) and the
      flags for a next hop (nh->nh_flags) are modified by netdev event
      handlers. The event handlers run with rtnl_lock held so updates are
      always done with the lock held. The packet path accesses the fields
      under the rcu lock. Since those fields can change at any moment in
      the packet path, both fields should be accessed using READ_ONCE. Updates
      to both fields should use WRITE_ONCE.
      
      Update mpls_select_multipath (packet path) and mpls_ifdown and mpls_ifup
      (event handlers) accordingly.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39eb8cd1
  5. 30 3月, 2017 1 次提交
  6. 29 3月, 2017 2 次提交
  7. 28 3月, 2017 2 次提交
  8. 25 3月, 2017 1 次提交
  9. 17 3月, 2017 1 次提交
    • D
      net: mpls: Fix nexthop alive tracking on down events · 61733c91
      David Ahern 提交于
      Alive tracking of nexthops can account for a link twice if the carrier
      goes down followed by an admin down of the same link rendering multipath
      routes useless. This is similar to 79099aab for UNREGISTER events and
      DOWN events.
      
      Fix by tracking number of alive nexthops in mpls_ifdown similar to the
      logic in mpls_ifup. Checking the flags per nexthop once after all events
      have been processed is simpler than trying to maintian a running count
      through all event combinations.
      
      Also, WRITE_ONCE is used instead of ACCESS_ONCE to set rt_nhn_alive
      per a comment from checkpatch:
          WARNING: Prefer WRITE_ONCE(<FOO>, <BAR>) over ACCESS_ONCE(<FOO>) = <BAR>
      
      Fixes: c89359a4 ("mpls: support for dead routes")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61733c91
  10. 14 3月, 2017 2 次提交
    • R
      mpls: allow TTL propagation from IP packets to be configured · a59166e4
      Robert Shearman 提交于
      Allow TTL propagation from IP packets to MPLS packets to be
      configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which
      allows the TTL to be set in the resulting MPLS packet, with the value
      of 0 having the semantics of enabling propagation of the TTL from the
      IP header (i.e. non-zero values disable propagation).
      
      Also allow the configuration to be overridden globally by reusing the
      same sysctl to control whether the TTL is propagated from IP packets
      into the MPLS header. If the per-LWT attribute is set then it
      overrides the global configuration. If the TTL isn't propagated then a
      default TTL value is used which can be configured via a new sysctl,
      "net.mpls.default_ttl". This is kept separate from the configuration
      of whether IP TTL propagation is enabled as it can be used in the
      future when non-IP payloads are supported (i.e. where there is no
      payload TTL that can be propagated).
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a59166e4
    • R
      mpls: allow TTL propagation to IP packets to be configured · 5b441ac8
      Robert Shearman 提交于
      Provide the ability to control on a per-route basis whether the TTL
      value from an MPLS packet is propagated to an IPv4/IPv6 packet when
      the last label is popped as per the theoretical model in RFC 3443
      through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to
      mean disable propagation and 1 to mean enable propagation.
      
      In order to provide the ability to change the behaviour for packets
      arriving with IPv4/IPv6 Explicit Null labels and to provide an easy
      way for a user to change the behaviour for all existing routes without
      having to reprogram them, a global knob is provided. This is done
      through the addition of a new per-namespace sysctl,
      "net.mpls.ip_ttl_propagate", which defaults to enabled. If the
      per-route attribute is set (either enabled or disabled) then it
      overrides the global configuration.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b441ac8
  11. 13 3月, 2017 2 次提交
    • D
      mpls: Do not decrement alive counter for unregister events · 79099aab
      David Ahern 提交于
      Multipath routes can be rendered usesless when a device in one of the
      paths is deleted. For example:
      
      $ ip -f mpls ro ls
      100
      	nexthop as to 200 via inet 172.16.2.2  dev virt12
      	nexthop as to 300 via inet 172.16.3.2  dev br0
      101
      	nexthop as to 201 via inet6 2000:2::2  dev virt12
      	nexthop as to 301 via inet6 2000:3::2  dev br0
      
      $ ip li del br0
      
      When br0 is deleted the other hop is not considered in
      mpls_select_multipath because of the alive check -- rt_nhn_alive
      is 0.
      
      rt_nhn_alive is decremented once in mpls_ifdown when the device is taken
      down (NETDEV_DOWN) and again when it is deleted (NETDEV_UNREGISTER). For
      a 2 hop route, deleting one device drops the alive count to 0. Since
      devices are taken down before unregistering, the decrement on
      NETDEV_UNREGISTER is redundant.
      
      Fixes: c89359a4 ("mpls: support for dead routes")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79099aab
    • D
      mpls: Send route delete notifications when router module is unloaded · e37791ec
      David Ahern 提交于
      When the mpls_router module is unloaded, mpls routes are deleted but
      notifications are not sent to userspace leaving userspace caches
      out of sync. Add the call to mpls_notify_route in mpls_net_exit as
      routes are freed.
      
      Fixes: 0189197f ("mpls: Basic routing support")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e37791ec
  12. 21 2月, 2017 1 次提交
  13. 24 1月, 2017 1 次提交
    • D
      net: mpls: Fix multipath selection for LSR use case · 9f427a0e
      David Ahern 提交于
      MPLS multipath for LSR is broken -- always selecting the first nexthop
      in the one label case. For example:
      
          $ ip -f mpls ro ls
          100
                  nexthop as to 200 via inet 172.16.2.2  dev virt12
                  nexthop as to 300 via inet 172.16.3.2  dev virt13
          101
                  nexthop as to 201 via inet6 2000:2::2  dev virt12
                  nexthop as to 301 via inet6 2000:3::2  dev virt13
      
      In this example incoming packets have a single MPLS labels which means
      BOS bit is set. The BOS bit is passed from mpls_forward down to
      mpls_multipath_hash which never processes the hash loop because BOS is 1.
      
      Update mpls_multipath_hash to process the entire label stack. mpls_hdr_len
      tracks the total mpls header length on each pass (on pass N mpls_hdr_len
      is N * sizeof(mpls_shim_hdr)). When the label is found with the BOS set
      it verifies the skb has sufficient header for ipv4 or ipv6, and find the
      IPv4 and IPv6 header by using the last mpls_hdr pointer and adding 1 to
      advance past it.
      
      With these changes I have verified the code correctly sees the label,
      BOS, IPv4 and IPv6 addresses in the network header and icmp/tcp/udp
      traffic for ipv4 and ipv6 are distributed across the nexthops.
      
      Fixes: 1c78efa8 ("mpls: flow-based multipath selection")
      Acked-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f427a0e
  14. 18 1月, 2017 1 次提交
    • R
      mpls: Packet stats · 27d69105
      Robert Shearman 提交于
      Having MPLS packet stats is useful for observing network operation and
      for diagnosing network problems. In the absence of anything better,
      RFC2863 and RFC3813 are used for guidance for which stats to expose
      and the semantics of them. In particular rx_noroutes maps to in
      unknown protos in RFC2863. The stats are exposed to userspace via
      AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of
      RTM_GETSTATS messages.
      
      All the introduced fields are 64-bit, even error ones, to ensure no
      overflow with long uptimes. Per-CPU counters are used to avoid
      cache-line contention on the commonly used fields. The other fields
      have also been made per-CPU for code to avoid performance problems in
      error conditions on the assumption that on some platforms the cost of
      atomic operations could be more expensive than sending the packet
      (which is what would be done in the success case). If that's not the
      case, we could instead not use per-CPU counters for these fields.
      
      Only unicast and non-fragment are exposed at the moment, but other
      counters can be exposed in the future either by adding to the end of
      struct mpls_link_stats or by additional netlink attributes in the
      AF_MPLS IFLA_STATS_AF_SPEC nested attribute.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27d69105
  15. 06 12月, 2016 1 次提交
  16. 02 9月, 2016 1 次提交
  17. 10 7月, 2016 1 次提交
  18. 17 6月, 2016 1 次提交
    • S
      mpls: allow routes on ipgre devices · 0d227a86
      Simon Horman 提交于
      This appears to be necessary and sufficient to provide
      MPLS in GRE (RFC4023) support.
      
      This can be used by establishing an ipgre tunnel device
      and then routing MPLS over it.
      
      The following example will forward MPLS frames received with an outermost
      MPLS label 100 over tun1, a GRE tunnel. The forwarded packet will have the
      outermost MPLS LSE removed and two new LSEs added with labels 200
      (outermost) and 300 (next).
      
      ip link add name tun1 type gre remote 10.0.99.193 local 10.0.99.192 ttl 225
      ip link set up dev tun1
      ip addr add 10.0.98.192/24 dev tun1
      ip route sh
      
      echo 1 > /proc/sys/net/mpls/conf/eth0/input
      echo 101 > /proc/sys/net/mpls/platform_labels
      ip -f mpls route add 100 as 200/300 via inet 10.0.98.193
      ip -f mpls route sh
      
      Also remove unnecessary braces.
      Reviewed-by: NDinan Gunawardena <dinan.gunawardena@netronome.com>
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Acked-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d227a86
  19. 04 6月, 2016 1 次提交
  20. 09 4月, 2016 1 次提交
    • R
      mpls: find_outdev: check for err ptr in addition to NULL check · 94a57f1f
      Roopa Prabhu 提交于
      find_outdev calls inet{,6}_fib_lookup_dev() or dev_get_by_index() to
      find the output device. In case of an error, inet{,6}_fib_lookup_dev()
      returns error pointer and dev_get_by_index() returns NULL. But the function
      only checks for NULL and thus can end up calling dev_put on an ERR_PTR.
      This patch adds an additional check for err ptr after the NULL check.
      
      Before: Trying to add an mpls route with no oif from user, no available
      path to 10.1.1.8 and no default route:
      $ip -f mpls route add 100 as 200 via inet 10.1.1.8
      [  822.337195] BUG: unable to handle kernel NULL pointer dereference at
      00000000000003a3
      [  822.340033] IP: [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182
      [  822.340033] PGD 1db38067 PUD 1de9e067 PMD 0
      [  822.340033] Oops: 0000 [#1] SMP
      [  822.340033] Modules linked in:
      [  822.340033] CPU: 0 PID: 11148 Comm: ip Not tainted 4.5.0-rc7+ #54
      [  822.340033] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org
      04/01/2014
      [  822.340033] task: ffff88001db82580 ti: ffff88001dad4000 task.ti:
      ffff88001dad4000
      [  822.340033] RIP: 0010:[<ffffffff8148781e>]  [<ffffffff8148781e>]
      mpls_nh_assign_dev+0x10b/0x182
      [  822.340033] RSP: 0018:ffff88001dad7a88  EFLAGS: 00010282
      [  822.340033] RAX: ffffffffffffff9b RBX: ffffffffffffff9b RCX:
      0000000000000002
      [  822.340033] RDX: 00000000ffffff9b RSI: 0000000000000008 RDI:
      0000000000000000
      [  822.340033] RBP: ffff88001ddc9ea0 R08: ffff88001e9f1768 R09:
      0000000000000000
      [  822.340033] R10: ffff88001d9c1100 R11: ffff88001e3c89f0 R12:
      ffffffff8187e0c0
      [  822.340033] R13: ffffffff8187e0c0 R14: ffff88001ddc9e80 R15:
      0000000000000004
      [  822.340033] FS:  00007ff9ed798700(0000) GS:ffff88001fc00000(0000)
      knlGS:0000000000000000
      [  822.340033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  822.340033] CR2: 00000000000003a3 CR3: 000000001de89000 CR4:
      00000000000006f0
      [  822.340033] Stack:
      [  822.340033]  0000000000000000 0000000100000000 0000000000000000
      0000000000000000
      [  822.340033]  0000000000000000 0801010a00000000 0000000000000000
      0000000000000000
      [  822.340033]  0000000000000004 ffffffff8148749b ffffffff8187e0c0
      000000000000001c
      [  822.340033] Call Trace:
      [  822.340033]  [<ffffffff8148749b>] ? mpls_rt_alloc+0x2b/0x3e
      [  822.340033]  [<ffffffff81488e66>] ? mpls_rtm_newroute+0x358/0x3e2
      [  822.340033]  [<ffffffff810e7bbc>] ? get_page+0x5/0xa
      [  822.340033]  [<ffffffff813b7d94>] ? rtnetlink_rcv_msg+0x17e/0x191
      [  822.340033]  [<ffffffff8111794e>] ? __kmalloc_track_caller+0x8c/0x9e
      [  822.340033]  [<ffffffff813c9393>] ?
      rht_key_hashfn.isra.20.constprop.57+0x14/0x1f
      [  822.340033]  [<ffffffff813b7c16>] ? __rtnl_unlock+0xc/0xc
      [  822.340033]  [<ffffffff813cb794>] ? netlink_rcv_skb+0x36/0x82
      [  822.340033]  [<ffffffff813b4507>] ? rtnetlink_rcv+0x1f/0x28
      [  822.340033]  [<ffffffff813cb2b1>] ? netlink_unicast+0x106/0x189
      [  822.340033]  [<ffffffff813cb5b3>] ? netlink_sendmsg+0x27f/0x2c8
      [  822.340033]  [<ffffffff81392ede>] ? sock_sendmsg_nosec+0x10/0x1b
      [  822.340033]  [<ffffffff81393df1>] ? ___sys_sendmsg+0x182/0x1e3
      [  822.340033]  [<ffffffff810e4f35>] ?
      __alloc_pages_nodemask+0x11c/0x1e4
      [  822.340033]  [<ffffffff8110619c>] ? PageAnon+0x5/0xd
      [  822.340033]  [<ffffffff811062fe>] ? __page_set_anon_rmap+0x45/0x52
      [  822.340033]  [<ffffffff810e7bbc>] ? get_page+0x5/0xa
      [  822.340033]  [<ffffffff810e85ab>] ? __lru_cache_add+0x1a/0x3a
      [  822.340033]  [<ffffffff81087ea9>] ? current_kernel_time64+0x9/0x30
      [  822.340033]  [<ffffffff813940c4>] ? __sys_sendmsg+0x3c/0x5a
      [  822.340033]  [<ffffffff8148f597>] ?
      entry_SYSCALL_64_fastpath+0x12/0x6a
      [  822.340033] Code: 83 08 04 00 00 65 ff 00 48 8b 3c 24 e8 40 7c f2 ff
      eb 13 48 c7 c3 9f ff ff ff eb 0f 89 ce e8 f1 ae f1 ff 48 89 c3 48 85 db
      74 15 <48> 8b 83 08 04 00 00 65 ff 08 48 81 fb 00 f0 ff ff 76 0d eb 07
      [  822.340033] RIP  [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182
      [  822.340033]  RSP <ffff88001dad7a88>
      [  822.340033] CR2: 00000000000003a3
      [  822.435363] ---[ end trace 98cc65e6f6b8bf11 ]---
      
      After patch:
      $ip -f mpls route add 100 as 200 via inet 10.1.1.8
      RTNETLINK answers: Network is unreachable
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Reported-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94a57f1f
  21. 12 12月, 2015 4 次提交
    • R
      mpls: make via address optional for multipath routes · f20367df
      Robert Shearman 提交于
      The via address is optional for a single path route, yet is mandatory
      when the multipath attribute is used:
      
        # ip -f mpls route add 100 dev lo
        # ip -f mpls route add 101 nexthop dev lo
        RTNETLINK answers: Invalid argument
      
      Make them consistent by making the via address optional when the
      RTA_MULTIPATH attribute is being parsed so that both forms of
      specifying the route work.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f20367df
    • R
      mpls: fix out-of-bounds access when via address not specified · eb7809f0
      Robert Shearman 提交于
      When a via address isn't specified, the via table is left initialised
      to 0 (NEIGH_ARP_TABLE), and the via address length also left
      initialised to 0. This results in a via address array of length 0
      being allocated (contiguous with route and nexthop array), meaning
      that when a packet is sent using neigh_xmit the neighbour lookup and
      creation will cause an out-of-bounds access when accessing the 4 bytes
      of the IPv4 address it assumes it has been given a pointer to.
      
      This could be fixed by allocating the 4 bytes of via address necessary
      and leaving it as all zeroes. However, it seems wrong to me to use an
      ipv4 nexthop (including possibly ARPing for 0.0.0.0) when the user
      didn't specify to do so.
      
      Instead, set the via address table to NEIGH_NR_TABLES to signify it
      hasn't been specified and use this at forwarding time to signify a
      neigh_xmit using an L2 address consisting of the device address. This
      mechanism is the same as that used for both ARP and ND for loopback
      interfaces and those flagged as no-arp, which are all we can really
      support in this case.
      
      Fixes: cf4b24f0 ("mpls: reduce memory usage of routes")
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb7809f0
    • R
      mpls: don't dump RTA_VIA attribute if not specified · 72dcac96
      Robert Shearman 提交于
      The problem seen is that when adding a route with a nexthop with no
      via address specified, iproute2 generates bogus output:
      
        # ip -f mpls route add 100 dev lo
        # ip -f mpls route list
        100 via inet 0.0.8.0 dev lo
      
      The reason for this is that the kernel generates an RTA_VIA attribute
      with the family set to AF_INET, but the via address data having zero
      length. The cause of family being AF_INET is that on route insert
      cfg->rc_via_table is left set to 0, which just happens to be
      NEIGH_ARP_TABLE which is then translated into AF_INET.
      
      iproute2 doesn't validate the length prior to printing and so prints
      garbage. Although it could be fixed to do the validation, I would
      argue that AF_INET addresses should always be exactly 4 bytes so the
      kernel is really giving userspace bogus data.
      
      Therefore, avoid generating the RTA_VIA attribute when dumping the
      route if the via address wasn't specified on add/modify. This is
      indicated by NEIGH_ARP_TABLE and a zero via address length - if the
      user specified a via address the address length would have been
      validated such that it was 4 bytes. Although this is a change in
      behaviour that is visible to userspace, I believe that what was
      generated before was invalid and as such userspace wouldn't be
      expecting it.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72dcac96
    • R
      mpls: validate L2 via address length · a3e948e8
      Robert Shearman 提交于
      If an L2 via address for an mpls nexthop is specified, the length of
      the L2 address must match that expected by the output device,
      otherwise it could access memory beyond the end of the via address
      buffer in the route.
      
      This check was present prior to commit f8efb73c ("mpls: multipath
      route support"), but got lost in the refactoring, so add it back,
      applying it to all nexthops in multipath routes.
      
      Fixes: f8efb73c ("mpls: multipath route support")
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3e948e8
  22. 04 12月, 2015 1 次提交
    • R
      mpls: support for dead routes · c89359a4
      Roopa Prabhu 提交于
      Adds support for RTNH_F_DEAD and RTNH_F_LINKDOWN flags on mpls
      routes due to link events. Also adds code to ignore dead
      routes during route selection.
      
      Unlike ip routes, mpls routes are not deleted when the route goes
      dead. This is current mpls behaviour and this patch does not change
      that. With this patch however, routes will be marked dead.
      dead routes are not notified to userspace (this is consistent with ipv4
      routes).
      
      dead routes:
      -----------
      $ip -f mpls route show
      100
          nexthop as to 200 via inet 10.1.1.2  dev swp1
          nexthop as to 700 via inet 10.1.1.6  dev swp2
      
      $ip link set dev swp1 down
      
      $ip link show dev swp1
      4: swp1: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode
      DEFAULT group default qlen 1000
          link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff
      
      $ip -f mpls route show
      100
          nexthop as to 200 via inet 10.1.1.2  dev swp1 dead linkdown
          nexthop as to 700 via inet 10.1.1.6  dev swp2
      
      linkdown routes:
      ----------------
      $ip -f mpls route show
      100
          nexthop as to 200 via inet 10.1.1.2  dev swp1
          nexthop as to 700 via inet 10.1.1.6  dev swp2
      
      $ip link show dev swp1
      4: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
      state UP mode DEFAULT group default qlen 1000
          link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff
      
      /* carrier goes down */
      $ip link show dev swp1
      4: swp1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
      state DOWN mode DEFAULT group default qlen 1000
          link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff
      
      $ip -f mpls route show
      100
          nexthop as to 200 via inet 10.1.1.2  dev swp1 linkdown
          nexthop as to 700 via inet 10.1.1.6  dev swp2
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Acked-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c89359a4
  23. 28 10月, 2015 2 次提交
    • R
      mpls: reduce memory usage of routes · cf4b24f0
      Robert Shearman 提交于
      Nexthops for MPLS routes have a via address field sized for the
      largest via address that is expected, which is 32 bytes. This means
      that in the most common case of having ipv4 via addresses, 28 bytes of
      memory more than required are used per nexthop. In the other common
      case of an ipv6 nexthop then 16 bytes more than required are
      used. With large numbers of MPLS routes this extra memory usage could
      start to become significant.
      
      To avoid allocating memory for a maximum length via address when not
      all of it is required and to allow for ease of iterating over
      nexthops, then the via addresses are changed to be stored in the same
      memory block as the route and nexthops, but in an array after the end
      of the array of nexthops. New accessors are provided to retrieve a
      pointer to the via address.
      
      To allow for O(1) access without having to store a pointer or offset
      per nh, the via address for each nexthop is sized according to the
      maximum via address for any nexthop in the route, which is stored in a
      new route field, rt_max_alen, but this is in an existing hole in
      struct mpls_route so it doesn't increase the size of the
      structure. Each via address is ensured to be aligned to VIA_ALEN_ALIGN
      to account for architectures that don't allow unaligned accesses.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf4b24f0
    • R
      mpls: fix forwarding using v4/v6 explicit null · b4e04fc7
      Robert Shearman 提交于
      Fill in the via address length for the predefined IPv4 and IPv6
      explicit-null label routes.
      
      Fixes: f8efb73c ("mpls: multipath route support")
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4e04fc7
  24. 23 10月, 2015 2 次提交
    • R
      mpls: flow-based multipath selection · 1c78efa8
      Robert Shearman 提交于
      Change the selection of a multipath route to use a flow-based
      hash. This more suitable for traffic sensitive to reordering within a
      flow (e.g. TCP, L2VPN) and whilst still allowing a good distribution
      of traffic given enough flows.
      
      Selection of the path for a multipath route is done using a hash of:
      1. Label stack up to MAX_MP_SELECT_LABELS labels or up to and
         including entropy label, whichever is first.
      2. 3-tuple of (L3 src, L3 dst, proto) from IPv4/IPv6 header in MPLS
         payload, if present.
      
      Naturally, a 5-tuple hash using L4 information in addition would be
      possible and be better in some scenarios, but there is a tradeoff
      between looking deeper into the packet to achieve good distribution,
      and packet forwarding performance, and I have erred on the side of the
      latter as the default.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c78efa8
    • R
      mpls: multipath route support · f8efb73c
      Roopa Prabhu 提交于
      This patch adds support for MPLS multipath routes.
      
      Includes following changes to support multipath:
      - splits struct mpls_route into 'struct mpls_route + struct mpls_nh'
      
      - 'struct mpls_nh' represents a mpls nexthop label forwarding entry
      
      - moves mpls route and nexthop structures into internal.h
      
      - A mpls_route can point to multiple mpls_nh structs
      
      - the nexthops are maintained as a array (similar to ipv4 fib)
      
      - In the process of restructuring, this patch also consistently changes
        all labels to u8
      
      - Adds support to parse/fill RTA_MULTIPATH netlink attribute for
      multipath routes similar to ipv4/v6 fib
      
      - In this patch, the multipath route nexthop selection algorithm
      simply returns the first nexthop. It is replaced by a
      hash based algorithm from Robert Shearman in the next patch
      
      - mpls_route_update cleanup: remove 'dev' handling in mpls_route_update.
      mpls_route_update though implemented to update based on dev, it was
      never used that way. And the dev handling gets tricky with multiple
      nexthops. Cannot match against any single nexthops dev. So, this patch
      removes the unused 'dev' handling in mpls_route_update.
      
      - dead route/path handling will be implemented in a subsequent patch
      
      Example:
      
      $ip -f mpls route add 100 nexthop as 200 via inet 10.1.1.2 dev swp1 \
                      nexthop as 700 via inet 10.1.1.6 dev swp2 \
                      nexthop as 800 via inet 40.1.1.2 dev swp3
      
      $ip  -f mpls route show
      100
              nexthop as to 200 via inet 10.1.1.2  dev swp1
              nexthop as to 700 via inet 10.1.1.6  dev swp2
              nexthop as to 800 via inet 40.1.1.2  dev swp3
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Acked-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8efb73c
  25. 01 9月, 2015 1 次提交
  26. 10 8月, 2015 1 次提交
    • R
      mpls: Enforce payload type of traffic sent using explicit NULL · 118d5234
      Robert Shearman 提交于
      RFC 4182 s2 states that if an IPv4 Explicit NULL label is the only
      label on the stack, then after popping the resulting packet must be
      treated as a IPv4 packet and forwarded based on the IPv4 header. The
      same is true for IPv6 Explicit NULL with an IPv6 packet following.
      
      Therefore, when installing the IPv4/IPv6 Explicit NULL label routes,
      add an attribute that specifies the expected payload type for use at
      forwarding time for determining the type of the encapsulated packet
      instead of inspecting the first nibble of the packet.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      118d5234