1. 16 9月, 2015 5 次提交
    • M
      ipv6: Replace spinlock with seqlock and rcu in ip6_tunnel · 70da5b5c
      Martin KaFai Lau 提交于
      This patch uses a seqlock to ensure consistency between idst->dst and
      idst->cookie.  It also makes dst freeing from fib tree to undergo a
      rcu grace period.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70da5b5c
    • M
      ipv6: Avoid double dst_free · 8e3d5be7
      Martin KaFai Lau 提交于
      It is a prep work to get dst freeing from fib tree undergo
      a rcu grace period.
      
      The following is a common paradigm:
      if (ip6_del_rt(rt))
      	dst_free(rt)
      
      which means, if rt cannot be deleted from the fib tree, dst_free(rt) now.
      1. We don't know the ip6_del_rt(rt) failure is because it
         was not managed by fib tree (e.g. DST_NOCACHE) or it had already been
         removed from the fib tree.
      2. If rt had been managed by the fib tree, ip6_del_rt(rt) failure means
         dst_free(rt) has been called already.  A second
         dst_free(rt) is not always obviously safe.  The rt may have
         been destroyed already.
      3. If rt is a DST_NOCACHE, dst_free(rt) should not be called.
      4. It is a stopper to make dst freeing from fib tree undergo a
         rcu grace period.
      
      This patch is to use a DST_NOCACHE flag to indicate a rt is
      not managed by the fib tree.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e3d5be7
    • M
      ipv6: Fix dst_entry refcnt bugs in ip6_tunnel · cdf3464e
      Martin KaFai Lau 提交于
      Problems in the current dst_entry cache in the ip6_tunnel:
      
      1. ip6_tnl_dst_set is racy.  There is no lock to protect it:
         - One major problem is that the dst refcnt gets messed up. F.e.
           the same dst_cache can be released multiple times and then
           triggering the infamous dst refcnt < 0 warning message.
         - Another issue is the inconsistency between dst_cache and
           dst_cookie.
      
         It can be reproduced by adding and removing the ip6gre tunnel
         while running a super_netperf TCP_CRR test.
      
      2. ip6_tnl_dst_get does not take the dst refcnt before returning
         the dst.
      
      This patch:
      1. Create a percpu dst_entry cache in ip6_tnl
      2. Use a spinlock to protect the dst_cache operations
      3. ip6_tnl_dst_get always takes the dst refcnt before returning
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdf3464e
    • M
      ipv6: Rename the dst_cache helper functions in ip6_tunnel · f230d1e8
      Martin KaFai Lau 提交于
      It is a prep work to fix the dst_entry refcnt bugs in
      ip6_tunnel.
      
      This patch rename:
      1. ip6_tnl_dst_check() to ip6_tnl_dst_get() to better
         reflect that it will take a dst refcnt in the next patch.
      2. ip6_tnl_dst_store() to ip6_tnl_dst_set() to have a more
         conventional name matching with ip6_tnl_dst_get().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f230d1e8
    • M
      ipv6: Refactor common ip6gre_tunnel_init codes · a3c119d3
      Martin KaFai Lau 提交于
      It is a prep work to fix the dst_entry refcnt bugs in ip6_tunnel.
      
      This patch refactors some common init codes used by both
      ip6gre_tunnel_init and ip6gre_tap_init.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3c119d3
  2. 10 9月, 2015 3 次提交
    • W
      ipv6: fix ifnullfree.cocci warnings · 52fe51f8
      Wu Fengguang 提交于
      net/ipv6/route.c:2946:3-8: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.
      
       NULL check before some freeing functions is not needed.
      
       Based on checkpatch warning
       "kfree(NULL) is safe this check is probably not required"
       and kfreeaddr.cocci by Julia Lawall.
      
      Generated by: scripts/coccinelle/free/ifnullfree.cocci
      
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52fe51f8
    • P
      net: ipv6: use common fib_default_rule_pref · f53de1e9
      Phil Sutter 提交于
      This switches IPv6 policy routing to use the shared
      fib_default_rule_pref() function of IPv4 and DECnet. It is also used in
      multicast routing for IPv4 as well as IPv6.
      
      The motivation for this patch is a complaint about iproute2 behaving
      inconsistent between IPv4 and IPv6 when adding policy rules: Formerly,
      IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the
      assigned priority value was decreased with each rule added.
      
      Since then all users of the default_pref field have been converted to
      assign the generic function fib_default_rule_pref(), fib_nl_newrule()
      may just use it directly instead. Therefore get rid of the function
      pointer altogether and make fib_default_rule_pref() static, as it's not
      used outside fib_rules.c anymore.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f53de1e9
    • R
      ipv6: fix multipath route replace error recovery · 6b9ea5a6
      Roopa Prabhu 提交于
      Problem:
      The ecmp route replace support for ipv6 in the kernel, deletes the
      existing ecmp route too early, ie when it installs the first nexthop.
      If there is an error in installing the subsequent nexthops, its too late
      to recover the already deleted existing route leaving the fib
      in an inconsistent state.
      
      This patch reduces the possibility of this by doing the following:
      a) Changes the existing multipath route add code to a two stage process:
        build rt6_infos + insert them
      	ip6_route_add rt6_info creation code is moved into
      	ip6_route_info_create.
      b) This ensures that most errors are caught during building rt6_infos
        and we fail early
      c) Separates multipath add and del code. Because add needs the special
        two stage mode in a) and delete essentially does not care.
      d) In any event if the code fails during inserting a route again, a
        warning is printed (This should be unlikely)
      
      Before the patch:
      $ip -6 route show
      3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
      3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
      3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024
      
      /* Try replacing the route with a duplicate nexthop */
      $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
      fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
      swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
      RTNETLINK answers: File exists
      
      $ip -6 route show
      /* previously added ecmp route 3000:1000:1000:1000::2 dissappears from
       * kernel */
      
      After the patch:
      $ip -6 route show
      3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
      3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
      3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024
      
      /* Try replacing the route with a duplicate nexthop */
      $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
      fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
      swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
      RTNETLINK answers: File exists
      
      $ip -6 route show
      3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
      3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
      3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024
      
      Fixes: 27596472 ("ipv6: fix ECMP route replacement")
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b9ea5a6
  3. 06 9月, 2015 1 次提交
  4. 03 9月, 2015 2 次提交
    • D
      netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabled · a82b0e63
      Daniel Borkmann 提交于
      While testing various Kconfig options on another issue, I found that
      the following one triggers as well on allmodconfig and nf_conntrack
      disabled:
      
        net/ipv4/netfilter/nf_dup_ipv4.c: In function ‘nf_dup_ipv4’:
        net/ipv4/netfilter/nf_dup_ipv4.c:72:20: error: ‘nf_skb_duplicated’ undeclared (first use in this function)
          if (this_cpu_read(nf_skb_duplicated))
        [...]
        net/ipv6/netfilter/nf_dup_ipv6.c: In function ‘nf_dup_ipv6’:
        net/ipv6/netfilter/nf_dup_ipv6.c:66:20: error: ‘nf_skb_duplicated’ undeclared (first use in this function)
          if (this_cpu_read(nf_skb_duplicated))
      
      Fix it by including directly the header where it is defined.
      
      Fixes: bbde9fc1 ("netfilter: factor out packet duplication for IPv4/IPv6")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a82b0e63
    • D
      ipv6: fix exthdrs offload registration in out_rt path · e41b0bed
      Daniel Borkmann 提交于
      We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
      but in error path, we try to unregister it with inet_del_offload(). This
      doesn't seem correct, it should actually be inet6_del_offload(), also
      ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
      also uses rthdr_offload twice), but it got removed entirely later on.
      
      Fixes: 3336288a ("ipv6: Switch to using new offload infrastructure.")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e41b0bed
  5. 01 9月, 2015 5 次提交
  6. 31 8月, 2015 2 次提交
    • R
      net: Optimize snmp stat aggregation by walking all the percpu data at once · a3a77372
      Raghavendra K T 提交于
      Docker container creation linearly increased from around 1.6 sec to 7.5 sec
      (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.
      
      reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
      through per cpu data of an item (iteratively for around 36 items).
      
      idea: This patch tries to aggregate the statistics by going through
      all the items of each cpu sequentially which is reducing cache
      misses.
      
      Docker creation got faster by more than 2x after the patch.
      
      Result:
                             Before           After
      Docker creation time   6.836s           3.25s
      cache miss             2.7%             1.41%
      
      perf before:
          50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field
           9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop
           3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one
           2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock
      
      perf after:
          10.57%  docker           docker                [.] scanblock
           8.37%  swapper          [kernel.kallsyms]     [k] snooze_loop
           6.91%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field
           6.67%  docker           [kernel.kallsyms]     [k] veth_stats_one
      
      changes/ideas suggested:
      Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
      place of unaligned_put (Joe).
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3a77372
    • M
      net/ipv6: Export addrconf_ifid_eui48 · 399e6f95
      Matan Barak 提交于
      For loopback purposes, RoCE devices should have a default GID in the
      port GID table, even when the interface is down. In order to do so,
      we use the IPv6 link local address which would have been genenrated
      for the related Ethernet netdevice when it goes up as a default GID.
      
      addrconf_ifid_eui48 is used to gernerate this address, export it.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      399e6f95
  7. 30 8月, 2015 2 次提交
  8. 29 8月, 2015 2 次提交
  9. 28 8月, 2015 1 次提交
  10. 27 8月, 2015 1 次提交
  11. 26 8月, 2015 3 次提交
  12. 25 8月, 2015 2 次提交
  13. 22 8月, 2015 1 次提交
    • P
      netfilter: nf_dup: fix sparse warnings · 59e26423
      Pablo Neira Ayuso 提交于
      >> net/ipv4/netfilter/nft_dup_ipv4.c:29:37: sparse: incorrect type in initializer (different base types)
         net/ipv4/netfilter/nft_dup_ipv4.c:29:37:    expected restricted __be32 [user type] s_addr
         net/ipv4/netfilter/nft_dup_ipv4.c:29:37:    got unsigned int [unsigned] <noident>
      
      >> net/ipv6/netfilter/nf_dup_ipv6.c:48:23: sparse: incorrect type in assignment (different base types)
         net/ipv6/netfilter/nf_dup_ipv6.c:48:23:    expected restricted __be32 [addressable] [assigned] [usertype] flowlabel
         net/ipv6/netfilter/nf_dup_ipv6.c:48:23:    got int
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      59e26423
  14. 21 8月, 2015 4 次提交
  15. 18 8月, 2015 6 次提交
    • T
      net: Identifier Locator Addressing module · 65d7ab8d
      Tom Herbert 提交于
      Adding new module name ila. This implements ILA translation. Light
      weight tunnel redirection is used to perform the translation in
      the data path. This is configured by the "ip -6 route" command
      using the "encap ila <locator>" option, where <locator> is the
      value to set in destination locator of the packet. e.g.
      
      ip -6 route add 3333:0:0:1:5555:0:1:0/128 \
            encap ila 2001:0:0:1 via 2401:db00:20:911a:face:0:25:0
      
      Sets a route where 3333:0:0:1 will be overwritten by
      2001:0:0:1 on output.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65d7ab8d
    • T
      net: Change pseudohdr argument of inet_proto_csum_replace* to be a bool · 4b048d6d
      Tom Herbert 提交于
      inet_proto_csum_replace4,2,16 take a pseudohdr argument which indicates
      the checksum field carries a pseudo header. This argument should be a
      boolean instead of an int.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b048d6d
    • T
      lwt: Add support to redirect dst.input · 25368623
      Tom Herbert 提交于
      This patch adds the capability to redirect dst input in the same way
      that dst output is redirected by LWT.
      
      Also, save the original dst.input and and dst.out when setting up
      lwtunnel redirection. These can be called by the client as a pass-
      through.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25368623
    • D
      netfilter: nf_conntrack: add efficient mark to zone mapping · 5e8018fc
      Daniel Borkmann 提交于
      This work adds the possibility of deriving the zone id from the skb->mark
      field in a scalable manner. This allows for having only a single template
      serving hundreds/thousands of different zones, for example, instead of the
      need to have one match for each zone as an extra CT jump target.
      
      Note that we'd need to have this information attached to the template as at
      the time when we're trying to lookup a possible ct object, we already need
      to know zone information for a possible match when going into
      __nf_conntrack_find_get(). This work provides a minimal implementation for
      a possible mapping.
      
      In order to not add/expose an extra ct->status bit, the zone structure has
      been extended to carry a flag for deriving the mark.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      5e8018fc
    • D
      netfilter: nf_conntrack: add direction support for zones · deedb590
      Daniel Borkmann 提交于
      This work adds a direction parameter to netfilter zones, so identity
      separation can be performed only in original/reply or both directions
      (default). This basically opens up the possibility of doing NAT with
      conflicting IP address/port tuples from multiple, isolated tenants
      on a host (e.g. from a netns) without requiring each tenant to NAT
      twice resp. to use its own dedicated IP address to SNAT to, meaning
      overlapping tuples can be made unique with the zone identifier in
      original direction, where the NAT engine will then allocate a unique
      tuple in the commonly shared default zone for the reply direction.
      In some restricted, local DNAT cases, also port redirection could be
      used for making the reply traffic unique w/o requiring SNAT.
      
      The consensus we've reached and discussed at NFWS and since the initial
      implementation [1] was to directly integrate the direction meta data
      into the existing zones infrastructure, as opposed to the ct->mark
      approach we proposed initially.
      
      As we pass the nf_conntrack_zone object directly around, we don't have
      to touch all call-sites, but only those, that contain equality checks
      of zones. Thus, based on the current direction (original or reply),
      we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID.
      CT expectations are direction-agnostic entities when expectations are
      being compared among themselves, so we can only use the identifier
      in this case.
      
      Note that zone identifiers can not be included into the hash mix
      anymore as they don't contain a "stable" value that would be equal
      for both directions at all times, f.e. if only zone->id would
      unconditionally be xor'ed into the table slot hash, then replies won't
      find the corresponding conntracking entry anymore.
      
      If no particular direction is specified when configuring zones, the
      behaviour is exactly as we expect currently (both directions).
      
      Support has been added for the CT netlink interface as well as the
      x_tables raw CT target, which both already offer existing interfaces
      to user space for the configuration of zones.
      
      Below a minimal, simplified collision example (script in [2]) with
      netperf sessions:
      
        +--- tenant-1 ---+   mark := 1
        |    netperf     |--+
        +----------------+  |                CT zone := mark [ORIGINAL]
         [ip,sport] := X   +--------------+  +--- gateway ---+
                           | mark routing |--|     SNAT      |-- ... +
                           +--------------+  +---------------+       |
        +--- tenant-2 ---+  |                                     ~~~|~~~
        |    netperf     |--+                +-----------+           |
        +----------------+   mark := 2       | netserver |------ ... +
         [ip,sport] := X                     +-----------+
                                              [ip,port] := Y
      On the gateway netns, example:
      
        iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL
        iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully
      
        iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
        iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark
      
      conntrack dump from gateway netns:
      
        netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns
      
        tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1
                                 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024
                     [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1
      
        tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2
                                 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555
                     [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1
      
        tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1
                              src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438
                     [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1
      
        tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2
                              src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889
                     [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2
      
      Taking this further, test script in [2] creates 200 tenants and runs
      original-tuple colliding netperf sessions each. A conntrack -L dump in
      the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED
      state as expected.
      
      I also did run various other tests with some permutations of the script,
      to mention some: SNAT in random/random-fully/persistent mode, no zones (no
      overlaps), static zones (original, reply, both directions), etc.
      
        [1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/
        [2] https://paste.fedoraproject.org/242835/65657871/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      deedb590
    • I
      ipv6: trivial whitespace fix · ec120da6
      Ian Morris 提交于
      Change brace placement to be in line with coding standards
      Signed-off-by: NIan Morris <ipm@chirality.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec120da6