1. 04 4月, 2017 1 次提交
    • M
      can: initial support for network namespaces · 8e8cda6d
      Mario Kicherer 提交于
      This patch adds initial support for network namespaces. The changes only
      enable support in the CAN raw, proc and af_can code. GW and BCM still
      have their checks that ensure that they are used only from the main
      namespace.
      
      The patch boils down to moving the global structures, i.e. the global
      filter list and their /proc stats, into a per-namespace structure and passing
      around the corresponding "struct net" in a lot of different places.
      
      Changes since v1:
       - rebased on current HEAD (2bfe01ef)
       - fixed overlong line
      Signed-off-by: NMario Kicherer <dev@kicherer.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      8e8cda6d
  2. 25 3月, 2017 1 次提交
    • S
      net: Add sysctl to toggle early demux for tcp and udp · dddb64bc
      subashab@codeaurora.org 提交于
      Certain system process significant unconnected UDP workload.
      It would be preferrable to disable UDP early demux for those systems
      and enable it for TCP only.
      
      By disabling UDP demux, we see these slight gains on an ARM64 system-
      782 -> 788Mbps unconnected single stream UDPv4
      633 -> 654Mbps unconnected UDPv4 different sources
      
      The performance impact can change based on CPU architecure and cache
      sizes. There will not much difference seen if entire UDP hash table
      is in cache.
      
      Both sysctls are enabled by default to preserve existing behavior.
      
      v1->v2: Change function pointer instead of adding conditional as
      suggested by Stephen.
      
      v2->v3: Read once in callers to avoid issues due to compiler
      optimizations. Also update commit message with the tests.
      
      v3->v4: Store and use read once result instead of querying pointer
      again incorrectly.
      
      v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dddb64bc
  3. 22 3月, 2017 1 次提交
    • N
      net: ipv4: add support for ECMP hash policy choice · bf4e0a3d
      Nikolay Aleksandrov 提交于
      This patch adds support for ECMP hash policy choice via a new sysctl
      called fib_multipath_hash_policy and also adds support for L4 hashes.
      The current values for fib_multipath_hash_policy are:
       0 - layer 3 (default)
       1 - layer 4
      If there's an skb hash already set and it matches the chosen policy then it
      will be used instead of being calculated (currently only for L4).
      In L3 mode we always calculate the hash due to the ICMP error special
      case, the flow dissector's field consistentification should handle the
      address order thus we can remove the address reversals.
      If the skb is provided we always use it for the hash calculation,
      otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf4e0a3d
  4. 17 3月, 2017 1 次提交
    • S
      tcp: remove tcp_tw_recycle · 4396e461
      Soheil Hassas Yeganeh 提交于
      The tcp_tw_recycle was already broken for connections
      behind NAT, since the per-destination timestamp is not
      monotonically increasing for multiple machines behind
      a single destination address.
      
      After the randomization of TCP timestamp offsets
      in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
      for each connection), the tcp_tw_recycle is broken for all
      types of connections for the same reason: the timestamps
      received from a single machine is not monotonically increasing,
      anymore.
      
      Remove tcp_tw_recycle, since it is not functional. Also, remove
      the PAWSPassive SNMP counter since it is only used for
      tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
      since the strict argument is only set when tcp_tw_recycle is
      enabled.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Cc: Lutz Vieweg <lvml@5t9.de>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4396e461
  5. 14 3月, 2017 2 次提交
    • R
      mpls: allow TTL propagation from IP packets to be configured · a59166e4
      Robert Shearman 提交于
      Allow TTL propagation from IP packets to MPLS packets to be
      configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which
      allows the TTL to be set in the resulting MPLS packet, with the value
      of 0 having the semantics of enabling propagation of the TTL from the
      IP header (i.e. non-zero values disable propagation).
      
      Also allow the configuration to be overridden globally by reusing the
      same sysctl to control whether the TTL is propagated from IP packets
      into the MPLS header. If the per-LWT attribute is set then it
      overrides the global configuration. If the TTL isn't propagated then a
      default TTL value is used which can be configured via a new sysctl,
      "net.mpls.default_ttl". This is kept separate from the configuration
      of whether IP TTL propagation is enabled as it can be used in the
      future when non-IP payloads are supported (i.e. where there is no
      payload TTL that can be propagated).
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a59166e4
    • R
      mpls: allow TTL propagation to IP packets to be configured · 5b441ac8
      Robert Shearman 提交于
      Provide the ability to control on a per-route basis whether the TTL
      value from an MPLS packet is propagated to an IPv4/IPv6 packet when
      the last label is popped as per the theoretical model in RFC 3443
      through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to
      mean disable propagation and 1 to mean enable propagation.
      
      In order to provide the ability to change the behaviour for packets
      arriving with IPv4/IPv6 Explicit Null labels and to provide an easy
      way for a user to change the behaviour for all existing routes without
      having to reprogram them, a global knob is provided. This is done
      through the addition of a new per-namespace sysctl,
      "net.mpls.ip_ttl_propagate", which defaults to enabled. If the
      per-route attribute is set (either enabled or disabled) then it
      overrides the global configuration.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b441ac8
  6. 31 1月, 2017 1 次提交
    • R
      net: Avoid receiving packets with an l3mdev on unbound UDP sockets · 63a6fff3
      Robert Shearman 提交于
      Packets arriving in a VRF currently are delivered to UDP sockets that
      aren't bound to any interface. TCP defaults to not delivering packets
      arriving in a VRF to unbound sockets. IP route lookup and socket
      transmit both assume that unbound means using the default table and
      UDP applications that haven't been changed to be aware of VRFs may not
      function correctly in this case since they may not be able to handle
      overlapping IP address ranges, or be able to send packets back to the
      original sender if required.
      
      So add a sysctl, udp_l3mdev_accept, to control this behaviour with it
      being analgous to the existing tcp_l3mdev_accept, namely to allow a
      process to have a VRF-global listen socket. Have this default to off
      as this is the behaviour that users will expect, given that there is
      no explicit mechanism to set unmodified VRF-unaware application into a
      default VRF.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63a6fff3
  7. 25 1月, 2017 1 次提交
    • K
      Introduce a sysctl that modifies the value of PROT_SOCK. · 4548b683
      Krister Johansen 提交于
      Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
      that denotes the first unprivileged inet port in the namespace.  To
      disable all privileged ports set this to zero.  It also checks for
      overlap with the local port range.  The privileged and local range may
      not overlap.
      
      The use case for this change is to allow containerized processes to bind
      to priviliged ports, but prevent them from ever being allowed to modify
      their container's network configuration.  The latter is accomplished by
      ensuring that the network namespace is not a child of the user
      namespace.  This modification was needed to allow the container manager
      to disable a namespace's priviliged port restrictions without exposing
      control of the network namespace to processes in the user namespace.
      Signed-off-by: NKrister Johansen <kjlx@templeofstupid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4548b683
  8. 19 1月, 2017 1 次提交
    • X
      sctp: add reconf_enable in asoc ep and netns · c28445c3
      Xin Long 提交于
      This patch is to add reconf_enable field in all of asoc ep and netns
      to indicate if they support stream reset.
      
      When initializing, asoc reconf_enable get the default value from ep
      reconf_enable which is from netns netns reconf_enable by default.
      
      It is also to add reconf_capable in asoc peer part to know if peer
      supports reconf_enable, the value is set if ext params have reconf
      chunk support when processing init chunk, just as rfc6525 section
      5.1.1 demands.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c28445c3
  9. 03 1月, 2017 1 次提交
    • F
      netfilter: merge udp and udplite conntrack helpers · e4781421
      Florian Westphal 提交于
      udplite was copied from udp, they are virtually 100% identical.
      
      This adds udplite tracker to udp instead, removes udplite module,
      and then makes the udplite tracker builtin.
      
      udplite will then simply re-use udp timeout settings.
      It makes little sense to add separate sysctls, nowadays we have
      fine-grained timeout policy support via the CT target.
      
      old:
       text    data     bss     dec     hex filename
       1633     672       0    2305     901 nf_conntrack_proto_udp.o
       1756     672       0    2428     97c nf_conntrack_proto_udplite.o
      69526   17937     268   87731   156b3 nf_conntrack.ko
      
      new:
       text    data     bss     dec     hex filename
       2442    1184       0    3626     e2a nf_conntrack_proto_udp.o
      68565   17721     268   86554   1521a nf_conntrack.ko
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e4781421
  10. 30 12月, 2016 2 次提交
  11. 28 12月, 2016 1 次提交
  12. 07 12月, 2016 1 次提交
    • F
      netfilter: defrag: only register defrag functionality if needed · 834184b1
      Florian Westphal 提交于
      nf_defrag modules for ipv4 and ipv6 export an empty stub function.
      Any module that needs the defragmentation hooks registered simply 'calls'
      this empty function to create a phony module dependency -- modprobe will
      then load the defrag module too.
      
      This extends netfilter ipv4/ipv6 defragmentation modules to delay the hook
      registration until the functionality is requested within a network namespace
      instead of module load time for all namespaces.
      
      Hooks are only un-registered on module unload or when a namespace that used
      such defrag functionality exits.
      
      We have to use struct net for this as the register hooks can be called
      before netns initialization here from the ipv4/ipv6 conntrack module
      init path.
      
      There is no unregister functionality support, defrag will always be
      active once it was requested inside a net namespace.
      
      The reason is that defrag has impact on nft and iptables rulesets
      (without defrag we might see framents).
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      834184b1
  13. 05 12月, 2016 3 次提交
    • D
      netfilter: conntrack: built-in support for UDPlite · 9b91c96c
      Davide Caratti 提交于
      CONFIG_NF_CT_PROTO_UDPLITE is no more a tristate. When set to y,
      connection tracking support for UDPlite protocol is built-in into
      nf_conntrack.ko.
      
      footprint test:
      $ ls -l net/netfilter/nf_conntrack{_proto_udplite,}.ko \
              net/ipv4/netfilter/nf_conntrack_ipv4.ko \
              net/ipv6/netfilter/nf_conntrack_ipv6.ko
      
      (builtin)|| udplite|  ipv4  |  ipv6  |nf_conntrack
      ---------++--------+--------+--------+--------------
      none     || 432538 | 828755 | 828676 | 6141434
      UDPlite  ||   -    | 829649 | 829362 | 6498204
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      9b91c96c
    • D
      netfilter: conntrack: built-in support for SCTP · a85406af
      Davide Caratti 提交于
      CONFIG_NF_CT_PROTO_SCTP is no more a tristate. When set to y, connection
      tracking support for SCTP protocol is built-in into nf_conntrack.ko.
      
      footprint test:
      $ ls -l net/netfilter/nf_conntrack{_proto_sctp,}.ko \
              net/ipv4/netfilter/nf_conntrack_ipv4.ko \
              net/ipv6/netfilter/nf_conntrack_ipv6.ko
      
      (builtin)||  sctp  |  ipv4  |  ipv6  | nf_conntrack
      ---------++--------+--------+--------+--------------
      none     || 498243 | 828755 | 828676 | 6141434
      SCTP     ||   -    | 829254 | 829175 | 6547872
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a85406af
    • D
      netfilter: conntrack: built-in support for DCCP · c51d3901
      Davide Caratti 提交于
      CONFIG_NF_CT_PROTO_DCCP is no more a tristate. When set to y, connection
      tracking support for DCCP protocol is built-in into nf_conntrack.ko.
      
      footprint test:
      $ ls -l net/netfilter/nf_conntrack{_proto_dccp,}.ko \
              net/ipv4/netfilter/nf_conntrack_ipv4.ko \
              net/ipv6/netfilter/nf_conntrack_ipv6.ko
      
      (builtin)||  dccp  |  ipv4  |  ipv6  | nf_conntrack
      ---------++--------+--------+--------+--------------
      none     || 469140 | 828755 | 828676 | 6141434
      DCCP     ||   -    | 830566 | 829935 | 6533526
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c51d3901
  14. 04 12月, 2016 3 次提交
    • I
      ipv4: fib: Allow for consistent FIB dumping · cacaad11
      Ido Schimmel 提交于
      The next patch will enable listeners of the FIB notification chain to
      request a dump of the FIB tables. However, since RTNL isn't taken during
      the dump, it's possible for the FIB tables to change mid-dump, which
      will result in inconsistency between the listener's table and the
      kernel's.
      
      Allow listeners to know about changes that occurred mid-dump, by adding
      a change sequence counter to each net namespace. The counter is
      incremented just before a notification is sent in the FIB chain.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cacaad11
    • A
      netns: fix net_generic() "id - 1" bloat · 6af2d5ff
      Alexey Dobriyan 提交于
      net_generic() function is both a) inline and b) used ~600 times.
      
      It has the following code inside
      
      		...
      	ptr = ng->ptr[id - 1];
      		...
      
      "id" is never compile time constant so compiler is forced to subtract 1.
      And those decrements or LEA [r32 - 1] instructions add up.
      
      We also start id'ing from 1 to catch bugs where pernet sybsystem id
      is not initialized and 0. This is quite pointless idea (nothing will
      work or immediate interference with first registered subsystem) in
      general but it hints what needs to be done for code size reduction.
      
      Namely, overlaying allocation of pointer array and fixed part of
      structure in the beginning and using usual base-0 addressing.
      
      Ids are just cookies, their exact values do not matter, so lets start
      with 3 on x86_64.
      
      Code size savings (oh boy): -4.2 KB
      
      As usual, ignore the initial compiler stupidity part of the table.
      
      	add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
      	function                                     old     new   delta
      	tipc_nametbl_insert_publ                    1250    1270     +20
      	nlmclnt_lookup_host                          686     703     +17
      	nfsd4_encode_fattr                          5930    5941     +11
      	nfs_get_client                              1050    1061     +11
      	register_pernet_operations                   333     342      +9
      	tcf_mirred_init                              843     849      +6
      	tcf_bpf_init                                1143    1149      +6
      	gss_setup_upcall                             990     994      +4
      	idmap_name_to_id                             432     434      +2
      	ops_init                                     274     275      +1
      	nfsd_inject_forget_client                    259     260      +1
      	nfs4_alloc_client                            612     613      +1
      	tunnel_key_walker                            164     163      -1
      
      		...
      
      	tipc_bcbase_select_primary                   392     360     -32
      	mac80211_hwsim_new_radio                    2808    2767     -41
      	ipip6_tunnel_ioctl                          2228    2186     -42
      	tipc_bcast_rcv                               715     672     -43
      	tipc_link_build_proto_msg                   1140    1089     -51
      	nfsd4_lock                                  3851    3796     -55
      	tipc_mon_rcv                                1012     956     -56
      	Total: Before=156643951, After=156639743, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6af2d5ff
    • A
      netns: add dummy struct inside "struct net_generic" · 9bfc7b99
      Alexey Dobriyan 提交于
      This is precursor to fixing "[id - 1]" bloat inside net_generic().
      
      Name "s" is chosen to complement name "u" often used for dummy unions.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bfc7b99
  15. 18 11月, 2016 1 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
  16. 14 11月, 2016 1 次提交
  17. 10 11月, 2016 1 次提交
    • D
      ipv6: sr: add code base for control plane support of SR-IPv6 · 915d7e5e
      David Lebrun 提交于
      This patch adds the necessary hooks and structures to provide support
      for SR-IPv6 control plane, essentially the Generic Netlink commands
      that will be used for userspace control over the Segment Routing
      kernel structures.
      
      The genetlink commands provide control over two different structures:
      tunnel source and HMAC data. The tunnel source is the source address
      that will be used by default when encapsulating packets into an
      outer IPv6 header + SRH. If the tunnel source is set to :: then an
      address of the outgoing interface will be selected as the source.
      
      The HMAC commands currently just return ENOTSUPP and will be implemented
      in a future patch.
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      915d7e5e
  18. 25 9月, 2016 1 次提交
  19. 24 8月, 2016 1 次提交
  20. 13 8月, 2016 1 次提交
    • P
      netfilter: remove ip_conntrack* sysctl compat code · adf05168
      Pablo Neira Ayuso 提交于
      This backward compatibility has been around for more than ten years,
      since Yasuyuki Kozakai introduced IPv6 in conntrack. These days, we have
      alternate /proc/net/nf_conntrack* entries, the ctnetlink interface and
      the conntrack utility got adopted by many people in the user community
      according to what I observed on the netfilter user mailing list.
      
      So let's get rid of this.
      
      Note that nf_conntrack_htable_size and unsigned int nf_conntrack_max do
      not need to be exported as symbol anymore.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      adf05168
  21. 12 8月, 2016 2 次提交
  22. 10 8月, 2016 2 次提交
  23. 25 5月, 2016 1 次提交
    • E
      netfilter: nf_queue: Make the queue_handler pernet · dc3ee32e
      Eric W. Biederman 提交于
      Florian Weber reported:
      > Under full load (unshare() in loop -> OOM conditions) we can
      > get kernel panic:
      >
      > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      > IP: [<ffffffff81476c85>] nfqnl_nf_hook_drop+0x35/0x70
      > [..]
      > task: ffff88012dfa3840 ti: ffff88012dffc000 task.ti: ffff88012dffc000
      > RIP: 0010:[<ffffffff81476c85>]  [<ffffffff81476c85>] nfqnl_nf_hook_drop+0x35/0x70
      > RSP: 0000:ffff88012dfffd80  EFLAGS: 00010206
      > RAX: 0000000000000008 RBX: ffffffff81add0c0 RCX: ffff88013fd80000
      > [..]
      > Call Trace:
      >  [<ffffffff81474d98>] nf_queue_nf_hook_drop+0x18/0x20
      >  [<ffffffff814738eb>] nf_unregister_net_hook+0xdb/0x150
      >  [<ffffffff8147398f>] netfilter_net_exit+0x2f/0x60
      >  [<ffffffff8141b088>] ops_exit_list.isra.4+0x38/0x60
      >  [<ffffffff8141b652>] setup_net+0xc2/0x120
      >  [<ffffffff8141bd09>] copy_net_ns+0x79/0x120
      >  [<ffffffff8106965b>] create_new_namespaces+0x11b/0x1e0
      >  [<ffffffff810698a7>] unshare_nsproxy_namespaces+0x57/0xa0
      >  [<ffffffff8104baa2>] SyS_unshare+0x1b2/0x340
      >  [<ffffffff81608276>] entry_SYSCALL_64_fastpath+0x1e/0xa8
      > Code: 65 00 48 89 e5 41 56 41 55 41 54 53 83 e8 01 48 8b 97 70 12 00 00 48 98 49 89 f4 4c 8b 74 c2 18 4d 8d 6e 08 49 81 c6 88 00 00 00 <49> 8b 5d 00 48 85 db 74 1a 48 89 df 4c 89 e2 48 c7 c6 90 68 47
      >
      
      The simple fix for this requires a new pernet variable for struct
      nf_queue that indicates when it is safe to use the dynamically
      allocated nf_queue state.
      
      As we need a variable anyway make nf_register_queue_handler and
      nf_unregister_queue_handler pernet.  This allows the existing logic of
      when it is safe to use the state from the nfnetlink_queue module to be
      reused with no changes except for making it per net.
      
      The syncrhonize_rcu from nf_unregister_queue_handler is moved to a new
      function nfnl_queue_net_exit_batch so that the worst case of having a
      syncrhonize_rcu in the pernet exit path is not experienced in batch
      mode.
      Reported-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      dc3ee32e
  24. 09 5月, 2016 2 次提交
  25. 06 5月, 2016 1 次提交
  26. 05 5月, 2016 1 次提交
  27. 25 4月, 2016 1 次提交
  28. 12 4月, 2016 1 次提交
    • D
      net: ipv4: Consider failed nexthops in multipath routes · a6db4494
      David Ahern 提交于
      Multipath route lookups should consider knowledge about next hops and not
      select a hop that is known to be failed.
      
      Example:
      
                           [h2]                   [h3]   15.0.0.5
                            |                      |
                           3|                     3|
                          [SP1]                  [SP2]--+
                           1  2                   1     2
                           |  |     /-------------+     |
                           |   \   /                    |
                           |     X                      |
                           |    / \                     |
                           |   /   \---------------\    |
                           1  2                     1   2
               12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                           4                         4
                            \                       /
                              \                    /
                               \                  /
                                -------|   |-----/
                                       1   2
                                      [TOR3]
                                        3|
                                         |
                                        [h1]  12.0.0.1
      
      host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
      
          root@h1:~# ip ro ls
          ...
          12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
          15.0.0.0/16
                  nexthop via 12.0.0.2  dev swp1 weight 1
                  nexthop via 12.0.0.3  dev swp1 weight 1
          ...
      
      If the link between tor3 and tor1 is down and the link between tor1
      and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
      in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
      ssh 15.0.0.5 gets the other. Connections that attempt to use the
      12.0.0.2 nexthop fail since that neighbor is not reachable:
      
          root@h1:~# ip neigh show
          ...
          12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
          12.0.0.2 dev swp1  FAILED
          ...
      
      The failed path can be avoided by considering known neighbor information
      when selecting next hops. If the neighbor lookup fails we have no
      knowledge about the nexthop, so give it a shot. If there is an entry
      then only select the nexthop if the state is sane. This is similar to
      what fib_detect_death does.
      
      To maintain backward compatibility use of the neighbor information is
      based on a new sysctl, fib_multipath_use_neigh.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6db4494
  29. 17 3月, 2016 1 次提交
  30. 09 3月, 2016 2 次提交