1. 01 12月, 2016 7 次提交
    • H
      netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed in 64bit kernel · 17a49cd5
      Hongxu Jia 提交于
      Since 09d96860 ("netfilter: x_tables: do compat validation via
      translate_table"), it used compatr structure to assign newinfo
      structure.  In translate_compat_table of ip_tables.c and ip6_tables.c,
      it used compatr->hook_entry to replace info->hook_entry and
      compatr->underflow to replace info->underflow, but not do the same
      replacement in arp_tables.c.
      
      It caused invoking 32-bit "arptbale -P INPUT ACCEPT" failed in 64bit
      kernel.
      --------------------------------------
      root@qemux86-64:~# arptables -P INPUT ACCEPT
      root@qemux86-64:~# arptables -P INPUT ACCEPT
      ERROR: Policy for `INPUT' offset 448 != underflow 0
      arptables: Incompatible with this kernel
      --------------------------------------
      
      Fixes: 09d96860 ("netfilter: x_tables: do compat validation via translate_table")
      Signed-off-by: NHongxu Jia <hongxu.jia@windriver.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      17a49cd5
    • G
      l2tp: fix address test in __l2tp_ip6_bind_lookup() · 31e2f21f
      Guillaume Nault 提交于
      The '!(addr && ipv6_addr_equal(addr, laddr))' part of the conditional
      matches if addr is NULL or if addr != laddr.
      But the intend of __l2tp_ip6_bind_lookup() is to find a sockets with
      the same address, so the ipv6_addr_equal() condition needs to be
      inverted.
      
      For better clarity and consistency with the rest of the expression, the
      (!X || X == Y) notation is used instead of !(X && X != Y).
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31e2f21f
    • G
      l2tp: fix lookup for sockets not bound to a device in l2tp_ip · df90e688
      Guillaume Nault 提交于
      When looking up an l2tp socket, we must consider a null netdevice id as
      wild card. There are currently two problems caused by
      __l2tp_ip_bind_lookup() not considering 'dif' as wild card when set to 0:
      
        * A socket bound to a device (i.e. with sk->sk_bound_dev_if != 0)
          never receives any packet. Since __l2tp_ip_bind_lookup() is called
          with dif == 0 in l2tp_ip_recv(), sk->sk_bound_dev_if is always
          different from 'dif' so the socket doesn't match.
      
        * Two sockets, one bound to a device but not the other, can be bound
          to the same address. If the first socket binding to the address is
          the one that is also bound to a device, the second socket can bind
          to the same address without __l2tp_ip_bind_lookup() noticing the
          overlap.
      
      To fix this issue, we need to consider that any null device index, be
      it 'sk->sk_bound_dev_if' or 'dif', matches with any other value.
      We also need to pass the input device index to __l2tp_ip_bind_lookup()
      on reception so that sockets bound to a device never receive packets
      from other devices.
      
      This patch fixes l2tp_ip6 in the same way.
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df90e688
    • G
      l2tp: fix racy socket lookup in l2tp_ip and l2tp_ip6 bind() · d5e3a190
      Guillaume Nault 提交于
      It's not enough to check for sockets bound to same address at the
      beginning of l2tp_ip{,6}_bind(): even if no socket is found at that
      time, a socket with the same address could be bound before we take
      the l2tp lock again.
      
      This patch moves the lookup right before inserting the new socket, so
      that no change can ever happen to the list between address lookup and
      socket insertion.
      
      Care is taken to avoid side effects on the socket in case of failure.
      That is, modifications of the socket are done after the lookup, when
      binding is guaranteed to succeed, and before releasing the l2tp lock,
      so that concurrent lookups will always see fully initialised sockets.
      
      For l2tp_ip, 'ret' is set to -EINVAL before checking the SOCK_ZAPPED
      bit. Error code was mistakenly set to -EADDRINUSE on error by commit
      32c23116 ("l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()").
      Using -EINVAL restores original behaviour.
      
      For l2tp_ip6, the lookup is now always done with the correct bound
      device. Before this patch, when binding to a link-local address, the
      lookup was done with the original sk->sk_bound_dev_if, which was later
      overwritten with addr->l2tp_scope_id. Lookup is now performed with the
      final sk->sk_bound_dev_if value.
      
      Finally, the (addr_len >= sizeof(struct sockaddr_in6)) check has been
      dropped: addr is a sockaddr_l2tpip6 not sockaddr_in6 and addr_len has
      already been checked at this point (this part of the code seems to have
      been copy-pasted from net/ipv6/raw.c).
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5e3a190
    • G
      l2tp: hold socket before dropping lock in l2tp_ip{, 6}_recv() · a3c18422
      Guillaume Nault 提交于
      Socket must be held while under the protection of the l2tp lock; there
      is no guarantee that sk remains valid after the read_unlock_bh() call.
      
      Same issue for l2tp_ip and l2tp_ip6.
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3c18422
    • G
      l2tp: lock socket before checking flags in connect() · 0382a25a
      Guillaume Nault 提交于
      Socket flags aren't updated atomically, so the socket must be locked
      while reading the SOCK_ZAPPED flag.
      
      This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch
      also brings error handling for __ip6_datagram_connect() failures.
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0382a25a
    • D
      openvswitch: Fix skb leak in IPv6 reassembly. · f92a80a9
      Daniele Di Proietto 提交于
      If nf_ct_frag6_gather() returns an error other than -EINPROGRESS, it
      means that we still have a reference to the skb.  We should free it
      before returning from handle_fragments, as stated in the comment above.
      
      Fixes: daaa7d64 ("netfilter: ipv6: avoid nf_iterate recursion")
      CC: Florian Westphal <fw@strlen.de>
      CC: Pravin B Shelar <pshelar@ovn.org>
      CC: Joe Stringer <joe@ovn.org>
      Signed-off-by: NDaniele Di Proietto <diproiettod@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f92a80a9
  2. 30 11月, 2016 12 次提交
  3. 29 11月, 2016 3 次提交
    • N
      net: dsa: fix unbalanced dsa_switch_tree reference counting · 7a99cd6e
      Nikita Yushchenko 提交于
      _dsa_register_switch() gets a dsa_switch_tree object either via
      dsa_get_dst() or via dsa_add_dst(). Former path does not increase kref
      in returned object (resulting into caller not owning a reference),
      while later path does create a new object (resulting into caller owning
      a reference).
      
      The rest of _dsa_register_switch() assumes that it owns a reference, and
      calls dsa_put_dst().
      
      This causes a memory breakage if first switch in the tree initialized
      successfully, but second failed to initialize. In particular, freed
      dsa_swith_tree object is left referenced by switch that was initialized,
      and later access to sysfs attributes of that switch cause OOPS.
      
      To fix, need to add kref_get() call to dsa_get_dst().
      
      Fixes: 83c0afae ("net: dsa: Add new binding implementation")
      Signed-off-by: NNikita Yushchenko <nikita.yoush@cogentembedded.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a99cd6e
    • D
      net: handle no dst on skb in icmp6_send · 79dc7e3f
      David Ahern 提交于
      Andrey reported the following while fuzzing the kernel with syzkaller:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 0 PID: 3859 Comm: a.out Not tainted 4.9.0-rc6+ #429
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff8800666d4200 task.stack: ffff880067348000
      RIP: 0010:[<ffffffff833617ec>]  [<ffffffff833617ec>]
      icmp6_send+0x5fc/0x1e30 net/ipv6/icmp.c:451
      RSP: 0018:ffff88006734f2c0  EFLAGS: 00010206
      RAX: ffff8800666d4200 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000018
      RBP: ffff88006734f630 R08: ffff880064138418 R09: 0000000000000003
      R10: dffffc0000000000 R11: 0000000000000005 R12: 0000000000000000
      R13: ffffffff84e7e200 R14: ffff880064138484 R15: ffff8800641383c0
      FS:  00007fb3887a07c0(0000) GS:ffff88006cc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000000 CR3: 000000006b040000 CR4: 00000000000006f0
      Stack:
       ffff8800666d4200 ffff8800666d49f8 ffff8800666d4200 ffffffff84c02460
       ffff8800666d4a1a 1ffff1000ccdaa2f ffff88006734f498 0000000000000046
       ffff88006734f440 ffffffff832f4269 ffff880064ba7456 0000000000000000
      Call Trace:
       [<ffffffff83364ddc>] icmpv6_param_prob+0x2c/0x40 net/ipv6/icmp.c:557
       [<     inline     >] ip6_tlvopt_unknown net/ipv6/exthdrs.c:88
       [<ffffffff83394405>] ip6_parse_tlv+0x555/0x670 net/ipv6/exthdrs.c:157
       [<ffffffff8339a759>] ipv6_parse_hopopts+0x199/0x460 net/ipv6/exthdrs.c:663
       [<ffffffff832ee773>] ipv6_rcv+0xfa3/0x1dc0 net/ipv6/ip6_input.c:191
       ...
      
      icmp6_send / icmpv6_send is invoked for both rx and tx paths. In both
      cases the dst->dev should be preferred for determining the L3 domain
      if the dst has been set on the skb. Fallback to the skb->dev if it has
      not. This covers the case reported here where icmp6_send is invoked on
      Rx before the route lookup.
      
      Fixes: 5d41ce29 ("net: icmp6_send should use dst dev to determine L3 domain")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79dc7e3f
    • J
      4df21dfc
  4. 28 11月, 2016 3 次提交
    • D
      net, sched: respect rcu grace period on cls destruction · d9363774
      Daniel Borkmann 提交于
      Roi reported a crash in flower where tp->root was NULL in ->classify()
      callbacks. Reason is that in ->destroy() tp->root is set to NULL via
      RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
      this doesn't respect RCU grace period for them, and as a result, still
      outstanding readers from tc_classify() will try to blindly dereference
      a NULL tp->root.
      
      The tp->root object is strictly private to the classifier implementation
      and holds internal data the core such as tc_ctl_tfilter() doesn't know
      about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
      is only checked for NULL in ->get() callback, but nowhere else. This is
      misleading and seemed to be copied from old classifier code that was not
      cleaned up properly. For example, d3fa76ee ("[NET_SCHED]: cls_basic:
      fix NULL pointer dereference") moved tp->root initialization into ->init()
      routine, where before it was part of ->change(), so ->get() had to deal
      with tp->root being NULL back then, so that was indeed a valid case, after
      d3fa76ee, not really anymore. We used to set tp->root to NULL long
      ago in ->destroy(), see 47a1a1d4 ("pkt_sched: remove unnecessary xchg()
      in packet classifiers"); but the NULLifying was reintroduced with the
      RCUification, but it's not correct for every classifier implementation.
      
      In the cases that are fixed here with one exception of cls_cgroup, tp->root
      object is allocated and initialized inside ->init() callback, which is always
      performed at a point in time after we allocate a new tp, which means tp and
      thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
      Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
      handler, same for the tp which is kfree_rcu()'ed right when we return
      from ->destroy() in tcf_destroy(). This means, the head object's lifetime
      for such classifiers is always tied to the tp lifetime. The RCU callback
      invocation for the two kfree_rcu() could be out of order, but that's fine
      since both are independent.
      
      Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
      means that 1) we don't need a useless NULL check in fast-path and, 2) that
      outstanding readers of that tp in tc_classify() can still execute under
      respect with RCU grace period as it is actually expected.
      
      Things that haven't been touched here: cls_fw and cls_route. They each
      handle tp->root being NULL in ->classify() path for historic reasons, so
      their ->destroy() implementation can stay as is. If someone actually
      cares, they could get cleaned up at some point to avoid the test in fast
      path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
      !head should anyone actually be using/testing it, so it at least aligns with
      cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
      destruction (to a sleepable context) after RCU grace period as concurrent
      readers might still access it. (Note that in this case we need to hold module
      reference to keep work callback address intact, since we only wait on module
      unload for all call_rcu()s to finish.)
      
      This fixes one race to bring RCU grace period guarantees back. Next step
      as worked on by Cong however is to fix 1e052be6 ("net_sched: destroy
      proto tp when all filters are gone") to get the order of unlinking the tp
      in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
      RCU_INIT_POINTER() before tcf_destroy() and let the notification for
      removal be done through the prior ->delete() callback. Both are independant
      issues. Once we have that right, we can then clean tp->root up for a number
      of classifiers by not making them RCU pointers, which requires a new callback
      (->uninit) that is triggered from tp's RCU callback, where we just kfree()
      tp->root from there.
      
      Fixes: 1f947bf1 ("net: sched: rcu'ify cls_bpf")
      Fixes: 9888faef ("net: sched: cls_basic use RCU")
      Fixes: 70da9f0b ("net: sched: cls_flow use RCU")
      Fixes: 77b9900e ("tc: introduce Flower classifier")
      Fixes: bf3994d2 ("net/sched: introduce Match-all classifier")
      Fixes: 952313bd ("net: sched: cls_cgroup use RCU")
      Reported-by: NRoi Dayan <roid@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Roi Dayan <roid@mellanox.com>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9363774
    • J
      tipc: fix link statistics counter errors · 95901122
      Jon Paul Maloy 提交于
      In commit e4bf4f76 ("tipc: simplify packet sequence number
      handling") we changed the internal representation of the packet
      sequence number counters from u32 to u16, reflecting what is really
      sent over the wire.
      
      Since then some link statistics counters have been displaying incorrect
      values, partially because the counters meant to be used as sequence
      number snapshots are now used as direct counters, stored as u32, and
      partially because some counter updates are just missing in the code.
      
      In this commit we correct this in two ways. First, we base the
      displayed packet sent/received values on direct counters instead
      of as previously a calculated difference between current sequence
      number and a snapshot. Second, we add the missing updates of the
      counters.
      
      This change is compatible with the current netlink API, and requires
      no changes to the user space tools.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95901122
    • J
      net: dsa: fix fixed-link-phy device leaks · fd05d7b1
      Johan Hovold 提交于
      Make sure to drop the reference taken by of_phy_find_device() when
      registering and deregistering the fixed-link PHY-device.
      
      Fixes: 39b0c705 ("net: dsa: Allow configuration of CPU & DSA port
      speeds/duplex")
      Signed-off-by: NJohan Hovold <johan@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd05d7b1
  5. 26 11月, 2016 4 次提交
    • J
      tipc: resolve connection flow control compatibility problem · 6998cc6e
      Jon Paul Maloy 提交于
      In commit 10724cc7 ("tipc: redesign connection-level flow control")
      we replaced the previous message based flow control with one based on
      1k blocks. In order to ensure backwards compatibility the mechanism
      falls back to using message as base unit when it senses that the peer
      doesn't support the new algorithm. The default flow control window,
      i.e., how many units can be sent before the sender blocks and waits
      for an acknowledge (aka advertisement) is 512. This was tested against
      the previous version, which uses an acknowledge frequency of on ack per
      256 received message, and found to work fine.
      
      However, we missed the fact that versions older than Linux 3.15 use an
      acknowledge frequency of 512, which is exactly the limit where a 4.6+
      sender will stop and wait for acknowledge. This would also work fine if
      it weren't for the fact that if the first sent message on a 4.6+ server
      side is an empty SYNACK, this one is also is counted as a sent message,
      while it is not counted as a received message on a legacy 3.15-receiver.
      This leads to the sender always being one step ahead of the receiver, a
      scenario causing the sender to block after 512 sent messages, while the
      receiver only has registered 511 read messages. Hence, the legacy
      receiver is not trigged to send an acknowledge, with a permanently
      blocked sender as result.
      
      We solve this deadlock by simply allowing the sender to send one more
      message before it blocks, i.e., by a making minimal change to the
      condition used for determining connection congestion.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6998cc6e
    • M
      net: ethtool: don't require CAP_NET_ADMIN for ETHTOOL_GLINKSETTINGS · 8006f6bf
      Miroslav Lichvar 提交于
      The ETHTOOL_GLINKSETTINGS command is deprecating the ETHTOOL_GSET
      command and likewise it shouldn't require the CAP_NET_ADMIN capability.
      Signed-off-by: NMiroslav Lichvar <mlichvar@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8006f6bf
    • J
      tipc: improve sanity check for received domain records · d876a4d2
      Jon Paul Maloy 提交于
      In commit 35c55c98 ("tipc: add neighbor monitoring framework") we
      added a data area to the link monitor STATE messages under the
      assumption that previous versions did not use any such data area.
      
      For versions older than Linux 4.3 this assumption is not correct. In
      those version, all STATE messages sent out from a node inadvertently
      contain a 16 byte data area containing a string; -a leftover from
      previous RESET messages which were using this during the setup phase.
      This string serves no purpose in STATE messages, and should no be there.
      
      Unfortunately, this data area is delivered to the link monitor
      framework, where a sanity check catches that it is not a correct domain
      record, and drops it. It also issues a rate limited warning about the
      event.
      
      Since such events occur much more frequently than anticipated, we now
      choose to remove the warning in order to not fill the kernel log with
      useless contents. We also make the sanity check stricter, to further
      reduce the risk that such data is inavertently admitted.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d876a4d2
    • J
      tipc: fix compatibility bug in link monitoring · f7967556
      Jon Paul Maloy 提交于
      commit 81729810 ("tipc: fix link priority propagation") introduced a
      compatibility problem between TIPC versions newer than Linux 4.6 and
      those older than Linux 4.4. In versions later than 4.4, link STATE
      messages only contain a non-zero link priority value when the sender
      wants the receiver to change its priority. This has the effect that the
      receiver resets itself in order to apply the new priority. This works
      well, and is consistent with the said commit.
      
      However, in versions older than 4.4 a valid link priority is present in
      all sent link STATE messages, leading to cyclic link establishment and
      reset on the 4.6+ node.
      
      We fix this by adding a test that the received value should not only
      be valid, but also differ from the current value in order to cause the
      receiving link endpoint to reset.
      Reported-by: NAmar Nv <amar.nv005@gmail.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7967556
  6. 25 11月, 2016 3 次提交
  7. 24 11月, 2016 8 次提交
    • L
      netfilter: nft_range: add the missing NULL pointer check · 49cdc4c7
      Liping Zhang 提交于
      Otherwise, kernel panic will happen if the user does not specify
      the related attributes.
      
      Fixes: 0f3cd9b3 ("netfilter: nf_tables: add range expression")
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      49cdc4c7
    • A
      netfilter: nf_tables: fix inconsistent element expiration calculation · d3e2a111
      Anders K. Pedersen 提交于
      As Liping Zhang reports, after commit a8b1e36d ("netfilter: nft_dynset:
      fix element timeout for HZ != 1000"), priv->timeout was stored in jiffies,
      while set->timeout was stored in milliseconds. This is inconsistent and
      incorrect.
      
      Firstly, we already call msecs_to_jiffies in nft_set_elem_init, so
      priv->timeout will be converted to jiffies twice.
      
      Secondly, if the user did not specify the NFTA_DYNSET_TIMEOUT attr,
      set->timeout will be used, but we forget to call msecs_to_jiffies
      when do update elements.
      
      Fix this by using jiffies internally for traditional sets and doing the
      conversions to/from msec when interacting with userspace - as dynset
      already does.
      
      This is preferable to doing the conversions, when elements are inserted or
      updated, because this can happen very frequently on busy dynsets.
      
      Fixes: a8b1e36d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
      Reported-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NAnders K. Pedersen <akp@cohaesio.com>
      Acked-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d3e2a111
    • F
      netfilter: nat: switch to new rhlist interface · 7223ecd4
      Florian Westphal 提交于
      I got offlist bug report about failing connections and high cpu usage.
      This happens because we hit 'elasticity' checks in rhashtable that
      refuses bucket list exceeding 16 entries.
      
      The nat bysrc hash unfortunately needs to insert distinct objects that
      share same key and are identical (have same source tuple), this cannot
      be avoided.
      
      Switch to the rhlist interface which is designed for this.
      
      The nulls_base is removed here, I don't think its needed:
      
      A (unlikely) false positive results in unneeded port clash resolution,
      a false negative results in packet drop during conntrack confirmation,
      when we try to insert the duplicate into main conntrack hash table.
      
      Tested by adding multiple ip addresses to host, then adding
      iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
      
      ... and then creating multiple connections, from same source port but
      different addresses:
      
      for i in $(seq 2000 2032);do nc -p 1234 192.168.7.1 $i > /dev/null  & done
      
      (all of these then get hashed to same bysource slot)
      
      Then, to test that nat conflict resultion is working:
      
      nc -s 10.0.0.1 -p 1234 192.168.7.1 2000
      nc -s 10.0.0.2 -p 1234 192.168.7.1 2000
      
      tcp  .. src=10.0.0.1 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1024 [ASSURED]
      tcp  .. src=10.0.0.2 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1025 [ASSURED]
      tcp  .. src=192.168.7.10 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1234 [ASSURED]
      tcp  .. src=192.168.7.10 dst=192.168.7.1 sport=1234 dport=2001 src=192.168.7.1 dst=192.168.7.10 sport=2001 dport=1234 [ASSURED]
      [..]
      
      -> nat altered source ports to 1024 and 1025, respectively.
      This can also be confirmed on destination host which shows
      ESTAB      0      0   192.168.7.1:2000      192.168.7.10:1024
      ESTAB      0      0   192.168.7.1:2000      192.168.7.10:1025
      ESTAB      0      0   192.168.7.1:2000      192.168.7.10:1234
      
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Fixes: 870190a9 ("netfilter: nat: convert nat bysrc hash to rhashtable")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      7223ecd4
    • F
      netfilter: nat: fix cmp return value · 728e87b4
      Florian Westphal 提交于
      The comparator works like memcmp, i.e. 0 means objects are equal.
      In other words, when objects are distinct they are treated as identical,
      when they are distinct they are allegedly the same.
      
      The first case is rare (distinct objects are unlikely to get hashed to
      same bucket).
      
      The second case results in unneeded port conflict resolutions attempts.
      
      Fixes: 870190a9 ("netfilter: nat: convert nat bysrc hash to rhashtable")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      728e87b4
    • L
      netfilter: nft_hash: validate maximum value of u32 netlink hash attribute · abd66e9f
      Laura Garcia Liebana 提交于
      Use the function nft_parse_u32_check() to fetch the value and validate
      the u32 attribute into the hash len u8 field.
      
      This patch revisits 4da449ae ("netfilter: nft_exthdr: Add size check
      on u8 nft_exthdr attributes").
      
      Fixes: cb1b69b0 ("netfilter: nf_tables: add hash expression")
      Signed-off-by: NLaura Garcia Liebana <nevola@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      abd66e9f
    • D
      netfilter: Update nf_send_reset6 to consider L3 domain · 00b4422f
      David Ahern 提交于
      nf_send_reset6 is not considering the L3 domain and lookups are sent
      to the wrong table. For example consider the following output rule:
      
      ip6tables -A OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset
      
      using perf to analyze lookups via the fib6_table_lookup tracepoint shows:
      
      swapper     0 [001]   248.787816: fib6:fib6_table_lookup: table 255 oif 0 iif 1 src 2100:1::3 dst 2100:1:
              ffffffff81439cdc perf_trace_fib6_table_lookup ([kernel.kallsyms])
              ffffffff814c1ce3 trace_fib6_table_lookup ([kernel.kallsyms])
              ffffffff814c3e89 ip6_pol_route ([kernel.kallsyms])
              ffffffff814c40d5 ip6_pol_route_output ([kernel.kallsyms])
              ffffffff814e7b6f fib6_rule_action ([kernel.kallsyms])
              ffffffff81437f60 fib_rules_lookup ([kernel.kallsyms])
              ffffffff814e7c79 fib6_rule_lookup ([kernel.kallsyms])
              ffffffff814c2541 ip6_route_output_flags ([kernel.kallsyms])
                           528 nf_send_reset6 ([nf_reject_ipv6])
      
      The lookup is directed to table 255 rather than the table associated with
      the device via the L3 domain. Update nf_send_reset6 to pull the L3 domain
      from the dst currently attached to the skb.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      00b4422f
    • D
      netfilter: Update ip_route_me_harder to consider L3 domain · 6d8b49c3
      David Ahern 提交于
      ip_route_me_harder is not considering the L3 domain and sending lookups
      to the wrong table. For example consider the following output rule:
      
      iptables -I OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset
      
      using perf to analyze lookups via the fib_table_lookup tracepoint shows:
      
      vrf-test  1187 [001] 46887.295927: fib:fib_table_lookup: table 255 oif 0 iif 0 src 0.0.0.0 dst 10.100.1.254 tos 0 scope 0 flags 0
              ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
              ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
              ffffffff8148dda3 __inet_dev_addr_type ([kernel.kallsyms])
              ffffffff8148ddf6 inet_addr_type ([kernel.kallsyms])
              ffffffff8149e344 ip_route_me_harder ([kernel.kallsyms])
      
      and
      
      vrf-test  1187 [001] 46887.295933: fib:fib_table_lookup: table 255 oif 0 iif 1 src 10.100.1.254 dst 10.100.1.2 tos 0 scope 0 flags
              ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
              ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
              ffffffff814998ff fib4_rule_action ([kernel.kallsyms])
              ffffffff81437f35 fib_rules_lookup ([kernel.kallsyms])
              ffffffff81499758 __fib_lookup ([kernel.kallsyms])
              ffffffff8144f010 fib_lookup.constprop.34 ([kernel.kallsyms])
              ffffffff8144f759 __ip_route_output_key_hash ([kernel.kallsyms])
              ffffffff8144fc6a ip_route_output_flow ([kernel.kallsyms])
              ffffffff8149e39b ip_route_me_harder ([kernel.kallsyms])
      
      In both cases the lookups are directed to table 255 rather than the
      table associated with the device via the L3 domain. Update both
      lookups to pull the L3 domain from the dst currently attached to the
      skb.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6d8b49c3
    • W
      net: revert "net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit" · a4cd0271
      WANG Cong 提交于
      This reverts commit 7c6ae610, because l2tp_xmit_skb() never
      returns NET_XMIT_CN, it ignores the return value of l2tp_xmit_core().
      
      Cc: Gao Feng <gfree.wind@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4cd0271