1. 14 11月, 2016 2 次提交
    • S
      ipv4: use new_gw for redirect neigh lookup · 969447f2
      Stephen Suryaputra Lin 提交于
      In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0
      and then the state of the neigh for the new_gw is checked. If the state
      isn't valid then the redirected route is deleted. This behavior is
      maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway
      is assigned to peer->redirect_learned.a4 before calling
      ipv4_neigh_lookup().
      
      After commit 5943634f ("ipv4: Maintain redirect and PMTU info in
      struct rtable again."), ipv4_neigh_lookup() is performed without the
      rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw)
      isn't zero, the function uses it as the key. The neigh is most likely
      valid since the old_gw is the one that sends the ICMP redirect message.
      Then the new_gw is assigned to fib_nh_exception. The problem is: the
      new_gw ARP may never gets resolved and the traffic is blackholed.
      
      So, use the new_gw for neigh lookup.
      
      Changes from v1:
       - use __ipv4_neigh_lookup instead (per Eric Dumazet).
      
      Fixes: 5943634f ("ipv4: Maintain redirect and PMTU info in struct rtable again.")
      Signed-off-by: NStephen Suryaputra Lin <ssurya@ieee.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969447f2
    • G
      r8152: Fix error path in open function · ca0a7531
      Guenter Roeck 提交于
      If usb_submit_urb() called from the open function fails, the following
      crash may be observed.
      
      r8152 8-1:1.0 eth0: intr_urb submit failed: -19
      ...
      r8152 8-1:1.0 eth0: v1.08.3
      Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b7b
      pgd = ffffffc0e7305000
      [6b6b6b6b6b6b6b7b] *pgd=0000000000000000, *pud=0000000000000000
      Internal error: Oops: 96000004 [#1] PREEMPT SMP
      ...
      PC is at notifier_chain_register+0x2c/0x58
      LR is at blocking_notifier_chain_register+0x54/0x70
      ...
      Call trace:
      [<ffffffc0002407f8>] notifier_chain_register+0x2c/0x58
      [<ffffffc000240bdc>] blocking_notifier_chain_register+0x54/0x70
      [<ffffffc00026991c>] register_pm_notifier+0x24/0x2c
      [<ffffffbffc183200>] rtl8152_open+0x3dc/0x3f8 [r8152]
      [<ffffffc000808000>] __dev_open+0xac/0x104
      [<ffffffc0008082f8>] __dev_change_flags+0xb0/0x148
      [<ffffffc0008083c4>] dev_change_flags+0x34/0x70
      [<ffffffc000818344>] do_setlink+0x2c8/0x888
      [<ffffffc0008199d4>] rtnl_newlink+0x328/0x644
      [<ffffffc000819e98>] rtnetlink_rcv_msg+0x1a8/0x1d4
      [<ffffffc0008373c8>] netlink_rcv_skb+0x68/0xd0
      [<ffffffc000817990>] rtnetlink_rcv+0x2c/0x3c
      [<ffffffc000836d1c>] netlink_unicast+0x16c/0x234
      [<ffffffc00083720c>] netlink_sendmsg+0x340/0x364
      [<ffffffc0007e85d0>] sock_sendmsg+0x48/0x60
      [<ffffffc0007e9c30>] SyS_sendto+0xe0/0x120
      [<ffffffc0007e9cb0>] SyS_send+0x40/0x4c
      [<ffffffc000203e34>] el0_svc_naked+0x24/0x28
      
      Clean up error handling to avoid registering the notifier if the open
      function is going to fail.
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca0a7531
  2. 13 11月, 2016 5 次提交
    • B
      net: bpqether.h: remove if_ether.h guard · 10b21768
      Baruch Siach 提交于
      __LINUX_IF_ETHER_H is not defined anywhere, and if_ether.h can keep itself from
      double inclusion, though it uses a single underscore prefix.
      Signed-off-by: NBaruch Siach <baruch@tkos.co.il>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10b21768
    • E
      net: __skb_flow_dissect() must cap its return value · 34fad54c
      Eric Dumazet 提交于
      After Tom patch, thoff field could point past the end of the buffer,
      this could fool some callers.
      
      If an skb was provided, skb->len should be the upper limit.
      If not, hlen is supposed to be the upper limit.
      
      Fixes: a6e544b0 ("flow_dissector: Jump to exit code in __skb_flow_dissect")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Yibin Yang <yibyang@cisco.com
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34fad54c
    • D
      Merge branch 'fix-bpf_redirect' · 79774d6b
      David S. Miller 提交于
      Martin KaFai Lau says:
      
      ====================
      bpf: Fix bpf_redirect to an ipip/ip6tnl dev
      
      This patch set fixes a bug in bpf_redirect(dev, flags) when dev is an
      ipip/ip6tnl.  The current problem is IP-EthHdr-IP is sent out instead of
      IP-IP.
      
      Patch 1 adds a dev->type test similar to dev_is_mac_header_xmit()
      in act_mirred.c which is only available in net-next.  We can consider to
      refactor it once this patch is pulled into net-next from net.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79774d6b
    • M
      bpf: Add test for bpf_redirect to ipip/ip6tnl · 90e02896
      Martin KaFai Lau 提交于
      The test creates two netns, ns1 and ns2.  The host (the default netns)
      has an ipip or ip6tnl dev configured for tunneling traffic to the ns2.
      
          ping VIPS from ns1 <----> host <--tunnel--> ns2 (VIPs at loopback)
      
      The test is to have ns1 pinging VIPs configured at the loopback
      interface in ns2.
      
      The VIPs are 10.10.1.102 and 2401:face::66 (which are configured
      at lo@ns2). [Note: 0x66 => 102].
      
      At ns1, the VIPs are routed _via_ the host.
      
      At the host, bpf programs are installed at the veth to redirect packets
      from a veth to the ipip/ip6tnl.  The test is configured in a way so
      that both ingress and egress can be tested.
      
      At ns2, the ipip/ip6tnl dev is configured with the local and remote address
      specified.  The return path is routed to the dev ipip/ip6tnl.
      
      During egress test, the host also locally tests pinging the VIPs to ensure
      that bpf_redirect at egress also works for the direct egress (i.e. not
      forwarding from dev ve1 to ve2).
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90e02896
    • M
      bpf: Fix bpf_redirect to an ipip/ip6tnl dev · 4e3264d2
      Martin KaFai Lau 提交于
      If the bpf program calls bpf_redirect(dev, 0) and dev is
      an ipip/ip6tnl, it currently includes the mac header.
      e.g. If dev is ipip, the end result is IP-EthHdr-IP instead
      of IP-IP.
      
      The fix is to pull the mac header.  At ingress, skb_postpull_rcsum()
      is not needed because the ethhdr should have been pulled once already
      and then got pushed back just before calling the bpf_prog.
      At egress, this patch calls skb_postpull_rcsum().
      
      If bpf_redirect(dev, BPF_F_INGRESS) is called,
      it also fails now because it calls dev_forward_skb() which
      eventually calls eth_type_trans(skb, dev).  The eth_type_trans()
      will set skb->type = PACKET_OTHERHOST because the mac address
      does not match the redirecting dev->dev_addr.  The PACKET_OTHERHOST
      will eventually cause the ip_rcv() errors out.  To fix this,
      ____dev_forward_skb() is added.
      
      Joint work with Daniel Borkmann.
      
      Fixes: cfc7381b ("ip_tunnel: add collect_md mode to IPIP tunnel")
      Fixes: 8d79266b ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e3264d2
  3. 11 11月, 2016 7 次提交
  4. 10 11月, 2016 16 次提交
    • D
      net: tcp response should set oif only if it is L3 master · 9b6c14d5
      David Ahern 提交于
      Lorenzo noted an Android unit test failed due to e0d56fdd:
      "The expectation in the test was that the RST replying to a SYN sent to a
      closed port should be generated with oif=0. In other words it should not
      prefer the interface where the SYN came in on, but instead should follow
      whatever the routing table says it should do."
      
      Revert the change to ip_send_unicast_reply and tcp_v6_send_response such
      that the oif in the flow is set to the skb_iif only if skb_iif is an L3
      master.
      
      Fixes: e0d56fdd ("net: l3mdev: remove redundant calls")
      Reported-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b6c14d5
    • A
      Net Driver: Add Cypress GX3 VID=04b4 PID=3610. · 8da3cf2a
      Allan Chou 提交于
      Add support for Cypress GX3 SuperSpeed to Gigabit Ethernet
      Bridge Controller (Vendor=04b4 ProdID=3610).
      
      Patch verified on x64 linux kernel 4.7.4, 4.8.6, 4.9-rc4 systems
      with the Kensington SD4600P USB-C Universal Dock with Power,
      which uses the Cypress GX3 SuperSpeed to Gigabit Ethernet Bridge
      Controller.
      
      A similar patch was signed-off and tested-by Allan Chou
      <allan@asix.com.tw> on 2015-12-01.
      
      Allan verified his similar patch on x86 Linux kernel 4.1.6 system
      with Cypress GX3 SuperSpeed to Gigabit Ethernet Bridge Controller.
      Tested-by: NAllan Chou <allan@asix.com.tw>
      Tested-by: NChris Roth <chris.roth@usask.ca>
      Tested-by: NArtjom Simon <artjom.simon@gmail.com>
      Signed-off-by: NAllan Chou <allan@asix.com.tw>
      Signed-off-by: NChris Roth <chris.roth@usask.ca>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8da3cf2a
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 9fa684ec
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains a larger than usual batch of Netfilter
      fixes for your net tree. This series contains a mixture of old bugs and
      recently introduced bugs, they are:
      
      1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't
         support the set element updates from the packet path. From Liping
         Zhang.
      
      2) Fix leak when nft_expr_clone() fails, from Liping Zhang.
      
      3) Fix a race when inserting new elements to the set hash from the
         packet path, also from Liping.
      
      4) Handle segmented TCP SIP packets properly, basically avoid that the
         INVITE in the allow header create bogus expectations by performing
         stricter SIP message parsing, from Ulrich Weber.
      
      5) nft_parse_u32_check() should return signed integer for errors, from
         John Linville.
      
      6) Fix wrong allocation instead of connlabels, allocate 16 instead of
         32 bytes, from Florian Westphal.
      
      7) Fix compilation breakage when building the ip_vs_sync code with
         CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann.
      
      8) Destroy the new set if the transaction object cannot be allocated,
         also from Liping Zhang.
      
      9) Use device to route duplicated packets via nft_dup only when set by
         the user, otherwise packets may not follow the right route, again
         from Liping.
      
      10) Fix wrong maximum genetlink attribute definition in IPVS, from
          WANG Cong.
      
      11) Ignore untracked conntrack objects from xt_connmark, from Florian
          Westphal.
      
      12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC
          via CT target, otherwise we cannot use the h.245 helper, from
          Florian.
      
      13) Revisit garbage collection heuristic in the new workqueue-based
          timer approach for conntrack to evict objects earlier, again from
          Florian.
      
      14) Fix crash in nf_tables when inserting an element into a verdict map,
          from Liping Zhang.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fa684ec
    • M
      rtnl: reset calcit fptr in rtnl_unregister() · f567e950
      Mathias Krause 提交于
      To avoid having dangling function pointers left behind, reset calcit in
      rtnl_unregister(), too.
      
      This is no issue so far, as only the rtnl core registers a netlink
      handler with a calcit hook which won't be unregistered, but may become
      one if new code makes use of the calcit hook.
      
      Fixes: c7ac8679 ("rtnetlink: Compute and store minimum ifinfo...")
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Greg Rose <gregory.v.rose@intel.com>
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f567e950
    • A
      vxlan: hide unused local variable · 4053ab1b
      Arnd Bergmann 提交于
      A bugfix introduced a harmless warning in v4.9-rc4:
      
      drivers/net/vxlan.c: In function 'vxlan_group_used':
      drivers/net/vxlan.c:947:21: error: unused variable 'sock6' [-Werror=unused-variable]
      
      This hides the variable inside of the same #ifdef that is
      around its user. The extraneous initialization is removed
      at the same time, it was accidentally introduced in the
      same commit.
      
      Fixes: c6fcc4fc ("vxlan: avoid using stale vxlan socket.")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4053ab1b
    • J
      ibmvnic: Start completion queue negotiation at server-provided optimum values · 6dbcd8fb
      John Allen 提交于
      Use the opt_* fields to determine the starting point for negotiating the
      number of tx/rx completion queues with the vnic server. These contain the
      number of queues that the vnic server estimates that it will be able to
      allocate. While renegotiation may still occur, using the opt_* fields will
      reduce the number of times this needs to happen and will prevent driver
      probe timeout on systems using large numbers of ibmvnic client devices per
      vnic port.
      Signed-off-by: NJohn Allen <jallen@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dbcd8fb
    • D
      net: icmp_route_lookup should use rt dev to determine L3 domain · 9d1a6c4e
      David Ahern 提交于
      icmp_send is called in response to some event. The skb may not have
      the device set (skb->dev is NULL), but it is expected to have an rt.
      Update icmp_route_lookup to use the rt on the skb to determine L3
      domain.
      
      Fixes: 613d09b3 ("net: Use VRF device index for lookups on TX")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d1a6c4e
    • D
      Merge branch 'qcom-emac-pause' · fd6f24d7
      David S. Miller 提交于
      Timur Tabi says:
      
      ====================
      net: qcom/emac: ensure that pause frames are enabled
      
      The qcom emac driver experiences significant packet loss (through frame
      check sequence errors) if flow control is not enabled and the phy is
      not configured to allow pause frames to pass through it.  Therefore, we
      need to enable flow control and force the phy to pass pause frames.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd6f24d7
    • T
      net: qcom/emac: enable flow control if requested · df63022e
      Timur Tabi 提交于
      If the PHY has been configured to allow pause frames, then the MAC
      should be configured to generate and/or accept those frames.
      Signed-off-by: NTimur Tabi <timur@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df63022e
    • T
      net: qcom/emac: configure the external phy to allow pause frames · 3e884493
      Timur Tabi 提交于
      Pause frames are used to enable flow control.  A MAC can send and
      receive pause frames in order to throttle traffic.  However, the PHY
      must be configured to allow those frames to pass through.
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NTimur Tabi <timur@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e884493
    • R
      net: bgmac: fix reversed checks for clock control flag · cdb26d33
      Rafał Miłecki 提交于
      This fixes regression introduced by patch adding feature flags. It was
      already reported and patch followed (it got accepted) but it appears it
      was incorrect. Instead of fixing reversed condition it broke a good one.
      
      This patch was verified to actually fix SoC hanges caused by bgmac on
      BCM47186B0.
      
      Fixes: db791eb2 ("net: ethernet: bgmac: convert to feature flags")
      Fixes: 4af1474e ("net: bgmac: Fix errant feature flag check")
      Cc: Jon Mason <jon.mason@broadcom.com>
      Signed-off-by: NRafał Miłecki <rafal@milecki.pl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdb26d33
    • B
      bna: Add synchronization for tx ring. · d667f785
      Benjamin Poirier 提交于
      We received two reports of BUG_ON in bnad_txcmpl_process() where
      hw_consumer_index appeared to be ahead of producer_index. Out of order
      write/read of these variables could explain these reports.
      
      bnad_start_xmit(), as a producer of tx descriptors, has a few memory
      barriers sprinkled around writes to producer_index and the device's
      doorbell but they're not paired with anything in bnad_txcmpl_process(), a
      consumer.
      
      Since we are synchronizing with a device, we must use mandatory barriers,
      not smp_*. Also, I didn't see the purpose of the last smp_mb() in
      bnad_start_xmit().
      Signed-off-by: NBenjamin Poirier <bpoirier@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d667f785
    • T
      Revert "net/mlx4_en: Fix panic during reboot" · f91d7181
      Tariq Toukan 提交于
      This reverts commit 9d2afba0.
      
      The original issue would possibly exist if an external module
      tried calling our "ethtool_ops" without checking if it still
      exists.
      
      The right way of solving it is by simply doing the check in
      the caller side.
      Currently, no action is required as there's no such use case.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f91d7181
    • M
      net-ipv6: on device mtu change do not add mtu to mtu-less routes · fb56be83
      Maciej Żenczykowski 提交于
      Routes can specify an mtu explicitly or inherit the mtu from
      the underlying device - this inheritance is implemented in
      dst->ops->mtu handlers ip6_mtu() and ip6_blackhole_mtu().
      
      Currently changing the mtu of a device adds mtu explicitly
      to routes using that device.
      
      ie.
        # ip link set dev lo mtu 65536
        # ip -6 route add local 2000::1 dev lo
        # ip -6 route get 2000::1
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
      
        # ip link set dev lo mtu 65535
        # ip -6 route get 2000::1
        local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65535 pref medium
      
        # ip link set dev lo mtu 65536
        # ip -6 route get 2000::1
        local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium
      
        # ip -6 route del local 2000::1
      
      After this patch the route entry no longer changes unless it already has an mtu.
      There is no need: this inheritance is already done in ip6_mtu()
      
        # ip link set dev lo mtu 65536
        # ip -6 route add local 2000::1 dev lo
        # ip -6 route add local 2000::2 dev lo mtu 2000
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium
      
        # ip link set dev lo mtu 65535
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium
      
        # ip link set dev lo mtu 1501
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 1501 pref medium
      
        # ip link set dev lo mtu 65536
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium
      
        # ip -6 route del local 2000::1
        # ip -6 route del local 2000::2
      
      This is desirable because changing device mtu and then resetting it
      to the previous value shouldn't change the user visible routing table.
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      CC: Eric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb56be83
    • S
      sock: fix sendmmsg for partial sendmsg · 3023898b
      Soheil Hassas Yeganeh 提交于
      Do not send the next message in sendmmsg for partial sendmsg
      invocations.
      
      sendmmsg assumes that it can continue sending the next message
      when the return value of the individual sendmsg invocations
      is positive. It results in corrupting the data for TCP,
      SCTP, and UNIX streams.
      
      For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
      of "aefgh" if the first sendmsg invocation sends only the first
      byte while the second sendmsg goes through.
      
      Datagram sockets either send the entire datagram or fail, so
      this patch affects only sockets of type SOCK_STREAM and
      SOCK_SEQPACKET.
      
      Fixes: 228e548e ("net: Add sendmmsg socket system call")
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3023898b
    • G
      driver: macvlan: Destroy new macvlan port if macvlan_common_newlink failed. · aa5fd0fb
      Gao Feng 提交于
      When there is no existing macvlan port in lowdev, one new macvlan port
      would be created. But it doesn't be destoried when something failed later.
      It casues some memleak.
      
      Now add one flag to indicate if new macvlan port is created.
      Signed-off-by: NGao Feng <fgao@ikuai8.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa5fd0fb
  5. 09 11月, 2016 5 次提交
    • L
      netfilter: nf_tables: fix oops when inserting an element into a verdict map · 58c78e10
      Liping Zhang 提交于
      Dalegaard says:
       The following ruleset, when loaded with 'nft -f bad.txt'
       ----snip----
       flush ruleset
       table ip inlinenat {
         map sourcemap {
           type ipv4_addr : verdict;
         }
      
         chain postrouting {
           ip saddr vmap @sourcemap accept
         }
       }
       add chain inlinenat test
       add element inlinenat sourcemap { 100.123.10.2 : jump test }
       ----snip----
      
       results in a kernel oops:
       BUG: unable to handle kernel paging request at 0000000000001344
       IP: [<ffffffffa07bf704>] nf_tables_check_loops+0x114/0x1f0 [nf_tables]
       [...]
       Call Trace:
        [<ffffffffa07c2aae>] ? nft_data_init+0x13e/0x1a0 [nf_tables]
        [<ffffffffa07c1950>] nft_validate_register_store+0x60/0xb0 [nf_tables]
        [<ffffffffa07c74b5>] nft_add_set_elem+0x545/0x5e0 [nf_tables]
        [<ffffffffa07bfdd0>] ? nft_table_lookup+0x30/0x60 [nf_tables]
        [<ffffffff8132c630>] ? nla_strcmp+0x40/0x50
        [<ffffffffa07c766e>] nf_tables_newsetelem+0x11e/0x210 [nf_tables]
        [<ffffffff8132c400>] ? nla_validate+0x60/0x80
        [<ffffffffa030d9b4>] nfnetlink_rcv+0x354/0x5a7 [nfnetlink]
      
      Because we forget to fill the net pointer in bind_ctx, so dereferencing
      it may cause kernel crash.
      Reported-by: NDalegaard <dalegaard@gmail.com>
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      58c78e10
    • F
      netfilter: conntrack: refine gc worker heuristics · e0df8cae
      Florian Westphal 提交于
      Nicolas Dichtel says:
        After commit b87a2f91 ("netfilter: conntrack: add gc worker to
        remove timed-out entries"), netlink conntrack deletion events may be
        sent with a huge delay.
      
      Nicolas further points at this line:
      
        goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);
      
      and indeed, this isn't optimal at all.  Rationale here was to ensure that
      we don't block other work items for too long, even if
      nf_conntrack_htable_size is huge.  But in order to have some guarantee
      about maximum time period where a scan of the full conntrack table
      completes we should always use a fixed slice size, so that once every
      N scans the full table has been examined at least once.
      
      We also need to balance this vs. the case where the system is either idle
      (i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
      from packet path).
      
      So, after some discussion with Nicolas:
      
      1. want hard guarantee that we scan entire table at least once every X s
      -> need to scan fraction of table (get rid of upper bound)
      
      2. don't want to eat cycles on idle or very busy system
      -> increase interval if we did not evict any entries
      
      3. don't want to block other worker items for too long
      -> make fraction really small, and prefer small scan interval instead
      
      4. Want reasonable short time where we detect timed-out entry when
      system went idle after a burst of traffic, while not doing scans
      all the time.
      -> Store next gc scan in worker, increasing delays when no eviction
      happened and shrinking delay when we see timed out entries.
      
      The old gc interval is turned into a max number, scans can now happen
      every jiffy if stale entries are present.
      
      Longest possible time period until an entry is evicted is now 2 minutes
      in worst case (entry expires right after it was deemed 'not expired').
      Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e0df8cae
    • F
      netfilter: conntrack: fix CT target for UNSPEC helpers · 6114cc51
      Florian Westphal 提交于
      Thomas reports its not possible to attach the H.245 helper:
      
      iptables -t raw -A PREROUTING -p udp -j CT --helper H.245
      iptables: No chain/target/match by that name.
      xt_CT: No such helper "H.245"
      
      This is because H.245 registers as NFPROTO_UNSPEC, but the CT target
      passes NFPROTO_IPV4/IPV6 to nf_conntrack_helper_try_module_get.
      
      We should treat UNSPEC as wildcard and ignore the l3num instead.
      Reported-by: NThomas Woerner <twoerner@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6114cc51
    • F
      netfilter: connmark: ignore skbs with magic untracked conntrack objects · fb9c9649
      Florian Westphal 提交于
      The (percpu) untracked conntrack entries can end up with nonzero connmarks.
      
      The 'untracked' conntrack objects are merely a way to distinguish INVALID
      (i.e. protocol connection tracker says payload doesn't meet some
      requirements or packet was never seen by the connection tracking code)
      from packets that are intentionally not tracked (some icmpv6 types such as
      neigh solicitation, or by using 'iptables -j CT --notrack' option).
      
      Untracked conntrack objects are implementation detail, we might as well use
      invalid magic address instead to tell INVALID and UNTRACKED apart.
      
      Check skb->nfct for untracked dummy and behave as if skb->nfct is NULL.
      Reported-by: NXU Tianwen <evan.xu.tianwen@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      fb9c9649
    • W
      ipvs: use IPVS_CMD_ATTR_MAX for family.maxattr · 8fbfef7f
      WANG Cong 提交于
      family.maxattr is the max index for policy[], the size of
      ops[] is determined with ARRAY_SIZE().
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8fbfef7f
  6. 08 11月, 2016 5 次提交