1. 02 5月, 2016 4 次提交
    • J
      gre: do not pull header in ICMP error processing · b7f8fe25
      Jiri Benc 提交于
      iptunnel_pull_header expects that IP header was already pulled; with this
      expectation, it pulls the tunnel header. This is not true in gre_err.
      Furthermore, ipv4_update_pmtu and ipv4_redirect expect that skb->data points
      to the IP header.
      
      We cannot pull the tunnel header in this path. It's just a matter of not
      calling iptunnel_pull_header - we don't need any of its effects.
      
      Fixes: bda7bb46 ("gre: Allow multiple protocol listener for gre protocol.")
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7f8fe25
    • H
      tipc: only process unicast on intended node · efe79050
      Hamish Martin 提交于
      We have observed complete lock up of broadcast-link transmission due to
      unacknowledged packets never being removed from the 'transmq' queue. This
      is traced to nodes having their ack field set beyond the sequence number
      of packets that have actually been transmitted to them.
      Consider an example where node 1 has sent 10 packets to node 2 on a
      link and node 3 has sent 20 packets to node 2 on another link. We
      see examples of an ack from node 2 destined for node 3 being treated as
      an ack from node 2 at node 1. This leads to the ack on the node 1 to node
      2 link being increased to 20 even though we have only sent 10 packets.
      When node 1 does get around to sending further packets, none of the
      packets with sequence numbers less than 21 are actually removed from the
      transmq.
      To resolve this we reinstate some code lost in commit d999297c ("tipc:
      reduce locking scope during packet reception") which ensures that only
      messages destined for the receiving node are processed by that node. This
      prevents the sequence numbers from getting out of sync and resolves the
      packet leakage, thereby resolving the broadcast-link transmission
      lock-ups we observed.
      
      While we are aware that this change only patches over a root problem that
      we still haven't identified, this is a sanity test that it is always
      legitimate to do. It will remain in the code even after we identify and
      fix the real problem.
      Reviewed-by: NChris Packham <chris.packham@alliedtelesis.co.nz>
      Reviewed-by: NJohn Thompson <john.thompson@alliedtelesis.co.nz>
      Signed-off-by: NHamish Martin <hamish.martin@alliedtelesis.co.nz>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efe79050
    • C
      soreuseport: Fix TCP listener hash collision · 90e5d0db
      Craig Gallek 提交于
      I forgot to include a check for listener port equality when deciding
      if two sockets should belong to the same reuseport group.  This was
      not caught previously because it's only necessary when two listening
      sockets for the same user happen to hash to the same listener bucket.
      The same error does not exist in the UDP path.
      
      Fixes: c125e80b("soreuseport: fast reuseport TCP socket selection")
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90e5d0db
    • W
      net: l2tp: fix reversed udp6 checksum flags · 018f8258
      Wang Shanker 提交于
      This patch fixes a bug which causes the behavior of whether to ignore
      udp6 checksum of udp6 encapsulated l2tp tunnel contrary to what
      userspace program requests.
      
      When the flag `L2TP_ATTR_UDP_ZERO_CSUM6_RX` is set by userspace, it is
      expected that udp6 checksums of received packets of the l2tp tunnel
      to create should be ignored. In `l2tp_netlink.c`:
      `l2tp_nl_cmd_tunnel_create()`, `cfg.udp6_zero_rx_checksums` is set
      according to the flag, and then passed to `l2tp_core.c`:
      `l2tp_tunnel_create()` and then `l2tp_tunnel_sock_create()`. In
      `l2tp_tunnel_sock_create()`, `udp_conf.use_udp6_rx_checksums` is set
      the same to `cfg.udp6_zero_rx_checksums`. However, if we want the
      checksum to be ignored, `udp_conf.use_udp6_rx_checksums` should be set
      to `false`, i.e. be set to the contrary. Similarly, the same should be
      done to `udp_conf.use_udp6_tx_checksums`.
      Signed-off-by: NMiao Wang <shankerwangmiao@gmail.com>
      Acked-by: NJames Chapman <jchapman@katalix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      018f8258
  2. 30 4月, 2016 1 次提交
  3. 29 4月, 2016 3 次提交
    • J
      gre: reject GUE and FOU in collect metadata mode · 946b636f
      Jiri Benc 提交于
      The collect metadata mode does not support GUE nor FOU. This might be
      implemented later; until then, we should reject such config.
      
      I think this is okay to be changed. It's unlikely anyone has such
      configuration (as it doesn't work anyway) and we may need a way to
      distinguish whether it's supported or not by the kernel later.
      
      For backwards compatibility with iproute2, it's not possible to just check
      the attribute presence (iproute2 always includes the attribute), the actual
      value has to be checked, too.
      
      Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      946b636f
    • J
      gre: build header correctly for collect metadata tunnels · 2090714e
      Jiri Benc 提交于
      In ipgre (i.e. not gretap) + collect metadata mode, the skb was assumed to
      contain Ethernet header and was encapsulated as ETH_P_TEB. This is not the
      case, the interface is ARPHRD_IPGRE and the protocol to be used for
      encapsulation is skb->protocol.
      
      Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2090714e
    • J
      gre: do not assign header_ops in collect metadata mode · a64b04d8
      Jiri Benc 提交于
      In ipgre mode (i.e. not gretap) with collect metadata flag set, the tunnel
      is incorrectly assumed to be mGRE in NBMA mode (see commit 6a5f44d7).
      This is not the case, we're controlling the encapsulation addresses by
      lwtunnel metadata. And anyway, assigning dev->header_ops in collect metadata
      mode does not make sense.
      
      Although it would be more user firendly to reject requests that specify
      both the collect metadata flag and a remote/local IP address, this would
      break current users of gretap or introduce ugly code and differences in
      handling ipgre and gretap configuration. Keep the current behavior of
      remote/local IP address being ignored in such case.
      
      v3: Back to v1, added explanation paragraph.
      v2: Reject configuration specifying both remote/local address and collect
          metadata flag.
      
      Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a64b04d8
  4. 27 4月, 2016 1 次提交
  5. 26 4月, 2016 3 次提交
    • D
      net: ipv6: Delete host routes on an ifdown · 38bd10c4
      David Ahern 提交于
      It was a simple idea -- save IPv6 configured addresses on a link down
      so that IPv6 behaves similar to IPv4. As always the devil is in the
      details and the IPv6 stack as too many behavioral differences from IPv4
      making the simple idea more complicated than it needs to be.
      
      The current implementation for keeping IPv6 addresses can panic or spit
      out a warning in one of many paths:
      
      1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in
         rt6_fill_node while handling a route dump request.
      
      2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in
         fib6_del
      
      3. Panic in fib6_purge_rt because rt6i_ref count is not 1.
      
      The root cause of all these is references related to the host route for
      an address that is retained.
      
      So, this patch deletes the host route every time the ifdown loop runs.
      Since the host route is deleted and will be re-generated an up there is
      no longer a need for the l3mdev fix up. On the 'admin up' side move
      addrconf_permanent_addr into the NETDEV_UP event handling so that it
      runs only once versus on UP and CHANGE events.
      
      All of the current panics and warnings appear to be related to
      addresses on the loopback device, but given the catastrophic nature when
      a bug is triggered this patch takes the conservative approach and evicts
      all host routes rather than trying to determine when it can be re-used
      and when it can not. That can be a later optimizaton if desired.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38bd10c4
    • D
      Revert "ipv6: Revert optional address flusing on ifdown." · 6a923934
      David S. Miller 提交于
      This reverts commit 841645b5.
      
      Ok, this puts the feature back.  I've decided to apply David A.'s
      bug fix and run with that rather than make everyone wait another
      whole release for this feature.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a923934
    • D
      ipv6: Revert optional address flusing on ifdown. · 841645b5
      David S. Miller 提交于
      This reverts the following three commits:
      
      70af921d
      799977d9
      f1705ec1
      
      The feature was ill conceived, has terrible semantics, and has added
      nothing but regressions to the already fragile ipv6 stack.
      
      Fixes: f1705ec1 ("net: ipv6: Make address flushing on ifdown optional")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      841645b5
  6. 25 4月, 2016 4 次提交
  7. 24 4月, 2016 5 次提交
  8. 22 4月, 2016 5 次提交
    • S
      openvswitch: use flow protocol when recalculating ipv6 checksums · b4f70527
      Simon Horman 提交于
      When using masked actions the ipv6_proto field of an action
      to set IPv6 fields may be zero rather than the prevailing protocol
      which will result in skipping checksum recalculation.
      
      This patch resolves the problem by relying on the protocol
      in the flow key rather than that in the set field action.
      
      Fixes: 83d2b9ba ("net: openvswitch: Support masked set actions.")
      Cc: Jarno Rajahalme <jrajahalme@nicira.com>
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4f70527
    • M
      tcp: Merge tx_flags and tskey in tcp_shifted_skb · cfea5a68
      Martin KaFai Lau 提交于
      After receiving sacks, tcp_shifted_skb() will collapse
      skbs if possible.  tx_flags and tskey also have to be
      merged.
      
      This patch reuses the tcp_skb_collapse_tstamp() to handle
      them.
      
      BPF Output Before:
      ~~~~~
      <no-output-due-to-missing-tstamp-event>
      
      BPF Output After:
      ~~~~~
      <...>-2024  [007] d.s.    88.644374: : ee_data:14599
      
      Packetdrill Script:
      ~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 1460) = 1460
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 13140) = 13140
      
      0.200 > P. 1:1461(1460) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:14601(5840) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 14601 win 257
      
      0.400 close(4) = 0
      0.400 > F. 14601:14601(0) ack 1
      0.500 < F. 1:1(0) ack 14602 win 257
      0.500 > . 14602:14602(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfea5a68
    • M
      tcp: Merge tx_flags and tskey in tcp_collapse_retrans · 082ac2d5
      Martin KaFai Lau 提交于
      If two skbs are merged/collapsed during retransmission, the current
      logic does not merge the tx_flags and tskey.  The end result is
      the SCM_TSTAMP_ACK timestamp could be missing for a packet.
      
      The patch:
      1. Merge the tx_flags
      2. Overwrite the prev_skb's tskey with the next_skb's tskey
      
      BPF Output Before:
      ~~~~~~
      <no-output-due-to-missing-tstamp-event>
      
      BPF Output After:
      ~~~~~~
      packetdrill-2092  [001] d.s.   453.998486: : ee_data:1459
      
      Packetdrill Script:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 730) = 730
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 730) = 730
      +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
      0.200 write(4, ..., 11680) = 11680
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      
      0.200 > P. 1:731(730) ack 1
      0.200 > P. 731:1461(730) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:13141(4380) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 13141 win 257
      
      0.400 close(4) = 0
      0.400 > F. 13141:13141(0) ack 1
      0.500 < F. 1:1(0) ack 13142 win 257
      0.500 > . 13142:13142(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      082ac2d5
    • M
      tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks · 479f85c3
      Martin KaFai Lau 提交于
      Assuming SOF_TIMESTAMPING_TX_ACK is on. When dup acks are received,
      it could incorrectly think that a skb has already
      been acked and queue a SCM_TSTAMP_ACK cmsg to the
      sk->sk_error_queue.
      
      In tcp_ack_tstamp(), it checks
      'between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1)'.
      If prior_snd_una == tcp_sk(sk)->snd_una like the following packetdrill
      script, between() returns true but the tskey is actually not acked.
      e.g. try between(3, 2, 1).
      
      The fix is to replace between() with one before() and one !before().
      By doing this, the -1 offset on the tcp_sk(sk)->snd_una can also be
      removed.
      
      A packetdrill script is used to reproduce the dup ack scenario.
      Due to the lacking cmsg support in packetdrill (may be I
      cannot find it),  a BPF prog is used to kprobe to
      sock_queue_err_skb() and print out the value of
      serr->ee.ee_data.
      
      Both the packetdrill and the bcc BPF script is attached at the end of
      this commit message.
      
      BPF Output Before Fix:
      ~~~~~~
            <...>-2056  [001] d.s.   433.927987: : ee_data:1459  #incorrect
      packetdrill-2056  [001] d.s.   433.929563: : ee_data:1459  #incorrect
      packetdrill-2056  [001] d.s.   433.930765: : ee_data:1459  #incorrect
      packetdrill-2056  [001] d.s.   434.028177: : ee_data:1459
      packetdrill-2056  [001] d.s.   434.029686: : ee_data:14599
      
      BPF Output After Fix:
      ~~~~~~
            <...>-2049  [000] d.s.   113.517039: : ee_data:1459
            <...>-2049  [000] d.s.   113.517253: : ee_data:14599
      
      BCC BPF Script:
      ~~~~~~
      #!/usr/bin/env python
      
      from __future__ import print_function
      from bcc import BPF
      
      bpf_text = """
      #include <uapi/linux/ptrace.h>
      #include <net/sock.h>
      #include <bcc/proto.h>
      #include <linux/errqueue.h>
      
      #ifdef memset
      #undef memset
      #endif
      
      int trace_err_skb(struct pt_regs *ctx)
      {
      	struct sk_buff *skb = (struct sk_buff *)ctx->si;
      	struct sock *sk = (struct sock *)ctx->di;
      	struct sock_exterr_skb *serr;
      	u32 ee_data = 0;
      
      	if (!sk || !skb)
      		return 0;
      
      	serr = SKB_EXT_ERR(skb);
      	bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
      	bpf_trace_printk("ee_data:%u\\n", ee_data);
      
      	return 0;
      };
      """
      
      b = BPF(text=bpf_text)
      b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
      print("Attached to kprobe")
      b.trace_print()
      
      Packetdrill Script:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 1460) = 1460
      0.200 write(4, ..., 13140) = 13140
      
      0.200 > P. 1:1461(1460) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:14601(5840) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 14601 win 257
      
      0.400 close(4) = 0
      0.400 > F. 14601:14601(0) ack 1
      0.500 < F. 1:1(0) ack 14602 win 257
      0.500 > . 14602:14602(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      479f85c3
    • J
      openvswitch: Orphan skbs before IPv6 defrag · 49e261a8
      Joe Stringer 提交于
      This is the IPv6 counterpart to commit 8282f274 ("inet: frag: Always
      orphan skbs inside ip_defrag()").
      
      Prior to commit 029f7f3b ("netfilter: ipv6: nf_defrag: avoid/free
      clone operations"), ipv6 fragments sent to nf_ct_frag6_gather() would be
      cloned (implicitly orphaning) prior to queueing for reassembly. As such,
      when the IPv6 message is eventually reassembled, the skb->sk for all
      fragments would be NULL. After that commit was introduced, rather than
      cloning, the original skbs were queued directly without orphaning. The
      end result is that all frags except for the first and last may have a
      socket attached.
      
      This commit explicitly orphans such skbs during nf_ct_frag6_gather() to
      prevent BUG_ON(skb->sk) during a later call to ip6_fragment().
      
      kernel BUG at net/ipv6/ip6_output.c:631!
      [...]
      Call Trace:
       <IRQ>
       [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
       [<ffffffffa042c7c0>] ? do_output.isra.28+0x1b0/0x1b0 [openvswitch]
       [<ffffffff810bb8a2>] ? __lock_is_held+0x52/0x70
       [<ffffffffa042c587>] ovs_fragment+0x1f7/0x280 [openvswitch]
       [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
       [<ffffffff817be416>] ? _raw_spin_unlock_irqrestore+0x36/0x50
       [<ffffffff81697ea0>] ? dst_discard_out+0x20/0x20
       [<ffffffff81697e80>] ? dst_ifdown+0x80/0x80
       [<ffffffffa042c703>] do_output.isra.28+0xf3/0x1b0 [openvswitch]
       [<ffffffffa042d279>] do_execute_actions+0x709/0x12c0 [openvswitch]
       [<ffffffffa04340a4>] ? ovs_flow_stats_update+0x74/0x1e0 [openvswitch]
       [<ffffffffa04340d1>] ? ovs_flow_stats_update+0xa1/0x1e0 [openvswitch]
       [<ffffffff817be387>] ? _raw_spin_unlock+0x27/0x40
       [<ffffffffa042de75>] ovs_execute_actions+0x45/0x120 [openvswitch]
       [<ffffffffa0432d65>] ovs_dp_process_packet+0x85/0x150 [openvswitch]
       [<ffffffff817be387>] ? _raw_spin_unlock+0x27/0x40
       [<ffffffffa042def4>] ovs_execute_actions+0xc4/0x120 [openvswitch]
       [<ffffffffa0432d65>] ovs_dp_process_packet+0x85/0x150 [openvswitch]
       [<ffffffffa04337f2>] ? key_extract+0x442/0xc10 [openvswitch]
       [<ffffffffa043b26d>] ovs_vport_receive+0x5d/0xb0 [openvswitch]
       [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
       [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
       [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
       [<ffffffff817be416>] ? _raw_spin_unlock_irqrestore+0x36/0x50
       [<ffffffffa043c11d>] internal_dev_xmit+0x6d/0x150 [openvswitch]
       [<ffffffffa043c0b5>] ? internal_dev_xmit+0x5/0x150 [openvswitch]
       [<ffffffff8168fb5f>] dev_hard_start_xmit+0x2df/0x660
       [<ffffffff8168f5ea>] ? validate_xmit_skb.isra.105.part.106+0x1a/0x2b0
       [<ffffffff81690925>] __dev_queue_xmit+0x8f5/0x950
       [<ffffffff81690080>] ? __dev_queue_xmit+0x50/0x950
       [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
       [<ffffffff81690990>] dev_queue_xmit+0x10/0x20
       [<ffffffff8169a418>] neigh_resolve_output+0x178/0x220
       [<ffffffff81752759>] ? ip6_finish_output2+0x219/0x7b0
       [<ffffffff81752759>] ip6_finish_output2+0x219/0x7b0
       [<ffffffff817525a5>] ? ip6_finish_output2+0x65/0x7b0
       [<ffffffff816cde2b>] ? ip_idents_reserve+0x6b/0x80
       [<ffffffff8175488f>] ? ip6_fragment+0x93f/0xc50
       [<ffffffff81754af1>] ip6_fragment+0xba1/0xc50
       [<ffffffff81752540>] ? ip6_flush_pending_frames+0x40/0x40
       [<ffffffff81754c6b>] ip6_finish_output+0xcb/0x1d0
       [<ffffffff81754dcf>] ip6_output+0x5f/0x1a0
       [<ffffffff81754ba0>] ? ip6_fragment+0xc50/0xc50
       [<ffffffff81797fbd>] ip6_local_out+0x3d/0x80
       [<ffffffff817554df>] ip6_send_skb+0x2f/0xc0
       [<ffffffff817555bd>] ip6_push_pending_frames+0x4d/0x50
       [<ffffffff817796cc>] icmpv6_push_pending_frames+0xac/0xe0
       [<ffffffff8177a4be>] icmpv6_echo_reply+0x42e/0x500
       [<ffffffff8177acbf>] icmpv6_rcv+0x4cf/0x580
       [<ffffffff81755ac7>] ip6_input_finish+0x1a7/0x690
       [<ffffffff81755925>] ? ip6_input_finish+0x5/0x690
       [<ffffffff817567a0>] ip6_input+0x30/0xa0
       [<ffffffff81755920>] ? ip6_rcv_finish+0x1a0/0x1a0
       [<ffffffff817557ce>] ip6_rcv_finish+0x4e/0x1a0
       [<ffffffff8175640f>] ipv6_rcv+0x45f/0x7c0
       [<ffffffff81755fe6>] ? ipv6_rcv+0x36/0x7c0
       [<ffffffff81755780>] ? ip6_make_skb+0x1c0/0x1c0
       [<ffffffff8168b649>] __netif_receive_skb_core+0x229/0xb80
       [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
       [<ffffffff8168c07f>] ? process_backlog+0x6f/0x230
       [<ffffffff8168bfb6>] __netif_receive_skb+0x16/0x70
       [<ffffffff8168c088>] process_backlog+0x78/0x230
       [<ffffffff8168c0ed>] ? process_backlog+0xdd/0x230
       [<ffffffff8168db43>] net_rx_action+0x203/0x480
       [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
       [<ffffffff817c156e>] __do_softirq+0xde/0x49f
       [<ffffffff81752768>] ? ip6_finish_output2+0x228/0x7b0
       [<ffffffff817c070c>] do_softirq_own_stack+0x1c/0x30
       <EOI>
       [<ffffffff8106f88b>] do_softirq.part.18+0x3b/0x40
       [<ffffffff8106f946>] __local_bh_enable_ip+0xb6/0xc0
       [<ffffffff81752791>] ip6_finish_output2+0x251/0x7b0
       [<ffffffff81754af1>] ? ip6_fragment+0xba1/0xc50
       [<ffffffff816cde2b>] ? ip_idents_reserve+0x6b/0x80
       [<ffffffff8175488f>] ? ip6_fragment+0x93f/0xc50
       [<ffffffff81754af1>] ip6_fragment+0xba1/0xc50
       [<ffffffff81752540>] ? ip6_flush_pending_frames+0x40/0x40
       [<ffffffff81754c6b>] ip6_finish_output+0xcb/0x1d0
       [<ffffffff81754dcf>] ip6_output+0x5f/0x1a0
       [<ffffffff81754ba0>] ? ip6_fragment+0xc50/0xc50
       [<ffffffff81797fbd>] ip6_local_out+0x3d/0x80
       [<ffffffff817554df>] ip6_send_skb+0x2f/0xc0
       [<ffffffff817555bd>] ip6_push_pending_frames+0x4d/0x50
       [<ffffffff81778558>] rawv6_sendmsg+0xa28/0xe30
       [<ffffffff81719097>] ? inet_sendmsg+0xc7/0x1d0
       [<ffffffff817190d6>] inet_sendmsg+0x106/0x1d0
       [<ffffffff81718fd5>] ? inet_sendmsg+0x5/0x1d0
       [<ffffffff8166d078>] sock_sendmsg+0x38/0x50
       [<ffffffff8166d4d6>] SYSC_sendto+0xf6/0x170
       [<ffffffff8100201b>] ? trace_hardirqs_on_thunk+0x1b/0x1d
       [<ffffffff8166e38e>] SyS_sendto+0xe/0x10
       [<ffffffff817bebe5>] entry_SYSCALL_64_fastpath+0x18/0xa8
      Code: 06 48 83 3f 00 75 26 48 8b 87 d8 00 00 00 2b 87 d0 00 00 00 48 39 d0 72 14 8b 87 e4 00 00 00 83 f8 01 75 09 48 83 7f 18 00 74 9a <0f> 0b 41 8b 86 cc 00 00 00 49 8#
      RIP  [<ffffffff8175468a>] ip6_fragment+0x73a/0xc50
       RSP <ffff880072803120>
      
      Fixes: 029f7f3b ("netfilter: ipv6: nf_defrag: avoid/free clone
      operations")
      Reported-by: NDaniele Di Proietto <diproiettod@vmware.com>
      Signed-off-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49e261a8
  9. 20 4月, 2016 1 次提交
  10. 17 4月, 2016 2 次提交
  11. 16 4月, 2016 1 次提交
    • D
      vlan: pull on __vlan_insert_tag error path and fix csum correction · 9241e2df
      Daniel Borkmann 提交于
      When __vlan_insert_tag() fails from skb_vlan_push() path due to the
      skb_cow_head(), we need to undo the __skb_push() in the error path
      as well that was done earlier to move skb->data pointer to mac header.
      
      Moreover, I noticed that when in the non-error path the __skb_pull()
      is done and the original offset to mac header was non-zero, we fixup
      from a wrong skb->data offset in the checksum complete processing.
      
      So the skb_postpush_rcsum() really needs to be done before __skb_pull()
      where skb->data still points to the mac header start and thus operates
      under the same conditions as in __vlan_insert_tag().
      
      Fixes: 93515d53 ("net: move vlan pop/push functions into common code")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9241e2df
  12. 15 4月, 2016 5 次提交
    • C
      soreuseport: fix ordering for mixed v4/v6 sockets · d894ba18
      Craig Gallek 提交于
      With the SO_REUSEPORT socket option, it is possible to create sockets
      in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
      This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
      the AF_INET6 sockets.
      
      Prior to the commits referenced below, an incoming IPv4 packet would
      always be routed to a socket of type AF_INET when this mixed-mode was used.
      After those changes, the same packet would be routed to the most recently
      bound socket (if this happened to be an AF_INET6 socket, it would
      have an IPv4 mapped IPv6 address).
      
      The change in behavior occurred because the recent SO_REUSEPORT optimizations
      short-circuit the socket scoring logic as soon as they find a match.  They
      did not take into account the scoring logic that favors AF_INET sockets
      over AF_INET6 sockets in the event of a tie.
      
      To fix this problem, this patch changes the insertion order of AF_INET
      and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
      have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
      list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
      the tail of the list.  This will force AF_INET sockets to always be
      considered first.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d894ba18
    • M
      ipv6: udp: Do a route lookup and update during release_cb · e646b657
      Martin KaFai Lau 提交于
      This patch adds a release_cb for UDPv6.  It does a route lookup
      and updates sk->sk_dst_cache if it is needed.  It picks up the
      left-over job from ip6_sk_update_pmtu() if the sk was owned
      by user during the pmtu update.
      
      It takes a rcu_read_lock to protect the __sk_dst_get() operations
      because another thread may do ip6_dst_store() without taking the
      sk lock (e.g. sendmsg).
      
      Fixes: 45e4fd26 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reported-by: NWei Wang <weiwan@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e646b657
    • M
      ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update · 33c162a9
      Martin KaFai Lau 提交于
      There is a case in connected UDP socket such that
      getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
      sequence could be the following:
      1. Create a connected UDP socket
      2. Send some datagrams out
      3. Receive a ICMPV6_PKT_TOOBIG
      4. No new outgoing datagrams to trigger the sk_dst_check()
         logic to update the sk->sk_dst_cache.
      5. getsockopt(IPV6_MTU) returns the mtu from the invalid
         sk->sk_dst_cache instead of the newly created RTF_CACHE clone.
      
      This patch updates the sk->sk_dst_cache for a connected datagram sk
      during pmtu-update code path.
      
      Note that the sk->sk_v6_daddr is used to do the route lookup
      instead of skb->data (i.e. iph).  It is because a UDP socket can become
      connected after sending out some datagrams in un-connected state.  or
      It can be connected multiple times to different destinations.  Hence,
      iph may not be related to where sk is currently connected to.
      
      It is done under '!sock_owned_by_user(sk)' condition because
      the user may make another ip6_datagram_connect()  (i.e changing
      the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
      code path.
      
      For the sock_owned_by_user(sk) == true case, the next patch will
      introduce a release_cb() which will update the sk->sk_dst_cache.
      
      Test:
      
      Server (Connected UDP Socket):
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Route Details:
      [root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
      2fac::/64 dev eth0  proto kernel  metric 256  pref medium
      2fac:face::/64 via 2fac::face dev eth0  metric 1024  pref medium
      
      A simple python code to create a connected UDP socket:
      
      import socket
      import errno
      
      HOST = '2fac::1'
      PORT = 8080
      
      s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
      s.bind((HOST, PORT))
      s.connect(('2fac:face::face', 53))
      print("connected")
      while True:
          try:
      	data = s.recv(1024)
          except socket.error as se:
      	if se.errno == errno.EMSGSIZE:
      		pmtu = s.getsockopt(41, 24)
      		print("PMTU:%d" % pmtu)
      		break
      s.close()
      
      Python program output after getting a ICMPV6_PKT_TOOBIG:
      [root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
      connected
      PMTU:1300
      
      Cache routes after recieving TOOBIG:
      [root@arch-fb-vm1 ~]# ip -6 r show table cache
      2fac:face::face via 2fac::face dev eth0  metric 0
          cache  expires 463sec mtu 1300 pref medium
      
      Client (Send the ICMPV6_PKT_TOOBIG):
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      scapy is used to generate the TOOBIG message.  Here is the scapy script I have
      used:
      
      >>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
      1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
      >>> sendp(p, iface='qemubr0')
      
      Fixes: 45e4fd26 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reported-by: NWei Wang <weiwan@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33c162a9
    • M
      ipv6: datagram: Refactor dst lookup and update codes to a new function · 7e2040db
      Martin KaFai Lau 提交于
      This patch moves the route lookup and update codes for connected
      datagram sk to a newly created function ip6_datagram_dst_update()
      
      It will be reused during the pmtu update in the later patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e2040db
    • M
      ipv6: datagram: Refactor flowi6 init codes to a new function · 80fbdb20
      Martin KaFai Lau 提交于
      Move flowi6 init codes for connected datagram sk to a newly created
      function ip6_datagram_flow_key_init().
      
      Notes:
      1. fl6_flowlabel is used instead of fl6.flowlabel in __ip6_datagram_connect
      2. ipv6_addr_is_multicast(&fl6->daddr) is used instead of
         (addr_type & IPV6_ADDR_MULTICAST) in ip6_datagram_flow_key_init()
      
      This new function will be reused during pmtu update in the later patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80fbdb20
  13. 14 4月, 2016 4 次提交
  14. 13 4月, 2016 1 次提交