1. 20 6月, 2014 1 次提交
    • N
      tcp: fix tcp_match_skb_to_sack() for unaligned SACK at end of an skb · 2cd0d743
      Neal Cardwell 提交于
      If there is an MSS change (or misbehaving receiver) that causes a SACK
      to arrive that covers the end of an skb but is less than one MSS, then
      tcp_match_skb_to_sack() was rounding up pkt_len to the full length of
      the skb ("Round if necessary..."), then chopping all bytes off the skb
      and creating a zero-byte skb in the write queue.
      
      This was visible now because the recently simplified TLP logic in
      bef1909e ("tcp: fixing TLP's FIN recovery") could find that 0-byte
      skb at the end of the write queue, and now that we do not check that
      skb's length we could send it as a TLP probe.
      
      Consider the following example scenario:
      
       mss: 1000
       skb: seq: 0 end_seq: 4000  len: 4000
       SACK: start_seq: 3999 end_seq: 4000
      
      The tcp_match_skb_to_sack() code will compute:
      
       in_sack = false
       pkt_len = start_seq - TCP_SKB_CB(skb)->seq = 3999 - 0 = 3999
       new_len = (pkt_len / mss) * mss = (3999/1000)*1000 = 3000
       new_len += mss = 4000
      
      Previously we would find the new_len > skb->len check failing, so we
      would fall through and set pkt_len = new_len = 4000 and chop off
      pkt_len of 4000 from the 4000-byte skb, leaving a 0-byte segment
      afterward in the write queue.
      
      With this new commit, we notice that the new new_len >= skb->len check
      succeeds, so that we return without trying to fragment.
      
      Fixes: adb92db8 ("tcp: Make SACK code to split only at mss boundaries")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2cd0d743
  2. 17 6月, 2014 1 次提交
  3. 14 6月, 2014 1 次提交
  4. 13 6月, 2014 1 次提交
  5. 12 6月, 2014 3 次提交
    • T
      net: Add skb_gro_postpull_rcsum to udp and vxlan · 6bae1d4c
      Tom Herbert 提交于
      Need to gro_postpull_rcsum for GRO to work with checksum complete.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bae1d4c
    • T
      net: Save software checksum complete · 7e3cead5
      Tom Herbert 提交于
      In skb_checksum complete, if we need to compute the checksum for the
      packet (via skb_checksum) save the result as CHECKSUM_COMPLETE.
      Subsequent checksum verification can use this.
      
      Also, added csum_complete_sw flag to distinguish between software and
      hardware generated checksum complete, we should always be able to trust
      the software computation.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e3cead5
    • E
      ipv4: fix a race in ip4_datagram_release_cb() · 9709674e
      Eric Dumazet 提交于
      Alexey gave a AddressSanitizer[1] report that finally gave a good hint
      at where was the origin of various problems already reported by Dormando
      in the past [2]
      
      Problem comes from the fact that UDP can have a lockless TX path, and
      concurrent threads can manipulate sk_dst_cache, while another thread,
      is holding socket lock and calls __sk_dst_set() in
      ip4_datagram_release_cb() (this was added in linux-3.8)
      
      It seems that all we need to do is to use sk_dst_check() and
      sk_dst_set() so that all the writers hold same spinlock
      (sk->sk_dst_lock) to prevent corruptions.
      
      TCP stack do not need this protection, as all sk_dst_cache writers hold
      the socket lock.
      
      [1]
      https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
      
      AddressSanitizer: heap-use-after-free in ipv4_dst_check
      Read of size 2 by thread T15453:
       [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
       [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
       [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
       [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
       [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
       [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
       [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
       [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
       [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
      ./arch/x86/kernel/entry_64.S:629
      
      Freed by thread T15455:
       [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
       [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
       [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
       [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
       [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
       [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
       [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
      ./arch/x86/kernel/entry_64.S:629
      
      Allocated by thread T15453:
       [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
       [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
       [<     inlined    >] __ip_route_output_key+0x3e8/0xf70
      __mkroute_output ./net/ipv4/route.c:1939
       [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
       [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
       [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
       [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
       [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
       [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
       [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
      ./arch/x86/kernel/entry_64.S:629
      
      [2]
      <4>[196727.311203] general protection fault: 0000 [#1] SMP
      <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
      <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
      <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
      <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
      <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
      <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
      <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
      <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
      <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
      <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
      <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
      <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
      <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>[196727.311713] Stack:
      <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
      <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
      <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
      <4>[196727.311885] Call Trace:
      <4>[196727.311907]  <IRQ>
      <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
      <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
      <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
      <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
      <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
      <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
      <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
      <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
      <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
      <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
      <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
      <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
      <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
      <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
      <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
      <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
      <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
      <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
      <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
      <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
      <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
      <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
      <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
      <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
      <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
      <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
      <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
      <4>[196727.312722]  <EOI>
      <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
      <4>[196727.313100]  RSP <ffff885effd23a70>
      <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
      <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
      Reported-by: NAlexey Preobrazhensky <preobr@google.com>
      Reported-by: Ndormando <dormando@rydia.ne>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 8141ed9f ("ipv4: Add a socket release callback for datagram sockets")
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9709674e
  6. 11 6月, 2014 5 次提交
  7. 06 6月, 2014 1 次提交
  8. 05 6月, 2014 8 次提交
  9. 03 6月, 2014 3 次提交
    • Y
      tcp: fix cwnd undo on DSACK in F-RTO · 0cfa5c07
      Yuchung Cheng 提交于
      This bug is discovered by an recent F-RTO issue on tcpm list
      https://www.ietf.org/mail-archive/web/tcpm/current/msg08794.html
      
      The bug is that currently F-RTO does not use DSACK to undo cwnd in
      certain cases: upon receiving an ACK after the RTO retransmission in
      F-RTO, and the ACK has DSACK indicating the retransmission is spurious,
      the sender only calls tcp_try_undo_loss() if some never retransmisted
      data is sacked (FLAG_ORIG_DATA_SACKED).
      
      The correct behavior is to unconditionally call tcp_try_undo_loss so
      the DSACK information is used properly to undo the cwnd reduction.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cfa5c07
    • D
      fib_trie: use seq_file_net rather than seq->private · 30f38d2f
      David Ahern 提交于
      Make fib_triestat_seq_show consistent with other /proc/net files and
      use seq_file_net.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30f38d2f
    • E
      inetpeer: get rid of ip_id_count · 73f156a6
      Eric Dumazet 提交于
      Ideally, we would need to generate IP ID using a per destination IP
      generator.
      
      linux kernels used inet_peer cache for this purpose, but this had a huge
      cost on servers disabling MTU discovery.
      
      1) each inet_peer struct consumes 192 bytes
      
      2) inetpeer cache uses a binary tree of inet_peer structs,
         with a nominal size of ~66000 elements under load.
      
      3) lookups in this tree are hitting a lot of cache lines, as tree depth
         is about 20.
      
      4) If server deals with many tcp flows, we have a high probability of
         not finding the inet_peer, allocating a fresh one, inserting it in
         the tree with same initial ip_id_count, (cf secure_ip_id())
      
      5) We garbage collect inet_peer aggressively.
      
      IP ID generation do not have to be 'perfect'
      
      Goal is trying to avoid duplicates in a short period of time,
      so that reassembly units have a chance to complete reassembly of
      fragments belonging to one message before receiving other fragments
      with a recycled ID.
      
      We simply use an array of generators, and a Jenkin hash using the dst IP
      as a key.
      
      ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
      belongs (it is only used from this file)
      
      secure_ip_id() and secure_ipv6_id() no longer are needed.
      
      Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
      unnecessary decrement/increment of the number of segments.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73f156a6
  10. 31 5月, 2014 1 次提交
  11. 24 5月, 2014 3 次提交
  12. 23 5月, 2014 2 次提交
    • L
      ipv4: initialise the itag variable in __mkroute_input · fbdc0ad0
      Li RongQing 提交于
      the value of itag is a random value from stack, and may not be initiated by
      fib_validate_source, which called fib_combine_itag if CONFIG_IP_ROUTE_CLASSID
      is not set
      
      This will make the cached dst uncertainty
      Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbdc0ad0
    • N
      tcp: make cwnd-limited checks measurement-based, and gentler · ca8a2263
      Neal Cardwell 提交于
      Experience with the recent e114a710 ("tcp: fix cwnd limited
      checking to improve congestion control") has shown that there are
      common cases where that commit can cause cwnd to be much larger than
      necessary. This leads to TSO autosizing cooking skbs that are too
      large, among other things.
      
      The main problems seemed to be:
      
      (1) That commit attempted to predict the future behavior of the
      connection by looking at the write queue (if TSO or TSQ limit
      sending). That prediction sometimes overestimated future outstanding
      packets.
      
      (2) That commit always allowed cwnd to grow to twice the number of
      outstanding packets (even in congestion avoidance, where this is not
      needed).
      
      This commit improves both of these, by:
      
      (1) Switching to a measurement-based approach where we explicitly
      track the largest number of packets in flight during the past window
      ("max_packets_out"), and remember whether we were cwnd-limited at the
      moment we finished sending that flight.
      
      (2) Only allowing cwnd to grow to twice the number of outstanding
      packets ("max_packets_out") in slow start. In congestion avoidance
      mode we now only allow cwnd to grow if it was fully utilized.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca8a2263
  13. 22 5月, 2014 1 次提交
  14. 21 5月, 2014 1 次提交
  15. 19 5月, 2014 1 次提交
  16. 17 5月, 2014 2 次提交
    • T
      ipv4: ip_tunnels: disable cache for nbma gre tunnels · 22fb22ea
      Timo Teräs 提交于
      The connected check fails to check for ip_gre nbma mode tunnels
      properly. ip_gre creates temporary tnl_params with daddr specified
      to pass-in the actual target on per-packet basis from neighbor
      layer. Detect these tunnels by inspecting the actual tunnel
      configuration.
      
      Minimal test case:
       ip route add 192.168.1.1/32 via 10.0.0.1
       ip route add 192.168.1.2/32 via 10.0.0.2
       ip tunnel add nbma0 mode gre key 1 tos c0
       ip addr add 172.17.0.0/16 dev nbma0
       ip link set nbma0 up
       ip neigh add 172.17.0.1 lladdr 192.168.1.1 dev nbma0
       ip neigh add 172.17.0.2 lladdr 192.168.1.2 dev nbma0
       ping 172.17.0.1
       ping 172.17.0.2
      
      The second ping should be going to 192.168.1.2 and head 10.0.0.2;
      but cached gre tunnel level route is used and it's actually going
      to 192.168.1.1 via 10.0.0.1.
      
      The lladdr's need to go to separate dst for the bug to trigger.
      Test case uses separate route entries, but this can also happen
      when the route entry is same: if there is a nexthop exception or
      the GRE tunnel is IPsec'ed in which case the dst points to xfrm
      bundle unique to the gre lladdr.
      
      Fixes: 7d442fab ("ipv4: Cache dst in tunnels")
      Signed-off-by: NTimo Teräs <timo.teras@iki.fi>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22fb22ea
    • D
      ip_tunnel: don't add tunnel twice · ee30ef4d
      Duan Jiong 提交于
      When using command "ip tunnel add" to add a tunnel, the tunnel will be added twice,
      through ip_tunnel_create() and ip_tunnel_update().
      
      Because the second is unnecessary, so we can just break after adding tunnel
      through ip_tunnel_create().
      Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee30ef4d
  17. 16 5月, 2014 1 次提交
  18. 15 5月, 2014 2 次提交
  19. 14 5月, 2014 2 次提交
    • L
      net: support marking accepting TCP sockets · 84f39b08
      Lorenzo Colitti 提交于
      When using mark-based routing, sockets returned from accept()
      may need to be marked differently depending on the incoming
      connection request.
      
      This is the case, for example, if different socket marks identify
      different networks: a listening socket may want to accept
      connections from all networks, but each connection should be
      marked with the network that the request came in on, so that
      subsequent packets are sent on the correct network.
      
      This patch adds a sysctl to mark TCP sockets based on the fwmark
      of the incoming SYN packet. If enabled, and an unmarked socket
      receives a SYN, then the SYN packet's fwmark is written to the
      connection's inet_request_sock, and later written back to the
      accepted socket when the connection is established.  If the
      socket already has a nonzero mark, then the behaviour is the same
      as it is today, i.e., the listening socket's fwmark is used.
      
      Black-box tested using user-mode linux:
      
      - IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
        mark of the incoming SYN packet.
      - The socket returned by accept() is marked with the mark of the
        incoming SYN packet.
      - Tested with syncookies=1 and syncookies=2.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84f39b08
    • L
      net: Use fwmark reflection in PMTU discovery. · 1b3c61dc
      Lorenzo Colitti 提交于
      Currently, routing lookups used for Path PMTU Discovery in
      absence of a socket or on unmarked sockets use a mark of 0.
      This causes PMTUD not to work when using routing based on
      netfilter fwmark mangling and fwmark ip rules, such as:
      
        iptables -j MARK --set-mark 17
        ip rule add fwmark 17 lookup 100
      
      This patch causes these route lookups to use the fwmark from the
      received ICMP error when the fwmark_reflect sysctl is enabled.
      This allows the administrator to make PMTUD work by configuring
      appropriate fwmark rules to mark the inbound ICMP packets.
      
      Black-box tested using user-mode linux by pointing different
      fwmarks at routing tables egressing on different interfaces, and
      using iptables mangling to mark packets inbound on each interface
      with the interface's fwmark. ICMPv4 and ICMPv6 PMTU discovery
      work as expected when mark reflection is enabled and fail when
      it is disabled.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b3c61dc
新手
引导
客服 返回
顶部