1. 21 12月, 2017 1 次提交
  2. 13 12月, 2017 1 次提交
  3. 04 12月, 2017 1 次提交
    • E
      tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb() · eeea10b8
      Eric Dumazet 提交于
      James Morris reported kernel stack corruption bug [1] while
      running the SELinux testsuite, and bisected to a recent
      commit bffa72cf ("net: sk_buff rbnode reorg")
      
      We believe this commit is fine, but exposes an older bug.
      
      SELinux code runs from tcp_filter() and might send an ICMP,
      expecting IP options to be found in skb->cb[] using regular IPCB placement.
      
      We need to defer TCP mangling of skb->cb[] after tcp_filter() calls.
      
      This patch adds tcp_v4_fill_cb()/tcp_v4_restore_cb() in a very
      similar way we added them for IPv6.
      
      [1]
      [  339.806024] SELinux: failure in selinux_parse_skb(), unable to parse packet
      [  339.822505] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff81745af5
      [  339.822505]
      [  339.852250] CPU: 4 PID: 3642 Comm: client Not tainted 4.15.0-rc1-test #15
      [  339.868498] Hardware name: LENOVO 10FGS0VA1L/30BC, BIOS FWKT68A   01/19/2017
      [  339.885060] Call Trace:
      [  339.896875]  <IRQ>
      [  339.908103]  dump_stack+0x63/0x87
      [  339.920645]  panic+0xe8/0x248
      [  339.932668]  ? ip_push_pending_frames+0x33/0x40
      [  339.946328]  ? icmp_send+0x525/0x530
      [  339.958861]  ? kfree_skbmem+0x60/0x70
      [  339.971431]  __stack_chk_fail+0x1b/0x20
      [  339.984049]  icmp_send+0x525/0x530
      [  339.996205]  ? netlbl_skbuff_err+0x36/0x40
      [  340.008997]  ? selinux_netlbl_err+0x11/0x20
      [  340.021816]  ? selinux_socket_sock_rcv_skb+0x211/0x230
      [  340.035529]  ? security_sock_rcv_skb+0x3b/0x50
      [  340.048471]  ? sk_filter_trim_cap+0x44/0x1c0
      [  340.061246]  ? tcp_v4_inbound_md5_hash+0x69/0x1b0
      [  340.074562]  ? tcp_filter+0x2c/0x40
      [  340.086400]  ? tcp_v4_rcv+0x820/0xa20
      [  340.098329]  ? ip_local_deliver_finish+0x71/0x1a0
      [  340.111279]  ? ip_local_deliver+0x6f/0xe0
      [  340.123535]  ? ip_rcv_finish+0x3a0/0x3a0
      [  340.135523]  ? ip_rcv_finish+0xdb/0x3a0
      [  340.147442]  ? ip_rcv+0x27c/0x3c0
      [  340.158668]  ? inet_del_offload+0x40/0x40
      [  340.170580]  ? __netif_receive_skb_core+0x4ac/0x900
      [  340.183285]  ? rcu_accelerate_cbs+0x5b/0x80
      [  340.195282]  ? __netif_receive_skb+0x18/0x60
      [  340.207288]  ? process_backlog+0x95/0x140
      [  340.218948]  ? net_rx_action+0x26c/0x3b0
      [  340.230416]  ? __do_softirq+0xc9/0x26a
      [  340.241625]  ? do_softirq_own_stack+0x2a/0x40
      [  340.253368]  </IRQ>
      [  340.262673]  ? do_softirq+0x50/0x60
      [  340.273450]  ? __local_bh_enable_ip+0x57/0x60
      [  340.285045]  ? ip_finish_output2+0x175/0x350
      [  340.296403]  ? ip_finish_output+0x127/0x1d0
      [  340.307665]  ? nf_hook_slow+0x3c/0xb0
      [  340.318230]  ? ip_output+0x72/0xe0
      [  340.328524]  ? ip_fragment.constprop.54+0x80/0x80
      [  340.340070]  ? ip_local_out+0x35/0x40
      [  340.350497]  ? ip_queue_xmit+0x15c/0x3f0
      [  340.361060]  ? __kmalloc_reserve.isra.40+0x31/0x90
      [  340.372484]  ? __skb_clone+0x2e/0x130
      [  340.382633]  ? tcp_transmit_skb+0x558/0xa10
      [  340.393262]  ? tcp_connect+0x938/0xad0
      [  340.403370]  ? ktime_get_with_offset+0x4c/0xb0
      [  340.414206]  ? tcp_v4_connect+0x457/0x4e0
      [  340.424471]  ? __inet_stream_connect+0xb3/0x300
      [  340.435195]  ? inet_stream_connect+0x3b/0x60
      [  340.445607]  ? SYSC_connect+0xd9/0x110
      [  340.455455]  ? __audit_syscall_entry+0xaf/0x100
      [  340.466112]  ? syscall_trace_enter+0x1d0/0x2b0
      [  340.476636]  ? __audit_syscall_exit+0x209/0x290
      [  340.487151]  ? SyS_connect+0xe/0x10
      [  340.496453]  ? do_syscall_64+0x67/0x1b0
      [  340.506078]  ? entry_SYSCALL64_slow_path+0x25/0x25
      
      Fixes: 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJames Morris <james.l.morris@oracle.com>
      Tested-by: NJames Morris <james.l.morris@oracle.com>
      Tested-by: NCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eeea10b8
  4. 30 11月, 2017 1 次提交
  5. 10 11月, 2017 1 次提交
  6. 24 10月, 2017 1 次提交
  7. 18 10月, 2017 1 次提交
  8. 09 9月, 2017 1 次提交
  9. 29 8月, 2017 1 次提交
  10. 24 8月, 2017 1 次提交
    • M
      tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE to TCP recvmsg · 98aaa913
      Mike Maloney 提交于
      When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the
      timestamp corresponding to the highest sequence number data returned.
      
      Previously the skb->tstamp is overwritten when a TCP packet is placed
      in the out of order queue.  While the packet is in the ooo queue, save the
      timestamp in the TCB_SKB_CB.  This space is shared with the gso_*
      options which are only used on the tx path, and a previously unused 4
      byte hole.
      
      When skbs are coalesced either in the sk_receive_queue or the
      out_of_order_queue always choose the timestamp of the appended skb to
      maintain the invariant of returning the timestamp of the last byte in
      the recvmsg buffer.
      Signed-off-by: NMike Maloney <maloney@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98aaa913
  11. 15 8月, 2017 1 次提交
  12. 08 8月, 2017 1 次提交
  13. 02 8月, 2017 1 次提交
  14. 01 8月, 2017 1 次提交
    • F
      tcp: remove prequeue support · e7942d06
      Florian Westphal 提交于
      prequeue is a tcp receive optimization that moves part of rx processing
      from bh to process context.
      
      This only works if the socket being processed belongs to a process that
      is blocked in recv on that socket.
      
      In practice, this doesn't happen anymore that often because nowadays
      servers tend to use an event driven (epoll) model.
      
      Even normal client applications (web browsers) commonly use many tcp
      connections in parallel.
      
      This has measureable impact only in netperf (which uses plain recv and
      thus allows prequeue use) from host to locally running vm (~4%), however,
      there were no changes when using netperf between two physical hosts with
      ixgbe interfaces.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7942d06
  15. 29 7月, 2017 1 次提交
  16. 25 7月, 2017 1 次提交
  17. 01 7月, 2017 2 次提交
  18. 22 6月, 2017 1 次提交
  19. 20 6月, 2017 2 次提交
  20. 16 6月, 2017 1 次提交
    • J
      networking: make skb_push & __skb_push return void pointers · d58ff351
      Johannes Berg 提交于
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions return void * and remove all the casts across
      the tree, adding a (u8 *) cast only where the unsigned char pointer
      was used directly, all done with the following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      
          @@
          expression SKB, LEN;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          @@
          - fn(SKB, LEN)[0]
          + *(u8 *)fn(SKB, LEN)
      
      Note that the last part there converts from push(...)[0] to the
      more idiomatic *(u8 *)push(...).
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d58ff351
  21. 11 6月, 2017 1 次提交
  22. 08 6月, 2017 2 次提交
    • E
      tcp: add TCPMemoryPressuresChrono counter · 06044751
      Eric Dumazet 提交于
      DRAM supply shortage and poor memory pressure tracking in TCP
      stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
      limits) and tcp_mem[] quite hazardous.
      
      TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
      limits being hit, but only tracking number of transitions.
      
      If TCP stack behavior under stress was perfect :
      1) It would maintain memory usage close to the limit.
      2) Memory pressure state would be entered for short times.
      
      We certainly prefer 100 events lasting 10ms compared to one event
      lasting 200 seconds.
      
      This patch adds a new SNMP counter tracking cumulative duration of
      memory pressure events, given in ms units.
      
      $ cat /proc/sys/net/ipv4/tcp_mem
      3088    4117    6176
      $ grep TCP /proc/net/sockstat
      TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
      $ nstat -n ; sleep 10 ; nstat |grep Pressure
      TcpExtTCPMemoryPressures        1700
      TcpExtTCPMemoryPressuresChrono  5209
      
      v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
      instructed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06044751
    • E
      tcp: Namespaceify sysctl_tcp_timestamps · 5d2ed052
      Eric Dumazet 提交于
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d2ed052
  23. 18 5月, 2017 1 次提交
  24. 12 5月, 2017 1 次提交
  25. 06 5月, 2017 1 次提交
    • E
      tcp: randomize timestamps on syncookies · 84b114b9
      Eric Dumazet 提交于
      Whole point of randomization was to hide server uptime, but an attacker
      can simply start a syn flood and TCP generates 'old style' timestamps,
      directly revealing server jiffies value.
      
      Also, TSval sent by the server to a particular remote address vary
      depending on syncookies being sent or not, potentially triggering PAWS
      drops for innocent clients.
      
      Lets implement proper randomization, including for SYNcookies.
      
      Also we do not need to export sysctl_tcp_timestamps, since it is not
      used from a module.
      
      In v2, I added Florian feedback and contribution, adding tsoff to
      tcp_get_cookie_sock().
      
      v3 removed one unused variable in tcp_v4_connect() as Florian spotted.
      
      Fixes: 95a22cae ("tcp: randomize tcp timestamp offsets for each connection")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Tested-by: NFlorian Westphal <fw@strlen.de>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84b114b9
  26. 19 4月, 2017 1 次提交
    • P
      mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU · 5f0d5a3a
      Paul E. McKenney 提交于
      A group of Linux kernel hackers reported chasing a bug that resulted
      from their assumption that SLAB_DESTROY_BY_RCU provided an existence
      guarantee, that is, that no block from such a slab would be reallocated
      during an RCU read-side critical section.  Of course, that is not the
      case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
      slab of blocks.
      
      However, there is a phrase for this, namely "type safety".  This commit
      therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
      to avoid future instances of this sort of confusion.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      [ paulmck: Add comments mentioning the old name, as requested by Eric
        Dumazet, in order to help people familiar with the old name find
        the new one. ]
      Acked-by: NDavid Rientjes <rientjes@google.com>
      5f0d5a3a
  27. 25 3月, 2017 2 次提交
    • A
      tcp: Record Rx hash and NAPI ID in tcp_child_process · e5907459
      Alexander Duyck 提交于
      While working on some recent busy poll changes we found that child sockets
      were being instantiated without NAPI ID being set.  In our first attempt to
      fix it, it was suggested that we should just pull programming the NAPI ID
      into the function itself since all callers will need to have it set.
      
      In addition to the NAPI ID change I have dropped the code that was
      populating the Rx hash since it was actually being populated in
      tcp_get_cookie_sock.
      Reported-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5907459
    • S
      net: Add sysctl to toggle early demux for tcp and udp · dddb64bc
      subashab@codeaurora.org 提交于
      Certain system process significant unconnected UDP workload.
      It would be preferrable to disable UDP early demux for those systems
      and enable it for TCP only.
      
      By disabling UDP demux, we see these slight gains on an ARM64 system-
      782 -> 788Mbps unconnected single stream UDPv4
      633 -> 654Mbps unconnected UDPv4 different sources
      
      The performance impact can change based on CPU architecure and cache
      sizes. There will not much difference seen if entire UDP hash table
      is in cache.
      
      Both sysctls are enabled by default to preserve existing behavior.
      
      v1->v2: Change function pointer instead of adding conditional as
      suggested by Stephen.
      
      v2->v3: Read once in callers to avoid issues due to compiler
      optimizations. Also update commit message with the tests.
      
      v3->v4: Store and use read once result instead of querying pointer
      again incorrectly.
      
      v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dddb64bc
  28. 17 3月, 2017 2 次提交
  29. 14 3月, 2017 1 次提交
    • J
      dccp/tcp: fix routing redirect race · 45caeaa5
      Jon Maxwell 提交于
      As Eric Dumazet pointed out this also needs to be fixed in IPv6.
      v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.
      
      We have seen a few incidents lately where a dst_enty has been freed
      with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
      dst_entry. If the conditions/timings are right a crash then ensues when the
      freed dst_entry is referenced later on. A Common crashing back trace is:
      
       #8 [] page_fault at ffffffff8163e648
          [exception RIP: __tcp_ack_snd_check+74]
      .
      .
       #9 [] tcp_rcv_established at ffffffff81580b64
      #10 [] tcp_v4_do_rcv at ffffffff8158b54a
      #11 [] tcp_v4_rcv at ffffffff8158cd02
      #12 [] ip_local_deliver_finish at ffffffff815668f4
      #13 [] ip_local_deliver at ffffffff81566bd9
      #14 [] ip_rcv_finish at ffffffff8156656d
      #15 [] ip_rcv at ffffffff81566f06
      #16 [] __netif_receive_skb_core at ffffffff8152b3a2
      #17 [] __netif_receive_skb at ffffffff8152b608
      #18 [] netif_receive_skb at ffffffff8152b690
      #19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
      #20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
      #21 [] net_rx_action at ffffffff8152bac2
      #22 [] __do_softirq at ffffffff81084b4f
      #23 [] call_softirq at ffffffff8164845c
      #24 [] do_softirq at ffffffff81016fc5
      #25 [] irq_exit at ffffffff81084ee5
      #26 [] do_IRQ at ffffffff81648ff8
      
      Of course it may happen with other NIC drivers as well.
      
      It's found the freed dst_entry here:
      
       224 static bool tcp_in_quickack_mode(struct sock *sk)
       225 {
       226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);
       227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);
       228 
       229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||
       230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);
       231 }
      
      But there are other backtraces attributed to the same freed dst_entry in
      netfilter code as well.
      
      All the vmcores showed 2 significant clues:
      
      - Remote hosts behind the default gateway had always been redirected to a
      different gateway. A rtable/dst_entry will be added for that host. Making
      more dst_entrys with lower reference counts. Making this more probable.
      
      - All vmcores showed a postitive LockDroppedIcmps value, e.g:
      
      LockDroppedIcmps                  267
      
      A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
      regardless of whether user space has the socket locked. This can result in a
      race condition where the same dst_entry cached in sk->sk_dst_entry can be
      decremented twice for the same socket via:
      
      do_redirect()->__sk_dst_check()-> dst_release().
      
      Which leads to the dst_entry being prematurely freed with another socket
      pointing to it via sk->sk_dst_cache and a subsequent crash.
      
      To fix this skip do_redirect() if usespace has the socket locked. Instead let
      the redirect take place later when user space does not have the socket
      locked.
      
      The dccp/IPv6 code is very similar in this respect, so fixing it there too.
      
      As Eric Garver pointed out the following commit now invalidates routes. Which
      can set the dst->obsolete flag so that ipv4_dst_check() returns null and
      triggers the dst_release().
      
      Fixes: ceb33206 ("ipv4: Kill routes during PMTU/redirect updates.")
      Cc: Eric Garver <egarver@redhat.com>
      Cc: Hannes Sowa <hsowa@redhat.com>
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45caeaa5
  30. 10 3月, 2017 1 次提交
  31. 23 2月, 2017 1 次提交
    • A
      tcp: setup timestamp offset when write_seq already set · 00355fa5
      Alexey Kodanev 提交于
      Found that when randomized tcp offsets are enabled (by default)
      TCP client can still start new connections without them. Later,
      if server does active close and re-uses sockets in TIME-WAIT
      state, new SYN from client can be rejected on PAWS check inside
      tcp_timewait_state_process(), because either tw_ts_recent or
      rcv_tsval doesn't really have an offset set.
      
      Here is how to reproduce it with LTP netstress tool:
          netstress -R 1 &
          netstress -H 127.0.0.1 -lr 1000000 -a1
      
          [...]
          < S  seq 1956977072 win 43690 TS val 295618 ecr 459956970
          > .  ack 1956911535 win 342 TS val 459967184 ecr 1547117608
          < R  seq 1956911535 win 0 length 0
      +1. < S  seq 1956977072 win 43690 TS val 296640 ecr 459956970
          > S. seq 657450664 ack 1956977073 win 43690 TS val 459968205 ecr 296640
      
      Fixes: 95a22cae ("tcp: randomize tcp timestamp offsets for each connection")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00355fa5
  32. 15 2月, 2017 1 次提交
    • J
      ipv6: Handle IPv4-mapped src to in6addr_any dst. · 052d2369
      Jonathan T. Leighton 提交于
      This patch adds a check on the type of the source address for the case
      where the destination address is in6addr_any. If the source is an
      IPv4-mapped IPv6 source address, the destination is changed to
      ::ffff:127.0.0.1, and otherwise the destination is changed to ::1. This
      is done in three locations to handle UDP calls to either connect() or
      sendmsg() and TCP calls to connect(). Note that udpv6_sendmsg() delays
      handling an in6addr_any destination until very late, so the patch only
      needs to handle the case where the source is an IPv4-mapped IPv6
      address.
      Signed-off-by: NJonathan T. Leighton <jtleight@udel.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      052d2369
  33. 06 2月, 2017 1 次提交
  34. 04 2月, 2017 1 次提交
  35. 27 1月, 2017 1 次提交