1. 20 3月, 2020 1 次提交
  2. 10 3月, 2020 3 次提交
  3. 22 2月, 2020 1 次提交
  4. 16 1月, 2020 1 次提交
  5. 10 1月, 2020 5 次提交
  6. 03 1月, 2020 1 次提交
    • D
      tcp: Add l3index to tcp_md5sig_key and md5 functions · dea53bb8
      David Ahern 提交于
      Add l3index to tcp_md5sig_key to represent the L3 domain of a key, and
      add l3index to tcp_md5_do_add and tcp_md5_do_del to fill in the key.
      
      With the key now based on an l3index, add the new parameter to the
      lookup functions and consider the l3index when looking for a match.
      
      The l3index comes from the skb when processing ingress packets leveraging
      the helpers created for socket lookups, tcp_v4_sdif and inet_iif (and the
      v6 variants). When the sdif index is set it means the packet ingressed a
      device that is part of an L3 domain and inet_iif points to the VRF device.
      For egress, the L3 domain is determined from the socket binding and
      sk_bound_dev_if.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dea53bb8
  7. 14 12月, 2019 1 次提交
  8. 07 12月, 2019 3 次提交
    • G
      tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE() · 721c8daf
      Guillaume Nault 提交于
      Syncookies borrow the ->rx_opt.ts_recent_stamp field to store the
      timestamp of the last synflood. Protect them with READ_ONCE() and
      WRITE_ONCE() since reads and writes aren't serialised.
      
      Use of .rx_opt.ts_recent_stamp for storing the synflood timestamp was
      introduced by a0f82f64 ("syncookies: remove last_synq_overflow from
      struct tcp_sock"). But unprotected accesses were already there when
      timestamp was stored in .last_synq_overflow.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      721c8daf
    • G
      tcp: tighten acceptance of ACKs not matching a child socket · cb44a08f
      Guillaume Nault 提交于
      When no synflood occurs, the synflood timestamp isn't updated.
      Therefore it can be so old that time_after32() can consider it to be
      in the future.
      
      That's a problem for tcp_synq_no_recent_overflow() as it may report
      that a recent overflow occurred while, in fact, it's just that jiffies
      has grown past 'last_overflow' + TCP_SYNCOOKIE_VALID + 2^31.
      
      Spurious detection of recent overflows lead to extra syncookie
      verification in cookie_v[46]_check(). At that point, the verification
      should fail and the packet dropped. But we should have dropped the
      packet earlier as we didn't even send a syncookie.
      
      Let's refine tcp_synq_no_recent_overflow() to report a recent overflow
      only if jiffies is within the
      [last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval. This
      way, no spurious recent overflow is reported when jiffies wraps and
      'last_overflow' becomes in the future from the point of view of
      time_after32().
      
      However, if jiffies wraps and enters the
      [last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval (with
      'last_overflow' being a stale synflood timestamp), then
      tcp_synq_no_recent_overflow() still erroneously reports an
      overflow. In such cases, we have to rely on syncookie verification
      to drop the packet. We unfortunately have no way to differentiate
      between a fresh and a stale syncookie timestamp.
      
      In practice, using last_overflow as lower bound is problematic.
      If the synflood timestamp is concurrently updated between the time
      we read jiffies and the moment we store the timestamp in
      'last_overflow', then 'now' becomes smaller than 'last_overflow' and
      tcp_synq_no_recent_overflow() returns true, potentially dropping a
      valid syncookie.
      
      Reading jiffies after loading the timestamp could fix the problem,
      but that'd require a memory barrier. Let's just accommodate for
      potential timestamp growth instead and extend the interval using
      'last_overflow - HZ' as lower bound.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb44a08f
    • G
      tcp: fix rejected syncookies due to stale timestamps · 04d26e7b
      Guillaume Nault 提交于
      If no synflood happens for a long enough period of time, then the
      synflood timestamp isn't refreshed and jiffies can advance so much
      that time_after32() can't accurately compare them any more.
      
      Therefore, we can end up in a situation where time_after32(now,
      last_overflow + HZ) returns false, just because these two values are
      too far apart. In that case, the synflood timestamp isn't updated as
      it should be, which can trick tcp_synq_no_recent_overflow() into
      rejecting valid syncookies.
      
      For example, let's consider the following scenario on a system
      with HZ=1000:
      
        * The synflood timestamp is 0, either because that's the timestamp
          of the last synflood or, more commonly, because we're working with
          a freshly created socket.
      
        * We receive a new SYN, which triggers synflood protection. Let's say
          that this happens when jiffies == 2147484649 (that is,
          'synflood timestamp' + HZ + 2^31 + 1).
      
        * Then tcp_synq_overflow() doesn't update the synflood timestamp,
          because time_after32(2147484649, 1000) returns false.
          With:
            - 2147484649: the value of jiffies, aka. 'now'.
            - 1000: the value of 'last_overflow' + HZ.
      
        * A bit later, we receive the ACK completing the 3WHS. But
          cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow()
          says that we're not under synflood. That's because
          time_after32(2147484649, 120000) returns false.
          With:
            - 2147484649: the value of jiffies, aka. 'now'.
            - 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID.
      
          Of course, in reality jiffies would have increased a bit, but this
          condition will last for the next 119 seconds, which is far enough
          to accommodate for jiffie's growth.
      
      Fix this by updating the overflow timestamp whenever jiffies isn't
      within the [last_overflow, last_overflow + HZ] range. That shouldn't
      have any performance impact since the update still happens at most once
      per second.
      
      Now we're guaranteed to have fresh timestamps while under synflood, so
      tcp_synq_no_recent_overflow() can safely use it with time_after32() in
      such situations.
      
      Stale timestamps can still make tcp_synq_no_recent_overflow() return
      the wrong verdict when not under synflood. This will be handled in the
      next patch.
      
      For 64 bits architectures, the problem was introduced with the
      conversion of ->tw_ts_recent_stamp to 32 bits integer by commit
      cca9bab1 ("tcp: use monotonic timestamps for PAWS").
      The problem has always been there on 32 bits architectures.
      
      Fixes: cca9bab1 ("tcp: use monotonic timestamps for PAWS")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04d26e7b
  9. 08 11月, 2019 1 次提交
  10. 14 10月, 2019 3 次提交
  11. 10 10月, 2019 2 次提交
    • E
      net: silence KCSAN warnings about sk->sk_backlog.len reads · 70c26558
      Eric Dumazet 提交于
      sk->sk_backlog.len can be written by BH handlers, and read
      from process contexts in a lockless way.
      
      Note the write side should also use WRITE_ONCE() or a variant.
      We need some agreement about the best way to do this.
      
      syzbot reported :
      
      BUG: KCSAN: data-race in tcp_add_backlog / tcp_grow_window.isra.0
      
      write to 0xffff88812665f32c of 4 bytes by interrupt on cpu 1:
       sk_add_backlog include/net/sock.h:934 [inline]
       tcp_add_backlog+0x4a0/0xcc0 net/ipv4/tcp_ipv4.c:1737
       tcp_v4_rcv+0x1aba/0x1bf0 net/ipv4/tcp_ipv4.c:1925
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
       virtnet_receive drivers/net/virtio_net.c:1323 [inline]
       virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
       napi_poll net/core/dev.c:6352 [inline]
       net_rx_action+0x3ae/0xa50 net/core/dev.c:6418
      
      read to 0xffff88812665f32c of 4 bytes by task 7292 on cpu 0:
       tcp_space include/net/tcp.h:1373 [inline]
       tcp_grow_window.isra.0+0x6b/0x480 net/ipv4/tcp_input.c:413
       tcp_event_data_recv+0x68f/0x990 net/ipv4/tcp_input.c:717
       tcp_rcv_established+0xbfe/0xf50 net/ipv4/tcp_input.c:5618
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
       sk_backlog_rcv include/net/sock.h:945 [inline]
       __release_sock+0x135/0x1e0 net/core/sock.c:2427
       release_sock+0x61/0x160 net/core/sock.c:2943
       tcp_recvmsg+0x63b/0x1a30 net/ipv4/tcp.c:2181
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:871 [inline]
       sock_recvmsg net/socket.c:889 [inline]
       sock_recvmsg+0x92/0xb0 net/socket.c:885
       sock_read_iter+0x15f/0x1e0 net/socket.c:967
       call_read_iter include/linux/fs.h:1864 [inline]
       new_sync_read+0x389/0x4f0 fs/read_write.c:414
       __vfs_read+0xb1/0xc0 fs/read_write.c:427
       vfs_read fs/read_write.c:461 [inline]
       vfs_read+0x143/0x2c0 fs/read_write.c:446
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7292 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      70c26558
    • E
      tcp: annotate lockless access to tcp_memory_pressure · 1f142c17
      Eric Dumazet 提交于
      tcp_memory_pressure is read without holding any lock,
      and its value could be changed on other cpus.
      
      Use READ_ONCE() to annotate these lockless reads.
      
      The write side is already using atomic ops.
      
      Fixes: b8da51eb ("tcp: introduce tcp_under_memory_pressure()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      1f142c17
  12. 01 9月, 2019 1 次提交
  13. 10 8月, 2019 1 次提交
  14. 31 7月, 2019 1 次提交
  15. 22 7月, 2019 2 次提交
    • J
      bpf: sockmap/tls, close can race with map free · 95fa1454
      John Fastabend 提交于
      When a map free is called and in parallel a socket is closed we
      have two paths that can potentially reset the socket prot ops, the
      bpf close() path and the map free path. This creates a problem
      with which prot ops should be used from the socket closed side.
      
      If the map_free side completes first then we want to call the
      original lowest level ops. However, if the tls path runs first
      we want to call the sockmap ops. Additionally there was no locking
      around prot updates in TLS code paths so the prot ops could
      be changed multiple times once from TLS path and again from sockmap
      side potentially leaving ops pointed at either TLS or sockmap
      when psock and/or tls context have already been destroyed.
      
      To fix this race first only update ops inside callback lock
      so that TLS, sockmap and lowest level all agree on prot state.
      Second and a ULP callback update() so that lower layers can
      inform the upper layer when they are being removed allowing the
      upper layer to reset prot ops.
      
      This gets us close to allowing sockmap and tls to be stacked
      in arbitrary order but will save that patch for *next trees.
      
      v4:
       - make sure we don't free things for device;
       - remove the checks which swap the callbacks back
         only if TLS is at the top.
      
      Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
      Fixes: 02c558b2 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      95fa1454
    • E
      tcp: be more careful in tcp_fragment() · b617158d
      Eric Dumazet 提交于
      Some applications set tiny SO_SNDBUF values and expect
      TCP to just work. Recent patches to address CVE-2019-11478
      broke them in case of losses, since retransmits might
      be prevented.
      
      We should allow these flows to make progress.
      
      This patch allows the first and last skb in retransmit queue
      to be split even if memory limits are hit.
      
      It also adds the some room due to the fact that tcp_sendmsg()
      and tcp_sendpage() might overshoot sk_wmem_queued by about one full
      TSO skb (64KB size). Note this allowance was already present
      in stable backports for kernels < 4.15
      
      Note for < 4.15 backports :
       tcp_rtx_queue_tail() will probably look like :
      
      static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
      {
      	struct sk_buff *skb = tcp_send_head(sk);
      
      	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
      }
      
      Fixes: f070ef2a ("tcp: tcp_fragment() should apply sane memory limits")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrew Prout <aprout@ll.mit.edu>
      Tested-by: NAndrew Prout <aprout@ll.mit.edu>
      Tested-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Tested-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NChristoph Paasch <cpaasch@apple.com>
      Cc: Jonathan Looney <jtl@netflix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b617158d
  16. 19 7月, 2019 1 次提交
  17. 08 7月, 2019 1 次提交
  18. 03 7月, 2019 1 次提交
  19. 23 6月, 2019 1 次提交
    • A
      net: fastopen: robustness and endianness fixes for SipHash · 438ac880
      Ard Biesheuvel 提交于
      Some changes to the TCP fastopen code to make it more robust
      against future changes in the choice of key/cookie size, etc.
      
      - Instead of keeping the SipHash key in an untyped u8[] buffer
        and casting it to the right type upon use, use the correct
        type directly. This ensures that the key will appear at the
        correct alignment if we ever change the way these data
        structures are allocated. (Currently, they are only allocated
        via kmalloc so they always appear at the correct alignment)
      
      - Use DIV_ROUND_UP when sizing the u64[] array to hold the
        cookie, so it is always of sufficient size, even if
        TCP_FASTOPEN_COOKIE_MAX is no longer a multiple of 8.
      
      - Drop the 'len' parameter from the tcp_fastopen_reset_cipher()
        function, which is no longer used.
      
      - Add endian swabbing when setting the keys and calculating the hash,
        to ensure that cookie values are the same for a given key and
        source/destination address pair regardless of the endianness of
        the server.
      
      Note that none of these are functional changes wrt the current
      state of the code, with the exception of the swabbing, which only
      affects big endian systems.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      438ac880
  20. 18 6月, 2019 1 次提交
    • A
      net: ipv4: move tcp_fastopen server side code to SipHash library · c681edae
      Ard Biesheuvel 提交于
      Using a bare block cipher in non-crypto code is almost always a bad idea,
      not only for security reasons (and we've seen some examples of this in
      the kernel in the past), but also for performance reasons.
      
      In the TCP fastopen case, we call into the bare AES block cipher one or
      two times (depending on whether the connection is IPv4 or IPv6). On most
      systems, this results in a call chain such as
      
        crypto_cipher_encrypt_one(ctx, dst, src)
          crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...);
            aesni_encrypt
              kernel_fpu_begin();
              aesni_enc(ctx, dst, src); // asm routine
              kernel_fpu_end();
      
      It is highly unlikely that the use of special AES instructions has a
      benefit in this case, especially since we are doing the above twice
      for IPv6 connections, instead of using a transform which can process
      the entire input in one go.
      
      We could switch to the cbcmac(aes) shash, which would at least get
      rid of the duplicated overhead in *some* cases (i.e., today, only
      arm64 has an accelerated implementation of cbcmac(aes), while x86 will
      end up using the generic cbcmac template wrapping the AES-NI cipher,
      which basically ends up doing exactly the above). However, in the given
      context, it makes more sense to use a light-weight MAC algorithm that
      is more suitable for the purpose at hand, such as SipHash.
      
      Since the output size of SipHash already matches our chosen value for
      TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input
      sizes, this greatly simplifies the code as well.
      
      NOTE: Server farms backing a single server IP for load balancing purposes
            and sharing a single fastopen key will be adversely affected by
            this change unless all systems in the pool receive their kernel
            upgrades at the same time.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c681edae
  21. 16 6月, 2019 1 次提交
    • E
      tcp: limit payload size of sacked skbs · 3b4929f6
      Eric Dumazet 提交于
      Jonathan Looney reported that TCP can trigger the following crash
      in tcp_shifted_skb() :
      
      	BUG_ON(tcp_skb_pcount(skb) < pcount);
      
      This can happen if the remote peer has advertized the smallest
      MSS that linux TCP accepts : 48
      
      An skb can hold 17 fragments, and each fragment can hold 32KB
      on x86, or 64KB on PowerPC.
      
      This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
      can overflow.
      
      Note that tcp_sendmsg() builds skbs with less than 64KB
      of payload, so this problem needs SACK to be enabled.
      SACK blocks allow TCP to coalesce multiple skbs in the retransmit
      queue, thus filling the 17 fragments to maximal capacity.
      
      CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs
      
      Fixes: 832d11c5 ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b4929f6
  22. 15 6月, 2019 1 次提交
  23. 13 6月, 2019 1 次提交
    • E
      tcp: add optional per socket transmit delay · a842fe14
      Eric Dumazet 提交于
      Adding delays to TCP flows is crucial for studying behavior
      of TCP stacks, including congestion control modules.
      
      Linux offers netem module, but it has unpractical constraints :
      - Need root access to change qdisc
      - Hard to setup on egress if combined with non trivial qdisc like FQ
      - Single delay for all flows.
      
      EDT (Earliest Departure Time) adoption in TCP stack allows us
      to enable a per socket delay at a very small cost.
      
      Networking tools can now establish thousands of flows, each of them
      with a different delay, simulating real world conditions.
      
      This requires FQ packet scheduler or a EDT-enabled NIC.
      
      This patchs adds TCP_TX_DELAY socket option, to set a delay in
      usec units.
      
        unsigned int tx_delay = 10000; /* 10 msec */
      
        setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));
      
      Note that FQ packet scheduler limits might need some tweaking :
      
      man tc-fq
      
      PARAMETERS
         limit
             Hard  limit  on  the  real  queue  size. When this limit is
             reached, new packets are dropped. If the value is  lowered,
             packets  are  dropped so that the new limit is met. Default
             is 10000 packets.
      
         flow_limit
             Hard limit on the maximum  number  of  packets  queued  per
             flow.  Default value is 100.
      
      Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
      so packets would be dropped if any of the previous limit is hit.
      
      Use of a jump label makes this support runtime-free, for hosts
      never using the option.
      
      Also note that TSQ (TCP Small Queues) limits are slightly changed
      with this patch : we need to account that skbs artificially delayed
      wont stop us providind more skbs to feed the pipe (netem uses
      skb_orphan_partial() for this purpose, but FQ can not use this trick)
      
      Because of that, using big delays might very well trigger
      old bugs in TSO auto defer logic and/or sndbuf limited detection.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a842fe14
  24. 31 5月, 2019 2 次提交
  25. 10 5月, 2019 1 次提交
  26. 23 4月, 2019 1 次提交
  27. 27 2月, 2019 1 次提交