1. 20 5月, 2020 2 次提交
  2. 19 5月, 2020 1 次提交
  3. 17 5月, 2020 1 次提交
  4. 16 5月, 2020 2 次提交
  5. 15 5月, 2020 3 次提交
    • A
      ipmr: Add lockdep expression to ipmr_for_each_table macro · 7013908c
      Amol Grover 提交于
      During the initialization process, ipmr_new_table() is called
      to create new tables which in turn calls ipmr_get_table() which
      traverses net->ipv4.mr_tables without holding the writer lock.
      However, this is safe to do so as no tables exist at this time.
      Hence add a suitable lockdep expression to silence the following
      false-positive warning:
      
      =============================
      WARNING: suspicious RCU usage
      5.7.0-rc3-next-20200428-syzkaller #0 Not tainted
      -----------------------------
      net/ipv4/ipmr.c:136 RCU-list traversed in non-reader section!!
      
      ipmr_get_table+0x130/0x160 net/ipv4/ipmr.c:136
      ipmr_new_table net/ipv4/ipmr.c:403 [inline]
      ipmr_rules_init net/ipv4/ipmr.c:248 [inline]
      ipmr_net_init+0x133/0x430 net/ipv4/ipmr.c:3089
      
      Fixes: f0ad0860 ("ipv4: ipmr: support multiple tables")
      Reported-by: syzbot+1519f497f2f9f08183c6@syzkaller.appspotmail.com
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NAmol Grover <frextrite@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7013908c
    • A
      ipmr: Fix RCU list debugging warning · a14fbcd4
      Amol Grover 提交于
      ipmr_for_each_table() macro uses list_for_each_entry_rcu()
      for traversing outside of an RCU read side critical section
      but under the protection of rtnl_mutex. Hence, add the
      corresponding lockdep expression to silence the following
      false-positive warning at boot:
      
      [    4.319347] =============================
      [    4.319349] WARNING: suspicious RCU usage
      [    4.319351] 5.5.4-stable #17 Tainted: G            E
      [    4.319352] -----------------------------
      [    4.319354] net/ipv4/ipmr.c:1757 RCU-list traversed in non-reader section!!
      
      Fixes: f0ad0860 ("ipv4: ipmr: support multiple tables")
      Signed-off-by: NAmol Grover <frextrite@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a14fbcd4
    • E
      tcp: fix error recovery in tcp_zerocopy_receive() · e776af60
      Eric Dumazet 提交于
      If user provides wrong virtual address in TCP_ZEROCOPY_RECEIVE
      operation we want to return -EINVAL error.
      
      But depending on zc->recv_skip_hint content, we might return
      -EIO error if the socket has SOCK_DONE set.
      
      Make sure to return -EINVAL in this case.
      
      BUG: KMSAN: uninit-value in tcp_zerocopy_receive net/ipv4/tcp.c:1833 [inline]
      BUG: KMSAN: uninit-value in do_tcp_getsockopt+0x4494/0x6320 net/ipv4/tcp.c:3685
      CPU: 1 PID: 625 Comm: syz-executor.0 Not tainted 5.7.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x220 lib/dump_stack.c:118
       kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:121
       __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
       tcp_zerocopy_receive net/ipv4/tcp.c:1833 [inline]
       do_tcp_getsockopt+0x4494/0x6320 net/ipv4/tcp.c:3685
       tcp_getsockopt+0xf8/0x1f0 net/ipv4/tcp.c:3728
       sock_common_getsockopt+0x13f/0x180 net/core/sock.c:3131
       __sys_getsockopt+0x533/0x7b0 net/socket.c:2177
       __do_sys_getsockopt net/socket.c:2192 [inline]
       __se_sys_getsockopt+0xe1/0x100 net/socket.c:2189
       __x64_sys_getsockopt+0x62/0x80 net/socket.c:2189
       do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:297
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45c829
      Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f1deeb72c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000037
      RAX: ffffffffffffffda RBX: 00000000004e01e0 RCX: 000000000045c829
      RDX: 0000000000000023 RSI: 0000000000000006 RDI: 0000000000000009
      RBP: 000000000078bf00 R08: 0000000020000200 R09: 0000000000000000
      R10: 00000000200001c0 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000000001d8 R14: 00000000004d3038 R15: 00007f1deeb736d4
      
      Local variable ----zc@do_tcp_getsockopt created at:
       do_tcp_getsockopt+0x1a74/0x6320 net/ipv4/tcp.c:3670
       do_tcp_getsockopt+0x1a74/0x6320 net/ipv4/tcp.c:3670
      
      Fixes: 05255b82 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e776af60
  6. 13 5月, 2020 3 次提交
  7. 12 5月, 2020 1 次提交
    • C
      net: cleanly handle kernel vs user buffers for ->msg_control · 1f466e1f
      Christoph Hellwig 提交于
      The msg_control field in struct msghdr can either contain a user
      pointer when used with the recvmsg system call, or a kernel pointer
      when used with sendmsg.  To complicate things further kernel_recvmsg
      can stuff a kernel pointer in and then use set_fs to make the uaccess
      helpers accept it.
      
      Replace it with a union of a kernel pointer msg_control field, and
      a user pointer msg_control_user one, and allow kernel_recvmsg operate
      on a proper kernel pointer using a bitfield to override the normal
      choice of a user pointer for recvmsg.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f466e1f
  8. 09 5月, 2020 4 次提交
  9. 08 5月, 2020 1 次提交
  10. 07 5月, 2020 2 次提交
    • E
      tcp: defer xmit timer reset in tcp_xmit_retransmit_queue() · 916e6d1a
      Eric Dumazet 提交于
      As hinted in prior change ("tcp: refine tcp_pacing_delay()
      for very low pacing rates"), it is probably best arming
      the xmit timer only when all the packets have been scheduled,
      rather than when the head of rtx queue has been re-sent.
      
      This does matter for flows having extremely low pacing rates,
      since their tp->tcp_wstamp_ns could be far in the future.
      
      Note that the regular xmit path has a stronger limit
      in tcp_small_queue_check(), meaning it is less likely to
      go beyond the pacing horizon.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      916e6d1a
    • E
      tcp: refine tcp_pacing_delay() for very low pacing rates · 8dc242ad
      Eric Dumazet 提交于
      With the addition of horizon feature to sch_fq, we noticed some
      suboptimal behavior of extremely low pacing rate TCP flows, especially
      when TCP is not aware of a drop happening in lower stacks.
      
      Back in commit 3f80e08f ("tcp: add tcp_reset_xmit_timer() helper"),
      tcp_pacing_delay() was added to estimate an extra delay to add to standard
      rto timers.
      
      This patch removes the skb argument from this helper and
      tcp_reset_xmit_timer() because it makes more sense to simply
      consider the time at which next packet is allowed to be sent,
      instead of the time of whatever packet has been sent.
      
      This avoids arming RTO timer too soon and removes
      spurious horizon drops.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8dc242ad
  11. 06 5月, 2020 2 次提交
    • J
      bpf, sockmap: bpf_tcp_ingress needs to subtract bytes from sg.size · 81aabbb9
      John Fastabend 提交于
      In bpf_tcp_ingress we used apply_bytes to subtract bytes from sg.size
      which is used to track total bytes in a message. But this is not
      correct because apply_bytes is itself modified in the main loop doing
      the mem_charge.
      
      Then at the end of this we have sg.size incorrectly set and out of
      sync with actual sk values. Then we can get a splat if we try to
      cork the data later and again try to redirect the msg to ingress. To
      fix instead of trying to track msg.size do the easy thing and include
      it as part of the sk_msg_xfer logic so that when the msg is moved the
      sg.size is always correct.
      
      To reproduce the below users will need ingress + cork and hit an
      error path that will then try to 'free' the skmsg.
      
      [  173.699981] BUG: KASAN: null-ptr-deref in sk_msg_free_elem+0xdd/0x120
      [  173.699987] Read of size 8 at addr 0000000000000008 by task test_sockmap/5317
      
      [  173.700000] CPU: 2 PID: 5317 Comm: test_sockmap Tainted: G          I       5.7.0-rc1+ #43
      [  173.700005] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
      [  173.700009] Call Trace:
      [  173.700021]  dump_stack+0x8e/0xcb
      [  173.700029]  ? sk_msg_free_elem+0xdd/0x120
      [  173.700034]  ? sk_msg_free_elem+0xdd/0x120
      [  173.700042]  __kasan_report+0x102/0x15f
      [  173.700052]  ? sk_msg_free_elem+0xdd/0x120
      [  173.700060]  kasan_report+0x32/0x50
      [  173.700070]  sk_msg_free_elem+0xdd/0x120
      [  173.700080]  __sk_msg_free+0x87/0x150
      [  173.700094]  tcp_bpf_send_verdict+0x179/0x4f0
      [  173.700109]  tcp_bpf_sendpage+0x3ce/0x5d0
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/158861290407.14306.5327773422227552482.stgit@john-Precision-5820-Tower
      81aabbb9
    • W
      erspan: Add type I version 0 support. · f989d546
      William Tu 提交于
      The Type I ERSPAN frame format is based on the barebones
      IP + GRE(4-byte) encapsulation on top of the raw mirrored frame.
      Both type I and II use 0x88BE as protocol type. Unlike type II
      and III, no sequence number or key is required.
      To creat a type I erspan tunnel device:
        $ ip link add dev erspan11 type erspan \
                  local 172.16.1.100 remote 172.16.1.200 \
                  erspan_ver 0
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f989d546
  12. 03 5月, 2020 1 次提交
  13. 02 5月, 2020 1 次提交
    • C
      net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX · f0628c52
      Cambda Zhu 提交于
      This patch changes the behavior of TCP_LINGER2 about its limit. The
      sysctl_tcp_fin_timeout used to be the limit of TCP_LINGER2 but now it's
      only the default value. A new macro named TCP_FIN_TIMEOUT_MAX is added
      as the limit of TCP_LINGER2, which is 2 minutes.
      
      Since TCP_LINGER2 used sysctl_tcp_fin_timeout as the default value
      and the limit in the past, the system administrator cannot set the
      default value for most of sockets and let some sockets have a greater
      timeout. It might be a mistake that let the sysctl to be the limit of
      the TCP_LINGER2. Maybe we can add a new sysctl to set the max of
      TCP_LINGER2, but FIN-WAIT-2 timeout is usually no need to be too long
      and 2 minutes are legal considering TCP specs.
      
      Changes in v3:
      - Remove the new socket option and change the TCP_LINGER2 behavior so
        that the timeout can be set to value between sysctl_tcp_fin_timeout
        and 2 minutes.
      
      Changes in v2:
      - Add int overflow check for the new socket option.
      
      Changes in v1:
      - Add a new socket option to set timeout greater than
        sysctl_tcp_fin_timeout.
      Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0628c52
  14. 01 5月, 2020 7 次提交
    • E
      tcp: add hrtimer slack to sack compression · a70437cc
      Eric Dumazet 提交于
      Add a sysctl to control hrtimer slack, default of 100 usec.
      
      This gives the opportunity to reduce system overhead,
      and help very short RTT flows.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a70437cc
    • E
      tcp: tcp_sack_new_ofo_skb() should be more conservative · ccd0628f
      Eric Dumazet 提交于
      Currently, tcp_sack_new_ofo_skb() sends an ack if prior
      acks were 'compressed', if room has to be made in tp->selective_acks[]
      
      But there is no guarantee all four sack ranges can be included
      in SACK option. As a matter of fact, when TCP timestamps option
      is used, only three SACK ranges can be included.
      
      Lets assume only two ranges can be included, and force the ack:
      
      - When we touch more than 2 ranges in the reordering
        done if tcp_sack_extend() could be done.
      
      - If we have at least 2 ranges when adding a new one.
      
      This enforces that before a range is in third or fourth
      position, at least one ACK packet included it in first/second
      position.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ccd0628f
    • E
      tcp: add tp->dup_ack_counter · 2b195850
      Eric Dumazet 提交于
      In commit 86de5921 ("tcp: defer SACK compression after DupThresh")
      I added a TCP_FASTRETRANS_THRESH bias to tp->compressed_ack in order
      to enable sack compression only after 3 dupacks.
      
      Since we plan to relax this rule for flows that involve
      stacks not requiring this old rule, this patch adds
      a distinct tp->dup_ack_counter.
      
      This means the TCP_FASTRETRANS_THRESH value is now used
      in a single location that a future patch can adjust:
      
      	if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
      		tp->dup_ack_counter++;
      		goto send_now;
      	}
      
      This patch also introduces tcp_sack_compress_send_ack()
      helper to ease following patch comprehension.
      
      This patch refines LINUX_MIB_TCPACKCOMPRESSED to not
      count the acks that we had to send if the timer expires
      or tcp_sack_compress_send_ack() is sending an ack.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b195850
    • D
      inet_diag: add support for cgroup filter · b1f3e43d
      Dmitry Yakunin 提交于
      This patch adds ability to filter sockets based on cgroup v2 ID.
      Such filter is helpful in ss utility for filtering sockets by
      cgroup pathname.
      Signed-off-by: NDmitry Yakunin <zeil@yandex-team.ru>
      Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1f3e43d
    • D
      inet_diag: add cgroup id attribute · 6e3a401f
      Dmitry Yakunin 提交于
      This patch adds cgroup v2 ID to common inet diag message attributes.
      Cgroup v2 ID is kernfs ID (ino or ino+gen). This attribute allows filter
      inet diag output by cgroup ID obtained by name_to_handle_at() syscall.
      When net_cls or net_prio cgroup is activated this ID is equal to 1 (root
      cgroup ID) for newly created sockets.
      
      Some notes about this ID:
      
      1) gets initialized in socket() syscall
      2) incoming socket gets ID from listening socket
         (not during accept() syscall)
      3) not changed when process get moved to another cgroup
      4) can point to deleted cgroup (refcounting)
      
      v2:
        - use CONFIG_SOCK_CGROUP_DATA instead if CONFIG_CGROUPS
      
      v3:
        - fix attr size by using nla_total_size_64bit() (Eric Dumazet)
        - more detailed commit message (Konstantin Khlebnikov)
      Signed-off-by: NDmitry Yakunin <zeil@yandex-team.ru>
      Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-By: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e3a401f
    • P
      mptcp: move option parsing into mptcp_incoming_options() · cfde141e
      Paolo Abeni 提交于
      The mptcp_options_received structure carries several per
      packet flags (mp_capable, mp_join, etc.). Such fields must
      be cleared on each packet, even on dropped ones or packet
      not carrying any MPTCP options, but the current mptcp
      code clears them only on TCP option reset.
      
      On several races/corner cases we end-up with stray bits in
      incoming options, leading to WARN_ON splats. e.g.:
      
      [  171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713
      [  171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata
      [  171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95
      [  171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
      [  171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe <0f> 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c
      [  171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282
      [  171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [  171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e
      [  171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955
      [  171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca
      [  171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020
      [  171.228460] FS:  00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000
      [  171.230065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0
      [  171.232586] Call Trace:
      [  171.233109]  <IRQ>
      [  171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691)
      [  171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832)
      [  171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1))
      [  171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217)
      [  171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822)
      [  171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730)
      [  171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009)
      [  171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1))
      [  171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232)
      [  171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252)
      [  171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539)
      [  171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135)
      [  171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640)
      [  171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293)
      [  171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083)
      [  171.282358]  </IRQ>
      
      We could address the issue clearing explicitly the relevant fields
      in several places - tcp_parse_option, tcp_fast_parse_options,
      possibly others.
      
      Instead we move the MPTCP option parsing into the already existing
      mptcp ingress hook, so that we need to clear the fields in a single
      place.
      
      This allows us dropping an MPTCP hook from the TCP code and
      removing the quite large mptcp_options_received from the tcp_sock
      struct. On the flip side, the MPTCP sockets will traverse the
      option space twice (in tcp_parse_option() and in
      mptcp_incoming_options(). That looks acceptable: we already
      do that for syn and 3rd ack packets, plain TCP socket will
      benefit from it, and even MPTCP sockets will experience better
      code locality, reducing the jumps between TCP and MPTCP code.
      
      v1 -> v2:
       - rebased on current '-net' tree
      
      Fixes: 648ef4b8 ("mptcp: Implement MPTCP receive path")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfde141e
    • P
      mptcp: consolidate synack processing. · 263e1201
      Paolo Abeni 提交于
      Currently the MPTCP code uses 2 hooks to process syn-ack
      packets, mptcp_rcv_synsent() and the sk_rx_dst_set()
      callback.
      
      We can drop the first, moving the relevant code into the
      latter, reducing the hooking into the TCP code. This is
      also needed by the next patch.
      
      v1 -> v2:
       - use local tcp sock ptr instead of casting the sk variable
         several times - DaveM
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      263e1201
  15. 29 4月, 2020 3 次提交
  16. 28 4月, 2020 1 次提交
  17. 27 4月, 2020 1 次提交
  18. 26 4月, 2020 1 次提交
    • F
      tcp: mptcp: use mptcp receive buffer space to select rcv window · 071c8ed6
      Florian Westphal 提交于
      In MPTCP, the receive window is shared across all subflows, because it
      refers to the mptcp-level sequence space.
      
      MPTCP receivers already place incoming packets on the mptcp socket
      receive queue and will charge it to the mptcp socket rcvbuf until
      userspace consumes the data.
      
      Update __tcp_select_window to use the occupancy of the parent/mptcp
      socket instead of the subflow socket in case the tcp socket is part
      of a logical mptcp connection.
      
      This commit doesn't change choice of initial window for passive or active
      connections.
      While it would be possible to change those as well, this adds complexity
      (especially when handling MP_JOIN requests).  Furthermore, the MPTCP RFC
      specifically says that a MPTCP sender 'MUST NOT use the RCV.WND field
      of a TCP segment at the connection level if it does not also carry a DSS
      option with a Data ACK field.'
      
      SYN/SYNACK packets do not carry a DSS option with a Data ACK field.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      071c8ed6
  19. 23 4月, 2020 2 次提交
    • D
      ipv4: Update fib_select_default to handle nexthop objects · 7c74b0be
      David Ahern 提交于
      A user reported [0] hitting the WARN_ON in fib_info_nh:
      
          [ 8633.839816] ------------[ cut here ]------------
          [ 8633.839819] WARNING: CPU: 0 PID: 1719 at include/net/nexthop.h:251 fib_select_path+0x303/0x381
          ...
          [ 8633.839846] RIP: 0010:fib_select_path+0x303/0x381
          ...
          [ 8633.839848] RSP: 0018:ffffb04d407f7d00 EFLAGS: 00010286
          [ 8633.839850] RAX: 0000000000000000 RBX: ffff9460b9897ee8 RCX: 00000000000000fe
          [ 8633.839851] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000000
          [ 8633.839852] RBP: ffff946076049850 R08: 0000000059263a83 R09: ffff9460840e4000
          [ 8633.839853] R10: 0000000000000014 R11: 0000000000000000 R12: ffffb04d407f7dc0
          [ 8633.839854] R13: ffffffffa4ce3240 R14: 0000000000000000 R15: ffff9460b7681f60
          [ 8633.839857] FS:  00007fcac2e02700(0000) GS:ffff9460bdc00000(0000) knlGS:0000000000000000
          [ 8633.839858] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [ 8633.839859] CR2: 00007f27beb77e28 CR3: 0000000077734000 CR4: 00000000000006f0
          [ 8633.839867] Call Trace:
          [ 8633.839871]  ip_route_output_key_hash_rcu+0x421/0x890
          [ 8633.839873]  ip_route_output_key_hash+0x5e/0x80
          [ 8633.839876]  ip_route_output_flow+0x1a/0x50
          [ 8633.839878]  __ip4_datagram_connect+0x154/0x310
          [ 8633.839880]  ip4_datagram_connect+0x28/0x40
          [ 8633.839882]  __sys_connect+0xd6/0x100
          ...
      
      The WARN_ON is triggered in fib_select_default which is invoked when
      there are multiple default routes. Update the function to use
      fib_info_nhc and convert the nexthop checks to use fib_nh_common.
      
      Add test case that covers the affected code path.
      
      [0] https://github.com/FRRouting/frr/issues/6089
      
      Fixes: 493ced1a ("ipv4: Allow routes to use nexthop objects")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c74b0be
    • D
      xfrm: Always set XFRM_TRANSFORMED in xfrm{4,6}_output_finish · 0c922a48
      David Ahern 提交于
      IPSKB_XFRM_TRANSFORMED and IP6SKB_XFRM_TRANSFORMED are skb flags set by
      xfrm code to tell other skb handlers that the packet has been passed
      through the xfrm output functions. Simplify the code and just always
      set them rather than conditionally based on netfilter enabled thus
      making the flag available for other users.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c922a48
  20. 21 4月, 2020 1 次提交