1. 13 7月, 2022 6 次提交
  2. 03 5月, 2022 2 次提交
    • T
      net: sysctl: introduce sysctl SYSCTL_THREE · 4c7f24f8
      Tonghao Zhang 提交于
      This patch introdues the SYSCTL_THREE.
      
      KUnit:
      [00:10:14] ================ sysctl_test (10 subtests) =================
      [00:10:14] [PASSED] sysctl_test_api_dointvec_null_tbl_data
      [00:10:14] [PASSED] sysctl_test_api_dointvec_table_maxlen_unset
      [00:10:14] [PASSED] sysctl_test_api_dointvec_table_len_is_zero
      [00:10:14] [PASSED] sysctl_test_api_dointvec_table_read_but_position_set
      [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_positive
      [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_negative
      [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_positive
      [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_negative
      [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_less_int_min
      [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_greater_int_max
      [00:10:14] =================== [PASSED] sysctl_test ===================
      
      ./run_kselftest.sh -c sysctl
      ...
      ok 1 selftests: sysctl: sysctl.sh
      
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Lorenz Bauer <lmb@cloudflare.com>
      Cc: Akhmat Karakotov <hmukos@yandex-team.ru>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Reviewed-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      4c7f24f8
    • T
      net: sysctl: use shared sysctl macro · bd8a5367
      Tonghao Zhang 提交于
      This patch replace two, four and long_one to SYSCTL_XXX.
      
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Lorenz Bauer <lmb@cloudflare.com>
      Cc: Akhmat Karakotov <hmukos@yandex-team.ru>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      bd8a5367
  3. 10 3月, 2022 1 次提交
    • E
      tcp: adjust TSO packet sizes based on min_rtt · 65466904
      Eric Dumazet 提交于
      Back when tcp_tso_autosize() and TCP pacing were introduced,
      our focus was really to reduce burst sizes for long distance
      flows.
      
      The simple heuristic of using sk_pacing_rate/1024 has worked
      well, but can lead to too small packets for hosts in the same
      rack/cluster, when thousands of flows compete for the bottleneck.
      
      Neal Cardwell had the idea of making the TSO burst size
      a function of both sk_pacing_rate and tcp_min_rtt()
      
      Indeed, for local flows, sending bigger bursts is better
      to reduce cpu costs, as occasional losses can be repaired
      quite fast.
      
      This patch is based on Neal Cardwell implementation
      done more than two years ago.
      bbr is adjusting max_pacing_rate based on measured bandwidth,
      while cubic would over estimate max_pacing_rate.
      
      /proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable
      this new feature, in logarithmic steps.
      
      Tested:
      
      100Gbit NIC, two hosts in the same rack, 4K MTU.
      600 flows rate-limited to 20000000 bytes per second.
      
      Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO)
      
      ~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log
      ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
        96005
      
       Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
      
               65,945.29 msec task-clock                #    2.845 CPUs utilized
               1,314,632      context-switches          # 19935.279 M/sec
                   5,292      cpu-migrations            #   80.249 M/sec
                 940,641      page-faults               # 14264.023 M/sec
         201,117,030,926      cycles                    # 3049769.216 GHz                   (83.45%)
          17,699,435,405      stalled-cycles-frontend   #    8.80% frontend cycles idle     (83.48%)
         136,584,015,071      stalled-cycles-backend    #   67.91% backend cycles idle      (83.44%)
          53,809,530,436      instructions              #    0.27  insn per cycle
                                                        #    2.54  stalled cycles per insn  (83.36%)
           9,062,315,523      branches                  # 137422329.563 M/sec               (83.22%)
             153,008,621      branch-misses             #    1.69% of all branches          (83.32%)
      
            23.182970846 seconds time elapsed
      
      TcpInSegs                       15648792           0.0
      TcpOutSegs                      58659110           0.0  # Average of 3.7 4K segments per TSO packet
      TcpExtTCPDelivered              58654791           0.0
      TcpExtTCPDeliveredCE            19                 0.0
      
      After patch:
      
      ~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log
      ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
        96046
      
       Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
      
               48,982.58 msec task-clock                #    2.104 CPUs utilized
                 186,014      context-switches          # 3797.599 M/sec
                   3,109      cpu-migrations            #   63.472 M/sec
                 941,180      page-faults               # 19214.814 M/sec
         153,459,763,868      cycles                    # 3132982.807 GHz                   (83.56%)
          12,069,861,356      stalled-cycles-frontend   #    7.87% frontend cycles idle     (83.32%)
         120,485,917,953      stalled-cycles-backend    #   78.51% backend cycles idle      (83.24%)
          36,803,672,106      instructions              #    0.24  insn per cycle
                                                        #    3.27  stalled cycles per insn  (83.18%)
           5,947,266,275      branches                  # 121417383.427 M/sec               (83.64%)
              87,984,616      branch-misses             #    1.48% of all branches          (83.43%)
      
            23.281200256 seconds time elapsed
      
      TcpInSegs                       1434706            0.0
      TcpOutSegs                      58883378           0.0  # Average of 41 4K segments per TSO packet
      TcpExtTCPDelivered              58878971           0.0
      TcpExtTCPDeliveredCE            9664               0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      65466904
  4. 27 1月, 2022 1 次提交
    • E
      tcp: allocate tcp_death_row outside of struct netns_ipv4 · fbb82952
      Eric Dumazet 提交于
      I forgot tcp had per netns tracking of timewait sockets,
      and their sysctl to change the limit.
      
      After 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()"),
      whole struct net can be freed before last tw socket is freed.
      
      We need to allocate a separate struct inet_timewait_death_row
      object per netns.
      
      tw_count becomes a refcount and gains associated debugging infrastructure.
      
      BUG: KASAN: use-after-free in inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
      Read of size 8 at addr ffff88807d5f9f40 by task kworker/1:7/3690
      
      CPU: 1 PID: 3690 Comm: kworker/1:7 Not tainted 5.16.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events pwq_unbound_release_workfn
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
       inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
       call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421
       expire_timers kernel/time/timer.c:1466 [inline]
       __run_timers.part.0+0x67c/0xa30 kernel/time/timer.c:1734
       __run_timers kernel/time/timer.c:1715 [inline]
       run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
       </IRQ>
       <TASK>
       asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
      RIP: 0010:lockdep_unregister_key+0x1c9/0x250 kernel/locking/lockdep.c:6328
      Code: 00 00 00 48 89 ee e8 46 fd ff ff 4c 89 f7 e8 5e c9 ff ff e8 09 cc ff ff 9c 58 f6 c4 02 75 26 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c 41 5d 41 5e 41 5f e9 19 4a 08 00 0f 0b 5b 5d 41 5c 41 5d
      RSP: 0018:ffffc90004077cb8 EFLAGS: 00000206
      RAX: 0000000000000046 RBX: ffff88807b61b498 RCX: 0000000000000001
      RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888077027128 R08: 0000000000000001 R09: ffffffff8f1ea4fc
      R10: fffffbfff1ff93ee R11: 000000000000af1e R12: 0000000000000246
      R13: 0000000000000000 R14: ffffffff8ffc89b8 R15: ffffffff90157fb0
       wq_unregister_lockdep kernel/workqueue.c:3508 [inline]
       pwq_unbound_release_workfn+0x254/0x340 kernel/workqueue.c:3746
       process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
       worker_thread+0x657/0x1110 kernel/workqueue.c:2454
       kthread+0x2e9/0x3a0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
       </TASK>
      
      Allocated by task 3635:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:437 [inline]
       __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:470
       kasan_slab_alloc include/linux/kasan.h:260 [inline]
       slab_post_alloc_hook mm/slab.h:732 [inline]
       slab_alloc_node mm/slub.c:3230 [inline]
       slab_alloc mm/slub.c:3238 [inline]
       kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3243
       kmem_cache_zalloc include/linux/slab.h:705 [inline]
       net_alloc net/core/net_namespace.c:407 [inline]
       copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
       create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
       unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
       ksys_unshare+0x445/0x920 kernel/fork.c:3048
       __do_sys_unshare kernel/fork.c:3119 [inline]
       __se_sys_unshare kernel/fork.c:3117 [inline]
       __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The buggy address belongs to the object at ffff88807d5f9a80
       which belongs to the cache net_namespace of size 6528
      The buggy address is located 1216 bytes inside of
       6528-byte region [ffff88807d5f9a80, ffff88807d5fb400)
      The buggy address belongs to the page:
      page:ffffea0001f57e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88807d5f9a80 pfn:0x7d5f8
      head:ffffea0001f57e00 order:3 compound_mapcount:0 compound_pincount:0
      memcg:ffff888070023001
      flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000010200 ffff888010dd4f48 ffffea0001404e08 ffff8880118fd000
      raw: ffff88807d5f9a80 0000000000040002 00000001ffffffff ffff888070023001
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3634, ts 119694798460, free_ts 119693556950
       prep_new_page mm/page_alloc.c:2434 [inline]
       get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
       alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
       alloc_slab_page mm/slub.c:1799 [inline]
       allocate_slab mm/slub.c:1944 [inline]
       new_slab+0x28a/0x3b0 mm/slub.c:2004
       ___slab_alloc+0x87c/0xe90 mm/slub.c:3018
       __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
       slab_alloc_node mm/slub.c:3196 [inline]
       slab_alloc mm/slub.c:3238 [inline]
       kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3243
       kmem_cache_zalloc include/linux/slab.h:705 [inline]
       net_alloc net/core/net_namespace.c:407 [inline]
       copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
       create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
       unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
       ksys_unshare+0x445/0x920 kernel/fork.c:3048
       __do_sys_unshare kernel/fork.c:3119 [inline]
       __se_sys_unshare kernel/fork.c:3117 [inline]
       __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1352 [inline]
       free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
       free_unref_page_prepare mm/page_alloc.c:3325 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3404
       skb_free_head net/core/skbuff.c:655 [inline]
       skb_release_data+0x65d/0x790 net/core/skbuff.c:677
       skb_release_all net/core/skbuff.c:742 [inline]
       __kfree_skb net/core/skbuff.c:756 [inline]
       consume_skb net/core/skbuff.c:914 [inline]
       consume_skb+0xc2/0x160 net/core/skbuff.c:908
       skb_free_datagram+0x1b/0x1f0 net/core/datagram.c:325
       netlink_recvmsg+0x636/0xea0 net/netlink/af_netlink.c:1998
       sock_recvmsg_nosec net/socket.c:948 [inline]
       sock_recvmsg net/socket.c:966 [inline]
       sock_recvmsg net/socket.c:962 [inline]
       ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
       ___sys_recvmsg+0x127/0x200 net/socket.c:2674
       __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Memory state around the buggy address:
       ffff88807d5f9e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807d5f9e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88807d5f9f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                 ^
       ffff88807d5f9f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807d5fa000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Tested-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220126180714.845362-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      fbb82952
  5. 23 9月, 2021 1 次提交
  6. 21 9月, 2021 1 次提交
  7. 16 6月, 2021 1 次提交
  8. 20 5月, 2021 1 次提交
  9. 19 5月, 2021 2 次提交
    • I
      ipv4: Add custom multipath hash policy · 4253b498
      Ido Schimmel 提交于
      Add a new multipath hash policy where the packet fields used for hash
      calculation are determined by user space via the
      fib_multipath_hash_fields sysctl that was introduced in the previous
      patch.
      
      The current set of available packet fields includes both outer and inner
      fields, which requires two invocations of the flow dissector. Avoid
      unnecessary dissection of the outer or inner flows by skipping
      dissection if none of the outer or inner fields are required.
      
      In accordance with the existing policies, when an skb is not available,
      packet fields are extracted from the provided flow key. In which case,
      only outer fields are considered.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4253b498
    • I
      ipv4: Add a sysctl to control multipath hash fields · ce5c9c20
      Ido Schimmel 提交于
      A subsequent patch will add a new multipath hash policy where the packet
      fields used for multipath hash calculation are determined by user space.
      This patch adds a sysctl that allows user space to set these fields.
      
      The packet fields are represented using a bitmask and are common between
      IPv4 and IPv6 to allow user space to use the same numbering across both
      protocols. For example, to hash based on standard 5-tuple:
      
       # sysctl -w net.ipv4.fib_multipath_hash_fields=0x0037
       net.ipv4.fib_multipath_hash_fields = 0x0037
      
      The kernel rejects unknown fields, for example:
      
       # sysctl -w net.ipv4.fib_multipath_hash_fields=0x1000
       sysctl: setting key "net.ipv4.fib_multipath_hash_fields": Invalid argument
      
      More fields can be added in the future, if needed.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce5c9c20
  10. 14 4月, 2021 1 次提交
    • J
      net: Make tcp_allowed_congestion_control readonly in non-init netns · 97684f09
      Jonathon Reinhart 提交于
      Currently, tcp_allowed_congestion_control is global and writable;
      writing to it in any net namespace will leak into all other net
      namespaces.
      
      tcp_available_congestion_control and tcp_allowed_congestion_control are
      the only sysctls in ipv4_net_table (the per-netns sysctl table) with a
      NULL data pointer; their handlers (proc_tcp_available_congestion_control
      and proc_allowed_congestion_control) have no other way of referencing a
      struct net. Thus, they operate globally.
      
      Because ipv4_net_table does not use designated initializers, there is no
      easy way to fix up this one "bad" table entry. However, the data pointer
      updating logic shouldn't be applied to NULL pointers anyway, so we
      instead force these entries to be read-only.
      
      These sysctls used to exist in ipv4_table (init-net only), but they were
      moved to the per-net ipv4_net_table, presumably without realizing that
      tcp_allowed_congestion_control was writable and thus introduced a leak.
      
      Because the intent of that commit was only to know (i.e. read) "which
      congestion algorithms are available or allowed", this read-only solution
      should be sufficient.
      
      The logic added in recent commit
      31c4d2f1: ("net: Ensure net namespace isolation of sysctls")
      does not and cannot check for NULL data pointers, because
      other table entries (e.g. /proc/sys/net/netfilter/nf_log/) have
      .data=NULL but use other methods (.extra2) to access the struct net.
      
      Fixes: 9cb8e048 ("net/ipv4/sysctl: show tcp_{allowed, available}_congestion_control in non-initial netns")
      Signed-off-by: NJonathon Reinhart <jonathon.reinhart@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97684f09
  11. 01 4月, 2021 5 次提交
  12. 31 3月, 2021 2 次提交
  13. 30 3月, 2021 1 次提交
  14. 26 3月, 2021 4 次提交
  15. 09 2月, 2021 1 次提交
  16. 03 2月, 2021 1 次提交
    • A
      net: ipv4: Emit notification when fib hardware flags are changed · 680aea08
      Amit Cohen 提交于
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel,
      but not necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead to a
      routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      It is also possible for a route already installed in hardware to change
      its action and therefore its flags. For example, a host route that is
      trapping packets can be "promoted" to perform decapsulation following
      the installation of an IPinIP/VXLAN tunnel.
      
      Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed. The aim is to provide an indication to user-space
      (e.g., routing daemons) about the state of the route in hardware.
      
      Introduce a sysctl that controls this behavior.
      
      Keep the default value at 0 (i.e., do not emit notifications) for several
      reasons:
      - Multiple RTM_NEWROUTE notification per-route might confuse existing
        routing daemons.
      - Convergence reasons in routing daemons.
      - The extra notifications will negatively impact the insertion rate.
      - Not all users are interested in these notifications.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Acked-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      680aea08
  17. 11 9月, 2020 1 次提交
    • W
      tcp: reflect tos value received in SYN to the socket · ac8f1710
      Wei Wang 提交于
      This commit adds a new TCP feature to reflect the tos value received in
      SYN, and send it out on the SYN-ACK, and eventually set the tos value of
      the established socket with this reflected tos value. This provides a
      way to set the traffic class/QoS level for all traffic in the same
      connection to be the same as the incoming SYN request. It could be
      useful in data centers to provide equivalent QoS according to the
      incoming request.
      This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by
      default turned off.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac8f1710
  18. 11 8月, 2020 1 次提交
    • J
      tcp: correct read of TFO keys on big endian systems · f19008e6
      Jason Baron 提交于
      When TFO keys are read back on big endian systems either via the global
      sysctl interface or via getsockopt() using TCP_FASTOPEN_KEY, the values
      don't match what was written.
      
      For example, on s390x:
      
      # echo "1-2-3-4" > /proc/sys/net/ipv4/tcp_fastopen_key
      # cat /proc/sys/net/ipv4/tcp_fastopen_key
      02000000-01000000-04000000-03000000
      
      Instead of:
      
      # cat /proc/sys/net/ipv4/tcp_fastopen_key
      00000001-00000002-00000003-00000004
      
      Fix this by converting to the correct endianness on read. This was
      reported by Colin Ian King when running the 'tcp_fastopen_backup_key' net
      selftest on s390x, which depends on the read value matching what was
      written. I've confirmed that the test now passes on big and little endian
      systems.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Fixes: 438ac880 ("net: fastopen: robustness and endianness fixes for SipHash")
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Reported-and-tested-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f19008e6
  19. 01 5月, 2020 1 次提交
  20. 29 4月, 2020 1 次提交
    • R
      net: ipv4: add sysctl for nexthop api compatibility mode · 4f80116d
      Roopa Prabhu 提交于
      Current route nexthop API maintains user space compatibility
      with old route API by default. Dumps and netlink notifications
      support both new and old API format. In systems which have
      moved to the new API, this compatibility mode cancels some
      of the performance benefits provided by the new nexthop API.
      
      This patch adds new sysctl nexthop_compat_mode which is on
      by default but provides the ability to turn off compatibility
      mode allowing systems to run entirely with the new routing
      API. Old route API behaviour and support is not modified by this
      sysctl.
      
      Uses a single sysctl to cover both ipv4 and ipv6 following
      other sysctls. Covers dumps and delete notifications as
      suggested by David Ahern.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f80116d
  21. 27 4月, 2020 1 次提交
  22. 13 3月, 2020 1 次提交
    • K
      tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted. · 4b01a967
      Kuniyuki Iwashima 提交于
      Commit aacd9289 ("tcp: bind() use stronger
      condition for bind_conflict") introduced a restriction to forbid to bind
      SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
      assign ports dispersedly so that we can connect to the same remote host.
      
      The change results in accelerating port depletion so that we fail to bind
      sockets to the same local port even if we want to connect to the different
      remote hosts.
      
      You can reproduce this issue by following instructions below.
      
        1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
        2. set SO_REUSEADDR to two sockets.
        3. bind two sockets to (localhost, 0) and the latter fails.
      
      Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
      the legacy behaviour to enable the SO_REUSEADDR option and make it possible
      to connect to different remote (addr, port) tuples.
      
      This patch allows us to bind SO_REUSEADDR enabled sockets to the same
      (addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
      ephemeral ports are exhausted. This also allows connect() and listen() to
      share ports in the following way and may break some applications. So the
      ip_autobind_reuse is 0 by default and disables the feature.
      
        1. setsockopt(sk1, SO_REUSEADDR)
        2. setsockopt(sk2, SO_REUSEADDR)
        3. bind(sk1, saddr, 0)
        4. bind(sk2, saddr, 0)
        5. connect(sk1, daddr)
        6. listen(sk2)
      
      If it is set 1, we can fully utilize the 4-tuples, but we should use
      IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.
      
      The notable thing is that if all sockets bound to the same port have
      both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
      ephemeral port and also do listen().
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b01a967
  23. 20 2月, 2020 1 次提交
    • C
      net/ipv4/sysctl: show tcp_{allowed, available}_congestion_control in non-initial netns · 9cb8e048
      Christian Brauner 提交于
      It is currenty possible to switch the TCP congestion control algorithm
      in non-initial network namespaces:
      
      unshare -U --map-root --net --fork --pid --mount-proc
      echo "reno" > /proc/sys/net/ipv4/tcp_congestion_control
      
      works just fine. But currently non-initial network namespaces have no
      way of kowing which congestion algorithms are available or allowed other
      than through trial and error by writing the names of the algorithms into
      the aforementioned file.
      Since we already allow changing the congestion algorithm in non-initial
      network namespaces by exposing the tcp_congestion_control file there is
      no reason to not also expose the
      tcp_{allowed,available}_congestion_control files to non-initial network
      namespaces. After this change a container with a separate network
      namespace will show:
      
      root@f1:~# ls -al /proc/sys/net/ipv4/tcp_* | grep congestion
      -rw-r--r-- 1 root root 0 Feb 19 11:54 /proc/sys/net/ipv4/tcp_allowed_congestion_control
      -r--r--r-- 1 root root 0 Feb 19 11:54 /proc/sys/net/ipv4/tcp_available_congestion_control
      -rw-r--r-- 1 root root 0 Feb 19 11:54 /proc/sys/net/ipv4/tcp_congestion_control
      
      Link: https://github.com/lxc/lxc/issues/3267Reported-by: NHaw Loeung <haw.loeung@canonical.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cb8e048
  24. 10 12月, 2019 1 次提交
    • K
      net-tcp: Disable TCP ssthresh metrics cache by default · 65e6d901
      Kevin(Yudong) Yang 提交于
      This patch introduces a sysctl knob "net.ipv4.tcp_no_ssthresh_metrics_save"
      that disables TCP ssthresh metrics cache by default. Other parts of TCP
      metrics cache, e.g. rtt, cwnd, remain unchanged.
      
      As modern networks becoming more and more dynamic, TCP metrics cache
      today often causes more harm than benefits. For example, the same IP
      address is often shared by different subscribers behind NAT in residential
      networks. Even if the IP address is not shared by different users,
      caching the slow-start threshold of a previous short flow using loss-based
      congestion control (e.g. cubic) often causes the future longer flows of
      the same network path to exit slow-start prematurely with abysmal
      throughput.
      
      Caching ssthresh is very risky and can lead to terrible performance.
      Therefore it makes sense to make disabling ssthresh caching by
      default and opt-in for specific networks by the administrators.
      This practice also has worked well for several years of deployment with
      CUBIC congestion control at Google.
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NKevin(Yudong) Yang <yyd@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65e6d901
  25. 21 11月, 2019 1 次提交