1. 20 7月, 2022 3 次提交
  2. 19 7月, 2022 1 次提交
    • T
      amt: use workqueue for gateway side message handling · 30e22a6e
      Taehee Yoo 提交于
      There are some synchronization issues(amt->status, amt->req_cnt, etc)
      if the interface is in gateway mode because gateway message handlers
      are processed concurrently.
      This applies a work queue for processing these messages instead of
      expanding the locking context.
      
      So, the purposes of this patch are to fix exist race conditions and to make
      gateway to be able to validate a gateway status more correctly.
      
      When the AMT gateway interface is created, it tries to establish to relay.
      The establishment step looks stateless, but it should be managed well.
      In order to handle messages in the gateway, it saves the current
      status(i.e. AMT_STATUS_XXX).
      This patch makes gateway code to be worked with a single thread.
      
      Now, all messages except the multicast are triggered(received or
      delay expired), and these messages will be stored in the event
      queue(amt->events).
      Then, the single worker processes stored messages asynchronously one
      by one.
      The multicast data message type will be still processed immediately.
      
      Now, amt->lock is only needed to access the event queue(amt->events)
      if an interface is the gateway mode.
      
      Fixes: cbc21dc1 ("amt: add data plane of amt interface")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      30e22a6e
  3. 18 7月, 2022 3 次提交
  4. 16 7月, 2022 1 次提交
    • K
      tcp/udp: Make early_demux back namespacified. · 11052589
      Kuniyuki Iwashima 提交于
      Commit e21145a9 ("ipv4: namespacify ip_early_demux sysctl knob") made
      it possible to enable/disable early_demux on a per-netns basis.  Then, we
      introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
      TCP/UDP in commit dddb64bc ("net: Add sysctl to toggle early demux for
      tcp and udp").  However, the .proc_handler() was wrong and actually
      disabled us from changing the behaviour in each netns.
      
      We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
      .early_demux() handler is not NULL.  When we toggle (tcp|udp)_early_demux,
      the change itself is saved in each netns variable, but the .early_demux()
      handler is a global variable, so the handler is switched based on the
      init_net's sysctl variable.  Thus, netns (tcp|udp)_early_demux knobs have
      nothing to do with the logic.  Whether we CAN execute proto .early_demux()
      is always decided by init_net's sysctl knob, and whether we DO it or not is
      by each netns ip_early_demux knob.
      
      This patch namespacifies (tcp|udp)_early_demux again.  For now, the users
      of the .early_demux() handler are TCP and UDP only, and they are called
      directly to avoid retpoline.  So, we can remove the .early_demux() handler
      from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
      If another proto needs .early_demux(), we can restore it at that time.
      
      Fixes: dddb64bc ("net: Add sysctl to toggle early demux for tcp and udp")
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      11052589
  5. 15 7月, 2022 7 次提交
  6. 13 7月, 2022 1 次提交
  7. 09 7月, 2022 1 次提交
  8. 08 7月, 2022 1 次提交
  9. 06 7月, 2022 1 次提交
  10. 29 6月, 2022 1 次提交
  11. 28 6月, 2022 1 次提交
  12. 23 6月, 2022 1 次提交
  13. 17 6月, 2022 1 次提交
  14. 09 6月, 2022 1 次提交
    • W
      ipv6: Fix signed integer overflow in __ip6_append_data · f93431c8
      Wang Yufen 提交于
      Resurrect ubsan overflow checks and ubsan report this warning,
      fix it by change the variable [length] type to size_t.
      
      UBSAN: signed-integer-overflow in net/ipv6/ip6_output.c:1489:19
      2147479552 + 8567 cannot be represented in type 'int'
      CPU: 0 PID: 253 Comm: err Not tainted 5.16.0+ #1
      Hardware name: linux,dummy-virt (DT)
      Call trace:
        dump_backtrace+0x214/0x230
        show_stack+0x30/0x78
        dump_stack_lvl+0xf8/0x118
        dump_stack+0x18/0x30
        ubsan_epilogue+0x18/0x60
        handle_overflow+0xd0/0xf0
        __ubsan_handle_add_overflow+0x34/0x44
        __ip6_append_data.isra.48+0x1598/0x1688
        ip6_append_data+0x128/0x260
        udpv6_sendmsg+0x680/0xdd0
        inet6_sendmsg+0x54/0x90
        sock_sendmsg+0x70/0x88
        ____sys_sendmsg+0xe8/0x368
        ___sys_sendmsg+0x98/0xe0
        __sys_sendmmsg+0xf4/0x3b8
        __arm64_sys_sendmmsg+0x34/0x48
        invoke_syscall+0x64/0x160
        el0_svc_common.constprop.4+0x124/0x300
        do_el0_svc+0x44/0xc8
        el0_svc+0x3c/0x1e8
        el0t_64_sync_handler+0x88/0xb0
        el0t_64_sync+0x16c/0x170
      
      Changes since v1:
      -Change the variable [length] type to unsigned, as Eric Dumazet suggested.
      Changes since v2:
      -Don't change exthdrlen type in ip6_make_skb, as Paolo Abeni suggested.
      Changes since v3:
      -Don't change ulen type in udpv6_sendmsg and l2tp_ip6_sendmsg, as
      Jakub Kicinski suggested.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NWang Yufen <wangyufen@huawei.com>
      Link: https://lore.kernel.org/r/20220607120028.845916-1-wangyufen@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      f93431c8
  15. 07 6月, 2022 1 次提交
  16. 06 6月, 2022 1 次提交
  17. 02 6月, 2022 2 次提交
    • D
      ax25: Fix ax25 session cleanup problems · 7d8a3a47
      Duoming Zhou 提交于
      There are session cleanup problems in ax25_release() and
      ax25_disconnect(). If we setup a session and then disconnect,
      the disconnected session is still in "LISTENING" state that
      is shown below.
      
      Active AX.25 sockets
      Dest       Source     Device  State        Vr/Vs    Send-Q  Recv-Q
      DL9SAU-4   DL9SAU-3   ???     LISTENING    000/000  0       0
      DL9SAU-3   DL9SAU-4   ???     LISTENING    000/000  0       0
      
      The first reason is caused by del_timer_sync() in ax25_release().
      The timers of ax25 are used for correct session cleanup. If we use
      ax25_release() to close ax25 sessions and ax25_dev is not null,
      the del_timer_sync() functions in ax25_release() will execute.
      As a result, the sessions could not be cleaned up correctly,
      because the timers have stopped.
      
      In order to solve this problem, this patch adds a device_up flag
      in ax25_dev in order to judge whether the device is up. If there
      are sessions to be cleaned up, the del_timer_sync() in
      ax25_release() will not execute. What's more, we add ax25_cb_del()
      in ax25_kill_by_device(), because the timers have been stopped
      and there are no functions that could delete ax25_cb if we do not
      call ax25_release(). Finally, we reorder the position of
      ax25_list_lock in ax25_cb_del() in order to synchronize among
      different functions that call ax25_cb_del().
      
      The second reason is caused by improper check in ax25_disconnect().
      The incoming ax25 sessions which ax25->sk is null will close
      heartbeat timer, because the check "if(!ax25->sk || ..)" is
      satisfied. As a result, the session could not be cleaned up properly.
      
      In order to solve this problem, this patch changes the improper
      check to "if(ax25->sk && ..)" in ax25_disconnect().
      
      What`s more, the ax25_disconnect() may be called twice, which is
      not necessary. For example, ax25_kill_by_device() calls
      ax25_disconnect() and sets ax25->state to AX25_STATE_0, but
      ax25_release() calls ax25_disconnect() again.
      
      In order to solve this problem, this patch add a check in
      ax25_release(). If the flag of ax25->sk equals to SOCK_DEAD,
      the ax25_disconnect() in ax25_release() should not be executed.
      
      Fixes: 82e31755 ("ax25: Fix UAF bugs in ax25 timers")
      Fixes: 8a367e74 ("ax25: Fix segfault after sock connection timeout")
      Reported-and-tested-by: NThomas Osterried <thomas@osterried.de>
      Signed-off-by: NDuoming Zhou <duoming@zju.edu.cn>
      Link: https://lore.kernel.org/r/20220530152158.108619-1-duoming@zju.edu.cnSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      7d8a3a47
    • P
      netfilter: nf_tables: delete flowtable hooks via transaction list · b6d9014a
      Pablo Neira Ayuso 提交于
      Remove inactive bool field in nft_hook object that was introduced in
      abadb2f8 ("netfilter: nf_tables: delete devices from flowtable").
      Move stale flowtable hooks to transaction list instead.
      
      Deleting twice the same device does not result in ENOENT.
      
      Fixes: abadb2f8 ("netfilter: nf_tables: delete devices from flowtable")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b6d9014a
  18. 01 6月, 2022 2 次提交
    • H
      bonding: guard ns_targets by CONFIG_IPV6 · c4caa500
      Hangbin Liu 提交于
      Guard ns_targets in struct bond_params by CONFIG_IPV6, which could save
      256 bytes if IPv6 not configed. Also add this protection for function
      bond_is_ip6_target_ok() and bond_get_targets_ip6().
      
      Remove the IS_ENABLED() check for bond_opts[] as this will make
      BOND_OPT_NS_TARGETS uninitialized if CONFIG_IPV6 not enabled. Add
      a dummy bond_option_ns_ip6_targets_set() for this situation.
      
      Fixes: 4e24be01 ("bonding: add new parameter ns_targets")
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Acked-by: NJonathan Toppins <jtoppins@redhat.com>
      Link: https://lore.kernel.org/r/20220531063727.224043-1-liuhangbin@gmail.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      c4caa500
    • G
      net: sched: add barrier to fix packet stuck problem for lockless qdisc · 2e8728c9
      Guoju Fang 提交于
      In qdisc_run_end(), the spin_unlock() only has store-release semantic,
      which guarantees all earlier memory access are visible before it. But
      the subsequent test_bit() has no barrier semantics so may be reordered
      ahead of the spin_unlock(). The store-load reordering may cause a packet
      stuck problem.
      
      The concurrent operations can be described as below,
               CPU 0                      |          CPU 1
         qdisc_run_end()                  |     qdisc_run_begin()
                .                         |           .
       ----> /* may be reorderd here */   |           .
      |         .                         |           .
      |     spin_unlock()                 |         set_bit()
      |         .                         |         smp_mb__after_atomic()
       ---- test_bit()                    |         spin_trylock()
                .                         |          .
      
      Consider the following sequence of events:
          CPU 0 reorder test_bit() ahead and see MISSED = 0
          CPU 1 calls set_bit()
          CPU 1 calls spin_trylock() and return fail
          CPU 0 executes spin_unlock()
      
      At the end of the sequence, CPU 0 calls spin_unlock() and does nothing
      because it see MISSED = 0. The skb on CPU 1 has beed enqueued but no one
      take it, until the next cpu pushing to the qdisc (if ever ...) will
      notice and dequeue it.
      
      This patch fix this by adding one explicit barrier. As spin_unlock() and
      test_bit() ordering is a store-load ordering, a full memory barrier
      smp_mb() is needed here.
      
      Fixes: a90c57f2 ("net: sched: fix packet stuck problem for lockless qdisc")
      Signed-off-by: NGuoju Fang <gjfang@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20220528101628.120193-1-gjfang@linux.alibaba.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      2e8728c9
  19. 27 5月, 2022 2 次提交
    • F
      netfilter: conntrack: re-fetch conntrack after insertion · 56b14ece
      Florian Westphal 提交于
      In case the conntrack is clashing, insertion can free skb->_nfct and
      set skb->_nfct to the already-confirmed entry.
      
      This wasn't found before because the conntrack entry and the extension
      space used to free'd after an rcu grace period, plus the race needs
      events enabled to trigger.
      
      Reported-by: <syzbot+793a590957d9c1b96620@syzkaller.appspotmail.com>
      Fixes: 71d8c47f ("netfilter: conntrack: introduce clash resolution on insertion race")
      Fixes: 2ad9d774 ("netfilter: conntrack: free extension area immediately")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      56b14ece
    • V
      net: sched: fixed barrier to prevent skbuff sticking in qdisc backlog · a54ce370
      Vincent Ray 提交于
      In qdisc_run_begin(), smp_mb__before_atomic() used before test_bit()
      does not provide any ordering guarantee as test_bit() is not an atomic
      operation. This, added to the fact that the spin_trylock() call at
      the beginning of qdisc_run_begin() does not guarantee acquire
      semantics if it does not grab the lock, makes it possible for the
      following statement :
      
      if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
      
      to be executed before an enqueue operation called before
      qdisc_run_begin().
      
      As a result the following race can happen :
      
                 CPU 1                             CPU 2
      
            qdisc_run_begin()               qdisc_run_begin() /* true */
              set(MISSED)                            .
            /* returns false */                      .
                .                            /* sees MISSED = 1 */
                .                            /* so qdisc not empty */
                .                            __qdisc_run()
                .                                    .
                .                              pfifo_fast_dequeue()
       ----> /* may be done here */                  .
      |         .                                clear(MISSED)
      |         .                                    .
      |         .                                smp_mb __after_atomic();
      |         .                                    .
      |         .                                /* recheck the queue */
      |         .                                /* nothing => exit   */
      |   enqueue(skb1)
      |         .
      |   qdisc_run_begin()
      |         .
      |     spin_trylock() /* fail */
      |         .
      |     smp_mb__before_atomic() /* not enough */
      |         .
       ---- if (test_bit(MISSED))
              return false;   /* exit */
      
      In the above scenario, CPU 1 and CPU 2 both try to grab the
      qdisc->seqlock at the same time. Only CPU 2 succeeds and enters the
      bypass code path, where it emits its skb then calls __qdisc_run().
      
      CPU1 fails, sets MISSED and goes down the traditionnal enqueue() +
      dequeue() code path. But when executing qdisc_run_begin() for the
      second time, after enqueuing its skbuff, it sees the MISSED bit still
      set (by itself) and consequently chooses to exit early without setting
      it again nor trying to grab the spinlock again.
      
      Meanwhile CPU2 has seen MISSED = 1, cleared it, checked the queue
      and found it empty, so it returned.
      
      At the end of the sequence, we end up with skb1 enqueued in the
      backlog, both CPUs out of __dev_xmit_skb(), the MISSED bit not set,
      and no __netif_schedule() called made. skb1 will now linger in the
      qdisc until somebody later performs a full __qdisc_run(). Associated
      to the bypass capacity of the qdisc, and the ability of the TCP layer
      to avoid resending packets which it knows are still in the qdisc, this
      can lead to serious traffic "holes" in a TCP connection.
      
      We fix this by replacing the smp_mb__before_atomic() / test_bit() /
      set_bit() / smp_mb__after_atomic() sequence inside qdisc_run_begin()
      by a single test_and_set_bit() call, which is more concise and
      enforces the needed memory barriers.
      
      Fixes: 89837eb4 ("net: sched: add barrier to ensure correct ordering for lockless qdisc")
      Signed-off-by: NVincent Ray <vray@kalrayinc.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220526001746.2437669-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      a54ce370
  20. 26 5月, 2022 1 次提交
  21. 23 5月, 2022 1 次提交
  22. 21 5月, 2022 2 次提交
  23. 20 5月, 2022 1 次提交
  24. 19 5月, 2022 1 次提交
    • B
      tls: Add opt-in zerocopy mode of sendfile() · c1318b39
      Boris Pismenny 提交于
      TLS device offload copies sendfile data to a bounce buffer before
      transmitting. It allows to maintain the valid MAC on TLS records when
      the file contents change and a part of TLS record has to be
      retransmitted on TCP level.
      
      In many common use cases (like serving static files over HTTPS) the file
      contents are not changed on the fly. In many use cases breaking the
      connection is totally acceptable if the file is changed during
      transmission, because it would be received corrupted in any case.
      
      This commit allows to optimize performance for such use cases to
      providing a new optional mode of TLS sendfile(), in which the extra copy
      is skipped. Removing this copy improves performance significantly, as
      TLS and TCP sendfile perform the same operations, and the only overhead
      is TLS header/trailer insertion.
      
      The new mode can only be enabled with the new socket option named
      TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards
      compatibility with existing applications that rely on the copying
      behavior.
      
      The new mode is safe, meaning that unsolicited modifications of the file
      being sent can't break integrity of the kernel. The worst thing that can
      happen is sending a corrupted TLS record, which is in any case not
      forbidden when using regular TCP sockets.
      
      Sockets other than TLS device offload are not affected by the new socket
      option. The actual status of zerocopy sendfile can be queried with
      sock_diag.
      
      Performance numbers in a single-core test with 24 HTTPS streams on
      nginx, under 100% CPU load:
      
      * non-zerocopy: 33.6 Gbit/s
      * zerocopy: 79.92 Gbit/s
      
      CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
      Signed-off-by: NBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20220518092731.1243494-1-maximmi@nvidia.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      c1318b39
  25. 17 5月, 2022 1 次提交
  26. 16 5月, 2022 1 次提交
    • W
      netfilter: nf_conncount: reduce unnecessary GC · d2659299
      William Tu 提交于
      Currently nf_conncount can trigger garbage collection (GC)
      at multiple places. Each GC process takes a spin_lock_bh
      to traverse the nf_conncount_list. We found that when testing
      port scanning use two parallel nmap, because the number of
      connection increase fast, the nf_conncount_count and its
      subsequent call to __nf_conncount_add take too much time,
      causing several CPU lockup. This happens when user set the
      conntrack limit to +20,000, because the larger the limit,
      the longer the list that GC has to traverse.
      
      The patch mitigate the performance issue by avoiding unnecessary
      GC with a timestamp. Whenever nf_conncount has done a GC,
      a timestamp is updated, and beforce the next time GC is
      triggered, we make sure it's more than a jiffies.
      By doin this we can greatly reduce the CPU cycles and
      avoid the softirq lockup.
      
      To reproduce it in OVS,
      $ ovs-appctl dpctl/ct-set-limits zone=1,limit=20000
      $ ovs-appctl dpctl/ct-get-limits
      
      At another machine, runs two nmap
      $ nmap -p1- <IP>
      $ nmap -p1- <IP>
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Co-authored-by: NYifeng Sun <pkusunyifeng@gmail.com>
      Reported-by: NGreg Rose <gvrose8192@gmail.com>
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d2659299