1. 05 10月, 2019 4 次提交
  2. 03 10月, 2019 1 次提交
    • V
      net: dsa: Remove unused __DSA_SKB_CB macro · 37048e94
      Vladimir Oltean 提交于
      The struct __dsa_skb_cb is supposed to span the entire 48-byte skb
      control block, while the struct dsa_skb_cb only the portion of it which
      is used by the DSA core (the rest is available as private data to
      drivers).
      
      The DSA_SKB_CB and __DSA_SKB_CB helpers are supposed to help retrieve
      this pointer based on a skb, but it turns out there is nobody directly
      interested in the struct __dsa_skb_cb in the kernel. So remove it.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37048e94
  3. 02 10月, 2019 3 次提交
  4. 27 9月, 2019 3 次提交
    • E
      tcp: honor SO_PRIORITY in TIME_WAIT state · f6c0f5d2
      Eric Dumazet 提交于
      ctl packets sent on behalf of TIME_WAIT sockets currently
      have a zero skb->priority, which can cause various problems.
      
      In this patch we :
      
      - add a tw_priority field in struct inet_timewait_sock.
      
      - populate it from sk->sk_priority when a TIME_WAIT is created.
      
      - For IPv4, change ip_send_unicast_reply() and its two
        callers to propagate tw_priority correctly.
        ip_send_unicast_reply() no longer changes sk->sk_priority.
      
      - For IPv6, make sure TIME_WAIT sockets pass their tw_priority
        field to tcp_v6_send_response() and tcp_v6_send_ack().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6c0f5d2
    • E
      ipv6: add priority parameter to ip6_xmit() · 4f6570d7
      Eric Dumazet 提交于
      Currently, ip6_xmit() sets skb->priority based on sk->sk_priority
      
      This is not desirable for TCP since TCP shares the same ctl socket
      for a given netns. We want to be able to send RST or ACK packets
      with a non zero skb->priority.
      
      This patch has no functional change.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f6570d7
    • E
      sch_netem: fix rcu splat in netem_enqueue() · 159d2c7d
      Eric Dumazet 提交于
      qdisc_root() use from netem_enqueue() triggers a lockdep warning.
      
      __dev_queue_xmit() uses rcu_read_lock_bh() which is
      not equivalent to rcu_read_lock() + local_bh_disable_bh as far
      as lockdep is concerned.
      
      WARNING: suspicious RCU usage
      5.3.0-rc7+ #0 Not tainted
      -----------------------------
      include/net/sch_generic.h:492 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      3 locks held by syz-executor427/8855:
       #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: lwtunnel_xmit_redirect include/net/lwtunnel.h:92 [inline]
       #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: ip_finish_output2+0x2dc/0x2570 net/ipv4/ip_output.c:214
       #1: 00000000b5525c01 (rcu_read_lock_bh){....}, at: __dev_queue_xmit+0x20a/0x3650 net/core/dev.c:3804
       #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: spin_lock include/linux/spinlock.h:338 [inline]
       #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_xmit_skb net/core/dev.c:3502 [inline]
       #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_queue_xmit+0x14b8/0x3650 net/core/dev.c:3838
      
      stack backtrace:
      CPU: 0 PID: 8855 Comm: syz-executor427 Not tainted 5.3.0-rc7+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5357
       qdisc_root include/net/sch_generic.h:492 [inline]
       netem_enqueue+0x1cfb/0x2d80 net/sched/sch_netem.c:479
       __dev_xmit_skb net/core/dev.c:3527 [inline]
       __dev_queue_xmit+0x15d2/0x3650 net/core/dev.c:3838
       dev_queue_xmit+0x18/0x20 net/core/dev.c:3902
       neigh_hh_output include/net/neighbour.h:500 [inline]
       neigh_output include/net/neighbour.h:509 [inline]
       ip_finish_output2+0x1726/0x2570 net/ipv4/ip_output.c:228
       __ip_finish_output net/ipv4/ip_output.c:308 [inline]
       __ip_finish_output+0x5fc/0xb90 net/ipv4/ip_output.c:290
       ip_finish_output+0x38/0x1f0 net/ipv4/ip_output.c:318
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip_mc_output+0x292/0xf40 net/ipv4/ip_output.c:417
       dst_output include/net/dst.h:436 [inline]
       ip_local_out+0xbb/0x190 net/ipv4/ip_output.c:125
       ip_send_skb+0x42/0xf0 net/ipv4/ip_output.c:1555
       udp_send_skb.isra.0+0x6b2/0x1160 net/ipv4/udp.c:887
       udp_sendmsg+0x1e96/0x2820 net/ipv4/udp.c:1174
       inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:657
       ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
       __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
       __do_sys_sendmmsg net/socket.c:2442 [inline]
       __se_sys_sendmmsg net/socket.c:2439 [inline]
       __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
       do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      159d2c7d
  5. 25 9月, 2019 1 次提交
  6. 21 9月, 2019 1 次提交
  7. 20 9月, 2019 1 次提交
  8. 17 9月, 2019 2 次提交
    • V
      net: dsa: Pass ndo_setup_tc slave callback to drivers · 47d23af2
      Vladimir Oltean 提交于
      DSA currently handles shared block filters (for the classifier-action
      qdisc) in the core due to what I believe are simply pragmatic reasons -
      hiding the complexity from drivers and offerring a simple API for port
      mirroring.
      
      Extend the dsa_slave_setup_tc function by passing all other qdisc
      offloads to the driver layer, where the driver may choose what it
      implements and how. DSA is simply a pass-through in this case.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Acked-by: NKurt Kanzenbach <kurt@linutronix.de>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47d23af2
    • V
      taprio: Add support for hardware offloading · 9c66d156
      Vinicius Costa Gomes 提交于
      This allows taprio to offload the schedule enforcement to capable
      network cards, resulting in more precise windows and less CPU usage.
      
      The gate mask acts on traffic classes (groups of queues of same
      priority), as specified in IEEE 802.1Q-2018, and following the existing
      taprio and mqprio semantics.
      It is up to the driver to perform conversion between tc and individual
      netdev queues if for some reason it needs to make that distinction.
      
      Full offload is requested from the network interface by specifying
      "flags 2" in the tc qdisc creation command, which in turn corresponds to
      the TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD bit.
      
      The important detail here is the clockid which is implicitly /dev/ptpN
      for full offload, and hence not configurable.
      
      A reference counting API is added to support the use case where Ethernet
      drivers need to keep the taprio offload structure locally (i.e. they are
      a multi-port switch driver, and configuring a port depends on the
      settings of other ports as well). The refcount_t variable is kept in a
      private structure (__tc_taprio_qopt_offload) and not exposed to drivers.
      
      In the future, the private structure might also be expanded with a
      backpointer to taprio_sched *q, to implement the notification system
      described in the patch (of when admin became oper, or an error occurred,
      etc, so the offload can be monitored with 'tc qdisc show').
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NVoon Weifeng <weifeng.voon@intel.com>
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c66d156
  9. 16 9月, 2019 5 次提交
    • V
      net: sched: use get_dev() action API in flow_action infra · 470d5060
      Vlad Buslov 提交于
      When filling in hardware intermediate representation tc_setup_flow_action()
      directly obtains, checks and takes reference to dev used by mirred action,
      instead of using act->ops->get_dev() API created specifically for this
      purpose. In order to remove code duplication, refactor flow_action infra to
      use action API when obtaining mirred action target dev. Extend get_dev()
      with additional argument that is used to provide dev destructor to the
      user.
      
      Fixes: 5a6ff4b1 ("net: sched: take reference to action dev before calling offloads")
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      470d5060
    • V
      net: sched: take reference to psample group in flow_action infra · 4a5da47d
      Vlad Buslov 提交于
      With recent patch set that removed rtnl lock dependency from cls hardware
      offload API rtnl lock is only taken when reading action data and can be
      released after action-specific data is parsed into intermediate
      representation. However, sample action psample group is passed by pointer
      without obtaining reference to it first, which makes it possible to
      concurrently overwrite the action and deallocate object pointed by
      psample_group pointer after rtnl lock is released but before driver
      finished using the pointer.
      
      To prevent such race condition, obtain reference to psample group while it
      is used by flow_action infra. Extend psample API with function
      psample_group_take() that increments psample group reference counter.
      Extend struct tc_action_ops with new get_psample_group() API. Implement the
      API for action sample using psample_group_take() and already existing
      psample_group_put() as a destructor. Use it in tc_setup_flow_action() to
      take reference to psample group pointed to by entry->sample.psample_group
      and release it in tc_cleanup_flow_action().
      
      Disable bh when taking psample_groups_lock. The lock is now taken while
      holding action tcf_lock that is used by data path and requires bh to be
      disabled, so doing the same for psample_groups_lock is necessary to
      preserve SOFTIRQ-irq-safety.
      
      Fixes: 918190f5 ("net: sched: flower: don't take rtnl lock for cls hw offloads API")
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a5da47d
    • V
      net: sched: extend flow_action_entry with destructor · 1158958a
      Vlad Buslov 提交于
      Generalize flow_action_entry cleanup by extending the structure with
      pointer to destructor function. Set the destructor in
      tc_setup_flow_action(). Refactor tc_cleanup_flow_action() to call
      entry->destructor() instead of using switch that dispatches by entry->id
      and manually executes cleanup.
      
      This refactoring is necessary for following patches in this series that
      require destructor to use tc_action->ops callbacks that can't be easily
      obtained in tc_cleanup_flow_action().
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1158958a
    • W
      udp: correct reuseport selection with connected sockets · acdcecc6
      Willem de Bruijn 提交于
      UDP reuseport groups can hold a mix unconnected and connected sockets.
      Ensure that connections only receive all traffic to their 4-tuple.
      
      Fast reuseport returns on the first reuseport match on the assumption
      that all matches are equal. Only if connections are present, return to
      the previous behavior of scoring all sockets.
      
      Record if connections are present and if so (1) treat such connected
      sockets as an independent match from the group, (2) only return
      2-tuple matches from reuseport and (3) do not return on the first
      2-tuple reuseport match to allow for a higher scoring match later.
      
      New field has_conns is set without locks. No other fields in the
      bitmap are modified at runtime and the field is only ever set
      unconditionally, so an RMW cannot miss a change.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Link: http://lkml.kernel.org/r/CA+FuTSfRP09aJNYRt04SS6qj22ViiOEWaWmLAwX0psk8-PGNxw@mail.gmail.comSigned-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acdcecc6
    • P
      net/sched: fix race between deactivation and dequeue for NOLOCK qdisc · d518d2ed
      Paolo Abeni 提交于
      The test implemented by some_qdisc_is_busy() is somewhat loosy for
      NOLOCK qdisc, as we may hit the following scenario:
      
      CPU1						CPU2
      // in net_tx_action()
      clear_bit(__QDISC_STATE_SCHED...);
      						// in some_qdisc_is_busy()
      						val = (qdisc_is_running(q) ||
      						       test_bit(__QDISC_STATE_SCHED,
      								&q->state));
      						// here val is 0 but...
      qdisc_run(q)
      // ... CPU1 is going to run the qdisc next
      
      As a conseguence qdisc_run() in net_tx_action() can race with qdisc_reset()
      in dev_qdisc_reset(). Such race is not possible for !NOLOCK qdisc as
      both the above bit operations are under the root qdisc lock().
      
      After commit 021a17ed ("pfifo_fast: drop unneeded additional lock on dequeue")
      the race can cause use after free and/or null ptr dereference, but the root
      cause is likely older.
      
      This patch addresses the issue explicitly checking for deactivation under
      the seqlock for NOLOCK qdisc, so that the qdisc_run() in the critical
      scenario becomes a no-op.
      
      Note that the enqueue() op can still execute concurrently with dev_qdisc_reset(),
      but that is safe due to the skb_array() locking, and we can't avoid that
      for NOLOCK qdiscs.
      
      Fixes: 021a17ed ("pfifo_fast: drop unneeded additional lock on dequeue")
      Reported-by: NLi Shuang <shuali@redhat.com>
      Reported-and-tested-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d518d2ed
  10. 14 9月, 2019 3 次提交
  11. 13 9月, 2019 16 次提交