1. 21 3月, 2019 3 次提交
    • P
      net: remove 'fallback' argument from dev->ndo_select_queue() · a350ecce
      Paolo Abeni 提交于
      After the previous patch, all the callers of ndo_select_queue()
      provide as a 'fallback' argument netdev_pick_tx.
      The only exceptions are nested calls to ndo_select_queue(),
      which pass down the 'fallback' available in the current scope
      - still netdev_pick_tx.
      
      We can drop such argument and replace fallback() invocation with
      netdev_pick_tx(). This avoids an indirect call per xmit packet
      in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
      with device drivers implementing such ndo. It also clean the code
      a bit.
      
      Tested with ixgbe and CONFIG_FCOE=m
      
      With pktgen using queue xmit:
      threads		vanilla 	patched
      		(kpps)		(kpps)
      1		2334		2428
      2		4166		4278
      4		7895		8100
      
       v1 -> v2:
       - rebased after helper's name change
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a350ecce
    • P
      packet: rework packet_pick_tx_queue() to use common code selection · b71b5837
      Paolo Abeni 提交于
      Currently packet_pick_tx_queue() is the only caller of
      ndo_select_queue() using a fallback argument other than
      netdev_pick_tx.
      
      Leveraging rx queue, we can obtain a similar queue selection
      behavior using core helpers. After this change, ndo_select_queue()
      is always invoked with netdev_pick_tx() as fallback.
      We can change ndo_select_queue() signature in a followup patch,
      dropping an indirect call per transmitted packet in some scenarios
      (e.g. TCP syn and XDP generic xmit)
      
      This changes slightly how af packet queue selection happens when
      PACKET_QDISC_BYPASS is set. It's now more similar to plan dev_queue_xmit()
      tacking in account both XPS and TC mapping.
      
       v1  -> v2:
        - rebased after helper name change
       RFC -> v1:
        - initialize sender_cpu to the expected value
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b71b5837
    • P
      net: dev: rename queue selection helpers. · 4bd97d51
      Paolo Abeni 提交于
      With the following patches, we are going to use __netdev_pick_tx() in
      many modules. Rename it to netdev_pick_tx(), to make it clear is
      a public API.
      
      Also rename the existing netdev_pick_tx() to netdev_core_pick_tx(),
      to avoid name clashes.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bd97d51
  2. 28 2月, 2019 2 次提交
  3. 27 2月, 2019 1 次提交
  4. 25 2月, 2019 1 次提交
  5. 23 2月, 2019 1 次提交
  6. 15 2月, 2019 1 次提交
  7. 07 2月, 2019 1 次提交
    • F
      net: Introduce ndo_get_port_parent_id() · d6abc596
      Florian Fainelli 提交于
      In preparation for getting rid of switchdev_ops, create a dedicated NDO
      operation for getting the port's parent identifier. There are
      essentially two classes of drivers that need to implement getting the
      port's parent ID which are VF/PF drivers with a built-in switch, and
      pure switchdev drivers such as mlxsw, ocelot, dsa etc.
      
      We introduce a helper function: dev_get_port_parent_id() which supports
      recursion into the lower devices to obtain the first port's parent ID.
      
      Convert the bridge, core and ipv4 multicast routing code to check for
      such ndo_get_port_parent_id() and call the helper function when valid
      before falling back to switchdev_port_attr_get(). This will allow us to
      convert all relevant drivers in one go instead of having to implement
      both switchdev_port_attr_get() and ndo_get_port_parent_id() operations,
      then get rid of switchdev_port_attr_get().
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6abc596
  8. 04 2月, 2019 1 次提交
  9. 31 1月, 2019 1 次提交
    • D
      ipvlan, l3mdev: fix broken l3s mode wrt local routes · d5256083
      Daniel Borkmann 提交于
      While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
      I ran into the issue that while l3 mode is working fine, l3s mode
      does not have any connectivity to kube-apiserver and hence all pods
      end up in Error state as well. The ipvlan master device sits on
      top of a bond device and hostns traffic to kube-apiserver (also running
      in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
      where the latter is the address of the bond0. While in l3 mode, a
      curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
      works fine from hostns, neither of them do in case of l3s. In the
      latter only a curl to https://127.0.0.1:37573 appeared to work where
      for local addresses of bond0 I saw kernel suddenly starting to emit
      ARP requests to query HW address of bond0 which remained unanswered
      and neighbor entries in INCOMPLETE state. These ARP requests only
      happen while in l3s.
      
      Debugging this further, I found the issue is that l3s mode is piggy-
      backing on l3 master device, and in this case local routes are using
      l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
      f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev
      if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be
      a loopback"). I found that reverting them back into using the
      net->loopback_dev fixed ipvlan l3s connectivity and got everything
      working for the CNI.
      
      Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the
      l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
      on l3 master device is to get the l3mdev_ip_rcv() receive hook for
      setting the dst entry of the input route without adding its own
      ipvlan specific hacks into the receive path, however, any l3 domain
      semantics beyond just that are breaking l3s operation. Note that
      ipvlan also has the ability to dynamically switch its internal
      operation from l3 to l3s for all ports via ipvlan_set_port_mode()
      at runtime. In any case, l3 vs l3s soley distinguishes itself by
      'de-confusing' netfilter through switching skb->dev to ipvlan slave
      device late in NF_INET_LOCAL_IN before handing the skb to L4.
      
      Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
      if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
      without any additional l3mdev semantics on top. This should also have
      minimal impact since dev->priv_flags is already hot in cache. With
      this set, l3s mode is working fine and I also get things like
      masquerading pod traffic on the ipvlan master properly working.
      
        [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Martynas Pumputis <m@lambda.lt>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5256083
  10. 23 1月, 2019 1 次提交
    • C
      net: introduce a knob to control whether to inherit devconf config · 856c395c
      Cong Wang 提交于
      There have been many people complaining about the inconsistent
      behaviors of IPv4 and IPv6 devconf when creating new network
      namespaces.  Currently, for IPv4, we inherit all current settings
      from init_net, but for IPv6 we reset all setting to default.
      
      This patch introduces a new /proc file
      /proc/sys/net/core/devconf_inherit_init_net to control the
      behavior of whether to inhert sysctl current settings from init_net.
      This file itself is only available in init_net.
      
      As demonstrated below:
      
      Initial setup in init_net:
       # cat /proc/sys/net/ipv4/conf/all/rp_filter
       2
       # cat /proc/sys/net/ipv6/conf/all/accept_dad
       1
      
      Default value 0 (current behavior):
       # ip netns del test
       # ip netns add test
       # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
       2
       # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
       0
      
      Set to 1 (inherit from init_net):
       # echo 1 > /proc/sys/net/core/devconf_inherit_init_net
       # ip netns del test
       # ip netns add test
       # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
       2
       # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
       1
      
      Set to 2 (reset to default):
       # echo 2 > /proc/sys/net/core/devconf_inherit_init_net
       # ip netns del test
       # ip netns add test
       # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter
       0
       # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad
       0
      
      Set to a value out of range (invalid):
       # echo 3 > /proc/sys/net/core/devconf_inherit_init_net
       -bash: echo: write error: Invalid argument
       # echo -1 > /proc/sys/net/core/devconf_inherit_init_net
       -bash: echo: write error: Invalid argument
      Reported-by: NZhu Yanjun <Yanjun.Zhu@windriver.com>
      Reported-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      856c395c
  11. 18 1月, 2019 1 次提交
    • P
      net: Add extack argument to ndo_fdb_add() · 87b0984e
      Petr Machata 提交于
      Drivers may not be able to support certain FDB entries, and an error
      code is insufficient to give clear hints as to the reasons of rejection.
      
      In order to make it possible to communicate the rejection reason, extend
      ndo_fdb_add() with an extack argument. Adapt the existing
      implementations of ndo_fdb_add() to take the parameter (and ignore it).
      Pass the extack parameter when invoking ndo_fdb_add() from rtnl_fdb_add().
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87b0984e
  12. 17 12月, 2018 1 次提交
  13. 14 12月, 2018 3 次提交
  14. 13 12月, 2018 1 次提交
  15. 07 12月, 2018 3 次提交
  16. 26 11月, 2018 1 次提交
  17. 25 11月, 2018 1 次提交
  18. 20 11月, 2018 1 次提交
  19. 18 11月, 2018 1 次提交
  20. 16 11月, 2018 1 次提交
  21. 15 11月, 2018 1 次提交
  22. 11 11月, 2018 3 次提交
  23. 09 11月, 2018 1 次提交
    • I
      net: core: dev_addr_lists: add auxiliary func to handle reference address updates · e7946760
      Ivan Khoronzhuk 提交于
      In order to avoid all table update, and only remove or add new
      address, the auxiliary function exists, named __hw_addr_sync_dev().
      It allows end driver do nothing when nothing changed and add/rm when
      concrete address is firstly added or lastly removed. But it doesn't
      include cases when an address of real device or vlan was reused by
      other vlans or vlan/macval devices.
      
      For handaling events when address was reused/unreused the patch adds
      new auxiliary routine - __hw_addr_ref_sync_dev(). It allows to do
      nothing when nothing was changed and do updates only for an address
      being added/reused/deleted/unreused. Thus, clone address changes for
      vlans can be mirrored in the table. The function is exclusive with
      __hw_addr_sync_dev(). It's responsibility of the end driver to
      identify address vlan device, if it needs so.
      Signed-off-by: NIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7946760
  24. 04 11月, 2018 1 次提交
    • E
      net: bql: add __netdev_tx_sent_queue() · 3e59020a
      Eric Dumazet 提交于
      When qdisc_run() tries to use BQL budget to bulk-dequeue a batch
      of packets, GSO can later transform this list in another list
      of skbs, and each skb is sent to device ndo_start_xmit(),
      one at a time, with skb->xmit_more being set to one but
      for last skb.
      
      Problem is that very often, BQL limit is hit in the middle of
      the packet train, forcing dev_hard_start_xmit() to stop the
      bulk send and requeue the end of the list.
      
      BQL role is to avoid head of line blocking, making sure
      a qdisc can deliver high priority packets before low priority ones.
      
      But there is no way requeued packets can be bypassed by fresh
      packets in the qdisc.
      
      Aborting the bulk send increases TX softirqs, and hot cache
      lines (after skb_segment()) are wasted.
      
      Note that for TSO packets, we never split a packet in the middle
      because of BQL limit being hit.
      
      Drivers should be able to update BQL counters without
      flipping/caring about BQL status, if the current skb
      has xmit_more set.
      
      Upper layers are ultimately responsible to stop sending another
      packet train when BQL limit is hit.
      
      Code template in a driver might look like the following :
      
      	send_doorbell = __netdev_tx_sent_queue(tx_queue, nr_bytes, skb->xmit_more);
      
      Note that __netdev_tx_sent_queue() use is not mandatory,
      since following patch will change dev_hard_start_xmit()
      to not care about BQL status.
      
      But it is highly recommended so that xmit_more full benefits
      can be reached (less doorbells sent, and less atomic operations as well)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e59020a
  25. 16 10月, 2018 1 次提交
    • M
      FDDI: defza: Support capturing outgoing SMT traffic · 9f9a742d
      Maciej W. Rozycki 提交于
      DEC FDDIcontroller 700 (DEFZA) uses a Tx/Rx queue pair to communicate
      SMT frames with adapter's firmware.  Any SMT frame received from the RMC
      via the Rx queue is queued back by the driver to the SMT Rx queue for
      the firmware to process.  Similarly the firmware uses the SMT Tx queue
      to supply the driver with SMT frames which are queued back to the Tx
      queue for the RMC to send to the ring.
      
      When a network tap is attached to an FDDI interface handled by `defza'
      any incoming SMT frames captured are queued to our usual processing of
      network data received, which in turn delivers them to any listening
      taps.
      
      However the outgoing SMT frames produced by the firmware bypass our
      network protocol stack and are therefore not delivered to taps.  This in
      turn means that taps are missing a part of network traffic sent by the
      adapter, which may make it more difficult to track down network problems
      or do general traffic analysis.
      
      Call `dev_queue_xmit_nit' then in the SMT Tx path, having checked that
      a network tap is attached, with a newly-created `dev_nit_active' helper
      wrapping the usual condition used in the transmit path.
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9a742d
  26. 11 10月, 2018 1 次提交
    • S
      net: ipv4: update fnhe_pmtu when first hop's MTU changes · af7d6cce
      Sabrina Dubroca 提交于
      Since commit 5aad1de5 ("ipv4: use separate genid for next hop
      exceptions"), exceptions get deprecated separately from cached
      routes. In particular, administrative changes don't clear PMTU anymore.
      
      As Stefano described in commit e9fa1495 ("ipv6: Reflect MTU changes
      on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
      the local MTU change can become stale:
       - if the local MTU is now lower than the PMTU, that PMTU is now
         incorrect
       - if the local MTU was the lowest value in the path, and is increased,
         we might discover a higher PMTU
      
      Similarly to what commit e9fa1495 did for IPv6, update PMTU in those
      cases.
      
      If the exception was locked, the discovered PMTU was smaller than the
      minimal accepted PMTU. In that case, if the new local MTU is smaller
      than the current PMTU, let PMTU discovery figure out if locking of the
      exception is still needed.
      
      To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
      notifier. By the time the notifier is called, dev->mtu has been
      changed. This patch adds the old MTU as additional information in the
      notifier structure, and a new call_netdevice_notifiers_u32() function.
      
      Fixes: 5aad1de5 ("ipv4: use separate genid for next hop exceptions")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af7d6cce
  27. 05 10月, 2018 1 次提交
  28. 27 9月, 2018 1 次提交
  29. 19 9月, 2018 1 次提交
  30. 14 9月, 2018 1 次提交
  31. 06 9月, 2018 1 次提交
    • V
      packet: add sockopt to ignore outgoing packets · fa788d98
      Vincent Whitchurch 提交于
      Currently, the only way to ignore outgoing packets on a packet socket is
      via the BPF filter.  With MSG_ZEROCOPY, packets that are looped into
      AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
      if the filter run from packet_rcv() would reject them.  So the presence
      of a packet socket on the interface takes away the benefits of
      MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
      packets.  (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
      cloned, but the cost for that is much lower.)
      
      Add a socket option to allow AF_PACKET sockets to ignore outgoing
      packets to solve this.  Note that the *BSDs already have something
      similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.
      
      The first intended user is lldpd.
      Signed-off-by: NVincent Whitchurch <vincent.whitchurch@axis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa788d98