1. 24 3月, 2019 2 次提交
  2. 21 3月, 2019 3 次提交
    • P
      net: remove 'fallback' argument from dev->ndo_select_queue() · a350ecce
      Paolo Abeni 提交于
      After the previous patch, all the callers of ndo_select_queue()
      provide as a 'fallback' argument netdev_pick_tx.
      The only exceptions are nested calls to ndo_select_queue(),
      which pass down the 'fallback' available in the current scope
      - still netdev_pick_tx.
      
      We can drop such argument and replace fallback() invocation with
      netdev_pick_tx(). This avoids an indirect call per xmit packet
      in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
      with device drivers implementing such ndo. It also clean the code
      a bit.
      
      Tested with ixgbe and CONFIG_FCOE=m
      
      With pktgen using queue xmit:
      threads		vanilla 	patched
      		(kpps)		(kpps)
      1		2334		2428
      2		4166		4278
      4		7895		8100
      
       v1 -> v2:
       - rebased after helper's name change
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a350ecce
    • P
      packet: rework packet_pick_tx_queue() to use common code selection · b71b5837
      Paolo Abeni 提交于
      Currently packet_pick_tx_queue() is the only caller of
      ndo_select_queue() using a fallback argument other than
      netdev_pick_tx.
      
      Leveraging rx queue, we can obtain a similar queue selection
      behavior using core helpers. After this change, ndo_select_queue()
      is always invoked with netdev_pick_tx() as fallback.
      We can change ndo_select_queue() signature in a followup patch,
      dropping an indirect call per transmitted packet in some scenarios
      (e.g. TCP syn and XDP generic xmit)
      
      This changes slightly how af packet queue selection happens when
      PACKET_QDISC_BYPASS is set. It's now more similar to plan dev_queue_xmit()
      tacking in account both XPS and TC mapping.
      
       v1  -> v2:
        - rebased after helper name change
       RFC -> v1:
        - initialize sender_cpu to the expected value
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b71b5837
    • P
      net: dev: rename queue selection helpers. · 4bd97d51
      Paolo Abeni 提交于
      With the following patches, we are going to use __netdev_pick_tx() in
      many modules. Rename it to netdev_pick_tx(), to make it clear is
      a public API.
      
      Also rename the existing netdev_pick_tx() to netdev_core_pick_tx(),
      to avoid name clashes.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bd97d51
  3. 25 2月, 2019 2 次提交
  4. 16 2月, 2019 1 次提交
    • H
      net: Fix for_each_netdev_feature on Big endian · 3b89ea9c
      Hauke Mehrtens 提交于
      The features attribute is of type u64 and stored in the native endianes on
      the system. The for_each_set_bit() macro takes a pointer to a 32 bit array
      and goes over the bits in this area. On little Endian systems this also
      works with an u64 as the most significant bit is on the highest address,
      but on big endian the words are swapped. When we expect bit 15 here we get
      bit 47 (15 + 32).
      
      This patch converts it more or less to its own for_each_set_bit()
      implementation which works on 64 bit integers directly. This is then
      completely in host endianness and should work like expected.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      Signed-off-by: NHauke Mehrtens <hauke.mehrtens@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b89ea9c
  5. 07 2月, 2019 1 次提交
    • F
      net: Introduce ndo_get_port_parent_id() · d6abc596
      Florian Fainelli 提交于
      In preparation for getting rid of switchdev_ops, create a dedicated NDO
      operation for getting the port's parent identifier. There are
      essentially two classes of drivers that need to implement getting the
      port's parent ID which are VF/PF drivers with a built-in switch, and
      pure switchdev drivers such as mlxsw, ocelot, dsa etc.
      
      We introduce a helper function: dev_get_port_parent_id() which supports
      recursion into the lower devices to obtain the first port's parent ID.
      
      Convert the bridge, core and ipv4 multicast routing code to check for
      such ndo_get_port_parent_id() and call the helper function when valid
      before falling back to switchdev_port_attr_get(). This will allow us to
      convert all relevant drivers in one go instead of having to implement
      both switchdev_port_attr_get() and ndo_get_port_parent_id() operations,
      then get rid of switchdev_port_attr_get().
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6abc596
  6. 06 2月, 2019 1 次提交
  7. 02 2月, 2019 1 次提交
  8. 30 1月, 2019 1 次提交
    • J
      net: set default network namespace in init_dummy_netdev() · 35edfdc7
      Josh Elsasser 提交于
      Assign a default net namespace to netdevs created by init_dummy_netdev().
      Fixes a NULL pointer dereference caused by busy-polling a socket bound to
      an iwlwifi wireless device, which bumps the per-net BUSYPOLLRXPACKETS stat
      if napi_poll() received packets:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000190
        IP: napi_busy_loop+0xd6/0x200
        Call Trace:
          sock_poll+0x5e/0x80
          do_sys_poll+0x324/0x5a0
          SyS_poll+0x6c/0xf0
          do_syscall_64+0x6b/0x1f0
          entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 7db6b048 ("net: Commonize busy polling code to focus on napi_id instead of socket")
      Signed-off-by: NJosh Elsasser <jelsasser@appneta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35edfdc7
  9. 06 1月, 2019 1 次提交
    • M
      jump_label: move 'asm goto' support test to Kconfig · e9666d10
      Masahiro Yamada 提交于
      Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
      
      The jump label is controlled by HAVE_JUMP_LABEL, which is defined
      like this:
      
        #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
        # define HAVE_JUMP_LABEL
        #endif
      
      We can improve this by testing 'asm goto' support in Kconfig, then
      make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
      
      Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
      match to the real kernel capability.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      e9666d10
  10. 16 12月, 2018 1 次提交
  11. 14 12月, 2018 3 次提交
  12. 07 12月, 2018 5 次提交
  13. 06 12月, 2018 1 次提交
    • E
      net: use skb_list_del_init() to remove from RX sublists · 22f6bbb7
      Edward Cree 提交于
      list_del() leaves the skb->next pointer poisoned, which can then lead to
       a crash in e.g. OVS forwarding.  For example, setting up an OVS VXLAN
       forwarding bridge on sfc as per:
      
      ========
      $ ovs-vsctl show
      5dfd9c47-f04b-4aaa-aa96-4fbb0a522a30
          Bridge "br0"
              Port "br0"
                  Interface "br0"
                      type: internal
              Port "enp6s0f0"
                  Interface "enp6s0f0"
              Port "vxlan0"
                  Interface "vxlan0"
                      type: vxlan
                      options: {key="1", local_ip="10.0.0.5", remote_ip="10.0.0.4"}
          ovs_version: "2.5.0"
      ========
      (where 10.0.0.5 is an address on enp6s0f1)
      and sending traffic across it will lead to the following panic:
      ========
      general protection fault: 0000 [#1] SMP PTI
      CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.20.0-rc3-ehc+ #701
      Hardware name: Dell Inc. PowerEdge R710/0M233H, BIOS 6.4.0 07/23/2013
      RIP: 0010:dev_hard_start_xmit+0x38/0x200
      Code: 53 48 89 fb 48 83 ec 20 48 85 ff 48 89 54 24 08 48 89 4c 24 18 0f 84 ab 01 00 00 48 8d 86 90 00 00 00 48 89 f5 48 89 44 24 10 <4c> 8b 33 48 c7 03 00 00 00 00 48 8b 05 c7 d1 b3 00 4d 85 f6 0f 95
      RSP: 0018:ffff888627b437e0 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: dead000000000100 RCX: ffff88862279c000
      RDX: ffff888614a342c0 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888618a88000 R08: 0000000000000001 R09: 00000000000003e8
      R10: 0000000000000000 R11: ffff888614a34140 R12: 0000000000000000
      R13: 0000000000000062 R14: dead000000000100 R15: ffff888616430000
      FS:  0000000000000000(0000) GS:ffff888627b40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6d2bc6d000 CR3: 000000000200a000 CR4: 00000000000006e0
      Call Trace:
       <IRQ>
       __dev_queue_xmit+0x623/0x870
       ? masked_flow_lookup+0xf7/0x220 [openvswitch]
       ? ep_poll_callback+0x101/0x310
       do_execute_actions+0xaba/0xaf0 [openvswitch]
       ? __wake_up_common+0x8a/0x150
       ? __wake_up_common_lock+0x87/0xc0
       ? queue_userspace_packet+0x31c/0x5b0 [openvswitch]
       ovs_execute_actions+0x47/0x120 [openvswitch]
       ovs_dp_process_packet+0x7d/0x110 [openvswitch]
       ovs_vport_receive+0x6e/0xd0 [openvswitch]
       ? dst_alloc+0x64/0x90
       ? rt_dst_alloc+0x50/0xd0
       ? ip_route_input_slow+0x19a/0x9a0
       ? __udp_enqueue_schedule_skb+0x198/0x1b0
       ? __udp4_lib_rcv+0x856/0xa30
       ? __udp4_lib_rcv+0x856/0xa30
       ? cpumask_next_and+0x19/0x20
       ? find_busiest_group+0x12d/0xcd0
       netdev_frame_hook+0xce/0x150 [openvswitch]
       __netif_receive_skb_core+0x205/0xae0
       __netif_receive_skb_list_core+0x11e/0x220
       netif_receive_skb_list+0x203/0x460
       ? __efx_rx_packet+0x335/0x5e0 [sfc]
       efx_poll+0x182/0x320 [sfc]
       net_rx_action+0x294/0x3c0
       __do_softirq+0xca/0x297
       irq_exit+0xa6/0xb0
       do_IRQ+0x54/0xd0
       common_interrupt+0xf/0xf
       </IRQ>
      ========
      So, in all listified-receive handling, instead pull skbs off the lists with
       skb_list_del_init().
      
      Fixes: 9af86f93 ("net: core: fix use-after-free in __netif_receive_skb_list_core")
      Fixes: 7da517a3 ("net: core: Another step of skb receive list processing")
      Fixes: a4ca8b7d ("net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()")
      Fixes: d8269e2c ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22f6bbb7
  14. 04 12月, 2018 1 次提交
    • Q
      net/core: tidy up an error message · bf29e9e9
      Qian Cai 提交于
      netif_napi_add() could report an error like this below due to it allows
      to pass a format string for wildcarding before calling
      dev_get_valid_name(),
      
      "netif_napi_add() called with weight 256 on device eth%d"
      
      For example, hns_enet_drv module does this.
      
      hns_nic_try_get_ae
        hns_nic_init_ring_data
          netif_napi_add
        register_netdev
          dev_get_valid_name
      
      Hence, make it a bit more human-readable by using netdev_err_once()
      instead.
      Signed-off-by: NQian Cai <cai@gmx.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf29e9e9
  15. 01 12月, 2018 1 次提交
    • G
      net: Add trace events for all receive exit points · b0e3f1bd
      Geneviève Bastien 提交于
      Trace events are already present for the receive entry points, to indicate
      how the reception entered the stack.
      
      This patch adds the corresponding exit trace events that will bound the
      reception such that all events occurring between the entry and the exit
      can be considered as part of the reception context. This greatly helps
      for dependency and root cause analyses.
      
      Without this, it is not possible with tracepoint instrumentation to
      determine whether a sched_wakeup event following a netif_receive_skb
      event is the result of the packet reception or a simple coincidence after
      further processing by the thread. It is possible using other mechanisms
      like kretprobes, but considering the "entry" points are already present,
      it would be good to add the matching exit events.
      
      In addition to linking packets with wakeups, the entry/exit event pair
      can also be used to perform network stack latency analyses.
      Signed-off-by: NGeneviève Bastien <gbastien@versatic.net>
      CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      CC: Steven Rostedt <rostedt@goodmis.org>
      CC: Ingo Molnar <mingo@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> (tracing side)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0e3f1bd
  16. 30 11月, 2018 3 次提交
    • C
      net: explain __skb_checksum_complete() with comments · 14641931
      Cong Wang 提交于
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14641931
    • S
      net: fix XPS static_key accounting · 867d0ad4
      Sabrina Dubroca 提交于
      Commit 04157469 ("net: Use static_key for XPS maps") introduced a
      static key for XPS, but the increments/decrements don't match.
      
      First, the static key's counter is incremented once for each queue, but
      only decremented once for a whole batch of queues, leading to large
      unbalances.
      
      Second, the xps_rxqs_needed key is decremented whenever we reset a batch
      of queues, whether they had any rxqs mapping or not, so that if we setup
      cpu-XPS on em1 and RXQS-XPS on em2, resetting the queues on em1 would
      decrement the xps_rxqs_needed key.
      
      This reworks the accounting scheme so that the xps_needed key is
      incremented only once for each type of XPS for all the queues on a
      device, and the xps_rxqs_needed key is incremented only once for all
      queues. This is sufficient to let us retrieve queues via
      get_xps_queue().
      
      This patch introduces a new reset_xps_maps(), which reinitializes and
      frees the appropriate map (xps_rxqs_map or xps_cpus_map), and drops a
      reference to the needed keys:
       - both xps_needed and xps_rxqs_needed, in case of rxqs maps,
       - only xps_needed, in case of CPU maps.
      
      Now, we also need to call reset_xps_maps() at the end of
      __netif_set_xps_queue() when there's no active map left, for example
      when writing '00000000,00000000' to all queues' xps_rxqs setting.
      
      Fixes: 04157469 ("net: Use static_key for XPS maps")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      867d0ad4
    • S
      net: restore call to netdev_queue_numa_node_write when resetting XPS · f28c020f
      Sabrina Dubroca 提交于
      Before commit 80d19669 ("net: Refactor XPS for CPUs and Rx queues"),
      netif_reset_xps_queues() did netdev_queue_numa_node_write() for all the
      queues being reset. Now, this is only done when the "active" variable in
      clean_xps_maps() is false, ie when on all the CPUs, there's no active
      XPS mapping left.
      
      Fixes: 80d19669 ("net: Refactor XPS for CPUs and Rx queues")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f28c020f
  17. 24 11月, 2018 2 次提交
  18. 18 11月, 2018 1 次提交
    • E
      net-gro: reset skb->pkt_type in napi_reuse_skb() · 33d9a2c7
      Eric Dumazet 提交于
      eth_type_trans() assumes initial value for skb->pkt_type
      is PACKET_HOST.
      
      This is indeed the value right after a fresh skb allocation.
      
      However, it is possible that GRO merged a packet with a different
      value (like PACKET_OTHERHOST in case macvlan is used), so
      we need to make sure napi->skb will have pkt_type set back to
      PACKET_HOST.
      
      Otherwise, valid packets might be dropped by the stack because
      their pkt_type is not PACKET_HOST.
      
      napi_reuse_skb() was added in commit 96e93eab ("gro: Add
      internal interfaces for VLAN"), but this bug always has
      been there.
      
      Fixes: 96e93eab ("gro: Add internal interfaces for VLAN")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33d9a2c7
  19. 16 11月, 2018 1 次提交
  20. 09 11月, 2018 1 次提交
  21. 04 11月, 2018 1 次提交
    • E
      net: do not abort bulk send on BQL status · fe60faa5
      Eric Dumazet 提交于
      Before calling dev_hard_start_xmit(), upper layers tried
      to cook optimal skb list based on BQL budget.
      
      Problem is that GSO packets can end up comsuming more than
      the BQL budget.
      
      Breaking the loop is not useful, since requeued packets
      are ahead of any packets still in the qdisc.
      
      It is also more expensive, since next TX completion will
      push these packets later, while skbs are not in cpu caches.
      
      It is also a behavior difference with TSO packets, that can
      break the BQL limit by a large amount.
      
      Note that drivers should use __netdev_tx_sent_queue()
      in order to have optimal xmit_more support, and avoid
      useless atomic operations as shown in the following patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe60faa5
  22. 29 10月, 2018 1 次提交
  23. 16 10月, 2018 1 次提交
    • M
      FDDI: defza: Support capturing outgoing SMT traffic · 9f9a742d
      Maciej W. Rozycki 提交于
      DEC FDDIcontroller 700 (DEFZA) uses a Tx/Rx queue pair to communicate
      SMT frames with adapter's firmware.  Any SMT frame received from the RMC
      via the Rx queue is queued back by the driver to the SMT Rx queue for
      the firmware to process.  Similarly the firmware uses the SMT Tx queue
      to supply the driver with SMT frames which are queued back to the Tx
      queue for the RMC to send to the ring.
      
      When a network tap is attached to an FDDI interface handled by `defza'
      any incoming SMT frames captured are queued to our usual processing of
      network data received, which in turn delivers them to any listening
      taps.
      
      However the outgoing SMT frames produced by the firmware bypass our
      network protocol stack and are therefore not delivered to taps.  This in
      turn means that taps are missing a part of network traffic sent by the
      adapter, which may make it more difficult to track down network problems
      or do general traffic analysis.
      
      Call `dev_queue_xmit_nit' then in the SMT Tx path, having checked that
      a network tap is attached, with a newly-created `dev_nit_active' helper
      wrapping the usual condition used in the transmit path.
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9a742d
  24. 11 10月, 2018 1 次提交
    • S
      net: ipv4: update fnhe_pmtu when first hop's MTU changes · af7d6cce
      Sabrina Dubroca 提交于
      Since commit 5aad1de5 ("ipv4: use separate genid for next hop
      exceptions"), exceptions get deprecated separately from cached
      routes. In particular, administrative changes don't clear PMTU anymore.
      
      As Stefano described in commit e9fa1495 ("ipv6: Reflect MTU changes
      on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
      the local MTU change can become stale:
       - if the local MTU is now lower than the PMTU, that PMTU is now
         incorrect
       - if the local MTU was the lowest value in the path, and is increased,
         we might discover a higher PMTU
      
      Similarly to what commit e9fa1495 did for IPv6, update PMTU in those
      cases.
      
      If the exception was locked, the discovered PMTU was smaller than the
      minimal accepted PMTU. In that case, if the new local MTU is smaller
      than the current PMTU, let PMTU discovery figure out if locking of the
      exception is still needed.
      
      To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
      notifier. By the time the notifier is called, dev->mtu has been
      changed. This patch adds the old MTU as additional information in the
      notifier structure, and a new call_netdevice_notifiers_u32() function.
      
      Fixes: 5aad1de5 ("ipv4: use separate genid for next hop exceptions")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af7d6cce
  25. 10 10月, 2018 1 次提交
    • J
      net: fix generic XDP to handle if eth header was mangled · 29724956
      Jesper Dangaard Brouer 提交于
      XDP can modify (and resize) the Ethernet header in the packet.
      
      There is a bug in generic-XDP, because skb->protocol and skb->pkt_type
      are setup before reaching (netif_receive_)generic_xdp.
      
      This bug was hit when XDP were popping VLAN headers (changing
      eth->h_proto), as skb->protocol still contains VLAN-indication
      (ETH_P_8021Q) causing invocation of skb_vlan_untag(skb), which corrupt
      the packet (basically popping the VLAN again).
      
      This patch catch if XDP changed eth header in such a way, that SKB
      fields needs to be updated.
      
      V2: on request from Song Liu, use ETH_HLEN instead of mac_len,
      in __skb_push() as eth_type_trans() use ETH_HLEN in paired skb_pull_inline().
      
      Fixes: d4455169 ("net: xdp: support xdp generic on virtual devices")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      29724956
  26. 11 9月, 2018 2 次提交