1. 12 1月, 2020 1 次提交
  2. 11 1月, 2020 1 次提交
    • I
      devlink: Wait longer before warning about unset port type · 4c582234
      Ido Schimmel 提交于
      The commit cited below causes devlink to emit a warning if a type was
      not set on a devlink port for longer than 30 seconds to "prevent
      misbehavior of drivers". This proved to be problematic when
      unregistering the backing netdev. The flow is always:
      
      devlink_port_type_clear()	// schedules the warning
      unregister_netdev()		// blocking
      devlink_port_unregister()	// cancels the warning
      
      The call to unregister_netdev() can block for long periods of time for
      various reasons: RTNL lock is contended, large amounts of configuration
      to unroll following dismantle of the netdev, etc. This results in
      devlink emitting a warning despite the driver behaving correctly.
      
      In emulated environments (of future hardware) which are usually very
      slow, the warning can also be emitted during port creation as more than
      30 seconds can pass between the time the devlink port is registered and
      when its type is set.
      
      In addition, syzbot has hit this warning [1] 1974 times since 07/11/19
      without being able to produce a reproducer. Probably because
      reproduction depends on the load or other bugs (e.g., RTNL not being
      released).
      
      To prevent bogus warnings, increase the timeout to 1 hour.
      
      [1] https://syzkaller.appspot.com/bug?id=e99b59e9c024a666c9f7450dc162a4b74d09d9cb
      
      Fixes: 136bf27f ("devlink: add warning in case driver does not set port type")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reported-by: syzbot+b0a18ed7b08b735d2f41@syzkaller.appspotmail.com
      Reported-by: NAlex Veber <alexve@mellanox.com>
      Tested-by: NAlex Veber <alexve@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c582234
  3. 20 12月, 2019 1 次提交
    • A
      net, sysctl: Fix compiler warning when only cBPF is present · 1148f9ad
      Alexander Lobakin 提交于
      proc_dointvec_minmax_bpf_restricted() has been firstly introduced
      in commit 2e4a3098 ("bpf: restrict access to core bpf sysctls")
      under CONFIG_HAVE_EBPF_JIT. Then, this ifdef has been removed in
      ede95a63 ("bpf: add bpf_jit_limit knob to restrict unpriv
      allocations"), because a new sysctl, bpf_jit_limit, made use of it.
      Finally, this parameter has become long instead of integer with
      fdadd049 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
      and thus, a new proc_dolongvec_minmax_bpf_restricted() has been
      added.
      
      With this last change, we got back to that
      proc_dointvec_minmax_bpf_restricted() is used only under
      CONFIG_HAVE_EBPF_JIT, but the corresponding ifdef has not been
      brought back.
      
      So, in configurations like CONFIG_BPF_JIT=y && CONFIG_HAVE_EBPF_JIT=n
      since v4.20 we have:
      
        CC      net/core/sysctl_net_core.o
      net/core/sysctl_net_core.c:292:1: warning: ‘proc_dointvec_minmax_bpf_restricted’ defined but not used [-Wunused-function]
        292 | proc_dointvec_minmax_bpf_restricted(struct ctl_table *table, int write,
            | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Suppress this by guarding it with CONFIG_HAVE_EBPF_JIT again.
      
      Fixes: fdadd049 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
      Signed-off-by: NAlexander Lobakin <alobakin@dlink.ru>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191218091821.7080-1-alobakin@dlink.ru
      1148f9ad
  4. 18 12月, 2019 2 次提交
  5. 14 12月, 2019 1 次提交
  6. 10 12月, 2019 2 次提交
  7. 08 12月, 2019 1 次提交
    • E
      inet: protect against too small mtu values. · 501a90c9
      Eric Dumazet 提交于
      syzbot was once again able to crash a host by setting a very small mtu
      on loopback device.
      
      Let's make inetdev_valid_mtu() available in include/net/ip.h,
      and use it in ip_setup_cork(), so that we protect both ip_append_page()
      and __ip_append_data()
      
      Also add a READ_ONCE() when the device mtu is read.
      
      Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
      even if other code paths might write over this field.
      
      Add a big comment in include/linux/netdevice.h about dev->mtu
      needing READ_ONCE()/WRITE_ONCE() annotations.
      
      Hopefully we will add the missing ones in followup patches.
      
      [1]
      
      refcount_t: saturated; leaking memory.
      WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       panic+0x2e3/0x75c kernel/panic.c:221
       __warn.cold+0x2f/0x3e kernel/panic.c:582
       report_bug+0x289/0x300 lib/bug.c:195
       fixup_bug arch/x86/kernel/traps.c:174 [inline]
       fixup_bug arch/x86/kernel/traps.c:169 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
       invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
      RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
      RSP: 0018:ffff88809689f550 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
      RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
      R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
      R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
       refcount_add include/linux/refcount.h:193 [inline]
       skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
       sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
       ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
       udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
       inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
       kernel_sendpage+0x92/0xf0 net/socket.c:3794
       sock_sendpage+0x8b/0xc0 net/socket.c:936
       pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
       splice_from_pipe_feed fs/splice.c:512 [inline]
       __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
       splice_from_pipe+0x108/0x170 fs/splice.c:671
       generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
       do_splice_from fs/splice.c:861 [inline]
       direct_splice_actor+0x123/0x190 fs/splice.c:1035
       splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
       do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
       do_sendfile+0x597/0xd00 fs/read_write.c:1464
       __do_sys_sendfile64 fs/read_write.c:1525 [inline]
       __se_sys_sendfile64 fs/read_write.c:1511 [inline]
       __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441409
      Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
      RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
      RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
      R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
      R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 1470ddf7 ("inet: Remove explicit write references to sk/inet in ip_append_data")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      501a90c9
  8. 07 12月, 2019 3 次提交
    • J
      net: core: rename indirect block ingress cb function · dbad3408
      John Hurley 提交于
      With indirect blocks, a driver can register for callbacks from a device
      that is does not 'own', for example, a tunnel device. When registering to
      or unregistering from a new device, a callback is triggered to generate
      a bind/unbind event. This, in turn, allows the driver to receive any
      existing rules or to properly clean up installed rules.
      
      When first added, it was assumed that all indirect block registrations
      would be for ingress offloads. However, the NFP driver can, in some
      instances, support clsact qdisc binds for egress offload.
      
      Change the name of the indirect block callback command in flow_offload to
      remove the 'ingress' identifier from it. While this does not change
      functionality, a follow up patch will implement a more more generic
      callback than just those currently just supporting ingress offload.
      
      Fixes: 4d12ba42 ("nfp: flower: allow offloading of matches on 'internal' ports")
      Signed-off-by: NJohn Hurley <john.hurley@netronome.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbad3408
    • J
      net-sysfs: Call dev_hold always in netdev_queue_add_kobject · e0b60903
      Jouni Hogander 提交于
      Dev_hold has to be called always in netdev_queue_add_kobject.
      Otherwise usage count drops below 0 in case of failure in
      kobject_init_and_add.
      
      Fixes: b8eb7183 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Miller <davem@davemloft.net>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0b60903
    • A
      net: dsa: fix flow dissection on Tx path · 8bef0af0
      Alexander Lobakin 提交于
      Commit 43e66528 ("net-next: dsa: fix flow dissection") added an
      ability to override protocol and network offset during flow dissection
      for DSA-enabled devices (i.e. controllers shipped as switch CPU ports)
      in order to fix skb hashing for RPS on Rx path.
      
      However, skb_hash() and added part of code can be invoked not only on
      Rx, but also on Tx path if we have a multi-queued device and:
       - kernel is running on UP system or
       - XPS is not configured.
      
      The call stack in this two cases will be like: dev_queue_xmit() ->
      __dev_queue_xmit() -> netdev_core_pick_tx() -> netdev_pick_tx() ->
      skb_tx_hash() -> skb_get_hash().
      
      The problem is that skbs queued for Tx have both network offset and
      correct protocol already set up even after inserting a CPU tag by DSA
      tagger, so calling tag_ops->flow_dissect() on this path actually only
      breaks flow dissection and hashing.
      
      This can be observed by adding debug prints just before and right after
      tag_ops->flow_dissect() call to the related block of code:
      
      Before the patch:
      
      Rx path (RPS):
      
      [   19.240001] Rx: proto: 0x00f8, nhoff: 0	/* ETH_P_XDSA */
      [   19.244271] tag_ops->flow_dissect()
      [   19.247811] Rx: proto: 0x0800, nhoff: 8	/* ETH_P_IP */
      
      [   19.215435] Rx: proto: 0x00f8, nhoff: 0	/* ETH_P_XDSA */
      [   19.219746] tag_ops->flow_dissect()
      [   19.223241] Rx: proto: 0x0806, nhoff: 8	/* ETH_P_ARP */
      
      [   18.654057] Rx: proto: 0x00f8, nhoff: 0	/* ETH_P_XDSA */
      [   18.658332] tag_ops->flow_dissect()
      [   18.661826] Rx: proto: 0x8100, nhoff: 8	/* ETH_P_8021Q */
      
      Tx path (UP system):
      
      [   18.759560] Tx: proto: 0x0800, nhoff: 26	/* ETH_P_IP */
      [   18.763933] tag_ops->flow_dissect()
      [   18.767485] Tx: proto: 0x920b, nhoff: 34	/* junk */
      
      [   22.800020] Tx: proto: 0x0806, nhoff: 26	/* ETH_P_ARP */
      [   22.804392] tag_ops->flow_dissect()
      [   22.807921] Tx: proto: 0x920b, nhoff: 34	/* junk */
      
      [   16.898342] Tx: proto: 0x86dd, nhoff: 26	/* ETH_P_IPV6 */
      [   16.902705] tag_ops->flow_dissect()
      [   16.906227] Tx: proto: 0x920b, nhoff: 34	/* junk */
      
      After:
      
      Rx path (RPS):
      
      [   16.520993] Rx: proto: 0x00f8, nhoff: 0	/* ETH_P_XDSA */
      [   16.525260] tag_ops->flow_dissect()
      [   16.528808] Rx: proto: 0x0800, nhoff: 8	/* ETH_P_IP */
      
      [   15.484807] Rx: proto: 0x00f8, nhoff: 0	/* ETH_P_XDSA */
      [   15.490417] tag_ops->flow_dissect()
      [   15.495223] Rx: proto: 0x0806, nhoff: 8	/* ETH_P_ARP */
      
      [   17.134621] Rx: proto: 0x00f8, nhoff: 0	/* ETH_P_XDSA */
      [   17.138895] tag_ops->flow_dissect()
      [   17.142388] Rx: proto: 0x8100, nhoff: 8	/* ETH_P_8021Q */
      
      Tx path (UP system):
      
      [   15.499558] Tx: proto: 0x0800, nhoff: 26	/* ETH_P_IP */
      
      [   20.664689] Tx: proto: 0x0806, nhoff: 26	/* ETH_P_ARP */
      
      [   18.565782] Tx: proto: 0x86dd, nhoff: 26	/* ETH_P_IPV6 */
      
      In order to fix that we can add the check 'proto == htons(ETH_P_XDSA)'
      to prevent code from calling tag_ops->flow_dissect() on Tx.
      I also decided to initialize 'offset' variable so tagger callbacks can
      now safely leave it untouched without provoking a chaos.
      
      Fixes: 43e66528 ("net-next: dsa: fix flow dissection")
      Signed-off-by: NAlexander Lobakin <alobakin@dlink.ru>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8bef0af0
  9. 05 12月, 2019 3 次提交
  10. 04 12月, 2019 3 次提交
    • Y
      cls_flower: Fix the behavior using port ranges with hw-offload · 8ffb055b
      Yoshiki Komachi 提交于
      The recent commit 5c72299f ("net: sched: cls_flower: Classify
      packets using port ranges") had added filtering based on port ranges
      to tc flower. However the commit missed necessary changes in hw-offload
      code, so the feature gave rise to generating incorrect offloaded flow
      keys in NIC.
      
      One more detailed example is below:
      
      $ tc qdisc add dev eth0 ingress
      $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
        dst_port 100-200 action drop
      
      With the setup above, an exact match filter with dst_port == 0 will be
      installed in NIC by hw-offload. IOW, the NIC will have a rule which is
      equivalent to the following one.
      
      $ tc qdisc add dev eth0 ingress
      $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
        dst_port 0 action drop
      
      The behavior was caused by the flow dissector which extracts packet
      data into the flow key in the tc flower. More specifically, regardless
      of exact match or specified port ranges, fl_init_dissector() set the
      FLOW_DISSECTOR_KEY_PORTS flag in struct flow_dissector to extract port
      numbers from skb in skb_flow_dissect() called by fl_classify(). Note
      that device drivers received the same struct flow_dissector object as
      used in skb_flow_dissect(). Thus, offloaded drivers could not identify
      which of these is used because the FLOW_DISSECTOR_KEY_PORTS flag was
      set to struct flow_dissector in either case.
      
      This patch adds the new FLOW_DISSECTOR_KEY_PORTS_RANGE flag and the new
      tp_range field in struct fl_flow_key to recognize which filters are applied
      to offloaded drivers. At this point, when filters based on port ranges
      passed to drivers, drivers return the EOPNOTSUPP error because they do
      not support the feature (the newly created FLOW_DISSECTOR_KEY_PORTS_RANGE
      flag).
      
      Fixes: 5c72299f ("net: sched: cls_flower: Classify packets using port ranges")
      Signed-off-by: NYoshiki Komachi <komachi.yoshiki@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ffb055b
    • D
      net/core: Populate VF index in struct ifla_vf_guid · 9aed6ae0
      Danit Goldberg 提交于
      In addition to filling the node_guid and port_guid attributes,
      there is a need to populate VF index too, otherwise users of netlink
      interface will see same VF index for all VFs.
      
      Fixes: 30aad417 ("net/core: Add support for getting VF GUIDs")
      Signed-off-by: NDanit Goldberg <danitg@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9aed6ae0
    • D
      net: fix a leak in register_netdevice() · 42c17fa6
      Dan Carpenter 提交于
      We have to free "dev->name_node" on this error path.
      
      Fixes: ff927412 ("net: introduce name_node struct to be used in hashlist")
      Reported-by: syzbot+6e13e65ffbaa33757bcb@syzkaller.appspotmail.com
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42c17fa6
  11. 03 12月, 2019 1 次提交
    • M
      Fixed updating of ethertype in function skb_mpls_pop · 040b5cfb
      Martin Varghese 提交于
      The skb_mpls_pop was not updating ethertype of an ethernet packet if the
      packet was originally received from a non ARPHRD_ETHER device.
      
      In the below OVS data path flow, since the device corresponding to port 7
      is an l3 device (ARPHRD_NONE) the skb_mpls_pop function does not update
      the ethertype of the packet even though the previous push_eth action had
      added an ethernet header to the packet.
      
      recirc_id(0),in_port(7),eth_type(0x8847),
      mpls(label=12/0xfffff,tc=0/0,ttl=0/0x0,bos=1/1),
      actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00),
      pop_mpls(eth_type=0x800),4
      
      Fixes: ed246cee ("net: core: move pop MPLS functionality from OvS to core helper")
      Signed-off-by: NMartin Varghese <martin.varghese@nokia.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      040b5cfb
  12. 29 11月, 2019 1 次提交
    • J
      net: skmsg: fix TLS 1.3 crash with full sk_msg · 031097d9
      Jakub Kicinski 提交于
      TLS 1.3 started using the entry at the end of the SG array
      for chaining-in the single byte content type entry. This mostly
      works:
      
      [ E E E E E E . . ]
        ^           ^
         start       end
      
                       E < content type
                     /
      [ E E E E E E C . ]
        ^           ^
         start       end
      
      (Where E denotes a populated SG entry; C denotes a chaining entry.)
      
      If the array is full, however, the end will point to the start:
      
      [ E E E E E E E E ]
        ^
         start
         end
      
      And we end up overwriting the start:
      
          E < content type
         /
      [ C E E E E E E E ]
        ^
         start
         end
      
      The sg array is supposed to be a circular buffer with start and
      end markers pointing anywhere. In case where start > end
      (i.e. the circular buffer has "wrapped") there is an extra entry
      reserved at the end to chain the two halves together.
      
      [ E E E E E E . . l ]
      
      (Where l is the reserved entry for "looping" back to front.
      
      As suggested by John, let's reserve another entry for chaining
      SG entries after the main circular buffer. Note that this entry
      has to be pointed to by the end entry so its position is not fixed.
      
      Examples of full messages:
      
      [ E E E E E E E E . l ]
        ^               ^
         start           end
      
         <---------------.
      [ E E . E E E E E E l ]
            ^ ^
         end   start
      
      Now the end will always point to an unused entry, so TLS 1.3
      can always use it.
      
      Fixes: 130b392c ("net: tls: Add tls 1.3 support")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      031097d9
  13. 24 11月, 2019 1 次提交
  14. 23 11月, 2019 2 次提交
    • D
      net: rtnetlink: prevent underflows in do_setvfinfo() · ff08ddba
      Dan Carpenter 提交于
      The "ivm->vf" variable is a u32, but the problem is that a number of
      drivers cast it to an int and then forget to check for negatives.  An
      example of this is in the cxgb4 driver.
      
      drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
        2890  static int cxgb4_mgmt_get_vf_config(struct net_device *dev,
        2891                                      int vf, struct ifla_vf_info *ivi)
                                                  ^^^^^^
        2892  {
        2893          struct port_info *pi = netdev_priv(dev);
        2894          struct adapter *adap = pi->adapter;
        2895          struct vf_info *vfinfo;
        2896
        2897          if (vf >= adap->num_vfs)
                          ^^^^^^^^^^^^^^^^^^^
        2898                  return -EINVAL;
        2899          vfinfo = &adap->vfinfo[vf];
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
      
      There are 48 functions affected.
      
      drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:8435 hclge_set_vf_vlan_filter() warn: can 'vfid' underflow 's32min-2147483646'
      drivers/net/ethernet/freescale/enetc/enetc_pf.c:377 enetc_pf_set_vf_mac() warn: can 'vf' underflow 's32min-2147483646'
      drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:2899 cxgb4_mgmt_get_vf_config() warn: can 'vf' underflow 's32min-254'
      drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:2960 cxgb4_mgmt_set_vf_rate() warn: can 'vf' underflow 's32min-254'
      drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:3019 cxgb4_mgmt_set_vf_rate() warn: can 'vf' underflow 's32min-254'
      drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:3038 cxgb4_mgmt_set_vf_vlan() warn: can 'vf' underflow 's32min-254'
      drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:3086 cxgb4_mgmt_set_vf_link_state() warn: can 'vf' underflow 's32min-254'
      drivers/net/ethernet/chelsio/cxgb/cxgb2.c:791 get_eeprom() warn: can 'i' underflow 's32min-(-4),0,4-s32max'
      drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:82 bnxt_set_vf_spoofchk() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:164 bnxt_set_vf_trust() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:186 bnxt_get_vf_config() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:228 bnxt_set_vf_mac() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:264 bnxt_set_vf_vlan() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:293 bnxt_set_vf_bw() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c:333 bnxt_set_vf_link_state() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c:2595 bnx2x_vf_op_prep() warn: can 'vfidx' underflow 's32min-63'
      drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c:2595 bnx2x_vf_op_prep() warn: can 'vfidx' underflow 's32min-63'
      drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c:2281 bnx2x_post_vf_bulletin() warn: can 'vf' underflow 's32min-63'
      drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c:2285 bnx2x_post_vf_bulletin() warn: can 'vf' underflow 's32min-63'
      drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c:2286 bnx2x_post_vf_bulletin() warn: can 'vf' underflow 's32min-63'
      drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c:2292 bnx2x_post_vf_bulletin() warn: can 'vf' underflow 's32min-63'
      drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c:2297 bnx2x_post_vf_bulletin() warn: can 'vf' underflow 's32min-63'
      drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c:1832 qlcnic_sriov_set_vf_mac() warn: can 'vf' underflow 's32min-254'
      drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c:1864 qlcnic_sriov_set_vf_tx_rate() warn: can 'vf' underflow 's32min-254'
      drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c:1937 qlcnic_sriov_set_vf_vlan() warn: can 'vf' underflow 's32min-254'
      drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c:2005 qlcnic_sriov_get_vf_config() warn: can 'vf' underflow 's32min-254'
      drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c:2036 qlcnic_sriov_set_vf_spoofchk() warn: can 'vf' underflow 's32min-254'
      drivers/net/ethernet/emulex/benet/be_main.c:1914 be_get_vf_config() warn: can 'vf' underflow 's32min-65534'
      drivers/net/ethernet/emulex/benet/be_main.c:1915 be_get_vf_config() warn: can 'vf' underflow 's32min-65534'
      drivers/net/ethernet/emulex/benet/be_main.c:1922 be_set_vf_tvt() warn: can 'vf' underflow 's32min-65534'
      drivers/net/ethernet/emulex/benet/be_main.c:1951 be_clear_vf_tvt() warn: can 'vf' underflow 's32min-65534'
      drivers/net/ethernet/emulex/benet/be_main.c:2063 be_set_vf_tx_rate() warn: can 'vf' underflow 's32min-65534'
      drivers/net/ethernet/emulex/benet/be_main.c:2091 be_set_vf_link_state() warn: can 'vf' underflow 's32min-65534'
      drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c:2609 ice_set_vf_port_vlan() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c:3050 ice_get_vf_cfg() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c:3103 ice_set_vf_spoofchk() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c:3181 ice_set_vf_mac() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c:3237 ice_set_vf_trust() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c:3286 ice_set_vf_link_state() warn: can 'vf_id' underflow 's32min-65534'
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:3919 i40e_validate_vf() warn: can 'vf_id' underflow 's32min-2147483646'
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:3957 i40e_ndo_set_vf_mac() warn: can 'vf_id' underflow 's32min-2147483646'
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4104 i40e_ndo_set_vf_port_vlan() warn: can 'vf_id' underflow 's32min-2147483646'
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4263 i40e_ndo_set_vf_bw() warn: can 'vf_id' underflow 's32min-2147483646'
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4309 i40e_ndo_get_vf_config() warn: can 'vf_id' underflow 's32min-2147483646'
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4371 i40e_ndo_set_vf_link_state() warn: can 'vf_id' underflow 's32min-2147483646'
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4441 i40e_ndo_set_vf_spoofchk() warn: can 'vf_id' underflow 's32min-2147483646'
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4441 i40e_ndo_set_vf_spoofchk() warn: can 'vf_id' underflow 's32min-2147483646'
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:4504 i40e_ndo_set_vf_trust() warn: can 'vf_id' underflow 's32min-2147483646'
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff08ddba
    • D
      net/core: Add support for getting VF GUIDs · 30aad417
      Danit Goldberg 提交于
      Introduce a new ndo: ndo_get_vf_guid, to get from the net
      device the port and node GUID.
      
      New applications can choose to use this interface to show
      GUIDs with iproute2 with commands such as:
      
      - ip link show ib4
      ib4: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256
      link/infiniband 00:00:0a:2d:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:44:36:8d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      vf 0     link/infiniband 00:00:0a:2d:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:44:36:8d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff,
      spoof checking off, NODE_GUID 22:44:33:00:33:11:00:33, PORT_GUID 10:21:33:12:00:11:22:10, link-state disable, trust off, query_rss off
      Signed-off-by: NDanit Goldberg <danitg@mellanox.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      30aad417
  15. 22 11月, 2019 1 次提交
  16. 21 11月, 2019 5 次提交
    • E
      net-sysfs: fix netdev_queue_add_kobject() breakage · 48a322b6
      Eric Dumazet 提交于
      kobject_put() should only be called in error path.
      
      Fixes: b8eb7183 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jouni Hogander <jouni.hogander@unikie.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48a322b6
    • L
      net: page_pool: add the possibility to sync DMA memory for device · e68bc756
      Lorenzo Bianconi 提交于
      Introduce the following parameters in order to add the possibility to sync
      DMA memory for device before putting allocated pages in the page_pool
      caches:
      - PP_FLAG_DMA_SYNC_DEV: if set in page_pool_params flags, all pages that
        the driver gets from page_pool will be DMA-synced-for-device according
        to the length provided by the device driver. Please note DMA-sync-for-CPU
        is still device driver responsibility
      - offset: DMA address offset where the DMA engine starts copying rx data
      - max_len: maximum DMA memory size page_pool is allowed to flush. This
        is currently used in __page_pool_alloc_pages_slow routine when pages
        are allocated from page allocator
      These parameters are supposed to be set by device drivers.
      
      This optimization reduces the length of the DMA-sync-for-device.
      The optimization is valid because pages are initially
      DMA-synced-for-device as defined via max_len. At RX time, the driver
      will perform a DMA-sync-for-CPU on the memory for the packet length.
      What is important is the memory occupied by packet payload, because
      this is the area CPU is allowed to read and modify. As we don't track
      cache-lines written into by the CPU, simply use the packet payload length
      as dma_sync_size at page_pool recycle time. This also take into account
      any tail-extend.
      Tested-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e68bc756
    • J
      net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject · b8eb7183
      Jouni Hogander 提交于
      kobject_init_and_add takes reference even when it fails. This has
      to be given up by the caller in error handling. Otherwise memory
      allocated by kobject_init_and_add is never freed. Originally found
      by Syzkaller:
      
      BUG: memory leak
      unreferenced object 0xffff8880679f8b08 (size 8):
        comm "netdev_register", pid 269, jiffies 4294693094 (age 12.132s)
        hex dump (first 8 bytes):
          72 78 2d 30 00 36 20 d4                          rx-0.6 .
        backtrace:
          [<000000008c93818e>] __kmalloc_track_caller+0x16e/0x290
          [<000000001f2e4e49>] kvasprintf+0xb1/0x140
          [<000000007f313394>] kvasprintf_const+0x56/0x160
          [<00000000aeca11c8>] kobject_set_name_vargs+0x5b/0x140
          [<0000000073a0367c>] kobject_init_and_add+0xd8/0x170
          [<0000000088838e4b>] net_rx_queue_update_kobjects+0x152/0x560
          [<000000006be5f104>] netdev_register_kobject+0x210/0x380
          [<00000000e31dab9d>] register_netdevice+0xa1b/0xf00
          [<00000000f68b2465>] __tun_chr_ioctl+0x20d5/0x3dd0
          [<000000004c50599f>] tun_chr_ioctl+0x2f/0x40
          [<00000000bbd4c317>] do_vfs_ioctl+0x1c7/0x1510
          [<00000000d4c59e8f>] ksys_ioctl+0x99/0xb0
          [<00000000946aea81>] __x64_sys_ioctl+0x78/0xb0
          [<0000000038d946e5>] do_syscall_64+0x16f/0x580
          [<00000000e0aa5d8f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
          [<00000000285b3d1a>] 0xffffffffffffffff
      
      Cc: David Miller <davem@davemloft.net>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Signed-off-by: NJouni Hogander <jouni.hogander@unikie.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8eb7183
    • S
      page_pool: Don't recycle non-reusable pages · d5394610
      Saeed Mahameed 提交于
      A page is NOT reusable when at least one of the following is true:
      1) allocated when system was under some pressure. (page_is_pfmemalloc)
      2) belongs to a different NUMA node than pool->p.nid.
      
      To update pool->p.nid users should call page_pool_update_nid().
      
      Holding on to such pages in the pool will hurt the consumer performance
      when the pool migrates to a different numa node.
      
      Performance testing:
      XDP drop/tx rate and TCP single/multi stream, on mlx5 driver
      while migrating rx ring irq from close to far numa:
      
      mlx5 internal page cache was locally disabled to get pure page pool
      results.
      
      CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
      NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G)
      
      XDP Drop/TX single core:
      NUMA  | XDP  | Before    | After
      ---------------------------------------
      Close | Drop | 11   Mpps | 10.9 Mpps
      Far   | Drop | 4.4  Mpps | 5.8  Mpps
      
      Close | TX   | 6.5 Mpps  | 6.5 Mpps
      Far   | TX   | 3.5 Mpps  | 4  Mpps
      
      Improvement is about 30% drop packet rate, 15% tx packet rate for numa
      far test.
      No degradation for numa close tests.
      
      TCP single/multi cpu/stream:
      NUMA  | #cpu | Before  | After
      --------------------------------------
      Close | 1    | 18 Gbps | 18 Gbps
      Far   | 1    | 15 Gbps | 18 Gbps
      Close | 12   | 80 Gbps | 80 Gbps
      Far   | 12   | 68 Gbps | 80 Gbps
      
      In all test cases we see improvement for the far numa case, and no
      impact on the close numa case.
      
      The impact of adding a check per page is very negligible, and shows no
      performance degradation whatsoever, also functionality wise it seems more
      correct and more robust for page pool to verify when pages should be
      recycled, since page pool can't guarantee where pages are coming from.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5394610
    • S
      page_pool: Add API to update numa node · bc836748
      Saeed Mahameed 提交于
      Add page_pool_update_nid() to be called by page pool consumers when they
      detect numa node changes.
      
      It will update the page pool nid value to start allocating from the new
      effective numa node.
      
      This is to mitigate page pool allocating pages from a wrong numa node,
      where the pool was originally allocated, and holding on to pages that
      belong to a different numa node, which causes performance degradation.
      
      For pages that are already being consumed and could be returned to the
      pool by the consumer, in next patch we will add a check per page to avoid
      recycling them back to the pool and return them to the page allocator.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc836748
  17. 19 11月, 2019 2 次提交
    • J
      page_pool: add destroy attempts counter and rename tracepoint · 7c9e6942
      Jesper Dangaard Brouer 提交于
      When Jonathan change the page_pool to become responsible to its
      own shutdown via deferred work queue, then the disconnect_cnt
      counter was removed from xdp memory model tracepoint.
      
      This patch change the page_pool_inflight tracepoint name to
      page_pool_release, because it reflects the new responsability
      better.  And it reintroduces a counter that reflect the number of
      times page_pool_release have been tried.
      
      The counter is also used by the code, to only empty the alloc
      cache once.  With a stuck work queue running every second and
      counter being 64-bit, it will overrun in approx 584 billion
      years. For comparison, Earth lifetime expectancy is 7.5 billion
      years, before the Sun will engulf, and destroy, the Earth.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c9e6942
    • J
      xdp: remove memory poison on free for struct xdp_mem_allocator · c491eae8
      Jesper Dangaard Brouer 提交于
      When looking at the details I realised that the memory poison in
      __xdp_mem_allocator_rcu_free doesn't make sense. This is because the
      SLUB allocator uses the first 16 bytes (on 64 bit), for its freelist,
      which overlap with members in struct xdp_mem_allocator, that were
      updated.  Thus, SLUB already does the "poisoning" for us.
      
      I still believe that poisoning memory make sense in other cases.
      Kernel have gained different use-after-free detection mechanism, but
      enabling those is associated with a huge overhead. Experience is that
      debugging facilities can change the timing so much, that that a race
      condition will not be provoked when enabled. Thus, I'm still in favour
      of poisoning memory where it makes sense.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c491eae8
  18. 18 11月, 2019 1 次提交
    • A
      bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails · 1e0bd5a0
      Andrii Nakryiko 提交于
      92117d84 ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
      potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
      (32k). Due to using 32-bit counter, it's possible in practice to overflow
      refcounter and make it wrap around to 0, causing erroneous map free, while
      there are still references to it, causing use-after-free problems.
      
      But having a failing refcounting operations are problematic in some cases. One
      example is mmap() interface. After establishing initial memory-mapping, user
      is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
      splitting it into multiple non-contiguous regions. All this happening without
      any control from the users of mmap subsystem. Rather mmap subsystem sends
      notifications to original creator of memory mapping through open/close
      callbacks, which are optionally specified during initial memory mapping
      creation. These callbacks are used to maintain accurate refcount for bpf_map
      (see next patch in this series). The problem is that open() callback is not
      supposed to fail, because memory-mapped resource is set up and properly
      referenced. This is posing a problem for using memory-mapping with BPF maps.
      
      One solution to this is to maintain separate refcount for just memory-mappings
      and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
      There are similar use cases in current work on tcp-bpf, necessitating extra
      counter as well. This seems like a rather unfortunate and ugly solution that
      doesn't scale well to various new use cases.
      
      Another approach to solve this is to use non-failing refcount_t type, which
      uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
      stays there. This utlimately causes memory leak, but prevents use after free.
      
      But given refcounting is not the most performance-critical operation with BPF
      maps (it's not used from running BPF program code), we can also just switch to
      64-bit counter that can't overflow in practice, potentially disadvantaging
      32-bit platforms a tiny bit. This simplifies semantics and allows above
      described scenarios to not worry about failing refcount increment operation.
      
      In terms of struct bpf_map size, we are still good and use the same amount of
      space:
      
      BEFORE (3 cache lines, 8 bytes of padding at the end):
      struct bpf_map {
      	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
      	struct bpf_map *           inner_map_meta;       /*     8     8 */
      	void *                     security;             /*    16     8 */
      	enum bpf_map_type  map_type;                     /*    24     4 */
      	u32                        key_size;             /*    28     4 */
      	u32                        value_size;           /*    32     4 */
      	u32                        max_entries;          /*    36     4 */
      	u32                        map_flags;            /*    40     4 */
      	int                        spin_lock_off;        /*    44     4 */
      	u32                        id;                   /*    48     4 */
      	int                        numa_node;            /*    52     4 */
      	u32                        btf_key_type_id;      /*    56     4 */
      	u32                        btf_value_type_id;    /*    60     4 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct btf *               btf;                  /*    64     8 */
      	struct bpf_map_memory memory;                    /*    72    16 */
      	bool                       unpriv_array;         /*    88     1 */
      	bool                       frozen;               /*    89     1 */
      
      	/* XXX 38 bytes hole, try to pack */
      
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
      	atomic_t                   usercnt;              /*   132     4 */
      	struct work_struct work;                         /*   136    32 */
      	char                       name[16];             /*   168    16 */
      
      	/* size: 192, cachelines: 3, members: 21 */
      	/* sum members: 146, holes: 1, sum holes: 38 */
      	/* padding: 8 */
      	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
      } __attribute__((__aligned__(64)));
      
      AFTER (same 3 cache lines, no extra padding now):
      struct bpf_map {
      	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
      	struct bpf_map *           inner_map_meta;       /*     8     8 */
      	void *                     security;             /*    16     8 */
      	enum bpf_map_type  map_type;                     /*    24     4 */
      	u32                        key_size;             /*    28     4 */
      	u32                        value_size;           /*    32     4 */
      	u32                        max_entries;          /*    36     4 */
      	u32                        map_flags;            /*    40     4 */
      	int                        spin_lock_off;        /*    44     4 */
      	u32                        id;                   /*    48     4 */
      	int                        numa_node;            /*    52     4 */
      	u32                        btf_key_type_id;      /*    56     4 */
      	u32                        btf_value_type_id;    /*    60     4 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct btf *               btf;                  /*    64     8 */
      	struct bpf_map_memory memory;                    /*    72    16 */
      	bool                       unpriv_array;         /*    88     1 */
      	bool                       frozen;               /*    89     1 */
      
      	/* XXX 38 bytes hole, try to pack */
      
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
      	atomic64_t                 usercnt;              /*   136     8 */
      	struct work_struct work;                         /*   144    32 */
      	char                       name[16];             /*   176    16 */
      
      	/* size: 192, cachelines: 3, members: 21 */
      	/* sum members: 154, holes: 1, sum holes: 38 */
      	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
      } __attribute__((__aligned__(64)));
      
      This patch, while modifying all users of bpf_map_inc, also cleans up its
      interface to match bpf_map_put with separate operations for bpf_map_inc and
      bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
      respectively). Also, given there are no users of bpf_map_inc_not_zero
      specifying uref=true, remove uref flag and default to uref=false internally.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
      1e0bd5a0
  19. 17 11月, 2019 2 次提交
    • A
      net: core: allow fast GRO for skbs with Ethernet header in head · 8aef998d
      Alexander Lobakin 提交于
      Commit 78d3fd0b ("gro: Only use skb_gro_header for completely
      non-linear packets") back in May'09 (v2.6.31-rc1) has changed the
      original condition '!skb_headlen(skb)' to
      'skb->mac_header == skb->tail' in gro_reset_offset() saying: "Since
      the drivers that need this optimisation all provide completely
      non-linear packets" (note that this condition has become the current
      'skb_mac_header(skb) == skb_tail_pointer(skb)' later with commmit
      ced14f68 ("net: Correct comparisons and calculations using
      skb->tail and skb-transport_header") without any functional changes).
      
      For now, we have the following rough statistics for v5.4-rc7:
      1) napi_gro_frags: 14
      2) napi_gro_receive with skb->head containing (most of) payload: 83
      3) napi_gro_receive with skb->head containing all the headers: 20
      4) napi_gro_receive with skb->head containing only Ethernet header: 2
      
      With the current condition, fast GRO with the usage of
      NAPI_GRO_CB(skb)->frag0 is available only in the [1] case.
      Packets pushed by [2] and [3] go through the 'slow' path, but
      it's not a problem for them as they already contain all the needed
      headers in skb->head, so pskb_may_pull() only moves skb->data.
      
      The layout of skbs in the fourth [4] case at the moment of
      dev_gro_receive() is identical to skbs that have come through [1],
      as napi_frags_skb() pulls Ethernet header to skb->head. The only
      difference is that the mentioned condition is always false for them,
      because skb_put() and friends irreversibly alter the tail pointer.
      They also go through the 'slow' path, but now every single
      pskb_may_pull() in every single .gro_receive() will call the *really*
      slow __pskb_pull_tail() to pull headers to head. This significantly
      decreases the overall performance for no visible reasons.
      
      The only two users of method [4] is:
      * drivers/staging/qlge
      * drivers/net/wireless/iwlwifi (all three variants: dvm, mvm, mvm-mq)
      
      Note that in case with wireless drivers we can't use [1]
      (napi_gro_frags()) at least for now and mac80211 stack always
      performs pushes and pulls anyways, so performance hit is inavoidable.
      
      At the moment of v2.6.31 the mentioned change was necessary (that's
      why I don't add the "Fixes:" tag), but it became obsolete since
      skb_gro_mac_header() has gone in commit a50e233c ("net-gro:
      restore frag0 optimization"), so we can simply revert the condition
      in gro_reset_offset() to allow skbs from [4] go through the 'fast'
      path just like in case [1].
      
      This was tested on a 600 MHz MIPS CPU and a custom driver and this
      patch gave boosts up to 40 Mbps to method [4] in both directions
      comparing to net-next, which made overall performance relatively
      close to [1] (without it, [4] is the slowest).
      
      v2:
      - Add more references and explanations to commit message
      - Fix some typos ibid
      - No functional changes
      Signed-off-by: NAlexander Lobakin <alobakin@dlink.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8aef998d
    • J
      page_pool: do not release pool until inflight == 0. · c3f812ce
      Jonathan Lemon 提交于
      The page pool keeps track of the number of pages in flight, and
      it isn't safe to remove the pool until all pages are returned.
      
      Disallow removing the pool until all pages are back, so the pool
      is always available for page producers.
      
      Make the page pool responsible for its own delayed destruction
      instead of relying on XDP, so the page pool can be used without
      the xdp memory model.
      
      When all pages are returned, free the pool and notify xdp if the
      pool is registered with the xdp memory system.  Have the callback
      perform a table walk since some drivers (cpsw) may share the pool
      among multiple xdp_rxq_info.
      
      Note that the increment of pages_state_release_cnt may result in
      inflight == 0, resulting in the pool being released.
      
      Fixes: d956a048 ("xdp: force mem allocator removal and periodic warning")
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3f812ce
  20. 16 11月, 2019 2 次提交
  21. 15 11月, 2019 1 次提交
  22. 13 11月, 2019 3 次提交
    • A
      devlink: Allow large formatted message of binary output · e2cde864
      Aya Levin 提交于
      Devlink supports pair output of name and value. When the value is
      binary, it must be presented in an array. If the length of the binary
      value exceeds fmsg limitation, break the value into chunks internally.
      Signed-off-by: NAya Levin <ayal@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2cde864
    • T
      cgroup: use cgrp->kn->id as the cgroup ID · 74321038
      Tejun Heo 提交于
      cgroup ID is currently allocated using a dedicated per-hierarchy idr
      and used internally and exposed through tracepoints and bpf.  This is
      confusing because there are tracepoints and other interfaces which use
      the cgroupfs ino as IDs.
      
      The preceding changes made kn->id exposed as ino as 64bit ino on
      supported archs or ino+gen (low 32bits as ino, high gen).  There's no
      reason for cgroup to use different IDs.  The kernfs IDs are unique and
      userland can easily discover them and map them back to paths using
      standard file operations.
      
      This patch replaces cgroup IDs with kernfs IDs.
      
      * cgroup_id() is added and all cgroup ID users are converted to use it.
      
      * kernfs_node creation is moved to earlier during cgroup init so that
        cgroup_id() is available during init.
      
      * While at it, s/cgroup/cgrp/ in psi helpers for consistency.
      
      * Fallback ID value is changed to 1 to be consistent with root cgroup
        ID.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      74321038
    • T
      kernfs: convert kernfs_node->id from union kernfs_node_id to u64 · 67c0496e
      Tejun Heo 提交于
      kernfs_node->id is currently a union kernfs_node_id which represents
      either a 32bit (ino, gen) pair or u64 value.  I can't see much value
      in the usage of the union - all that's needed is a 64bit ID which the
      current code is already limited to.  Using a union makes the code
      unnecessarily complicated and prevents using 64bit ino without adding
      practical benefits.
      
      This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
      ino is stored in the lower 32bits and gen upper.  Accessors -
      kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
      ino and gen.  This simplifies ID handling less cumbersome and will
      allow using 64bit inos on supported archs.
      
      This patch doesn't make any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      67c0496e