1. 10 1月, 2020 4 次提交
  2. 09 1月, 2020 2 次提交
  3. 03 1月, 2020 2 次提交
    • J
      page_pool: help compiler remove code in case CONFIG_NUMA=n · f13fc107
      Jesper Dangaard Brouer 提交于
      When kernel is compiled without NUMA support, then page_pool NUMA
      config setting (pool->p.nid) doesn't make any practical sense. The
      compiler cannot see that it can remove the code paths.
      
      This patch avoids reading pool->p.nid setting in case of !CONFIG_NUMA,
      in allocation and numa check code, which helps compiler to see the
      optimisation potential. It leaves update code intact to keep API the
      same.
      
       $ ./scripts/bloat-o-meter net/core/page_pool.o-numa-enabled \
                                 net/core/page_pool.o-numa-disabled
       add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-113 (-113)
       Function                                     old     new   delta
       page_pool_create                             401     398      -3
       __page_pool_alloc_pages_slow                 439     426     -13
       page_pool_refill_alloc_cache                 425     328     -97
       Total: Before=3611, After=3498, chg -3.13%
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f13fc107
    • J
      page_pool: handle page recycle for NUMA_NO_NODE condition · 44768dec
      Jesper Dangaard Brouer 提交于
      The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
      not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.
      
      The goal of the NUMA changes in commit d5394610 ("page_pool: Don't
      recycle non-reusable pages"), were to have RX-pages that belongs to the
      same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
      illustrated by the performance measurements.
      
      This patch moves the NAPI checks out of fast-path, and at the same time
      solves the NUMA_NO_NODE issue.
      
      First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
      will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
      as the the preferred nid.  It is only in rare situations, where
      e.g. NUMA zone runs dry, that page gets doesn't get allocated from
      preferred nid.  The page_pool API allows drivers to control the nid
      themselves via controlling pool->p.nid.
      
      This patch moves the NAPI check to when alloc cache is refilled, via
      dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
      pages from remote NUMA into the ptr_ring, as the dequeue/consume step
      will check the NUMA node. All current drivers using page_pool will
      alloc/refill RX-ring from same CPU running softirq/NAPI process.
      
      Drivers that control the nid explicitly, also use page_pool_update_nid
      when changing nid runtime.  To speed up transision to new nid the alloc
      cache is now flushed on nid changes.  This force pages to come from
      ptr_ring, which does the appropate nid check.
      
      For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
      node, we accept that transitioning the alloc cache doesn't happen
      immediately. The preferred nid change runtime via consulting
      numa_mem_id() based on the CPU processing RX-packets.
      
      Notice, to avoid stressing the page buddy allocator and avoid doing too
      much work under softirq with preempt disabled, the NUMA check at
      ptr_ring dequeue will break the refill cycle, when detecting a NUMA
      mismatch. This will cause a slower transition, but its done on purpose.
      
      Fixes: d5394610 ("page_pool: Don't recycle non-reusable pages")
      Reported-by: NLi RongQing <lirongqing@baidu.com>
      Reported-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44768dec
  4. 26 12月, 2019 2 次提交
  5. 25 12月, 2019 2 次提交
  6. 20 12月, 2019 6 次提交
  7. 18 12月, 2019 3 次提交
  8. 14 12月, 2019 2 次提交
  9. 13 12月, 2019 2 次提交
  10. 10 12月, 2019 2 次提交
  11. 08 12月, 2019 1 次提交
    • E
      inet: protect against too small mtu values. · 501a90c9
      Eric Dumazet 提交于
      syzbot was once again able to crash a host by setting a very small mtu
      on loopback device.
      
      Let's make inetdev_valid_mtu() available in include/net/ip.h,
      and use it in ip_setup_cork(), so that we protect both ip_append_page()
      and __ip_append_data()
      
      Also add a READ_ONCE() when the device mtu is read.
      
      Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
      even if other code paths might write over this field.
      
      Add a big comment in include/linux/netdevice.h about dev->mtu
      needing READ_ONCE()/WRITE_ONCE() annotations.
      
      Hopefully we will add the missing ones in followup patches.
      
      [1]
      
      refcount_t: saturated; leaking memory.
      WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       panic+0x2e3/0x75c kernel/panic.c:221
       __warn.cold+0x2f/0x3e kernel/panic.c:582
       report_bug+0x289/0x300 lib/bug.c:195
       fixup_bug arch/x86/kernel/traps.c:174 [inline]
       fixup_bug arch/x86/kernel/traps.c:169 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
       invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
      RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
      RSP: 0018:ffff88809689f550 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
      RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
      R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
      R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
       refcount_add include/linux/refcount.h:193 [inline]
       skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
       sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
       ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
       udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
       inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
       kernel_sendpage+0x92/0xf0 net/socket.c:3794
       sock_sendpage+0x8b/0xc0 net/socket.c:936
       pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
       splice_from_pipe_feed fs/splice.c:512 [inline]
       __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
       splice_from_pipe+0x108/0x170 fs/splice.c:671
       generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
       do_splice_from fs/splice.c:861 [inline]
       direct_splice_actor+0x123/0x190 fs/splice.c:1035
       splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
       do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
       do_sendfile+0x597/0xd00 fs/read_write.c:1464
       __do_sys_sendfile64 fs/read_write.c:1525 [inline]
       __se_sys_sendfile64 fs/read_write.c:1511 [inline]
       __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441409
      Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
      RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
      RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
      R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
      R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 1470ddf7 ("inet: Remove explicit write references to sk/inet in ip_append_data")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      501a90c9
  12. 07 12月, 2019 3 次提交
    • J
      net: core: rename indirect block ingress cb function · dbad3408
      John Hurley 提交于
      With indirect blocks, a driver can register for callbacks from a device
      that is does not 'own', for example, a tunnel device. When registering to
      or unregistering from a new device, a callback is triggered to generate
      a bind/unbind event. This, in turn, allows the driver to receive any
      existing rules or to properly clean up installed rules.
      
      When first added, it was assumed that all indirect block registrations
      would be for ingress offloads. However, the NFP driver can, in some
      instances, support clsact qdisc binds for egress offload.
      
      Change the name of the indirect block callback command in flow_offload to
      remove the 'ingress' identifier from it. While this does not change
      functionality, a follow up patch will implement a more more generic
      callback than just those currently just supporting ingress offload.
      
      Fixes: 4d12ba42 ("nfp: flower: allow offloading of matches on 'internal' ports")
      Signed-off-by: NJohn Hurley <john.hurley@netronome.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbad3408
    • J
      net-sysfs: Call dev_hold always in netdev_queue_add_kobject · e0b60903
      Jouni Hogander 提交于
      Dev_hold has to be called always in netdev_queue_add_kobject.
      Otherwise usage count drops below 0 in case of failure in
      kobject_init_and_add.
      
      Fixes: b8eb7183 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Miller <davem@davemloft.net>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0b60903
    • A
      net: dsa: fix flow dissection on Tx path · 8bef0af0
      Alexander Lobakin 提交于
      Commit 43e66528 ("net-next: dsa: fix flow dissection") added an
      ability to override protocol and network offset during flow dissection
      for DSA-enabled devices (i.e. controllers shipped as switch CPU ports)
      in order to fix skb hashing for RPS on Rx path.
      
      However, skb_hash() and added part of code can be invoked not only on
      Rx, but also on Tx path if we have a multi-queued device and:
       - kernel is running on UP system or
       - XPS is not configured.
      
      The call stack in this two cases will be like: dev_queue_xmit() ->
      __dev_queue_xmit() -> netdev_core_pick_tx() -> netdev_pick_tx() ->
      skb_tx_hash() -> skb_get_hash().
      
      The problem is that skbs queued for Tx have both network offset and
      correct protocol already set up even after inserting a CPU tag by DSA
      tagger, so calling tag_ops->flow_dissect() on this path actually only
      breaks flow dissection and hashing.
      
      This can be observed by adding debug prints just before and right after
      tag_ops->flow_dissect() call to the related block of code:
      
      Before the patch:
      
      Rx path (RPS):
      
      [   19.240001] Rx: proto: 0x00f8, nhoff: 0	/* ETH_P_XDSA */
      [   19.244271] tag_ops->flow_dissect()
      [   19.247811] Rx: proto: 0x0800, nhoff: 8	/* ETH_P_IP */
      
      [   19.215435] Rx: proto: 0x00f8, nhoff: 0	/* ETH_P_XDSA */
      [   19.219746] tag_ops->flow_dissect()
      [   19.223241] Rx: proto: 0x0806, nhoff: 8	/* ETH_P_ARP */
      
      [   18.654057] Rx: proto: 0x00f8, nhoff: 0	/* ETH_P_XDSA */
      [   18.658332] tag_ops->flow_dissect()
      [   18.661826] Rx: proto: 0x8100, nhoff: 8	/* ETH_P_8021Q */
      
      Tx path (UP system):
      
      [   18.759560] Tx: proto: 0x0800, nhoff: 26	/* ETH_P_IP */
      [   18.763933] tag_ops->flow_dissect()
      [   18.767485] Tx: proto: 0x920b, nhoff: 34	/* junk */
      
      [   22.800020] Tx: proto: 0x0806, nhoff: 26	/* ETH_P_ARP */
      [   22.804392] tag_ops->flow_dissect()
      [   22.807921] Tx: proto: 0x920b, nhoff: 34	/* junk */
      
      [   16.898342] Tx: proto: 0x86dd, nhoff: 26	/* ETH_P_IPV6 */
      [   16.902705] tag_ops->flow_dissect()
      [   16.906227] Tx: proto: 0x920b, nhoff: 34	/* junk */
      
      After:
      
      Rx path (RPS):
      
      [   16.520993] Rx: proto: 0x00f8, nhoff: 0	/* ETH_P_XDSA */
      [   16.525260] tag_ops->flow_dissect()
      [   16.528808] Rx: proto: 0x0800, nhoff: 8	/* ETH_P_IP */
      
      [   15.484807] Rx: proto: 0x00f8, nhoff: 0	/* ETH_P_XDSA */
      [   15.490417] tag_ops->flow_dissect()
      [   15.495223] Rx: proto: 0x0806, nhoff: 8	/* ETH_P_ARP */
      
      [   17.134621] Rx: proto: 0x00f8, nhoff: 0	/* ETH_P_XDSA */
      [   17.138895] tag_ops->flow_dissect()
      [   17.142388] Rx: proto: 0x8100, nhoff: 8	/* ETH_P_8021Q */
      
      Tx path (UP system):
      
      [   15.499558] Tx: proto: 0x0800, nhoff: 26	/* ETH_P_IP */
      
      [   20.664689] Tx: proto: 0x0806, nhoff: 26	/* ETH_P_ARP */
      
      [   18.565782] Tx: proto: 0x86dd, nhoff: 26	/* ETH_P_IPV6 */
      
      In order to fix that we can add the check 'proto == htons(ETH_P_XDSA)'
      to prevent code from calling tag_ops->flow_dissect() on Tx.
      I also decided to initialize 'offset' variable so tagger callbacks can
      now safely leave it untouched without provoking a chaos.
      
      Fixes: 43e66528 ("net-next: dsa: fix flow dissection")
      Signed-off-by: NAlexander Lobakin <alobakin@dlink.ru>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8bef0af0
  13. 05 12月, 2019 3 次提交
  14. 04 12月, 2019 3 次提交
    • Y
      cls_flower: Fix the behavior using port ranges with hw-offload · 8ffb055b
      Yoshiki Komachi 提交于
      The recent commit 5c72299f ("net: sched: cls_flower: Classify
      packets using port ranges") had added filtering based on port ranges
      to tc flower. However the commit missed necessary changes in hw-offload
      code, so the feature gave rise to generating incorrect offloaded flow
      keys in NIC.
      
      One more detailed example is below:
      
      $ tc qdisc add dev eth0 ingress
      $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
        dst_port 100-200 action drop
      
      With the setup above, an exact match filter with dst_port == 0 will be
      installed in NIC by hw-offload. IOW, the NIC will have a rule which is
      equivalent to the following one.
      
      $ tc qdisc add dev eth0 ingress
      $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
        dst_port 0 action drop
      
      The behavior was caused by the flow dissector which extracts packet
      data into the flow key in the tc flower. More specifically, regardless
      of exact match or specified port ranges, fl_init_dissector() set the
      FLOW_DISSECTOR_KEY_PORTS flag in struct flow_dissector to extract port
      numbers from skb in skb_flow_dissect() called by fl_classify(). Note
      that device drivers received the same struct flow_dissector object as
      used in skb_flow_dissect(). Thus, offloaded drivers could not identify
      which of these is used because the FLOW_DISSECTOR_KEY_PORTS flag was
      set to struct flow_dissector in either case.
      
      This patch adds the new FLOW_DISSECTOR_KEY_PORTS_RANGE flag and the new
      tp_range field in struct fl_flow_key to recognize which filters are applied
      to offloaded drivers. At this point, when filters based on port ranges
      passed to drivers, drivers return the EOPNOTSUPP error because they do
      not support the feature (the newly created FLOW_DISSECTOR_KEY_PORTS_RANGE
      flag).
      
      Fixes: 5c72299f ("net: sched: cls_flower: Classify packets using port ranges")
      Signed-off-by: NYoshiki Komachi <komachi.yoshiki@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ffb055b
    • D
      net/core: Populate VF index in struct ifla_vf_guid · 9aed6ae0
      Danit Goldberg 提交于
      In addition to filling the node_guid and port_guid attributes,
      there is a need to populate VF index too, otherwise users of netlink
      interface will see same VF index for all VFs.
      
      Fixes: 30aad417 ("net/core: Add support for getting VF GUIDs")
      Signed-off-by: NDanit Goldberg <danitg@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9aed6ae0
    • D
      net: fix a leak in register_netdevice() · 42c17fa6
      Dan Carpenter 提交于
      We have to free "dev->name_node" on this error path.
      
      Fixes: ff927412 ("net: introduce name_node struct to be used in hashlist")
      Reported-by: syzbot+6e13e65ffbaa33757bcb@syzkaller.appspotmail.com
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42c17fa6
  15. 03 12月, 2019 1 次提交
    • M
      Fixed updating of ethertype in function skb_mpls_pop · 040b5cfb
      Martin Varghese 提交于
      The skb_mpls_pop was not updating ethertype of an ethernet packet if the
      packet was originally received from a non ARPHRD_ETHER device.
      
      In the below OVS data path flow, since the device corresponding to port 7
      is an l3 device (ARPHRD_NONE) the skb_mpls_pop function does not update
      the ethertype of the packet even though the previous push_eth action had
      added an ethernet header to the packet.
      
      recirc_id(0),in_port(7),eth_type(0x8847),
      mpls(label=12/0xfffff,tc=0/0,ttl=0/0x0,bos=1/1),
      actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00),
      pop_mpls(eth_type=0x800),4
      
      Fixes: ed246cee ("net: core: move pop MPLS functionality from OvS to core helper")
      Signed-off-by: NMartin Varghese <martin.varghese@nokia.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      040b5cfb
  16. 29 11月, 2019 1 次提交
    • J
      net: skmsg: fix TLS 1.3 crash with full sk_msg · 031097d9
      Jakub Kicinski 提交于
      TLS 1.3 started using the entry at the end of the SG array
      for chaining-in the single byte content type entry. This mostly
      works:
      
      [ E E E E E E . . ]
        ^           ^
         start       end
      
                       E < content type
                     /
      [ E E E E E E C . ]
        ^           ^
         start       end
      
      (Where E denotes a populated SG entry; C denotes a chaining entry.)
      
      If the array is full, however, the end will point to the start:
      
      [ E E E E E E E E ]
        ^
         start
         end
      
      And we end up overwriting the start:
      
          E < content type
         /
      [ C E E E E E E E ]
        ^
         start
         end
      
      The sg array is supposed to be a circular buffer with start and
      end markers pointing anywhere. In case where start > end
      (i.e. the circular buffer has "wrapped") there is an extra entry
      reserved at the end to chain the two halves together.
      
      [ E E E E E E . . l ]
      
      (Where l is the reserved entry for "looping" back to front.
      
      As suggested by John, let's reserve another entry for chaining
      SG entries after the main circular buffer. Note that this entry
      has to be pointed to by the end entry so its position is not fixed.
      
      Examples of full messages:
      
      [ E E E E E E E E . l ]
        ^               ^
         start           end
      
         <---------------.
      [ E E . E E E E E E l ]
            ^ ^
         end   start
      
      Now the end will always point to an unused entry, so TLS 1.3
      can always use it.
      
      Fixes: 130b392c ("net: tls: Add tls 1.3 support")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      031097d9
  17. 24 11月, 2019 1 次提交