1. 14 10月, 2019 3 次提交
  2. 10 10月, 2019 5 次提交
    • E
      net: silence KCSAN warnings about sk->sk_backlog.len reads · 70c26558
      Eric Dumazet 提交于
      sk->sk_backlog.len can be written by BH handlers, and read
      from process contexts in a lockless way.
      
      Note the write side should also use WRITE_ONCE() or a variant.
      We need some agreement about the best way to do this.
      
      syzbot reported :
      
      BUG: KCSAN: data-race in tcp_add_backlog / tcp_grow_window.isra.0
      
      write to 0xffff88812665f32c of 4 bytes by interrupt on cpu 1:
       sk_add_backlog include/net/sock.h:934 [inline]
       tcp_add_backlog+0x4a0/0xcc0 net/ipv4/tcp_ipv4.c:1737
       tcp_v4_rcv+0x1aba/0x1bf0 net/ipv4/tcp_ipv4.c:1925
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
       virtnet_receive drivers/net/virtio_net.c:1323 [inline]
       virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
       napi_poll net/core/dev.c:6352 [inline]
       net_rx_action+0x3ae/0xa50 net/core/dev.c:6418
      
      read to 0xffff88812665f32c of 4 bytes by task 7292 on cpu 0:
       tcp_space include/net/tcp.h:1373 [inline]
       tcp_grow_window.isra.0+0x6b/0x480 net/ipv4/tcp_input.c:413
       tcp_event_data_recv+0x68f/0x990 net/ipv4/tcp_input.c:717
       tcp_rcv_established+0xbfe/0xf50 net/ipv4/tcp_input.c:5618
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
       sk_backlog_rcv include/net/sock.h:945 [inline]
       __release_sock+0x135/0x1e0 net/core/sock.c:2427
       release_sock+0x61/0x160 net/core/sock.c:2943
       tcp_recvmsg+0x63b/0x1a30 net/ipv4/tcp.c:2181
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:871 [inline]
       sock_recvmsg net/socket.c:889 [inline]
       sock_recvmsg+0x92/0xb0 net/socket.c:885
       sock_read_iter+0x15f/0x1e0 net/socket.c:967
       call_read_iter include/linux/fs.h:1864 [inline]
       new_sync_read+0x389/0x4f0 fs/read_write.c:414
       __vfs_read+0xb1/0xc0 fs/read_write.c:427
       vfs_read fs/read_write.c:461 [inline]
       vfs_read+0x143/0x2c0 fs/read_write.c:446
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7292 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      70c26558
    • E
      net: annotate sk->sk_rcvlowat lockless reads · eac66402
      Eric Dumazet 提交于
      sock_rcvlowat() or int_sk_rcvlowat() might be called without the socket
      lock for example from tcp_poll().
      
      Use READ_ONCE() to document the fact that other cpus might change
      sk->sk_rcvlowat under us and avoid KCSAN splats.
      
      Use WRITE_ONCE() on write sides too.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      eac66402
    • E
      net: silence KCSAN warnings around sk_add_backlog() calls · 8265792b
      Eric Dumazet 提交于
      sk_add_backlog() callers usually read sk->sk_rcvbuf without
      owning the socket lock. This means sk_rcvbuf value can
      be changed by other cpus, and KCSAN complains.
      
      Add READ_ONCE() annotations to document the lockless nature
      of these reads.
      
      Note that writes over sk_rcvbuf should also use WRITE_ONCE(),
      but this will be done in separate patches to ease stable
      backports (if we decide this is relevant for stable trees).
      
      BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg
      
      write to 0xffff88812ab369f8 of 8 bytes by interrupt on cpu 1:
       __sk_add_backlog include/net/sock.h:902 [inline]
       sk_add_backlog include/net/sock.h:933 [inline]
       tcp_add_backlog+0x45a/0xcc0 net/ipv4/tcp_ipv4.c:1737
       tcp_v4_rcv+0x1aba/0x1bf0 net/ipv4/tcp_ipv4.c:1925
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
       virtnet_receive drivers/net/virtio_net.c:1323 [inline]
       virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
       napi_poll net/core/dev.c:6352 [inline]
       net_rx_action+0x3ae/0xa50 net/core/dev.c:6418
      
      read to 0xffff88812ab369f8 of 8 bytes by task 7271 on cpu 0:
       tcp_recvmsg+0x470/0x1a30 net/ipv4/tcp.c:2047
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:871 [inline]
       sock_recvmsg net/socket.c:889 [inline]
       sock_recvmsg+0x92/0xb0 net/socket.c:885
       sock_read_iter+0x15f/0x1e0 net/socket.c:967
       call_read_iter include/linux/fs.h:1864 [inline]
       new_sync_read+0x389/0x4f0 fs/read_write.c:414
       __vfs_read+0xb1/0xc0 fs/read_write.c:427
       vfs_read fs/read_write.c:461 [inline]
       vfs_read+0x143/0x2c0 fs/read_write.c:446
       ksys_read+0xd5/0x1b0 fs/read_write.c:587
       __do_sys_read fs/read_write.c:597 [inline]
       __se_sys_read fs/read_write.c:595 [inline]
       __x64_sys_read+0x4c/0x60 fs/read_write.c:595
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7271 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      8265792b
    • E
      net: avoid possible false sharing in sk_leave_memory_pressure() · 503978ac
      Eric Dumazet 提交于
      As mentioned in https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE#it-may-improve-performance
      a C compiler can legally transform :
      
      if (memory_pressure && *memory_pressure)
              *memory_pressure = 0;
      
      to :
      
      if (memory_pressure)
              *memory_pressure = 0;
      
      Fixes: 06044751 ("tcp: add TCPMemoryPressuresChrono counter")
      Fixes: 180d8cd9 ("foundations of per-cgroup memory pressure controlling.")
      Fixes: 3ab224be ("[NET] CORE: Introducing new memory accounting interface.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      503978ac
    • N
      netns: fix NLM_F_ECHO mechanism for RTM_NEWNSID · 993e4c92
      Nicolas Dichtel 提交于
      The flag NLM_F_ECHO aims to reply to the user the message notified to all
      listeners.
      It was not the case with the command RTM_NEWNSID, let's fix this.
      
      Fixes: 0c7aecd4 ("netns: add rtnl cmd to add and get peer netns ids")
      Reported-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NGuillaume Nault <gnault@redhat.com>
      Tested-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      993e4c92
  3. 05 10月, 2019 1 次提交
  4. 02 10月, 2019 2 次提交
  5. 01 10月, 2019 1 次提交
    • M
      net: Unpublish sk from sk_reuseport_cb before call_rcu · 8c7138b3
      Martin KaFai Lau 提交于
      The "reuse->sock[]" array is shared by multiple sockets.  The going away
      sk must unpublish itself from "reuse->sock[]" before making call_rcu()
      call.  However, this unpublish-action is currently done after a grace
      period and it may cause use-after-free.
      
      The fix is to move reuseport_detach_sock() to sk_destruct().
      Due to the above reason, any socket with sk_reuseport_cb has
      to go through the rcu grace period before freeing it.
      
      It is a rather old bug (~3 yrs).  The Fixes tag is not necessary
      the right commit but it is the one that introduced the SOCK_RCU_FREE
      logic and this fix is depending on it.
      
      Fixes: a4298e45 ("net: add SOCK_RCU_FREE socket flag")
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c7138b3
  6. 28 9月, 2019 1 次提交
    • F
      sk_buff: drop all skb extensions on free and skb scrubbing · 174e2381
      Florian Westphal 提交于
      Now that we have a 3rd extension, add a new helper that drops the
      extension space and use it when we need to scrub an sk_buff.
      
      At this time, scrubbing clears secpath and bridge netfilter data, but
      retains the tc skb extension, after this patch all three get cleared.
      
      NAPI reuse/free assumes we can only have a secpath attached to skb, but
      it seems better to clear all extensions there as well.
      
      v2: add unlikely hint (Eric Dumazet)
      
      Fixes: 95a7233c ("net: openvswitch: Set OvS recirc_id from tc chain index")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      174e2381
  7. 26 9月, 2019 1 次提交
  8. 17 9月, 2019 2 次提交
    • A
      ethtool: implement Energy Detect Powerdown support via phy-tunable · 9f2f13f4
      Alexandru Ardelean 提交于
      The `phy_tunable_id` has been named `ETHTOOL_PHY_EDPD` since it looks like
      this feature is common across other PHYs (like EEE), and defining
      `ETHTOOL_PHY_ENERGY_DETECT_POWER_DOWN` seems too long.
      
      The way EDPD works, is that the RX block is put to a lower power mode,
      except for link-pulse detection circuits. The TX block is also put to low
      power mode, but the PHY wakes-up periodically to send link pulses, to avoid
      lock-ups in case the other side is also in EDPD mode.
      
      Currently, there are 2 PHY drivers that look like they could use this new
      PHY tunable feature: the `adin` && `micrel` PHYs.
      
      The ADIN's datasheet mentions that TX pulses are at intervals of 1 second
      default each, and they can be disabled. For the Micrel KSZ9031 PHY, the
      datasheet does not mention whether they can be disabled, but mentions that
      they can modified.
      
      The way this change is structured, is similar to the PHY tunable downshift
      control:
      * a `ETHTOOL_PHY_EDPD_DFLT_TX_MSECS` value is exposed to cover a default
        TX interval; some PHYs could specify a certain value that makes sense
      * `ETHTOOL_PHY_EDPD_NO_TX` would disable TX when EDPD is enabled
      * `ETHTOOL_PHY_EDPD_DISABLE` will disable EDPD
      
      As noted by the `ETHTOOL_PHY_EDPD_DFLT_TX_MSECS` the interval unit is 1
      millisecond, which should cover a reasonable range of intervals:
       - from 1 millisecond, which does not sound like much of a power-saver
       - to ~65 seconds which is quite a lot to wait for a link to come up when
         plugging a cable
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NAlexandru Ardelean <alexandru.ardelean@analog.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f2f13f4
    • I
      drop_monitor: Better sanitize notified packets · bef17466
      Ido Schimmel 提交于
      When working in 'packet' mode, drop monitor generates a notification
      with a potentially truncated payload of the dropped packet. The payload
      is copied from the MAC header, but I forgot to check that the MAC header
      was set, so do it now.
      
      Fixes: ca30707d ("drop_monitor: Add packet alert mode")
      Fixes: 5e58109b ("drop_monitor: Add support for packet alert mode for hardware drops")
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bef17466
  9. 16 9月, 2019 2 次提交
    • W
      udp: correct reuseport selection with connected sockets · acdcecc6
      Willem de Bruijn 提交于
      UDP reuseport groups can hold a mix unconnected and connected sockets.
      Ensure that connections only receive all traffic to their 4-tuple.
      
      Fast reuseport returns on the first reuseport match on the assumption
      that all matches are equal. Only if connections are present, return to
      the previous behavior of scoring all sockets.
      
      Record if connections are present and if so (1) treat such connected
      sockets as an independent match from the group, (2) only return
      2-tuple matches from reuseport and (3) do not return on the first
      2-tuple reuseport match to allow for a higher scoring match later.
      
      New field has_conns is set without locks. No other fields in the
      bitmap are modified at runtime and the field is only ever set
      unconditionally, so an RMW cannot miss a change.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Link: http://lkml.kernel.org/r/CA+FuTSfRP09aJNYRt04SS6qj22ViiOEWaWmLAwX0psk8-PGNxw@mail.gmail.comSigned-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acdcecc6
    • P
      net/sched: fix race between deactivation and dequeue for NOLOCK qdisc · d518d2ed
      Paolo Abeni 提交于
      The test implemented by some_qdisc_is_busy() is somewhat loosy for
      NOLOCK qdisc, as we may hit the following scenario:
      
      CPU1						CPU2
      // in net_tx_action()
      clear_bit(__QDISC_STATE_SCHED...);
      						// in some_qdisc_is_busy()
      						val = (qdisc_is_running(q) ||
      						       test_bit(__QDISC_STATE_SCHED,
      								&q->state));
      						// here val is 0 but...
      qdisc_run(q)
      // ... CPU1 is going to run the qdisc next
      
      As a conseguence qdisc_run() in net_tx_action() can race with qdisc_reset()
      in dev_qdisc_reset(). Such race is not possible for !NOLOCK qdisc as
      both the above bit operations are under the root qdisc lock().
      
      After commit 021a17ed ("pfifo_fast: drop unneeded additional lock on dequeue")
      the race can cause use after free and/or null ptr dereference, but the root
      cause is likely older.
      
      This patch addresses the issue explicitly checking for deactivation under
      the seqlock for NOLOCK qdisc, so that the qdisc_run() in the critical
      scenario becomes a no-op.
      
      Note that the enqueue() op can still execute concurrently with dev_qdisc_reset(),
      but that is safe due to the skb_array() locking, and we can't avoid that
      for NOLOCK qdiscs.
      
      Fixes: 021a17ed ("pfifo_fast: drop unneeded additional lock on dequeue")
      Reported-by: NLi Shuang <shuali@redhat.com>
      Reported-and-tested-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d518d2ed
  10. 14 9月, 2019 2 次提交
  11. 12 9月, 2019 1 次提交
    • S
      net: Fix null de-reference of device refcount · 10cc514f
      Subash Abhinov Kasiviswanathan 提交于
      In event of failure during register_netdevice, free_netdev is
      invoked immediately. free_netdev assumes that all the netdevice
      refcounts have been dropped prior to it being called and as a
      result frees and clears out the refcount pointer.
      
      However, this is not necessarily true as some of the operations
      in the NETDEV_UNREGISTER notifier handlers queue RCU callbacks for
      invocation after a grace period. The IPv4 callback in_dev_rcu_put
      tries to access the refcount after free_netdev is called which
      leads to a null de-reference-
      
      44837.761523:   <6> Unable to handle kernel paging request at
                          virtual address 0000004a88287000
      44837.761651:   <2> pc : in_dev_finish_destroy+0x4c/0xc8
      44837.761654:   <2> lr : in_dev_finish_destroy+0x2c/0xc8
      44837.762393:   <2> Call trace:
      44837.762398:   <2>  in_dev_finish_destroy+0x4c/0xc8
      44837.762404:   <2>  in_dev_rcu_put+0x24/0x30
      44837.762412:   <2>  rcu_nocb_kthread+0x43c/0x468
      44837.762418:   <2>  kthread+0x118/0x128
      44837.762424:   <2>  ret_from_fork+0x10/0x1c
      
      Fix this by waiting for the completion of the call_rcu() in
      case of register_netdevice errors.
      
      Fixes: 93ee31f1 ("[NET]: Fix free_netdev on register_netdev failure.")
      Cc: Sean Tranchetti <stranche@codeaurora.org>
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10cc514f
  12. 11 9月, 2019 1 次提交
  13. 07 9月, 2019 2 次提交
  14. 06 9月, 2019 1 次提交
    • P
      net: openvswitch: Set OvS recirc_id from tc chain index · 95a7233c
      Paul Blakey 提交于
      Offloaded OvS datapath rules are translated one to one to tc rules,
      for example the following simplified OvS rule:
      
      recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2)
      
      Will be translated to the following tc rule:
      
      $ tc filter add dev dev1 ingress \
      	    prio 1 chain 0 proto ip \
      		flower tcp ct_state -trk \
      		action ct pipe \
      		action goto chain 2
      
      Received packets will first travel though tc, and if they aren't stolen
      by it, like in the above rule, they will continue to OvS datapath.
      Since we already did some actions (action ct in this case) which might
      modify the packets, and updated action stats, we would like to continue
      the proccessing with the correct recirc_id in OvS (here recirc_id(2))
      where we left off.
      
      To support this, introduce a new skb extension for tc, which
      will be used for translating tc chain to ovs recirc_id to
      handle these miss cases. Last tc chain index will be set
      by tc goto chain action and read by OvS datapath.
      Signed-off-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a7233c
  15. 05 9月, 2019 1 次提交
  16. 01 9月, 2019 3 次提交
  17. 31 8月, 2019 1 次提交
  18. 28 8月, 2019 1 次提交
    • F
      net: fix skb use after free in netpoll · 2c1644cf
      Feng Sun 提交于
      After commit baeababb
      ("tun: return NET_XMIT_DROP for dropped packets"),
      when tun_net_xmit drop packets, it will free skb and return NET_XMIT_DROP,
      netpoll_send_skb_on_dev will run into following use after free cases:
      1. retry netpoll_start_xmit with freed skb;
      2. queue freed skb in npinfo->txq.
      queue_process will also run into use after free case.
      
      hit netpoll_send_skb_on_dev first case with following kernel log:
      
      [  117.864773] kernel BUG at mm/slub.c:306!
      [  117.864773] invalid opcode: 0000 [#1] SMP PTI
      [  117.864774] CPU: 3 PID: 2627 Comm: loop_printmsg Kdump: loaded Tainted: P           OE     5.3.0-050300rc5-generic #201908182231
      [  117.864775] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      [  117.864775] RIP: 0010:kmem_cache_free+0x28d/0x2b0
      [  117.864781] Call Trace:
      [  117.864781]  ? tun_net_xmit+0x21c/0x460
      [  117.864781]  kfree_skbmem+0x4e/0x60
      [  117.864782]  kfree_skb+0x3a/0xa0
      [  117.864782]  tun_net_xmit+0x21c/0x460
      [  117.864782]  netpoll_start_xmit+0x11d/0x1b0
      [  117.864788]  netpoll_send_skb_on_dev+0x1b8/0x200
      [  117.864789]  __br_forward+0x1b9/0x1e0 [bridge]
      [  117.864789]  ? skb_clone+0x53/0xd0
      [  117.864790]  ? __skb_clone+0x2e/0x120
      [  117.864790]  deliver_clone+0x37/0x50 [bridge]
      [  117.864790]  maybe_deliver+0x89/0xc0 [bridge]
      [  117.864791]  br_flood+0x6c/0x130 [bridge]
      [  117.864791]  br_dev_xmit+0x315/0x3c0 [bridge]
      [  117.864792]  netpoll_start_xmit+0x11d/0x1b0
      [  117.864792]  netpoll_send_skb_on_dev+0x1b8/0x200
      [  117.864792]  netpoll_send_udp+0x2c6/0x3e8
      [  117.864793]  write_msg+0xd9/0xf0 [netconsole]
      [  117.864793]  console_unlock+0x386/0x4e0
      [  117.864793]  vprintk_emit+0x17e/0x280
      [  117.864794]  vprintk_default+0x29/0x50
      [  117.864794]  vprintk_func+0x4c/0xbc
      [  117.864794]  printk+0x58/0x6f
      [  117.864795]  loop_fun+0x24/0x41 [printmsg_loop]
      [  117.864795]  kthread+0x104/0x140
      [  117.864795]  ? 0xffffffffc05b1000
      [  117.864796]  ? kthread_park+0x80/0x80
      [  117.864796]  ret_from_fork+0x35/0x40
      Signed-off-by: NFeng Sun <loyou85@gmail.com>
      Signed-off-by: NXiaojun Zhao <xiaojunzhao141@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c1644cf
  19. 25 8月, 2019 2 次提交
  20. 24 8月, 2019 3 次提交
  21. 20 8月, 2019 2 次提交
    • E
      tcp: make sure EPOLLOUT wont be missed · ef8d8ccd
      Eric Dumazet 提交于
      As Jason Baron explained in commit 790ba456 ("tcp: set SOCK_NOSPACE
      under memory pressure"), it is crucial we properly set SOCK_NOSPACE
      when needed.
      
      However, Jason patch had a bug, because the 'nonblocking' status
      as far as sk_stream_wait_memory() is concerned is governed
      by MSG_DONTWAIT flag passed at sendmsg() time :
      
          long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
      
      So it is very possible that tcp sendmsg() calls sk_stream_wait_memory(),
      and that sk_stream_wait_memory() returns -EAGAIN with SOCK_NOSPACE
      cleared, if sk->sk_sndtimeo has been set to a small (but not zero)
      value.
      
      This patch removes the 'noblock' variable since we must always
      set SOCK_NOSPACE if -EAGAIN is returned.
      
      It also renames the do_nonblock label since we might reach this
      code path even if we were in blocking mode.
      
      Fixes: 790ba456 ("tcp: set SOCK_NOSPACE under memory pressure")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Reported-by: NVladimir Rutsky  <rutsky@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef8d8ccd
    • V
      net: flow_offload: convert block_ing_cb_list to regular list type · 607f625b
      Vlad Buslov 提交于
      RCU list block_ing_cb_list is protected by rcu read lock in
      flow_block_ing_cmd() and with flow_indr_block_ing_cb_lock mutex in all
      functions that use it. However, flow_block_ing_cmd() needs to call blocking
      functions while iterating block_ing_cb_list which leads to following
      suspicious RCU usage warning:
      
      [  401.510948] =============================
      [  401.510952] WARNING: suspicious RCU usage
      [  401.510993] 5.3.0-rc3+ #589 Not tainted
      [  401.510996] -----------------------------
      [  401.511001] include/linux/rcupdate.h:265 Illegal context switch in RCU read-side critical section!
      [  401.511004]
                     other info that might help us debug this:
      
      [  401.511008]
                     rcu_scheduler_active = 2, debug_locks = 1
      [  401.511012] 7 locks held by test-ecmp-add-v/7576:
      [  401.511015]  #0: 00000000081d71a5 (sb_writers#4){.+.+}, at: vfs_write+0x166/0x1d0
      [  401.511037]  #1: 000000002bd338c3 (&of->mutex){+.+.}, at: kernfs_fop_write+0xef/0x1b0
      [  401.511051]  #2: 00000000c921c634 (kn->count#317){.+.+}, at: kernfs_fop_write+0xf7/0x1b0
      [  401.511062]  #3: 00000000a19cdd56 (&dev->mutex){....}, at: sriov_numvfs_store+0x6b/0x130
      [  401.511079]  #4: 000000005425fa52 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x30/0x140
      [  401.511092]  #5: 00000000c5822793 (rtnl_mutex){+.+.}, at: unregister_netdevice_notifier+0x35/0x140
      [  401.511101]  #6: 00000000c2f3507e (rcu_read_lock){....}, at: flow_block_ing_cmd+0x5/0x130
      [  401.511115]
                     stack backtrace:
      [  401.511121] CPU: 21 PID: 7576 Comm: test-ecmp-add-v Not tainted 5.3.0-rc3+ #589
      [  401.511124] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
      [  401.511127] Call Trace:
      [  401.511138]  dump_stack+0x85/0xc0
      [  401.511146]  ___might_sleep+0x100/0x180
      [  401.511154]  __mutex_lock+0x5b/0x960
      [  401.511162]  ? find_held_lock+0x2b/0x80
      [  401.511173]  ? __tcf_get_next_chain+0x1d/0xb0
      [  401.511179]  ? mark_held_locks+0x49/0x70
      [  401.511194]  ? __tcf_get_next_chain+0x1d/0xb0
      [  401.511198]  __tcf_get_next_chain+0x1d/0xb0
      [  401.511251]  ? uplink_rep_async_event+0x70/0x70 [mlx5_core]
      [  401.511261]  tcf_block_playback_offloads+0x39/0x160
      [  401.511276]  tcf_block_setup+0x1b0/0x240
      [  401.511312]  ? mlx5e_rep_indr_setup_tc_cb+0xca/0x290 [mlx5_core]
      [  401.511347]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
      [  401.511359]  tc_indr_block_get_and_ing_cmd+0x11b/0x1e0
      [  401.511404]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
      [  401.511414]  flow_block_ing_cmd+0x7e/0x130
      [  401.511453]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
      [  401.511462]  __flow_indr_block_cb_unregister+0x7f/0xf0
      [  401.511502]  mlx5e_nic_rep_netdevice_event+0x75/0xb0 [mlx5_core]
      [  401.511513]  unregister_netdevice_notifier+0xe9/0x140
      [  401.511554]  mlx5e_cleanup_rep_tx+0x6f/0xe0 [mlx5_core]
      [  401.511597]  mlx5e_detach_netdev+0x4b/0x60 [mlx5_core]
      [  401.511637]  mlx5e_vport_rep_unload+0x71/0xc0 [mlx5_core]
      [  401.511679]  esw_offloads_disable+0x5b/0x90 [mlx5_core]
      [  401.511724]  mlx5_eswitch_disable.cold+0xdf/0x176 [mlx5_core]
      [  401.511759]  mlx5_device_disable_sriov+0xab/0xb0 [mlx5_core]
      [  401.511794]  mlx5_core_sriov_configure+0xaf/0xd0 [mlx5_core]
      [  401.511805]  sriov_numvfs_store+0xf8/0x130
      [  401.511817]  kernfs_fop_write+0x122/0x1b0
      [  401.511826]  vfs_write+0xdb/0x1d0
      [  401.511835]  ksys_write+0x65/0xe0
      [  401.511847]  do_syscall_64+0x5c/0xb0
      [  401.511857]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  401.511862] RIP: 0033:0x7fad892d30f8
      [  401.511868] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 96 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 60 c3 0f 1f 80 00 00 00 00 48 83
       ec 28 48 89
      [  401.511871] RSP: 002b:00007ffca2a9fad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  401.511875] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fad892d30f8
      [  401.511878] RDX: 0000000000000002 RSI: 000055afeb072a90 RDI: 0000000000000001
      [  401.511881] RBP: 000055afeb072a90 R08: 00000000ffffffff R09: 000000000000000a
      [  401.511884] R10: 000055afeb058710 R11: 0000000000000246 R12: 0000000000000002
      [  401.511887] R13: 00007fad893a8780 R14: 0000000000000002 R15: 00007fad893a3740
      
      To fix the described incorrect RCU usage, convert block_ing_cb_list from
      RCU list to regular list and protect it with flow_indr_block_ing_cb_lock
      mutex in flow_block_ing_cmd().
      
      Fixes: 1150ab0f ("flow_offload: support get multi-subsystem block")
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      607f625b
  22. 18 8月, 2019 2 次提交