1. 10 1月, 2020 2 次提交
  2. 09 1月, 2020 2 次提交
    • F
      net: dsa: Get information about stacked DSA protocol · 4d776482
      Florian Fainelli 提交于
      It is possible to stack multiple DSA switches in a way that they are not
      part of the tree (disjoint) but the DSA master of a switch is a DSA
      slave of another. When that happens switch drivers may have to know this
      is the case so as to determine whether their tagging protocol has a
      remove chance of working.
      
      This is useful for specific switch drivers such as b53 where devices
      have been known to be stacked in the wild without the Broadcom tag
      protocol supporting that feature. This allows b53 to continue supporting
      those devices by forcing the disabling of Broadcom tags on the outermost
      switches if necessary.
      
      The get_tag_protocol() function is therefore updated to gain an
      additional enum dsa_tag_protocol argument which denotes the current
      tagging protocol used by the DSA master we are attached to, else
      DSA_TAG_PROTO_NONE for the top of the dsa_switch_tree.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d776482
    • V
      devlink: add support for reporter recovery completion · 6181e5cb
      Vikas Gupta 提交于
      It is possible that a reporter recovery completion do not finish
      successfully when recovery is triggered via
      devlink_health_reporter_recover as recovery could be processed in
      different context. In such scenario an error is returned by driver when
      recover hook is invoked and successful recovery completion is
      intimated later.
      Expose devlink recover done API to update recovery stats.
      Signed-off-by: NVikas Gupta <vikas.gupta@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6181e5cb
  3. 06 1月, 2020 3 次提交
    • P
      netfilter: flowtable: add nf_flowtable_time_stamp · fb46f1b7
      Pablo Neira Ayuso 提交于
      This patch adds nf_flowtable_time_stamp and updates the existing code to
      use it.
      
      This patch is also implicitly fixing up hardware statistic fetching via
      nf_flow_offload_stats() where casting to u32 is missing. Use
      nf_flow_timeout_delta() to fix this.
      
      Fixes: c29f74e0 ("netfilter: nf_flow_table: hardware offload support")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: Nwenxu <wenxu@ucloud.cn>
      fb46f1b7
    • V
      net: dsa: Pass pcs_poll flag from driver to PHYLINK · 787cac3f
      Vladimir Oltean 提交于
      The DSA drivers that implement .phylink_mac_link_state should normally
      register an interrupt for the PCS, from which they should call
      phylink_mac_change(). However not all switches implement this, and those
      who don't should set this flag in dsa_switch in the .setup callback, so
      that PHYLINK will poll for a few ms until the in-band AN link timer
      expires and the PCS state settles.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      787cac3f
    • V
      net: dsa: Make deferred_xmit private to sja1105 · a68578c2
      Vladimir Oltean 提交于
      There are 3 things that are wrong with the DSA deferred xmit mechanism:
      
      1. Its introduction has made the DSA hotpath ever so slightly more
         inefficient for everybody, since DSA_SKB_CB(skb)->deferred_xmit needs
         to be initialized to false for every transmitted frame, in order to
         figure out whether the driver requested deferral or not (a very rare
         occasion, rare even for the only driver that does use this mechanism:
         sja1105). That was necessary to avoid kfree_skb from freeing the skb.
      
      2. Because L2 PTP is a link-local protocol like STP, it requires
         management routes and deferred xmit with this switch. But as opposed
         to STP, the deferred work mechanism needs to schedule the packet
         rather quickly for the TX timstamp to be collected in time and sent
         to user space. But there is no provision for controlling the
         scheduling priority of this deferred xmit workqueue. Too bad this is
         a rather specific requirement for a feature that nobody else uses
         (more below).
      
      3. Perhaps most importantly, it makes the DSA core adhere a bit too
         much to the NXP company-wide policy "Innovate Where It Doesn't
         Matter". The sja1105 is probably the only DSA switch that requires
         some frames sent from the CPU to be routed to the slave port via an
         out-of-band configuration (register write) rather than in-band (DSA
         tag). And there are indeed very good reasons to not want to do that:
         if that out-of-band register is at the other end of a slow bus such
         as SPI, then you limit that Ethernet flow's throughput to effectively
         the throughput of the SPI bus. So hardware vendors should definitely
         not be encouraged to design this way. We do _not_ want more
         widespread use of this mechanism.
      
      Luckily we have a solution for each of the 3 issues:
      
      For 1, we can just remove that variable in the skb->cb and counteract
      the effect of kfree_skb with skb_get, much to the same effect. The
      advantage, of course, being that anybody who doesn't use deferred xmit
      doesn't need to do any extra operation in the hotpath.
      
      For 2, we can create a kernel thread for each port's deferred xmit work.
      If the user switch ports are named swp0, swp1, swp2, the kernel threads
      will be named swp0_xmit, swp1_xmit, swp2_xmit (there appears to be a 15
      character length limit on kernel thread names). With this, the user can
      change the scheduling priority with chrt $(pidof swp2_xmit).
      
      For 3, we can actually move the entire implementation to the sja1105
      driver.
      
      So this patch deletes the generic implementation from the DSA core and
      adds a new one, more adequate to the requirements of PTP TX
      timestamping, in sja1105_main.c.
      Suggested-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a68578c2
  4. 03 1月, 2020 1 次提交
    • D
      tcp: Add l3index to tcp_md5sig_key and md5 functions · dea53bb8
      David Ahern 提交于
      Add l3index to tcp_md5sig_key to represent the L3 domain of a key, and
      add l3index to tcp_md5_do_add and tcp_md5_do_del to fill in the key.
      
      With the key now based on an l3index, add the new parameter to the
      lookup functions and consider the l3index when looking for a match.
      
      The l3index comes from the skb when processing ingress packets leveraging
      the helpers created for socket lookups, tcp_v4_sdif and inet_iif (and the
      v6 variants). When the sdif index is set it means the packet ingressed a
      device that is part of an L3 domain and inet_iif points to the VRF device.
      For egress, the L3 domain is determined from the socket binding and
      sk_bound_dev_if.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dea53bb8
  5. 31 12月, 2019 1 次提交
    • D
      net/sched: add delete_empty() to filters and use it in cls_flower · a5b72a08
      Davide Caratti 提交于
      Revert "net/sched: cls_u32: fix refcount leak in the error path of
      u32_change()", and fix the u32 refcount leak in a more generic way that
      preserves the semantic of rule dumping.
      On tc filters that don't support lockless insertion/removal, there is no
      need to guard against concurrent insertion when a removal is in progress.
      Therefore, for most of them we can avoid a full walk() when deleting, and
      just decrease the refcount, like it was done on older Linux kernels.
      This fixes situations where walk() was wrongly detecting a non-empty
      filter, like it happened with cls_u32 in the error path of change(), thus
      leading to failures in the following tdc selftests:
      
       6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
       6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
       74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id
      
      On cls_flower, and on (future) lockless filters, this check is necessary:
      move all the check_empty() logic in a callback so that each filter
      can have its own implementation. For cls_flower, it's sufficient to check
      if no IDRs have been allocated.
      
      This reverts commit 275c44aa.
      
      Changes since v1:
       - document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
         is used, thanks to Vlad Buslov
       - implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
       - squash revert and new fix in a single patch, to be nice with bisect
         tests that run tdc on u32 filter, thanks to Dave Miller
      
      Fixes: 275c44aa ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
      Fixes: 6676d5e4 ("net: sched: set dedicated tcf_walker flag when tp is empty")
      Suggested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Suggested-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Reviewed-by: NVlad Buslov <vladbu@mellanox.com>
      Tested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5b72a08
  6. 25 12月, 2019 6 次提交
  7. 21 12月, 2019 4 次提交
  8. 20 12月, 2019 2 次提交
  9. 19 12月, 2019 2 次提交
  10. 18 12月, 2019 1 次提交
  11. 17 12月, 2019 4 次提交
  12. 14 12月, 2019 2 次提交
  13. 13 12月, 2019 2 次提交
  14. 12 12月, 2019 1 次提交
  15. 10 12月, 2019 4 次提交
  16. 08 12月, 2019 1 次提交
    • E
      inet: protect against too small mtu values. · 501a90c9
      Eric Dumazet 提交于
      syzbot was once again able to crash a host by setting a very small mtu
      on loopback device.
      
      Let's make inetdev_valid_mtu() available in include/net/ip.h,
      and use it in ip_setup_cork(), so that we protect both ip_append_page()
      and __ip_append_data()
      
      Also add a READ_ONCE() when the device mtu is read.
      
      Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
      even if other code paths might write over this field.
      
      Add a big comment in include/linux/netdevice.h about dev->mtu
      needing READ_ONCE()/WRITE_ONCE() annotations.
      
      Hopefully we will add the missing ones in followup patches.
      
      [1]
      
      refcount_t: saturated; leaking memory.
      WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       panic+0x2e3/0x75c kernel/panic.c:221
       __warn.cold+0x2f/0x3e kernel/panic.c:582
       report_bug+0x289/0x300 lib/bug.c:195
       fixup_bug arch/x86/kernel/traps.c:174 [inline]
       fixup_bug arch/x86/kernel/traps.c:169 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
       invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
      RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
      RSP: 0018:ffff88809689f550 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
      RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
      R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
      R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
       refcount_add include/linux/refcount.h:193 [inline]
       skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
       sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
       ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
       udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
       inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
       kernel_sendpage+0x92/0xf0 net/socket.c:3794
       sock_sendpage+0x8b/0xc0 net/socket.c:936
       pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
       splice_from_pipe_feed fs/splice.c:512 [inline]
       __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
       splice_from_pipe+0x108/0x170 fs/splice.c:671
       generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
       do_splice_from fs/splice.c:861 [inline]
       direct_splice_actor+0x123/0x190 fs/splice.c:1035
       splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
       do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
       do_sendfile+0x597/0xd00 fs/read_write.c:1464
       __do_sys_sendfile64 fs/read_write.c:1525 [inline]
       __se_sys_sendfile64 fs/read_write.c:1511 [inline]
       __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441409
      Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
      RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
      RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
      R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
      R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 1470ddf7 ("inet: Remove explicit write references to sk/inet in ip_append_data")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      501a90c9
  17. 07 12月, 2019 2 次提交
    • G
      tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE() · 721c8daf
      Guillaume Nault 提交于
      Syncookies borrow the ->rx_opt.ts_recent_stamp field to store the
      timestamp of the last synflood. Protect them with READ_ONCE() and
      WRITE_ONCE() since reads and writes aren't serialised.
      
      Use of .rx_opt.ts_recent_stamp for storing the synflood timestamp was
      introduced by a0f82f64 ("syncookies: remove last_synq_overflow from
      struct tcp_sock"). But unprotected accesses were already there when
      timestamp was stored in .last_synq_overflow.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      721c8daf
    • G
      tcp: tighten acceptance of ACKs not matching a child socket · cb44a08f
      Guillaume Nault 提交于
      When no synflood occurs, the synflood timestamp isn't updated.
      Therefore it can be so old that time_after32() can consider it to be
      in the future.
      
      That's a problem for tcp_synq_no_recent_overflow() as it may report
      that a recent overflow occurred while, in fact, it's just that jiffies
      has grown past 'last_overflow' + TCP_SYNCOOKIE_VALID + 2^31.
      
      Spurious detection of recent overflows lead to extra syncookie
      verification in cookie_v[46]_check(). At that point, the verification
      should fail and the packet dropped. But we should have dropped the
      packet earlier as we didn't even send a syncookie.
      
      Let's refine tcp_synq_no_recent_overflow() to report a recent overflow
      only if jiffies is within the
      [last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval. This
      way, no spurious recent overflow is reported when jiffies wraps and
      'last_overflow' becomes in the future from the point of view of
      time_after32().
      
      However, if jiffies wraps and enters the
      [last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval (with
      'last_overflow' being a stale synflood timestamp), then
      tcp_synq_no_recent_overflow() still erroneously reports an
      overflow. In such cases, we have to rely on syncookie verification
      to drop the packet. We unfortunately have no way to differentiate
      between a fresh and a stale syncookie timestamp.
      
      In practice, using last_overflow as lower bound is problematic.
      If the synflood timestamp is concurrently updated between the time
      we read jiffies and the moment we store the timestamp in
      'last_overflow', then 'now' becomes smaller than 'last_overflow' and
      tcp_synq_no_recent_overflow() returns true, potentially dropping a
      valid syncookie.
      
      Reading jiffies after loading the timestamp could fix the problem,
      but that'd require a memory barrier. Let's just accommodate for
      potential timestamp growth instead and extend the interval using
      'last_overflow - HZ' as lower bound.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb44a08f