1. 02 11月, 2019 1 次提交
  2. 01 11月, 2019 3 次提交
  3. 31 10月, 2019 11 次提交
    • V
      net: sched: update action implementations to support flags · e3822678
      Vlad Buslov 提交于
      Extend struct tc_action with new "tcfa_flags" field. Set the field in
      tcf_idr_create() function and provide new helper
      tcf_idr_create_from_flags() that derives 'cpustats' boolean from flags
      value. Update individual hardware-offloaded actions init() to pass their
      "flags" argument to new helper in order to skip percpu stats allocation
      when user requested it through flags.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3822678
    • V
      net: sched: extend TCA_ACT space with TCA_ACT_FLAGS · abbb0d33
      Vlad Buslov 提交于
      Extend TCA_ACT space with nla_bitfield32 flags. Add
      TCA_ACT_FLAGS_NO_PERCPU_STATS as the only allowed flag. Parse the flags in
      tcf_action_init_1() and pass resulting value as additional argument to
      a_o->init().
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abbb0d33
    • V
      net: sched: modify stats helper functions to support regular stats · 5e174d5e
      Vlad Buslov 提交于
      Modify stats update helper functions introduced in previous patches in this
      series to fallback to regular tc_action->tcfa_{b|q}stats if cpu stats are
      not allocated for the action argument. If regular non-percpu allocated
      counters are in use, then obtain action tcfa_lock while modifying them.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e174d5e
    • V
      net: sched: don't expose action qstats to skb_tc_reinsert() · ef816f3c
      Vlad Buslov 提交于
      Previous commit introduced helper function for updating qstats and
      refactored set of actions to use the helpers, instead of modifying qstats
      directly. However, one of the affected action exposes its qstats to
      skb_tc_reinsert(), which then modifies it.
      
      Refactor skb_tc_reinsert() to return integer error code and don't increment
      overlimit qstats in case of error, and use the returned error code in
      tcf_mirred_act() to manually increment the overlimit counter with new
      helper function.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef816f3c
    • V
      net: sched: extract qstats update code into functions · 26b537a8
      Vlad Buslov 提交于
      Extract common code that increments cpu_qstats counters into standalone act
      API functions. Change hardware offloaded actions that use percpu counter
      allocation to use the new functions instead of accessing cpu_qstats
      directly.
      
      This commit doesn't change functionality.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26b537a8
    • V
      net: sched: extract bstats update code into function · 5e1ad95b
      Vlad Buslov 提交于
      Extract common code that increments cpu_bstats counter into standalone act
      API function. Change hardware offloaded actions that use percpu counter
      allocation to use the new function instead of incrementing cpu_bstats
      directly.
      
      This commit doesn't change functionality.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e1ad95b
    • V
      net: sched: extract common action counters update code into function · c8ecebd0
      Vlad Buslov 提交于
      Currently, all implementations of tc_action_ops->stats_update() callback
      have almost exactly the same implementation of counters update
      code (besides gact which also updates drop counter). In order to simplify
      support for using both percpu-allocated and regular action counters
      depending on run-time flag in following patches, extract action counters
      update code into standalone function in act API.
      
      This commit doesn't change functionality.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8ecebd0
    • E
      net: annotate lockless accesses to sk->sk_napi_id · ee8d153d
      Eric Dumazet 提交于
      We already annotated most accesses to sk->sk_napi_id
      
      We missed sk_mark_napi_id() and sk_mark_napi_id_once()
      which might be called without socket lock held in UDP stack.
      
      KCSAN reported :
      BUG: KCSAN: data-race in udpv6_queue_rcv_one_skb / udpv6_queue_rcv_one_skb
      
      write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 0:
       sk_mark_napi_id include/net/busy_poll.h:125 [inline]
       __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline]
       udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672
       udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689
       udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832
       __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913
       udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015
       ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409
       ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459
       dst_input include/net/dst.h:442 [inline]
       ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
      
      write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 1:
       sk_mark_napi_id include/net/busy_poll.h:125 [inline]
       __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline]
       udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672
       udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689
       udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832
       __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913
       udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015
       ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409
       ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459
       dst_input include/net/dst.h:442 [inline]
       ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 10890 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: e68b6e50 ("udp: enable busy polling for all sockets")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee8d153d
    • M
      flow_dissector: extract more ICMP information · 5dec597e
      Matteo Croce 提交于
      The ICMP flow dissector currently parses only the Type and Code fields.
      Some ICMP packets (echo, timestamp) have a 16 bit Identifier field which
      is used to correlate packets.
      Add such field in flow_dissector_key_icmp and replace skb_flow_get_be16()
      with a more complex function which populate this field.
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5dec597e
    • M
      flow_dissector: add meaningful comments · 98298e6c
      Matteo Croce 提交于
      Documents two piece of code which can't be understood at a glance.
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98298e6c
    • E
      net: annotate accesses to sk->sk_incoming_cpu · 7170a977
      Eric Dumazet 提交于
      This socket field can be read and written by concurrent cpus.
      
      Use READ_ONCE() and WRITE_ONCE() annotations to document this,
      and avoid some compiler 'optimizations'.
      
      KCSAN reported :
      
      BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv
      
      write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
       sk_incoming_cpu_update include/net/sock.h:953 [inline]
       tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
       do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
       do_softirq kernel/softirq.c:329 [inline]
       __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
      
      read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
       sk_incoming_cpu_update include/net/sock.h:952 [inline]
       tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
       smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7170a977
  4. 30 10月, 2019 2 次提交
  5. 29 10月, 2019 3 次提交
    • A
      net: dsa: Add support for devlink device parameters · 6b297524
      Andrew Lunn 提交于
      Add plumbing to allow DSA drivers to register parameters with devlink.
      
      To keep with the abstraction, the DSA drivers pass the ds structure to
      these helpers, and the DSA core then translates that to the devlink
      structure associated to the device.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b297524
    • T
      net: fix sk_page_frag() recursion from memory reclaim · 20eb4f29
      Tejun Heo 提交于
      sk_page_frag() optimizes skb_frag allocations by using per-task
      skb_frag cache when it knows it's the only user.  The condition is
      determined by seeing whether the socket allocation mask allows
      blocking - if the allocation may block, it obviously owns the task's
      context and ergo exclusively owns current->task_frag.
      
      Unfortunately, this misses recursion through memory reclaim path.
      Please take a look at the following backtrace.
      
       [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
           ...
           tcp_sendmsg+0x27/0x40
           sock_sendmsg+0x30/0x40
           sock_xmit.isra.24+0xa1/0x170 [nbd]
           nbd_send_cmd+0x1d2/0x690 [nbd]
           nbd_queue_rq+0x1b5/0x3b0 [nbd]
           __blk_mq_try_issue_directly+0x108/0x1b0
           blk_mq_request_issue_directly+0xbd/0xe0
           blk_mq_try_issue_list_directly+0x41/0xb0
           blk_mq_sched_insert_requests+0xa2/0xe0
           blk_mq_flush_plug_list+0x205/0x2a0
           blk_flush_plug_list+0xc3/0xf0
       [1] blk_finish_plug+0x21/0x2e
           _xfs_buf_ioapply+0x313/0x460
           __xfs_buf_submit+0x67/0x220
           xfs_buf_read_map+0x113/0x1a0
           xfs_trans_read_buf_map+0xbf/0x330
           xfs_btree_read_buf_block.constprop.42+0x95/0xd0
           xfs_btree_lookup_get_block+0x95/0x170
           xfs_btree_lookup+0xcc/0x470
           xfs_bmap_del_extent_real+0x254/0x9a0
           __xfs_bunmapi+0x45c/0xab0
           xfs_bunmapi+0x15/0x30
           xfs_itruncate_extents_flags+0xca/0x250
           xfs_free_eofblocks+0x181/0x1e0
           xfs_fs_destroy_inode+0xa8/0x1b0
           destroy_inode+0x38/0x70
           dispose_list+0x35/0x50
           prune_icache_sb+0x52/0x70
           super_cache_scan+0x120/0x1a0
           do_shrink_slab+0x120/0x290
           shrink_slab+0x216/0x2b0
           shrink_node+0x1b6/0x4a0
           do_try_to_free_pages+0xc6/0x370
           try_to_free_mem_cgroup_pages+0xe3/0x1e0
           try_charge+0x29e/0x790
           mem_cgroup_charge_skmem+0x6a/0x100
           __sk_mem_raise_allocated+0x18e/0x390
           __sk_mem_schedule+0x2a/0x40
       [0] tcp_sendmsg_locked+0x8eb/0xe10
           tcp_sendmsg+0x27/0x40
           sock_sendmsg+0x30/0x40
           ___sys_sendmsg+0x26d/0x2b0
           __sys_sendmsg+0x57/0xa0
           do_syscall_64+0x42/0x100
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      In [0], tcp_send_msg_locked() was using current->page_frag when it
      called sk_wmem_schedule().  It already calculated how many bytes can
      be fit into current->page_frag.  Due to memory pressure,
      sk_wmem_schedule() called into memory reclaim path which called into
      xfs and then IO issue path.  Because the filesystem in question is
      backed by nbd, the control goes back into the tcp layer - back into
      tcp_sendmsg_locked().
      
      nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
      sense - it's in the process of freeing memory and wants to be able to,
      e.g., drop clean pages to make forward progress.  However, this
      confused sk_page_frag() called from [2].  Because it only tests
      whether the allocation allows blocking which it does, it now thinks
      current->page_frag can be used again although it already was being
      used in [0].
      
      After [2] used current->page_frag, the offset would be increased by
      the used amount.  When the control returns to [0],
      current->page_frag's offset is increased and the previously calculated
      number of bytes now may overrun the end of allocated memory leading to
      silent memory corruptions.
      
      Fix it by adding gfpflags_normal_context() which tests sleepable &&
      !reclaim and use it to determine whether to use current->task_frag.
      
      v2: Eric didn't like gfp flags being tested twice.  Introduce a new
          helper gfpflags_normal_context() and combine the two tests.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20eb4f29
    • G
      net: Fix various misspellings of "connect" · e1b18549
      Geert Uytterhoeven 提交于
      Fix misspellings of "disconnect", "disconnecting", "connections", and
      "disconnected".
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Acked-by: NKalle Valo <kvalo@codeaurora.org>
      Acked-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1b18549
  6. 26 10月, 2019 2 次提交
    • G
      netns: fix GFP flags in rtnl_net_notifyid() · d4e4fdf9
      Guillaume Nault 提交于
      In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
      rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
      but there are a few paths calling rtnl_net_notifyid() from atomic
      context or from RCU critical sections. The later also precludes the use
      of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
      call is wrong too, as it uses GFP_KERNEL unconditionally.
      
      Therefore, we need to pass the GFP flags as parameter and propagate it
      through function calls until the proper flags can be determined.
      
      In most cases, GFP_KERNEL is fine. The exceptions are:
        * openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
          indirectly call rtnl_net_notifyid() from RCU critical section,
      
        * rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
          parameter.
      
      Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
      by nlmsg_new(). The function is allowed to sleep, so better make the
      flags consistent with the ones used in the following
      ovs_vport_cmd_fill_info() call.
      
      Found by code inspection.
      
      Fixes: 9a963454 ("netns: notify netns id events")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4e4fdf9
    • B
      net: hwbm: if CONFIG_NET_HWBM unset, make stub functions static · 91e2e576
      Ben Dooks (Codethink) 提交于
      If CONFIG_NET_HWBM is not set, then these stub functions in
      <net/hwbm.h> should be declared static to avoid trying to
      export them from any driver that includes this.
      
      Fixes the following sparse warnings:
      
      ./include/net/hwbm.h:24:6: warning: symbol 'hwbm_buf_free' was not declared. Should it be static?
      ./include/net/hwbm.h:25:5: warning: symbol 'hwbm_pool_refill' was not declared. Should it be static?
      ./include/net/hwbm.h:26:5: warning: symbol 'hwbm_pool_add' was not declared. Should it be static?
      Signed-off-by: NBen Dooks (Codethink) <ben.dooks@codethink.co.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91e2e576
  7. 25 10月, 2019 3 次提交
    • T
      net: remove unnecessary variables and callback · f3b0a18b
      Taehee Yoo 提交于
      This patch removes variables and callback these are related to the nested
      device structure.
      devices that can be nested have their own nest_level variable that
      represents the depth of nested devices.
      In the previous patch, new {lower/upper}_level variables are added and
      they replace old private nest_level variable.
      So, this patch removes all 'nest_level' variables.
      
      In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added
      to get lockdep subclass value, which is actually lower nested depth value.
      But now, they use the dynamic lockdep key to avoid lockdep warning instead
      of the subclass.
      So, this patch removes ->ndo_get_lock_subclass() callback.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3b0a18b
    • T
      vxlan: add adjacent link to limit depth level · 0ce1822c
      Taehee Yoo 提交于
      Current vxlan code doesn't limit the number of nested devices.
      Nested devices would be handled recursively and this routine needs
      huge stack memory. So, unlimited nested devices could make
      stack overflow.
      
      In order to fix this issue, this patch adds adjacent links.
      The adjacent link APIs internally check the depth level.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add vxlan0 type vxlan id 0 group 239.1.1.1 dev dummy0 \
      	    dstport 4789
          for i in {1..100}
          do
      	    let A=$i-1
      	    ip link add vxlan$i type vxlan id $i group 239.1.1.1 \
      		    dev vxlan$A dstport 4789
          done
          ip link del dummy0
      
      The top upper link is vxlan100 and the lowest link is vxlan0.
      When vxlan0 is deleting, the upper devices will be deleted recursively.
      It needs huge stack memory so it makes stack overflow.
      
      Splat looks like:
      [  229.628477] =============================================================================
      [  229.629785] BUG page->ptl (Not tainted): Padding overwritten. 0x0000000026abf214-0x0000000091f6abb2
      [  229.629785] -----------------------------------------------------------------------------
      [  229.629785]
      [  229.655439] ==================================================================
      [  229.629785] INFO: Slab 0x00000000ff7cfda8 objects=19 used=19 fp=0x00000000fe33776c flags=0x200000000010200
      [  229.655688] BUG: KASAN: stack-out-of-bounds in unmap_single_vma+0x25a/0x2e0
      [  229.655688] Read of size 8 at addr ffff888113076928 by task vlan-network-in/2334
      [  229.655688]
      [  229.629785] Padding 0000000026abf214: 00 80 14 0d 81 88 ff ff 68 91 81 14 81 88 ff ff  ........h.......
      [  229.629785] Padding 0000000001e24790: 38 91 81 14 81 88 ff ff 68 91 81 14 81 88 ff ff  8.......h.......
      [  229.629785] Padding 00000000b39397c8: 33 30 62 a7 ff ff ff ff ff eb 60 22 10 f1 ff 1f  30b.......`"....
      [  229.629785] Padding 00000000bc98f53a: 80 60 07 13 81 88 ff ff 00 80 14 0d 81 88 ff ff  .`..............
      [  229.629785] Padding 000000002aa8123d: 68 91 81 14 81 88 ff ff f7 21 17 a7 ff ff ff ff  h........!......
      [  229.629785] Padding 000000001c8c2369: 08 81 14 0d 81 88 ff ff 03 02 00 00 00 00 00 00  ................
      [  229.629785] Padding 000000004e290c5d: 21 90 a2 21 10 ed ff ff 00 00 00 00 00 fc ff df  !..!............
      [  229.629785] Padding 000000000e25d731: 18 60 07 13 81 88 ff ff c0 8b 13 05 81 88 ff ff  .`..............
      [  229.629785] Padding 000000007adc7ab3: b3 8a b5 41 00 00 00 00                          ...A....
      [  229.629785] FIX page->ptl: Restoring 0x0000000026abf214-0x0000000091f6abb2=0x5a
      [  ... ]
      
      Fixes: acaf4e70 ("net: vxlan: when lower dev unregisters remove vxlan dev as well")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ce1822c
    • T
      bonding: use dynamic lockdep key instead of subclass · 089bca2c
      Taehee Yoo 提交于
      All bonding device has same lockdep key and subclass is initialized with
      nest_level.
      But actual nest_level value can be changed when a lower device is attached.
      And at this moment, the subclass should be updated but it seems to be
      unsafe.
      So this patch makes bonding use dynamic lockdep key instead of the
      subclass.
      
      Test commands:
          ip link add bond0 type bond
      
          for i in {1..5}
          do
      	    let A=$i-1
      	    ip link add bond$i type bond
      	    ip link set bond$i master bond$A
          done
          ip link set bond5 master bond0
      
      Splat looks like:
      [  307.992912] WARNING: possible recursive locking detected
      [  307.993656] 5.4.0-rc3+ #96 Tainted: G        W
      [  307.994367] --------------------------------------------
      [  307.995092] ip/761 is trying to acquire lock:
      [  307.995710] ffff8880513aac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
      [  307.997045]
      	       but task is already holding lock:
      [  307.997923] ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
      [  307.999215]
      	       other info that might help us debug this:
      [  308.000251]  Possible unsafe locking scenario:
      
      [  308.001137]        CPU0
      [  308.001533]        ----
      [  308.001915]   lock(&(&bond->stats_lock)->rlock#2/2);
      [  308.002609]   lock(&(&bond->stats_lock)->rlock#2/2);
      [  308.003302]
      		*** DEADLOCK ***
      
      [  308.004310]  May be due to missing lock nesting notation
      
      [  308.005319] 3 locks held by ip/761:
      [  308.005830]  #0: ffffffff9fcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
      [  308.006894]  #1: ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
      [  308.008243]  #2: ffffffff9f9219c0 (rcu_read_lock){....}, at: bond_get_stats+0x9f/0x500 [bonding]
      [  308.009422]
      	       stack backtrace:
      [  308.010124] CPU: 0 PID: 761 Comm: ip Tainted: G        W         5.4.0-rc3+ #96
      [  308.011097] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  308.012179] Call Trace:
      [  308.012601]  dump_stack+0x7c/0xbb
      [  308.013089]  __lock_acquire+0x269d/0x3de0
      [  308.013669]  ? register_lock_class+0x14d0/0x14d0
      [  308.014318]  lock_acquire+0x164/0x3b0
      [  308.014858]  ? bond_get_stats+0xb8/0x500 [bonding]
      [  308.015520]  _raw_spin_lock_nested+0x2e/0x60
      [  308.016129]  ? bond_get_stats+0xb8/0x500 [bonding]
      [  308.017215]  bond_get_stats+0xb8/0x500 [bonding]
      [  308.018454]  ? bond_arp_rcv+0xf10/0xf10 [bonding]
      [  308.019710]  ? rcu_read_lock_held+0x90/0xa0
      [  308.020605]  ? rcu_read_lock_sched_held+0xc0/0xc0
      [  308.021286]  ? bond_get_stats+0x9f/0x500 [bonding]
      [  308.021953]  dev_get_stats+0x1ec/0x270
      [  308.022508]  bond_get_stats+0x1d1/0x500 [bonding]
      
      Fixes: d3fff6c4 ("net: add netdev_lockdep_set_classes() helper")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      089bca2c
  8. 24 10月, 2019 2 次提交
    • E
      ipvs: move old_secure_tcp into struct netns_ipvs · c24b75e0
      Eric Dumazet 提交于
      syzbot reported the following issue :
      
      BUG: KCSAN: data-race in update_defense_level / update_defense_level
      
      read to 0xffffffff861a6260 of 4 bytes by task 3006 on cpu 1:
       update_defense_level+0x621/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:177
       defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225
       process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
       worker_thread+0xa0/0x800 kernel/workqueue.c:2415
       kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      write to 0xffffffff861a6260 of 4 bytes by task 7333 on cpu 0:
       update_defense_level+0xa62/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:205
       defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225
       process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
       worker_thread+0xa0/0x800 kernel/workqueue.c:2415
       kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7333 Comm: kworker/0:5 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events defense_work_handler
      
      Indeed, old_secure_tcp is currently a static variable, while it
      needs to be a per netns variable.
      
      Fixes: a0840e2e ("IPVS: netns, ip_vs_ctl local vars moved to ipvs struct.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      c24b75e0
    • E
      net/flow_dissector: switch to siphash · 55667441
      Eric Dumazet 提交于
      UDP IPv6 packets auto flowlabels are using a 32bit secret
      (static u32 hashrnd in net/core/flow_dissector.c) and
      apply jhash() over fields known by the receivers.
      
      Attackers can easily infer the 32bit secret and use this information
      to identify a device and/or user, since this 32bit secret is only
      set at boot time.
      
      Really, using jhash() to generate cookies sent on the wire
      is a serious security concern.
      
      Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be
      a dead end. Trying to periodically change the secret (like in sch_sfq.c)
      could change paths taken in the network for long lived flows.
      
      Let's switch to siphash, as we did in commit df453700
      ("inet: switch IP ID generator to siphash")
      
      Using a cryptographically strong pseudo random function will solve this
      privacy issue and more generally remove other weak points in the stack.
      
      Packet schedulers using skb_get_hash_perturb() benefit from this change.
      
      Fixes: b5677416 ("ipv6: Enable auto flow labels by default")
      Fixes: 42240901 ("ipv6: Implement different admin modes for automatic flow labels")
      Fixes: 67800f9b ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
      Fixes: cb1ce2ef ("ipv6: Implement automatic flow label generation on transmit")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Berger <jonathann1@walla.com>
      Reported-by: NAmit Klein <aksecurity@gmail.com>
      Reported-by: NBenny Pinkas <benny@pinkas.net>
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55667441
  9. 23 10月, 2019 11 次提交
  10. 22 10月, 2019 1 次提交
  11. 20 10月, 2019 1 次提交
    • E
      net: reorder 'struct net' fields to avoid false sharing · 2a06b898
      Eric Dumazet 提交于
      Intel test robot reported a ~7% regression on TCP_CRR tests
      that they bisected to the cited commit.
      
      Indeed, every time a new TCP socket is created or deleted,
      the atomic counter net->count is touched (via get_net(net)
      and put_net(net) calls)
      
      So cpus might have to reload a contended cache line in
      net_hash_mix(net) calls.
      
      We need to reorder 'struct net' fields to move @hash_mix
      in a read mostly cache line.
      
      We move in the first cache line fields that can be
      dirtied often.
      
      We probably will have to address in a followup patch
      the __randomize_layout that was added in linux-4.13,
      since this might break our placement choices.
      
      Fixes: 355b9855 ("netns: provide pure entropy for net_hash_mix()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a06b898