1. 06 9月, 2019 1 次提交
    • P
      net: openvswitch: Set OvS recirc_id from tc chain index · 95a7233c
      Paul Blakey 提交于
      Offloaded OvS datapath rules are translated one to one to tc rules,
      for example the following simplified OvS rule:
      
      recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2)
      
      Will be translated to the following tc rule:
      
      $ tc filter add dev dev1 ingress \
      	    prio 1 chain 0 proto ip \
      		flower tcp ct_state -trk \
      		action ct pipe \
      		action goto chain 2
      
      Received packets will first travel though tc, and if they aren't stolen
      by it, like in the above rule, they will continue to OvS datapath.
      Since we already did some actions (action ct in this case) which might
      modify the packets, and updated action stats, we would like to continue
      the proccessing with the correct recirc_id in OvS (here recirc_id(2))
      where we left off.
      
      To support this, introduce a new skb extension for tc, which
      will be used for translating tc chain to ovs recirc_id to
      handle these miss cases. Last tc chain index will be set
      by tc goto chain action and read by OvS datapath.
      Signed-off-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a7233c
  2. 01 9月, 2019 3 次提交
  3. 28 8月, 2019 1 次提交
    • F
      net: fix skb use after free in netpoll · 2c1644cf
      Feng Sun 提交于
      After commit baeababb
      ("tun: return NET_XMIT_DROP for dropped packets"),
      when tun_net_xmit drop packets, it will free skb and return NET_XMIT_DROP,
      netpoll_send_skb_on_dev will run into following use after free cases:
      1. retry netpoll_start_xmit with freed skb;
      2. queue freed skb in npinfo->txq.
      queue_process will also run into use after free case.
      
      hit netpoll_send_skb_on_dev first case with following kernel log:
      
      [  117.864773] kernel BUG at mm/slub.c:306!
      [  117.864773] invalid opcode: 0000 [#1] SMP PTI
      [  117.864774] CPU: 3 PID: 2627 Comm: loop_printmsg Kdump: loaded Tainted: P           OE     5.3.0-050300rc5-generic #201908182231
      [  117.864775] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      [  117.864775] RIP: 0010:kmem_cache_free+0x28d/0x2b0
      [  117.864781] Call Trace:
      [  117.864781]  ? tun_net_xmit+0x21c/0x460
      [  117.864781]  kfree_skbmem+0x4e/0x60
      [  117.864782]  kfree_skb+0x3a/0xa0
      [  117.864782]  tun_net_xmit+0x21c/0x460
      [  117.864782]  netpoll_start_xmit+0x11d/0x1b0
      [  117.864788]  netpoll_send_skb_on_dev+0x1b8/0x200
      [  117.864789]  __br_forward+0x1b9/0x1e0 [bridge]
      [  117.864789]  ? skb_clone+0x53/0xd0
      [  117.864790]  ? __skb_clone+0x2e/0x120
      [  117.864790]  deliver_clone+0x37/0x50 [bridge]
      [  117.864790]  maybe_deliver+0x89/0xc0 [bridge]
      [  117.864791]  br_flood+0x6c/0x130 [bridge]
      [  117.864791]  br_dev_xmit+0x315/0x3c0 [bridge]
      [  117.864792]  netpoll_start_xmit+0x11d/0x1b0
      [  117.864792]  netpoll_send_skb_on_dev+0x1b8/0x200
      [  117.864792]  netpoll_send_udp+0x2c6/0x3e8
      [  117.864793]  write_msg+0xd9/0xf0 [netconsole]
      [  117.864793]  console_unlock+0x386/0x4e0
      [  117.864793]  vprintk_emit+0x17e/0x280
      [  117.864794]  vprintk_default+0x29/0x50
      [  117.864794]  vprintk_func+0x4c/0xbc
      [  117.864794]  printk+0x58/0x6f
      [  117.864795]  loop_fun+0x24/0x41 [printmsg_loop]
      [  117.864795]  kthread+0x104/0x140
      [  117.864795]  ? 0xffffffffc05b1000
      [  117.864796]  ? kthread_park+0x80/0x80
      [  117.864796]  ret_from_fork+0x35/0x40
      Signed-off-by: NFeng Sun <loyou85@gmail.com>
      Signed-off-by: NXiaojun Zhao <xiaojunzhao141@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c1644cf
  4. 25 8月, 2019 2 次提交
  5. 24 8月, 2019 3 次提交
  6. 20 8月, 2019 2 次提交
    • E
      tcp: make sure EPOLLOUT wont be missed · ef8d8ccd
      Eric Dumazet 提交于
      As Jason Baron explained in commit 790ba456 ("tcp: set SOCK_NOSPACE
      under memory pressure"), it is crucial we properly set SOCK_NOSPACE
      when needed.
      
      However, Jason patch had a bug, because the 'nonblocking' status
      as far as sk_stream_wait_memory() is concerned is governed
      by MSG_DONTWAIT flag passed at sendmsg() time :
      
          long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
      
      So it is very possible that tcp sendmsg() calls sk_stream_wait_memory(),
      and that sk_stream_wait_memory() returns -EAGAIN with SOCK_NOSPACE
      cleared, if sk->sk_sndtimeo has been set to a small (but not zero)
      value.
      
      This patch removes the 'noblock' variable since we must always
      set SOCK_NOSPACE if -EAGAIN is returned.
      
      It also renames the do_nonblock label since we might reach this
      code path even if we were in blocking mode.
      
      Fixes: 790ba456 ("tcp: set SOCK_NOSPACE under memory pressure")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Reported-by: NVladimir Rutsky  <rutsky@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef8d8ccd
    • V
      net: flow_offload: convert block_ing_cb_list to regular list type · 607f625b
      Vlad Buslov 提交于
      RCU list block_ing_cb_list is protected by rcu read lock in
      flow_block_ing_cmd() and with flow_indr_block_ing_cb_lock mutex in all
      functions that use it. However, flow_block_ing_cmd() needs to call blocking
      functions while iterating block_ing_cb_list which leads to following
      suspicious RCU usage warning:
      
      [  401.510948] =============================
      [  401.510952] WARNING: suspicious RCU usage
      [  401.510993] 5.3.0-rc3+ #589 Not tainted
      [  401.510996] -----------------------------
      [  401.511001] include/linux/rcupdate.h:265 Illegal context switch in RCU read-side critical section!
      [  401.511004]
                     other info that might help us debug this:
      
      [  401.511008]
                     rcu_scheduler_active = 2, debug_locks = 1
      [  401.511012] 7 locks held by test-ecmp-add-v/7576:
      [  401.511015]  #0: 00000000081d71a5 (sb_writers#4){.+.+}, at: vfs_write+0x166/0x1d0
      [  401.511037]  #1: 000000002bd338c3 (&of->mutex){+.+.}, at: kernfs_fop_write+0xef/0x1b0
      [  401.511051]  #2: 00000000c921c634 (kn->count#317){.+.+}, at: kernfs_fop_write+0xf7/0x1b0
      [  401.511062]  #3: 00000000a19cdd56 (&dev->mutex){....}, at: sriov_numvfs_store+0x6b/0x130
      [  401.511079]  #4: 000000005425fa52 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x30/0x140
      [  401.511092]  #5: 00000000c5822793 (rtnl_mutex){+.+.}, at: unregister_netdevice_notifier+0x35/0x140
      [  401.511101]  #6: 00000000c2f3507e (rcu_read_lock){....}, at: flow_block_ing_cmd+0x5/0x130
      [  401.511115]
                     stack backtrace:
      [  401.511121] CPU: 21 PID: 7576 Comm: test-ecmp-add-v Not tainted 5.3.0-rc3+ #589
      [  401.511124] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
      [  401.511127] Call Trace:
      [  401.511138]  dump_stack+0x85/0xc0
      [  401.511146]  ___might_sleep+0x100/0x180
      [  401.511154]  __mutex_lock+0x5b/0x960
      [  401.511162]  ? find_held_lock+0x2b/0x80
      [  401.511173]  ? __tcf_get_next_chain+0x1d/0xb0
      [  401.511179]  ? mark_held_locks+0x49/0x70
      [  401.511194]  ? __tcf_get_next_chain+0x1d/0xb0
      [  401.511198]  __tcf_get_next_chain+0x1d/0xb0
      [  401.511251]  ? uplink_rep_async_event+0x70/0x70 [mlx5_core]
      [  401.511261]  tcf_block_playback_offloads+0x39/0x160
      [  401.511276]  tcf_block_setup+0x1b0/0x240
      [  401.511312]  ? mlx5e_rep_indr_setup_tc_cb+0xca/0x290 [mlx5_core]
      [  401.511347]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
      [  401.511359]  tc_indr_block_get_and_ing_cmd+0x11b/0x1e0
      [  401.511404]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
      [  401.511414]  flow_block_ing_cmd+0x7e/0x130
      [  401.511453]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
      [  401.511462]  __flow_indr_block_cb_unregister+0x7f/0xf0
      [  401.511502]  mlx5e_nic_rep_netdevice_event+0x75/0xb0 [mlx5_core]
      [  401.511513]  unregister_netdevice_notifier+0xe9/0x140
      [  401.511554]  mlx5e_cleanup_rep_tx+0x6f/0xe0 [mlx5_core]
      [  401.511597]  mlx5e_detach_netdev+0x4b/0x60 [mlx5_core]
      [  401.511637]  mlx5e_vport_rep_unload+0x71/0xc0 [mlx5_core]
      [  401.511679]  esw_offloads_disable+0x5b/0x90 [mlx5_core]
      [  401.511724]  mlx5_eswitch_disable.cold+0xdf/0x176 [mlx5_core]
      [  401.511759]  mlx5_device_disable_sriov+0xab/0xb0 [mlx5_core]
      [  401.511794]  mlx5_core_sriov_configure+0xaf/0xd0 [mlx5_core]
      [  401.511805]  sriov_numvfs_store+0xf8/0x130
      [  401.511817]  kernfs_fop_write+0x122/0x1b0
      [  401.511826]  vfs_write+0xdb/0x1d0
      [  401.511835]  ksys_write+0x65/0xe0
      [  401.511847]  do_syscall_64+0x5c/0xb0
      [  401.511857]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  401.511862] RIP: 0033:0x7fad892d30f8
      [  401.511868] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 96 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 60 c3 0f 1f 80 00 00 00 00 48 83
       ec 28 48 89
      [  401.511871] RSP: 002b:00007ffca2a9fad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  401.511875] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fad892d30f8
      [  401.511878] RDX: 0000000000000002 RSI: 000055afeb072a90 RDI: 0000000000000001
      [  401.511881] RBP: 000055afeb072a90 R08: 00000000ffffffff R09: 000000000000000a
      [  401.511884] R10: 000055afeb058710 R11: 0000000000000246 R12: 0000000000000002
      [  401.511887] R13: 00007fad893a8780 R14: 0000000000000002 R15: 00007fad893a3740
      
      To fix the described incorrect RCU usage, convert block_ing_cb_list from
      RCU list to regular list and protect it with flow_indr_block_ing_cb_lock
      mutex in flow_block_ing_cmd().
      
      Fixes: 1150ab0f ("flow_offload: support get multi-subsystem block")
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      607f625b
  7. 18 8月, 2019 9 次提交
  8. 16 8月, 2019 2 次提交
  9. 14 8月, 2019 2 次提交
  10. 12 8月, 2019 10 次提交
  11. 10 8月, 2019 2 次提交
    • D
      sock: make cookie generation global instead of per netns · cd48bdda
      Daniel Borkmann 提交于
      Generating and retrieving socket cookies are a useful feature that is
      exposed to BPF for various program types through bpf_get_socket_cookie()
      helper.
      
      The fact that the cookie counter is per netns is quite a limitation
      for BPF in practice in particular for programs in host namespace that
      use socket cookies as part of a map lookup key since they will be
      causing socket cookie collisions e.g. when attached to BPF cgroup hooks
      or cls_bpf on tc egress in host namespace handling container traffic
      from veth or ipvlan devices with peer in different netns. Change the
      counter to be global instead.
      
      Socket cookie consumers must assume the value as opqaue in any case.
      Not every socket must have a cookie generated and knowledge of the
      counter value itself does not provide much value either way hence
      conversion to global is fine.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Martynas Pumputis <m@lambda.lt>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd48bdda
    • J
      devlink: remove pointless data_len arg from region snapshot create · 3a5e5234
      Jiri Pirko 提交于
      The size of the snapshot has to be the same as the size of the region,
      therefore no need to pass it again during snapshot creation. Remove the
      arg and use region->size instead.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a5e5234
  12. 09 8月, 2019 3 次提交
    • J
      net/tls: prevent skb_orphan() from leaking TLS plain text with offload · 41477662
      Jakub Kicinski 提交于
      sk_validate_xmit_skb() and drivers depend on the sk member of
      struct sk_buff to identify segments requiring encryption.
      Any operation which removes or does not preserve the original TLS
      socket such as skb_orphan() or skb_clone() will cause clear text
      leaks.
      
      Make the TCP socket underlying an offloaded TLS connection
      mark all skbs as decrypted, if TLS TX is in offload mode.
      Then in sk_validate_xmit_skb() catch skbs which have no socket
      (or a socket with no validation) and decrypted flag set.
      
      Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
      sk->sk_validate_xmit_skb are slightly interchangeable right now,
      they all imply TLS offload. The new checks are guarded by
      CONFIG_TLS_DEVICE because that's the option guarding the
      sk_buff->decrypted member.
      
      Second, smaller issue with orphaning is that it breaks
      the guarantee that packets will be delivered to device
      queues in-order. All TLS offload drivers depend on that
      scheduling property. This means skb_orphan_partial()'s
      trick of preserving partial socket references will cause
      issues in the drivers. We need a full orphan, and as a
      result netem delay/throttling will cause all TLS offload
      skbs to be dropped.
      
      Reusing the sk_buff->decrypted flag also protects from
      leaking clear text when incoming, decrypted skb is redirected
      (e.g. by TC).
      
      See commit 0608c69c ("bpf: sk_msg, sock{map|hash} redirect
      through ULP") for justification why the internal flag is safe.
      The only location which could leak the flag in is tcp_bpf_sendmsg(),
      which is taken care of by clearing the previously unused bit.
      
      v2:
       - remove superfluous decrypted mark copy (Willem);
       - remove the stale doc entry (Boris);
       - rely entirely on EOR marking to prevent coalescing (Boris);
       - use an internal sendpages flag instead of marking the socket
         (Boris).
      v3 (Willem):
       - reorganize the can_skb_orphan_partial() condition;
       - fix the flag leak-in through tcp_bpf_sendmsg.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41477662
    • W
      flow_offload: support get multi-subsystem block · 1150ab0f
      wenxu 提交于
      It provide a callback list to find the blocks of tc
      and nft subsystems
      Signed-off-by: Nwenxu <wenxu@ucloud.cn>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1150ab0f
    • W
      flow_offload: move tc indirect block to flow offload · 4e481908
      wenxu 提交于
      move tc indirect block to flow_offload and rename
      it to flow indirect block.The nf_tables can use the
      indr block architecture.
      Signed-off-by: Nwenxu <wenxu@ucloud.cn>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e481908