1. 14 9月, 2019 1 次提交
  2. 11 9月, 2019 1 次提交
  3. 07 9月, 2019 1 次提交
  4. 06 9月, 2019 1 次提交
    • P
      net: openvswitch: Set OvS recirc_id from tc chain index · 95a7233c
      Paul Blakey 提交于
      Offloaded OvS datapath rules are translated one to one to tc rules,
      for example the following simplified OvS rule:
      
      recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2)
      
      Will be translated to the following tc rule:
      
      $ tc filter add dev dev1 ingress \
      	    prio 1 chain 0 proto ip \
      		flower tcp ct_state -trk \
      		action ct pipe \
      		action goto chain 2
      
      Received packets will first travel though tc, and if they aren't stolen
      by it, like in the above rule, they will continue to OvS datapath.
      Since we already did some actions (action ct in this case) which might
      modify the packets, and updated action stats, we would like to continue
      the proccessing with the correct recirc_id in OvS (here recirc_id(2))
      where we left off.
      
      To support this, introduce a new skb extension for tc, which
      will be used for translating tc chain to ovs recirc_id to
      handle these miss cases. Last tc chain index will be set
      by tc goto chain action and read by OvS datapath.
      Signed-off-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a7233c
  5. 01 9月, 2019 3 次提交
  6. 31 8月, 2019 1 次提交
  7. 28 8月, 2019 1 次提交
    • F
      net: fix skb use after free in netpoll · 2c1644cf
      Feng Sun 提交于
      After commit baeababb
      ("tun: return NET_XMIT_DROP for dropped packets"),
      when tun_net_xmit drop packets, it will free skb and return NET_XMIT_DROP,
      netpoll_send_skb_on_dev will run into following use after free cases:
      1. retry netpoll_start_xmit with freed skb;
      2. queue freed skb in npinfo->txq.
      queue_process will also run into use after free case.
      
      hit netpoll_send_skb_on_dev first case with following kernel log:
      
      [  117.864773] kernel BUG at mm/slub.c:306!
      [  117.864773] invalid opcode: 0000 [#1] SMP PTI
      [  117.864774] CPU: 3 PID: 2627 Comm: loop_printmsg Kdump: loaded Tainted: P           OE     5.3.0-050300rc5-generic #201908182231
      [  117.864775] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      [  117.864775] RIP: 0010:kmem_cache_free+0x28d/0x2b0
      [  117.864781] Call Trace:
      [  117.864781]  ? tun_net_xmit+0x21c/0x460
      [  117.864781]  kfree_skbmem+0x4e/0x60
      [  117.864782]  kfree_skb+0x3a/0xa0
      [  117.864782]  tun_net_xmit+0x21c/0x460
      [  117.864782]  netpoll_start_xmit+0x11d/0x1b0
      [  117.864788]  netpoll_send_skb_on_dev+0x1b8/0x200
      [  117.864789]  __br_forward+0x1b9/0x1e0 [bridge]
      [  117.864789]  ? skb_clone+0x53/0xd0
      [  117.864790]  ? __skb_clone+0x2e/0x120
      [  117.864790]  deliver_clone+0x37/0x50 [bridge]
      [  117.864790]  maybe_deliver+0x89/0xc0 [bridge]
      [  117.864791]  br_flood+0x6c/0x130 [bridge]
      [  117.864791]  br_dev_xmit+0x315/0x3c0 [bridge]
      [  117.864792]  netpoll_start_xmit+0x11d/0x1b0
      [  117.864792]  netpoll_send_skb_on_dev+0x1b8/0x200
      [  117.864792]  netpoll_send_udp+0x2c6/0x3e8
      [  117.864793]  write_msg+0xd9/0xf0 [netconsole]
      [  117.864793]  console_unlock+0x386/0x4e0
      [  117.864793]  vprintk_emit+0x17e/0x280
      [  117.864794]  vprintk_default+0x29/0x50
      [  117.864794]  vprintk_func+0x4c/0xbc
      [  117.864794]  printk+0x58/0x6f
      [  117.864795]  loop_fun+0x24/0x41 [printmsg_loop]
      [  117.864795]  kthread+0x104/0x140
      [  117.864795]  ? 0xffffffffc05b1000
      [  117.864796]  ? kthread_park+0x80/0x80
      [  117.864796]  ret_from_fork+0x35/0x40
      Signed-off-by: NFeng Sun <loyou85@gmail.com>
      Signed-off-by: NXiaojun Zhao <xiaojunzhao141@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c1644cf
  8. 25 8月, 2019 2 次提交
  9. 24 8月, 2019 3 次提交
  10. 20 8月, 2019 2 次提交
    • E
      tcp: make sure EPOLLOUT wont be missed · ef8d8ccd
      Eric Dumazet 提交于
      As Jason Baron explained in commit 790ba456 ("tcp: set SOCK_NOSPACE
      under memory pressure"), it is crucial we properly set SOCK_NOSPACE
      when needed.
      
      However, Jason patch had a bug, because the 'nonblocking' status
      as far as sk_stream_wait_memory() is concerned is governed
      by MSG_DONTWAIT flag passed at sendmsg() time :
      
          long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
      
      So it is very possible that tcp sendmsg() calls sk_stream_wait_memory(),
      and that sk_stream_wait_memory() returns -EAGAIN with SOCK_NOSPACE
      cleared, if sk->sk_sndtimeo has been set to a small (but not zero)
      value.
      
      This patch removes the 'noblock' variable since we must always
      set SOCK_NOSPACE if -EAGAIN is returned.
      
      It also renames the do_nonblock label since we might reach this
      code path even if we were in blocking mode.
      
      Fixes: 790ba456 ("tcp: set SOCK_NOSPACE under memory pressure")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Reported-by: NVladimir Rutsky  <rutsky@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef8d8ccd
    • V
      net: flow_offload: convert block_ing_cb_list to regular list type · 607f625b
      Vlad Buslov 提交于
      RCU list block_ing_cb_list is protected by rcu read lock in
      flow_block_ing_cmd() and with flow_indr_block_ing_cb_lock mutex in all
      functions that use it. However, flow_block_ing_cmd() needs to call blocking
      functions while iterating block_ing_cb_list which leads to following
      suspicious RCU usage warning:
      
      [  401.510948] =============================
      [  401.510952] WARNING: suspicious RCU usage
      [  401.510993] 5.3.0-rc3+ #589 Not tainted
      [  401.510996] -----------------------------
      [  401.511001] include/linux/rcupdate.h:265 Illegal context switch in RCU read-side critical section!
      [  401.511004]
                     other info that might help us debug this:
      
      [  401.511008]
                     rcu_scheduler_active = 2, debug_locks = 1
      [  401.511012] 7 locks held by test-ecmp-add-v/7576:
      [  401.511015]  #0: 00000000081d71a5 (sb_writers#4){.+.+}, at: vfs_write+0x166/0x1d0
      [  401.511037]  #1: 000000002bd338c3 (&of->mutex){+.+.}, at: kernfs_fop_write+0xef/0x1b0
      [  401.511051]  #2: 00000000c921c634 (kn->count#317){.+.+}, at: kernfs_fop_write+0xf7/0x1b0
      [  401.511062]  #3: 00000000a19cdd56 (&dev->mutex){....}, at: sriov_numvfs_store+0x6b/0x130
      [  401.511079]  #4: 000000005425fa52 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x30/0x140
      [  401.511092]  #5: 00000000c5822793 (rtnl_mutex){+.+.}, at: unregister_netdevice_notifier+0x35/0x140
      [  401.511101]  #6: 00000000c2f3507e (rcu_read_lock){....}, at: flow_block_ing_cmd+0x5/0x130
      [  401.511115]
                     stack backtrace:
      [  401.511121] CPU: 21 PID: 7576 Comm: test-ecmp-add-v Not tainted 5.3.0-rc3+ #589
      [  401.511124] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
      [  401.511127] Call Trace:
      [  401.511138]  dump_stack+0x85/0xc0
      [  401.511146]  ___might_sleep+0x100/0x180
      [  401.511154]  __mutex_lock+0x5b/0x960
      [  401.511162]  ? find_held_lock+0x2b/0x80
      [  401.511173]  ? __tcf_get_next_chain+0x1d/0xb0
      [  401.511179]  ? mark_held_locks+0x49/0x70
      [  401.511194]  ? __tcf_get_next_chain+0x1d/0xb0
      [  401.511198]  __tcf_get_next_chain+0x1d/0xb0
      [  401.511251]  ? uplink_rep_async_event+0x70/0x70 [mlx5_core]
      [  401.511261]  tcf_block_playback_offloads+0x39/0x160
      [  401.511276]  tcf_block_setup+0x1b0/0x240
      [  401.511312]  ? mlx5e_rep_indr_setup_tc_cb+0xca/0x290 [mlx5_core]
      [  401.511347]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
      [  401.511359]  tc_indr_block_get_and_ing_cmd+0x11b/0x1e0
      [  401.511404]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
      [  401.511414]  flow_block_ing_cmd+0x7e/0x130
      [  401.511453]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
      [  401.511462]  __flow_indr_block_cb_unregister+0x7f/0xf0
      [  401.511502]  mlx5e_nic_rep_netdevice_event+0x75/0xb0 [mlx5_core]
      [  401.511513]  unregister_netdevice_notifier+0xe9/0x140
      [  401.511554]  mlx5e_cleanup_rep_tx+0x6f/0xe0 [mlx5_core]
      [  401.511597]  mlx5e_detach_netdev+0x4b/0x60 [mlx5_core]
      [  401.511637]  mlx5e_vport_rep_unload+0x71/0xc0 [mlx5_core]
      [  401.511679]  esw_offloads_disable+0x5b/0x90 [mlx5_core]
      [  401.511724]  mlx5_eswitch_disable.cold+0xdf/0x176 [mlx5_core]
      [  401.511759]  mlx5_device_disable_sriov+0xab/0xb0 [mlx5_core]
      [  401.511794]  mlx5_core_sriov_configure+0xaf/0xd0 [mlx5_core]
      [  401.511805]  sriov_numvfs_store+0xf8/0x130
      [  401.511817]  kernfs_fop_write+0x122/0x1b0
      [  401.511826]  vfs_write+0xdb/0x1d0
      [  401.511835]  ksys_write+0x65/0xe0
      [  401.511847]  do_syscall_64+0x5c/0xb0
      [  401.511857]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  401.511862] RIP: 0033:0x7fad892d30f8
      [  401.511868] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 96 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 60 c3 0f 1f 80 00 00 00 00 48 83
       ec 28 48 89
      [  401.511871] RSP: 002b:00007ffca2a9fad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  401.511875] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fad892d30f8
      [  401.511878] RDX: 0000000000000002 RSI: 000055afeb072a90 RDI: 0000000000000001
      [  401.511881] RBP: 000055afeb072a90 R08: 00000000ffffffff R09: 000000000000000a
      [  401.511884] R10: 000055afeb058710 R11: 0000000000000246 R12: 0000000000000002
      [  401.511887] R13: 00007fad893a8780 R14: 0000000000000002 R15: 00007fad893a3740
      
      To fix the described incorrect RCU usage, convert block_ing_cb_list from
      RCU list to regular list and protect it with flow_indr_block_ing_cb_lock
      mutex in flow_block_ing_cmd().
      
      Fixes: 1150ab0f ("flow_offload: support get multi-subsystem block")
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      607f625b
  11. 18 8月, 2019 11 次提交
  12. 16 8月, 2019 2 次提交
  13. 14 8月, 2019 2 次提交
  14. 12 8月, 2019 9 次提交
    • I
      drop_monitor: Expose tail drop counter · e9feb580
      Ido Schimmel 提交于
      Previous patch made the length of the per-CPU skb drop list
      configurable. Expose a counter that shows how many packets could not be
      enqueued to this list.
      
      This allows users determine the desired queue length.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9feb580
    • I
      drop_monitor: Make drop queue length configurable · 30328d46
      Ido Schimmel 提交于
      In packet alert mode, each CPU holds a list of dropped skbs that need to
      be processed in process context and sent to user space. To avoid
      exhausting the system's memory the maximum length of this queue is
      currently set to 1000.
      
      Allow users to tune the length of this queue according to their needs.
      The configured length is reported to user space when drop monitor
      configuration is queried.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30328d46
    • I
      drop_monitor: Add a command to query current configuration · 444be061
      Ido Schimmel 提交于
      Users should be able to query the current configuration of drop monitor
      before they start using it. Add a command to query the existing
      configuration which currently consists of alert mode and packet
      truncation length.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      444be061
    • I
      drop_monitor: Allow truncation of dropped packets · 57986617
      Ido Schimmel 提交于
      When sending dropped packets to user space it is not always necessary to
      copy the entire packet as usually only the headers are of interest.
      
      Allow user to specify the truncation length and add the original length
      of the packet as additional metadata to the netlink message.
      
      By default no truncation is performed.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57986617
    • I
      drop_monitor: Add packet alert mode · ca30707d
      Ido Schimmel 提交于
      So far drop monitor supported only one alert mode in which a summary of
      locations in which packets were recently dropped was sent to user space.
      
      This alert mode is sufficient in order to understand that packets were
      dropped, but lacks information to perform a more detailed analysis.
      
      Add a new alert mode in which the dropped packet itself is passed to
      user space along with metadata: The drop location (as program counter
      and resolved symbol), ingress netdevice and drop timestamp. More
      metadata can be added in the future.
      
      To avoid performing expensive operations in the context in which
      kfree_skb() is invoked (can be hard IRQ), the dropped skb is cloned and
      queued on per-CPU skb drop list. Then, in process context the netlink
      message is allocated, prepared and finally sent to user space.
      
      The per-CPU skb drop list is limited to 1000 skbs to prevent exhausting
      the system's memory. Subsequent patches will make this limit
      configurable and also add a counter that indicates how many skbs were
      tail dropped.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca30707d
    • I
      drop_monitor: Add alert mode operations · 28315f79
      Ido Schimmel 提交于
      The next patch is going to add another alert mode in which the dropped
      packet is notified to user space, instead of only a summary of recent
      drops.
      
      Abstract the differences between the modes by adding alert mode
      operations. The operations are selected based on the currently
      configured mode and associated with the probes and the work item just
      before tracing starts.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28315f79
    • I
      drop_monitor: Require CAP_NET_ADMIN for drop monitor configuration · c5ab9b1c
      Ido Schimmel 提交于
      Currently, the configure command does not do anything but return an
      error. Subsequent patches will enable the command to change various
      configuration options such as alert mode and packet truncation.
      
      Similar to other netlink-based configuration channels, make sure only
      users with the CAP_NET_ADMIN capability set can execute this command.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5ab9b1c
    • I
      drop_monitor: Reset per-CPU data before starting to trace · 44075f56
      Ido Schimmel 提交于
      The function reset_per_cpu_data() allocates and prepares a new skb for
      the summary netlink alert message ('NET_DM_CMD_ALERT'). The new skb is
      stored in the per-CPU 'data' variable and the old is returned.
      
      The function is invoked during module initialization and from the
      workqueue, before an alert is sent. This means that it is possible to
      receive an alert with stale data, if we stopped tracing when the
      hysteresis timer ('data->send_timer') was pending.
      
      Instead of invoking the function during module initialization, invoke it
      just before we start tracing and ensure we get a fresh skb.
      
      This also allows us to remove the calls to initialize the timer and the
      work item from the module initialization path, since both could have
      been triggered by the error paths of reset_per_cpu_data().
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44075f56
    • I
      drop_monitor: Initialize timer and work item upon tracing enable · 70c69274
      Ido Schimmel 提交于
      The timer and work item are currently initialized once during module
      init, but subsequent patches will need to associate different functions
      with the work item, based on the configured alert mode.
      
      Allow subsequent patches to make that change by initializing and
      de-initializing these objects during tracing enable and disable.
      
      This also guarantees that once the request to disable tracing returns,
      no more netlink notifications will be generated.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70c69274