1. 25 5月, 2019 6 次提交
    • D
      ipv6: Refactor ip6_route_del for cached routes · 0fa6efc5
      David Ahern 提交于
      Move the removal of cached routes to a helper, ip6_del_cached_rt, that
      can be invoked per nexthop. Rename the existig ip6_del_cached_rt to
      __ip6_del_cached_rt since it is called by ip6_del_cached_rt.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fa6efc5
    • D
      ipv6: Make fib6_nh optional at the end of fib6_info · 1cf844c7
      David Ahern 提交于
      Move fib6_nh to the end of fib6_info and make it an array of
      size 0. Pass a flag to fib6_info_alloc indicating if the
      allocation needs to add space for a fib6_nh.
      
      The current code path always has a fib6_nh allocated with a
      fib6_info; with nexthop objects they will be separate.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cf844c7
    • D
      ipv6: Move exception bucket to fib6_nh · cc5c073a
      David Ahern 提交于
      Similar to the pcpu routes exceptions are really per nexthop, so move
      rt6i_exception_bucket from fib6_info to fib6_nh.
      
      To avoid additional increases to the size of fib6_nh for a 1-bit flag,
      use the lowest bit in the allocated memory pointer for the flushed flag.
      Add helpers for retrieving the bucket pointer to mask off the flag.
      
      The cleanup of the exception bucket is moved to fib6_nh_release.
      
      fib6_nh_flush_exceptions can now be called from 2 contexts:
      1. deleting a fib entry
      2. deleting a fib6_nh
      
      For 1., fib6_nh_flush_exceptions is called for a specific fib6_info that
      is getting deleted. All exceptions in the cache using the entry are
      deleted. For 2, the fib6_nh itself is getting destroyed so
      fib6_nh_flush_exceptions is called for a NULL fib6_info which means
      flush all entries.
      
      The pmtu.sh selftest exercises the affected code paths - from creating
      exceptions to cleaning them up on device delete. All tests pass without
      any rcu locking or memleak warnings.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc5c073a
    • D
      ipv6: Refactor exception functions · c0b220cf
      David Ahern 提交于
      Before moving exception bucket from fib6_info to fib6_nh, refactor
      rt6_flush_exceptions, rt6_remove_exception_rt, rt6_mtu_change_route,
      and rt6_update_exception_stamp_rt. In all 3 cases, move the primary
      logic into a new helper that starts with fib6_nh_. The latter 3
      functions still take a fib6_info; this will be changed to fib6_nh
      in the next patch.
      
      In the case of rt6_mtu_change_route, move the fib6_metric_locked
      out as a standalone check - no need to call the new function if
      the fib entry has the mtu locked. Also, add fib6_info to
      rt6_mtu_change_arg as a way of passing the fib entry to the new
      helper.
      
      No functional change intended. The goal here is to make the next
      patch easier to review by moving existing lookup logic for each to
      new helpers.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0b220cf
    • D
      ipv6: Refactor fib6_drop_pcpu_from · 7d88d8b5
      David Ahern 提交于
      Move the existing pcpu walk in fib6_drop_pcpu_from to a new
      helper, __fib6_drop_pcpu_from, that can be invoked per fib6_nh with a
      reference to the from entries that need to be evicted. If the passed
      in 'from' is non-NULL then only entries associated with that fib6_info
      are removed (e.g., case where fib entry is deleted); if the 'from' is
      NULL are entries are flushed (e.g., fib6_nh is deleted).
      
      For fib6_info entries with builtin fib6_nh (ie., current code) there
      is no change in behavior.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d88d8b5
    • D
      ipv6: Move pcpu cached routes to fib6_nh · f40b6ae2
      David Ahern 提交于
      rt6_info are specific instances of a fib entry and are tied to a
      device and gateway - ie., a nexthop. Before nexthop objects, IPv6 fib
      entries have separate fib6_info for each nexthop in a multipath route,
      so the location of the pcpu cache in the fib6_info struct worked.
      However, with nexthop objects a fib6_info can point to a set of nexthops
      (yet another alignment of ipv6 with ipv4). Accordingly, the pcpu
      cache needs to be moved to the fib6_nh struct so the cached entries
      are local to the nexthop specification used to create the rt6_info.
      
      Initialization and free of the pcpu entries moved to fib6_nh_init and
      fib6_nh_release.
      
      Change in location only, from fib6_info down to fib6_nh; no other
      functional change intended.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f40b6ae2
  2. 24 5月, 2019 1 次提交
  3. 23 5月, 2019 13 次提交
    • S
      hv_sock: perf: loop in send() to maximize bandwidth · 14a1eaa8
      Sunil Muthuswamy 提交于
      Currently, the hv_sock send() iterates once over the buffer, puts data into
      the VMBUS channel and returns. It doesn't maximize on the case when there
      is a simultaneous reader draining data from the channel. In such a case,
      the send() can maximize the bandwidth (and consequently minimize the cpu
      cycles) by iterating until the channel is found to be full.
      
      Perf data:
      Total Data Transfer: 10GB/iteration
      Single threaded reader/writer, Linux hvsocket writer with Windows hvsocket
      reader
      Packet size: 64KB
      CPU sys time was captured using the 'time' command for the writer to send
      10GB of data.
      'Send Buffer Loop' is with the patch applied.
      The values below are over 10 iterations.
      
      |--------------------------------------------------------|
      |        |        Current        |   Send Buffer Loop    |
      |--------------------------------------------------------|
      |        | Throughput | CPU sys  | Throughput | CPU sys  |
      |        | (MB/s)     | time (s) | (MB/s)     | time (s) |
      |--------------------------------------------------------|
      | Min    |     407    |   7.048  |    401     |  5.958   |
      |--------------------------------------------------------|
      | Max    |     455    |   7.563  |    542     |  6.993   |
      |--------------------------------------------------------|
      | Avg    |     440    |   7.411  |    451     |  6.639   |
      |--------------------------------------------------------|
      | Median |     446    |   7.417  |    447     |  6.761   |
      |--------------------------------------------------------|
      
      Observation:
      1. The avg throughput doesn't really change much with this change for this
      scenario. This is most probably because the bottleneck on throughput is
      somewhere else.
      2. The average system (or kernel) cpu time goes down by 10%+ with this
      change, for the same amount of data transfer.
      Signed-off-by: NSunil Muthuswamy <sunilmut@microsoft.com>
      Reviewed-by: NDexuan Cui <decui@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14a1eaa8
    • S
      hv_sock: perf: Allow the socket buffer size options to influence the actual socket buffers · ac383f58
      Sunil Muthuswamy 提交于
      Currently, the hv_sock buffer size is static and can't scale to the
      bandwidth requirements of the application. This change allows the
      applications to influence the socket buffer sizes using the SO_SNDBUF and
      the SO_RCVBUF socket options.
      
      Few interesting points to note:
      1. Since the VMBUS does not allow a resize operation of the ring size, the
      socket buffer size option should be set prior to establishing the
      connection for it to take effect.
      2. Setting the socket option comes with the cost of that much memory being
      reserved/allocated by the kernel, for the lifetime of the connection.
      
      Perf data:
      Total Data Transfer: 1GB
      Single threaded reader/writer
      Results below are summarized over 10 iterations.
      
      Linux hvsocket writer + Windows hvsocket reader:
      |---------------------------------------------------------------------------------------------|
      |Packet size ->   |      128B       |       1KB       |       4KB       |        64KB         |
      |---------------------------------------------------------------------------------------------|
      |SO_SNDBUF size | |                 Throughput in MB/s (min/max/avg/median):                  |
      |               v |                                                                           |
      |---------------------------------------------------------------------------------------------|
      |      Default    | 109/118/114/116 | 636/774/701/700 | 435/507/480/476 |   410/491/462/470   |
      |      16KB       | 110/116/112/111 | 575/705/662/671 | 749/900/854/869 |   592/824/692/676   |
      |      32KB       | 108/120/115/115 | 703/823/767/772 | 718/878/850/866 | 1593/2124/2000/2085 |
      |      64KB       | 108/119/114/114 | 592/732/683/688 | 805/934/903/911 | 1784/1943/1862/1843 |
      |---------------------------------------------------------------------------------------------|
      
      Windows hvsocket writer + Linux hvsocket reader:
      |---------------------------------------------------------------------------------------------|
      |Packet size ->   |     128B    |      1KB        |          4KB        |        64KB         |
      |---------------------------------------------------------------------------------------------|
      |SO_RCVBUF size | |               Throughput in MB/s (min/max/avg/median):                    |
      |               v |                                                                           |
      |---------------------------------------------------------------------------------------------|
      |      Default    | 69/82/75/73 | 313/343/333/336 |   418/477/446/445   |   659/701/676/678   |
      |      16KB       | 69/83/76/77 | 350/401/375/382 |   506/548/517/516   |   602/624/615/615   |
      |      32KB       | 62/83/73/73 | 471/529/496/494 |   830/1046/935/939  | 944/1180/1070/1100  |
      |      64KB       | 64/70/68/69 | 467/533/501/497 | 1260/1590/1430/1431 | 1605/1819/1670/1660 |
      |---------------------------------------------------------------------------------------------|
      Signed-off-by: NSunil Muthuswamy <sunilmut@microsoft.com>
      Reviewed-by: NDexuan Cui <decui@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac383f58
    • D
      neighbor: Add tracepoint to __neigh_create · fc651001
      David Ahern 提交于
      Add tracepoint to __neigh_create to enable debugging of new entries.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc651001
    • D
      net: Set strict_start_type for routes and rules · 75425657
      David Ahern 提交于
      New userspace on an older kernel can send unknown and unsupported
      attributes resulting in an incompelete config which is almost
      always wrong for routing (few exceptions are passthrough settings
      like the protocol that installed the route).
      
      Set strict_start_type in the policies for IPv4 and IPv6 routes and
      rules to detect new, unsupported attributes and fail the route add.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75425657
    • D
      ipv4: Rename and export nh_update_mtu · 06c77c3e
      David Ahern 提交于
      Rename nh_update_mtu to fib_nhc_update_mtu and export for use by the
      nexthop code.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06c77c3e
    • D
      ipv4: export fib_info_update_nh_saddr · c3669486
      David Ahern 提交于
      Add scope as input argument versus relying on fib_info reference in
      fib_nh, and export fib_info_update_nh_saddr.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3669486
    • D
      ipv4: export fib_flush · 9bd83667
      David Ahern 提交于
      As nexthops are deleted, fib entries referencing it are marked dead.
      Export fib_flush so those entries can be removed in a timely manner.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bd83667
    • D
      ipv4: export fib_check_nh · ac1fab2d
      David Ahern 提交于
      Change fib_check_nh to take net, table and scope as input arguments
      over struct fib_config and export for use by nexthop code.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac1fab2d
    • D
      ipv4: Add function to send route updates · 1bff1a0c
      David Ahern 提交于
      Add fib_info_notify_update to walk the fib and send RTM_NEWROUTE
      notifications with NLM_F_REPLACE set for entries linked to a fib_info
      that have nh_updated flag set. This helper will be used by the nexthop
      code to notify userspace of routes that are impacted when a nexthop
      config is updated via replace. The new function and its helper are
      similar to how fib_flush and fib_table_flush work for address delete
      and link down events.
      
      This notification is needed for legacy apps that do not understand
      the new nexthop object. Apps that are nexthop aware can use the
      RTA_NH_ID attribute in the route notification to just ignore it.
      
      In the future this should be wrapped in a sysctl to allow OS'es that
      are fully updated to avoid the notificaton storm.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1bff1a0c
    • D
      ipv6: export function to send route updates · 19a3b7ee
      David Ahern 提交于
      Add fib6_rt_update to send RTM_NEWROUTE with NLM_F_REPLACE set. This
      helper will be used by the nexthop code to notify userspace of routes
      that are impacted when a nexthop config is updated via replace.
      
      This notification is needed for legacy apps that do not understand
      the new nexthop object. Apps that are nexthop aware can use the
      RTA_NH_ID attribute in the route notification to just ignore it.
      
      In the future this should be wrapped in a sysctl to allow OS'es that
      are fully updated to avoid the notificaton storm.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19a3b7ee
    • D
      ipv6: Add hook to bump sernum for a route to stubs · cdaa16a4
      David Ahern 提交于
      Add hook to ipv6 stub to bump the sernum up to the root node for a
      route. This is needed by the nexthop code when a nexthop config changes.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdaa16a4
    • D
      ipv6: Add delete route hook to stubs · 68a9b13d
      David Ahern 提交于
      Add ip6_del_rt to the IPv6 stub. The hook is needed by the nexthop
      code to remove entries linked to a nexthop that is getting deleted.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68a9b13d
    • F
      net: Add UNIX_DIAG_UID to Netlink UNIX socket diagnostics. · cae9910e
      Felipe Gasper 提交于
      This adds the ability for Netlink to report a socket's UID along with the
      other UNIX diagnostic information that is already available. This will
      allow diagnostic tools greater insight into which users control which
      socket.
      
      To test this, do the following as a non-root user:
      
          unshare -U -r bash
          nc -l -U user.socket.$$ &
      
      .. and verify from within that same session that Netlink UNIX socket
      diagnostics report the socket's UID as 0. Also verify that Netlink UNIX
      socket diagnostics report the socket's UID as the user's UID from an
      unprivileged process in a different session. Verify the same from
      a root process.
      Signed-off-by: NFelipe Gasper <felipe@felipegasper.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cae9910e
  4. 22 5月, 2019 1 次提交
  5. 21 5月, 2019 11 次提交
  6. 20 5月, 2019 2 次提交
    • R
      net: fix kernel-doc warnings for socket.c · 85806af0
      Randy Dunlap 提交于
      Fix kernel-doc warnings by moving the kernel-doc notation to be
      immediately above the functions that it describes.
      
      Fixes these warnings for sock_sendmsg() and sock_recvmsg():
      
      ../net/socket.c:658: warning: Excess function parameter 'sock' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:658: warning: Excess function parameter 'msg' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:889: warning: Excess function parameter 'sock' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:889: warning: Excess function parameter 'msg' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:889: warning: Excess function parameter 'flags' description in 'INDIRECT_CALLABLE_DECLARE'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85806af0
    • P
      net: Treat sock->sk_drops as an unsigned int when printing · ea9a0379
      Patrick Talbert 提交于
      Currently, procfs socket stats format sk_drops as a signed int (%d). For large
      values this will cause a negative number to be printed.
      
      We know the drop count can never be a negative so change the format specifier to
      %u.
      Signed-off-by: NPatrick Talbert <ptalbert@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea9a0379
  7. 19 5月, 2019 1 次提交
    • J
      vsock/virtio: Initialize core virtio vsock before registering the driver · ba95e5df
      Jorge E. Moreira 提交于
      Avoid a race in which static variables in net/vmw_vsock/af_vsock.c are
      accessed (while handling interrupts) before they are initialized.
      
      [    4.201410] BUG: unable to handle kernel paging request at ffffffffffffffe8
      [    4.207829] IP: vsock_addr_equals_addr+0x3/0x20
      [    4.211379] PGD 28210067 P4D 28210067 PUD 28212067 PMD 0
      [    4.211379] Oops: 0000 [#1] PREEMPT SMP PTI
      [    4.211379] Modules linked in:
      [    4.211379] CPU: 1 PID: 30 Comm: kworker/1:1 Not tainted 4.14.106-419297-gd7e28cc1f241 #1
      [    4.211379] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      [    4.211379] Workqueue: virtio_vsock virtio_transport_rx_work
      [    4.211379] task: ffffa3273d175280 task.stack: ffffaea1800e8000
      [    4.211379] RIP: 0010:vsock_addr_equals_addr+0x3/0x20
      [    4.211379] RSP: 0000:ffffaea1800ebd28 EFLAGS: 00010286
      [    4.211379] RAX: 0000000000000002 RBX: 0000000000000000 RCX: ffffffffb94e42f0
      [    4.211379] RDX: 0000000000000400 RSI: ffffffffffffffe0 RDI: ffffaea1800ebdd0
      [    4.211379] RBP: ffffaea1800ebd58 R08: 0000000000000001 R09: 0000000000000001
      [    4.211379] R10: 0000000000000000 R11: ffffffffb89d5d60 R12: ffffaea1800ebdd0
      [    4.211379] R13: 00000000828cbfbf R14: 0000000000000000 R15: ffffaea1800ebdc0
      [    4.211379] FS:  0000000000000000(0000) GS:ffffa3273fd00000(0000) knlGS:0000000000000000
      [    4.211379] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    4.211379] CR2: ffffffffffffffe8 CR3: 000000002820e001 CR4: 00000000001606e0
      [    4.211379] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    4.211379] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    4.211379] Call Trace:
      [    4.211379]  ? vsock_find_connected_socket+0x6c/0xe0
      [    4.211379]  virtio_transport_recv_pkt+0x15f/0x740
      [    4.211379]  ? detach_buf+0x1b5/0x210
      [    4.211379]  virtio_transport_rx_work+0xb7/0x140
      [    4.211379]  process_one_work+0x1ef/0x480
      [    4.211379]  worker_thread+0x312/0x460
      [    4.211379]  kthread+0x132/0x140
      [    4.211379]  ? process_one_work+0x480/0x480
      [    4.211379]  ? kthread_destroy_worker+0xd0/0xd0
      [    4.211379]  ret_from_fork+0x35/0x40
      [    4.211379] Code: c7 47 08 00 00 00 00 66 c7 07 28 00 c7 47 08 ff ff ff ff c7 47 04 ff ff ff ff c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 8b 47 08 <3b> 46 08 75 0a 8b 47 04 3b 46 04 0f 94 c0 c3 31 c0 c3 90 66 2e
      [    4.211379] RIP: vsock_addr_equals_addr+0x3/0x20 RSP: ffffaea1800ebd28
      [    4.211379] CR2: ffffffffffffffe8
      [    4.211379] ---[ end trace f31cc4a2e6df3689 ]---
      [    4.211379] Kernel panic - not syncing: Fatal exception in interrupt
      [    4.211379] Kernel Offset: 0x37000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [    4.211379] Rebooting in 5 seconds..
      
      Fixes: 22b5c0b6 ("vsock/virtio: fix kernel panic after device hot-unplug")
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Stefano Garzarella <sgarzare@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: kvm@vger.kernel.org
      Cc: virtualization@lists.linux-foundation.org
      Cc: netdev@vger.kernel.org
      Cc: kernel-team@android.com
      Cc: stable@vger.kernel.org [4.9+]
      Signed-off-by: NJorge E. Moreira <jemoreira@google.com>
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Acked-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba95e5df
  8. 18 5月, 2019 5 次提交