1. 18 1月, 2014 24 次提交
  2. 17 1月, 2014 16 次提交
    • D
      Merge branch 'virtio_rx_merging' · cf84eb0b
      David S. Miller 提交于
      Michael Dalton says:
      
      ====================
      virtio-net: mergeable rx buffer size auto-tuning
      
      The virtio-net device currently uses aligned MTU-sized mergeable receive
      packet buffers. Network throughput for workloads with large average
      packet size can be improved by posting larger receive packet buffers.
      However, due to SKB truesize effects, posting large (e.g, PAGE_SIZE)
      buffers reduces the throughput of workloads that do not benefit from GRO
      and have no large inbound packets.
      
      This patchset introduces virtio-net mergeable buffer size auto-tuning,
      with buffer sizes ranging from aligned MTU-size to PAGE_SIZE. Packet
      buffer size is chosen based on a per-receive queue EWMA of incoming
      packet size.
      
      To unify mergeable receive buffer memory allocation and improve
      SKB frag coalescing, all mergeable buffer memory allocation is
      migrated to per-receive queue page frag allocators.
      
      The per-receive queue mergeable packet buffer size is exported via
      sysfs, and the network device sysfs layer has been extended to add
      support for device-specific per-receive queue sysfs attribute groups.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf84eb0b
    • M
      virtio-net: initial rx sysfs support, export mergeable rx buffer size · fbf28d78
      Michael Dalton 提交于
      Add initial support for per-rx queue sysfs attributes to virtio-net. If
      mergeable packet buffers are enabled, adds a read-only mergeable packet
      buffer size sysfs attribute for each RX queue.
      Suggested-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NMichael Dalton <mwdalton@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbf28d78
    • M
      lib: Ensure EWMA does not store wrong intermediate values · 03144b58
      Michael Dalton 提交于
      To ensure ewma_read() without a lock returns a valid but possibly
      out of date average, modify ewma_add() by using ACCESS_ONCE to prevent
      intermediate wrong values from being written to avg->internal.
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NMichael Dalton <mwdalton@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03144b58
    • M
      net-sysfs: add support for device-specific rx queue sysfs attributes · a953be53
      Michael Dalton 提交于
      Extend existing support for netdevice receive queue sysfs attributes to
      permit a device-specific attribute group. Initial use case for this
      support will be to allow the virtio-net device to export per-receive
      queue mergeable receive buffer size.
      Signed-off-by: NMichael Dalton <mwdalton@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a953be53
    • M
      virtio-net: auto-tune mergeable rx buffer size for improved performance · ab7db917
      Michael Dalton 提交于
      Commit 2613af0e ("virtio_net: migrate mergeable rx buffers to page frag
      allocators") changed the mergeable receive buffer size from PAGE_SIZE to
      MTU-size, introducing a single-stream regression for benchmarks with large
      average packet size. There is no single optimal buffer size for all
      workloads.  For workloads with packet size <= MTU bytes, MTU + virtio-net
      header-sized buffers are preferred as larger buffers reduce the TCP window
      due to SKB truesize. However, single-stream workloads with large average
      packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
      are used.
      
      This commit auto-tunes the mergeable receiver buffer packet size by
      choosing the packet buffer size based on an EWMA of the recent packet
      sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
      virtio-net header len to PAGE_SIZE. This improves throughput for
      large packet workloads, as any workload with average packet size >=
      PAGE_SIZE will use PAGE_SIZE buffers.
      
      These optimizations interact positively with recent commit
      ba275241 ("virtio-net: coalesce rx frags when possible during rx"),
      which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
      optimizations benefit buffers of any size.
      
      Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
      between two QEMU VMs on a single physical machine. Each VM has two VCPUs
      with all offloads & vhost enabled. All VMs and vhost threads run in a
      single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
      in the system will not be scheduled on the benchmark CPUs. Trunk includes
      SKB rx frag coalescing.
      
      net-next w/ virtio_net before 2613af0e (PAGE_SIZE bufs): 14642.85Gb/s
      net-next (MTU-size bufs):  13170.01Gb/s
      net-next + auto-tune: 14555.94Gb/s
      
      Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
      using MTU-sized buffers to about 26Gb/s using auto-tuning.
      Signed-off-by: NMichael Dalton <mwdalton@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab7db917
    • M
      virtio-net: use per-receive queue page frag alloc for mergeable bufs · fb51879d
      Michael Dalton 提交于
      The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
      mergeable rx buffer allocations. This commit migrates virtio-net to use
      per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
      mergeable rx buffer memory allocation, which now will use skb_refill_frag()
      for both atomic and GFP-WAIT buffer allocations.
      
      To address fragmentation concerns, if after buffer allocation there
      is too little space left in the page frag to allocate a subsequent
      buffer, the remaining space is added to the current allocated buffer
      so that the remaining space can be used to store packet data.
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NMichael Dalton <mwdalton@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb51879d
    • M
      net: allow > 0 order atomic page alloc in skb_page_frag_refill · 097b4f19
      Michael Dalton 提交于
      skb_page_frag_refill currently permits only order-0 page allocs
      unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
      higher-order page allocations whether or not GFP_WAIT is used. If
      memory cannot be allocated, the allocator will fall back to
      successively smaller page allocs (down to order-0 page allocs).
      
      This change brings skb_page_frag_refill in line with the existing
      page allocation strategy employed by netdev_alloc_frag, which attempts
      higher-order page allocations whether or not GFP_WAIT is set, falling
      back to successively lower-order page allocations on failure. Part
      of migration of virtio-net to per-receive queue page frag allocators.
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NMichael Dalton <mwdalton@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      097b4f19
    • W
      net_sched: fix error return code in fw_change_attrs() · 722e47d7
      Wei Yongjun 提交于
      The error code was not set if change indev fail, so the error
      condition wasn't reflected in the return value. Fix to return a
      negative error code from this error handling case instead of 0.
      
      Fixes: 2519a602 ('net_sched: optimize tcf_match_indev()')
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      722e47d7
    • D
      Merge branch 'tipc' · 8b88a11e
      David S. Miller 提交于
      Ying Xue says:
      
      ====================
      tipc: align TIPC behaviours of waiting for events with other stacks
      
      Comparing the current implementations of waiting for events in TIPC
      socket layer with other stacks, TIPC's behaviour is very different
      because wait_event_interruptible_timeout()/wait_event_interruptible()
      are always used by TIPC to wait for events while relevant socket or
      port variables are fed to them as their arguments. As socket lock has
      to be released temporarily before the two routines of waiting for
      events are called, their arguments associated with socket or port
      structures are out of socket lock protection. This might cause
      serious issues where the process of calling socket syscall such as
      sendsmg(), connect(), accept(), and recvmsg(), cannot be waken up
      at all even if proper event arrives or improperly be woken up
      although the condition of waking up the process is not satisfied
      in practice.
      
      Therefore, aligning its behaviours with similar functions implemented
      in other stacks, for instance, sk_stream_wait_connect() and
      inet_csk_wait_for_connect() etc, can avoid above risks for us.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b88a11e
    • Y
      tipc: standardize recvmsg routine · 9bbb4ecc
      Ying Xue 提交于
      Standardize the behaviour of waiting for events in TIPC recvmsg()
      so that all variables of socket or port structures are protected
      within socket lock, allowing the process of calling recvmsg() to
      be woken up at appropriate time.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bbb4ecc
    • Y
      tipc: standardize sendmsg routine of connected socket · 391a6dd1
      Ying Xue 提交于
      Standardize the behaviour of waiting for events in TIPC send_packet()
      so that all variables of socket or port structures are protected within
      socket lock, allowing the process of calling sendmsg() to be woken up
      at appropriate time.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      391a6dd1
    • Y
      tipc: standardize sendmsg routine of connectionless socket · 3f40504f
      Ying Xue 提交于
      Comparing the behaviour of how to wait for events in TIPC sendmsg()
      with other stacks, the TIPC implementation might be perceived as
      different, and sometimes even incorrect. For instance, sk_sleep()
      and tport->congested variables associated with socket are exposed
      without socket lock protection while wait_event_interruptible_timeout()
      accesses them. So standardizing it with similar implementation
      in other stacks can help us correct these errors which the process
      of calling sendmsg() cannot be woken up event if an expected event
      arrive at socket or improperly woken up although the wake condition
      doesn't match.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f40504f
    • Y
      tipc: standardize accept routine · 6398e23c
      Ying Xue 提交于
      Comparing the behaviour of how to wait for events in TIPC accept()
      with other stacks, the TIPC implementation might be perceived as
      different, and sometimes even incorrect. As sk_sleep() and
      sk->sk_receive_queue variables associated with socket are not
      protected by socket lock, the process of calling accept() may be
      woken up improperly or sometimes cannot be woken up at all. After
      standardizing it with inet_csk_wait_for_connect routine, we can
      get benefits including: avoiding 'thundering herd' phenomenon,
      adding a timeout mechanism for accept(), coping with a pending
      signal, and having sk_sleep() and sk->sk_receive_queue being
      always protected within socket lock scope and so on.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6398e23c
    • Y
      tipc: standardize connect routine · 78eb3a53
      Ying Xue 提交于
      Comparing the behaviour of how to wait for events in TIPC connect()
      with other stacks, the TIPC implementation might be perceived as
      different, and sometimes even incorrect. For instance, as both
      sock->state and sk_sleep() are directly fed to
      wait_event_interruptible_timeout() as its arguments, and socket lock
      has to be released before we call wait_event_interruptible_timeout(),
      the two variables associated with socket are exposed out of socket
      lock protection, thereby probably getting stale values so that the
      process of calling connect() cannot be woken up exactly even if
      correct event arrives or it is woken up improperly even if the wake
      condition is not satisfied in practice. Therefore, standardizing its
      behaviour with sk_stream_wait_connect routine can avoid these risks.
      
      Additionally the implementation of connect routine is simplified as a
      whole, allowing it to return correct values in all different cases.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      78eb3a53
    • W
      sctp: remove the unnecessary assignment · abfce3ef
      wangweidong 提交于
      When go the right path, the status is 0, no need to assign it again.
      So just remove the assignment.
      Signed-off-by: NWang Weidong <wangweidong1@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abfce3ef
    • J
      virtio-net: drop rq->max and rq->num · be121f46
      Jason Wang 提交于
      It looks like there's no need for those two fields:
      
      - Unless there's a failure for the first refill try, rq->max should be always
        equal to the vring size.
      - rq->num is only used to determine the condition that we need to do the refill,
        we could check vq->num_free instead.
      - rq->num was required to be increased or decreased explicitly after each
        get/put which results a bad API.
      
      So this patch removes them both to make the code simpler.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be121f46