1. 13 7月, 2012 1 次提交
    • A
      net: Update alloc frag to reduce get/put page usage and recycle pages · 540eb7bf
      Alexander Duyck 提交于
      This patch is meant to help improve performance by reducing the number of
      locked operations required to allocate a frag on x86 and other platforms.
      This is accomplished by using atomic_set operations on the page count
      instead of calling get_page and put_page.  It is based on work originally
      provided by Eric Dumazet.
      
      In addition it also helps to reduce memory overhead when using TCP.  This
      is done by recycling the page if the only holder of the frame is the
      netdev_alloc_frag call itself.  This can occur when skb heads are stolen by
      either GRO or TCP and the driver providing the packets is using paged frags
      to store all of the data for the packets.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      540eb7bf
  2. 12 7月, 2012 1 次提交
    • E
      tcp: TCP Small Queues · 46d3ceab
      Eric Dumazet 提交于
      This introduce TSQ (TCP Small Queues)
      
      TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
      device queues), to reduce RTT and cwnd bias, part of the bufferbloat
      problem.
      
      sk->sk_wmem_alloc not allowed to grow above a given limit,
      allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
      given time.
      
      TSO packets are sized/capped to half the limit, so that we have two
      TSO packets in flight, allowing better bandwidth use.
      
      As a side effect, setting the limit to 40000 automatically reduces the
      standard gso max limit (65536) to 40000/2 : It can help to reduce
      latencies of high prio packets, having smaller TSO packets.
      
      This means we divert sock_wfree() to a tcp_wfree() handler, to
      queue/send following frames when skb_orphan() [2] is called for the
      already queued skbs.
      
      Results on my dev machines (tg3/ixgbe nics) are really impressive,
      using standard pfifo_fast, and with or without TSO/GSO.
      
      Without reduction of nominal bandwidth, we have reduction of buffering
      per bulk sender :
      < 1ms on Gbit (instead of 50ms with TSO)
      < 8ms on 100Mbit (instead of 132 ms)
      
      I no longer have 4 MBytes backlogged in qdisc by a single netperf
      session, and both side socket autotuning no longer use 4 Mbytes.
      
      As skb destructor cannot restart xmit itself ( as qdisc lock might be
      taken at this point ), we delegate the work to a tasklet. We use one
      tasklest per cpu for performance reasons.
      
      If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
      This flag is tested in a new protocol method called from release_sock(),
      to eventually send new segments.
      
      [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
      [2] skb_orphan() is usually called at TX completion time,
        but some drivers call it in their start_xmit() handler.
        These drivers should at least use BQL, or else a single TCP
        session can still fill the whole NIC TX ring, since TSQ will
        have no effect.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Dave Taht <dave.taht@bufferbloat.net>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46d3ceab
  3. 11 7月, 2012 3 次提交
  4. 10 7月, 2012 1 次提交
  5. 09 7月, 2012 1 次提交
    • G
      cgroup: fix panic in netprio_cgroup · b761c9b1
      Gao feng 提交于
      we set max_prioidx to the first zero bit index of prioidx_map in
      function get_prioidx.
      
      So when we delete the low index netprio cgroup and adding a new
      netprio cgroup again,the max_prioidx will be set to the low index.
      
      when we set the high index cgroup's net_prio.ifpriomap,the function
      write_priomap will call update_netdev_tables to alloc memory which
      size is sizeof(struct netprio_map) + sizeof(u32) * (max_prioidx + 1),
      so the size of array that map->priomap point to is max_prioidx +1,
      which is low than what we actually need.
      
      fix this by adding check in get_prioidx,only set max_prioidx when
      max_prioidx low than the new prioidx.
      Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b761c9b1
  6. 05 7月, 2012 5 次提交
  7. 30 6月, 2012 1 次提交
    • P
      netlink: add netlink_kernel_cfg parameter to netlink_kernel_create · a31f2d17
      Pablo Neira Ayuso 提交于
      This patch adds the following structure:
      
      struct netlink_kernel_cfg {
              unsigned int    groups;
              void            (*input)(struct sk_buff *skb);
              struct mutex    *cb_mutex;
      };
      
      That can be passed to netlink_kernel_create to set optional configurations
      for netlink kernel sockets.
      
      I've populated this structure by looking for NULL and zero parameters at the
      existing code. The remaining parameters that always need to be set are still
      left in the original interface.
      
      That includes optional parameters for the netlink socket creation. This allows
      easy extensibility of this interface in the future.
      
      This patch also adapts all callers to use this new interface.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a31f2d17
  8. 29 6月, 2012 2 次提交
  9. 28 6月, 2012 3 次提交
  10. 20 6月, 2012 1 次提交
    • D
      ipv4: Early TCP socket demux. · 41063e9d
      David S. Miller 提交于
      Input packet processing for local sockets involves two major demuxes.
      One for the route and one for the socket.
      
      But we can optimize this down to one demux for certain kinds of local
      sockets.
      
      Currently we only do this for established TCP sockets, but it could
      at least in theory be expanded to other kinds of connections.
      
      If a TCP socket is established then it's identity is fully specified.
      
      This means that whatever input route was used during the three-way
      handshake must work equally well for the rest of the connection since
      the keys will not change.
      
      Once we move to established state, we cache the receive packet's input
      route to use later.
      
      Like the existing cached route in sk->sk_dst_cache used for output
      packets, we have to check for route invalidations using dst->obsolete
      and dst->ops->check().
      
      Early demux occurs outside of a socket locked section, so when a route
      invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
      actually inside of established state packet processing and thus have
      the socket locked.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41063e9d
  11. 16 6月, 2012 1 次提交
  12. 14 6月, 2012 2 次提交
    • E
      netpoll: fix netpoll_send_udp() bugs · 954fba02
      Eric Dumazet 提交于
      Bogdan Hamciuc diagnosed and fixed following bug in netpoll_send_udp() :
      
      "skb->len += len;" instead of "skb_put(skb, len);"
      
      Meaning that _if_ a network driver needs to call skb_realloc_headroom(),
      only packet headers would be copied, leaving garbage in the payload.
      
      However the skb_realloc_headroom() must be avoided as much as possible
      since it requires memory and netpoll tries hard to work even if memory
      is exhausted (using a pool of preallocated skbs)
      
      It appears netpoll_send_udp() reserved 16 bytes for the ethernet header,
      which happens to work for typicall drivers but not all.
      
      Right thing is to use LL_RESERVED_SPACE(dev)
      (And also add dev->needed_tailroom of tailroom)
      
      This patch combines both fixes.
      
      Many thanks to Bogdan for raising this issue.
      Reported-by: NBogdan Hamciuc <bogdan.hamciuc@freescale.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NBogdan Hamciuc <bogdan.hamciuc@freescale.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Reviewed-by: NNeil Horman <nhorman@tuxdriver.com>
      Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      954fba02
    • E
      splice: fix racy pipe->buffers uses · 047fe360
      Eric Dumazet 提交于
      Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
      by splice_shrink_spd() called from vmsplice_to_pipe()
      
      commit 35f3d14d (pipe: add support for shrinking and growing pipes)
      added capability to adjust pipe->buffers.
      
      Problem is some paths don't hold pipe mutex and assume pipe->buffers
      doesn't change for their duration.
      
      Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
      use it in place of pipe->buffers where appropriate.
      
      splice_shrink_spd() loses its struct pipe_inode_info argument.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Tom Herbert <therbert@google.com>
      Cc: stable <stable@vger.kernel.org> # 2.6.35
      Tested-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      047fe360
  13. 13 6月, 2012 2 次提交
    • B
      ethtool: Make more commands available to unprivileged processes · 2da45db2
      Ben Hutchings 提交于
      'Get' commands should generally not require CAP_NET_ADMIN, with
      the exception of those that expose internal state.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2da45db2
    • M
      net-next: add dev_loopback_xmit() to avoid duplicate code · 95603e22
      Michel Machado 提交于
      Add dev_loopback_xmit() in order to deduplicate functions
      ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and
      ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c).
      
      I was about to reinvent the wheel when I noticed that
      ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I
      need and are not IP-only functions, but they were not available to reuse
      elsewhere.
      
      ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I
      understand that this is harmless, and should be in dev_loopback_xmit().
      Signed-off-by: NMichel Machado <michel@digirati.com.br>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jpirko@redhat.com>
      CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95603e22
  14. 09 6月, 2012 1 次提交
  15. 08 6月, 2012 3 次提交
  16. 06 6月, 2012 1 次提交
  17. 04 6月, 2012 2 次提交
    • E
      drop_monitor: dont sleep in atomic context · bec4596b
      Eric Dumazet 提交于
      drop_monitor calls several sleeping functions while in atomic context.
      
       BUG: sleeping function called from invalid context at mm/slub.c:943
       in_atomic(): 1, irqs_disabled(): 0, pid: 2103, name: kworker/0:2
       Pid: 2103, comm: kworker/0:2 Not tainted 3.5.0-rc1+ #55
       Call Trace:
        [<ffffffff810697ca>] __might_sleep+0xca/0xf0
        [<ffffffff811345a3>] kmem_cache_alloc_node+0x1b3/0x1c0
        [<ffffffff8105578c>] ? queue_delayed_work_on+0x11c/0x130
        [<ffffffff815343fb>] __alloc_skb+0x4b/0x230
        [<ffffffffa00b0360>] ? reset_per_cpu_data+0x160/0x160 [drop_monitor]
        [<ffffffffa00b022f>] reset_per_cpu_data+0x2f/0x160 [drop_monitor]
        [<ffffffffa00b03ab>] send_dm_alert+0x4b/0xb0 [drop_monitor]
        [<ffffffff810568e0>] process_one_work+0x130/0x4c0
        [<ffffffff81058249>] worker_thread+0x159/0x360
        [<ffffffff810580f0>] ? manage_workers.isra.27+0x240/0x240
        [<ffffffff8105d403>] kthread+0x93/0xa0
        [<ffffffff816be6d4>] kernel_thread_helper+0x4/0x10
        [<ffffffff8105d370>] ? kthread_freezable_should_stop+0x80/0x80
        [<ffffffff816be6d0>] ? gs_change+0xb/0xb
      
      Rework the logic to call the sleeping functions in right context.
      
      Use standard timer/workqueue api to let system chose any cpu to perform
      the allocation and netlink send.
      
      Also avoid a loop if reset_per_cpu_data() cannot allocate memory :
      use mod_timer() to wait 1/10 second before next try.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Reviewed-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec4596b
    • E
      sock_diag: add SK_MEMINFO_BACKLOG · d594e987
      Eric Dumazet 提交于
      Adding socket backlog len in INET_DIAG_SKMEMINFO is really useful to
      diagnose various TCP problems.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d594e987
  18. 01 6月, 2012 1 次提交
  19. 30 5月, 2012 1 次提交
  20. 20 5月, 2012 1 次提交
    • E
      net: introduce skb_try_coalesce() · bad43ca8
      Eric Dumazet 提交于
      Move tcp_try_coalesce() protocol independent part to
      skb_try_coalesce().
      
      skb_try_coalesce() can be used in IPv4 defrag and IPv6 reassembly,
      to build optimized skbs (less sk_buff, and possibly less 'headers')
      
      skb_try_coalesce() is zero copy, unless the copy can fit in destination
      header (its a rare case)
      
      kfree_skb_partial() is also moved to net/core/skbuff.c and exported,
      because IPv6 will need it in patch (ipv6: use skb coalescing in
      reassembly).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bad43ca8
  21. 19 5月, 2012 3 次提交
  22. 18 5月, 2012 2 次提交
    • N
      drop_monitor: convert to modular building · cad456d5
      Neil Horman 提交于
      When I first wrote drop monitor I wrote it to just build monolithically.  There
      is no reason it can't be built modularly as well, so lets give it that
      flexibiity.
      
      I've tested this by building it as both a module and monolithically, and it
      seems to work quite well
      
      Change notes:
      
      v2)
      * fixed for_each_present_cpu loops to be more correct as per Eric D.
      * Converted exit path failures to BUG_ON as per Ben H.
      
      v3)
      * Converted del_timer to del_timer_sync to close race noted by Ben H.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      Reviewed-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cad456d5
    • E
      net: netdev_alloc_skb() use build_skb() · a1c7fff7
      Eric Dumazet 提交于
      netdev_alloc_skb() is used by networks driver in their RX path to
      allocate an skb to receive an incoming frame.
      
      With recent skb->head_frag infrastructure, it makes sense to change
      netdev_alloc_skb() to use build_skb() and a frag allocator.
      
      This permits a zero copy splice(socket->pipe), and better GRO or TCP
      coalescing.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1c7fff7
  23. 17 5月, 2012 1 次提交