1. 11 12月, 2014 4 次提交
    • G
      net: introduce helper macro for_each_cmsghdr · f95b414e
      Gu Zheng 提交于
      Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating
      cmsghdr from msghdr, just cleanup.
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f95b414e
    • D
      net, lib: kill arch_fast_hash library bits · 0cb6c969
      Daniel Borkmann 提交于
      As there are now no remaining users of arch_fast_hash(), lets kill
      it entirely.
      
      This basically reverts commit 71ae8aac ("lib: introduce arch
      optimized hash library") and follow-up work, that is f.e., commit
      23721754 ("lib: hash: follow-up fixups for arch hash"),
      commit e3fec2f7 ("lib: Add missing arch generic-y entries for
      asm-generic/hash.h") and last but not least commit 6a02652d
      ("perf tools: Fix include for non x86 architectures").
      
      Cc: Francesco Fusco <fusco@ntop.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cb6c969
    • A
      net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb · fd11a83d
      Alexander Duyck 提交于
      This change pulls the core functionality out of __netdev_alloc_skb and
      places them in a new function named __alloc_rx_skb.  The reason for doing
      this is to make these bits accessible to a new function __napi_alloc_skb.
      In addition __alloc_rx_skb now has a new flags value that is used to
      determine which page frag pool to allocate from.  If the SKB_ALLOC_NAPI
      flag is set then the NAPI pool is used.  The advantage of this is that we
      do not have to use local_irq_save/restore when accessing the NAPI pool from
      NAPI context.
      
      In my test setup I saw at least 11ns of savings using the napi_alloc_skb
      function versus the netdev_alloc_skb function, most of this being due to
      the fact that we didn't have to call local_irq_save/restore.
      
      The main use case for napi_alloc_skb would be for things such as copybreak
      or page fragment based receive paths where an skb is allocated after the
      data has been received instead of before.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd11a83d
    • A
      net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag · ffde7328
      Alexander Duyck 提交于
      This patch splits the netdev_alloc_frag function up so that it can be used
      on one of two page frag pools instead of being fixed on the
      netdev_alloc_cache.  By doing this we can add a NAPI specific function
      __napi_alloc_frag that accesses a pool that is only used from softirq
      context.  The advantage to this is that we do not need to call
      local_irq_save/restore which can be a significant savings.
      
      I also took the opportunity to refactor the core bits that were placed in
      __alloc_page_frag.  First I updated the allocation to do either a 32K
      allocation or an order 0 page.  This is based on the changes in commmit
      d9b2938a where it was found that latencies could be reduced in case of
      failures.  Then I also rewrote the logic to work from the end of the page to
      the start.  By doing this the size value doesn't have to be used unless we
      have run out of space for page fragments.  Finally I cleaned up the atomic
      bits so that we just do an atomic_sub_and_test and if that returns true then
      we set the page->_count via an atomic_set.  This way we can remove the extra
      conditional for the atomic_read since it would have led to an atomic_inc in
      the case of success anyway.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ffde7328
  2. 10 12月, 2014 12 次提交
    • E
      tcp: refine TSO autosizing · 605ad7f1
      Eric Dumazet 提交于
      Commit 95bd09eb ("tcp: TSO packets automatic sizing") tried to
      control TSO size, but did this at the wrong place (sendmsg() time)
      
      At sendmsg() time, we might have a pessimistic view of flow rate,
      and we end up building very small skbs (with 2 MSS per skb).
      
      This is bad because :
      
       - It sends small TSO packets even in Slow Start where rate quickly
         increases.
       - It tends to make socket write queue very big, increasing tcp_ack()
         processing time, but also increasing memory needs, not necessarily
         accounted for, as fast clones overhead is currently ignored.
       - Lower GRO efficiency and more ACK packets.
      
      Servers with a lot of small lived connections suffer from this.
      
      Lets instead fill skbs as much as possible (64KB of payload), but split
      them at xmit time, when we have a precise idea of the flow rate.
      skb split is actually quite efficient.
      
      Patch looks bigger than necessary, because TCP Small Queue decision now
      has to take place after the eventual split.
      
      As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
      tcp_tso_should_defer() can be synchronized on same goal.
      
      Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
      contains number of mss that we can put in GSO packet, and is not
      related to the autosizing goal anymore.
      
      Tested:
      
      40 ms rtt link
      
      nstat >/dev/null
      netperf -H remote -l -2000000 -- -s 1000000
      nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"
      
      Before patch :
      
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/s
      
       87380 2000000 2000000    0.36         44.22
      IpInReceives                    600                0.0
      IpOutRequests                   599                0.0
      TcpOutSegs                      1397               0.0
      IpExtOutOctets                  2033249            0.0
      
      After patch :
      
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380 2000000 2000000    0.36       44.27
      IpInReceives                    221                0.0
      IpOutRequests                   232                0.0
      TcpOutSegs                      1397               0.0
      IpExtOutOctets                  2013953            0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      605ad7f1
    • A
      bury memcpy_toiovec() · 218321e7
      Al Viro 提交于
      no users left
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      218321e7
    • A
      skb_copy_datagram_iovec() can die · d3a9632f
      Al Viro 提交于
      no callers other than itself.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d3a9632f
    • A
      switch memcpy_to_msg() and skb_copy{,_and_csum}_datagram_msg() to primitives · e5a4b0bb
      Al Viro 提交于
      ... making both non-draining.  That means that tcp_recvmsg() becomes
      non-draining.  And _that_ would break iscsit_do_rx_data() unless we
      	a) make sure tcp_recvmsg() is uniformly non-draining (it is)
      	b) make sure it copes with arbitrary (including shifted)
      iov_iter (it does, all it uses is iov_iter primitives)
      	c) make iscsit_do_rx_data() initialize ->msg_iter only once.
      
      Fortunately, (c) is doable with minimal work and we are rid of one
      the two places where kernel send/recvmsg users would be unhappy with
      non-draining behaviour.
      
      Actually, that makes all but one of ->recvmsg() instances iov_iter-clean.
      The exception is skcipher_recvmsg() and it also isn't hard to convert
      to primitives (iov_iter_get_pages() is needed there).  That'll wait
      a bit - there's some interplay with ->sendmsg() path for that one.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e5a4b0bb
    • A
      put iov_iter into msghdr · c0371da6
      Al Viro 提交于
      Note that the code _using_ ->msg_iter at that point will be very
      unhappy with anything other than unshifted iovec-backed iov_iter.
      We still need to convert users to proper primitives.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c0371da6
    • A
      vmci: propagate msghdr all way down to __qp_memcpy_from_queue() · d838df2e
      Al Viro 提交于
      ... and switch it to memcpy_to_msg()
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d838df2e
    • A
      f4362a2c
    • M
      ipvlan: move the device check function into netdevice.h · 5933fea7
      Mahesh Bandewar 提交于
      Move the port check [ipvlan_dev_master()] and device check
      [ipvlan_dev_slave()] functions to netdevice.h and rename them
      netif_is_ipvlan_port() and netif_is_ipvlan() resp. to be
      consistent with macvlan api naming.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5933fea7
    • M
      netdevice: Add a function to check macvlan port · 2f33e7d5
      Mahesh Bandewar 提交于
      Similar to a check for macvlan device, netif_is_macvlan(), add
      another function to check if a device is used as macvlan port.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f33e7d5
    • H
      dst: no need to take reference on DST_NOCACHE dsts · dbfc4fb7
      Hannes Frederic Sowa 提交于
      Since commit f8864972 ("ipv4: fix dst race in sk_dst_get()")
      DST_NOCACHE dst_entries get freed by RCU. So there is no need to get a
      reference on them when we are in rcu protected sections.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbfc4fb7
    • E
      net: avoid two atomic operations in fast clones · 6ffe75eb
      Eric Dumazet 提交于
      Commit ce1a4ea3 ("net: avoid one atomic operation in skb_clone()")
      took the wrong way to save one atomic operation.
      
      It is actually possible to avoid two atomic operations, if we
      do not change skb->fclone values, and only rely on clone_ref
      content to signal if the clone is available or not.
      
      skb_clone() can simply use the fast clone if clone_ref is 1.
      
      kfree_skbmem() can avoid the atomic_dec_and_test() if clone_ref is 1.
      
      Note that because we usually free the clone before the original skb,
      this particular attempt is only done for the original skb to have better
      branch prediction.
      
      SKB_FCLONE_FREE is removed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ffe75eb
    • M
      rtnetlink: delay RTM_DELLINK notification until after ndo_uninit() · 395eea6c
      Mahesh Bandewar 提交于
      The commit 56bfa7ee ("unregister_netdevice : move RTM_DELLINK to
      until after ndo_uninit") tried to do this ealier but while doing so
      it created a problem. Unfortunately the delayed rtmsg_ifinfo() also
      delayed call to fill_info(). So this translated into asking driver
      to remove private state and then query it's private state. This
      could have catastropic consequences.
      
      This change breaks the rtmsg_ifinfo() into two parts - one takes the
      precise snapshot of the device by called fill_info() before calling
      the ndo_uninit() and the second part sends the notification using
      collected snapshot.
      
      It was brought to notice when last link is deleted from an ipvlan device
      when it has free-ed the port and the subsequent .fill_info() call is
      trying to get the info from the port.
      
      kernel: [  255.139429] ------------[ cut here ]------------
      kernel: [  255.139439] WARNING: CPU: 12 PID: 11173 at net/core/rtnetlink.c:2238 rtmsg_ifinfo+0x100/0x110()
      kernel: [  255.139493] Modules linked in: ipvlan bonding w1_therm ds2482 wire cdc_acm ehci_pci ehci_hcd i2c_dev i2c_i801 i2c_core msr cpuid bnx2x ptp pps_core mdio libcrc32c
      kernel: [  255.139513] CPU: 12 PID: 11173 Comm: ip Not tainted 3.18.0-smp-DEV #167
      kernel: [  255.139514] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 1.0.10 05/15/2012
      kernel: [  255.139515]  0000000000000009 ffff880851b6b828 ffffffff815d87f4 00000000000000e0
      kernel: [  255.139516]  0000000000000000 ffff880851b6b868 ffffffff8109c29c 0000000000000000
      kernel: [  255.139518]  00000000ffffffa6 00000000000000d0 ffffffff81aaf580 0000000000000011
      kernel: [  255.139520] Call Trace:
      kernel: [  255.139527]  [<ffffffff815d87f4>] dump_stack+0x46/0x58
      kernel: [  255.139531]  [<ffffffff8109c29c>] warn_slowpath_common+0x8c/0xc0
      kernel: [  255.139540]  [<ffffffff8109c2ea>] warn_slowpath_null+0x1a/0x20
      kernel: [  255.139544]  [<ffffffff8150d570>] rtmsg_ifinfo+0x100/0x110
      kernel: [  255.139547]  [<ffffffff814f78b5>] rollback_registered_many+0x1d5/0x2d0
      kernel: [  255.139549]  [<ffffffff814f79cf>] unregister_netdevice_many+0x1f/0xb0
      kernel: [  255.139551]  [<ffffffff8150acab>] rtnl_dellink+0xbb/0x110
      kernel: [  255.139553]  [<ffffffff8150da90>] rtnetlink_rcv_msg+0xa0/0x240
      kernel: [  255.139557]  [<ffffffff81329283>] ? rhashtable_lookup_compare+0x43/0x80
      kernel: [  255.139558]  [<ffffffff8150d9f0>] ? __rtnl_unlock+0x20/0x20
      kernel: [  255.139562]  [<ffffffff8152cb11>] netlink_rcv_skb+0xb1/0xc0
      kernel: [  255.139563]  [<ffffffff8150a495>] rtnetlink_rcv+0x25/0x40
      kernel: [  255.139565]  [<ffffffff8152c398>] netlink_unicast+0x178/0x230
      kernel: [  255.139567]  [<ffffffff8152c75f>] netlink_sendmsg+0x30f/0x420
      kernel: [  255.139571]  [<ffffffff814e0b0c>] sock_sendmsg+0x9c/0xd0
      kernel: [  255.139575]  [<ffffffff811d1d7f>] ? rw_copy_check_uvector+0x6f/0x130
      kernel: [  255.139577]  [<ffffffff814e11c9>] ? copy_msghdr_from_user+0x139/0x1b0
      kernel: [  255.139578]  [<ffffffff814e1774>] ___sys_sendmsg+0x304/0x310
      kernel: [  255.139581]  [<ffffffff81198723>] ? handle_mm_fault+0xca3/0xde0
      kernel: [  255.139585]  [<ffffffff811ebc4c>] ? destroy_inode+0x3c/0x70
      kernel: [  255.139589]  [<ffffffff8108e6ec>] ? __do_page_fault+0x20c/0x500
      kernel: [  255.139597]  [<ffffffff811e8336>] ? dput+0xb6/0x190
      kernel: [  255.139606]  [<ffffffff811f05f6>] ? mntput+0x26/0x40
      kernel: [  255.139611]  [<ffffffff811d2b94>] ? __fput+0x174/0x1e0
      kernel: [  255.139613]  [<ffffffff814e2129>] __sys_sendmsg+0x49/0x90
      kernel: [  255.139615]  [<ffffffff814e2182>] SyS_sendmsg+0x12/0x20
      kernel: [  255.139617]  [<ffffffff815df092>] system_call_fastpath+0x12/0x17
      kernel: [  255.139619] ---[ end trace 5e6703e87d984f6b ]---
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reported-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      395eea6c
  3. 09 12月, 2014 10 次提交
  4. 08 12月, 2014 1 次提交
  5. 06 12月, 2014 1 次提交
    • A
      net: sock: allow eBPF programs to be attached to sockets · 89aa0758
      Alexei Starovoitov 提交于
      introduce new setsockopt() command:
      
      setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))
      
      where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
      and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER
      
      setsockopt() calls bpf_prog_get() which increments refcnt of the program,
      so it doesn't get unloaded while socket is using the program.
      
      The same eBPF program can be attached to multiple sockets.
      
      User task exit automatically closes socket which calls sk_filter_uncharge()
      which decrements refcnt of eBPF program
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89aa0758
  6. 03 12月, 2014 8 次提交
  7. 02 12月, 2014 2 次提交
  8. 26 11月, 2014 1 次提交
    • A
      kvm: fix kvm_is_mmio_pfn() and rename to kvm_is_reserved_pfn() · d3fccc7e
      Ard Biesheuvel 提交于
      This reverts commit 85c8555f ("KVM: check for !is_zero_pfn() in
      kvm_is_mmio_pfn()") and renames the function to kvm_is_reserved_pfn.
      
      The problem being addressed by the patch above was that some ARM code
      based the memory mapping attributes of a pfn on the return value of
      kvm_is_mmio_pfn(), whose name indeed suggests that such pfns should
      be mapped as device memory.
      
      However, kvm_is_mmio_pfn() doesn't do quite what it says on the tin,
      and the existing non-ARM users were already using it in a way which
      suggests that its name should probably have been 'kvm_is_reserved_pfn'
      from the beginning, e.g., whether or not to call get_page/put_page on
      it etc. This means that returning false for the zero page is a mistake
      and the patch above should be reverted.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d3fccc7e
  9. 25 11月, 2014 1 次提交
    • M
      ipvlan: Initial check-in of the IPVLAN driver. · 2ad7bf36
      Mahesh Bandewar 提交于
      This driver is very similar to the macvlan driver except that it
      uses L3 on the frame to determine the logical interface while
      functioning as packet dispatcher. It inherits L2 of the master
      device hence the packets on wire will have the same L2 for all
      the packets originating from all virtual devices off of the same
      master device.
      
      This driver was developed keeping the namespace use-case in
      mind. Hence most of the examples given here take that as the
      base setup where main-device belongs to the default-ns and
      virtual devices are assigned to the additional namespaces.
      
      The device operates in two different modes and the difference
      in these two modes in primarily in the TX side.
      
      (a) L2 mode : In this mode, the device behaves as a L2 device.
      TX processing upto L2 happens on the stack of the virtual device
      associated with (namespace). Packets are switched after that
      into the main device (default-ns) and queued for xmit.
      
      RX processing is simple and all multicast, broadcast (if
      applicable), and unicast belonging to the address(es) are
      delivered to the virtual devices.
      
      (b) L3 mode : In this mode, the device behaves like a L3 device.
      TX processing upto L3 happens on the stack of the virtual device
      associated with (namespace). Packets are switched to the
      main-device (default-ns) for the L2 processing. Hence the routing
      table of the default-ns will be used in this mode.
      
      RX processins is somewhat similar to the L2 mode except that in
      this mode only Unicast packets are delivered to the virtual device
      while main-dev will handle all other packets.
      
      The devices can be added using the "ip" command from the iproute2
      package -
      
      	ip link add link <master> <virtual> type ipvlan mode [ l2 | l3 ]
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Laurent Chavey <chavey@google.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Brandon Philips <brandon.philips@coreos.com>
      Cc: Pavel Emelianov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ad7bf36