1. 04 1月, 2014 3 次提交
  2. 11 12月, 2013 1 次提交
  3. 23 10月, 2013 1 次提交
  4. 18 10月, 2013 1 次提交
    • E
      net: refactor sk_page_frag_refill() · 400dfd3a
      Eric Dumazet 提交于
      While working on virtio_net new allocation strategy to increase
      payload/truesize ratio, we found that refactoring sk_page_frag_refill()
      was needed.
      
      This patch splits sk_page_frag_refill() into two parts, adding
      skb_page_frag_refill() which can be used without a socket.
      
      While we are at it, add a minimum frag size of 32 for
      sk_page_frag_refill()
      
      Michael will either use netdev_alloc_frag() from softirq context,
      or skb_page_frag_refill() from process context in refill_work()
       (GFP_KERNEL allocations)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Michael Dalton <mwdalton@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      400dfd3a
  5. 09 10月, 2013 1 次提交
  6. 29 9月, 2013 1 次提交
    • E
      net: introduce SO_MAX_PACING_RATE · 62748f32
      Eric Dumazet 提交于
      As mentioned in commit afe4fd06 ("pkt_sched: fq: Fair Queue packet
      scheduler"), this patch adds a new socket option.
      
      SO_MAX_PACING_RATE offers the application the ability to cap the
      rate computed by transport layer. Value is in bytes per second.
      
      u32 val = 1000000;
      setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));
      
      To be effectively paced, a flow must use FQ packet scheduler.
      
      Note that a packet scheduler takes into account the headers for its
      computations. The effective payload rate depends on MSS and retransmits
      if any.
      
      I chose to make this pacing rate a SOL_SOCKET option instead of a
      TCP one because this can be used by other protocols.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Steinar H. Gunderson <sesse@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62748f32
  7. 10 8月, 2013 1 次提交
    • E
      net: attempt high order allocations in sock_alloc_send_pskb() · 28d64271
      Eric Dumazet 提交于
      Adding paged frags skbs to af_unix sockets introduced a performance
      regression on large sends because of additional page allocations, even
      if each skb could carry at least 100% more payload than before.
      
      We can instruct sock_alloc_send_pskb() to attempt high order
      allocations.
      
      Most of the time, it does a single page allocation instead of 8.
      
      I added an additional parameter to sock_alloc_send_pskb() to
      let other users to opt-in for this new feature on followup patches.
      
      Tested:
      
      Before patch :
      
      $ netperf -t STREAM_STREAM
      STREAM STREAM TEST
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       2304  212992  212992    10.00    46861.15
      
      After patch :
      
      $ netperf -t STREAM_STREAM
      STREAM STREAM TEST
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       2304  212992  212992    10.00    57981.11
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28d64271
  8. 02 8月, 2013 1 次提交
  9. 01 8月, 2013 1 次提交
    • E
      netem: Introduce skb_orphan_partial() helper · f2f872f9
      Eric Dumazet 提交于
      Commit 547669d4 ("tcp: xps: fix reordering issues") added
      unexpected reorders in case netem is used in a MQ setup for high
      performance test bed.
      
      ETH=eth0
      tc qd del dev $ETH root 2>/dev/null
      tc qd add dev $ETH root handle 1: mq
      for i in `seq 1 32`
      do
       tc qd add dev $ETH parent 1:$i netem delay 100ms
      done
      
      As all tcp packets are orphaned by netem, TCP stack believes it can
      set skb->ooo_okay on all packets.
      
      In order to allow producers to send more packets, we want to
      keep sk_wmem_alloc from reaching sk_sndbuf limit.
      
      We can do that by accounting one byte per skb in netem queues,
      so that TCP stack is not fooled too much.
      
      Tested:
      
      With above MQ/netem setup, scaling number of concurrent flows gives
      linear results and no reorders/retransmits
      
      lpq83:~# for n in 1 10 20 30 40 50 60 70 80 90 100
       do echo -n "n:$n " ; ./super_netperf $n -H 10.7.7.84; done
      n:1 198.46
      n:10 2002.69
      n:20 4000.98
      n:30 6006.35
      n:40 8020.93
      n:50 10032.3
      n:60 12081.9
      n:70 13971.3
      n:80 16009.7
      n:90 17117.3
      n:100 17425.5
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2f872f9
  10. 23 7月, 2013 1 次提交
  11. 11 7月, 2013 2 次提交
  12. 27 6月, 2013 1 次提交
    • N
      net: fix kernel deadlock with interface rename and netdev name retrieval. · 5dbe7c17
      Nicolas Schichan 提交于
      When the kernel (compiled with CONFIG_PREEMPT=n) is performing the
      rename of a network interface, it can end up waiting for a workqueue
      to complete. If userland is able to invoke a SIOCGIFNAME ioctl or a
      SO_BINDTODEVICE getsockopt in between, the kernel will deadlock due to
      the fact that read_secklock_begin() will spin forever waiting for the
      writer process (the one doing the interface rename) to update the
      devnet_rename_seq sequence.
      
      This patch fixes the problem by adding a helper (netdev_get_name())
      and using it in the code handling the SIOCGIFNAME ioctl and
      SO_BINDTODEVICE setsockopt.
      
      The netdev_get_name() helper uses raw_seqcount_begin() to avoid
      spinning forever, waiting for devnet_rename_seq->sequence to become
      even. cond_resched() is used in the contended case, before retrying
      the access to give the writer process a chance to finish.
      
      The use of raw_seqcount_begin() will incur some unneeded work in the
      reader process in the contended case, but this is better than
      deadlocking the system.
      Signed-off-by: NNicolas Schichan <nschichan@freebox.fr>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5dbe7c17
  13. 26 6月, 2013 1 次提交
    • E
      net: poll/select low latency socket support · 2d48d67f
      Eliezer Tamir 提交于
      select/poll busy-poll support.
      
      Split sysctl value into two separate ones, one for read and one for poll.
      updated Documentation/sysctl/net.txt
      
      Add a new poll flag POLL_LL. When this flag is set, sock_poll will call
      sk_poll_ll if possible. sock_poll sets this flag in its return value
      to indicate to select/poll when a socket that can busy poll is found.
      
      When poll/select have nothing to report, call the low-level
      sock_poll again until we are out of time or we find something.
      
      Once the system call finds something, it stops setting POLL_LL, so it can
      return the result to the user ASAP.
      Signed-off-by: NEliezer Tamir <eliezer.tamir@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d48d67f
  14. 18 6月, 2013 1 次提交
  15. 11 6月, 2013 1 次提交
  16. 29 5月, 2013 1 次提交
  17. 12 5月, 2013 1 次提交
    • E
      ipv6: do not clear pinet6 field · f77d6021
      Eric Dumazet 提交于
      We have seen multiple NULL dereferences in __inet6_lookup_established()
      
      After analysis, I found that inet6_sk() could be NULL while the
      check for sk_family == AF_INET6 was true.
      
      Bug was added in linux-2.6.29 when RCU lookups were introduced in UDP
      and TCP stacks.
      
      Once an IPv6 socket, using SLAB_DESTROY_BY_RCU is inserted in a hash
      table, we no longer can clear pinet6 field.
      
      This patch extends logic used in commit fcbdf09d
      ("net: fix nulls list corruptions in sk_prot_alloc")
      
      TCP/UDP/UDPLite IPv6 protocols provide their own .clear_sk() method
      to make sure we do not clear pinet6 field.
      
      At socket clone phase, we do not really care, as cloning the parent (non
      NULL) pinet6 is not adding a fatal race.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f77d6021
  18. 10 4月, 2013 2 次提交
  19. 01 4月, 2013 1 次提交
    • K
      net: add option to enable error queue packets waking select · 7d4c04fc
      Keller, Jacob E 提交于
      Currently, when a socket receives something on the error queue it only wakes up
      the socket on select if it is in the "read" list, that is the socket has
      something to read. It is useful also to wake the socket if it is in the error
      list, which would enable software to wait on error queue packets without waking
      up for regular data on the socket. The main use case is for receiving
      timestamped transmit packets which return the timestamp to the socket via the
      error queue. This enables an application to select on the socket for the error
      queue only instead of for the regular traffic.
      
      -v2-
      * Added the SO_SELECT_ERR_QUEUE socket option to every architechture specific file
      * Modified every socket poll function that checks error queue
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Cc: Jeffrey Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Matthew Vick <matthew.vick@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d4c04fc
  20. 21 3月, 2013 1 次提交
  21. 23 2月, 2013 1 次提交
  22. 19 2月, 2013 2 次提交
  23. 05 2月, 2013 1 次提交
  24. 24 1月, 2013 1 次提交
  25. 17 1月, 2013 1 次提交
    • V
      sk-filter: Add ability to lock a socket filter program · d59577b6
      Vincent Bernat 提交于
      While a privileged program can open a raw socket, attach some
      restrictive filter and drop its privileges (or send the socket to an
      unprivileged program through some Unix socket), the filter can still
      be removed or modified by the unprivileged program. This commit adds a
      socket option to lock the filter (SO_LOCK_FILTER) preventing any
      modification of a socket filter program.
      
      This is similar to OpenBSD BIOCLOCK ioctl on bpf sockets, except even
      root is not allowed change/drop the filter.
      
      The state of the lock can be read with getsockopt(). No error is
      triggered if the state is not changed. -EPERM is returned when a user
      tries to remove the lock or to change/remove the filter while the lock
      is active. The check is done directly in sk_attach_filter() and
      sk_detach_filter() and does not affect only setsockopt() syscall.
      Signed-off-by: NVincent Bernat <bernat@luffy.cx>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d59577b6
  26. 22 12月, 2012 1 次提交
  27. 27 11月, 2012 1 次提交
    • B
      sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name · c91f6df2
      Brian Haley 提交于
      Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which
      will then require another call like if_indextoname() to get the actual interface
      name, have it return the name directly.
      
      This also matches the existing man page description on socket(7) which mentions
      the argument being an interface name.
      
      If the value has not been set, zero is returned and optlen will be set to zero
      to indicate there is no interface name present.
      
      Added a seqlock to protect this code path, and dev_ifname(), from someone
      changing the device name via dev_change_name().
      
      v2: Added seqlock protection while copying device name.
      
      v3: Fixed word wrap in patch.
      Signed-off-by: NBrian Haley <brian.haley@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c91f6df2
  28. 19 11月, 2012 1 次提交
    • E
      net: Allow userns root control of the core of the network stack. · 5e1fccc0
      Eric W. Biederman 提交于
      Allow an unpriviled user who has created a user namespace, and then
      created a network namespace to effectively use the new network
      namespace, by reducing capable(CAP_NET_ADMIN) and
      capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
      CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
      
      Settings that merely control a single network device are allowed.
      Either the network device is a logical network device where
      restrictions make no difference or the network device is hardware NIC
      that has been explicity moved from the initial network namespace.
      
      In general policy and network stack state changes are allowed
      while resource control is left unchanged.
      
      Allow ethtool ioctls.
      
      Allow binding to network devices.
      Allow setting the socket mark.
      Allow setting the socket priority.
      
      Allow setting the network device alias via sysfs.
      Allow setting the mtu via sysfs.
      Allow changing the network device flags via sysfs.
      Allow setting the network device group via sysfs.
      
      Allow the following network device ioctls.
      SIOCGMIIPHY
      SIOCGMIIREG
      SIOCSIFNAME
      SIOCSIFFLAGS
      SIOCSIFMETRIC
      SIOCSIFMTU
      SIOCSIFHWADDR
      SIOCSIFSLAVE
      SIOCADDMULTI
      SIOCDELMULTI
      SIOCSIFHWBROADCAST
      SIOCSMIIREG
      SIOCBONDENSLAVE
      SIOCBONDRELEASE
      SIOCBONDSETHWADDR
      SIOCBONDCHANGEACTIVE
      SIOCBRADDIF
      SIOCBRDELIF
      SIOCSHWTSTAMP
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e1fccc0
  29. 01 11月, 2012 1 次提交
    • P
      sk-filter: Add ability to get socket filter program (v2) · a8fc9277
      Pavel Emelyanov 提交于
      The SO_ATTACH_FILTER option is set only. I propose to add the get
      ability by using SO_ATTACH_FILTER in getsockopt. To be less
      irritating to eyes the SO_GET_FILTER alias to it is declared. This
      ability is required by checkpoint-restore project to be able to
      save full state of a socket.
      
      There are two issues with getting filter back.
      
      First, kernel modifies the sock_filter->code on filter load, thus in
      order to return the filter element back to user we have to decode it
      into user-visible constants. Fortunately the modification in question
      is interconvertible.
      
      Second, the BPF_S_ALU_DIV_K code modifies the command argument k to
      speed up the run-time division by doing kernel_k = reciprocal(user_k).
      Bad news is that different user_k may result in same kernel_k, so we
      can't get the original user_k back. Good news is that we don't have
      to do it. What we need to is calculate a user2_k so, that
      
        reciprocal(user2_k) == reciprocal(user_k) == kernel_k
      
      i.e. if it's re-loaded back the compiled again value will be exactly
      the same as it was. That said, the user2_k can be calculated like this
      
        user2_k = reciprocal(kernel_k)
      
      with an exception, that if kernel_k == 0, then user2_k == 1.
      
      The optlen argument is treated like this -- when zero, kernel returns
      the amount of sock_fprog elements in filter, otherwise it should be
      large enough for the sock_fprog array.
      
      changes since v1:
      * Declared SO_GET_FILTER in all arch headers
      * Added decode of vlan-tag codes
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8fc9277
  30. 26 10月, 2012 2 次提交
  31. 22 10月, 2012 1 次提交
    • P
      sockopt: Make SO_BINDTODEVICE readable · f7b86bfe
      Pavel Emelyanov 提交于
      The SO_BINDTODEVICE option is the only SOL_SOCKET one that can be set, but
      cannot be get via sockopt API. The only way we can find the device id a
      socket is bound to is via sock-diag interface. But the diag works only on
      hashed sockets, while the opt in question can be set for yet unhashed one.
      
      That said, in order to know what device a socket is bound to (we do want
      to know this in checkpoint-restore project) I propose to make this option
      getsockopt-able and report the respective device index.
      
      Another solution to the problem might be to teach the sock-diag reporting
      info on unhashed sockets. Should I go this way instead?
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7b86bfe
  32. 28 9月, 2012 1 次提交
  33. 25 9月, 2012 2 次提交
    • E
      net: guard tcp_set_keepalive() to tcp sockets · 3e10986d
      Eric Dumazet 提交于
      Its possible to use RAW sockets to get a crash in
      tcp_set_keepalive() / sk_reset_timer()
      
      Fix is to make sure socket is a SOCK_STREAM one.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e10986d
    • E
      net: use a per task frag allocator · 5640f768
      Eric Dumazet 提交于
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5640f768