1. 08 3月, 2018 1 次提交
  2. 22 2月, 2018 1 次提交
    • E
      tcp: switch to GSO being always on · 0a6b2a1d
      Eric Dumazet 提交于
      Oleksandr Natalenko reported performance issues with BBR without FQ
      packet scheduler that were root caused to lack of SG and GSO/TSO on
      his configuration.
      
      In this mode, TCP internal pacing has to setup a high resolution timer
      for each MSS sent.
      
      We could implement in TCP a strategy similar to the one adopted
      in commit fefa569a ("net_sched: sch_fq: account for schedule/timers drifts")
      or decide to finally switch TCP stack to a GSO only mode.
      
      This has many benefits :
      
      1) Most TCP developments are done with TSO in mind.
      2) Less high-resolution timers needs to be armed for TCP-pacing
      3) GSO can benefit of xmit_more hint
      4) Receiver GRO is more effective (as if TSO was used for real on sender)
         -> Lower ACK traffic
      5) Write queues have less overhead (one skb holds about 64KB of payload)
      6) SACK coalescing just works.
      7) rtx rb-tree contains less packets, SACK is cheaper.
      
      This patch implements the minimum patch, but we can remove some legacy
      code as follow ups.
      
      Tested:
      
      On 40Gbit link, one netperf -t TCP_STREAM
      
      BBR+fq:
      sg on:  26 Gbits/sec
      sg off: 15.7 Gbits/sec   (was 2.3 Gbit before patch)
      
      BBR+pfifo_fast:
      sg on:  24.2 Gbits/sec
      sg off: 14.9 Gbits/sec  (was 0.66 Gbit before patch !!! )
      
      BBR+fq_codel:
      sg on:  24.4 Gbits/sec
      sg off: 15 Gbits/sec  (was 0.66 Gbit before patch !!! )
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a6b2a1d
  3. 17 2月, 2018 1 次提交
  4. 13 2月, 2018 3 次提交
    • K
      net: Convert proto_net_ops · 36b0068e
      Kirill Tkhai 提交于
      This patch starts to convert pernet_subsys, registered
      from subsys initcalls.
      
      It seems safe to be executed in parallel with others,
      as it's only creates/destoyes proc entry,
      which nobody else is not interested in.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36b0068e
    • K
      net: Convert net_inuse_ops · 604da74e
      Kirill Tkhai 提交于
      net_inuse_ops methods expose statistics in /proc.
      No one from the rest of pernet_subsys or pernet_device
      lists touch net::core::inuse.
      
      So, it's safe to make net_inuse_ops async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      604da74e
    • D
      net: make getname() functions return length rather than use int* parameter · 9b2c45d4
      Denys Vlasenko 提交于
      Changes since v1:
      Added changes in these files:
          drivers/infiniband/hw/usnic/usnic_transport.c
          drivers/staging/lustre/lnet/lnet/lib-socket.c
          drivers/target/iscsi/iscsi_target_login.c
          drivers/vhost/net.c
          fs/dlm/lowcomms.c
          fs/ocfs2/cluster/tcp.c
          security/tomoyo/network.c
      
      Before:
      All these functions either return a negative error indicator,
      or store length of sockaddr into "int *socklen" parameter
      and return zero on success.
      
      "int *socklen" parameter is awkward. For example, if caller does not
      care, it still needs to provide on-stack storage for the value
      it does not need.
      
      None of the many FOO_getname() functions of various protocols
      ever used old value of *socklen. They always just overwrite it.
      
      This change drops this parameter, and makes all these functions, on success,
      return length of sockaddr. It's always >= 0 and can be differentiated
      from an error.
      
      Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.
      
      rpc_sockname() lost "int buflen" parameter, since its only use was
      to be passed to kernel_getsockname() as &buflen and subsequently
      not used in any way.
      
      Userspace API is not changed.
      
          text    data     bss      dec     hex filename
      30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
      30108109 2633612  873672 33615393 200ee21 vmlinux.o
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: linux-bluetooth@vger.kernel.org
      CC: linux-decnet-user@lists.sourceforge.net
      CC: linux-wireless@vger.kernel.org
      CC: linux-rdma@vger.kernel.org
      CC: linux-sctp@vger.kernel.org
      CC: linux-nfs@vger.kernel.org
      CC: linux-x25@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b2c45d4
  5. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  6. 03 2月, 2018 1 次提交
    • R
      Revert "defer call to mem_cgroup_sk_alloc()" · edbe69ef
      Roman Gushchin 提交于
      This patch effectively reverts commit 9f1c2674 ("net: memcontrol:
      defer call to mem_cgroup_sk_alloc()").
      
      Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
      memcg socket memory accounting, as packets received before memcg
      pointer initialization are not accounted and are causing refcounting
      underflow on socket release.
      
      Actually the free-after-use problem was fixed by
      commit c0576e39 ("net: call cgroup_sk_alloc() earlier in
      sk_clone_lock()") for the cgroup pointer.
      
      So, let's revert it and call mem_cgroup_sk_alloc() just before
      cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
      we're cloning, and it holds a reference to the memcg.
      
      Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
      mem_cgroup_sk_alloc(). I see no reasons why bumping the root
      memcg counter is a good reason to panic, and there are no realistic
      ways to hit it.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edbe69ef
  7. 17 1月, 2018 1 次提交
    • A
      net: delete /proc THIS_MODULE references · 96890d62
      Alexey Dobriyan 提交于
      /proc has been ignoring struct file_operations::owner field for 10 years.
      Specifically, it started with commit 786d7e16
      ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
      inode->i_fop is initialized with proxy struct file_operations for
      regular files:
      
      	-               if (de->proc_fops)
      	-                       inode->i_fop = de->proc_fops;
      	+               if (de->proc_fops) {
      	+                       if (S_ISREG(inode->i_mode))
      	+                               inode->i_fop = &proc_reg_file_ops;
      	+                       else
      	+                               inode->i_fop = de->proc_fops;
      	+               }
      
      VFS stopped pinning module at this point.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96890d62
  8. 16 1月, 2018 2 次提交
    • K
      net: Restrict unwhitelisted proto caches to size 0 · 289a4860
      Kees Cook 提交于
      Now that protocols have been annotated (the copy of icsk_ca_ops->name
      is of an ops field from outside the slab cache):
      
      $ git grep 'copy_.*_user.*sk.*->'
      caif/caif_socket.c: copy_from_user(&cf_sk->conn_req.param.data, ov, ol)) {
      ipv4/raw.c:   if (copy_from_user(&raw_sk(sk)->filter, optval, optlen))
      ipv4/raw.c:       copy_to_user(optval, &raw_sk(sk)->filter, len))
      ipv4/tcp.c:       if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
      ipv4/tcp.c:       if (copy_to_user(optval, icsk->icsk_ulp_ops->name, len))
      ipv6/raw.c:       if (copy_from_user(&raw6_sk(sk)->filter, optval, optlen))
      ipv6/raw.c:           if (copy_to_user(optval, &raw6_sk(sk)->filter, len))
      sctp/socket.c: if (copy_from_user(&sctp_sk(sk)->subscribe, optval, optlen))
      sctp/socket.c: if (copy_to_user(optval, &sctp_sk(sk)->subscribe, len))
      sctp/socket.c: if (copy_to_user(optval, &sctp_sk(sk)->initmsg, len))
      
      we can switch the default proto usercopy region to size 0. Any protocols
      needing to add whitelisted regions must annotate the fields with the
      useroffset and usersize fields of struct proto.
      
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      289a4860
    • D
      net: Define usercopy region in struct proto slab cache · 30c2c9f1
      David Windsor 提交于
      In support of usercopy hardening, this patch defines a region in the
      struct proto slab cache in which userspace copy operations are allowed.
      Some protocols need to copy objects to/from userspace, and they can
      declare the region via their proto structure with the new usersize and
      useroffset fields. Initially, if no region is specified (usersize ==
      0), the entire field is marked as whitelisted. This allows protocols
      to be whitelisted in subsequent patches. Once all protocols have been
      annotated, the full-whitelist default can be removed.
      
      This region is known as the slab cache's usercopy region. Slab caches
      can now check that each dynamically sized copy operation involving
      cache-managed memory falls entirely within the slab's usercopy region.
      
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      Signed-off-by: NDavid Windsor <dave@nullcore.net>
      [kees: adjust commit log, split off per-proto patches]
      [kees: add logic for by-default full-whitelist]
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      30c2c9f1
  9. 19 12月, 2017 2 次提交
  10. 28 11月, 2017 1 次提交
  11. 16 11月, 2017 1 次提交
  12. 14 11月, 2017 1 次提交
    • E
      tcp: allow drivers to tweak TSQ logic · 3a9b76fd
      Eric Dumazet 提交于
      I had many reports that TSQ logic breaks wifi aggregation.
      
      Current logic is to allow up to 1 ms of bytes to be queued into qdisc
      and drivers queues.
      
      But Wifi aggregation needs a bigger budget to allow bigger rates to
      be discovered by various TCP Congestion Controls algorithms.
      
      This patch adds an extra socket field, allowing wifi drivers to select
      another log scale to derive TCP Small Queue credit from current pacing
      rate.
      
      Initial value is 10, meaning that this patch does not change current
      behavior.
      
      We expect wifi drivers to set this field to smaller values (tests have
      been done with values from 6 to 9)
      
      They would have to use following template :
      
      if (skb->sk && skb->sk->sk_pacing_shift != MY_PACING_SHIFT)
           skb->sk->sk_pacing_shift = MY_PACING_SHIFT;
      
      Ref: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1670041Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Johannes Berg <johannes.berg@intel.com>
      Cc: Toke Høiland-Jørgensen <toke@toke.dk>
      Cc: Kir Kolyshkin <kir@openvz.org>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a9b76fd
  13. 11 11月, 2017 1 次提交
  14. 10 11月, 2017 1 次提交
  15. 18 10月, 2017 2 次提交
    • K
      net/core: Convert sk_timer users to use timer_setup() · 99767f27
      Kees Cook 提交于
      In preparation for unconditionally passing the struct timer_list pointer to
      all timer callbacks, switch to using the new timer_setup() and from_timer()
      to pass the timer pointer explicitly for all users of sk_timer.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Andrew Hendry <andrew.hendry@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Julia Lawall <julia.lawall@lip6.fr>
      Cc: linzhang <xiaolou4617@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-hams@vger.kernel.org
      Cc: linux-x25@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99767f27
    • K
      net/core: Collapse redundant sk_timer callback data assignments · 9f12a77e
      Kees Cook 提交于
      The core sk_timer initializer can provide the common .data assignment
      instead of it being set separately in users.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Andrew Hendry <andrew.hendry@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: linzhang <xiaolou4617@gmail.com>
      Cc: netdev@vger.kernel.org
      Cc: linux-hams@vger.kernel.org
      Cc: linux-x25@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f12a77e
  16. 11 10月, 2017 2 次提交
  17. 10 10月, 2017 2 次提交
  18. 03 10月, 2017 1 次提交
  19. 29 9月, 2017 1 次提交
    • C
      net: Set sk_prot_creator when cloning sockets to the right proto · 9d538fa6
      Christoph Paasch 提交于
      sk->sk_prot and sk->sk_prot_creator can differ when the app uses
      IPV6_ADDRFORM (transforming an IPv6-socket to an IPv4-one).
      Which is why sk_prot_creator is there to make sure that sk_prot_free()
      does the kmem_cache_free() on the right kmem_cache slab.
      
      Now, if such a socket gets transformed back to a listening socket (using
      connect() with AF_UNSPEC) we will allocate an IPv4 tcp_sock through
      sk_clone_lock() when a new connection comes in. But sk_prot_creator will
      still point to the IPv6 kmem_cache (as everything got copied in
      sk_clone_lock()). When freeing, we will thus put this
      memory back into the IPv6 kmem_cache although it was allocated in the
      IPv4 cache. I have seen memory corruption happening because of this.
      
      With slub-debugging and MEMCG_KMEM enabled this gives the warning
      	"cache_from_obj: Wrong slab cache. TCPv6 but object is from TCP"
      
      A C-program to trigger this:
      
      void main(void)
      {
              int fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP);
              int new_fd, newest_fd, client_fd;
              struct sockaddr_in6 bind_addr;
              struct sockaddr_in bind_addr4, client_addr1, client_addr2;
              struct sockaddr unsp;
              int val;
      
              memset(&bind_addr, 0, sizeof(bind_addr));
              bind_addr.sin6_family = AF_INET6;
              bind_addr.sin6_port = ntohs(42424);
      
              memset(&client_addr1, 0, sizeof(client_addr1));
              client_addr1.sin_family = AF_INET;
              client_addr1.sin_port = ntohs(42424);
              client_addr1.sin_addr.s_addr = inet_addr("127.0.0.1");
      
              memset(&client_addr2, 0, sizeof(client_addr2));
              client_addr2.sin_family = AF_INET;
              client_addr2.sin_port = ntohs(42421);
              client_addr2.sin_addr.s_addr = inet_addr("127.0.0.1");
      
              memset(&unsp, 0, sizeof(unsp));
              unsp.sa_family = AF_UNSPEC;
      
              bind(fd, (struct sockaddr *)&bind_addr, sizeof(bind_addr));
      
              listen(fd, 5);
      
              client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
              connect(client_fd, (struct sockaddr *)&client_addr1, sizeof(client_addr1));
              new_fd = accept(fd, NULL, NULL);
              close(fd);
      
              val = AF_INET;
              setsockopt(new_fd, SOL_IPV6, IPV6_ADDRFORM, &val, sizeof(val));
      
              connect(new_fd, &unsp, sizeof(unsp));
      
              memset(&bind_addr4, 0, sizeof(bind_addr4));
              bind_addr4.sin_family = AF_INET;
              bind_addr4.sin_port = ntohs(42421);
              bind(new_fd, (struct sockaddr *)&bind_addr4, sizeof(bind_addr4));
      
              listen(new_fd, 5);
      
              client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
              connect(client_fd, (struct sockaddr *)&client_addr2, sizeof(client_addr2));
      
              newest_fd = accept(new_fd, NULL, NULL);
              close(new_fd);
      
              close(client_fd);
              close(new_fd);
      }
      
      As far as I can see, this bug has been there since the beginning of the
      git-days.
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d538fa6
  20. 30 8月, 2017 1 次提交
  21. 24 8月, 2017 1 次提交
  22. 04 8月, 2017 3 次提交
    • W
      sock: add SOCK_ZEROCOPY sockopt · 76851d12
      Willem de Bruijn 提交于
      The send call ignores unknown flags. Legacy applications may already
      unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a
      socket opts in to zerocopy.
      
      Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing.
      Processes can also query this socket option to detect kernel support
      for the feature. Older kernels will return ENOPROTOOPT.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76851d12
    • W
      sock: add MSG_ZEROCOPY · 52267790
      Willem de Bruijn 提交于
      The kernel supports zerocopy sendmsg in virtio and tap. Expand the
      infrastructure to support other socket types. Introduce a completion
      notification channel over the socket error queue. Notifications are
      returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
      blocking the send/recv path on receiving notifications.
      
      Add reference counting, to support the skb split, merge, resize and
      clone operations possible with SOCK_STREAM and other socket types.
      
      The patch does not yet modify any datapaths.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52267790
    • W
      sock: allocate skbs from optmem · 98ba0bd5
      Willem de Bruijn 提交于
      Add sock_omalloc and sock_ofree to be able to allocate control skbs,
      for instance for looping errors onto sk_error_queue.
      
      The transmit budget (sk_wmem_alloc) is involved in transmit skb
      shaping, most notably in TCP Small Queues. Using this budget for
      control packets would impact transmission.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98ba0bd5
  23. 02 8月, 2017 2 次提交
  24. 01 7月, 2017 2 次提交
  25. 21 6月, 2017 1 次提交
    • D
      net: introduce SO_PEERGROUPS getsockopt · 28b5ba2a
      David Herrmann 提交于
      This adds the new getsockopt(2) option SO_PEERGROUPS on SOL_SOCKET to
      retrieve the auxiliary groups of the remote peer. It is designed to
      naturally extend SO_PEERCRED. That is, the underlying data is from the
      same credentials. Regarding its syntax, it is based on SO_PEERSEC. That
      is, if the provided buffer is too small, ERANGE is returned and @optlen
      is updated. Otherwise, the information is copied, @optlen is set to the
      actual size, and 0 is returned.
      
      While SO_PEERCRED (and thus `struct ucred') already returns the primary
      group, it lacks the auxiliary group vector. However, nearly all access
      controls (including kernel side VFS and SYSVIPC, but also user-space
      polkit, DBus, ...) consider the entire set of groups, rather than just
      the primary group. But this is currently not possible with pure
      SO_PEERCRED. Instead, user-space has to work around this and query the
      system database for the auxiliary groups of a UID retrieved via
      SO_PEERCRED.
      
      Unfortunately, there is no race-free way to query the auxiliary groups
      of the PID/UID retrieved via SO_PEERCRED. Hence, the current user-space
      solution is to use getgrouplist(3p), which itself falls back to NSS and
      whatever is configured in nsswitch.conf(3). This effectively checks
      which groups we *would* assign to the user if it logged in *now*. On
      normal systems it is as easy as reading /etc/group, but with NSS it can
      resort to quering network databases (eg., LDAP), using IPC or network
      communication.
      
      Long story short: Whenever we want to use auxiliary groups for access
      checks on IPC, we need further IPC to talk to the user/group databases,
      rather than just relying on SO_PEERCRED and the incoming socket. This
      is unfortunate, and might even result in dead-locks if the database
      query uses the same IPC as the original request.
      
      So far, those recursions / dead-locks have been avoided by using
      primitive IPC for all crucial NSS modules. However, we want to avoid
      re-inventing the wheel for each NSS module that might be involved in
      user/group queries. Hence, we would preferably make DBus (and other IPC
      that supports access-management based on groups) work without resorting
      to the user/group database. This new SO_PEERGROUPS ioctl would allow us
      to make dbus-daemon work without ever calling into NSS.
      
      Cc: Michal Sekletar <msekleta@redhat.com>
      Cc: Simon McVittie <simon.mcvittie@collabora.co.uk>
      Reviewed-by: NTom Gundersen <teg@jklm.no>
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28b5ba2a
  26. 08 6月, 2017 1 次提交
    • E
      tcp: add TCPMemoryPressuresChrono counter · 06044751
      Eric Dumazet 提交于
      DRAM supply shortage and poor memory pressure tracking in TCP
      stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
      limits) and tcp_mem[] quite hazardous.
      
      TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
      limits being hit, but only tracking number of transitions.
      
      If TCP stack behavior under stress was perfect :
      1) It would maintain memory usage close to the limit.
      2) Memory pressure state would be entered for short times.
      
      We certainly prefer 100 events lasting 10ms compared to one event
      lasting 200 seconds.
      
      This patch adds a new SNMP counter tracking cumulative duration of
      memory pressure events, given in ms units.
      
      $ cat /proc/sys/net/ipv4/tcp_mem
      3088    4117    6176
      $ grep TCP /proc/net/sockstat
      TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
      $ nstat -n ; sleep 10 ; nstat |grep Pressure
      TcpExtTCPMemoryPressures        1700
      TcpExtTCPMemoryPressuresChrono  5209
      
      v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
      instructed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06044751
  27. 18 5月, 2017 1 次提交
  28. 17 5月, 2017 1 次提交
    • E
      tcp: internal implementation for pacing · 218af599
      Eric Dumazet 提交于
      BBR congestion control depends on pacing, and pacing is
      currently handled by sch_fq packet scheduler for performance reasons,
      and also because implemening pacing with FQ was convenient to truly
      avoid bursts.
      
      However there are many cases where this packet scheduler constraint
      is not practical.
      - Many linux hosts are not focusing on handling thousands of TCP
        flows in the most efficient way.
      - Some routers use fq_codel or other AQM, but still would like
        to use BBR for the few TCP flows they initiate/terminate.
      
      This patch implements an automatic fallback to internal pacing.
      
      Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
      
      If sch_fq happens to be in the egress path, pacing is delegated to
      the qdisc, otherwise pacing is done by TCP itself.
      
      One advantage of pacing from TCP stack is to get more precise rtt
      estimations, and less work done from TX completion, since TCP Small
      queue limits are not generally hit. Setups with single TX queue but
      many cpus might even benefit from this.
      
      Note that unlike sch_fq, we do not take into account header sizes.
      Taking care of these headers would add additional complexity for
      no practical differences in behavior.
      
      Some performance numbers using 800 TCP_STREAM flows rate limited to
      ~48 Mbit per second on 40Gbit NIC.
      
      If MQ+pfifo_fast is used on the NIC :
      
      $ sar -n DEV 1 5 | grep eth
      14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
      14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
      14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
      14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
      14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
      Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
       4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
       0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
       1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
       1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0
      
      Now use MQ+FQ :
      
      lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
      lpaa23:~# tc qdisc replace dev eth0 root mq
      
      $ sar -n DEV 1 5 | grep eth
      14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
      14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
      14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
      14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
      14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
      Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
       1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
       0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
       4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
       3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0
      
      As expected, number of interrupts per second is very different.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      218af599
  29. 16 5月, 2017 1 次提交
    • M
      net: fix some identation issues at kernel-doc markups · d651983d
      Mauro Carvalho Chehab 提交于
      Sphinx is very pedantic with regards to identation and
      escape sequences:
      
        ./include/net/sock.h:1967: ERROR: Unexpected indentation.
        ./include/net/sock.h:1969: ERROR: Unexpected indentation.
        ./include/net/sock.h:1970: WARNING: Block quote ends without a blank line; unexpected unindent.
        ./include/net/sock.h:1971: WARNING: Block quote ends without a blank line; unexpected unindent.
        ./include/net/sock.h:2268: WARNING: Inline emphasis start-string without end-string.
        ./net/core/sock.c:2686: ERROR: Unexpected indentation.
        ./net/core/sock.c:2687: WARNING: Block quote ends without a blank line; unexpected unindent.
        ./net/core/datagram.c:182: WARNING: Inline emphasis start-string without end-string.
        ./include/linux/netdevice.h:1444: ERROR: Unexpected indentation.
        ./drivers/net/phy/phy.c:381: ERROR: Unexpected indentation.
        ./drivers/net/phy/phy.c:382: WARNING: Block quote ends without a blank line; unexpected unindent.
      
      - Fix spacing where needed;
      - Properly escape constants;
      - Use a literal block for a race description.
      
      No functional changes.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      d651983d