1. 27 7月, 2016 1 次提交
    • V
      af_unix: charge buffers to kmemcg · 3aa9799e
      Vladimir Davydov 提交于
      Unix sockets can consume a significant amount of system memory, hence
      they should be accounted to kmemcg.
      
      Since unix socket buffers are always allocated from process context, all
      we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
      sock->sk_allocation mask.
      
      Eric asked:
      
      > 1) What happens when a buffer, allocated from socket <A> lands in a
      > different socket <B>, maybe owned by another user/process.
      >
      > Who owns it now, in term of kmemcg accounting ?
      
      We never move memcg charges.  E.g.  if two processes from different
      cgroups are sharing a memory region, each page will be charged to the
      process which touched it first.  Or if two processes are working with
      the same directory tree, inodes and dentries will be charged to the
      first user.  The same is fair for unix socket buffers - they will be
      charged to the sender.
      
      > 2) Has performance impact been evaluated ?
      
      I ran netperf STREAM_STREAM with default options in a kmemcg on a 4 core
      x2 HT box.  The results are below:
      
       # clients            bandwidth (10^6bits/sec)
                          base              patched
               1      67643 +-  725      64874 +-  353    - 4.0 %
               4     193585 +- 2516     186715 +- 1460    - 3.5 %
               8     194820 +-  377     187443 +- 1229    - 3.7 %
      
      So the accounting doesn't come for free - it takes ~4% of performance.
      I believe we could optimize it by using per cpu batching not only on
      charge, but also on uncharge in memcg core, but that's beyond the scope
      of this patch set - I'll take a look at this later.
      
      Anyway, if performance impact is found to be unacceptable, it is always
      possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
      or not use memory cgroups at runtime at all (thanks to jump labels
      there'll be no overhead even if they are compiled in).
      
      Link: http://lkml.kernel.org/r/fcfe6cae27a59fbc5e40145664b3cf085a560c68.1464079538.git.vdavydov@virtuozzo.comSigned-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3aa9799e
  2. 21 5月, 2016 1 次提交
    • M
      af_unix: fix hard linked sockets on overlay · eb0a4a47
      Miklos Szeredi 提交于
      Overlayfs uses separate inodes even in the case of hard links on the
      underlying filesystems.  This is a problem for AF_UNIX socket
      implementation which indexes sockets based on the inode.  This resulted in
      hard linked sockets not working.
      
      The fix is to use the real, underlying inode.
      
      Test case follows:
      
      -- ovl-sock-test.c --
      #include <unistd.h>
      #include <err.h>
      #include <sys/socket.h>
      #include <sys/un.h>
      
      #define SOCK "test-sock"
      #define SOCK2 "test-sock2"
      
      int main(void)
      {
      	int fd, fd2;
      	struct sockaddr_un addr = {
      		.sun_family = AF_UNIX,
      		.sun_path = SOCK,
      	};
      	struct sockaddr_un addr2 = {
      		.sun_family = AF_UNIX,
      		.sun_path = SOCK2,
      	};
      
      	unlink(SOCK);
      	unlink(SOCK2);
      	if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
      		err(1, "socket");
      	if (bind(fd, (struct sockaddr *) &addr, sizeof(addr)) == -1)
      		err(1, "bind");
      	if (listen(fd, 0) == -1)
      		err(1, "listen");
      	if (link(SOCK, SOCK2) == -1)
      		err(1, "link");
      	if ((fd2 = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
      		err(1, "socket");
      	if (connect(fd2, (struct sockaddr *) &addr2, sizeof(addr2)) == -1)
      		err (1, "connect");
      	return 0;
      }
      ----
      
      Reported-by: Alexander Morozov <alexandr.morozov@docker.com> 
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Cc: <stable@vger.kernel.org>
      eb0a4a47
  3. 28 3月, 2016 1 次提交
  4. 20 2月, 2016 2 次提交
  5. 17 2月, 2016 2 次提交
  6. 08 2月, 2016 2 次提交
  7. 25 1月, 2016 1 次提交
  8. 11 1月, 2016 1 次提交
  9. 05 1月, 2016 1 次提交
    • R
      af_unix: Fix splice-bind deadlock · c845acb3
      Rainer Weikusat 提交于
      On 2015/11/06, Dmitry Vyukov reported a deadlock involving the splice
      system call and AF_UNIX sockets,
      
      http://lists.openwall.net/netdev/2015/11/06/24
      
      The situation was analyzed as
      
      (a while ago) A: socketpair()
      B: splice() from a pipe to /mnt/regular_file
      	does sb_start_write() on /mnt
      C: try to freeze /mnt
      	wait for B to finish with /mnt
      A: bind() try to bind our socket to /mnt/new_socket_name
      	lock our socket, see it not bound yet
      	decide that it needs to create something in /mnt
      	try to do sb_start_write() on /mnt, block (it's
      	waiting for C).
      D: splice() from the same pipe to our socket
      	lock the pipe, see that socket is connected
      	try to lock the socket, block waiting for A
      B:	get around to actually feeding a chunk from
      	pipe to file, try to lock the pipe.  Deadlock.
      
      on 2015/11/10 by Al Viro,
      
      http://lists.openwall.net/netdev/2015/11/10/4
      
      The patch fixes this by removing the kern_path_create related code from
      unix_mknod and executing it as part of unix_bind prior acquiring the
      readlock of the socket in question. This means that A (as used above)
      will sb_start_write on /mnt before it acquires the readlock, hence, it
      won't indirectly block B which first did a sb_start_write and then
      waited for a thread trying to acquire the readlock. Consequently, A
      being blocked by C waiting for B won't cause a deadlock anymore
      (effectively, both A and B acquire two locks in opposite order in the
      situation described above).
      
      Dmitry Vyukov(<dvyukov@google.com>) tested the original patch.
      Signed-off-by: NRainer Weikusat <rweikusat@mobileactivedefense.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c845acb3
  10. 18 12月, 2015 1 次提交
  11. 07 12月, 2015 1 次提交
    • R
      af_unix: fix unix_dgram_recvmsg entry locking · 64874280
      Rainer Weikusat 提交于
      The current unix_dgram_recvsmg code acquires the u->readlock mutex in
      order to protect access to the peek offset prior to calling
      __skb_recv_datagram for actually receiving data. This implies that a
      blocking reader will go to sleep with this mutex held if there's
      presently no data to return to userspace. Two non-desirable side effects
      of this are that a later non-blocking read call on the same socket will
      block on the ->readlock mutex until the earlier blocking call releases it
      (or the readers is interrupted) and that later blocking read calls
      will wait longer than the effective socket read timeout says they
      should: The timeout will only start 'ticking' once such a reader hits
      the schedule_timeout in wait_for_more_packets (core.c) while the time it
      already had to wait until it could acquire the mutex is unaccounted for.
      
      The patch avoids both by using the __skb_try_recv_datagram and
      __skb_wait_for_more packets functions created by the first patch to
      implement a unix_dgram_recvmsg read loop which releases the readlock
      mutex prior to going to sleep and reacquires it as needed
      afterwards. Non-blocking readers will thus immediately return with
      -EAGAIN if there's no data available regardless of any concurrent
      blocking readers and all blocking readers will end up sleeping via
      schedule_timeout, thus honouring the configured socket receive timeout.
      Signed-off-by: NRainer Weikusat <rweikusat@mobileactivedefense.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64874280
  12. 02 12月, 2015 2 次提交
  13. 01 12月, 2015 2 次提交
  14. 24 11月, 2015 1 次提交
    • R
      unix: avoid use-after-free in ep_remove_wait_queue · 7d267278
      Rainer Weikusat 提交于
      Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
      An AF_UNIX datagram socket being the client in an n:1 association with
      some server socket is only allowed to send messages to the server if the
      receive queue of this socket contains at most sk_max_ack_backlog
      datagrams. This implies that prospective writers might be forced to go
      to sleep despite none of the message presently enqueued on the server
      receive queue were sent by them. In order to ensure that these will be
      woken up once space becomes again available, the present unix_dgram_poll
      routine does a second sock_poll_wait call with the peer_wait wait queue
      of the server socket as queue argument (unix_dgram_recvmsg does a wake
      up on this queue after a datagram was received). This is inherently
      problematic because the server socket is only guaranteed to remain alive
      for as long as the client still holds a reference to it. In case the
      connection is dissolved via connect or by the dead peer detection logic
      in unix_dgram_sendmsg, the server socket may be freed despite "the
      polling mechanism" (in particular, epoll) still has a pointer to the
      corresponding peer_wait queue. There's no way to forcibly deregister a
      wait queue with epoll.
      
      Based on an idea by Jason Baron, the patch below changes the code such
      that a wait_queue_t belonging to the client socket is enqueued on the
      peer_wait queue of the server whenever the peer receive queue full
      condition is detected by either a sendmsg or a poll. A wake up on the
      peer queue is then relayed to the ordinary wait queue of the client
      socket via wake function. The connection to the peer wait queue is again
      dissolved if either a wake up is about to be relayed or the client
      socket reconnects or a dead peer is detected or the client socket is
      itself closed. This enables removing the second sock_poll_wait from
      unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
      that no blocked writer sleeps forever.
      Signed-off-by: NRainer Weikusat <rweikusat@mobileactivedefense.com>
      Fixes: ec0d215f ("af_unix: fix 'poll for write'/connected DGRAM sockets")
      Reviewed-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d267278
  15. 18 11月, 2015 1 次提交
  16. 17 11月, 2015 1 次提交
  17. 16 11月, 2015 1 次提交
    • H
      af-unix: fix use-after-free with concurrent readers while splicing · 73ed5d25
      Hannes Frederic Sowa 提交于
      During splicing an af-unix socket to a pipe we have to drop all
      af-unix socket locks. While doing so we allow another reader to enter
      unix_stream_read_generic which can read, copy and finally free another
      skb. If exactly this skb is just in process of being spliced we get a
      use-after-free report by kasan.
      
      First, we must make sure to not have a free while the skb is used during
      the splice operation. We simply increment its use counter before unlocking
      the reader lock.
      
      Stream sockets have the nice characteristic that we don't care about
      zero length writes and they never reach the peer socket's queue. That
      said, we can take the UNIXCB.consumed field as the indicator if the
      skb was already freed from the socket's receive queue. If the skb was
      fully consumed after we locked the reader side again we know it has been
      dropped by a second reader. We indicate a short read to user space and
      abort the current splice operation.
      
      This bug has been found with syzkaller
      (http://github.com/google/syzkaller) by Dmitry Vyukov.
      
      Fixes: 2b514574 ("net: af_unix: implement splice for stream af_unix sockets")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73ed5d25
  18. 25 10月, 2015 1 次提交
  19. 05 10月, 2015 1 次提交
  20. 30 9月, 2015 1 次提交
  21. 11 6月, 2015 1 次提交
  22. 27 5月, 2015 1 次提交
  23. 25 5月, 2015 2 次提交
  24. 11 5月, 2015 1 次提交
  25. 24 4月, 2015 1 次提交
  26. 16 4月, 2015 2 次提交
  27. 03 3月, 2015 1 次提交
  28. 29 1月, 2015 1 次提交
    • C
      net: remove sock_iocb · 7cc05662
      Christoph Hellwig 提交于
      The sock_iocb structure is allocate on stack for each read/write-like
      operation on sockets, and contains various fields of which only the
      embedded msghdr and sometimes a pointer to the scm_cookie is ever used.
      Get rid of the sock_iocb and put a msghdr directly on the stack and pass
      the scm_cookie explicitly to netlink_mmap_sendmsg.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7cc05662
  29. 18 1月, 2015 1 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
  30. 10 12月, 2014 1 次提交
    • A
      put iov_iter into msghdr · c0371da6
      Al Viro 提交于
      Note that the code _using_ ->msg_iter at that point will be very
      unhappy with anything other than unshifted iovec-backed iov_iter.
      We still need to convert users to proper primitives.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c0371da6
  31. 24 11月, 2014 1 次提交
  32. 06 11月, 2014 1 次提交
    • D
      net: Add and use skb_copy_datagram_msg() helper. · 51f3d02b
      David S. Miller 提交于
      This encapsulates all of the skb_copy_datagram_iovec() callers
      with call argument signature "skb, offset, msghdr->msg_iov, length".
      
      When we move to iov_iters in the networking, the iov_iter object will
      sit in the msghdr.
      
      Having a helper like this means there will be less places to touch
      during that transformation.
      
      Based upon descriptions and patch from Al Viro.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51f3d02b
  33. 08 10月, 2014 1 次提交