1. 30 7月, 2012 3 次提交
  2. 26 4月, 2012 1 次提交
  3. 21 4月, 2012 2 次提交
  4. 16 4月, 2012 1 次提交
  5. 04 4月, 2012 1 次提交
    • E
      af_unix: reduce high order page allocations · eb6a2481
      Eric Dumazet 提交于
      unix_dgram_sendmsg() currently builds linear skbs, and this can stress
      page allocator with high order page allocations. When memory gets
      fragmented, this can eventually fail.
      
      We can try to use order-2 allocations for skb head (SKB_MAX_ALLOC) plus
      up to 16 page fragments to lower pressure on buddy allocator.
      
      This patch has no effect on messages of less than 16064 bytes.
      (on 64bit arches with PAGE_SIZE=4096)
      
      For bigger messages (from 16065 to 81600 bytes), this patch brings
      reliability at the expense of performance penalty because of extra pages
      allocations.
      
      netperf -t DG_STREAM -T 0,2 -- -m 16064 -s 200000
      ->4086040 Messages / 10s
      
      netperf -t DG_STREAM -T 0,2 -- -m 16068 -s 200000
      ->3901747 Messages / 10s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb6a2481
  6. 24 3月, 2012 1 次提交
    • H
      poll: add poll_requested_events() and poll_does_not_wait() functions · 626cf236
      Hans Verkuil 提交于
      In some cases the poll() implementation in a driver has to do different
      things depending on the events the caller wants to poll for.  An example
      is when a driver needs to start a DMA engine if the caller polls for
      POLLIN, but doesn't want to do that if POLLIN is not requested but instead
      only POLLOUT or POLLPRI is requested.  This is something that can happen
      in the video4linux subsystem among others.
      
      Unfortunately, the current epoll/poll/select implementation doesn't
      provide that information reliably.  The poll_table_struct does have it: it
      has a key field with the event mask.  But once a poll() call matches one
      or more bits of that mask any following poll() calls are passed a NULL
      poll_table pointer.
      
      Also, the eventpoll implementation always left the key field at ~0 instead
      of using the requested events mask.
      
      This was changed in eventpoll.c so the key field now contains the actual
      events that should be polled for as set by the caller.
      
      The solution to the NULL poll_table pointer is to set the qproc field to
      NULL in poll_table once poll() matches the events, not the poll_table
      pointer itself.  That way drivers can obtain the mask through a new
      poll_requested_events inline.
      
      The poll_table_struct can still be NULL since some kernel code calls it
      internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h).  In
      that case poll_requested_events() returns ~0 (i.e.  all events).
      
      Very rarely drivers might want to know whether poll_wait will actually
      wait.  If another earlier file descriptor in the set already matched the
      events the caller wanted to wait for, then the kernel will return from the
      select() call without waiting.  This might be useful information in order
      to avoid doing expensive work.
      
      A new helper function poll_does_not_wait() is added that drivers can use
      to detect this situation.  This is now used in sock_poll_wait() in
      include/net/sock.h.  This was the only place in the kernel that needed
      this information.
      
      Drivers should no longer access any of the poll_table internals, but use
      the poll_requested_events() and poll_does_not_wait() access functions
      instead.  In order to enforce that the poll_table fields are now prepended
      with an underscore and a comment was added warning against using them
      directly.
      
      This required a change in unix_dgram_poll() in unix/af_unix.c which used
      the key field to get the requested events.  It's been replaced by a call
      to poll_requested_events().
      
      For qproc it was especially important to change its name since the
      behavior of that field changes with this patch since this function pointer
      can now be NULL when that wasn't possible in the past.
      
      Any driver accessing the qproc or key fields directly will now fail to compile.
      
      Some notes regarding the correctness of this patch: the driver's poll()
      function is called with a 'struct poll_table_struct *wait' argument.  This
      pointer may or may not be NULL, drivers can never rely on it being one or
      the other as that depends on whether or not an earlier file descriptor in
      the select()'s fdset matched the requested events.
      
      There are only three things a driver can do with the wait argument:
      
      1) obtain the key field:
      
      	events = wait ? wait->key : ~0;
      
         This will still work although it should be replaced with the new
         poll_requested_events() function (which does exactly the same).
         This will now even work better, since wait is no longer set to NULL
         unnecessarily.
      
      2) use the qproc callback. This could be deadly since qproc can now be
         NULL. Renaming qproc should prevent this from happening. There are no
         kernel drivers that actually access this callback directly, BTW.
      
      3) test whether wait == NULL to determine whether poll would return without
         waiting. This is no longer sufficient as the correct test is now
         wait == NULL || wait->_qproc == NULL.
      
         However, the worst that can happen here is a slight performance hit in
         the case where wait != NULL and wait->_qproc == NULL. In that case the
         driver will assume that poll_wait() will actually add the fd to the set
         of waiting file descriptors. Of course, poll_wait() will not do that
         since it tests for wait->_qproc. This will not break anything, though.
      
         There is only one place in the whole kernel where this happens
         (sock_poll_wait() in include/net/sock.h) and that code will be replaced
         by a call to poll_does_not_wait() in the next patch.
      
         Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
         actually waiting. The next file descriptor from the set might match the
         event mask and thus any possible waits will never happen.
      Signed-off-by: NHans Verkuil <hans.verkuil@cisco.com>
      Reviewed-by: NJonathan Corbet <corbet@lwn.net>
      Reviewed-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      626cf236
  7. 21 3月, 2012 2 次提交
  8. 27 2月, 2012 1 次提交
  9. 23 2月, 2012 1 次提交
    • E
      af_unix: MSG_TRUNC support for dgram sockets · 9f6f9af7
      Eric Dumazet 提交于
      Piergiorgio Beruto expressed the need to fetch size of first datagram in
      queue for AF_UNIX sockets and suggested a patch against SIOCINQ ioctl.
      
      I suggested instead to implement MSG_TRUNC support as a recv() input
      flag, as already done for RAW, UDP & NETLINK sockets.
      
      len = recv(fd, &byte, 1, MSG_PEEK | MSG_TRUNC);
      
      MSG_TRUNC asks recv() to return the real length of the packet, even when
      is was longer than the passed buffer.
      
      There is risk that a userland application used MSG_TRUNC by accident
      (since it had no effect on af_unix sockets) and this might break after
      this patch.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Tested-by: NPiergiorgio Beruto <piergiorgio.beruto@gmail.com>
      CC: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f6f9af7
  10. 22 2月, 2012 2 次提交
  11. 31 1月, 2012 1 次提交
    • E
      af_unix: fix EPOLLET regression for stream sockets · 6f01fd6e
      Eric Dumazet 提交于
      Commit 0884d7aa (AF_UNIX: Fix poll blocking problem when reading from
      a stream socket) added a regression for epoll() in Edge Triggered mode
      (EPOLLET)
      
      Appropriate fix is to use skb_peek()/skb_unlink() instead of
      skb_dequeue(), and only call skb_unlink() when skb is fully consumed.
      
      This remove the need to requeue a partial skb into sk_receive_queue head
      and the extra sk->sk_data_ready() calls that added the regression.
      
      This is safe because once skb is given to sk_receive_queue, it is not
      modified by a writer, and readers are serialized by u->readlock mutex.
      
      This also reduce number of spinlock acquisition for small reads or
      MSG_PEEK users so should improve overall performance.
      Reported-by: NNick Mathewson <nickm@freehaven.net>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Alexey Moiseytsev <himeraster@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f01fd6e
  12. 08 1月, 2012 1 次提交
  13. 04 1月, 2012 1 次提交
  14. 31 12月, 2011 3 次提交
  15. 27 12月, 2011 2 次提交
  16. 21 12月, 2011 1 次提交
  17. 17 12月, 2011 10 次提交
  18. 27 11月, 2011 1 次提交
  19. 29 9月, 2011 1 次提交
    • E
      af_unix: dont send SCM_CREDENTIALS by default · 16e57262
      Eric Dumazet 提交于
      Since commit 7361c36c (af_unix: Allow credentials to work across
      user and pid namespaces) af_unix performance dropped a lot.
      
      This is because we now take a reference on pid and cred in each write(),
      and release them in read(), usually done from another process,
      eventually from another cpu. This triggers false sharing.
      
      # Events: 154K cycles
      #
      # Overhead  Command       Shared Object        Symbol
      # ........  .......  ..................  .........................
      #
          10.40%  hackbench  [kernel.kallsyms]   [k] put_pid
           8.60%  hackbench  [kernel.kallsyms]   [k] unix_stream_recvmsg
           7.87%  hackbench  [kernel.kallsyms]   [k] unix_stream_sendmsg
           6.11%  hackbench  [kernel.kallsyms]   [k] do_raw_spin_lock
           4.95%  hackbench  [kernel.kallsyms]   [k] unix_scm_to_skb
           4.87%  hackbench  [kernel.kallsyms]   [k] pid_nr_ns
           4.34%  hackbench  [kernel.kallsyms]   [k] cred_to_ucred
           2.39%  hackbench  [kernel.kallsyms]   [k] unix_destruct_scm
           2.24%  hackbench  [kernel.kallsyms]   [k] sub_preempt_count
           1.75%  hackbench  [kernel.kallsyms]   [k] fget_light
           1.51%  hackbench  [kernel.kallsyms]   [k]
      __mutex_lock_interruptible_slowpath
           1.42%  hackbench  [kernel.kallsyms]   [k] sock_alloc_send_pskb
      
      This patch includes SCM_CREDENTIALS information in a af_unix message/skb
      only if requested by the sender, [man 7 unix for details how to include
      ancillary data using sendmsg() system call]
      
      Note: This might break buggy applications that expected SCM_CREDENTIAL
      from an unaware write() system call, and receiver not using SO_PASSCRED
      socket option.
      
      If SOCK_PASSCRED is set on source or destination socket, we still
      include credentials for mere write() syscalls.
      
      Performance boost in hackbench : more than 50% gain on a 16 thread
      machine (2 quad-core cpus, 2 threads per core)
      
      hackbench 20 thread 2000
      
      4.228 sec instead of 9.102 sec
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16e57262
  20. 17 9月, 2011 1 次提交
  21. 25 8月, 2011 1 次提交
    • T
      Scm: Remove unnecessary pid & credential references in Unix socket's send and receive path · 0856a304
      Tim Chen 提交于
      Patch series 109f6e39..7361c36c back in 2.6.36 added functionality to
      allow credentials to work across pid namespaces for packets sent via
      UNIX sockets.  However, the atomic reference counts on pid and
      credentials caused plenty of cache bouncing when there are numerous
      threads of the same pid sharing a UNIX socket.  This patch mitigates the
      problem by eliminating extraneous reference counts on pid and
      credentials on both send and receive path of UNIX sockets. I found a 2x
      improvement in hackbench's threaded case.
      
      On the receive path in unix_dgram_recvmsg, currently there is an
      increment of reference count on pid and credentials in scm_set_cred.
      Then there are two decrement of the reference counts.  Once in scm_recv
      and once when skb_free_datagram call skb->destructor function
      unix_destruct_scm.  One pair of increment and decrement of ref count on
      pid and credentials can be eliminated from the receive path.  Until we
      destroy the skb, we already set a reference when we created the skb on
      the send side.
      
      On the send path, there are two increments of ref count on pid and
      credentials, once in scm_send and once in unix_scm_to_skb.  Then there
      is a decrement of the reference counts in scm_destroy's call to
      scm_destroy_cred at the end of unix_dgram_sendmsg functions.   One pair
      of increment and decrement of the reference counts can be removed so we
      only need to increment the ref counts once.
      
      By incorporating these changes, for hackbench running on a 4 socket
      NHM-EX machine with 40 cores, the execution of hackbench on
      50 groups of 20 threads sped up by factor of 2.
      
      Hackbench command used for testing:
      ./hackbench 50 thread 2000
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0856a304
  22. 20 7月, 2011 1 次提交
  23. 24 5月, 2011 1 次提交
    • D
      net: convert %p usage to %pK · 71338aa7
      Dan Rosenberg 提交于
      The %pK format specifier is designed to hide exposed kernel pointers,
      specifically via /proc interfaces.  Exposing these pointers provides an
      easy target for kernel write vulnerabilities, since they reveal the
      locations of writable structures containing easily triggerable function
      pointers.  The behavior of %pK depends on the kptr_restrict sysctl.
      
      If kptr_restrict is set to 0, no deviation from the standard %p behavior
      occurs.  If kptr_restrict is set to 1, the default, if the current user
      (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
      (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
       If kptr_restrict is set to 2, kernel pointers using %pK are printed as
      0's regardless of privileges.  Replacing with 0's was chosen over the
      default "(null)", which cannot be parsed by userland %p, which expects
      "(nil)".
      
      The supporting code for kptr_restrict and %pK are currently in the -mm
      tree.  This patch converts users of %p in net/ to %pK.  Cases of printing
      pointers to the syslog are not covered, since this would eliminate useful
      information for postmortem debugging and the reading of the syslog is
      already optionally protected by the dmesg_restrict sysctl.
      Signed-off-by: NDan Rosenberg <drosenberg@vsecurity.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Thomas Graf <tgraf@infradead.org>
      Cc: Eugene Teo <eugeneteo@kernel.org>
      Cc: Kees Cook <kees.cook@canonical.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71338aa7