1. 22 2月, 2012 1 次提交
  2. 31 1月, 2012 1 次提交
    • E
      af_unix: fix EPOLLET regression for stream sockets · 6f01fd6e
      Eric Dumazet 提交于
      Commit 0884d7aa (AF_UNIX: Fix poll blocking problem when reading from
      a stream socket) added a regression for epoll() in Edge Triggered mode
      (EPOLLET)
      
      Appropriate fix is to use skb_peek()/skb_unlink() instead of
      skb_dequeue(), and only call skb_unlink() when skb is fully consumed.
      
      This remove the need to requeue a partial skb into sk_receive_queue head
      and the extra sk->sk_data_ready() calls that added the regression.
      
      This is safe because once skb is given to sk_receive_queue, it is not
      modified by a writer, and readers are serialized by u->readlock mutex.
      
      This also reduce number of spinlock acquisition for small reads or
      MSG_PEEK users so should improve overall performance.
      Reported-by: NNick Mathewson <nickm@freehaven.net>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Alexey Moiseytsev <himeraster@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f01fd6e
  3. 04 1月, 2012 1 次提交
  4. 31 12月, 2011 1 次提交
  5. 17 12月, 2011 1 次提交
  6. 27 11月, 2011 1 次提交
  7. 29 9月, 2011 1 次提交
    • E
      af_unix: dont send SCM_CREDENTIALS by default · 16e57262
      Eric Dumazet 提交于
      Since commit 7361c36c (af_unix: Allow credentials to work across
      user and pid namespaces) af_unix performance dropped a lot.
      
      This is because we now take a reference on pid and cred in each write(),
      and release them in read(), usually done from another process,
      eventually from another cpu. This triggers false sharing.
      
      # Events: 154K cycles
      #
      # Overhead  Command       Shared Object        Symbol
      # ........  .......  ..................  .........................
      #
          10.40%  hackbench  [kernel.kallsyms]   [k] put_pid
           8.60%  hackbench  [kernel.kallsyms]   [k] unix_stream_recvmsg
           7.87%  hackbench  [kernel.kallsyms]   [k] unix_stream_sendmsg
           6.11%  hackbench  [kernel.kallsyms]   [k] do_raw_spin_lock
           4.95%  hackbench  [kernel.kallsyms]   [k] unix_scm_to_skb
           4.87%  hackbench  [kernel.kallsyms]   [k] pid_nr_ns
           4.34%  hackbench  [kernel.kallsyms]   [k] cred_to_ucred
           2.39%  hackbench  [kernel.kallsyms]   [k] unix_destruct_scm
           2.24%  hackbench  [kernel.kallsyms]   [k] sub_preempt_count
           1.75%  hackbench  [kernel.kallsyms]   [k] fget_light
           1.51%  hackbench  [kernel.kallsyms]   [k]
      __mutex_lock_interruptible_slowpath
           1.42%  hackbench  [kernel.kallsyms]   [k] sock_alloc_send_pskb
      
      This patch includes SCM_CREDENTIALS information in a af_unix message/skb
      only if requested by the sender, [man 7 unix for details how to include
      ancillary data using sendmsg() system call]
      
      Note: This might break buggy applications that expected SCM_CREDENTIAL
      from an unaware write() system call, and receiver not using SO_PASSCRED
      socket option.
      
      If SOCK_PASSCRED is set on source or destination socket, we still
      include credentials for mere write() syscalls.
      
      Performance boost in hackbench : more than 50% gain on a 16 thread
      machine (2 quad-core cpus, 2 threads per core)
      
      hackbench 20 thread 2000
      
      4.228 sec instead of 9.102 sec
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16e57262
  8. 17 9月, 2011 1 次提交
  9. 25 8月, 2011 1 次提交
    • T
      Scm: Remove unnecessary pid & credential references in Unix socket's send and receive path · 0856a304
      Tim Chen 提交于
      Patch series 109f6e39..7361c36c back in 2.6.36 added functionality to
      allow credentials to work across pid namespaces for packets sent via
      UNIX sockets.  However, the atomic reference counts on pid and
      credentials caused plenty of cache bouncing when there are numerous
      threads of the same pid sharing a UNIX socket.  This patch mitigates the
      problem by eliminating extraneous reference counts on pid and
      credentials on both send and receive path of UNIX sockets. I found a 2x
      improvement in hackbench's threaded case.
      
      On the receive path in unix_dgram_recvmsg, currently there is an
      increment of reference count on pid and credentials in scm_set_cred.
      Then there are two decrement of the reference counts.  Once in scm_recv
      and once when skb_free_datagram call skb->destructor function
      unix_destruct_scm.  One pair of increment and decrement of ref count on
      pid and credentials can be eliminated from the receive path.  Until we
      destroy the skb, we already set a reference when we created the skb on
      the send side.
      
      On the send path, there are two increments of ref count on pid and
      credentials, once in scm_send and once in unix_scm_to_skb.  Then there
      is a decrement of the reference counts in scm_destroy's call to
      scm_destroy_cred at the end of unix_dgram_sendmsg functions.   One pair
      of increment and decrement of the reference counts can be removed so we
      only need to increment the ref counts once.
      
      By incorporating these changes, for hackbench running on a 4 socket
      NHM-EX machine with 40 cores, the execution of hackbench on
      50 groups of 20 threads sped up by factor of 2.
      
      Hackbench command used for testing:
      ./hackbench 50 thread 2000
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0856a304
  10. 20 7月, 2011 1 次提交
  11. 24 5月, 2011 1 次提交
    • D
      net: convert %p usage to %pK · 71338aa7
      Dan Rosenberg 提交于
      The %pK format specifier is designed to hide exposed kernel pointers,
      specifically via /proc interfaces.  Exposing these pointers provides an
      easy target for kernel write vulnerabilities, since they reveal the
      locations of writable structures containing easily triggerable function
      pointers.  The behavior of %pK depends on the kptr_restrict sysctl.
      
      If kptr_restrict is set to 0, no deviation from the standard %p behavior
      occurs.  If kptr_restrict is set to 1, the default, if the current user
      (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
      (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
       If kptr_restrict is set to 2, kernel pointers using %pK are printed as
      0's regardless of privileges.  Replacing with 0's was chosen over the
      default "(null)", which cannot be parsed by userland %p, which expects
      "(nil)".
      
      The supporting code for kptr_restrict and %pK are currently in the -mm
      tree.  This patch converts users of %p in net/ to %pK.  Cases of printing
      pointers to the syslog are not covered, since this would eliminate useful
      information for postmortem debugging and the reading of the syslog is
      already optionally protected by the dmesg_restrict sysctl.
      Signed-off-by: NDan Rosenberg <drosenberg@vsecurity.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Thomas Graf <tgraf@infradead.org>
      Cc: Eugene Teo <eugeneteo@kernel.org>
      Cc: Kees Cook <kees.cook@canonical.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71338aa7
  12. 02 5月, 2011 1 次提交
    • E
      af_unix: Only allow recv on connected seqpacket sockets. · a05d2ad1
      Eric W. Biederman 提交于
      This fixes the following oops discovered by Dan Aloni:
      > Anyway, the following is the output of the Oops that I got on the
      > Ubuntu kernel on which I first detected the problem
      > (2.6.37-12-generic). The Oops that followed will be more useful, I
      > guess.
      
      >[ 5594.669852] BUG: unable to handle kernel NULL pointer dereference
      > at           (null)
      > [ 5594.681606] IP: [<ffffffff81550b7b>] unix_dgram_recvmsg+0x1fb/0x420
      > [ 5594.687576] PGD 2a05d067 PUD 2b951067 PMD 0
      > [ 5594.693720] Oops: 0002 [#1] SMP
      > [ 5594.699888] last sysfs file:
      
      The bug was that unix domain sockets use a pseduo packet for
      connecting and accept uses that psudo packet to get the socket.
      In the buggy seqpacket case we were allowing unconnected
      sockets to call recvmsg and try to receive the pseudo packet.
      
      That is always wrong and as of commit 7361c36c the pseudo
      packet had become enough different from a normal packet
      that the kernel started oopsing.
      
      Do for seqpacket_recv what was done for seqpacket_send in 2.5
      and only allow it on connected seqpacket sockets.
      
      Cc: stable@kernel.org
      Tested-by: NDan Aloni <dan@aloni.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a05d2ad1
  13. 31 3月, 2011 1 次提交
  14. 15 3月, 2011 1 次提交
  15. 14 3月, 2011 1 次提交
  16. 08 3月, 2011 2 次提交
    • H
      6118e35a
    • R
      net: fix multithreaded signal handling in unix recv routines · b3ca9b02
      Rainer Weikusat 提交于
      The unix_dgram_recvmsg and unix_stream_recvmsg routines in
      net/af_unix.c utilize mutex_lock(&u->readlock) calls in order to
      serialize read operations of multiple threads on a single socket. This
      implies that, if all n threads of a process block in an AF_UNIX recv
      call trying to read data from the same socket, one of these threads
      will be sleeping in state TASK_INTERRUPTIBLE and all others in state
      TASK_UNINTERRUPTIBLE. Provided that a particular signal is supposed to
      be handled by a signal handler defined by the process and that none of
      this threads is blocking the signal, the complete_signal routine in
      kernel/signal.c will select the 'first' such thread it happens to
      encounter when deciding which thread to notify that a signal is
      supposed to be handled and if this is one of the TASK_UNINTERRUPTIBLE
      threads, the signal won't be handled until the one thread not blocking
      on the u->readlock mutex is woken up because some data to process has
      arrived (if this ever happens). The included patch fixes this by
      changing mutex_lock to mutex_lock_interruptible and handling possible
      error returns in the same way interruptions are handled by the actual
      receive-code.
      Signed-off-by: NRainer Weikusat <rweikusat@mobileactivedefense.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3ca9b02
  17. 23 2月, 2011 1 次提交
  18. 20 1月, 2011 1 次提交
  19. 19 1月, 2011 1 次提交
    • A
      af_unix: implement socket filter · d6ae3bae
      Alban Crequy 提交于
      Linux Socket Filters can already be successfully attached and detached on unix
      sockets with setsockopt(sockfd, SOL_SOCKET, SO_{ATTACH,DETACH}_FILTER, ...).
      See: Documentation/networking/filter.txt
      
      But the filter was never used in the unix socket code so it did not work. This
      patch uses sk_filter() to filter buffers before delivery.
      
      This short program demonstrates the problem on SOCK_DGRAM.
      
      int main(void) {
        int i, j, ret;
        int sv[2];
        struct pollfd fds[2];
        char *message = "Hello world!";
        char buffer[64];
        struct sock_filter ins[32] = {{0,},};
        struct sock_fprog filter;
      
        socketpair(AF_UNIX, SOCK_DGRAM, 0, sv);
      
        for (i = 0 ; i < 2 ; i++) {
          fds[i].fd = sv[i];
          fds[i].events = POLLIN;
          fds[i].revents = 0;
        }
      
        for(j = 1 ; j < 13 ; j++) {
      
          /* Set a socket filter to truncate the message */
          memset(ins, 0, sizeof(ins));
          ins[0].code = BPF_RET|BPF_K;
          ins[0].k = j;
          filter.len = 1;
          filter.filter = ins;
          setsockopt(sv[1], SOL_SOCKET, SO_ATTACH_FILTER, &filter, sizeof(filter));
      
          /* send a message */
          send(sv[0], message, strlen(message) + 1, 0);
      
          /* The filter should let the message pass but truncated. */
          poll(fds, 2, 0);
      
          /* Receive the truncated message*/
          ret = recv(sv[1], buffer, 64, 0);
          printf("received %d bytes, expected %d\n", ret, j);
        }
      
          for (i = 0 ; i < 2 ; i++)
            close(sv[i]);
      
        return 0;
      }
      Signed-off-by: NAlban Crequy <alban.crequy@collabora.co.uk>
      Reviewed-by: NIan Molton <ian.molton@collabora.co.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6ae3bae
  20. 06 1月, 2011 1 次提交
  21. 30 11月, 2010 1 次提交
  22. 09 11月, 2010 3 次提交
  23. 27 10月, 2010 1 次提交
    • E
      fs: allow for more than 2^31 files · 518de9b3
      Eric Dumazet 提交于
      Robin Holt tried to boot a 16TB system and found af_unix was overflowing
      a 32bit value :
      
      <quote>
      
      We were seeing a failure which prevented boot.  The kernel was incapable
      of creating either a named pipe or unix domain socket.  This comes down
      to a common kernel function called unix_create1() which does:
      
              atomic_inc(&unix_nr_socks);
              if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                      goto out;
      
      The function get_max_files() is a simple return of files_stat.max_files.
      files_stat.max_files is a signed integer and is computed in
      fs/file_table.c's files_init().
      
              n = (mempages * (PAGE_SIZE / 1024)) / 10;
              files_stat.max_files = n;
      
      In our case, mempages (total_ram_pages) is approx 3,758,096,384
      (0xe0000000).  That leaves max_files at approximately 1,503,238,553.
      This causes 2 * get_max_files() to integer overflow.
      
      </quote>
      
      Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
      integers, and change af_unix to use an atomic_long_t instead of atomic_t.
      
      get_max_files() is changed to return an unsigned long.  get_nr_files() is
      changed to return a long.
      
      unix_nr_socks is changed from atomic_t to atomic_long_t, while not
      strictly needed to address Robin problem.
      
      Before patch (on a 64bit kernel) :
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      -18446744071562067968
      
      After patch:
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      2147483648
      # cat /proc/sys/fs/file-nr
      704     0       2147483648
      Reported-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid Miller <davem@davemloft.net>
      Reviewed-by: NRobin Holt <holt@sgi.com>
      Tested-by: NRobin Holt <holt@sgi.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      518de9b3
  24. 26 10月, 2010 1 次提交
    • E
      fs: allow for more than 2^31 files · 7e360c38
      Eric Dumazet 提交于
      Andrew,
      
      Could you please review this patch, you probably are the right guy to
      take it, because it crosses fs and net trees.
      
      Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
      depend on previous patch (sysctl: fix min/max handling in
      __do_proc_doulongvec_minmax())
      
      Thanks !
      
      [PATCH V4] fs: allow for more than 2^31 files
      
      Robin Holt tried to boot a 16TB system and found af_unix was overflowing
      a 32bit value :
      
      <quote>
      
      We were seeing a failure which prevented boot.  The kernel was incapable
      of creating either a named pipe or unix domain socket.  This comes down
      to a common kernel function called unix_create1() which does:
      
              atomic_inc(&unix_nr_socks);
              if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                      goto out;
      
      The function get_max_files() is a simple return of files_stat.max_files.
      files_stat.max_files is a signed integer and is computed in
      fs/file_table.c's files_init().
      
              n = (mempages * (PAGE_SIZE / 1024)) / 10;
              files_stat.max_files = n;
      
      In our case, mempages (total_ram_pages) is approx 3,758,096,384
      (0xe0000000).  That leaves max_files at approximately 1,503,238,553.
      This causes 2 * get_max_files() to integer overflow.
      
      </quote>
      
      Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
      integers, and change af_unix to use an atomic_long_t instead of
      atomic_t.
      
      get_max_files() is changed to return an unsigned long.
      get_nr_files() is changed to return a long.
      
      unix_nr_socks is changed from atomic_t to atomic_long_t, while not
      strictly needed to address Robin problem.
      
      Before patch (on a 64bit kernel) :
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      -18446744071562067968
      
      After patch:
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      2147483648
      # cat /proc/sys/fs/file-nr
      704     0       2147483648
      Reported-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid Miller <davem@davemloft.net>
      Reviewed-by: NRobin Holt <holt@sgi.com>
      Tested-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7e360c38
  25. 06 10月, 2010 1 次提交
  26. 08 9月, 2010 1 次提交
    • T
      UNIX: Do not loop forever at unix_autobind(). · 8df73ff9
      Tetsuo Handa 提交于
      We assumed that unix_autobind() never fails if kzalloc() succeeded.
      But unix_autobind() allows only 1048576 names. If /proc/sys/fs/file-max is
      larger than 1048576 (e.g. systems with more than 10GB of RAM), a local user can
      consume all names using fork()/socket()/bind().
      
      If all names are in use, those who call bind() with addr_len == sizeof(short)
      or connect()/sendmsg() with setsockopt(SO_PASSCRED) will continue
      
        while (1)
              yield();
      
      loop at unix_autobind() till a name becomes available.
      This patch adds a loop counter in order to give up after 1048576 attempts.
      
      Calling yield() for once per 256 attempts may not be sufficient when many names
      are already in use, for __unix_find_socket_byname() can take long time under
      such circumstance. Therefore, this patch also adds cond_resched() call.
      
      Note that currently a local user can consume 2GB of kernel memory if the user
      is allowed to create and autobind 1048576 UNIX domain sockets. We should
      consider adding some restriction for autobind operation.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8df73ff9
  27. 07 9月, 2010 1 次提交
  28. 21 7月, 2010 1 次提交
    • N
      drop_monitor: convert some kfree_skb call sites to consume_skb · 70d4bf6d
      Neil Horman 提交于
      Convert a few calls from kfree_skb to consume_skb
      
      Noticed while I was working on dropwatch that I was detecting lots of internal
      skb drops in several places.  While some are legitimate, several were not,
      freeing skbs that were at the end of their life, rather than being discarded due
      to an error.  This patch converts those calls sites from using kfree_skb to
      consume_skb, which quiets the in-kernel drop_monitor code from detecting them as
      drops.  Tested successfully by myself
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70d4bf6d
  29. 17 6月, 2010 3 次提交
  30. 02 5月, 2010 1 次提交
    • E
      net: sock_def_readable() and friends RCU conversion · 43815482
      Eric Dumazet 提交于
      sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
      need two atomic operations (and associated dirtying) per incoming
      packet.
      
      RCU conversion is pretty much needed :
      
      1) Add a new structure, called "struct socket_wq" to hold all fields
      that will need rcu_read_lock() protection (currently: a
      wait_queue_head_t and a struct fasync_struct pointer).
      
      [Future patch will add a list anchor for wakeup coalescing]
      
      2) Attach one of such structure to each "struct socket" created in
      sock_alloc_inode().
      
      3) Respect RCU grace period when freeing a "struct socket_wq"
      
      4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
      socket_wq"
      
      5) Change sk_sleep() function to use new sk->sk_wq instead of
      sk->sk_sleep
      
      6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
      a rcu_read_lock() section.
      
      7) Change all sk_has_sleeper() callers to :
        - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
        - Use wq_has_sleeper() to eventually wakeup tasks.
        - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)
      
      8) sock_wake_async() is modified to use rcu protection as well.
      
      9) Exceptions :
        macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
      instead of dynamically allocated ones. They dont need rcu freeing.
      
      Some cleanups or followups are probably needed, (possible
      sk_callback_lock conversion to a spinlock for example...).
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43815482
  31. 21 4月, 2010 1 次提交
  32. 19 2月, 2010 1 次提交
  33. 18 1月, 2010 1 次提交
  34. 30 11月, 2009 1 次提交
  35. 11 11月, 2009 1 次提交