1. 01 11月, 2007 1 次提交
  2. 20 10月, 2007 1 次提交
    • P
      pid namespaces: changes to show virtual ids to user · b488893a
      Pavel Emelyanov 提交于
      This is the largest patch in the set. Make all (I hope) the places where
      the pid is shown to or get from user operate on the virtual pids.
      
      The idea is:
       - all in-kernel data structures must store either struct pid itself
         or the pid's global nr, obtained with pid_nr() call;
       - when seeking the task from kernel code with the stored id one
         should use find_task_by_pid() call that works with global pids;
       - when showing pid's numerical value to the user the virtual one
         should be used, but however when one shows task's pid outside this
         task's namespace the global one is to be used;
       - when getting the pid from userspace one need to consider this as
         the virtual one and use appropriate task/pid-searching functions.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: nuther build fix]
      [akpm@linux-foundation.org: yet nuther build fix]
      [akpm@linux-foundation.org: remove unneeded casts]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAlexey Dobriyan <adobriyan@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b488893a
  3. 15 10月, 2007 1 次提交
    • I
      sched: affine sync wakeups · 71e20f18
      Ingo Molnar 提交于
      make sync wakeups affine for cache-cold tasks: if a cache-cold task
      is woken up by a sync wakeup then use the opportunity to migrate it
      straight away. (the two tasks are 'related' because they communicate)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      71e20f18
  4. 11 10月, 2007 3 次提交
    • P
      [NET]: Make core networking code use seq_open_private · cf7732e4
      Pavel Emelyanov 提交于
      This concerns the ipv4 and ipv6 code mostly, but also the netlink
      and unix sockets.
      
      The netlink code is an example of how to use the __seq_open_private()
      call - it saves the net namespace on this private.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf7732e4
    • E
      [NET]: Make socket creation namespace safe. · 1b8d7ae4
      Eric W. Biederman 提交于
      This patch passes in the namespace a new socket should be created in
      and has the socket code do the appropriate reference counting.  By
      virtue of this all socket create methods are touched.  In addition
      the socket create methods are modified so that they will fail if
      you attempt to create a socket in a non-default network namespace.
      
      Failing if we attempt to create a socket outside of the default
      network namespace ensures that as we incrementally make the network stack
      network namespace aware we will not export functionality that someone
      has not audited and made certain is network namespace safe.
      Allowing us to partially enable network namespaces before all of the
      exotic protocols are supported.
      
      Any protocol layers I have missed will fail to compile because I now
      pass an extra parameter into the socket creation code.
      
      [ Integrated AF_IUCV build fixes from Andrew Morton... -DaveM ]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b8d7ae4
    • E
      [NET]: Make /proc/net per network namespace · 457c4cbc
      Eric W. Biederman 提交于
      This patch makes /proc/net per network namespace.  It modifies the global
      variables proc_net and proc_net_stat to be per network namespace.
      The proc_net file helpers are modified to take a network namespace argument,
      and all of their callers are fixed to pass &init_net for that argument.
      This ensures that all of the /proc/net files are only visible and
      usable in the initial network namespace until the code behind them
      has been updated to be handle multiple network namespaces.
      
      Making /proc/net per namespace is necessary as at least some files
      in /proc/net depend upon the set of network devices which is per
      network namespace, and even more files in /proc/net have contents
      that are relevant to a single network namespace.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      457c4cbc
  5. 31 7月, 2007 1 次提交
  6. 12 7月, 2007 1 次提交
    • M
      [AF_UNIX]: Rewrite garbage collector, fixes race. · 1fd05ba5
      Miklos Szeredi 提交于
      Throw out the old mark & sweep garbage collector and put in a
      refcounting cycle detecting one.
      
      The old one had a race with recvmsg, that resulted in false positives
      and hence data loss.  The old algorithm operated on all unix sockets
      in the system, so any additional locking would have meant performance
      problems for all users of these.
      
      The new algorithm instead only operates on "in flight" sockets, which
      are very rare, and the additional locking for these doesn't negatively
      impact the vast majority of users.
      
      In fact it's probable, that there weren't *any* heavy senders of
      sockets over sockets, otherwise the above race would have been
      discovered long ago.
      
      The patch works OK with the app that exposed the race with the old
      code.  The garbage collection has also been verified to work in a few
      simple cases.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fd05ba5
  7. 11 7月, 2007 1 次提交
  8. 08 6月, 2007 1 次提交
    • M
      [AF_UNIX]: Fix stream recvmsg() race. · 3c0d2f37
      Miklos Szeredi 提交于
      A recv() on an AF_UNIX, SOCK_STREAM socket can race with a
      send()+close() on the peer, causing recv() to return zero, even though
      the sent data should be received.
      
      This happens if the send() and the close() is performed between
      skb_dequeue() and checking sk->sk_shutdown in unix_stream_recvmsg():
      
      process A  skb_dequeue() returns NULL, there's no data in the socket queue
      process B  new data is inserted onto the queue by unix_stream_sendmsg()
      process B  sk->sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock()
      process A  sk->sk_shutdown is checked, unix_release_sock() returns zero
      
      I'm surprised nobody noticed this, it's not hard to trigger.  Maybe
      it's just (un)luck with the timing.
      
      It's possible to work around this bug in userspace, by retrying the
      recv() once in case of a zero return value.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c0d2f37
  9. 04 6月, 2007 2 次提交
    • D
      [AF_UNIX]: Fix datagram connect race causing an OOPS. · 278a3de5
      David S. Miller 提交于
      Based upon an excellent bug report and initial patch by
      Frederik Deweerdt.
      
      The UNIX datagram connect code blindly dereferences other->sk_socket
      via the call down to the security_unix_may_send() function.
      
      Without locking 'other' that pointer can go NULL via unix_release_sock()
      which does sock_orphan() which also marks the socket SOCK_DEAD.
      
      So we have to lock both 'sk' and 'other' yet avoid all kinds of
      potential deadlocks (connect to self is OK for datagram sockets and it
      is possible for two datagram sockets to perform a simultaneous connect
      to each other).  So what we do is have a "double lock" function similar
      to how we handle this situation in other areas of the kernel.  We take
      the lock of the socket pointer with the smallest address first in
      order to avoid ABBA style deadlocks.
      
      Once we have them both locked, we check to see if SOCK_DEAD is set
      for 'other' and if so, drop everything and retry the lookup.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      278a3de5
    • D
      [AF_UNIX]: Make socket locking much less confusing. · 1c92b4e5
      David S. Miller 提交于
      The unix_state_*() locking macros imply that there is some
      rwlock kind of thing going on, but the implementation is
      actually a spinlock which makes the code more confusing than
      it needs to be.
      
      So use plain unix_state_lock and unix_state_unlock.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c92b4e5
  10. 09 5月, 2007 1 次提交
  11. 26 4月, 2007 1 次提交
  12. 07 3月, 2007 1 次提交
  13. 03 3月, 2007 1 次提交
  14. 13 2月, 2007 1 次提交
  15. 11 2月, 2007 1 次提交
  16. 03 12月, 2006 1 次提交
  17. 23 9月, 2006 2 次提交
  18. 03 8月, 2006 1 次提交
    • C
      [AF_UNIX]: Kernel memory leak fix for af_unix datagram getpeersec patch · dc49c1f9
      Catherine Zhang 提交于
      From: Catherine Zhang <cxzhang@watson.ibm.com>
      
      This patch implements a cleaner fix for the memory leak problem of the
      original unix datagram getpeersec patch.  Instead of creating a
      security context each time a unix datagram is sent, we only create the
      security context when the receiver requests it.
      
      This new design requires modification of the current
      unix_getsecpeer_dgram LSM hook and addition of two new hooks, namely,
      secid_to_secctx and release_secctx.  The former retrieves the security
      context and the latter releases it.  A hook is required for releasing
      the security context because it is up to the security module to decide
      how that's done.  In the case of Selinux, it's a simple kfree
      operation.
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc49c1f9
  19. 22 7月, 2006 1 次提交
  20. 04 7月, 2006 2 次提交
  21. 01 7月, 2006 1 次提交
  22. 30 6月, 2006 1 次提交
    • C
      [AF_UNIX]: Datagram getpeersec · 877ce7c1
      Catherine Zhang 提交于
      This patch implements an API whereby an application can determine the
      label of its peer's Unix datagram sockets via the auxiliary data mechanism of
      recvmsg.
      
      Patch purpose:
      
      This patch enables a security-aware application to retrieve the
      security context of the peer of a Unix datagram socket.  The application
      can then use this security context to determine the security context for
      processing on behalf of the peer who sent the packet.
      
      Patch design and implementation:
      
      The design and implementation is very similar to the UDP case for INET
      sockets.  Basically we build upon the existing Unix domain socket API for
      retrieving user credentials.  Linux offers the API for obtaining user
      credentials via ancillary messages (i.e., out of band/control messages
      that are bundled together with a normal message).  To retrieve the security
      context, the application first indicates to the kernel such desire by
      setting the SO_PASSSEC option via getsockopt.  Then the application
      retrieves the security context using the auxiliary data mechanism.
      
      An example server application for Unix datagram socket should look like this:
      
      toggle = 1;
      toggle_len = sizeof(toggle);
      
      setsockopt(sockfd, SOL_SOCKET, SO_PASSSEC, &toggle, &toggle_len);
      recvmsg(sockfd, &msg_hdr, 0);
      if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
          cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
          if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
              cmsg_hdr->cmsg_level == SOL_SOCKET &&
              cmsg_hdr->cmsg_type == SCM_SECURITY) {
              memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
          }
      }
      
      sock_setsockopt is enhanced with a new socket option SOCK_PASSSEC to allow
      a server socket to receive security context of the peer.
      
      Testing:
      
      We have tested the patch by setting up Unix datagram client and server
      applications.  We verified that the server can retrieve the security context
      using the auxiliary data mechanism of recvmsg.
      Signed-off-by: NCatherine Zhang <cxzhang@watson.ibm.com>
      Acked-by: NAcked-by: James Morris <jmorris@namei.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      877ce7c1
  23. 26 3月, 2006 1 次提交
    • D
      [PATCH] POLLRDHUP/EPOLLRDHUP handling for half-closed devices notifications · f348d70a
      Davide Libenzi 提交于
      Implement the half-closed devices notifiation, by adding a new POLLRDHUP
      (and its alias EPOLLRDHUP) bit to the existing poll/select sets.  Since the
      existing POLLHUP handling, that does not report correctly half-closed
      devices, was feared to be changed, this implementation leaves the current
      POLLHUP reporting unchanged and simply add a new bit that is set in the few
      places where it makes sense.  The same thing was discussed and conceptually
      agreed quite some time ago:
      
      http://lkml.org/lkml/2003/7/12/116
      
      Since this new event bit is added to the existing Linux poll infrastruture,
      even the existing poll/select system calls will be able to use it.  As far
      as the existing POLLHUP handling, the patch leaves it as is.  The
      pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing
      archs and sets the bit in the six relevant files.  The other attached diff
      is the simple change required to sys/epoll.h to add the EPOLLRDHUP
      definition.
      
      There is "a stupid program" to test POLLRDHUP delivery here:
      
       http://www.xmailserver.org/pollrdhup-test.c
      
      It tests poll(2), but since the delivery is same epoll(2) will work equally.
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f348d70a
  24. 21 3月, 2006 2 次提交
  25. 09 3月, 2006 1 次提交
    • D
      [PATCH] fix file counting · 529bf6be
      Dipankar Sarma 提交于
      I have benchmarked this on an x86_64 NUMA system and see no significant
      performance difference on kernbench.  Tested on both x86_64 and powerpc.
      
      The way we do file struct accounting is not very suitable for batched
      freeing.  For scalability reasons, file accounting was
      constructor/destructor based.  This meant that nr_files was decremented
      only when the object was removed from the slab cache.  This is susceptible
      to slab fragmentation.  With RCU based file structure, consequent batched
      freeing and a test program like Serge's, we just speed this up and end up
      with a very fragmented slab -
      
      llm22:~ # cat /proc/sys/fs/file-nr
      587730  0       758844
      
      At the same time, I see only a 2000+ objects in filp cache.  The following
      patch I fixes this problem.
      
      This patch changes the file counting by removing the filp_count_lock.
      Instead we use a separate percpu counter, nr_files, for now and all
      accesses to it are through get_nr_files() api.  In the sysctl handler for
      nr_files, we populate files_stat.nr_files before returning to user.
      
      Counting files as an when they are created and destroyed (as opposed to
      inside slab) allows us to correctly count open files with RCU.
      Signed-off-by: NDipankar Sarma <dipankar@in.ibm.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      529bf6be
  26. 10 1月, 2006 1 次提交
  27. 04 1月, 2006 5 次提交
  28. 09 11月, 2005 1 次提交
  29. 30 8月, 2005 2 次提交