1. 05 9月, 2012 1 次提交
    • M
      net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries · 600e1779
      Masatake YAMATO 提交于
      lsof reports some of socket descriptors as "can't identify protocol" like:
      
          [yamato@localhost]/tmp% sudo lsof | grep dbus | grep iden
          dbus-daem   652          dbus    6u     sock ... 17812 can't identify protocol
          dbus-daem   652          dbus   34u     sock ... 24689 can't identify protocol
          dbus-daem   652          dbus   42u     sock ... 24739 can't identify protocol
          dbus-daem   652          dbus   48u     sock ... 22329 can't identify protocol
          ...
      
      lsof cannot resolve the protocol used in a socket because procfs
      doesn't provide the map between inode number on sockfs and protocol
      type of the socket.
      
      For improving the situation this patch adds an extended attribute named
      'system.sockprotoname' in which the protocol name for
      /proc/PID/fd/SOCKET is stored. So lsof can know the protocol for a
      given /proc/PID/fd/SOCKET with getxattr system call.
      
      A few weeks ago I submitted a patch for the same purpose. The patch
      was introduced /proc/net/sockfs which enumerates inodes and protocols
      of all sockets alive on a system. However, it was rejected because (1)
      a global lock was needed, and (2) the layout of struct socket was
      changed with the patch.
      
      This patch doesn't use any global lock; and doesn't change the layout
      of any structs.
      
      In this patch, a protocol name is stored to dentry->d_name of sockfs
      when new socket is associated with a file descriptor. Before this
      patch dentry->d_name was not used; it was just filled with empty
      string. lsof may use an extended attribute named
      'system.sockprotoname' to retrieve the value of dentry->d_name.
      
      It is nice if we can see the protocol name with ls -l
      /proc/PID/fd. However, "socket:[#INODE]", the name format returned
      from sockfs_dname() was already defined. To keep the compatibility
      between kernel and user land, the extended attribute is used to
      prepare the value of dentry->d_name.
      Signed-off-by: NMasatake YAMATO <yamato@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      600e1779
  2. 16 8月, 2012 1 次提交
  3. 23 7月, 2012 1 次提交
    • J
      net: netprio_cgroup: rework update socket logic · 406a3c63
      John Fastabend 提交于
      Instead of updating the sk_cgrp_prioidx struct field on every send
      this only updates the field when a task is moved via cgroup
      infrastructure.
      
      This allows sockets that may be used by a kernel worker thread
      to be managed. For example in the iscsi case today a user can
      put iscsid in a netprio cgroup and control traffic will be sent
      with the correct sk_cgrp_prioidx value set but as soon as data
      is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
      is updated with the kernel worker threads value which is the
      default case.
      
      It seems more correct to only update the field when the user
      explicitly sets it via control group infrastructure. This allows
      the users to manage sockets that may be used with other threads.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      406a3c63
  4. 21 7月, 2012 1 次提交
    • M
      tun: fix a crash bug and a memory leak · b09e786b
      Mikulas Patocka 提交于
      This patch fixes a crash
      tun_chr_close -> netdev_run_todo -> tun_free_netdev -> sk_release_kernel ->
      sock_release -> iput(SOCK_INODE(sock))
      introduced by commit 1ab5ecb9
      
      The problem is that this socket is embedded in struct tun_struct, it has
      no inode, iput is called on invalid inode, which modifies invalid memory
      and optionally causes a crash.
      
      sock_release also decrements sockets_in_use, this causes a bug that
      "sockets: used" field in /proc/*/net/sockstat keeps on decreasing when
      creating and closing tun devices.
      
      This patch introduces a flag SOCK_EXTERNALLY_ALLOCATED that instructs
      sock_release to not free the inode and not decrement sockets_in_use,
      fixing both memory corruption and sockets_in_use underflow.
      
      It should be backported to 3.3 an 3.4 stabke.
      Signed-off-by: NMikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      Cc: stable@kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b09e786b
  5. 16 5月, 2012 1 次提交
  6. 15 5月, 2012 1 次提交
  7. 22 4月, 2012 1 次提交
  8. 21 4月, 2012 1 次提交
  9. 16 4月, 2012 1 次提交
  10. 06 4月, 2012 1 次提交
    • E
      tcp: tcp_sendpages() should call tcp_push() once · 35f9c09f
      Eric Dumazet 提交于
      commit 2f533844 (tcp: allow splice() to build full TSO packets) added
      a regression for splice() calls using SPLICE_F_MORE.
      
      We need to call tcp_flush() at the end of the last page processed in
      tcp_sendpages(), or else transmits can be deferred and future sends
      stall.
      
      Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but
      with different semantic.
      
      For all sendpage() providers, its a transparent change. Only
      sock_sendpage() and tcp_sendpages() can differentiate the two different
      flags provided by pipe_to_sendpage()
      Reported-by: NTom Herbert <therbert@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: H.K. Jerry Chu <hkchu@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail&gt;com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35f9c09f
  11. 12 3月, 2012 1 次提交
  12. 21 2月, 2012 1 次提交
  13. 13 1月, 2012 1 次提交
  14. 06 1月, 2012 1 次提交
    • L
      vfs: fix up ENOIOCTLCMD error handling · 07d106d0
      Linus Torvalds 提交于
      We're doing some odd things there, which already messes up various users
      (see the net/socket.c code that this removes), and it was going to add
      yet more crud to the block layer because of the incorrect error code
      translation.
      
      ENOIOCTLCMD is not an error return that should be returned to user mode
      from the "ioctl()" system call, but it should *not* be translated as
      EINVAL ("Invalid argument").  It should be translated as ENOTTY
      ("Inappropriate ioctl for device").
      
      That EINVAL confusion has apparently so permeated some code that the
      block layer actually checks for it, which is sad.  We continue to do so
      for now, but add a big comment about how wrong that is, and we should
      remove it entirely eventually.  In the meantime, this tries to keep the
      changes localized to just the EINVAL -> ENOTTY fix, and removing code
      that makes it harder to do the right thing.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07d106d0
  15. 05 1月, 2012 1 次提交
  16. 23 11月, 2011 1 次提交
    • N
      net: add network priority cgroup infrastructure (v4) · 5bc1421e
      Neil Horman 提交于
      This patch adds in the infrastructure code to create the network priority
      cgroup.  The cgroup, in addition to the standard processes file creates two
      control files:
      
      1) prioidx - This is a read-only file that exports the index of this cgroup.
      This is a value that is both arbitrary and unique to a cgroup in this subsystem,
      and is used to index the per-device priority map
      
      2) priomap - This is a writeable file.  On read it reports a table of 2-tuples
      <name:priority> where name is the name of a network interface and priority is
      indicates the priority assigned to frames egresessing on the named interface and
      originating from a pid in this cgroup
      
      This cgroup allows for skb priority to be set prior to a root qdisc getting
      selected. This is benenficial for DCB enabled systems, in that it allows for any
      application to use dcb configured priorities so without application modification
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      CC: Robert Love <robert.w.love@intel.com>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bc1421e
  17. 10 11月, 2011 1 次提交
    • J
      net: add wireless TX status socket option · 6e3e939f
      Johannes Berg 提交于
      The 802.1X EAPOL handshake hostapd does requires
      knowing whether the frame was ack'ed by the peer.
      Currently, we fudge this pretty badly by not even
      transmitting the frame as a normal data frame but
      injecting it with radiotap and getting the status
      out of radiotap monitor as well. This is rather
      complex, confuses users (mon.wlan0 presence) and
      doesn't work with all hardware.
      
      To get rid of that hack, introduce a real wifi TX
      status option for data frame transmissions.
      
      This works similar to the existing TX timestamping
      in that it reflects the SKB back to the socket's
      error queue with a SCM_WIFI_STATUS cmsg that has
      an int indicating ACK status (0/1).
      
      Since it is possible that at some point we will
      want to have TX timestamping and wifi status in a
      single errqueue SKB (there's little point in not
      doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
      to SO_EE_ORIGIN_TXSTATUS which can collect more
      than just the timestamp; keep the old constant
      as an alias of course. Currently the internal APIs
      don't make that possible, but it wouldn't be hard
      to split them up in a way that makes it possible.
      
      Thanks to Neil Horman for helping me figure out
      the functions that add the control messages.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      6e3e939f
  18. 25 8月, 2011 1 次提交
    • M
      sendmmsg/sendmsg: fix unsafe user pointer access · bc909d9d
      Mathieu Desnoyers 提交于
      Dereferencing a user pointer directly from kernel-space without going
      through the copy_from_user family of functions is a bad idea. Two of
      such usages can be found in the sendmsg code path called from sendmmsg,
      added by
      
      commit c71d8ebe upstream.
      commit 5b47b8038f183b44d2d8ff1c7d11a5c1be706b34 in the 3.0-stable tree.
      
      Usages are performed through memcmp() and memcpy() directly. Fix those
      by using the already copied msg_sys structure instead of the __user *msg
      structure. Note that msg_sys can be set to NULL by verify_compat_iovec()
      or verify_iovec(), which requires additional NULL pointer checks.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NDavid Goulet <dgoulet@ev0ke.net>
      CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      CC: Anton Blanchard <anton@samba.org>
      CC: David S. Miller <davem@davemloft.net>
      CC: stable <stable@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc909d9d
  19. 05 8月, 2011 3 次提交
    • T
      net: Fix security_socket_sendmsg() bypass problem. · c71d8ebe
      Tetsuo Handa 提交于
      The sendmmsg() introduced by commit 228e548e "net: Add sendmmsg socket system
      call" is capable of sending to multiple different destination addresses.
      
      SMACK is using destination's address for checking sendmsg() permission.
      However, security_socket_sendmsg() is called for only once even if multiple
      different destination addresses are passed to sendmmsg().
      
      Therefore, we need to call security_socket_sendmsg() for each destination
      address rather than only the first destination address.
      
      Since calling security_socket_sendmsg() every time when only single destination
      address was passed to sendmmsg() is a waste of time, omit calling
      security_socket_sendmsg() unless destination address of previous datagram and
      that of current datagram differs.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NAnton Blanchard <anton@samba.org>
      Cc: stable <stable@kernel.org> [3.0+]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c71d8ebe
    • A
      net: Cap number of elements for sendmmsg · 98382f41
      Anton Blanchard 提交于
      To limit the amount of time we can spend in sendmmsg, cap the
      number of elements to UIO_MAXIOV (currently 1024).
      
      For error handling an application using sendmmsg needs to retry at
      the first unsent message, so capping is simpler and requires less
      application logic than returning EINVAL.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Cc: stable <stable@kernel.org> [3.0+]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98382f41
    • A
      net: sendmmsg should only return an error if no messages were sent · 728ffb86
      Anton Blanchard 提交于
      sendmmsg uses a similar error return strategy as recvmmsg but it
      turns out to be a confusing way to communicate errors.
      
      The current code stores the error code away and returns it on the next
      sendmmsg call. This means a call with completely valid arguments could
      get an error from a previous call.
      
      Change things so we only return an error if no datagrams could be sent.
      If less than the requested number of messages were sent, the application
      must retry starting at the first failed one and if the problem is
      persistent the error will be returned.
      
      This matches the behaviour of other syscalls like read/write - it
      is not an error if less than the requested number of elements are sent.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Cc: stable <stable@kernel.org> [3.0+]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      728ffb86
  20. 02 8月, 2011 1 次提交
  21. 28 7月, 2011 1 次提交
  22. 27 7月, 2011 1 次提交
  23. 18 5月, 2011 1 次提交
  24. 08 5月, 2011 1 次提交
  25. 06 5月, 2011 1 次提交
    • A
      net: Add sendmmsg socket system call · 228e548e
      Anton Blanchard 提交于
      This patch adds a multiple message send syscall and is the send
      version of the existing recvmmsg syscall. This is heavily
      based on the patch by Arnaldo that added recvmmsg.
      
      I wrote a microbenchmark to test the performance gains of using
      this new syscall:
      
      http://ozlabs.org/~anton/junkcode/sendmmsg_test.c
      
      The test was run on a ppc64 box with a 10 Gbit network card. The
      benchmark can send both UDP and RAW ethernet packets.
      
      64B UDP
      
      batch   pkts/sec
      1       804570
      2       872800 (+ 8 %)
      4       916556 (+14 %)
      8       939712 (+17 %)
      16      952688 (+18 %)
      32      956448 (+19 %)
      64      964800 (+20 %)
      
      64B raw socket
      
      batch   pkts/sec
      1       1201449
      2       1350028 (+12 %)
      4       1461416 (+22 %)
      8       1513080 (+26 %)
      16      1541216 (+28 %)
      32      1553440 (+29 %)
      64      1557888 (+30 %)
      
      We see a 20% improvement in throughput on UDP send and 30%
      on raw socket send.
      
      [ Add sparc syscall entries. -DaveM ]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      228e548e
  26. 12 4月, 2011 1 次提交
  27. 31 3月, 2011 1 次提交
  28. 19 3月, 2011 1 次提交
  29. 24 2月, 2011 1 次提交
  30. 23 2月, 2011 1 次提交
  31. 01 2月, 2011 2 次提交
    • G
      Revert "appletalk: move to staging" · 0ffbf8bf
      Greg Kroah-Hartman 提交于
      This reverts commit a6238f21
      
      Appletalk got some patches to fix up the BLK usage in it in the
      network tree, so this removal isn't needed.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: <acme@ghostprotocols.net>
      Cc: netdev@vger.kernel.org,
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      0ffbf8bf
    • A
      appletalk: move to staging · a6238f21
      Arnd Bergmann 提交于
      For all I know, Appletalk is dead, the only reasonable
      use right now would be nostalgia, and that can be served
      well enough by old kernels. The code is largely not
      in a bad shape, but it still uses the big kernel lock,
      and nobody seems motivated to change that.
      
      FWIW, the last release of MacOS that supported Appletalk
      was MacOS X 10.5, made in 2007, and it has been abandoned
      by Apple with 10.6. Using TCP/IP instead of Appletalk has
      been supported since MacOS 7.6, which was released in
      1997 and is able to run on most of the legacy hardware.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      a6238f21
  32. 13 1月, 2011 1 次提交
  33. 07 1月, 2011 5 次提交
    • N
      fs: scale mntget/mntput · b3e19d92
      Nick Piggin 提交于
      The problem that this patch aims to fix is vfsmount refcounting scalability.
      We need to take a reference on the vfsmount for every successful path lookup,
      which often go to the same mount point.
      
      The fundamental difficulty is that a "simple" reference count can never be made
      scalable, because any time a reference is dropped, we must check whether that
      was the last reference. To do that requires communication with all other CPUs
      that may have taken a reference count.
      
      We can make refcounts more scalable in a couple of ways, involving keeping
      distributed counters, and checking for the global-zero condition less
      frequently.
      
      - check the global sum once every interval (this will delay zero detection
        for some interval, so it's probably a showstopper for vfsmounts).
      
      - keep a local count and only taking the global sum when local reaches 0 (this
        is difficult for vfsmounts, because we can't hold preempt off for the life of
        a reference, so a counter would need to be per-thread or tied strongly to a
        particular CPU which requires more locking).
      
      - keep a local difference of increments and decrements, which allows us to sum
        the total difference and hence find the refcount when summing all CPUs. Then,
        keep a single integer "long" refcount for slow and long lasting references,
        and only take the global sum of local counters when the long refcount is 0.
      
      This last scheme is what I implemented here. Attached mounts and process root
      and working directory references are "long" references, and everything else is
      a short reference.
      
      This allows scalable vfsmount references during path walking over mounted
      subtrees and unattached (lazy umounted) mounts with processes still running
      in them.
      
      This results in one fewer atomic op in the fastpath: mntget is now just a
      per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
      and non-atomic decrement in the common case. However code is otherwise bigger
      and heavier, so single threaded performance is basically a wash.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b3e19d92
    • N
      fs: improve scalability of pseudo filesystems · 4b936885
      Nick Piggin 提交于
      Regardless of how much we possibly try to scale dcache, there is likely
      always going to be some fundamental contention when adding or removing children
      under the same parent. Pseudo filesystems do not seem need to have connected
      dentries because by definition they are disconnected.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      4b936885
    • N
      fs: dcache reduce branches in lookup path · fb045adb
      Nick Piggin 提交于
      Reduce some branches and memory accesses in dcache lookup by adding dentry
      flags to indicate common d_ops are set, rather than having to check them.
      This saves a pointer memory access (dentry->d_op) in common path lookup
      situations, and saves another pointer load and branch in cases where we
      have d_op but not the particular operation.
      
      Patched with:
      
      git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fb045adb
    • N
      fs: avoid inode RCU freeing for pseudo fs · ff0c7d15
      Nick Piggin 提交于
      Pseudo filesystems that don't put inode on RCU list or reachable by
      rcu-walk dentries do not need to RCU free their inodes.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      ff0c7d15
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d