1. 20 6月, 2012 2 次提交
    • D
      ipv4: Early TCP socket demux. · 41063e9d
      David S. Miller 提交于
      Input packet processing for local sockets involves two major demuxes.
      One for the route and one for the socket.
      
      But we can optimize this down to one demux for certain kinds of local
      sockets.
      
      Currently we only do this for established TCP sockets, but it could
      at least in theory be expanded to other kinds of connections.
      
      If a TCP socket is established then it's identity is fully specified.
      
      This means that whatever input route was used during the three-way
      handshake must work equally well for the rest of the connection since
      the keys will not change.
      
      Once we move to established state, we cache the receive packet's input
      route to use later.
      
      Like the existing cached route in sk->sk_dst_cache used for output
      packets, we have to check for route invalidations using dst->obsolete
      and dst->ops->check().
      
      Early demux occurs outside of a socket locked section, so when a route
      invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
      actually inside of established state packet processing and thus have
      the socket locked.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41063e9d
    • D
      inet: Sanitize inet{,6} protocol demux. · f9242b6b
      David S. Miller 提交于
      Don't pretend that inet_protos[] and inet6_protos[] are hashes, thay
      are just a straight arrays.  Remove all unnecessary hash masking.
      
      Document MAX_INET_PROTOS.
      
      Use RAW_HTABLE_SIZE when appropriate.
      Reported-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9242b6b
  2. 04 6月, 2012 1 次提交
    • J
      net: Remove casts to same type · e3192690
      Joe Perches 提交于
      Adding casts of objects to the same type is unnecessary
      and confusing for a human reader.
      
      For example, this cast:
      
      	int y;
      	int *p = (int *)&y;
      
      I used the coccinelle script below to find and remove these
      unnecessary casts.  I manually removed the conversions this
      script produces of casts with __force and __user.
      
      @@
      type T;
      T *p;
      @@
      
      -	(T *)p
      +	p
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3192690
  3. 22 4月, 2012 1 次提交
  4. 16 4月, 2012 1 次提交
  5. 29 3月, 2012 1 次提交
  6. 13 3月, 2012 1 次提交
  7. 12 3月, 2012 1 次提交
    • J
      net: Convert printks to pr_<level> · 058bd4d2
      Joe Perches 提交于
      Use a more current kernel messaging style.
      
      Convert a printk block to print_hex_dump.
      Coalesce formats, align arguments.
      Use %s, __func__ instead of embedding function names.
      
      Some messages that were prefixed with <foo>_close are
      now prefixed with <foo>_fini.  Some ah4 and esp messages
      are now not prefixed with "ip ".
      
      The intent of this patch is to later add something like
        #define pr_fmt(fmt) "IPv4: " fmt.
      to standardize the output messages.
      
      Text size is trivially reduced. (x86-32 allyesconfig)
      
      $ size net/ipv4/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       887888	  31558	 249696	1169142	 11d6f6	net/ipv4/built-in.o.new
       887934	  31558	 249800	1169292	 11d78c	net/ipv4/built-in.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      058bd4d2
  8. 13 2月, 2012 1 次提交
    • J
      net: implement IP_RECVTOS for IP_PKTOPTIONS · 4c507d28
      Jiri Benc 提交于
      Currently, it is not easily possible to get TOS/DSCP value of packets from
      an incoming TCP stream. The mechanism is there, IP_PKTOPTIONS getsockopt
      with IP_RECVTOS set, the same way as incoming TTL can be queried. This is
      not actually implemented for TOS, though.
      
      This patch adds this functionality, both for IPv4 (IP_PKTOPTIONS) and IPv6
      (IPV6_2292PKTOPTIONS). For IPv4, like in the IP_RECVTTL case, the value of
      the TOS field is stored from the other party's ACK.
      
      This is needed for proxies which require DSCP transparency. One such example
      is at http://zph.bratcheda.org/.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c507d28
  9. 13 12月, 2011 1 次提交
  10. 17 11月, 2011 1 次提交
  11. 10 11月, 2011 1 次提交
    • E
      ipv4: reduce percpu needs for icmpmsg mibs · acb32ba3
      Eric Dumazet 提交于
      Reading /proc/net/snmp on a machine with a lot of cpus is very expensive
      (can be ~88000 us).
      
      This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values
      for all possible cpus can read 16 Mbytes of memory.
      
      ICMP messages are not considered as fast path on a typical server, and
      eventually few cpus handle them anyway. We can afford an atomic
      operation instead of using percpu data.
      
      This saves 4096 bytes per cpu and per network namespace.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acb32ba3
  12. 20 10月, 2011 1 次提交
  13. 31 8月, 2011 1 次提交
  14. 05 7月, 2011 1 次提交
    • M
      net: bind() fix error return on wrong address family · c349a528
      Marcus Meissner 提交于
      Hi,
      
      Reinhard Max also pointed out that the error should EAFNOSUPPORT according
      to POSIX.
      
      The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use
      EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN.
      
      Other protocols error values in their af bind() methods in current mainline git as far
      as a brief look shows:
      	EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc
      	EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25,
      	No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip
      
      Ciao, Marcus
      Signed-off-by: NMarcus Meissner <meissner@suse.de>
      Cc: Reinhard Max <max@suse.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c349a528
  15. 18 6月, 2011 1 次提交
    • E
      net: rfs: enable RFS before first data packet is received · 1eddcead
      Eric Dumazet 提交于
      Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit :
      > From: Ben Hutchings <bhutchings@solarflare.com>
      > Date: Fri, 17 Jun 2011 00:50:46 +0100
      >
      > > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote:
      > >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
      > >>  			goto discard;
      > >>
      > >>  		if (nsk != sk) {
      > >> +			sock_rps_save_rxhash(nsk, skb->rxhash);
      > >>  			if (tcp_child_process(sk, nsk, skb)) {
      > >>  				rsk = nsk;
      > >>  				goto reset;
      > >>
      > >
      > > I haven't tried this, but it looks reasonable to me.
      > >
      > > What about IPv6?  The logic in tcp_v6_do_rcv() looks very similar.
      >
      > Indeed ipv6 side needs the same fix.
      >
      > Eric please add that part and resubmit.  And in fact I might stick
      > this into net-2.6 instead of net-next-2.6
      >
      
      OK, here is the net-2.6 based one then, thanks !
      
      [PATCH v2] net: rfs: enable RFS before first data packet is received
      
      First packet received on a passive tcp flow is not correctly RFS
      steered.
      
      One sock_rps_record_flow() call is missing in inet_accept()
      
      But before that, we also must record rxhash when child socket is setup.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Tom Herbert <therbert@google.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: Jamal Hadi Salim <hadi@cyberus.ca>
      Signed-off-by: NDavid S. Miller <davem@conan.davemloft.net>
      1eddcead
  16. 12 6月, 2011 1 次提交
    • E
      snmp: reduce percpu needs by 50% · 8f0ea0fe
      Eric Dumazet 提交于
      SNMP mibs use two percpu arrays, one used in BH context, another in USER
      context. With increasing number of cpus in machines, and fact that ipv6
      uses per network device ipstats_mib, this is consuming a lot of memory
      if many network devices are registered.
      
      commit be281e55 (ipv6: reduce per device ICMP mib sizes) shrinked
      percpu needs for ipv6, but we can reduce memory use a bit more.
      
      With recent percpu infrastructure (irqsafe_cpu_inc() ...), we no longer
      need this BH/USER separation since we can update counters in a single
      x86 instruction, regardless of the BH/USER context.
      
      Other arches than x86 might need to disable irq in their
      irqsafe_cpu_inc() implementation : If this happens to be a problem, we
      can make SNMP_ARRAY_SZ arch dependent, but a previous poll
      ( https://lkml.org/lkml/2011/3/17/174 ) to arch maintainers did not
      raise strong opposition.
      
      Only on 32bit arches, we need to disable BH for 64bit counters updates
      done from USER context (currently used for IP MIB)
      
      This also reduces vmlinux size :
      
      1) x86_64 build
      $ size vmlinux.before vmlinux.after
         text	   data	    bss	    dec	    hex	filename
      7853650	1293772	1896448	11043870	 a8841e	vmlinux.before
      7850578	1293772	1896448	11040798	 a8781e	vmlinux.after
      
      2) i386  build
      $ size vmlinux.before vmlinux.afterpatch
         text	   data	    bss	    dec	    hex	filename
      6039335	 635076	3670016	10344427	 9dd7eb	vmlinux.before
      6037342	 635076	3670016	10342434	 9dd022	vmlinux.afterpatch
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Andi Kleen <andi@firstfloor.org>
      CC: Ingo Molnar <mingo@elte.hu>
      CC: Tejun Heo <tj@kernel.org>
      CC: Christoph Lameter <cl@linux-foundation.org>
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org
      CC: linux-arch@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f0ea0fe
  17. 02 6月, 2011 1 次提交
  18. 14 5月, 2011 1 次提交
    • V
      net: ipv4: add IPPROTO_ICMP socket kind · c319b4d7
      Vasiliy Kulikov 提交于
      This patch adds IPPROTO_ICMP socket kind.  It makes it possible to send
      ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
      without any special privileges.  In other words, the patch makes it
      possible to implement setuid-less and CAP_NET_RAW-less /bin/ping.  In
      order not to increase the kernel's attack surface, the new functionality
      is disabled by default, but is enabled at bootup by supporting Linux
      distributions, optionally with restriction to a group or a group range
      (see below).
      
      Similar functionality is implemented in Mac OS X:
      http://www.manpagez.com/man/4/icmp/
      
      A new ping socket is created with
      
          socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
      
      Message identifiers (octets 4-5 of ICMP header) are interpreted as local
      ports. Addresses are stored in struct sockaddr_in. No port numbers are
      reserved for privileged processes, port 0 is reserved for API ("let the
      kernel pick a free number"). There is no notion of remote ports, remote
      port numbers provided by the user (e.g. in connect()) are ignored.
      
      Data sent and received include ICMP headers. This is deliberate to:
      1) Avoid the need to transport headers values like sequence numbers by
      other means.
      2) Make it easier to port existing programs using raw sockets.
      
      ICMP headers given to send() are checked and sanitized. The type must be
      ICMP_ECHO and the code must be zero (future extensions might relax this,
      see below). The id is set to the number (local port) of the socket, the
      checksum is always recomputed.
      
      ICMP reply packets received from the network are demultiplexed according
      to their id's, and are returned by recv() without any modifications.
      IP header information and ICMP errors of those packets may be obtained
      via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
      quenches and redirects are reported as fake errors via the error queue
      (IP_RECVERR); the next hop address for redirects is saved to ee_info (in
      network order).
      
      socket(2) is restricted to the group range specified in
      "/proc/sys/net/ipv4/ping_group_range".  It is "1 0" by default, meaning
      that nobody (not even root) may create ping sockets.  Setting it to "100
      100" would grant permissions to the single group (to either make
      /sbin/ping g+s and owned by this group or to grant permissions to the
      "netadmins" group), "0 4294967295" would enable it for the world, "100
      4294967295" would enable it for the users, but not daemons.
      
      The existing code might be (in the unlikely case anyone needs it)
      extended rather easily to handle other similar pairs of ICMP messages
      (Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
      etc.).
      
      Userspace ping util & patch for it:
      http://openwall.info/wiki/people/segoon/ping
      
      For Openwall GNU/*/Linux it was the last step on the road to the
      setuid-less distro.  A revision of this patch (for RHEL5/OpenVZ kernels)
      is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
      http://mirrors.kernel.org/openwall/Owl/current/iso/
      
      Initially this functionality was written by Pavel Kankovsky for
      Linux 2.4.32, but unfortunately it was never made public.
      
      All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
      the patch.
      
      PATCH v3:
          - switched to flowi4.
          - minor changes to be consistent with raw sockets code.
      
      PATCH v2:
          - changed ping_debug() to pr_debug().
          - removed CONFIG_IP_PING.
          - removed ping_seq_fops.owner field (unused for procfs).
          - switched to proc_net_fops_create().
          - switched to %pK in seq_printf().
      
      PATCH v1:
          - fixed checksumming bug.
          - CAP_NET_RAW may not create icmp sockets anymore.
      
      RFC v2:
          - minor cleanups.
          - introduced sysctl'able group range to restrict socket(2).
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c319b4d7
  19. 09 5月, 2011 1 次提交
  20. 04 5月, 2011 1 次提交
  21. 29 4月, 2011 2 次提交
    • D
      ipv4: Fetch route saddr from flow key in inet_sk_reselect_saddr(). · b8831877
      David S. Miller 提交于
      Now that output route lookups update the flow with
      source address selection, we can fetch it from
      fl4->saddr instead of rt->rt_src
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8831877
    • E
      inet: add RCU protection to inet->opt · f6d8bd05
      Eric Dumazet 提交于
      We lack proper synchronization to manipulate inet->opt ip_options
      
      Problem is ip_make_skb() calls ip_setup_cork() and
      ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
      without any protection against another thread manipulating inet->opt.
      
      Another thread can change inet->opt pointer and free old one under us.
      
      Use RCU to protect inet->opt (changed to inet->inet_opt).
      
      Instead of handling atomic refcounts, just copy ip_options when
      necessary, to avoid cache line dirtying.
      
      We cant insert an rcu_head in struct ip_options since its included in
      skb->cb[], so this patch is large because I had to introduce a new
      ip_options_rcu structure.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6d8bd05
  22. 28 4月, 2011 1 次提交
    • D
      ipv4: Sanitize and simplify ip_route_{connect,newports}() · 2d7192d6
      David S. Miller 提交于
      These functions are used together as a unit for route resolution
      during connect().  They address the chicken-and-egg problem that
      exists when ports need to be allocated during connect() processing,
      yet such port allocations require addressing information from the
      routing code.
      
      It's currently more heavy handed than it needs to be, and in
      particular we allocate and initialize a flow object twice.
      
      Let the callers provide the on-stack flow object.  That way we only
      need to initialize it once in the ip_route_connect() call.
      
      Later, if ip_route_newports() needs to do anything, it re-uses that
      flow object as-is except for the ports which it updates before the
      route re-lookup.
      
      Also, describe why this set of facilities are needed and how it works
      in a big comment.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com>
      2d7192d6
  23. 23 4月, 2011 1 次提交
  24. 13 3月, 2011 1 次提交
  25. 03 3月, 2011 1 次提交
  26. 02 3月, 2011 3 次提交
  27. 30 1月, 2011 1 次提交
    • E
      net: Add compat ioctl support for the ipv4 multicast ioctl SIOCGETSGCNT · 709b46e8
      Eric W. Biederman 提交于
      SIOCGETSGCNT is not a unique ioctl value as it it maps tio SIOCPROTOPRIVATE +1,
      which unfortunately means the existing infrastructure for compat networking
      ioctls is insufficient.  A trivial compact ioctl implementation would conflict
      with:
      
      SIOCAX25ADDUID
      SIOCAIPXPRISLT
      SIOCGETSGCNT_IN6
      SIOCGETSGCNT
      SIOCRSSCAUSE
      SIOCX25SSUBSCRIP
      SIOCX25SDTEFACILITIES
      
      To make this work I have updated the compat_ioctl decode path to mirror the
      the normal ioctl decode path.  I have added an ipv4 inet_compat_ioctl function
      so that I can have ipv4 specific compat ioctls.   I have added a compat_ioctl
      function into struct proto so I can break out ioctls by which kind of ip socket
      I am using.  I have added a compat_raw_ioctl function because SIOCGETSGCNT only
      works on raw sockets.  I have added a ipmr_compat_ioctl that mirrors the normal
      ipmr_ioctl.
      
      This was necessary because unfortunately the struct layout for the SIOCGETSGCNT
      has unsigned longs in it so changes between 32bit and 64bit kernels.
      
      This change was sufficient to run a 32bit ip multicast routing daemon on a
      64bit kernel.
      Reported-by: NBill Fenner <fenner@aristanetworks.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      709b46e8
  28. 25 1月, 2011 1 次提交
  29. 18 11月, 2010 1 次提交
  30. 20 8月, 2010 1 次提交
  31. 13 7月, 2010 1 次提交
  32. 01 7月, 2010 1 次提交
    • E
      snmp: 64bit ipstats_mib for all arches · 4ce3c183
      Eric Dumazet 提交于
      /proc/net/snmp and /proc/net/netstat expose SNMP counters.
      
      Width of these counters is either 32 or 64 bits, depending on the size
      of "unsigned long" in kernel.
      
      This means user program parsing these files must already be prepared to
      deal with 64bit values, regardless of user program being 32 or 64 bit.
      
      This patch introduces 64bit snmp values for IPSTAT mib, where some
      counters can wrap pretty fast if they are 32bit wide.
      
      # netstat -s|egrep "InOctets|OutOctets"
          InOctets: 244068329096
          OutOctets: 244069348848
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ce3c183
  33. 26 6月, 2010 1 次提交
    • E
      snmp: add align parameter to snmp_mib_init() · 1823e4c8
      Eric Dumazet 提交于
      In preparation for 64bit snmp counters for some mibs,
      add an 'align' parameter to snmp_mib_init(), instead
      of assuming mibs only contain 'unsigned long' fields.
      
      Callers can use __alignof__(type) to provide correct
      alignment.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Vlad Yasevich <vladislav.yasevich@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1823e4c8
  34. 24 6月, 2010 1 次提交
  35. 11 6月, 2010 1 次提交
  36. 16 5月, 2010 1 次提交