1. 07 6月, 2015 1 次提交
    • E
      inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations · 90c337da
      Eric Dumazet 提交于
      When an application needs to force a source IP on an active TCP socket
      it has to use bind(IP, port=x).
      
      As most applications do not want to deal with already used ports, x is
      often set to 0, meaning the kernel is in charge to find an available
      port.
      But kernel does not know yet if this socket is going to be a listener or
      be connected.
      It has very limited choices (no full knowledge of final 4-tuple for a
      connect())
      
      With limited ephemeral port range (about 32K ports), it is very easy to
      fill the space.
      
      This patch adds a new SOL_IP socket option, asking kernel to ignore
      the 0 port provided by application in bind(IP, port=0) and only
      remember the given IP address.
      
      The port will be automatically chosen at connect() time, in a way
      that allows sharing a source port as long as the 4-tuples are unique.
      
      This new feature is available for both IPv4 and IPv6 (Thanks Neal)
      
      Tested:
      
      Wrote a test program and checked its behavior on IPv4 and IPv6.
      
      strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
      connect().
      Also getsockname() show that the port is still 0 right after bind()
      but properly allocated after connect().
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
      setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
      bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
      getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
      connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
      getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
      
      IPv6 test :
      
      socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
      setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
      bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
      getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
      connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
      getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
      
      I was able to bind()/connect() a million concurrent IPv4 sockets,
      instead of ~32000 before patch.
      
      lpaa23:~# ulimit -n 1000010
      lpaa23:~# ./bind --connect --num-flows=1000000 &
      1000000 sockets
      
      lpaa23:~# grep TCP /proc/net/sockstat
      TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66
      
      Check that a given source port is indeed used by many different
      connections :
      
      lpaa23:~# ss -t src :40000 | head -10
      State      Recv-Q Send-Q   Local Address:Port          Peer Address:Port
      ESTAB      0      0           127.0.0.2:40000         127.0.202.33:44983
      ESTAB      0      0           127.0.0.2:40000         127.2.27.240:44983
      ESTAB      0      0           127.0.0.2:40000           127.2.98.5:44983
      ESTAB      0      0           127.0.0.2:40000        127.0.124.196:44983
      ESTAB      0      0           127.0.0.2:40000         127.2.139.38:44983
      ESTAB      0      0           127.0.0.2:40000          127.1.59.80:44983
      ESTAB      0      0           127.0.0.2:40000          127.3.6.228:44983
      ESTAB      0      0           127.0.0.2:40000          127.0.38.53:44983
      ESTAB      0      0           127.0.0.2:40000         127.1.197.10:44983
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90c337da
  2. 05 6月, 2015 1 次提交
  3. 06 1月, 2015 1 次提交
  4. 27 2月, 2014 1 次提交
    • H
      ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMIT · 1b346576
      Hannes Frederic Sowa 提交于
      IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
      generation of fragments if the interface mtu is exceeded, it is very
      hard to make use of this option in already deployed name server software
      for which I introduced this option.
      
      This patch adds yet another new IP_MTU_DISCOVER option to not honor any
      path mtu information and not accepting new icmp notifications destined for
      the socket this option is enabled on. But we allow outgoing fragmentation
      in case the packet size exceeds the outgoing interface mtu.
      
      As such this new option can be used as a drop-in replacement for
      IP_PMTUDISC_DONT, which is currently in use by most name server software
      making the adoption of this option very smooth and easy.
      
      The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
      ignoring incoming path MTU updates and not honoring discovered path MTUs
      in the output path.
      
      Fixes: 482fc609 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b346576
  5. 06 11月, 2013 1 次提交
    • H
      ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE · 482fc609
      Hannes Frederic Sowa 提交于
      Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
      their sockets won't accept and install new path mtu information and they
      will always use the interface mtu for outgoing packets. It is guaranteed
      that the packet is not fragmented locally. But we won't set the DF-Flag
      on the outgoing frames.
      
      Florian Weimer had the idea to use this flag to ensure DNS servers are
      never generating outgoing fragments. They may well be fragmented on the
      path, but the server never stores or usees path mtu values, which could
      well be forged in an attack.
      
      (The root of the problem with path MTU discovery is that there is
      no reliable way to authenticate ICMP Fragmentation Needed But DF Set
      messages because they are sent from intermediate routers with their
      source addresses, and the IMCP payload will not always contain sufficient
      information to identify a flow.)
      
      Recent research in the DNS community showed that it is possible to
      implement an attack where DNS cache poisoning is feasible by spoofing
      fragments. This work was done by Amir Herzberg and Haya Shulman:
      <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf>
      
      This issue was previously discussed among the DNS community, e.g.
      <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>,
      without leading to fixes.
      
      This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode
      regarding local fragmentation with UFO/CORK" for the enforcement of the
      non-fragmentable checks. If other users than ip_append_page/data should
      use this semantic too, we have to add a new flag to IPCB(skb)->flags to
      suppress local fragmentation and check for this in ip_finish_output.
      
      Many thanks to Florian Weimer for the idea and feedback while implementing
      this patch.
      
      Cc: David S. Miller <davem@davemloft.net>
      Suggested-by: NFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      482fc609
  6. 05 9月, 2013 1 次提交
    • C
      net: sync some IP headers with glibc · cfd280c9
      Carlos O'Donell 提交于
      Solution:
      =========
      
      - Synchronize linux's `include/uapi/linux/in6.h'
        with glibc's `inet/netinet/in.h'.
      - Synchronize glibc's `inet/netinet/in.h with linux's
        `include/uapi/linux/in6.h'.
      - Allow including the headers in either other.
      - First header included defines the structures and macros.
      
      Details:
      ========
      
      The kernel promises not to break the UAPI ABI so I don't
      see why we can't just have the two userspace headers
      coordinate?
      
      If you include the kernel headers first you get those,
      and if you include the glibc headers first you get those,
      and the following patch arranges a coordination and
      synchronization between the two.
      
      Let's handle `include/uapi/linux/in6.h' from linux,
      and `inet/netinet/in.h' from glibc and ensure they compile
      in any order and preserve the required ABI.
      
      These two patches pass the following compile tests:
      
      cat >> test1.c <<EOF
      int main (void) {
        return 0;
      }
      EOF
      gcc -c test1.c
      
      cat >> test2.c <<EOF
      int main (void) {
        return 0;
      }
      EOF
      gcc -c test2.c
      
      One wrinkle is that the kernel has a different name for one of
      the members in ipv6_mreq. In the kernel patch we create a macro
      to cover the uses of the old name, and while that's not entirely
      clean it's one of the best solutions (aside from an anonymous
      union which has other issues).
      
      I've reviewed the code and it looks to me like the ABI is
      assured and everything matches on both sides.
      
      Notes:
      - You want netinet/in.h to include bits/in.h as early as possible,
        but it needs in_addr so define in_addr early.
      - You want bits/in.h included as early as possible so you can use
        the linux specific code to define __USE_KERNEL_DEFS based on
        the _UAPI_* macro definition and use those to cull in.h.
      - glibc was missing IPPROTO_MH, added here.
      
      Compile tested and inspected.
      Reported-by: NThomas Backlund <tmb@mageia.org>
      Cc: Thomas Backlund <tmb@mageia.org>
      Cc: libc-alpha@sourceware.org
      Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Cc: David S. Miller <davem@davemloft.net>
      Tested-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NCarlos O'Donell <carlos@redhat.com>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfd280c9
  7. 13 10月, 2012 1 次提交
  8. 09 2月, 2012 1 次提交
    • E
      ipv4: Implement IP_UNICAST_IF socket option. · 76e21053
      Erich E. Hoover 提交于
      The IP_UNICAST_IF feature is needed by the Wine project.  This patch
      implements the feature by setting the outgoing interface in a similar
      fashion to that of IP_MULTICAST_IF.  A separate option is needed to
      handle this feature since the existing options do not provide all of
      the characteristics required by IP_UNICAST_IF, a summary is provided
      below.
      
      SO_BINDTODEVICE:
      * SO_BINDTODEVICE requires administrative privileges, IP_UNICAST_IF
      does not.  From reading some old mailing list articles my
      understanding is that SO_BINDTODEVICE requires administrative
      privileges because it can override the administrator's routing
      settings.
      * The SO_BINDTODEVICE option restricts both outbound and inbound
      traffic, IP_UNICAST_IF only impacts outbound traffic.
      
      IP_PKTINFO:
      * Since IP_PKTINFO and IP_UNICAST_IF are independent options,
      implementing IP_UNICAST_IF with IP_PKTINFO will likely break some
      applications.
      * Implementing IP_UNICAST_IF on top of IP_PKTINFO significantly
      complicates the Wine codebase and reduces the socket performance
      (doing this requires a lot of extra communication between the
      "server" and "user" layers).
      
      bind():
      * bind() does not work on broadcast packets, IP_UNICAST_IF is
      specifically intended to work with broadcast packets.
      * Like SO_BINDTODEVICE, bind() restricts both outbound and inbound
      traffic.
      Signed-off-by: NErich E. Hoover <ehoover@mines.edu>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76e21053
  9. 27 8月, 2011 1 次提交
  10. 20 8月, 2010 1 次提交
  11. 24 6月, 2010 1 次提交
  12. 12 1月, 2010 1 次提交
  13. 05 11月, 2009 1 次提交
  14. 02 6月, 2009 1 次提交
    • N
      ipv4: New multicast-all socket option · f771bef9
      Nivedita Singhvi 提交于
      After some discussion offline with Christoph Lameter and David Stevens
      regarding multicast behaviour in Linux, I'm submitting a slightly
      modified patch from the one Christoph submitted earlier.
      
      This patch provides a new socket option IP_MULTICAST_ALL.
      
      In this case, default behaviour is _unchanged_ from the current
      Linux standard. The socket option is set by default to provide
      original behaviour. Sockets wishing to receive data only from
      multicast groups they join explicitly will need to clear this
      socket option.
      Signed-off-by: NNivedita Singhvi <niv@us.ibm.com>
      Signed-off-by: Christoph Lameter<cl@linux.com>
      Acked-by: NDavid Stevens <dlstevens@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f771bef9
  15. 17 11月, 2008 1 次提交
  16. 01 10月, 2008 1 次提交
  17. 18 3月, 2008 1 次提交
  18. 29 1月, 2008 4 次提交
  19. 26 4月, 2007 1 次提交
  20. 03 12月, 2006 1 次提交
    • G
      [NET]: Supporting UDP-Lite (RFC 3828) in Linux · ba4e58ec
      Gerrit Renker 提交于
      This is a revision of the previously submitted patch, which alters
      the way files are organized and compiled in the following manner:
      
      	* UDP and UDP-Lite now use separate object files
      	* source file dependencies resolved via header files
      	  net/ipv{4,6}/udp_impl.h
      	* order of inclusion files in udp.c/udplite.c adapted
      	  accordingly
      
      [NET/IPv4]: Support for the UDP-Lite protocol (RFC 3828)
      
      This patch adds support for UDP-Lite to the IPv4 stack, provided as an
      extension to the existing UDPv4 code:
              * generic routines are all located in net/ipv4/udp.c
              * UDP-Lite specific routines are in net/ipv4/udplite.c
              * MIB/statistics support in /proc/net/snmp and /proc/net/udplite
              * shared API with extensions for partial checksum coverage
      
      [NET/IPv6]: Extension for UDP-Lite over IPv6
      
      It extends the existing UDPv6 code base with support for UDP-Lite
      in the same manner as per UDPv4. In particular,
              * UDPv6 generic and shared code is in net/ipv6/udp.c
              * UDP-Litev6 specific extensions are in net/ipv6/udplite.c
              * MIB/statistics support in /proc/net/snmp6 and /proc/net/udplite6
              * support for IPV6_ADDRFORM
              * aligned the coding style of protocol initialisation with af_inet6.c
              * made the error handling in udpv6_queue_rcv_skb consistent;
                to return `-1' on error on all error cases
              * consolidation of shared code
      
      [NET]: UDP-Lite Documentation and basic XFRM/Netfilter support
      
      The UDP-Lite patch further provides
              * API documentation for UDP-Lite
              * basic xfrm support
              * basic netfilter support for IPv4 and IPv6 (LOG target)
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba4e58ec
  21. 04 10月, 2006 1 次提交
  22. 29 9月, 2006 1 次提交
  23. 23 9月, 2006 2 次提交
  24. 21 3月, 2006 1 次提交
    • C
      [SECURITY]: TCP/UDP getpeersec · 2c7946a7
      Catherine Zhang 提交于
      This patch implements an application of the LSM-IPSec networking
      controls whereby an application can determine the label of the
      security association its TCP or UDP sockets are currently connected to
      via getsockopt and the auxiliary data mechanism of recvmsg.
      
      Patch purpose:
      
      This patch enables a security-aware application to retrieve the
      security context of an IPSec security association a particular TCP or
      UDP socket is using.  The application can then use this security
      context to determine the security context for processing on behalf of
      the peer at the other end of this connection.  In the case of UDP, the
      security context is for each individual packet.  An example
      application is the inetd daemon, which could be modified to start
      daemons running at security contexts dependent on the remote client.
      
      Patch design approach:
      
      - Design for TCP
      The patch enables the SELinux LSM to set the peer security context for
      a socket based on the security context of the IPSec security
      association.  The application may retrieve this context using
      getsockopt.  When called, the kernel determines if the socket is a
      connected (TCP_ESTABLISHED) TCP socket and, if so, uses the dst_entry
      cache on the socket to retrieve the security associations.  If a
      security association has a security context, the context string is
      returned, as for UNIX domain sockets.
      
      - Design for UDP
      Unlike TCP, UDP is connectionless.  This requires a somewhat different
      API to retrieve the peer security context.  With TCP, the peer
      security context stays the same throughout the connection, thus it can
      be retrieved at any time between when the connection is established
      and when it is torn down.  With UDP, each read/write can have
      different peer and thus the security context might change every time.
      As a result the security context retrieval must be done TOGETHER with
      the packet retrieval.
      
      The solution is to build upon the existing Unix domain socket API for
      retrieving user credentials.  Linux offers the API for obtaining user
      credentials via ancillary messages (i.e., out of band/control messages
      that are bundled together with a normal message).
      
      Patch implementation details:
      
      - Implementation for TCP
      The security context can be retrieved by applications using getsockopt
      with the existing SO_PEERSEC flag.  As an example (ignoring error
      checking):
      
      getsockopt(sockfd, SOL_SOCKET, SO_PEERSEC, optbuf, &optlen);
      printf("Socket peer context is: %s\n", optbuf);
      
      The SELinux function, selinux_socket_getpeersec, is extended to check
      for labeled security associations for connected (TCP_ESTABLISHED ==
      sk->sk_state) TCP sockets only.  If so, the socket has a dst_cache of
      struct dst_entry values that may refer to security associations.  If
      these have security associations with security contexts, the security
      context is returned.
      
      getsockopt returns a buffer that contains a security context string or
      the buffer is unmodified.
      
      - Implementation for UDP
      To retrieve the security context, the application first indicates to
      the kernel such desire by setting the IP_PASSSEC option via
      getsockopt.  Then the application retrieves the security context using
      the auxiliary data mechanism.
      
      An example server application for UDP should look like this:
      
      toggle = 1;
      toggle_len = sizeof(toggle);
      
      setsockopt(sockfd, SOL_IP, IP_PASSSEC, &toggle, &toggle_len);
      recvmsg(sockfd, &msg_hdr, 0);
      if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
          cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
          if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
              cmsg_hdr->cmsg_level == SOL_IP &&
              cmsg_hdr->cmsg_type == SCM_SECURITY) {
              memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
          }
      }
      
      ip_setsockopt is enhanced with a new socket option IP_PASSSEC to allow
      a server socket to receive security context of the peer.  A new
      ancillary message type SCM_SECURITY.
      
      When the packet is received we get the security context from the
      sec_path pointer which is contained in the sk_buff, and copy it to the
      ancillary message space.  An additional LSM hook,
      selinux_socket_getpeersec_udp, is defined to retrieve the security
      context from the SELinux space.  The existing function,
      selinux_socket_getpeersec does not suit our purpose, because the
      security context is copied directly to user space, rather than to
      kernel space.
      
      Testing:
      
      We have tested the patch by setting up TCP and UDP connections between
      applications on two machines using the IPSec policies that result in
      labeled security associations being built.  For TCP, we can then
      extract the peer security context using getsockopt on either end.  For
      UDP, the receiving end can retrieve the security context using the
      auxiliary data mechanism of recvmsg.
      Signed-off-by: NCatherine Zhang <cxzhang@watson.ibm.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c7946a7
  25. 30 8月, 2005 1 次提交
  26. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4