1. 14 5月, 2014 2 次提交
    • L
      net: support marking accepting TCP sockets · 84f39b08
      Lorenzo Colitti 提交于
      When using mark-based routing, sockets returned from accept()
      may need to be marked differently depending on the incoming
      connection request.
      
      This is the case, for example, if different socket marks identify
      different networks: a listening socket may want to accept
      connections from all networks, but each connection should be
      marked with the network that the request came in on, so that
      subsequent packets are sent on the correct network.
      
      This patch adds a sysctl to mark TCP sockets based on the fwmark
      of the incoming SYN packet. If enabled, and an unmarked socket
      receives a SYN, then the SYN packet's fwmark is written to the
      connection's inet_request_sock, and later written back to the
      accepted socket when the connection is established.  If the
      socket already has a nonzero mark, then the behaviour is the same
      as it is today, i.e., the listening socket's fwmark is used.
      
      Black-box tested using user-mode linux:
      
      - IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
        mark of the incoming SYN packet.
      - The socket returned by accept() is marked with the mark of the
        incoming SYN packet.
      - Tested with syncookies=1 and syncookies=2.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84f39b08
    • L
      net: add a sysctl to reflect the fwmark on replies · e110861f
      Lorenzo Colitti 提交于
      Kernel-originated IP packets that have no user socket associated
      with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
      are emitted with a mark of zero. Add a sysctl to make them have
      the same mark as the packet they are replying to.
      
      This allows an administrator that wishes to do so to use
      mark-based routing, firewalling, etc. for these replies by
      marking the original packets inbound.
      
      Tested using user-mode linux:
       - ICMP/ICMPv6 echo replies and errors.
       - TCP RST packets (IPv4 and IPv6).
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e110861f
  2. 09 5月, 2014 2 次提交
  3. 14 1月, 2014 1 次提交
    • H
      ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing · f87c10a8
      Hannes Frederic Sowa 提交于
      While forwarding we should not use the protocol path mtu to calculate
      the mtu for a forwarded packet but instead use the interface mtu.
      
      We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
      introduced for multicast forwarding. But as it does not conflict with
      our usage in unicast code path it is perfect for reuse.
      
      I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
      along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
      dependencies because of IPSKB_FORWARDED.
      
      Because someone might have written a software which does probe
      destinations manually and expects the kernel to honour those path mtus
      I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
      can disable this new behaviour. We also still use mtus which are locked on a
      route for forwarding.
      
      The reason for this change is, that path mtus information can be injected
      into the kernel via e.g. icmp_err protocol handler without verification
      of local sockets. As such, this could cause the IPv4 forwarding path to
      wrongfully emit fragmentation needed notifications or start to fragment
      packets along a path.
      
      Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
      won't be set and further fragmentation logic will use the path mtu to
      determine the fragmentation size. They also recheck packet size with
      help of path mtu discovery and report appropriate errors.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f87c10a8
  4. 19 12月, 2013 1 次提交
  5. 22 10月, 2013 1 次提交
  6. 01 10月, 2013 1 次提交
  7. 01 8月, 2013 1 次提交
  8. 06 2月, 2013 1 次提交
  9. 07 1月, 2013 1 次提交
  10. 19 9月, 2012 1 次提交
  11. 30 8月, 2012 1 次提交
  12. 15 8月, 2012 1 次提交
    • E
      userns: Use kgids for sysctl_ping_group_range · 7064d16e
      Eric W. Biederman 提交于
      - Store sysctl_ping_group_range as a paire of kgid_t values
        instead of a pair of gid_t values.
      - Move the kgid conversion work from ping_init_sock into ipv4_ping_group_range
      - For invalid cases reset to the default disabled state.
      
      With the kgid_t conversion made part of the original value sanitation
      from userspace understand how the code will react becomes clearer
      and it becomes possible to set the sysctl ping group range from
      something other than the initial user namespace.
      
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      7064d16e
  13. 31 7月, 2012 1 次提交
  14. 21 7月, 2012 1 次提交
  15. 20 7月, 2012 1 次提交
    • E
      ipv4: tcp: remove per net tcp_sock · be9f4a44
      Eric Dumazet 提交于
      tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
      per network namespace.
      
      This leads to bad behavior on multiqueue NICS, because many cpus
      contend for the socket lock and once socket lock is acquired, extra
      false sharing on various socket fields slow down the operations.
      
      To better resist to attacks, we use a percpu socket. Each cpu can
      run without contention, using appropriate memory (local node)
      
      Additional features :
      
      1) We also mirror the queue_mapping of the incoming skb, so that
      answers use the same queue if possible.
      
      2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()
      
      3) We now limit the number of in-flight RST/ACK [1] packets
      per cpu, instead of per namespace, and we honor the sysctl_wmem_default
      limit dynamically. (Prior to this patch, sysctl_wmem_default value was
      copied at boot time, so any further change would not affect tcp_sock
      limit)
      
      [1] These packets are only generated when no socket was matched for
      the incoming packet.
      Reported-by: NBill Sommerfeld <wsommerfeld@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be9f4a44
  16. 11 7月, 2012 1 次提交
    • D
      tcp: Maintain dynamic metrics in local cache. · 51c5d0c4
      David S. Miller 提交于
      Maintain a local hash table of TCP dynamic metrics blobs.
      
      Computed TCP metrics are no longer maintained in the route metrics.
      
      The table uses RCU and an extremely simple hash so that it has low
      latency and low overhead.  A simple hash is legitimate because we only
      make metrics blobs for fully established connections.
      
      Some tweaking of the default hash table sizes, metric timeouts, and
      the hash chain length limit certainly could use some tweaking.  But
      the basic design seems sound.
      
      With help from Eric Dumazet and Joe Perches.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51c5d0c4
  17. 06 7月, 2012 1 次提交
  18. 09 6月, 2012 1 次提交
  19. 13 12月, 2011 1 次提交
  20. 14 5月, 2011 1 次提交
    • V
      net: ipv4: add IPPROTO_ICMP socket kind · c319b4d7
      Vasiliy Kulikov 提交于
      This patch adds IPPROTO_ICMP socket kind.  It makes it possible to send
      ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
      without any special privileges.  In other words, the patch makes it
      possible to implement setuid-less and CAP_NET_RAW-less /bin/ping.  In
      order not to increase the kernel's attack surface, the new functionality
      is disabled by default, but is enabled at bootup by supporting Linux
      distributions, optionally with restriction to a group or a group range
      (see below).
      
      Similar functionality is implemented in Mac OS X:
      http://www.manpagez.com/man/4/icmp/
      
      A new ping socket is created with
      
          socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
      
      Message identifiers (octets 4-5 of ICMP header) are interpreted as local
      ports. Addresses are stored in struct sockaddr_in. No port numbers are
      reserved for privileged processes, port 0 is reserved for API ("let the
      kernel pick a free number"). There is no notion of remote ports, remote
      port numbers provided by the user (e.g. in connect()) are ignored.
      
      Data sent and received include ICMP headers. This is deliberate to:
      1) Avoid the need to transport headers values like sequence numbers by
      other means.
      2) Make it easier to port existing programs using raw sockets.
      
      ICMP headers given to send() are checked and sanitized. The type must be
      ICMP_ECHO and the code must be zero (future extensions might relax this,
      see below). The id is set to the number (local port) of the socket, the
      checksum is always recomputed.
      
      ICMP reply packets received from the network are demultiplexed according
      to their id's, and are returned by recv() without any modifications.
      IP header information and ICMP errors of those packets may be obtained
      via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
      quenches and redirects are reported as fake errors via the error queue
      (IP_RECVERR); the next hop address for redirects is saved to ee_info (in
      network order).
      
      socket(2) is restricted to the group range specified in
      "/proc/sys/net/ipv4/ping_group_range".  It is "1 0" by default, meaning
      that nobody (not even root) may create ping sockets.  Setting it to "100
      100" would grant permissions to the single group (to either make
      /sbin/ping g+s and owned by this group or to grant permissions to the
      "netadmins" group), "0 4294967295" would enable it for the world, "100
      4294967295" would enable it for the users, but not daemons.
      
      The existing code might be (in the unlikely case anyone needs it)
      extended rather easily to handle other similar pairs of ICMP messages
      (Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
      etc.).
      
      Userspace ping util & patch for it:
      http://openwall.info/wiki/people/segoon/ping
      
      For Openwall GNU/*/Linux it was the last step on the road to the
      setuid-less distro.  A revision of this patch (for RHEL5/OpenVZ kernels)
      is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
      http://mirrors.kernel.org/openwall/Owl/current/iso/
      
      Initially this functionality was written by Pavel Kankovsky for
      Linux 2.4.32, but unfortunately it was never made public.
      
      All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
      the patch.
      
      PATCH v3:
          - switched to flowi4.
          - minor changes to be consistent with raw sockets code.
      
      PATCH v2:
          - changed ping_debug() to pr_debug().
          - removed CONFIG_IP_PING.
          - removed ping_seq_fops.owner field (unused for procfs).
          - switched to proc_net_fops_create().
          - switched to %pK in seq_printf().
      
      PATCH v1:
          - fixed checksumming bug.
          - CAP_NET_RAW may not create icmp sockets anymore.
      
      RFC v2:
          - minor cleanups.
          - introduced sysctl'able group range to restrict socket(2).
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c319b4d7
  21. 25 3月, 2011 1 次提交
  22. 14 1月, 2011 1 次提交
  23. 08 5月, 2010 1 次提交
  24. 14 4月, 2010 4 次提交
  25. 09 2月, 2010 2 次提交
    • P
      netfilter: nf_conntrack: fix hash resizing with namespaces · d696c7bd
      Patrick McHardy 提交于
      As noticed by Jon Masters <jonathan@jonmasters.org>, the conntrack hash
      size is global and not per namespace, but modifiable at runtime through
      /sys/module/nf_conntrack/hashsize. Changing the hash size will only
      resize the hash in the current namespace however, so other namespaces
      will use an invalid hash size. This can cause crashes when enlarging
      the hashsize, or false negative lookups when shrinking it.
      
      Move the hash size into the per-namespace data and only use the global
      hash size to initialize the per-namespace value when instanciating a
      new namespace. Additionally restrict hash resizing to init_net for
      now as other namespaces are not handled currently.
      
      Cc: stable@kernel.org
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d696c7bd
    • P
      netfilter: nf_conntrack: fix hash resizing with namespaces · 9ab48ddc
      Patrick McHardy 提交于
      As noticed by Jon Masters <jonathan@jonmasters.org>, the conntrack hash
      size is global and not per namespace, but modifiable at runtime through
      /sys/module/nf_conntrack/hashsize. Changing the hash size will only
      resize the hash in the current namespace however, so other namespaces
      will use an invalid hash size. This can cause crashes when enlarging
      the hashsize, or false negative lookups when shrinking it.
      
      Move the hash size into the per-namespace data and only use the global
      hash size to initialize the per-namespace value when instanciating a
      new namespace. Additionally restrict hash resizing to init_net for
      now as other namespaces are not handled currently.
      
      Cc: stable@kernel.org
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      9ab48ddc
  26. 18 1月, 2010 1 次提交
  27. 23 1月, 2009 6 次提交
  28. 28 10月, 2008 1 次提交
    • N
      net: implement emergency route cache rebulds when gc_elasticity is exceeded · 1080d709
      Neil Horman 提交于
      This is a patch to provide on demand route cache rebuilding.  Currently, our
      route cache is rebulid periodically regardless of need.  This introduced
      unneeded periodic latency.  This patch offers a better approach.  Using code
      provided by Eric Dumazet, we compute the standard deviation of the average hash
      bucket chain length while running rt_check_expire.  Should any given chain
      length grow to larger that average plus 4 standard deviations, we trigger an
      emergency hash table rebuild for that net namespace.  This allows for the common
      case in which chains are well behaved and do not grow unevenly to not incur any
      latency at all, while those systems (which may be being maliciously attacked),
      only rebuild when the attack is detected.  This patch take 2 other factors into
      account:
      1) chains with multiple entries that differ by attributes that do not affect the
      hash value are only counted once, so as not to unduly bias system to rebuilding
      if features like QOS are heavily used
      2) if rebuilding crosses a certain threshold (which is adjustable via the added
      sysctl in this patch), route caching is disabled entirely for that net
      namespace, since constant rebuilding is less efficient that no caching at all
      
      Tested successfully by me.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1080d709
  29. 08 10月, 2008 1 次提交