1. 14 10月, 2015 1 次提交
  2. 13 10月, 2015 1 次提交
    • P
      ipv4/icmp: redirect messages can use the ingress daddr as source · e2ca690b
      Paolo Abeni 提交于
      This patch allows configuring how the source address of ICMP
      redirect messages is selected; by default the old behaviour is
      retained, while setting icmp_redirects_use_orig_daddr force the
      usage of the destination address of the packet that caused the
      redirect.
      
      The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
      following scenario:
      
      Two machines are set up with VRRP to act as routers out of a subnet,
      they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
      x.x.x.254/24.
      
      If a host in said subnet needs to get an ICMP redirect from the VRRP
      router, i.e. to reach a destination behind a different gateway, the
      source IP in the ICMP redirect is chosen as the primary IP on the
      interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
      
      The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
      and will continue to use the wrong next-op.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2ca690b
  3. 01 9月, 2015 1 次提交
    • P
      IGMP: Document igmp_link_local_mcast_reports · 87583ebb
      Philip Downey 提交于
      Document the addition of a new sysctl variable which controls the
      generation of IGMP reports for link local multicast groups in the
      224.0.0.X range.
      
      IGMP reports for local multicast groups can now be optionally
      inhibited by setting the value to zero e.g.:
      echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports
      
      To retain backwards compatibility the previous behaviour is retained
      by default on system boot or reverted by setting the value back to
      non-zero.
      Signed-off-by: NPhilip Downey <pdowney@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87583ebb
  4. 26 8月, 2015 1 次提交
    • E
      tcp: refine pacing rate determination · 43e122b0
      Eric Dumazet 提交于
      When TCP pacing was added back in linux-3.12, we chose
      to apply a fixed ratio of 200 % against current rate,
      to allow probing for optimal throughput even during
      slow start phase, where cwnd can be doubled every other gRTT.
      
      At Google, we found it was better applying a different ratio
      while in Congestion Avoidance phase.
      This ratio was set to 120 %.
      
      We've used the normal tcp_in_slow_start() helper for a while,
      then tuned the condition to select the conservative ratio
      as soon as cwnd >= ssthresh/2 :
      
      - After cwnd reduction, it is safer to ramp up more slowly,
        as we approach optimal cwnd.
      - Initial ramp up (ssthresh == INFINITY) still allows doubling
        cwnd every other RTT.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43e122b0
  5. 12 8月, 2015 1 次提交
  6. 01 8月, 2015 2 次提交
  7. 31 7月, 2015 1 次提交
    • H
      net/ipv6: add sysctl option accept_ra_min_hop_limit · 8013d1d7
      Hangbin Liu 提交于
      Commit 6fd99094 ("ipv6: Don't reduce hop limit for an interface")
      disabled accept hop limit from RA if it is smaller than the current hop
      limit for security stuff. But this behavior kind of break the RFC definition.
      
      RFC 4861, 6.3.4.  Processing Received Router Advertisements
         A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time,
         and Retrans Timer) may contain a value denoting that it is
         unspecified.  In such cases, the parameter should be ignored and the
         host should continue using whatever value it is already using.
      
         If the received Cur Hop Limit value is non-zero, the host SHOULD set
         its CurHopLimit variable to the received value.
      
      So add sysctl option accept_ra_min_hop_limit to let user choose the minimum
      hop limit value they can accept from RA. And set default to 1 to meet RFC
      standards.
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Acked-by: NYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8013d1d7
  8. 23 7月, 2015 1 次提交
  9. 10 7月, 2015 1 次提交
    • T
      ipv6: Nonlocal bind · 35a256fe
      Tom Herbert 提交于
      Add support to allow non-local binds similar to how this was done for IPv4.
      Non-local binds are very useful in emulating the Internet in a box, etc.
      
      This add the ip_nonlocal_bind sysctl under ipv6.
      
      Testing:
      
      Set up nonlocal binding and receive routing on a host, e.g.:
      
      ip -6 rule add from ::/0 iif eth0 lookup 200
      ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200
      sysctl -w net.ipv6.ip_nonlocal_bind=1
      
      Set up routing to 2001:0:0:1::/64 on peer to go to first host
      
      ping6 -I 2001:0:0:1::1 peer-address -- to verify
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35a256fe
  10. 28 5月, 2015 1 次提交
    • E
      tcp/dccp: try to not exhaust ip_local_port_range in connect() · 07f4c900
      Eric Dumazet 提交于
      A long standing problem on busy servers is the tiny available TCP port
      range (/proc/sys/net/ipv4/ip_local_port_range) and the default
      sequential allocation of source ports in connect() system call.
      
      If a host is having a lot of active TCP sessions, chances are
      very high that all ports are in use by at least one flow,
      and subsequent bind(0) attempts fail, or have to scan a big portion of
      space to find a slot.
      
      In this patch, I changed the starting point in __inet_hash_connect()
      so that we try to favor even [1] ports, leaving odd ports for bind()
      users.
      
      We still perform a sequential search, so there is no guarantee, but
      if connect() targets are very different, end result is we leave
      more ports available to bind(), and we spread them all over the range,
      lowering time for both connect() and bind() to find a slot.
      
      This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range
      is even, ie if start/end values have different parity.
      
      Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to
      32768 - 60999 (instead of 32768 - 61000)
      
      There is no change on security aspects here, only some poor hashing
      schemes could be eventually impacted by this change.
      
      [1] : The odd/even property depends on ip_local_port_range values parity
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07f4c900
  11. 20 5月, 2015 1 次提交
    • D
      tcp: add rfc3168, section 6.1.1.1. fallback · 49213555
      Daniel Borkmann 提交于
      This work as a follow-up of commit f7b3bec6 ("net: allow setting ecn
      via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
      ECN connections. In other words, this work adds a retry with a non-ECN
      setup SYN packet, as suggested from the RFC on the first timeout:
      
        [...] A host that receives no reply to an ECN-setup SYN within the
        normal SYN retransmission timeout interval MAY resend the SYN and
        any subsequent SYN retransmissions with CWR and ECE cleared. [...]
      
      Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
      that is, Linux default since 2009 via commit 255cac91 ("tcp: extend
      ECN sysctl to allow server-side only ECN"):
      
       1) Normal ECN-capable path:
      
          SYN ECE CWR ----->
                      <----- SYN ACK ECE
                  ACK ----->
      
       2) Path with broken middlebox, when client has fallback:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
                  SYN ----->
                      <----- SYN ACK
                  ACK ----->
      
      In case we would not have the fallback implemented, the middlebox drop
      point would basically end up as:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
      
      In any case, it's rather a smaller percentage of sites where there would
      occur such additional setup latency: it was found in end of 2014 that ~56%
      of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
      ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
      when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
      fallback would mitigate with a slight latency trade-off. Recent related
      paper on this topic:
      
        Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
        Gorry Fairhurst, and Richard Scheffenegger:
          "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
          Proc. PAM 2015, New York.
        http://ecn.ethz.ch/ecn-pam15.pdf
      
      Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
      section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
      which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
      allows for disabling the fallback.
      
      tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
      rather we let tcp_ecn_rcv_synack() take that over on input path in case a
      SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
      ECN being negotiated eventually in that case.
      
      Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
      Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
      Signed-off-by: NBrian Trammell <trammell@tik.ee.ethz.ch>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Dave That <dave.taht@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49213555
  12. 04 5月, 2015 1 次提交
    • T
      ipv6: Flow label state ranges · 82a584b7
      Tom Herbert 提交于
      This patch divides the IPv6 flow label space into two ranges:
      0-7ffff is reserved for flow label manager, 80000-fffff will be
      used for creating auto flow labels (per RFC6438). This only affects how
      labels are set on transmit, it does not affect receive. This range split
      can be disbaled by systcl.
      
      Background:
      
      IPv6 flow labels have been an unmitigated disappointment thus far
      in the lifetime of IPv6. Support in HW devices to use them for ECMP
      is lacking, and OSes don't turn them on by default. If we had these
      we could get much better hashing in IPv6 networks without resorting
      to DPI, possibly eliminating some of the motivations to to define new
      encaps in UDP just for getting ECMP.
      
      Unfortunately, the initial specfications of IPv6 did not clarify
      how they are to be used. There has always been a vague concept that
      these can be used for ECMP, flow hashing, etc. and we do now have a
      good standard how to this in RFC6438. The problem is that flow labels
      can be either stateful or stateless (as in RFC6438), and we are
      presented with the possibility that a stateless label may collide
      with a stateful one.  Attempts to split the flow label space were
      rejected in IETF. When we added support in Linux for RFC6438, we
      could not turn on flow labels by default due to this conflict.
      
      This patch splits the flow label space and should give us
      a path to enabling auto flow labels by default for all IPv6 packets.
      This is an API change so we need to consider compatibility with
      existing deployment. The stateful range is chosen to be the lower
      values in hopes that most uses would have chosen small numbers.
      
      Once we resolve the stateless/stateful issue, we can proceed to
      look at enabling RFC6438 flow labels by default (starting with
      scaled testing).
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82a584b7
  13. 24 3月, 2015 1 次提交
  14. 21 3月, 2015 1 次提交
  15. 07 3月, 2015 1 次提交
  16. 08 2月, 2015 1 次提交
    • N
      tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks · 032ee423
      Neal Cardwell 提交于
      Helpers for mitigating ACK loops by rate-limiting dupacks sent in
      response to incoming out-of-window packets.
      
      This patch includes:
      
      - rate-limiting logic
      - sysctl to control how often we allow dupacks to out-of-window packets
      - SNMP counter for cases where we rate-limited our dupack sending
      
      The rate-limiting logic in this patch decides to not send dupacks in
      response to out-of-window segments if (a) they are SYNs or pure ACKs
      and (b) the remote endpoint is sending them faster than the configured
      rate limit.
      
      We rate-limit our responses rather than blocking them entirely or
      resetting the connection, because legitimate connections can rely on
      dupacks in response to some out-of-window segments. For example, zero
      window probes are typically sent with a sequence number that is below
      the current window, and ZWPs thus expect to thus elicit a dupack in
      response.
      
      We allow dupacks in response to TCP segments with data, because these
      may be spurious retransmissions for which the remote endpoint wants to
      receive DSACKs. This is safe because segments with data can't
      realistically be part of ACK loops, which by their nature consist of
      each side sending pure/data-less ACKs to each other.
      
      The dupack interval is controlled by a new sysctl knob,
      tcp_invalid_ratelimit, given in milliseconds, in case an administrator
      needs to dial this upward in the face of a high-rate DoS attack. The
      name and units are chosen to be analogous to the existing analogous
      knob for ICMP, icmp_ratelimit.
      
      The default value for tcp_invalid_ratelimit is 500ms, which allows at
      most one such dupack per 500ms. This is chosen to be 2x faster than
      the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
      2.4). We allow the extra 2x factor because network delay variations
      can cause packets sent at 1 second intervals to be compressed and
      arrive much closer.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      032ee423
  17. 26 1月, 2015 1 次提交
  18. 13 1月, 2015 1 次提交
  19. 06 11月, 2014 1 次提交
  20. 30 10月, 2014 2 次提交
    • E
      net: ipv6: Add a sysctl to make optimistic addresses useful candidates · 7fd2561e
      Erik Kline 提交于
      Add a sysctl that causes an interface's optimistic addresses
      to be considered equivalent to other non-deprecated addresses
      for source address selection purposes.  Preferred addresses
      will still take precedence over optimistic addresses, subject
      to other ranking in the source address selection algorithm.
      
      This is useful where different interfaces are connected to
      different networks from different ISPs (e.g., a cell network
      and a home wifi network).
      
      The current behaviour complies with RFC 3484/6724, and it
      makes sense if the host has only one interface, or has
      multiple interfaces on the same network (same or cooperating
      administrative domain(s), but not in the multiple distinct
      networks case.
      
      For example, if a mobile device has an IPv6 address on an LTE
      network and then connects to IPv6-enabled wifi, while the wifi
      IPv6 address is undergoing DAD, IPv6 connections will try use
      the wifi default route with the LTE IPv6 address, and will get
      stuck until they time out.
      
      Also, because optimistic nodes can receive frames, issue
      an RTM_NEWADDR as soon as DAD starts (with the IFA_F_OPTIMSTIC
      flag appropriately set).  A second RTM_NEWADDR is sent if DAD
      completes (the address flags have changed), otherwise an
      RTM_DELADDR is sent.
      
      Also: add an entry in ip-sysctl.txt for optimistic_dad.
      Signed-off-by: NErik Kline <ek@google.com>
      Acked-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fd2561e
    • E
      tcp: allow for bigger reordering level · dca145ff
      Eric Dumazet 提交于
      While testing upcoming Yaogong patch (converting out of order queue
      into an RB tree), I hit the max reordering level of linux TCP stack.
      
      Reordering level was limited to 127 for no good reason, and some
      network setups [1] can easily reach this limit and get limited
      throughput.
      
      Allow a new max limit of 300, and add a sysctl to allow admins to even
      allow bigger (or lower) values if needed.
      
      [1] Aggregation of links, per packet load balancing, fabrics not doing
       deep packet inspections, alternative TCP congestion modules...
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yaogong Wang <wygivan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dca145ff
  21. 28 9月, 2014 1 次提交
  22. 24 9月, 2014 1 次提交
    • E
      icmp: add a global rate limitation · 4cdf507d
      Eric Dumazet 提交于
      Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
      protected by a lock, meaning that hosts can be stuck hard if all cpus
      want to check ICMP limits.
      
      When say a DNS or NTP server process is restarted, inetpeer tree grows
      quick and machine comes to its knees.
      
      iptables can not help because the bottleneck happens before ICMP
      messages are even cooked and sent.
      
      This patch adds a new global limitation, using a token bucket filter,
      controlled by two new sysctl :
      
      icmp_msgs_per_sec - INTEGER
          Limit maximal number of ICMP packets sent per second from this host.
          Only messages whose type matches icmp_ratemask are
          controlled by this limit.
          Default: 1000
      
      icmp_msgs_burst - INTEGER
          icmp_msgs_per_sec controls number of ICMP packets sent per second,
          while icmp_msgs_burst controls the burst size of these packets.
          Default: 50
      
      Note that if we really want to send millions of ICMP messages per
      second, we might extend idea and infra added in commit 04ca6973
      ("ip: make IP identifiers less predictable") :
      add a token bucket in the ip_idents hash and no longer rely on inetpeer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cdf507d
  23. 13 9月, 2014 1 次提交
  24. 05 9月, 2014 2 次提交
  25. 26 8月, 2014 1 次提交
  26. 28 7月, 2014 3 次提交
    • N
      inet: frag: set limits and make init_net's high_thresh limit global · 1bab4c75
      Nikolay Aleksandrov 提交于
      This patch makes init_net's high_thresh limit to be the maximum for all
      namespaces, thus introducing a global memory limit threshold equal to the
      sum of the individual high_thresh limits which are capped.
      It also introduces some sane minimums for low_thresh as it shouldn't be
      able to drop below 0 (or > high_thresh in the unsigned case), and
      overall low_thresh should not ever be above high_thresh, so we make the
      following relations for a namespace:
      init_net:
       high_thresh - max(not capped), min(init_net low_thresh)
       low_thresh - max(init_net high_thresh), min (0)
      
      all other namespaces:
       high_thresh = max(init_net high_thresh), min(namespace's low_thresh)
       low_thresh = max(namespace's high_thresh), min(0)
      
      The major issue with having low_thresh > high_thresh is that we'll
      schedule eviction but never evict anything and thus rely only on the
      timers.
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1bab4c75
    • F
      inet: frag: remove periodic secret rebuild timer · e3a57d18
      Florian Westphal 提交于
      merge functionality into the eviction workqueue.
      
      Instead of rebuilding every n seconds, take advantage of the upper
      hash chain length limit.
      
      If we hit it, mark table for rebuild and schedule workqueue.
      To prevent frequent rebuilds when we're completely overloaded,
      don't rebuild more than once every 5 seconds.
      
      ipfrag_secret_interval sysctl is now obsolete and has been marked as
      deprecated, it still can be changed so scripts won't be broken but it
      won't have any effect. A comment is left above each unused secret_timer
      variable to avoid confusion.
      
      Joint work with Nikolay Aleksandrov.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3a57d18
    • F
      inet: frag: move eviction of queues to work queue · b13d3cbf
      Florian Westphal 提交于
      When the high_thresh limit is reached we try to toss the 'oldest'
      incomplete fragment queues until memory limits are below the low_thresh
      value.  This happens in softirq/packet processing context.
      
      This has two drawbacks:
      
      1) processors might evict a queue that was about to be completed
      by another cpu, because they will compete wrt. resource usage and
      resource reclaim.
      
      2) LRU list maintenance is expensive.
      
      But when constantly overloaded, even the 'least recently used' element is
      recent, so removing 'lru' queue first is not 'fairer' than removing any
      other fragment queue.
      
      This moves eviction out of the fast path:
      
      When the low threshold is reached, a work queue is scheduled
      which then iterates over the table and removes the queues that exceed
      the memory limits of the namespace. It sets a new flag called
      INET_FRAG_EVICTED on the evicted queues so the proper counters will get
      incremented when the queue is forcefully expired.
      
      When the high threshold is reached, no more fragment queues are
      created until we're below the limit again.
      
      The LRU list is now unused and will be removed in a followup patch.
      
      Joint work with Nikolay Aleksandrov.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b13d3cbf
  27. 08 7月, 2014 1 次提交
    • T
      ipv6: Implement automatic flow label generation on transmit · cb1ce2ef
      Tom Herbert 提交于
      Automatically generate flow labels for IPv6 packets on transmit.
      The flow label is computed based on skb_get_hash. The flow label will
      only automatically be set when it is zero otherwise (i.e. flow label
      manager hasn't set one). This supports the transmit side functionality
      of RFC 6438.
      
      Added an IPv6 sysctl auto_flowlabels to enable/disable this behavior
      system wide, and added IPV6_AUTOFLOWLABEL socket option to enable this
      functionality per socket.
      
      By default, auto flowlabels are disabled to avoid possible conflicts
      with flow label manager, however if this feature proves useful we
      may want to enable it by default.
      
      It should also be noted that FreeBSD has already implemented automatic
      flow labels (including the sysctl and socket option). In FreeBSD,
      automatic flow labels default to enabled.
      
      Performance impact:
      
      Running super_netperf with 200 flows for TCP_RR and UDP_RR for
      IPv6. Note that in UDP case, __skb_get_hash will be called for
      every packet with explains slight regression. In the TCP case
      the hash is saved in the socket so there is no regression.
      
      Automatic flow labels disabled:
      
        TCP_RR:
          86.53% CPU utilization
          127/195/322 90/95/99% latencies
          1.40498e+06 tps
      
        UDP_RR:
          90.70% CPU utilization
          118/168/243 90/95/99% latencies
          1.50309e+06 tps
      
      Automatic flow labels enabled:
      
        TCP_RR:
          85.90% CPU utilization
          128/199/337 90/95/99% latencies
          1.40051e+06
      
        UDP_RR
          92.61% CPU utilization
          115/164/236 90/95/99% latencies
          1.4687e+06
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb1ce2ef
  28. 02 7月, 2014 1 次提交
    • B
      ipv6: Allow accepting RA from local IP addresses. · d9333196
      Ben Greear 提交于
      This can be used in virtual networking applications, and
      may have other uses as well.  The option is disabled by
      default.
      
      A specific use case is setting up virtual routers, bridges, and
      hosts on a single OS without the use of network namespaces or
      virtual machines.  With proper use of ip rules, routing tables,
      veth interface pairs and/or other virtual interfaces,
      and applications that can bind to interfaces and/or IP addresses,
      it is possibly to create one or more virtual routers with multiple
      hosts attached.  The host interfaces can act as IPv6 systems,
      with radvd running on the ports in the virtual routers.  With the
      option provided in this patch enabled, those hosts can now properly
      obtain IPv6 addresses from the radvd.
      Signed-off-by: NBen Greear <greearb@candelatech.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9333196
  29. 28 1月, 2014 1 次提交
  30. 20 1月, 2014 1 次提交
  31. 14 1月, 2014 2 次提交
    • H
      ipv4: introduce hardened ip_no_pmtu_disc mode · 8ed1dc44
      Hannes Frederic Sowa 提交于
      This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
      to be honored by protocols which do more stringent validation on the
      ICMP's packet payload. This knob is useful for people who e.g. want to
      run an unmodified DNS server in a namespace where they need to use pmtu
      for TCP connections (as they are used for zone transfers or fallback
      for requests) but don't want to use possibly spoofed UDP pmtu information.
      
      Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
      if the returned packet is in the window or if the association is valid.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Suggested-by: NFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ed1dc44
    • H
      ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing · f87c10a8
      Hannes Frederic Sowa 提交于
      While forwarding we should not use the protocol path mtu to calculate
      the mtu for a forwarded packet but instead use the interface mtu.
      
      We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
      introduced for multicast forwarding. But as it does not conflict with
      our usage in unicast code path it is perfect for reuse.
      
      I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
      along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
      dependencies because of IPSKB_FORWARDED.
      
      Because someone might have written a software which does probe
      destinations manually and expects the kernel to honour those path mtus
      I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
      can disable this new behaviour. We also still use mtus which are locked on a
      route for forwarding.
      
      The reason for this change is, that path mtus information can be injected
      into the kernel via e.g. icmp_err protocol handler without verification
      of local sockets. As such, this could cause the IPv4 forwarding path to
      wrongfully emit fragmentation needed notifications or start to fragment
      packets along a path.
      
      Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
      won't be set and further fragmentation logic will use the path mtu to
      determine the fragmentation size. They also recheck packet size with
      help of path mtu discovery and report appropriate errors.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f87c10a8
  32. 08 1月, 2014 1 次提交
  33. 19 12月, 2013 1 次提交
  34. 18 12月, 2013 1 次提交