1. 07 9月, 2013 3 次提交
    • E
      tcp: properly increase rcv_ssthresh for ofo packets · 4e4f1fc2
      Eric Dumazet 提交于
      TCP receive window handling is multi staged.
      
      A socket has a memory budget, static or dynamic, in sk_rcvbuf.
      
      Because we do not really know how this memory budget translates to
      a TCP window (payload), TCP announces a small initial window
      (about 20 MSS).
      
      When a packet is received, we increase TCP rcv_win depending
      on the payload/truesize ratio of this packet. Good citizen
      packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2
      
      This heuristic takes place in tcp_grow_window()
      
      Problem is : We currently call tcp_grow_window() only for in-order
      packets.
      
      This means that reorders or packet losses stop proper grow of
      rcv_win, and senders are unable to benefit from fast recovery,
      or proper reordering level detection.
      
      Really, a packet being stored in OFO queue is not a bad citizen.
      It should be part of the game as in-order packets.
      
      In our traces, we very often see sender is limited by linux small
      receive windows, even if linux hosts use autotuning (DRS) and should
      allow rcv_win to grow to ~3MB.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e4f1fc2
    • Y
      tcp: fix no cwnd growth after timeout · 16edfe7e
      Yuchung Cheng 提交于
      In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
      it only allows cwnd to increase in Open state. This mistakenly disables
      slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
      state moves from Disorder to Open later in tcp_fastretrans_alert().
      
      Therefore the correct logic should be to allow cwnd to grow as long
      as the data is received in order in Open, Loss, or even Disorder state.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16edfe7e
    • D
      net: netlink: filter particular protocols from analyzers · 5ffd5cdd
      Daniel Borkmann 提交于
      Fix finer-grained control and let only a whitelist of allowed netlink
      protocols pass, in our case related to networking. If later on, other
      subsystems decide they want to add their protocol as well to the list
      of allowed protocols they shall simply add it. While at it, we also
      need to tell what protocol is in use otherwise BPF_S_ANC_PROTOCOL can
      not pick it up (as it's not filled out).
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ffd5cdd
  2. 06 9月, 2013 7 次提交
  3. 05 9月, 2013 12 次提交
    • D
      net: ipv6: mld: introduce mld_{gq, ifc, dad}_stop_timer functions · b4af8def
      Daniel Borkmann 提交于
      We already have mld_{gq,ifc,dad}_start_timer() functions, so introduce
      mld_{gq,ifc,dad}_stop_timer() functions to reduce code size and make it
      more readable.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4af8def
    • D
      net: ipv6: mld: refactor query processing into v1/v2 functions · 2b7c121f
      Daniel Borkmann 提交于
      Make igmp6_event_query() a bit easier to read by refactoring code
      parts into mld_process_v1() and mld_process_v2().
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b7c121f
    • D
      net: ipv6: mld: similarly to MLDv2 have min max_delay of 1 · cc7f7ab7
      Daniel Borkmann 提交于
      Similarly as we do in MLDv2 queries, set a forged MLDv1 query with
      0 ms mld_maxdelay to minimum timer shot time of 1 jiffies. This is
      eventually done in igmp6_group_queried() anyway, so we can simplify
      a check there.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc7f7ab7
    • D
      net: ipv6: mld: implement RFC3810 MLDv2 mode only · 58c0ecfd
      Daniel Borkmann 提交于
      RFC3810, 10. Security Considerations says under subsection 10.1.
      Query Message:
      
        A forged Version 1 Query message will put MLDv2 listeners on that
        link in MLDv1 Host Compatibility Mode. This scenario can be avoided
        by providing MLDv2 hosts with a configuration option to ignore
        Version 1 messages completely.
      
      Hence, implement a MLDv2-only mode that will ignore MLDv1 traffic:
      
        echo 2 > /proc/sys/net/ipv6/conf/ethX/force_mld_version  or
        echo 2 > /proc/sys/net/ipv6/conf/all/force_mld_version
      
      Note that <all> device has a higher precedence as it was previously
      also the case in the macro MLD_V1_SEEN() that would "short-circuit"
      if condition on <all> case.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58c0ecfd
    • D
      net: ipv6: mld: get rid of MLDV2_MRC and simplify calculation · e3f5b170
      Daniel Borkmann 提交于
      Get rid of MLDV2_MRC and use our new macros for mantisse and
      exponent to calculate Maximum Response Delay out of the Maximum
      Response Code.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3f5b170
    • D
      net: ipv6: mld: clean up MLD_V1_SEEN macro · 6c567b78
      Daniel Borkmann 提交于
      Replace the macro with a function to make it more readable. GCC will
      eventually decide whether to inline this or not (also, that's not
      fast-path anyway).
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c567b78
    • D
      net: ipv6: mld: fix v1/v2 switchback timeout to rfc3810, 9.12. · 89225d1c
      Daniel Borkmann 提交于
      i) RFC3810, 9.2. Query Interval [QI] says:
      
         The Query Interval variable denotes the interval between General
         Queries sent by the Querier. Default value: 125 seconds. [...]
      
      ii) RFC3810, 9.3. Query Response Interval [QRI] says:
      
        The Maximum Response Delay used to calculate the Maximum Response
        Code inserted into the periodic General Queries. Default value:
        10000 (10 seconds) [...] The number of seconds represented by the
        [Query Response Interval] must be less than the [Query Interval].
      
      iii) RFC3810, 9.12. Older Version Querier Present Timeout [OVQPT] says:
      
        The Older Version Querier Present Timeout is the time-out for
        transitioning a host back to MLDv2 Host Compatibility Mode. When an
        MLDv1 query is received, MLDv2 hosts set their Older Version Querier
        Present Timer to [Older Version Querier Present Timeout].
      
        This value MUST be ([Robustness Variable] times (the [Query Interval]
        in the last Query received)) plus ([Query Response Interval]).
      
      Hence, on *default* the timeout results in:
      
        [RV] = 2, [QI] = 125sec, [QRI] = 10sec
        [OVQPT] = [RV] * [QI] + [QRI] = 260sec
      
      Having that said, we currently calculate [OVQPT] (here given as 'switchback'
      variable) as ...
      
        switchback = (idev->mc_qrv + 1) * max_delay
      
      RFC3810, 9.12. says "the [Query Interval] in the last Query received". In
      section "9.14. Configuring timers", it is said:
      
        This section is meant to provide advice to network administrators on
        how to tune these settings to their network. Ambitious router
        implementations might tune these settings dynamically based upon
        changing characteristics of the network. [...]
      
      iv) RFC38010, 9.14.2. Query Interval:
      
        The overall level of periodic MLD traffic is inversely proportional
        to the Query Interval. A longer Query Interval results in a lower
        overall level of MLD traffic. The value of the Query Interval MUST
        be equal to or greater than the Maximum Response Delay used to
        calculate the Maximum Response Code inserted in General Query
        messages.
      
      I assume that was why switchback is calculated as is (3 * max_delay), although
      this setting seems to be meant for routers only to configure their [QI]
      interval for non-default intervals. So usage here like this is clearly wrong.
      
      Concluding, the current behaviour in IPv6's multicast code is not conform
      to the RFC as switch back is calculated wrongly. That is, it has a too small
      value, so MLDv2 hosts switch back again to MLDv2 way too early, i.e. ~30secs
      instead of ~260secs on default.
      
      Hence, introduce necessary helper functions and fix this up properly as it
      should be.
      
      Introduced in 06da92283 ("[IPV6]: Add MLDv2 support."). Credits to Hannes
      Frederic Sowa who also had a hand in this as well. Also thanks to Hangbin Liu
      who did initial testing.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: David Stevens <dlstevens@us.ibm.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89225d1c
    • D
      net: ipv6: tcp: fix potential use after free in tcp_v6_do_rcv · 3a1c7565
      Daniel Borkmann 提交于
      In tcp_v6_do_rcv() code, when processing pkt options, we soley work
      on our skb clone opt_skb that we've created earlier before entering
      tcp_rcv_established() on our way. However, only in condition ...
      
        if (np->rxopt.bits.rxtclass)
          np->rcv_tclass = ipv6_get_dsfield(ipv6_hdr(skb));
      
      ... we work on skb itself. As we extract every other information out
      of opt_skb in ipv6_pktoptions path, this seems wrong, since skb can
      already be released by tcp_rcv_established() earlier on. When we try
      to access it in ipv6_hdr(), we will dereference freed skb.
      
      [ Bug added by commit 4c507d28 ("net: implement IP_RECVTOS for
        IP_PKTOPTIONS") ]
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a1c7565
    • Y
      tcp: better comments for RTO initiallization · 52f20e65
      Yuchung Cheng 提交于
      Commit 1b7fdd2a("tcp: do not use cached RTT for RTT estimation")
      removes important comments on how RTO is initialized and updated.
      Hopefully this patch puts those information back.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52f20e65
    • T
      ipv6: Don't depend on per socket memory for neighbour discovery messages · 25a6e6b8
      Thomas Graf 提交于
      Allocating skbs when sending out neighbour discovery messages
      currently uses sock_alloc_send_skb() based on a per net namespace
      socket and thus share a socket wmem buffer space.
      
      If a netdevice is temporarily unable to transmit due to carrier
      loss or for other reasons, the queued up ndisc messages will cosnume
      all of the wmem space and will thus prevent from any more skbs to
      be allocated even for netdevices that are able to transmit packets.
      
      The number of neighbour discovery messages sent is very limited,
      use of alloc_skb() bypasses the socket wmem buffer size enforcement
      while the manual call to skb_set_owner_w() maintains the socket
      reference needed for the IPv6 output path.
      
      This patch has orginally been posted by Eric Dumazet in a modified
      form.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Fabio Estevam <festevam@gmail.com>
      Tested-by: NFabio Estevam <fabio.estevam@freescale.com>
      Tested-by: NStephen Warren <swarren@nvidia.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25a6e6b8
    • H
      ipv6: fix null pointer dereference in __ip6addrlbl_add · 639739b5
      Hannes Frederic Sowa 提交于
      Commit b67bfe0d ("hlist: drop
      the node parameter from iterators") changed the behavior of
      hlist_for_each_entry_safe to leave the p argument NULL.
      
      Fix this up by tracking the last argument.
      Reported-by: NMichele Baldessari <michele@acksyn.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Tested-by: NMichele Baldessari <michele@acksyn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      639739b5
    • A
      net: sctp: Fix data chunk fragmentation for MTU values which are not multiple of 4 · c08751c8
      Alexander Sverdlin 提交于
      net: sctp: Fix data chunk fragmentation for MTU values which are not multiple of 4
      
      Initially the problem was observed with ipsec, but later it became clear that
      SCTP data chunk fragmentation algorithm has problems with MTU values which are
      not multiple of 4. Test program was used which just transmits 2000 bytes long
      packets to other host. tcpdump was used to observe re-fragmentation in IP layer
      after SCTP already fragmented data chunks.
      
      With MTU 1500:
      12:54:34.082904 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 1500)
          10.151.38.153.39303 > 10.151.24.91.54321: sctp (1) [DATA] (B) [TSN: 2366088589] [SID: 0] [SSEQ 1] [PPID 0x0]
      12:54:34.082933 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 596)
          10.151.38.153.39303 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 2366088590] [SID: 0] [SSEQ 1] [PPID 0x0]
      12:54:34.090576 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48)
          10.151.24.91.54321 > 10.151.38.153.39303: sctp (1) [SACK] [cum ack 2366088590] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0]
      
      With MTU 1499:
      13:02:49.955220 IP (tos 0x2,ECT(0), ttl 64, id 48215, offset 0, flags [+], proto SCTP (132), length 1492)
          10.151.38.153.39084 > 10.151.24.91.54321: sctp[|sctp]
      13:02:49.955249 IP (tos 0x2,ECT(0), ttl 64, id 48215, offset 1472, flags [none], proto SCTP (132), length 28)
          10.151.38.153 > 10.151.24.91: ip-proto-132
      13:02:49.955262 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 600)
          10.151.38.153.39084 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 404355346] [SID: 0] [SSEQ 1] [PPID 0x0]
      13:02:49.956770 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48)
          10.151.24.91.54321 > 10.151.38.153.39084: sctp (1) [SACK] [cum ack 404355346] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0]
      
      Here problem in data portion limit calculation leads to re-fragmentation in IP,
      which is sub-optimal. The problem is max_data initial value, which doesn't take
      into account the fact, that data chunk must be padded to 4-bytes boundary.
      It's enough to correct max_data, because all later adjustments are correctly
      aligned to 4-bytes boundary.
      
      After the fix is applied, everything is fragmented correctly for uneven MTUs:
      15:16:27.083881 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 1496)
          10.151.38.153.53417 > 10.151.24.91.54321: sctp (1) [DATA] (B) [TSN: 3077098183] [SID: 0] [SSEQ 1] [PPID 0x0]
      15:16:27.083907 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 600)
          10.151.38.153.53417 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 3077098184] [SID: 0] [SSEQ 1] [PPID 0x0]
      15:16:27.085640 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48)
          10.151.24.91.54321 > 10.151.38.153.53417: sctp (1) [SACK] [cum ack 3077098184] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0]
      
      The bug was there for years already, but
       - is a performance issue, the packets are still transmitted
       - doesn't show up with default MTU 1500, but possibly with ipsec (MTU 1438)
      Signed-off-by: NAlexander Sverdlin <alexander.sverdlin@nsn.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c08751c8
  4. 04 9月, 2013 18 次提交
新手
引导
客服 返回
顶部