1. 02 8月, 2018 2 次提交
  2. 31 7月, 2018 1 次提交
  3. 30 7月, 2018 1 次提交
    • X
      route: add support for directed broadcast forwarding · 5cbf777c
      Xin Long 提交于
      This patch implements the feature described in rfc1812#section-5.3.5.2
      and rfc2644. It allows the router to forward directed broadcast when
      sysctl bc_forwarding is enabled.
      
      Note that this feature could be done by iptables -j TEE, but it would
      cause some problems:
        - target TEE's gateway param has to be set with a specific address,
          and it's not flexible especially when the route wants forward all
          directed broadcasts.
        - this duplicates the directed broadcasts so this may cause side
          effects to applications.
      
      Besides, to keep consistent with other os router like BSD, it's also
      necessary to implement it in the route rx path.
      
      Note that route cache needs to be flushed when bc_forwarding is
      changed.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5cbf777c
  4. 26 7月, 2018 1 次提交
  5. 25 7月, 2018 3 次提交
  6. 24 7月, 2018 6 次提交
  7. 22 7月, 2018 5 次提交
  8. 21 7月, 2018 3 次提交
    • Y
      tcp: do not delay ACK in DCTCP upon CE status change · a0496ef2
      Yuchung Cheng 提交于
      Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
      has to be sent immediately so the sender can respond quickly:
      
      """ When receiving packets, the CE codepoint MUST be processed as follows:
      
         1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
             true and send an immediate ACK.
      
         2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
             to false and send an immediate ACK.
      """
      
      Previously DCTCP implementation may continue to delay the ACK. This
      patch fixes that to implement the RFC by forcing an immediate ACK.
      
      Tested with this packetdrill script provided by Larry Brakmo
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
         +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      +0.005 < [ce] . 2001:3001(1000) ack 2 win 257
      
      +0.000 > [ect01] . 2:2(0) ack 2001
      // Previously the ACK below would be delayed by 40ms
      +0.000 > [ect01] E. 2:2(0) ack 3001
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0496ef2
    • Y
      tcp: do not cancel delay-AcK on DCTCP special ACK · 27cde44a
      Yuchung Cheng 提交于
      Currently when a DCTCP receiver delays an ACK and receive a
      data packet with a different CE mark from the previous one's, it
      sends two immediate ACKs acking previous and latest sequences
      respectly (for ECN accounting).
      
      Previously sending the first ACK may mark off the delayed ACK timer
      (tcp_event_ack_sent). This may subsequently prevent sending the
      second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
      The culprit is that tcp_send_ack() assumes it always acknowleges
      the latest sequence, which is not true for the first special ACK.
      
      The fix is to not make the assumption in tcp_send_ack and check the
      actual ack sequence before cancelling the delayed ACK. Further it's
      safer to pass the ack sequence number as a local variable into
      tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
      future bugs like this.
      Reported-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27cde44a
    • Y
      tcp: helpers to send special DCTCP ack · 2987babb
      Yuchung Cheng 提交于
      Refactor and create helpers to send the special ACK in DCTCP.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2987babb
  9. 19 7月, 2018 2 次提交
  10. 17 7月, 2018 3 次提交
    • F
      netfilter: conntrack: remove l3proto abstraction · a0ae2562
      Florian Westphal 提交于
      This unifies ipv4 and ipv6 protocol trackers and removes the l3proto
      abstraction.
      
      This gets rid of all l3proto indirect calls and the need to do
      a lookup on the function to call for l3 demux.
      
      It increases module size by only a small amount (12kbyte), so this reduces
      size because nf_conntrack.ko is useless without either nf_conntrack_ipv4
      or nf_conntrack_ipv6 module.
      
      before:
         text    data     bss     dec     hex filename
         7357    1088       0    8445    20fd nf_conntrack_ipv4.ko
         7405    1084       4    8493    212d nf_conntrack_ipv6.ko
        72614   13689     236   86539   1520b nf_conntrack.ko
       19K nf_conntrack_ipv4.ko
       19K nf_conntrack_ipv6.ko
      179K nf_conntrack.ko
      
      after:
         text    data     bss     dec     hex filename
        79277   13937     236   93450   16d0a nf_conntrack.ko
        191K nf_conntrack.ko
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a0ae2562
    • S
      tcp: Fix broken repair socket window probe patch · 31048d7a
      Stefan Baranoff 提交于
      Correct previous bad attempt at allowing sockets to come out of TCP
      repair without sending window probes. To avoid changing size of
      the repair variable in struct tcp_sock, this lets the decision for
      sending probes or not to be made when coming out of repair by
      introducing two ways to turn it off.
      
      v2:
      * Remove erroneous comment; defines now make behavior clear
      
      Fixes: 70b7ff13 ("tcp: allow user to create repair socket without window probes")
      Signed-off-by: NStefan Baranoff <sbaranoff@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31048d7a
    • H
      ipv4/igmp: init group mode as INCLUDE when join source group · 6e2059b5
      Hangbin Liu 提交于
      Based on RFC3376 5.1
         If no interface
         state existed for that multicast address before the change (i.e., the
         change consisted of creating a new per-interface record), or if no
         state exists after the change (i.e., the change consisted of deleting
         a per-interface record), then the "non-existent" state is considered
         to have a filter mode of INCLUDE and an empty source list.
      
      Which means a new multicast group should start with state IN().
      
      Function ip_mc_join_group() works correctly for IGMP ASM(Any-Source Multicast)
      mode. It adds a group with state EX() and inits crcount to mc_qrv,
      so the kernel will send a TO_EX() report message after adding group.
      
      But for IGMPv3 SSM(Source-specific multicast) JOIN_SOURCE_GROUP mode, we
      split the group joining into two steps. First we join the group like ASM,
      i.e. via ip_mc_join_group(). So the state changes from IN() to EX().
      
      Then we add the source-specific address with INCLUDE mode. So the state
      changes from EX() to IN(A).
      
      Before the first step sends a group change record, we finished the second
      step. So we will only send the second change record. i.e. TO_IN(A).
      
      Regarding the RFC stands, we should actually send an ALLOW(A) message for
      SSM JOIN_SOURCE_GROUP as the state should mimic the 'IN() to IN(A)'
      transition.
      
      The issue was exposed by commit a052517a ("net/multicast: should not
      send source list records when have filter mode change"). Before this change,
      we used to send both ALLOW(A) and TO_IN(A). After this change we only send
      TO_IN(A).
      
      Fix it by adding a new parameter to init group mode. Also add new wrapper
      functions so we don't need to change too much code.
      
      v1 -> v2:
      In my first version I only cleared the group change record. But this is not
      enough. Because when a new group join, it will init as EXCLUDE and trigger
      an filter mode change in ip/ip6_mc_add_src(), which will clear all source
      addresses' sf_crcount. This will prevent early joined address sending state
      change records if multi source addressed joined at the same time.
      
      In v2 patch, I fixed it by directly initializing the mode to INCLUDE for SSM
      JOIN_SOURCE_GROUP. I also split the original patch into two separated patches
      for IPv4 and IPv6.
      
      Fixes: a052517a ("net/multicast: should not send source list records when have filter mode change")
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e2059b5
  11. 16 7月, 2018 7 次提交
  12. 15 7月, 2018 2 次提交
  13. 14 7月, 2018 3 次提交
    • Y
      tcp: remove DELAYED ACK events in DCTCP · a69258f7
      Yuchung Cheng 提交于
      After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
      related callbacks are no longer needed
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a69258f7
    • Y
      tcp: fix dctcp delayed ACK schedule · b0c05d0e
      Yuchung Cheng 提交于
      Previously, when a data segment was sent an ACK was piggybacked
      on the data segment without generating a CA_EVENT_NON_DELAYED_ACK
      event to notify congestion control modules. So the DCTCP
      ca->delayed_ack_reserved flag could incorrectly stay set when
      in fact there were no delayed ACKs being reserved. This could result
      in sending a special ECN notification ACK that carries an older
      ACK sequence, when in fact there was no need for such an ACK.
      DCTCP keeps track of the delayed ACK status with its own separate
      state ca->delayed_ack_reserved. Previously it may accidentally cancel
      the delayed ACK without updating this field upon sending a special
      ACK that carries a older ACK sequence. This inconsistency would
      lead to DCTCP receiver never acknowledging the latest data until the
      sender times out and retry in some cases.
      
      Packetdrill script (provided by Larry Brakmo)
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 2:3(1) ack 2001
      
      0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
      0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
      0.200 > [ect01] . 3:3(0) ack 4001
      
      0.210 < [ce] P. 4001:4501(500) ack 3 win 257
      
      +0.001 read(4, ..., 4500) = 4500
      +0 write(4, ..., 1) = 1
      +0 > [ect01] PE. 3:4(1) ack 4501
      
      +0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
      // Previously the ACK sequence below would be 4501, causing a long RTO
      +0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack
      
      +0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
      +0 > [ect01] . 4:4(0) ack 6501     // now acks everything
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Reported-by: NLarry Brakmo <brakmo@fb.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0c05d0e
    • N
      net: ipmr: add support for passing full packet on wrong vif · c921c207
      Nikolay Aleksandrov 提交于
      This patch adds support for IGMPMSG_WRVIFWHOLE which is used to pass
      full packet and real vif id when the incoming interface is wrong.
      While the RP and FHR are setting up state we need to be sending the
      registers encapsulated with all the data inside otherwise we lose it.
      The RP then decapsulates it and forwards it to the interested parties.
      Currently with WRONGVIF we can only be sending empty register packets
      and will lose that data.
      This behaviour can be enabled by using MRT_PIM with
      val == IGMPMSG_WRVIFWHOLE. This doesn't prevent IGMPMSG_WRONGVIF from
      happening, it happens in addition to it, also it is controlled by the same
      throttling parameters as WRONGVIF (i.e. 1 packet per 3 seconds currently).
      Both messages are generated to keep backwards compatibily and avoid
      breaking someone who was enabling MRT_PIM with val == 4, since any
      positive val is accepted and treated the same.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c921c207
  14. 13 7月, 2018 1 次提交
    • J
      net: ipv4: fix listify ip_rcv_finish in case of forwarding · 0761680d
      Jesper Dangaard Brouer 提交于
      In commit 5fa12739 ("net: ipv4: listify ip_rcv_finish") calling
      dst_input(skb) was split-out.  The ip_sublist_rcv_finish() just calls
      dst_input(skb) in a loop.
      
      The problem is that ip_sublist_rcv_finish() forgot to remove the SKB
      from the list before invoking dst_input().  Further more we need to
      clear skb->next as other parts of the network stack use another kind
      of SKB lists for xmit_more (see dev_hard_start_xmit).
      
      A crash occurs if e.g. dst_input() invoke ip_forward(), which calls
      dst_output()/ip_output() that eventually calls __dev_queue_xmit() +
      sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list().
      
      This patch only fixes the crash, but there is a huge potential for
      a performance boost if we can pass an SKB-list through to ip_forward.
      
      Fixes: 5fa12739 ("net: ipv4: listify ip_rcv_finish")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0761680d