1. 03 3月, 2017 2 次提交
    • A
      net: Introduce sk_clone_lock() error path routine · 94352d45
      Arnaldo Carvalho de Melo 提交于
      When handling problems in cloning a socket with the sk_clone_locked()
      function we need to perform several steps that were open coded in it and
      its callers, so introduce a routine to avoid this duplication:
      sk_free_unlock_clone().
      
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/n/net-ui6laqkotycunhtmqryl9bfx@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94352d45
    • A
      dccp: Unlock sock before calling sk_free() · d5afb6f9
      Arnaldo Carvalho de Melo 提交于
      The code where sk_clone() came from created a new socket and locked it,
      but then, on the error path didn't unlock it.
      
      This problem stayed there for a long while, till b0691c8e ("net:
      Unlock sock before calling sk_free()") fixed it, but unfortunately the
      callers of sk_clone() (now sk_clone_locked()) were not audited and the
      one in dccp_create_openreq_child() remained.
      
      Now in the age of the syskaller fuzzer, this was finally uncovered, as
      reported by Dmitry:
      
       ---- 8< ----
      
      I've got the following report while running syzkaller fuzzer on
      86292b33 ("Merge branch 'akpm' (patches from Andrew)")
      
        [ BUG: held lock freed! ]
        4.10.0+ #234 Not tainted
        -------------------------
        syz-executor6/6898 is freeing memory
        ffff88006286cac0-ffff88006286d3b7, with a lock still held there!
         (slock-AF_INET6){+.-...}, at: [<ffffffff8362c2c9>] spin_lock
        include/linux/spinlock.h:299 [inline]
         (slock-AF_INET6){+.-...}, at: [<ffffffff8362c2c9>]
        sk_clone_lock+0x3d9/0x12c0 net/core/sock.c:1504
        5 locks held by syz-executor6/6898:
         #0:  (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff839a34b4>] lock_sock
        include/net/sock.h:1460 [inline]
         #0:  (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff839a34b4>]
        inet_stream_connect+0x44/0xa0 net/ipv4/af_inet.c:681
         #1:  (rcu_read_lock){......}, at: [<ffffffff83bc1c2a>]
        inet6_csk_xmit+0x12a/0x5d0 net/ipv6/inet6_connection_sock.c:126
         #2:  (rcu_read_lock){......}, at: [<ffffffff8369b424>] __skb_unlink
        include/linux/skbuff.h:1767 [inline]
         #2:  (rcu_read_lock){......}, at: [<ffffffff8369b424>] __skb_dequeue
        include/linux/skbuff.h:1783 [inline]
         #2:  (rcu_read_lock){......}, at: [<ffffffff8369b424>]
        process_backlog+0x264/0x730 net/core/dev.c:4835
         #3:  (rcu_read_lock){......}, at: [<ffffffff83aeb5c0>]
        ip6_input_finish+0x0/0x1700 net/ipv6/ip6_input.c:59
         #4:  (slock-AF_INET6){+.-...}, at: [<ffffffff8362c2c9>] spin_lock
        include/linux/spinlock.h:299 [inline]
         #4:  (slock-AF_INET6){+.-...}, at: [<ffffffff8362c2c9>]
        sk_clone_lock+0x3d9/0x12c0 net/core/sock.c:1504
      
      Fix it just like was done by b0691c8e ("net: Unlock sock before calling
      sk_free()").
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170301153510.GE15145@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5afb6f9
  2. 28 4月, 2016 1 次提交
  3. 23 10月, 2015 1 次提交
    • E
      tcp/dccp: fix hashdance race for passive sessions · 5e0724d0
      Eric Dumazet 提交于
      Multiple cpus can process duplicates of incoming ACK messages
      matching a SYN_RECV request socket. This is a rare event under
      normal operations, but definitely can happen.
      
      Only one must win the race, otherwise corruption would occur.
      
      To fix this without adding new atomic ops, we use logic in
      inet_ehash_nolisten() to detect the request was present in the same
      ehash bucket where we try to insert the new child.
      
      If request socket was not found, we have to undo the child creation.
      
      This actually removes a spin_lock()/spin_unlock() pair in
      reqsk_queue_unlink() for the fast path.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e0724d0
  4. 30 9月, 2015 2 次提交
  5. 22 9月, 2015 1 次提交
    • E
      tcp/dccp: fix timewait races in timer handling · ed2e9239
      Eric Dumazet 提交于
      When creating a timewait socket, we need to arm the timer before
      allowing other cpus to find it. The signal allowing cpus to find
      the socket is setting tw_refcnt to non zero value.
      
      As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
      call inet_twsk_schedule() first.
      
      This also means we need to remove tw_refcnt changes from
      inet_twsk_schedule() and let the caller handle it.
      
      Note that because we use mod_timer_pinned(), we have the guarantee
      the timer wont expire before we set tw_refcnt as we run in BH context.
      
      To make things more readable I introduced inet_twsk_reschedule() helper.
      
      When rearming the timer, we can use mod_timer_pending() to make sure
      we do not rearm a canceled timer.
      
      Note: This bug can possibly trigger if packets of a flow can hit
      multiple cpus. This does not normally happen, unless flow steering
      is broken somehow. This explains this bug was spotted ~5 months after
      its introduction.
      
      A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
      but will be provided in a separate patch for proper tracking.
      
      Fixes: 789f558c ("tcp/dccp: get rid of central timewait timer")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NYing Cai <ycai@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed2e9239
  6. 24 4月, 2015 1 次提交
    • E
      inet: fix possible panic in reqsk_queue_unlink() · b357a364
      Eric Dumazet 提交于
      [ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
       0000000000000080
      [ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243
      
      There is a race when reqsk_timer_handler() and tcp_check_req() call
      inet_csk_reqsk_queue_unlink() on the same req at the same time.
      
      Before commit fa76ce73 ("inet: get rid of central tcp/dccp listener
      timer"), listener spinlock was held and race could not happen.
      
      To solve this bug, we change reqsk_queue_unlink() to not assume req
      must be found, and we return a status, to conditionally release a
      refcount on the request sock.
      
      This also means tcp_check_req() in non fastopen case might or not
      consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
      to properly handle this.
      
      (Same remark for dccp_check_req() and its callers)
      
      inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
      called 4 times in tcp and 3 times in dccp.
      
      Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b357a364
  7. 14 4月, 2015 1 次提交
    • E
      tcp/dccp: get rid of central timewait timer · 789f558c
      Eric Dumazet 提交于
      Using a timer wheel for timewait sockets was nice ~15 years ago when
      memory was expensive and machines had a single processor.
      
      This does not scale, code is ugly and source of huge latencies
      (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)
      
      We can afford to use an extra 64 bytes per timewait sock and spread
      timewait load to all cpus to have better behavior.
      
      Tested:
      
      On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
      on the target (lpaa24)
      
      Before patch :
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      419594
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      437171
      
      While test is running, we can observe 25 or even 33 ms latencies.
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
      rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
      rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2
      
      After patch :
      
      About 90% increase of throughput :
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      810442
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      800992
      
      And latencies are kept to minimal values during this load, even
      if network utilization is 90% higher :
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
      rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      789f558c
  8. 21 3月, 2015 1 次提交
  9. 02 7月, 2014 1 次提交
    • E
      inet: move ipv6only in sock_common · 9fe516ba
      Eric Dumazet 提交于
      When an UDP application switches from AF_INET to AF_INET6 sockets, we
      have a small performance degradation for IPv4 communications because of
      extra cache line misses to access ipv6only information.
      
      This can also be noticed for TCP listeners, as ipv6_only_sock() is also
      used from __inet_lookup_listener()->compute_score()
      
      This is magnified when SO_REUSEPORT is used.
      
      Move ipv6only into struct sock_common so that it is available at
      no extra cost in lookups.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fe516ba
  10. 12 4月, 2014 1 次提交
    • D
      net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369
      David S. Miller 提交于
      Several spots in the kernel perform a sequence like:
      
      	skb_queue_tail(&sk->s_receive_queue, skb);
      	sk->sk_data_ready(sk, skb->len);
      
      But at the moment we place the SKB onto the socket receive queue it
      can be consumed and freed up.  So this skb->len access is potentially
      to freed up memory.
      
      Furthermore, the skb->len can be modified by the consumer so it is
      possible that the value isn't accurate.
      
      And finally, no actual implementation of this callback actually uses
      the length argument.  And since nobody actually cared about it's
      value, lots of call sites pass arbitrary values in such as '0' and
      even '1'.
      
      So just remove the length argument from the callback, that way there
      is no confusion whatsoever and all of these use-after-free cases get
      fixed as a side effect.
      
      Based upon a patch by Eric Dumazet and his suggestion to audit this
      issue tree-wide.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      676d2369
  11. 11 10月, 2013 1 次提交
  12. 10 10月, 2013 1 次提交
    • E
      inet: includes a sock_common in request_sock · 634fb979
      Eric Dumazet 提交于
      TCP listener refactoring, part 5 :
      
      We want to be able to insert request sockets (SYN_RECV) into main
      ehash table instead of the per listener hash table to allow RCU
      lookups and remove listener lock contention.
      
      This patch includes the needed struct sock_common in front
      of struct request_sock
      
      This means there is no more inet6_request_sock IPv6 specific
      structure.
      
      Following inet_request_sock fields were renamed as they became
      macros to reference fields from struct sock_common.
      Prefix ir_ was chosen to avoid name collisions.
      
      loc_port   -> ir_loc_port
      loc_addr   -> ir_loc_addr
      rmt_addr   -> ir_rmt_addr
      rmt_port   -> ir_rmt_port
      iif        -> ir_iif
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      634fb979
  13. 09 10月, 2013 1 次提交
    • E
      ipv6: make lookups simpler and faster · efe4208f
      Eric Dumazet 提交于
      TCP listener refactoring, part 4 :
      
      To speed up inet lookups, we moved IPv4 addresses from inet to struct
      sock_common
      
      Now is time to do the same for IPv6, because it permits us to have fast
      lookups for all kind of sockets, including upcoming SYN_RECV.
      
      Getting IPv6 addresses in TCP lookups currently requires two extra cache
      lines, plus a dereference (and memory stall).
      
      inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
      
      This patch is way bigger than its IPv4 counter part, because for IPv4,
      we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
      it's not doable easily.
      
      inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
      inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
      
      And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
      at the same offset.
      
      We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
      macro.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efe4208f
  14. 04 11月, 2012 1 次提交
    • E
      tcp: better retrans tracking for defer-accept · e6c022a4
      Eric Dumazet 提交于
      For passive TCP connections using TCP_DEFER_ACCEPT facility,
      we incorrectly increment req->retrans each time timeout triggers
      while no SYNACK is sent.
      
      SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
      which we received the ACK from client). Only the last SYNACK is sent
      so that we can receive again an ACK from client, to move the req into
      accept queue. We plan to change this later to avoid the useless
      retransmit (and potential problem as this SYNACK could be lost)
      
      TCP_INFO later gives wrong information to user, claiming imaginary
      retransmits.
      
      Decouple req->retrans field into two independent fields :
      
      num_retrans : number of retransmit
      num_timeout : number of timeouts
      
      num_timeout is the counter that is incremented at each timeout,
      regardless of actual SYNACK being sent or not, and used to
      compute the exponential timeout.
      
      Introduce inet_rtx_syn_ack() helper to increment num_retrans
      only if ->rtx_syn_ack() succeeded.
      
      Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
      when we re-send a SYNACK in answer to a (retransmitted) SYN.
      Prior to this patch, we were not counting these retransmits.
      
      Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
      only if a synack packet was successfully queued.
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Elliott Hughes <enh@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6c022a4
  15. 04 3月, 2012 1 次提交
  16. 12 12月, 2011 1 次提交
  17. 23 11月, 2011 1 次提交
  18. 09 11月, 2011 1 次提交
  19. 12 10月, 2010 1 次提交
    • G
      dccp: fix the adjustments to AWL and SWL · 0b53d460
      Gerrit Renker 提交于
      This fixes a problem and a potential loophole with regard to seqno/ackno
      validity: currently the initial adjustments to AWL/SWL are only performed
      once at the begin of the connection, during the handshake.
      
      Since the Sequence Window feature is always greater than Wmin=32 (7.5.2),
      it is however necessary to perform these adjustments at least for the first
      W/W' (variables as per 7.5.1) packets in the lifetime of a connection.
      
      This requirement is complicated by the fact that W/W' can change at any time
      during the lifetime of a connection.
      
      Therefore it is better to perform that safety check each time SWL/AWL are
      updated, as implemented by the patch.
      
      A second problem solved by this patch is that the remote/local Sequence Window
      feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
      feature negotiation has completed.
      
      During the initial handshake we have more stringent sequence number protection;
      the changes added by this patch effect that {A,S}W{L,H} are within the correct
      bounds at the instant that feature negotiation completes (since the SeqWin
      feature activation handlers call dccp_update_gsr/gss()).
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      0b53d460
  20. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  21. 06 3月, 2010 1 次提交
  22. 03 12月, 2009 1 次提交
  23. 22 1月, 2009 1 次提交
    • G
      dccp: Implement both feature-local and feature-remote Sequence Window feature · 792b4878
      Gerrit Renker 提交于
      This adds full support for local/remote Sequence Window feature, from which the
        * sequence-number-validity (W) and
        * acknowledgment-number-validity (W') windows
      derive as specified in RFC 4340, 7.5.3.
      
      Specifically, the following is contained in this patch:
        * integrated new socket fields into dccp_sk;
        * updated the update_gsr/gss routines with regard to these fields;
        * updated handler code: the Sequence Window feature is located at the TX side,
          so the local feature is meant if the handler-rx flag is false;
        * the initialisation of `rcv_wnd' in reqsk is removed, since
          - rcv_wnd is not used by the code anywhere;
          - sequence number checks are not done in the LISTEN state (cf. 7.5.3);
          - dccp_check_req checks the Ack number validity more rigorously;
        * the `struct dccp_minisock' became empty and is now removed.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      792b4878
  24. 08 12月, 2008 4 次提交
    • G
      dccp ccid-2: Phase out the use of boolean Ack Vector sysctl · 6fdd34d4
      Gerrit Renker 提交于
      This removes the use of the sysctl and the minisock variable for the Send Ack
      Vector feature, as it now is handled fully dynamically via feature negotiation
      (i.e. when CCID-2 is enabled, Ack Vectors are automatically enabled as per
       RFC 4341, 4.).
      
      Using a sysctl in parallel to this implementation would open the door to
      crashes, since much of the code relies on tests of the boolean minisock /
      sysctl variable. Thus, this patch replaces all tests of type
      
      	if (dccp_msk(sk)->dccpms_send_ack_vector)
      		/* ... */
      with
      	if (dp->dccps_hc_rx_ackvec != NULL)
      		/* ... */
      
      The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
      negotiation concluded that Ack Vectors are to be used on the half-connection.
      Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
      so that the test is a valid one.
      
      The activation handler for Ack Vectors is called as soon as the feature
      negotiation has concluded at the
       * server when the Ack marking the transition RESPOND => OPEN arrives;
       * client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.
      
      Adding the sequence number of the Response packet to the Ack Vector has been
      removed, since
       (a) connection establishment implies that the Response has been received;
       (b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
           this entry will always be ignored;
       (c) it can not be used for anything useful - to detect loss for instance, only
           packets received after the loss can serve as pseudo-dupacks.
      
      There was a FIXME to change the error code when dccp_ackvec_add() fails.
      I removed this after finding out that:
       * the check whether ackno < ISN is already made earlier,
       * this Response is likely the 1st packet with an Ackno that the client gets,
       * so when dccp_ackvec_add() fails, the reason is likely not a packet error.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fdd34d4
    • G
      dccp: Remove manual influence on NDP Count feature · 4098dce5
      Gerrit Renker 提交于
      Updating the NDP count feature is handled automatically now:
       * for CCID-2 it is disabled, since the code does not use NDP counts;
       * for CCID-3 it is enabled, as NDP counts are used to determine loss lengths.
      
      Allowing the user to change NDP values leads to unpredictable and failing
      behaviour, since it is then possible to disable NDP counts even when they
      are needed (e.g. in CCID-3).
      
      This means that only those user settings are sensible that agree with the
      values for Send NDP Count implied by the choice of CCID. But those settings
      are already activated by the feature negotiation (CCID dependency tracking),
      hence this form of support is redundant.
      
      At startup the initialisation of the NDP count feature uses the default
      value of 0, which is done implicitly by the zeroing-out of the socket when
      it is allocated. If the choice of CCID or feature negotiation enables NDP
      count, this will then be updated via the NDP activation handler.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4098dce5
    • G
      dccp: Remove obsolete parts of the old CCID interface · 0049bab5
      Gerrit Renker 提交于
      The TX/RX CCIDs of the minisock are now redundant: similar to the Ack Vector
      case, their value equals initially that of the sysctl, but at the end of
      feature negotiation may be something different.
      
      The old interface removed by this patch thus has been replaced by the newer
      interface to dynamically query the currently loaded CCIDs.
      
      Also removed are the constructors for the TX CCID and the RX CCID, since the
      switch "rx <-> non-rx" is done by the handler in minisocks.c (and the handler
      is the only place in the code where CCIDs are loaded).
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0049bab5
    • G
      dccp: Integration of dynamic feature activation - part 2 (server side) · 192b27ff
      Gerrit Renker 提交于
      This patch integrates the activation of features at the end of negotiation
      into the server-side code.
      
      Note regarding the removal of 'const':
      --------------------------------------
       The 'const' attribute has been removed from 'dreq' since dccp_activate_values()
       needs to operate on dreq's feature list. Part of the activation is to remove
       those options from the list that have already been confirmed, hence it is not
       purely read-only.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      192b27ff
  25. 17 11月, 2008 1 次提交
    • G
      dccp: Deprecate Ack Ratio sysctl · dd9c0e36
      Gerrit Renker 提交于
      This patch deprecates the Ack Ratio sysctl, since
       * Ack Ratio is entirely ignored by CCID-3 and CCID-4,
       * Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
       * even if it would work in CCID-2, there is no point for a user to change it:
         - Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
         - if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
           (since waiting for Acks which will never arrive in this window),
         - cwnd is not a user-configurable value.
      
      The only reasonable place for Ack Ratio is to print it for debugging. It is
      planned to do this later on, as part of e.g. dccp_probe.
      
      With this patch Ack Ratio is now under full control of feature negotiation:
       * Ack Ratio is resolved as a dependency of the selected CCID;
       * if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
         the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
         Ratio 2 for both endpoints";
       * what happens then is part of another patch set, since it concerns the
         dynamic update of Ack Ratio while the connection is in full flight.
      
      Thanks to Tomasz Grobelny for discussion leading up to this patch.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd9c0e36
  26. 05 11月, 2008 1 次提交
  27. 20 10月, 2008 1 次提交
  28. 09 9月, 2008 1 次提交
  29. 04 9月, 2008 7 次提交
    • G
      dccp: Fix the adjustments to AWL and SWL · bfbddd08
      Gerrit Renker 提交于
      This fixes a problem and a potential loophole with regard to seqno/ackno
      validity: the problem is that the initial adjustments to AWL/SWL were
      only performed at the begin of the connection, during the handshake.
      
      Since the Sequence Window feature is always greater than Wmin=32 (7.5.2), 
      it is however necessary to perform these adjustments at least for the first
      W/W' (variables as per 7.5.1) packets in the lifetime of a connection.
      
      This requirement is complicated by the fact that W/W' can change at any time
      during the lifetime of a connection.
      
      Therefore the consequence is to perform this safety check each time SWL/AWL
      are updated.
      
      A second problem solved by this patch is that the remote/local Sequence Window
      feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
      feature negotiation has completed.
      
      During the initial handshake we have more stringent sequence number protection,
      the changes added by this patch effect that {A,S}W{L,H} are within the correct
      bounds at the instant that feature negotiation completes (since the SeqWin
      feature activation handlers call dccp_update_gsr/gss()). 
      
      A detailed rationale is below -- can be removed from the commit message.
      
      
      1. Server sequence number checks during initial handshake
      ---------------------------------------------------------
      The server can not use the fields of the listening socket for seqno/ackno checks
      and thus needs to store all relevant information on a per-connection basis on
      the dccp_request socket. This is a size-constrained structure and has currently
      only ISS (dreq_iss) and ISR (dreq_isr) defined.
      Adding further fields (SW{L,H}, AW{L,H}) would increase the size of the struct
      and it is questionable whether this will have any practical gain. The currently
      implemented solution is as follows.
       * receiving first Request: dccp_v{4,6}_conn_request sets 
                                  ISR := P.seqno, ISS := dccp_v{4,6}_init_sequence()
      
       * sending first Response:  dccp_v{4,6}_send_response via dccp_make_response()	
                                  sets P.seqno := ISS, sets P.ackno := ISR
      
       * receiving retransmitted Request: dccp_check_req() overrides ISR := P.seqno
      
       * answering retransmitted Request: dccp_make_response() sets ISS += 1,
                                          otherwise as per first Response
      
       * completing the handshake: succeeds in dccp_check_req() for the first Ack
                                   where P.ackno == ISS (P.seqno is not tested)
      
       * creating child socket: ISS, ISR are copied from the request_sock
      
      This solution will succeed whenever the server can receive the Request and the
      subsequent Ack in succession, without retransmissions. If there is packet loss,
      the client needs to retransmit until this condition succeeds; it will otherwise
      eventually give up. Adding further fields to the request_sock could increase
      the robustness a bit, in that it would make possible to let a reordered Ack
      (from a retransmitted Response) pass. The argument against such a solution is
      that if the packet loss is not persistent and an Ack gets through, why not
      wait for the one answering the original response: if the loss is persistent, it
      is probably better to not start the connection in the first place.
      
      Long story short: the present design (by Arnaldo) is simple and will likely work
      just as well as a more complicated solution. As a consequence, {A,S}W{L,H} are
      not needed until the moment the request_sock is cloned into the accept queue.
      
      At that stage feature negotiation has completed, so that the values for the local
      and remote Sequence Window feature (7.5.2) are known, i.e. we are now in a better
      position to compute {A,S}W{L,H}.
      
      
      2. Client sequence number checks during initial handshake
      ---------------------------------------------------------
      Until entering PARTOPEN the client does not need the adjustments, since it 
      constrains the Ack window to the packet it sent.
      
       * sending first Request: dccp_v{4,6}_connect() choose ISS, 
                                dccp_connect() then sets GAR := ISS (as per 8.5),
      			  dccp_transmit_skb() (with the previous bug fix) sets
      			         GSS := ISS, AWL := ISS, AWH := GSS
       * n-th retransmitted Request (with previous patch):
      	                  dccp_retransmit_skb() via timer calls
      			  dccp_transmit_skb(), which sets GSS := ISS+n
                                and then AWL := ISS, AWH := ISS+n
      	                  
       * receiving any Response: dccp_rcv_request_sent_state_process() 
      	                   -- accepts packet if AWL <= P.ackno <= AWH;
      			   -- sets GSR = ISR = P.seqno
      
       * sending the Ack completing the handshake: dccp_send_ack() calls 
                                 dccp_transmit_skb(), which sets GSS += 1
      			   and AWL := ISS, AWH := GSS
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      bfbddd08
    • G
      dccp: Implement both feature-local and feature-remote Sequence Window feature · 51c7d4fa
      Gerrit Renker 提交于
      This adds full support for local/remote Sequence Window feature, from which the 
        * sequence-number-validity (W) and 
        * acknowledgment-number-validity (W') windows 
      derive as specified in RFC 4340, 7.5.3. 
      
      Specifically, the following changes are introduced:
        * integrated new socket fields into dccp_sk;
        * updated the update_gsr/gss routines with regard to these fields;
        * updated handler code: the Sequence Window feature is located at the TX side,
          so the local feature is meant if the handler-rx flag is false;
        * the initialisation of `rcv_wnd' in reqsk is removed, since
          - rcv_wnd is not used by the code anywhere;
          - sequence number checks are not done in the LISTEN state (cf. 7.5.3);
          - dccp_check_req checks the Ack number validity more rigorously;
        * the `struct dccp_minisock' became empty and is now removed.
      
      Until the handshake completes with activating negotiated values, the local/remote
      Sequence-Window values are undefined and thus can not reliably be estimated.
      This issue is addressed in a separate patch.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      51c7d4fa
    • G
      dccp ccid-2: Phase out the use of boolean Ack Vector sysctl · b235dc4a
      Gerrit Renker 提交于
      This removes the use of the sysctl and the minisock variable for the Send Ack
      Vector feature, which is now handled fully dynamically via feature negotiation;
      i.e. when CCID2 is enabled, Ack Vectors are automatically enabled (as per
      RFC 4341, 4.).
      
      Using a sysctl in parallel to this implementation would open the door to
      crashes, since much of the code relies on tests of the boolean minisock /
      sysctl variable. Thus, this patch replaces all tests of type
      
      	if (dccp_msk(sk)->dccpms_send_ack_vector)
      		/* ... */
      with
      	if (dp->dccps_hc_rx_ackvec != NULL)
      		/* ... */
      
      The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
      negotiation concluded that Ack Vectors are to be used on the half-connection.
      Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
      so that the test is a valid one.
      
      The activation handler for Ack Vectors is called as soon as the feature
      negotiation has concluded at the
       * server when the Ack marking the transition RESPOND => OPEN arrives;
       * client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.
      
      Adding the sequence number of the Response packet to the Ack Vector has been 
      removed, since
       (a) connection establishment implies that the Response has been received;
       (b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
           this entry will always be ignored;
       (c) it can not be used for anything useful - to detect loss for instance, only
           packets received after the loss can serve as pseudo-dupacks.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      b235dc4a
    • G
      dccp: Remove manual influence on NDP Count feature · 68e074bf
      Gerrit Renker 提交于
      Updating the NDP count feature is handled automatically now:
       * for CCID-2 it is disabled, since the code does not use NDP counts;
       * for CCID-3 it is enabled, as NDP counts are used to determine loss lengths.
      
      Allowing the user to change NDP values leads to unpredictable and failing
      behaviour, since it is then possible to disable NDP counts even when they
      are needed (e.g. in CCID-3).
      
      This means that only those user settings are sensible that agree with the
      values for Send NDP Count implied by the choice of CCID. But those settings
      are already activated by the feature negotiation (CCID dependency tracking),
      hence this form of support is redundant.
      
      At startup the initialisation of the NDP count feature is with the default
      value of 0, which is done implicitly by the zeroing-out of the socket when
      it is allocated. If the choice of CCID or feature negotiation enables NDP
      count, this will then be updated via the NDP activation handler.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      68e074bf
    • G
      dccp: Remove obsolete parts of the old CCID interface · 78673e24
      Gerrit Renker 提交于
      The TX/RX CCIDs of the minisock are now redundant: similar to the Ack Vector
      case, their value equals initially that of the sysctl, but at the end of
      feature negotiation may be something different.
      
      The old interface removed by this patch thus has been replaced by the newer
      interface to dynamically query the currently loaded CCIDs earlier in this
      patch set.
      
      Also removed the constructors for the TX CCID and the RX CCID, since the
      switch rx/non-rx is done by the handler in minisocks.c (and the handler is
      the only place in the code where CCIDs are loaded).
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      78673e24
    • G
      dccp: Integration of dynamic feature activation - part 2 (server side) · e70cacb9
      Gerrit Renker 提交于
      This patch integrates the activation of features at the end of negotiation
      into the server-side code.
      
      Note: 
        In dccp_create_openreq_child the request_sock argument is no longer constant,
        since dccp_activate_values() uses the feature-negotiation list on dreq to sort
        out the initialisation values for the different features of the child socket;
        and purges this queue after use (but the `req' argument to openreq_child
        can and does still remain constant).
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      e70cacb9
    • G
      dccp: Deprecate Ack Ratio sysctl · 17c30b40
      Gerrit Renker 提交于
      This patch deprecates the Ack Ratio sysctl, since
       * Ack Ratio is entirely ignored by CCID-3 and CCID-4,
       * Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
       * even if it would work in CCID-2, there is no point for a user to change it:
         - Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
         - if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts 
           (since waiting for Acks which will never arrive in this window),
         - cwnd is not a user-configurable value.	
      
      The only reasonable place for Ack Ratio is to print it for debugging. It is
      planned to do this later on, as part of e.g. dccp_probe.
      
      With this patch Ack Ratio is now under full control of feature negotiation:
       * Ack Ratio is resolved as a dependency of the selected CCID;
       * if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
         the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
         Ratio 2 for both endpoints";
       * what happens then is part of another patch set, since it concerns the 
         dynamic update of Ack Ratio while the connection is in full flight.
      
      Thanks to Tomasz Grobelny for discussion leading up to this patch.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      17c30b40