1. 29 1月, 2008 9 次提交
    • G
      [DCCP]: Collapse repeated `len' statements into one · 79133506
      Gerrit Renker 提交于
      This replaces 4 individual assignments for `len' with a single
      one, placed where the control flow of those 4 leads to.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79133506
    • G
      [DCCP]: Support for server holding timewait state · b8599d20
      Gerrit Renker 提交于
      This adds a socket option and signalling support for the case where the server
      holds timewait state on closing the connection, as described in RFC 4340, 8.3.
      
      Since holding timewait state at the server is the non-usual case, it is enabled
      via a socket option. Documentation for this socket option has been added.
      
      The setsockopt statement has been made resilient against different possible cases
      of expressing boolean `true' values using a suggestion by Ian McDonald.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8599d20
    • G
      [DCCP]: Shift the retransmit timer for active-close into output.c · 92d31920
      Gerrit Renker 提交于
      When performing active close, RFC 4340, 8.3. requires to retransmit the
      Close/CloseReq with a backoff-retransmit timer starting at intially 2 RTTs.
      
      This patch shifts the existing code for active-close retransmit timer
      into output.c, so that the retransmit timer is started when the first
      Close/CloseReq is sent. Previously, the timer was started when, after
      releasing the socket in dccp_close(), the actively-closing side had not yet
      reached the CLOSED/TIMEWAIT state.
      
      The patch further reduces the initial timeout from 3 seconds to the required
      2 RTTs, where - in absence of a known RTT - the fallback value specified in
      RFC 4340, 3.4 is used.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92d31920
    • G
      [DCCP]: Integrate state transitions for passive-close · 0c869620
      Gerrit Renker 提交于
      This adds the necessary state transitions for the two forms of passive-close
      
       * PASSIVE_CLOSE    - which is entered when a host   receives a Close;
       * PASSIVE_CLOSEREQ - which is entered when a client receives a CloseReq.
      
      Here is a detailed account of what the patch does in each state.
      
      1) Receiving CloseReq
      
        The pseudo-code in 8.5 says:
      
           Step 13: Process CloseReq
                If P.type == CloseReq and S.state < CLOSEREQ,
                    Generate Close
                    S.state := CLOSING
                    Set CLOSING timer.
      
        This means we need to address what to do in CLOSED, LISTEN, REQUEST, RESPOND, PARTOPEN, and OPEN.
      
         * CLOSED:         silently ignore - it may be a late or duplicate CloseReq;
         * LISTEN/RESPOND: will not appear, since Step 7 is performed first (we know we are the client);
         * REQUEST:        perform Step 13 directly (no need to enqueue packet);
         * OPEN/PARTOPEN:  enter PASSIVE_CLOSEREQ so that the application has a chance to process unread data.
      
        When already in PASSIVE_CLOSEREQ, no second CloseReq is enqueued. In any other state, the CloseReq is ignored.
        I think that this offers some robustness against rare and pathological cases: e.g. a simultaneous close where
        the client sends a Close and the server a CloseReq. The client will then be retransmitting its Close until it
        gets the Reset, so ignoring the CloseReq while in state CLOSING is sane.
      
      2) Receiving Close
      
        The code below from 8.5 is unconditional.
      
           Step 14: Process Close
                If P.type == Close,
                    Generate Reset(Closed)
                    Tear down connection
                    Drop packet and return
      
        Thus we need to consider all states:
         * CLOSED:           silently ignore, since this can happen when a retransmitted or late Close arrives;
         * LISTEN:           dccp_rcv_state_process() will generate a Reset ("No Connection");
         * REQUEST:          perform Step 14 directly (no need to enqueue packet);
         * RESPOND:          dccp_check_req() will generate a Reset ("Packet Error") -- left it at that;
         * OPEN/PARTOPEN:    enter PASSIVE_CLOSE so that application has a chance to process unread data;
         * CLOSEREQ:         server performed active-close -- perform Step 14;
         * CLOSING:          simultaneous-close: use a tie-breaker to avoid message ping-pong (see comment);
         * PASSIVE_CLOSEREQ: ignore - the peer has a bug (sending first a CloseReq and now a Close);
         * TIMEWAIT:         packet is ignored.
      
         Note that the condition of receiving a packet in state CLOSED here is different from the condition "there
         is no socket for such a connection": the socket still exists, but its state indicates it is unusable.
      
         Last, dccp_finish_passive_close sets either DCCP_CLOSED or DCCP_CLOSING = TCP_CLOSING, so that
         sk_stream_wait_close() will wait for the final Reset (which will trigger CLOSING => CLOSED).
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c869620
    • G
      [DCCP]: Dedicated auxiliary states to support passive-close · f11135a3
      Gerrit Renker 提交于
      This adds two auxiliary states to deal with passive closes:
        * PASSIVE_CLOSE    (reached from OPEN via reception of Close)    and
        * PASSIVE_CLOSEREQ (reached from OPEN via reception of CloseReq)
      as internal intermediate states.
      
      These states are used to allow a receiver to process unread data before
      acknowledging the received connection-termination-request (the Close/CloseReq).
      
      Without such support, it will happen that passively-closed sockets enter CLOSED
      state while there is still unprocessed data in the queue; leading to unexpected
      and erratic API behaviour.
      
      PASSIVE_CLOSE has been mapped into TCPF_CLOSE_WAIT, so that the code will
      seamlessly work with inet_accept() (which tests for this state).
      
      The state names are thanks to Arnaldo, who suggested this naming scheme
      following an earlier revision of this patch.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f11135a3
    • G
      [DCCP]: Add support for abortive release · ce865a61
      Gerrit Renker 提交于
      This continues from the previous patch and adds support for actively aborting
      a DCCP connection, using a Reset Code 2, "Aborted" to inform the peer of an
      abortive release.
      
      I have tried this in various client/server settings and it works as expected.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce865a61
    • G
      [DCCP]: Check for unread data on close · d83bd95b
      Gerrit Renker 提交于
      This removes one FIXME with regard to close when there is still unread data.
      The mechanism is implemented similar to TCP: with regard to DCCP-specifics,
      a Reset with Code 2, "Aborted" is sent to the peer.
      
      This corresponds in part to RFC 4340, 8.1.1 and 8.1.5.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d83bd95b
    • A
      [DCCP]: Initialize dccp_sock before calling the ccid constructors · e18d7a98
      Arnaldo Carvalho de Melo 提交于
      This is because in the next patch CCID2 will assume that dccps_mss_cache is
      non-zero.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e18d7a98
    • G
      [DCCP]: Honour and make use of shutdown option set by user · 8e8c71f1
      Gerrit Renker 提交于
      This extends the DCCP socket API by honouring any shutdown(2) option set by the user.
      The behaviour is, as much as possible, made consistent with the API for TCP's shutdown.
      
      This patch exploits the information provided by the user via the socket API to reduce
      processing costs:
       * if the read end is closed (SHUT_RD), it is not necessary to deliver to input CCID;
       * if the write end is closed (SHUT_WR), the same idea applies, but with a difference -
         as long as the TX queue has not been drained, we need to receive feedback to keep
         congestion-control rates up to date. Hence SHUT_WR is honoured only after the last
         packet (under congestion control) has been sent;
       * although SHUT_RDWR seems nonsensical, it is nevertheless supported in the same manner
         as for TCP (and agrees with test for SHUTDOWN_MASK in dccp_poll() in net/dccp/proto.c).
      
      Furthermore, most of the code already honours the sk_shutdown flags (dccp_recvmsg() for
      instance sets the read length to 0 if SHUT_RD had been called); CCID handling is now added
      to this by the present patch.
      
      There will also no longer be any delivery when the socket is in the final stages, i.e. when
      one of dccp_close(), dccp_fin(), or dccp_done() has been called - which is fine since at
      that stage the connection is its final stages.
      
      Motivation and background are on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/shutdown
      
      A FIXME has been added to notify the other end if SHUT_RD has been set (RFC 4340, 11.7).
      
      Note: There is a comment in inet_shutdown() in net/ipv4/af_inet.c which asks to "make
            sure the socket is a TCP socket". This should probably be extended to mean
            `TCP or DCCP socket' (the code is also used by UDP and raw sockets).
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e8c71f1
  2. 07 11月, 2007 1 次提交
    • E
      [INET]: Remove per bucket rwlock in tcp/dccp ehash table. · 230140cf
      Eric Dumazet 提交于
      As done two years ago on IP route cache table (commit
      22c047cc) , we can avoid using one
      lock per hash bucket for the huge TCP/DCCP hash tables.
      
      On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
      litle performance differences. (we hit a different cache line for the
      rwlock, but then the bucket cache line have a better sharing factor
      among cpus, since we dirty it less often). For netstat or ss commands
      that want a full scan of hash table, we perform fewer memory accesses.
      
      Using a 'small' table of hashed rwlocks should be more than enough to
      provide correct SMP concurrency between different buckets, without
      using too much memory. Sizing of this table depends on
      num_possible_cpus() and various CONFIG settings.
      
      This patch provides some locking abstraction that may ease a future
      work using a different model for TCP/DCCP table.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      230140cf
  3. 24 10月, 2007 1 次提交
  4. 11 10月, 2007 6 次提交
  5. 20 7月, 2007 1 次提交
    • P
      mm: Remove slab destructors from kmem_cache_create(). · 20c2df83
      Paul Mundt 提交于
      Slab destructors were no longer supported after Christoph's
      c59def9f change. They've been
      BUGs for both slab and slub, and slob never supported them
      either.
      
      This rips out support for the dtor pointer from kmem_cache_create()
      completely and fixes up every single callsite in the kernel (there were
      about 224, not including the slab allocator definitions themselves,
      or the documentation references).
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      20c2df83
  6. 29 3月, 2007 1 次提交
  7. 11 2月, 2007 1 次提交
  8. 09 2月, 2007 1 次提交
    • E
      [NET]: change layout of ehash table · dbca9b27
      Eric Dumazet 提交于
      ehash table layout is currently this one :
      
      First half of this table is used by sockets not in TIME_WAIT state
      Second half of it is used by sockets in TIME_WAIT state.
      
      This is non optimal because of for a given hash or socket, the two chain heads 
      are located in separate cache lines.
      Moreover the locks of the second half are never used.
      
      If instead of this halving, we use two list heads in inet_ehash_bucket instead 
      of only one, we probably can avoid one cache miss, and reduce ram usage, 
      particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK, 
      CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep 
      together related chains to speedup lookups and socket state change.
      
      In this patch I did not try to align struct inet_ehash_bucket, but a future 
      patch could try to make this structure have a convenient size (a power of two 
      or a multiple of L1_CACHE_SIZE).
      I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so 
      maybe we dont need to scratch our heads to align the bucket...
      
      Note : In case struct inet_ehash_bucket is not a power of two, we could 
      probably change alloc_large_system_hash() (in case it use __get_free_pages()) 
      to free the unused space. It currently allocates a big zone, but the last 
      quarter of it could be freed. Again, this should be a temporary 'problem'.
      
      Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbca9b27
  9. 12 12月, 2006 1 次提交
  10. 03 12月, 2006 8 次提交
    • A
      [DCCP]: Make {set,get}sockopt(DCCP_SOCKOPT_PACKET_SIZE) return 0 · 841bac1d
      Arnaldo Carvalho de Melo 提交于
      To reflect the fact that this now is of no effect, not making apps
      stop working, just be warned in the system log.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      841bac1d
    • G
      [DCCP]: Tidy up unused structures · 5aed3243
      Gerrit Renker 提交于
      This removes and cleans up unused variables and structures which have become
      unnecessary following the introduction of the EWMA patch to automatically track
      the CCID 3 receiver/sender packet sizes `s'.
      
      It deprecates the PACKET_SIZE socket option by returning an error code and
      printing a deprecation warning if an application tries to read or write this
      socket option.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      5aed3243
    • G
      [DCCP]: Simplified conditions due to use of enum:8 states · 59348b19
      Gerrit Renker 提交于
      This reaps the benefit of the earlier patch, which changed the type of
      CCID 3 states to use enums, in that many conditions are now simplified
      and the number of possible (unexpected) values is greatly reduced.
      
      In a few instances, this also allowed to simplify pre-conditions; where
      care has been taken to retain logical equivalence.
      
      [DCCP]: Introduce a consistent BUG/WARN message scheme
      
      This refines the existing set of DCCP messages so that
       * BUG(), BUG_ON(), WARN_ON() have meaningful DCCP-specific counterparts
       * DCCP_CRIT (for severe warnings) is not rate-limited
       * DCCP_WARN() is introduced as rate-limited wrapper
      
      Using these allows a faster and cleaner transition to their original
      counterparts once the code has matured into a full DCCP implementation.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      59348b19
    • I
      [DCCP]: Set TX Queue Length Bounds via Sysctl · b1308dc0
      Ian McDonald 提交于
      Previously the transmit queue was unbounded.
      
      This patch:
      	* puts a limit on transmit queue length
      	  and sends back EAGAIN if the buffer is full
      	* sets the TX queue length to a sensible default
      	* implements tx buffer sysctls for DCCP
      Signed-off-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      b1308dc0
    • G
      [DCCP]: Miscellaneous code tidy-ups · 09dbc389
      Gerrit Renker 提交于
      This patch does not change code; it performs some trivial clean/tidy-ups:
      
        * removal of a `debug_prefix' string in favour of the
          already existing dccp_role(sk)
      
        * add documentation of structures and constants
      
        * separated out the cases for invalid packets (step 1
          of the packet validation)
      
        * removing duplicate statements
      
        * combining declaration & initialisation
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      09dbc389
    • G
      [DCCP]: Add sysctls to control retransmission behaviour · 2e2e9e92
      Gerrit Renker 提交于
      This adds 3 sysctls which govern the retransmission behaviour of DCCP control
      packets (3way handshake, feature negotiation).
      
      It removes 4 FIXMEs from the code.
      
      The close resemblance of sysctl variables to their TCP analogues is emphasised
      not only by their name, but also by giving them the same initial values.
      This is useful since there is not much practical experience with DCCP yet.
      
      Furthermore, with regard to the previous patch, it is now possible to limit
      the number of keepalive-Responses by setting net.dccp.default.request_retries
      (also a bit like in TCP).
      
      Lastly, added documentation of all existing DCCP sysctls.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      2e2e9e92
    • G
      [DCCP]: Support for partial checksums (RFC 4340, sec. 9.2) · 6f4e5fff
      Gerrit Renker 提交于
      This patch does the following:
        a) introduces variable-length checksums as specified in [RFC 4340, sec. 9.2]
        b) provides necessary socket options and documentation as to how to use them
        c) basic support and infrastructure for the Minimum Checksum Coverage feature
           [RFC 4340, sec. 9.2.1]: acceptability tests, user notification and user
           interface
      
      In addition, it
      
       (1) fixes two bugs in the DCCPv4 checksum computation:
       	* pseudo-header used checksum_len instead of skb->len
      	* incorrect checksum coverage calculation based on dccph_x
       (2) removes dccp_v4_verify_checksum() since it reduplicates code of the
           checksum computation; code calling this function is updated accordingly.
       (3) now uses skb_checksum(), which is safer than checksum_partial() if the
           sk_buff has is a non-linear buffer (has pages attached to it).
       (4) fixes an outstanding TODO item:
              * If P.CsCov is too large for the packet size, drop packet and return.
      
      The code has been tested with applications, the latest version of tcpdump now
      comes with support for partial DCCP checksums.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      6f4e5fff
    • E
      [NET]: Size listen hash tables using backlog hint · 72a3effa
      Eric Dumazet 提交于
      We currently allocate a fixed size (TCP_SYNQ_HSIZE=512) slots hash table for
      each LISTEN socket, regardless of various parameters (listen backlog for
      example)
      
      On x86_64, this means order-1 allocations (might fail), even for 'small'
      sockets, expecting few connections. On the contrary, a huge server wanting a
      backlog of 50000 is slowed down a bit because of this fixed limit.
      
      This patch makes the sizing of listen hash table a dynamic parameter,
      depending of :
      - net.core.somaxconn tunable (default is 128)
      - net.ipv4.tcp_max_syn_backlog tunable (default : 256, 1024 or 128)
      - backlog value given by user application  (2nd parameter of listen())
      
      For large allocations (bigger than PAGE_SIZE), we use vmalloc() instead of
      kmalloc().
      
      We still limit memory allocation with the two existing tunables (somaxconn &
      tcp_max_syn_backlog). So for standard setups, this patch actually reduce RAM
      usage.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72a3effa
  11. 25 9月, 2006 1 次提交
  12. 23 9月, 2006 1 次提交
  13. 11 7月, 2006 1 次提交
  14. 01 7月, 2006 1 次提交
  15. 18 6月, 2006 1 次提交
  16. 06 5月, 2006 1 次提交
    • H
      [DCCP]: Fix sock_orphan dead lock · 134af346
      Herbert Xu 提交于
      Calling sock_orphan inside bh_lock_sock in dccp_close can lead to dead
      locks.  For example, the inet_diag code holds sk_callback_lock without
      disabling BH.  If an inbound packet arrives during that admittedly tiny
      window, it will cause a dead lock on bh_lock_sock.  Another possible
      path would be through sock_wfree if the network device driver frees the
      tx skb in process context with BH enabled.
      
      We can fix this by moving sock_orphan out of bh_lock_sock.
      
      The tricky bit is to work out when we need to destroy the socket
      ourselves and when it has already been destroyed by someone else.
      
      By moving sock_orphan before the release_sock we can solve this
      problem.  This is because as long as we own the socket lock its
      state cannot change.
      
      So we simply record the socket state before the release_sock
      and then check the state again after we regain the socket lock.
      If the socket state has transitioned to DCCP_CLOSED in the time being,
      we know that the socket has been destroyed.  Otherwise the socket is
      still ours to keep.
      
      This problem was discoverd by Ingo Molnar using his lock validator.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      134af346
  17. 26 3月, 2006 1 次提交
    • D
      [PATCH] POLLRDHUP/EPOLLRDHUP handling for half-closed devices notifications · f348d70a
      Davide Libenzi 提交于
      Implement the half-closed devices notifiation, by adding a new POLLRDHUP
      (and its alias EPOLLRDHUP) bit to the existing poll/select sets.  Since the
      existing POLLHUP handling, that does not report correctly half-closed
      devices, was feared to be changed, this implementation leaves the current
      POLLHUP reporting unchanged and simply add a new bit that is set in the few
      places where it makes sense.  The same thing was discussed and conceptually
      agreed quite some time ago:
      
      http://lkml.org/lkml/2003/7/12/116
      
      Since this new event bit is added to the existing Linux poll infrastruture,
      even the existing poll/select system calls will be able to use it.  As far
      as the existing POLLHUP handling, the patch leaves it as is.  The
      pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing
      archs and sets the bit in the six relevant files.  The other attached diff
      is the simple change required to sys/epoll.h to add the EPOLLRDHUP
      definition.
      
      There is "a stupid program" to test POLLRDHUP delivery here:
      
       http://www.xmailserver.org/pollrdhup-test.c
      
      It tests poll(2), but since the delivery is same epoll(2) will work equally.
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f348d70a
  18. 21 3月, 2006 3 次提交