1. 23 7月, 2012 1 次提交
    • N
      sctp: Implement quick failover draft from tsvwg · 5aa93bcf
      Neil Horman 提交于
      I've seen several attempts recently made to do quick failover of sctp transports
      by reducing various retransmit timers and counters.  While its possible to
      implement a faster failover on multihomed sctp associations, its not
      particularly robust, in that it can lead to unneeded retransmits, as well as
      false connection failures due to intermittent latency on a network.
      
      Instead, lets implement the new ietf quick failover draft found here:
      http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05
      
      This will let the sctp stack identify transports that have had a small number of
      errors, and avoid using them quickly until their reliability can be
      re-established.  I've tested this out on two virt guests connected via multiple
      isolated virt networks and believe its in compliance with the above draft and
      works well.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Vlad Yasevich <vyasevich@gmail.com>
      CC: Sridhar Samudrala <sri@us.ibm.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: linux-sctp@vger.kernel.org
      CC: joe@perches.com
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5aa93bcf
  2. 17 7月, 2012 3 次提交
    • D
      net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() · 6700c270
      David S. Miller 提交于
      This will be used so that we can compose a full flow key.
      
      Even though we have a route in this context, we need more.  In the
      future the routes will be without destination address, source address,
      etc. keying.  One ipv4 route will cover entire subnets, etc.
      
      In this environment we have to have a way to possess persistent storage
      for redirects and PMTU information.  This persistent storage will exist
      in the FIB tables, and that's why we'll need to be able to rebuild a
      full lookup flow key here.  Using that flow key will do a fib_lookup()
      and create/update the persistent entry.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6700c270
    • I
      sctp: fix sparse warning for sctp_init_cause_fixed · db28aafa
      Ioan Orghici 提交于
      Fix the following sparse warning:
      	* symbol 'sctp_init_cause_fixed' was not declared. Should it be
      	  static?
      Signed-off-by: NIoan Orghici <ioanorghici@gmail.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db28aafa
    • N
      sctp: Fix list corruption resulting from freeing an association on a list · 2eebc1e1
      Neil Horman 提交于
      A few days ago Dave Jones reported this oops:
      
      [22766.294255] general protection fault: 0000 [#1] PREEMPT SMP
      [22766.295376] CPU 0
      [22766.295384] Modules linked in:
      [22766.387137]  ffffffffa169f292 6b6b6b6b6b6b6b6b ffff880147c03a90
      ffff880147c03a74
      [22766.387135] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000
      [22766.387136] Process trinity-watchdo (pid: 10896, threadinfo ffff88013e7d2000,
      [22766.387137] Stack:
      [22766.387140]  ffff880147c03a10
      [22766.387140]  ffffffffa169f2b6
      [22766.387140]  ffff88013ed95728
      [22766.387143]  0000000000000002
      [22766.387143]  0000000000000000
      [22766.387143]  ffff880003fad062
      [22766.387144]  ffff88013c120000
      [22766.387144]
      [22766.387145] Call Trace:
      [22766.387145]  <IRQ>
      [22766.387150]  [<ffffffffa169f292>] ? __sctp_lookup_association+0x62/0xd0
      [sctp]
      [22766.387154]  [<ffffffffa169f2b6>] __sctp_lookup_association+0x86/0xd0 [sctp]
      [22766.387157]  [<ffffffffa169f597>] sctp_rcv+0x207/0xbb0 [sctp]
      [22766.387161]  [<ffffffff810d4da8>] ? trace_hardirqs_off_caller+0x28/0xd0
      [22766.387163]  [<ffffffff815827e3>] ? nf_hook_slow+0x133/0x210
      [22766.387166]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
      [22766.387168]  [<ffffffff8159043d>] ip_local_deliver_finish+0x18d/0x4c0
      [22766.387169]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
      [22766.387171]  [<ffffffff81590a07>] ip_local_deliver+0x47/0x80
      [22766.387172]  [<ffffffff8158fd80>] ip_rcv_finish+0x150/0x680
      [22766.387174]  [<ffffffff81590c54>] ip_rcv+0x214/0x320
      [22766.387176]  [<ffffffff81558c07>] __netif_receive_skb+0x7b7/0x910
      [22766.387178]  [<ffffffff8155856c>] ? __netif_receive_skb+0x11c/0x910
      [22766.387180]  [<ffffffff810d423e>] ? put_lock_stats.isra.25+0xe/0x40
      [22766.387182]  [<ffffffff81558f83>] netif_receive_skb+0x23/0x1f0
      [22766.387183]  [<ffffffff815596a9>] ? dev_gro_receive+0x139/0x440
      [22766.387185]  [<ffffffff81559280>] napi_skb_finish+0x70/0xa0
      [22766.387187]  [<ffffffff81559cb5>] napi_gro_receive+0xf5/0x130
      [22766.387218]  [<ffffffffa01c4679>] e1000_receive_skb+0x59/0x70 [e1000e]
      [22766.387242]  [<ffffffffa01c5aab>] e1000_clean_rx_irq+0x28b/0x460 [e1000e]
      [22766.387266]  [<ffffffffa01c9c18>] e1000e_poll+0x78/0x430 [e1000e]
      [22766.387268]  [<ffffffff81559fea>] net_rx_action+0x1aa/0x3d0
      [22766.387270]  [<ffffffff810a495f>] ? account_system_vtime+0x10f/0x130
      [22766.387273]  [<ffffffff810734d0>] __do_softirq+0xe0/0x420
      [22766.387275]  [<ffffffff8169826c>] call_softirq+0x1c/0x30
      [22766.387278]  [<ffffffff8101db15>] do_softirq+0xd5/0x110
      [22766.387279]  [<ffffffff81073bc5>] irq_exit+0xd5/0xe0
      [22766.387281]  [<ffffffff81698b03>] do_IRQ+0x63/0xd0
      [22766.387283]  [<ffffffff8168ee2f>] common_interrupt+0x6f/0x6f
      [22766.387283]  <EOI>
      [22766.387284]
      [22766.387285]  [<ffffffff8168eed9>] ? retint_swapgs+0x13/0x1b
      [22766.387285] Code: c0 90 5d c3 66 0f 1f 44 00 00 4c 89 c8 5d c3 0f 1f 00 55 48
      89 e5 48 83
      ec 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 <0f> b7 87 98 00 00 00
      48 89 fb
      49 89 f5 66 c1 c0 08 66 39 46 02
      [22766.387307]
      [22766.387307] RIP
      [22766.387311]  [<ffffffffa168a2c9>] sctp_assoc_is_match+0x19/0x90 [sctp]
      [22766.387311]  RSP <ffff880147c039b0>
      [22766.387142]  ffffffffa16ab120
      [22766.599537] ---[ end trace 3f6dae82e37b17f5 ]---
      [22766.601221] Kernel panic - not syncing: Fatal exception in interrupt
      
      It appears from his analysis and some staring at the code that this is likely
      occuring because an association is getting freed while still on the
      sctp_assoc_hashtable.  As a result, we get a gpf when traversing the hashtable
      while a freed node corrupts part of the list.
      
      Nominally I would think that an mibalanced refcount was responsible for this,
      but I can't seem to find any obvious imbalance.  What I did note however was
      that the two places where we create an association using
      sctp_primitive_ASSOCIATE (__sctp_connect and sctp_sendmsg), have failure paths
      which free a newly created association after calling sctp_primitive_ASSOCIATE.
      sctp_primitive_ASSOCIATE brings us into the sctp_sf_do_prm_asoc path, which
      issues a SCTP_CMD_NEW_ASOC side effect, which in turn adds a new association to
      the aforementioned hash table.  the sctp command interpreter that process side
      effects has not way to unwind previously processed commands, so freeing the
      association from the __sctp_connect or sctp_sendmsg error path would lead to a
      freed association remaining on this hash table.
      
      I've fixed this but modifying sctp_[un]hash_established to use hlist_del_init,
      which allows us to proerly use hlist_unhashed to check if the node is on a
      hashlist safely during a delete.  That in turn alows us to safely call
      sctp_unhash_established in the __sctp_connect and sctp_sendmsg error paths
      before freeing them, regardles of what the associations state is on the hash
      list.
      
      I noted, while I was doing this, that the __sctp_unhash_endpoint was using
      hlist_unhsashed in a simmilar fashion, but never nullified any removed nodes
      pointers to make that function work properly, so I fixed that up in a simmilar
      fashion.
      
      I attempted to test this using a virtual guest running the SCTP_RR test from
      netperf in a loop while running the trinity fuzzer, both in a loop.  I wasn't
      able to recreate the problem prior to this fix, nor was I able to trigger the
      failure after (neither of which I suppose is suprising).  Given the trace above
      however, I think its likely that this is what we hit.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Reported-by: davej@redhat.com
      CC: davej@redhat.com
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Vlad Yasevich <vyasevich@gmail.com>
      CC: Sridhar Samudrala <sri@us.ibm.com>
      CC: linux-sctp@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2eebc1e1
  3. 16 7月, 2012 1 次提交
  4. 12 7月, 2012 3 次提交
  5. 09 7月, 2012 1 次提交
    • N
      sctp: refactor sctp_packet_append_chunk and clenup some memory leaks · ed106277
      Neil Horman 提交于
      While doing some recent work on sctp sack bundling I noted that
      sctp_packet_append_chunk was pretty inefficient.  Specifially, it was called
      recursively while trying to bundle auth and sack chunks.  Because of that we
      call sctp_packet_bundle_sack and sctp_packet_bundle_auth a total of 4 times for
      every call to sctp_packet_append_chunk, knowing that at least 3 of those calls
      will do nothing.
      
      So lets refactor sctp_packet_bundle_auth to have an outer part that does the
      attempted bundling, and an inner part that just does the chunk appends.  This
      saves us several calls per iteration that we just don't need.
      
      Also, noticed that the auth and sack bundling fail to free the chunks they
      allocate if the append fails, so make sure we add that in
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Vlad Yasevich <vyasevich@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: linux-sctp@vger.kernel.org
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed106277
  6. 01 7月, 2012 1 次提交
    • N
      sctp: be more restrictive in transport selection on bundled sacks · 4244854d
      Neil Horman 提交于
      It was noticed recently that when we send data on a transport, its possible that
      we might bundle a sack that arrived on a different transport.  While this isn't
      a major problem, it does go against the SHOULD requirement in section 6.4 of RFC
      2960:
      
       An endpoint SHOULD transmit reply chunks (e.g., SACK, HEARTBEAT ACK,
         etc.) to the same destination transport address from which it
         received the DATA or control chunk to which it is replying.  This
         rule should also be followed if the endpoint is bundling DATA chunks
         together with the reply chunk.
      
      This patch seeks to correct that.  It restricts the bundling of sack operations
      to only those transports which have moved the ctsn of the association forward
      since the last sack.  By doing this we guarantee that we only bundle outbound
      saks on a transport that has received a chunk since the last sack.  This brings
      us into stricter compliance with the RFC.
      
      Vlad had initially suggested that we strictly allow only sack bundling on the
      transport that last moved the ctsn forward.  While this makes sense, I was
      concerned that doing so prevented us from bundling in the case where we had
      received chunks that moved the ctsn on multiple transports.  In those cases, the
      RFC allows us to select any of the transports having received chunks to bundle
      the sack on.  so I've modified the approach to allow for that, by adding a state
      variable to each transport that tracks weather it has moved the ctsn since the
      last sack.  This I think keeps our behavior (and performance), close enough to
      our current profile that I think we can do this without a sysctl knob to
      enable/disable it.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Vlad Yaseivch <vyasevich@gmail.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-sctp@vger.kernel.org
      Reported-by: NMichele Baldessari <michele@redhat.com>
      Reported-by: Nsorin serban <sserban@redhat.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4244854d
  7. 19 6月, 2012 1 次提交
  8. 16 5月, 2012 1 次提交
  9. 11 5月, 2012 1 次提交
  10. 24 4月, 2012 1 次提交
    • E
      net: add a limit parameter to sk_add_backlog() · f545a38f
      Eric Dumazet 提交于
      sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
      memory limit. We need to make this limit a parameter for TCP use.
      
      No functional change expected in this patch, all callers still using the
      old sk_rcvbuf limit.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Rick Jones <rick.jones2@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f545a38f
  11. 21 4月, 2012 2 次提交
  12. 16 4月, 2012 1 次提交
  13. 05 4月, 2012 1 次提交
    • T
      sctp: Allow struct sctp_event_subscribe to grow without breaking binaries · acdd5985
      Thomas Graf 提交于
      getsockopt(..., SCTP_EVENTS, ...) performs a length check and returns
      an error if the user provides less bytes than the size of struct
      sctp_event_subscribe.
      
      Struct sctp_event_subscribe needs to be extended by an u8 for every
      new event or notification type that is added.
      
      This obviously makes getsockopt fail for binaries that are compiled
      against an older versions of <net/sctp/user.h> which do not contain
      all event types.
      
      This patch changes getsockopt behaviour to no longer return an error
      if not enough bytes are being provided by the user. Instead, it
      returns as much of sctp_event_subscribe as fits into the provided buffer.
      
      This leads to the new behavior that users see what they have been aware
      of at compile time.
      
      The setsockopt(..., SCTP_EVENTS, ...) API is already behaving like this.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NVlad Yasevich <vladislav.yasevich@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acdd5985
  14. 09 3月, 2012 1 次提交
  15. 21 12月, 2011 1 次提交
    • T
      sctp: Do not account for sizeof(struct sk_buff) in estimated rwnd · a76c0adf
      Thomas Graf 提交于
      When checking whether a DATA chunk fits into the estimated rwnd a
      full sizeof(struct sk_buff) is added to the needed chunk size. This
      quickly exhausts the available rwnd space and leads to packets being
      sent which are much below the PMTU limit. This can lead to much worse
      performance.
      
      The reason for this behaviour was to avoid putting too much memory
      pressure on the receiver. The concept is not completely irational
      because a Linux receiver does in fact clone an skb for each DATA chunk
      delivered. However, Linux also reserves half the available socket
      buffer space for data structures therefore usage of it is already
      accounted for.
      
      When proposing to change this the last time it was noted that this
      behaviour was introduced to solve a performance issue caused by rwnd
      overusage in combination with small DATA chunks.
      
      Trying to reproduce this I found that with the sk_buff overhead removed,
      the performance would improve significantly unless socket buffer limits
      are increased.
      
      The following numbers have been gathered using a patched iperf
      supporting SCTP over a live 1 Gbit ethernet network. The -l option
      was used to limit DATA chunk sizes. The numbers listed are based on
      the average of 3 test runs each. Default values have been used for
      sk_(r|w)mem.
      
      Chunk
      Size    Unpatched     No Overhead
      -------------------------------------
         4    15.2 Kbit [!]   12.2 Mbit [!]
         8    35.8 Kbit [!]   26.0 Mbit [!]
        16    95.5 Kbit [!]   54.4 Mbit [!]
        32   106.7 Mbit      102.3 Mbit
        64   189.2 Mbit      188.3 Mbit
       128   331.2 Mbit      334.8 Mbit
       256   537.7 Mbit      536.0 Mbit
       512   766.9 Mbit      766.6 Mbit
      1024   810.1 Mbit      808.6 Mbit
      Signed-off-by: NThomas Graf <tgraf@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a76c0adf
  16. 20 12月, 2011 1 次提交
  17. 12 12月, 2011 1 次提交
  18. 02 12月, 2011 1 次提交
  19. 30 11月, 2011 1 次提交
    • X
      sctp: better integer overflow check in sctp_auth_create_key() · c89304b8
      Xi Wang 提交于
      The check from commit 30c2235c is incomplete and cannot prevent
      cases like key_len = 0x80000000 (INT_MAX + 1).  In that case, the
      left-hand side of the check (INT_MAX - key_len), which is unsigned,
      becomes 0xffffffff (UINT_MAX) and bypasses the check.
      
      However this shouldn't be a security issue.  The function is called
      from the following two code paths:
      
       1) setsockopt()
      
       2) sctp_auth_asoc_set_secret()
      
      In case (1), sca_keylength is never going to exceed 65535 since it's
      bounded by a u16 from the user API.  As such, the key length will
      never overflow.
      
      In case (2), sca_keylength is computed based on the user key (1 short)
      and 2 * key_vector (3 shorts) for a total of 7 * USHRT_MAX, which still
      will not overflow.
      
      In other words, this overflow check is not really necessary.  Just
      make it more correct.
      Signed-off-by: NXi Wang <xi.wang@gmail.com>
      Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c89304b8
  20. 23 11月, 2011 1 次提交
  21. 09 11月, 2011 2 次提交
  22. 01 11月, 2011 1 次提交
  23. 27 10月, 2011 1 次提交
  24. 14 10月, 2011 1 次提交
    • E
      net: more accurate skb truesize · 87fb4b7b
      Eric Dumazet 提交于
      skb truesize currently accounts for sk_buff struct and part of skb head.
      kmalloc() roundings are also ignored.
      
      Considering that skb_shared_info is larger than sk_buff, its time to
      take it into account for better memory accounting.
      
      This patch introduces SKB_TRUESIZE(X) macro to centralize various
      assumptions into a single place.
      
      At skb alloc phase, we put skb_shared_info struct at the exact end of
      skb head, to allow a better use of memory (lowering number of
      reallocations), since kmalloc() gives us power-of-two memory blocks.
      
      Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
      aligned to cache lines, as before.
      
      Note: This patch might trigger performance regressions because of
      misconfigured protocol stacks, hitting per socket or global memory
      limits that were previously not reached. But its a necessary step for a
      more accurate memory accounting.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Andi Kleen <ak@linux.intel.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87fb4b7b
  25. 17 9月, 2011 1 次提交
    • M
      sctp: deal with multiple COOKIE_ECHO chunks · d5ccd496
      Max Matveev 提交于
      Attempt to reduce the number of IP packets emitted in response to single
      SCTP packet (2e3216cd) introduced a complication - if a packet contains
      two COOKIE_ECHO chunks and nothing else then SCTP state machine corks the
      socket while processing first COOKIE_ECHO and then loses the association
      and forgets to uncork the socket. To deal with the issue add new SCTP
      command which can be used to set association explictly. Use this new
      command when processing second COOKIE_ECHO chunk to restore the context
      for SCTP state machine.
      Signed-off-by: NMax Matveev <makc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5ccd496
  26. 25 8月, 2011 2 次提交
  27. 15 7月, 2011 1 次提交
  28. 09 7月, 2011 1 次提交
  29. 08 7月, 2011 1 次提交
    • T
      sctp: Enforce retransmission limit during shutdown · f8d96052
      Thomas Graf 提交于
      When initiating a graceful shutdown while having data chunks
      on the retransmission queue with a peer which is in zero
      window mode the shutdown is never completed because the
      retransmission error count is reset periodically by the
      following two rules:
      
       - Do not timeout association while doing zero window probe.
       - Reset overall error count when a heartbeat request has
         been acknowledged.
      
      The graceful shutdown will wait for all outstanding TSN to
      be acknowledged before sending the SHUTDOWN request. This
      never happens due to the peer's zero window not acknowledging
      the continuously retransmitted data chunks. Although the
      error counter is incremented for each failed retransmission,
      the receiving of the SACK announcing the zero window clears
      the error count again immediately. Also heartbeat requests
      continue to be sent periodically. The peer acknowledges these
      requests causing the error counter to be reset as well.
      
      This patch changes behaviour to only reset the overall error
      counter for the above rules while not in shutdown. After
      reaching the maximum number of retransmission attempts, the
      T5 shutdown guard timer is scheduled to give the receiver
      some additional time to recover. The timer is stopped as soon
      as the receiver acknowledges any data.
      
      The issue can be easily reproduced by establishing a sctp
      association over the loopback device, constantly queueing
      data at the sender while not reading any at the receiver.
      Wait for the window to reach zero, then initiate a shutdown
      by killing both processes simultaneously. The association
      will never be freed and the chunks on the retransmission
      queue will be retransmitted indefinitely.
      Signed-off-by: NThomas Graf <tgraf@infradead.org>
      Acked-by: NVlad Yasevich <vladislav.yasevich@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8d96052
  30. 07 7月, 2011 2 次提交
  31. 02 7月, 2011 1 次提交
  32. 17 6月, 2011 1 次提交