1. 15 10月, 2014 2 次提交
    • D
      net: sctp: fix panic on duplicate ASCONF chunks · b69040d8
      Daniel Borkmann 提交于
      When receiving a e.g. semi-good formed connection scan in the
      form of ...
      
        -------------- INIT[ASCONF; ASCONF_ACK] ------------->
        <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
        ---------------- ASCONF_a; ASCONF_b ----------------->
      
      ... where ASCONF_a equals ASCONF_b chunk (at least both serials
      need to be equal), we panic an SCTP server!
      
      The problem is that good-formed ASCONF chunks that we reply with
      ASCONF_ACK chunks are cached per serial. Thus, when we receive a
      same ASCONF chunk twice (e.g. through a lost ASCONF_ACK), we do
      not need to process them again on the server side (that was the
      idea, also proposed in the RFC). Instead, we know it was cached
      and we just resend the cached chunk instead. So far, so good.
      
      Where things get nasty is in SCTP's side effect interpreter, that
      is, sctp_cmd_interpreter():
      
      While incoming ASCONF_a (chunk = event_arg) is being marked
      !end_of_packet and !singleton, and we have an association context,
      we do not flush the outqueue the first time after processing the
      ASCONF_ACK singleton chunk via SCTP_CMD_REPLY. Instead, we keep it
      queued up, although we set local_cork to 1. Commit 2e3216cd
      changed the precedence, so that as long as we get bundled, incoming
      chunks we try possible bundling on outgoing queue as well. Before
      this commit, we would just flush the output queue.
      
      Now, while ASCONF_a's ASCONF_ACK sits in the corked outq, we
      continue to process the same ASCONF_b chunk from the packet. As
      we have cached the previous ASCONF_ACK, we find it, grab it and
      do another SCTP_CMD_REPLY command on it. So, effectively, we rip
      the chunk->list pointers and requeue the same ASCONF_ACK chunk
      another time. Since we process ASCONF_b, it's correctly marked
      with end_of_packet and we enforce an uncork, and thus flush, thus
      crashing the kernel.
      
      Fix it by testing if the ASCONF_ACK is currently pending and if
      that is the case, do not requeue it. When flushing the output
      queue we may relink the chunk for preparing an outgoing packet,
      but eventually unlink it when it's copied into the skb right
      before transmission.
      
      Joint work with Vlad Yasevich.
      
      Fixes: 2e3216cd ("sctp: Follow security requirement of responding with 1 packet")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b69040d8
    • D
      net: sctp: fix skb_over_panic when receiving malformed ASCONF chunks · 9de7922b
      Daniel Borkmann 提交于
      Commit 6f4c618d ("SCTP : Add paramters validity check for
      ASCONF chunk") added basic verification of ASCONF chunks, however,
      it is still possible to remotely crash a server by sending a
      special crafted ASCONF chunk, even up to pre 2.6.12 kernels:
      
      skb_over_panic: text:ffffffffa01ea1c3 len:31056 put:30768
       head:ffff88011bd81800 data:ffff88011bd81800 tail:0x7950
       end:0x440 dev:<NULL>
       ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:129!
      [...]
      Call Trace:
       <IRQ>
       [<ffffffff8144fb1c>] skb_put+0x5c/0x70
       [<ffffffffa01ea1c3>] sctp_addto_chunk+0x63/0xd0 [sctp]
       [<ffffffffa01eadaf>] sctp_process_asconf+0x1af/0x540 [sctp]
       [<ffffffff8152d025>] ? _read_unlock_bh+0x15/0x20
       [<ffffffffa01e0038>] sctp_sf_do_asconf+0x168/0x240 [sctp]
       [<ffffffffa01e3751>] sctp_do_sm+0x71/0x1210 [sctp]
       [<ffffffff8147645d>] ? fib_rules_lookup+0xad/0xf0
       [<ffffffffa01e6b22>] ? sctp_cmp_addr_exact+0x32/0x40 [sctp]
       [<ffffffffa01e8393>] sctp_assoc_bh_rcv+0xd3/0x180 [sctp]
       [<ffffffffa01ee986>] sctp_inq_push+0x56/0x80 [sctp]
       [<ffffffffa01fcc42>] sctp_rcv+0x982/0xa10 [sctp]
       [<ffffffffa01d5123>] ? ipt_local_in_hook+0x23/0x28 [iptable_filter]
       [<ffffffff8148bdc9>] ? nf_iterate+0x69/0xb0
       [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
       [<ffffffff8148bf86>] ? nf_hook_slow+0x76/0x120
       [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
       [<ffffffff81496ded>] ip_local_deliver_finish+0xdd/0x2d0
       [<ffffffff81497078>] ip_local_deliver+0x98/0xa0
       [<ffffffff8149653d>] ip_rcv_finish+0x12d/0x440
       [<ffffffff81496ac5>] ip_rcv+0x275/0x350
       [<ffffffff8145c88b>] __netif_receive_skb+0x4ab/0x750
       [<ffffffff81460588>] netif_receive_skb+0x58/0x60
      
      This can be triggered e.g., through a simple scripted nmap
      connection scan injecting the chunk after the handshake, for
      example, ...
      
        -------------- INIT[ASCONF; ASCONF_ACK] ------------->
        <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
        ------------------ ASCONF; UNKNOWN ------------------>
      
      ... where ASCONF chunk of length 280 contains 2 parameters ...
      
        1) Add IP address parameter (param length: 16)
        2) Add/del IP address parameter (param length: 255)
      
      ... followed by an UNKNOWN chunk of e.g. 4 bytes. Here, the
      Address Parameter in the ASCONF chunk is even missing, too.
      This is just an example and similarly-crafted ASCONF chunks
      could be used just as well.
      
      The ASCONF chunk passes through sctp_verify_asconf() as all
      parameters passed sanity checks, and after walking, we ended
      up successfully at the chunk end boundary, and thus may invoke
      sctp_process_asconf(). Parameter walking is done with
      WORD_ROUND() to take padding into account.
      
      In sctp_process_asconf()'s TLV processing, we may fail in
      sctp_process_asconf_param() e.g., due to removal of the IP
      address that is also the source address of the packet containing
      the ASCONF chunk, and thus we need to add all TLVs after the
      failure to our ASCONF response to remote via helper function
      sctp_add_asconf_response(), which basically invokes a
      sctp_addto_chunk() adding the error parameters to the given
      skb.
      
      When walking to the next parameter this time, we proceed
      with ...
      
        length = ntohs(asconf_param->param_hdr.length);
        asconf_param = (void *)asconf_param + length;
      
      ... instead of the WORD_ROUND()'ed length, thus resulting here
      in an off-by-one that leads to reading the follow-up garbage
      parameter length of 12336, and thus throwing an skb_over_panic
      for the reply when trying to sctp_addto_chunk() next time,
      which implicitly calls the skb_put() with that length.
      
      Fix it by using sctp_walk_params() [ which is also used in
      INIT parameter processing ] macro in the verification *and*
      in ASCONF processing: it will make sure we don't spill over,
      that we walk parameters WORD_ROUND()'ed. Moreover, we're being
      more defensive and guard against unknown parameter types and
      missized addresses.
      
      Joint work with Vlad Yasevich.
      
      Fixes: b896b82be4ae ("[SCTP] ADDIP: Support for processing incoming ASCONF_ACK chunks.")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NVlad Yasevich <vyasevich@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9de7922b
  2. 06 10月, 2014 1 次提交
    • V
      sctp: handle association restarts when the socket is closed. · bdf6fa52
      Vlad Yasevich 提交于
      Currently association restarts do not take into consideration the
      state of the socket.  When a restart happens, the current assocation
      simply transitions into established state.  This creates a condition
      where a remote system, through a the restart procedure, may create a
      local association that is no way reachable by user.  The conditions
      to trigger this are as follows:
        1) Remote does not acknoledge some data causing data to remain
           outstanding.
        2) Local application calls close() on the socket.  Since data
           is still outstanding, the association is placed in SHUTDOWN_PENDING
           state.  However, the socket is closed.
        3) The remote tries to create a new association, triggering a restart
           on the local system.  The association moves from SHUTDOWN_PENDING
           to ESTABLISHED.  At this point, it is no longer reachable by
           any socket on the local system.
      
      This patch addresses the above situation by moving the newly ESTABLISHED
      association into SHUTDOWN-SENT state and bundling a SHUTDOWN after
      the COOKIE-ACK chunk.  This way, the restarted associate immidiately
      enters the shutdown procedure and forces the termination of the
      unreachable association.
      Reported-by: NDavid Laight <David.Laight@aculab.com>
      Signed-off-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdf6fa52
  3. 30 8月, 2014 1 次提交
    • D
      net: sctp: fix ABI mismatch through sctp_assoc_to_state helper · 38ab1fa9
      Daniel Borkmann 提交于
      Since SCTP day 1, that is, 19b55a2af145 ("Initial commit") from lksctp
      tree, the official <netinet/sctp.h> header carries a copy of enum
      sctp_sstat_state that looks like (compared to the current in-kernel
      enumeration):
      
        User definition:                     Kernel definition:
      
        enum sctp_sstat_state {              typedef enum {
          SCTP_EMPTY             = 0,          <removed>
          SCTP_CLOSED            = 1,          SCTP_STATE_CLOSED            = 0,
          SCTP_COOKIE_WAIT       = 2,          SCTP_STATE_COOKIE_WAIT       = 1,
          SCTP_COOKIE_ECHOED     = 3,          SCTP_STATE_COOKIE_ECHOED     = 2,
          SCTP_ESTABLISHED       = 4,          SCTP_STATE_ESTABLISHED       = 3,
          SCTP_SHUTDOWN_PENDING  = 5,          SCTP_STATE_SHUTDOWN_PENDING  = 4,
          SCTP_SHUTDOWN_SENT     = 6,          SCTP_STATE_SHUTDOWN_SENT     = 5,
          SCTP_SHUTDOWN_RECEIVED = 7,          SCTP_STATE_SHUTDOWN_RECEIVED = 6,
          SCTP_SHUTDOWN_ACK_SENT = 8,          SCTP_STATE_SHUTDOWN_ACK_SENT = 7,
        };                                   } sctp_state_t;
      
      This header was later on also placed into the uapi, so that user space
      programs can compile without having <netinet/sctp.h>, but the shipped
      with <linux/sctp.h> instead.
      
      While RFC6458 under 8.2.1.Association Status (SCTP_STATUS) says that
      sstat_state can range from SCTP_CLOSED to SCTP_SHUTDOWN_ACK_SENT, we
      nevertheless have a what it appears to be dummy SCTP_EMPTY state from
      the very early days.
      
      While it seems to do just nothing, commit 0b8f9e25 ("sctp: remove
      completely unsed EMPTY state") did the right thing and removed this dead
      code. That however, causes an off-by-one when the user asks the SCTP
      stack via SCTP_STATUS API and checks for the current socket state thus
      yielding possibly undefined behaviour in applications as they expect
      the kernel to tell the right thing.
      
      The enumeration had to be changed however as based on the current socket
      state, we access a function pointer lookup-table through this. Therefore,
      I think the best way to deal with this is just to add a helper function
      sctp_assoc_to_state() to encapsulate the off-by-one quirk.
      Reported-by: NTristan Su <sooqing@gmail.com>
      Fixes: 0b8f9e25 ("sctp: remove completely unsed EMPTY state")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38ab1fa9
  4. 01 8月, 2014 1 次提交
    • J
      sctp: Fixup v4mapped behaviour to comply with Sock API · 299ee123
      Jason Gunthorpe 提交于
      The SCTP socket extensions API document describes the v4mapping option as
      follows:
      
      8.1.15.  Set/Clear IPv4 Mapped Addresses (SCTP_I_WANT_MAPPED_V4_ADDR)
      
         This socket option is a Boolean flag which turns on or off the
         mapping of IPv4 addresses.  If this option is turned on, then IPv4
         addresses will be mapped to V6 representation.  If this option is
         turned off, then no mapping will be done of V4 addresses and a user
         will receive both PF_INET6 and PF_INET type addresses on the socket.
         See [RFC3542] for more details on mapped V6 addresses.
      
      This description isn't really in line with what the code does though.
      
      Introduce addr_to_user (renamed addr_v4map), which should be called
      before any sockaddr is passed back to user space. The new function
      places the sockaddr into the correct format depending on the
      SCTP_I_WANT_MAPPED_V4_ADDR option.
      
      Audit all places that touched v4mapped and either sanely construct
      a v4 or v6 address then call addr_to_user, or drop the
      unnecessary v4mapped check entirely.
      
      Audit all places that call addr_to_user and verify they are on a sycall
      return path.
      
      Add a custom getname that formats the address properly.
      
      Several bugs are addressed:
       - SCTP_I_WANT_MAPPED_V4_ADDR=0 often returned garbage for
         addresses to user space
       - The addr_len returned from recvmsg was not correct when
         returning AF_INET on a v6 socket
       - flowlabel and scope_id were not zerod when promoting
         a v4 to v6
       - Some syscalls like bind and connect behaved differently
         depending on v4mapped
      
      Tested bind, getpeername, getsockname, connect, and recvmsg for proper
      behaviour in v4mapped = 1 and 0 cases.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Tested-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      299ee123
  5. 23 7月, 2014 1 次提交
  6. 17 7月, 2014 3 次提交
    • G
      net: sctp: implement rfc6458, 5.3.6. SCTP_NXTINFO cmsg support · 2347c80f
      Geir Ola Vaagland 提交于
      This patch implements section 5.3.6. of RFC6458, that is, support
      for 'SCTP Next Receive Information Structure' (SCTP_NXTINFO) which
      is placed into ancillary data cmsghdr structure for each recvmsg()
      call, if this information is already available when delivering the
      current message.
      
      This option can be enabled/disabled via setsockopt(2) on SOL_SCTP
      level by setting an int value with 1/0 for SCTP_RECVNXTINFO in
      user space applications as per RFC6458, section 8.1.30.
      
      The sctp_nxtinfo structure is defined as per RFC as below ...
      
        struct sctp_nxtinfo {
          uint16_t nxt_sid;
          uint16_t nxt_flags;
          uint32_t nxt_ppid;
          uint32_t nxt_length;
          sctp_assoc_t nxt_assoc_id;
        };
      
      ... and provided under cmsg_level IPPROTO_SCTP, cmsg_type
      SCTP_NXTINFO, while cmsg_data[] contains struct sctp_nxtinfo.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NGeir Ola Vaagland <geirola@gmail.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2347c80f
    • G
      net: sctp: implement rfc6458, 5.3.5. SCTP_RCVINFO cmsg support · 0d3a421d
      Geir Ola Vaagland 提交于
      This patch implements section 5.3.5. of RFC6458, that is, support
      for 'SCTP Receive Information Structure' (SCTP_RCVINFO) which is
      placed into ancillary data cmsghdr structure for each recvmsg()
      call.
      
      This option can be enabled/disabled via setsockopt(2) on SOL_SCTP
      level by setting an int value with 1/0 for SCTP_RECVRCVINFO in user
      space applications as per RFC6458, section 8.1.29.
      
      The sctp_rcvinfo structure is defined as per RFC as below ...
      
        struct sctp_rcvinfo {
          uint16_t rcv_sid;
          uint16_t rcv_ssn;
          uint16_t rcv_flags;
          <-- 2 bytes hole  -->
          uint32_t rcv_ppid;
          uint32_t rcv_tsn;
          uint32_t rcv_cumtsn;
          uint32_t rcv_context;
          sctp_assoc_t rcv_assoc_id;
        };
      
      ... and provided under cmsg_level IPPROTO_SCTP, cmsg_type
      SCTP_RCVINFO, while cmsg_data[] contains struct sctp_rcvinfo.
      An sctp_rcvinfo item always corresponds to the data in msg_iov.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NGeir Ola Vaagland <geirola@gmail.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d3a421d
    • G
      net: sctp: implement rfc6458, 5.3.4. SCTP_SNDINFO cmsg support · 63b94938
      Geir Ola Vaagland 提交于
      This patch implements section 5.3.4. of RFC6458, that is, support
      for 'SCTP Send Information Structure' (SCTP_SNDINFO) which can be
      placed into ancillary data cmsghdr structure for sendmsg() calls.
      
      The sctp_sndinfo structure is defined as per RFC as below ...
      
        struct sctp_sndinfo {
          uint16_t snd_sid;
          uint16_t snd_flags;
          uint32_t snd_ppid;
          uint32_t snd_context;
          sctp_assoc_t snd_assoc_id;
        };
      
      ... and supplied under cmsg_level IPPROTO_SCTP, cmsg_type
      SCTP_SNDINFO, while cmsg_data[] contains struct sctp_sndinfo.
      An sctp_sndinfo item always corresponds to the data in msg_iov.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NGeir Ola Vaagland <geirola@gmail.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63b94938
  7. 09 7月, 2014 3 次提交
  8. 03 7月, 2014 1 次提交
    • D
      net: sctp: improve timer slack calculation for transport HBs · 8f61059a
      Daniel Borkmann 提交于
      RFC4960, section 8.3 says:
      
        On an idle destination address that is allowed to heartbeat,
        it is recommended that a HEARTBEAT chunk is sent once per RTO
        of that destination address plus the protocol parameter
        'HB.interval', with jittering of +/- 50% of the RTO value,
        and exponential backoff of the RTO if the previous HEARTBEAT
        is unanswered.
      
      Currently, we calculate jitter via sctp_jitter() function first,
      and then add its result to the current RTO for the new timeout:
      
        TMO = RTO + (RAND() % RTO) - (RTO / 2)
                    `------------------------^-=> sctp_jitter()
      
      Instead, we can just simplify all this by directly calculating:
      
        TMO = (RTO / 2) + (RAND() % RTO)
      
      With the help of prandom_u32_max(), we don't need to open code
      our own global PRNG, but can instead just make use of the per
      CPU implementation of prandom with better quality numbers. Also,
      we can now spare us the conditional for divide by zero check
      since no div or mod operation needs to be used. Note that
      prandom_u32_max() won't emit the same result as a mod operation,
      but we really don't care here as we only want to have a random
      number scaled into RTO interval.
      
      Note, exponential RTO backoff is handeled elsewhere, namely in
      sctp_do_8_2_transport_strike().
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f61059a
  9. 12 6月, 2014 1 次提交
  10. 19 4月, 2014 1 次提交
    • V
      net: sctp: cache auth_enable per endpoint · b14878cc
      Vlad Yasevich 提交于
      Currently, it is possible to create an SCTP socket, then switch
      auth_enable via sysctl setting to 1 and crash the system on connect:
      
      Oops[#1]:
      CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.1-mipsgit-20140415 #1
      task: ffffffff8056ce80 ti: ffffffff8055c000 task.ti: ffffffff8055c000
      [...]
      Call Trace:
      [<ffffffff8043c4e8>] sctp_auth_asoc_set_default_hmac+0x68/0x80
      [<ffffffff8042b300>] sctp_process_init+0x5e0/0x8a4
      [<ffffffff8042188c>] sctp_sf_do_5_1B_init+0x234/0x34c
      [<ffffffff804228c8>] sctp_do_sm+0xb4/0x1e8
      [<ffffffff80425a08>] sctp_endpoint_bh_rcv+0x1c4/0x214
      [<ffffffff8043af68>] sctp_rcv+0x588/0x630
      [<ffffffff8043e8e8>] sctp6_rcv+0x10/0x24
      [<ffffffff803acb50>] ip6_input+0x2c0/0x440
      [<ffffffff8030fc00>] __netif_receive_skb_core+0x4a8/0x564
      [<ffffffff80310650>] process_backlog+0xb4/0x18c
      [<ffffffff80313cbc>] net_rx_action+0x12c/0x210
      [<ffffffff80034254>] __do_softirq+0x17c/0x2ac
      [<ffffffff800345e0>] irq_exit+0x54/0xb0
      [<ffffffff800075a4>] ret_from_irq+0x0/0x4
      [<ffffffff800090ec>] rm7k_wait_irqoff+0x24/0x48
      [<ffffffff8005e388>] cpu_startup_entry+0xc0/0x148
      [<ffffffff805a88b0>] start_kernel+0x37c/0x398
      Code: dd0900b8  000330f8  0126302d <dcc60000> 50c0fff1  0047182a  a48306a0
      03e00008  00000000
      ---[ end trace b530b0551467f2fd ]---
      Kernel panic - not syncing: Fatal exception in interrupt
      
      What happens while auth_enable=0 in that case is, that
      ep->auth_hmacs is initialized to NULL in sctp_auth_init_hmacs()
      when endpoint is being created.
      
      After that point, if an admin switches over to auth_enable=1,
      the machine can crash due to NULL pointer dereference during
      reception of an INIT chunk. When we enter sctp_process_init()
      via sctp_sf_do_5_1B_init() in order to respond to an INIT chunk,
      the INIT verification succeeds and while we walk and process
      all INIT params via sctp_process_param() we find that
      net->sctp.auth_enable is set, therefore do not fall through,
      but invoke sctp_auth_asoc_set_default_hmac() instead, and thus,
      dereference what we have set to NULL during endpoint
      initialization phase.
      
      The fix is to make auth_enable immutable by caching its value
      during endpoint initialization, so that its original value is
      being carried along until destruction. The bug seems to originate
      from the very first days.
      
      Fix in joint work with Daniel Borkmann.
      Reported-by: NJoshua Kinard <kumba@gentoo.org>
      Signed-off-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Tested-by: NJoshua Kinard <kumba@gentoo.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b14878cc
  11. 15 4月, 2014 1 次提交
    • D
      Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" · 362d5204
      Daniel Borkmann 提交于
      This reverts commit ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management
      to reflect real state of the receiver's buffer") as it introduced a
      serious performance regression on SCTP over IPv4 and IPv6, though a not
      as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
      
      Current state:
      
      [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
      Time: Fri, 11 Apr 2014 17:56:21 GMT
      Connecting to host 192.168.241.3, port 5201
            Cookie: Lab200slot2.1397238981.812898.548918
      [  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
      [  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
      [  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
      [  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
      [  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
      [  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
      [  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
      [  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
      [  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
      [  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
      [  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
      [  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
      [  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
      (etc)
      
      [root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
      Time: Fri, 11 Apr 2014 19:08:41 GMT
      Connecting to host 2001:db8:0:f101::1, port 5201
            Cookie: Lab200slot2.1397243321.714295.2b3f7c
      [  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
      [  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
      [  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
      [  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
      [  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
      [  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
      [  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
      [  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
      [  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
      [  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
      [  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
      [  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
      [  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
      [  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
      (etc)
      
      After patch:
      
      [root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
      Time: Mon, 14 Apr 2014 16:40:48 GMT
      Connecting to host 192.168.240.3, port 5201
            Cookie: Lab200slot2.1397493648.413274.65e131
      [  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
      [  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
      [  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
      [  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
      [  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
      [  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
      [  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
      [  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec
      
      With the reverted patch applied, the SCTP/IPv4 performance is back
      to normal on latest upstream for IPv4 and IPv6 and has same throughput
      as 3.4.2 test kernel, steady and interval reports are smooth again.
      
      Fixes: ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
      Reported-by: NPeter Butler <pbutler@sonusnet.com>
      Reported-by: NDongsheng Song <dongsheng.song@gmail.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Tested-by: NPeter Butler <pbutler@sonusnet.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
      Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      362d5204
  12. 12 4月, 2014 1 次提交
    • D
      net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369
      David S. Miller 提交于
      Several spots in the kernel perform a sequence like:
      
      	skb_queue_tail(&sk->s_receive_queue, skb);
      	sk->sk_data_ready(sk, skb->len);
      
      But at the moment we place the SKB onto the socket receive queue it
      can be consumed and freed up.  So this skb->len access is potentially
      to freed up memory.
      
      Furthermore, the skb->len can be modified by the consumer so it is
      possible that the value isn't accurate.
      
      And finally, no actual implementation of this callback actually uses
      the length argument.  And since nobody actually cared about it's
      value, lots of call sites pass arbitrary values in such as '0' and
      even '1'.
      
      So just remove the length argument from the callback, that way there
      is no confusion whatsoever and all of these use-after-free cases get
      fixed as a side effect.
      
      Based upon a patch by Eric Dumazet and his suggestion to audit this
      issue tree-wide.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      676d2369
  13. 17 2月, 2014 1 次提交
    • M
      net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer · ef2820a7
      Matija Glavinic Pecotic 提交于
      Implementation of (a)rwnd calculation might lead to severe performance issues
      and associations completely stalling. These problems are described and solution
      is proposed which improves lksctp's robustness in congestion state.
      
      1) Sudden drop of a_rwnd and incomplete window recovery afterwards
      
      Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
      but size of sk_buff, which is blamed against receiver buffer, is not accounted
      in rwnd. Theoretically, this should not be the problem as actual size of buffer
      is double the amount requested on the socket (SO_RECVBUF). Problem here is
      that this will have bad scaling for data which is less then sizeof sk_buff.
      E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
      of traffic of this size (less then 100B).
      
      An example of sudden drop and incomplete window recovery is given below. Node B
      exhibits problematic behavior. Node A initiates association and B is configured
      to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
      message in 4G (LTE) network). On B data is left in buffer by not reading socket
      in userspace.
      
      Lets examine when we will hit pressure state and declare rwnd to be 0 for
      scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
      chunk is sent in separate sctp packet)
      
      Logic is implemented in sctp_assoc_rwnd_decrease:
      
      socket_buffer (see below) is maximum size which can be held in socket buffer
      (sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)
      
      A simple expression is given for which it will be examined after how many
      packets for above stated parameters we enter pressure state:
      
      We start by condition which has to be met in order to enter pressure state:
      
      	socket_buffer < currently_alloced;
      
      currently_alloced is represented as size of sctp packets received so far and not
      yet delivered to userspace. x is the number of chunks/packets (since there is no
      bundling, and each chunk is delivered in separate packet, we can observe each
      chunk also as sctp packet, and what is important here, having its own sk_buff):
      
      	socket_buffer < x*each_sctp_packet;
      
      each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
      twice the amount of initially requested size of socket buffer, which is in case
      of sctp, twice the a_rwnd requested:
      
      	2*rwnd < x*(payload+sizeof(struc sk_buff));
      
      sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
      and each payload size is 43
      
      	20000 < x(43+190);
      
      	x > 20000/233;
      
      	x ~> 84;
      
      After ~84 messages, pressure state is entered and 0 rwnd is advertised while
      received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
      drop from 6474 to 0, as it will be now shown in example:
      
      IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
      IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
      IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
      IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
      <...>
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]
      
      --> Sudden drop
      
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      At this point, rwnd_press stores current rwnd value so it can be later restored
      in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
      slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
      condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
      adding amount which is read from userspace. Let us observe values in above
      example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
      amount of actual sctp data currently waiting to be delivered to userspace
      is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
      only for sctp data, which is ~3500. Condition is never met, and when userspace
      reads all data, rwnd stays on 3569.
      
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
      
      --> At this point userspace read everything, rwnd recovered only to 3569
      
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
      
      Reproduction is straight forward, it is enough for sender to send packets of
      size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.
      
      2) Minute size window for associations sharing the same socket buffer
      
      In case multiple associations share the same socket, and same socket buffer
      (sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
      of the associations can permanently drop rwnd of other association(s).
      
      Situation will be typically observed as one association suddenly having rwnd
      dropped to size of last packet received and never recovering beyond that point.
      Different scenarios will lead to it, but all have in common that one of the
      associations (let it be association from 1)) nearly depleted socket buffer, and
      the other association blames socket buffer just for the amount enough to start
      the pressure. This association will enter pressure state, set rwnd_press and
      announce 0 rwnd.
      When data is read by userspace, similar situation as in 1) will occur, rwnd will
      increase just for the size read by userspace but rwnd_press will be high enough
      so that association doesn't have enough credit to reach rwnd_press and restore
      to previous state. This case is special case of 1), being worse as there is, in
      the worst case, only one packet in buffer for which size rwnd will be increased.
      Consequence is association which has very low maximum rwnd ('minute size', in
      our case down to 43B - size of packet which caused pressure) and as such
      unusable.
      
      Scenario happened in the field and labs frequently after congestion state (link
      breaks, different probabilities of packet drop, packet reordering) and with
      scenario 1) preceding. Here is given a deterministic scenario for reproduction:
      
      >From node A establish two associations on the same socket, with rcvbuf_policy
      being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
      repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
      scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
      bring it down close to 0, as in just one more packet would close it. This has as
      a consequence that association number 2 is able to receive (at least) one more
      packet which will bring it in pressure state. E.g. if association 2 had rwnd of
      10000, packet received was 43, and we enter at this point into pressure,
      rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
      increase for 43, but conditions to restore rwnd to original state, just as in
      1), will never be satisfied.
      
      --> Association 1, between A.y and B.12345
      
      IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
      IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
      IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
      IP B.12345 > A.55915: sctp (1) [COOKIE ACK]
      
      --> Association 2, between A.z and B.12346
      
      IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
      IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
      IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
      IP B.12346 > A.55915: sctp (1) [COOKIE ACK]
      
      --> Deplete socket buffer by sending messages of size 43B over association 1
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      
      <...>
      
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]
      
      --> Sudden drop on 1
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
          association 1 so there is place in buffer for only one more packet
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
      
      --> Socket buffer is almost depleted, but there is space for one more packet,
          send them over association 2, size 43B
      
      IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Immediate drop
      
      IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Read everything from the socket, both association recover up to maximum rwnd
          they are capable of reaching, note that association 1 recovered up to 3698,
          and association 2 recovered only to 43
      
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
      IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
      IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
      
      A careful reader might wonder why it is necessary to reproduce 1) prior
      reproduction of 2). It is simply easier to observe when to send packet over
      association 2 which will push association into the pressure state.
      
      Proposed solution:
      
      Both problems share the same root cause, and that is improper scaling of socket
      buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
      calculating rwnd is not possible due to fact that there is no linear
      relationship between amount of data blamed in increase/decrease with IP packet
      in which payload arrived. Even in case such solution would be followed,
      complexity of the code would increase. Due to nature of current rwnd handling,
      slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
      entered is rationale, but it gives false representation to the sender of current
      buffer space. Furthermore, it implements additional congestion control mechanism
      which is defined on implementation, and not on standard basis.
      
      Proposed solution simplifies whole algorithm having on mind definition from rfc:
      
      o  Receiver Window (rwnd): This gives the sender an indication of the space
         available in the receiver's inbound buffer.
      
      Core of the proposed solution is given with these lines:
      
      sctp_assoc_rwnd_update:
      	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
      		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
      	else
      		asoc->rwnd = 0;
      
      We advertise to sender (half of) actual space we have. Half is in the braces
      depending whether you would like to observe size of socket buffer as SO_RECVBUF
      or twice the amount, i.e. size is the one visible from userspace, that is,
      from kernelspace.
      In this way sender is given with good approximation of our buffer space,
      regardless of the buffer policy - we always advertise what we have. Proposed
      solution fixes described problems and removes necessity for rwnd restoration
      algorithm. Finally, as proposed solution is simplification, some lines of code,
      along with some bytes in struct sctp_association are saved.
      
      Version 2 of the patch addressed comments from Vlad. Name of the function is set
      to be more descriptive, and two parts of code are changed, in one removing the
      superfluous call to sctp_assoc_rwnd_update since call would not result in update
      of rwnd, and the other being reordering of the code in a way that call to
      sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2
      in a way that existing function is not reordered/copied in line, but it is
      correctly called. Thanks Vlad for suggesting.
      Signed-off-by: NMatija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
      Reviewed-by: NAlexander Sverdlin <alexander.sverdlin@nsn.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef2820a7
  14. 22 1月, 2014 7 次提交
  15. 14 1月, 2014 1 次提交
  16. 03 1月, 2014 1 次提交
    • V
      sctp: Remove outqueue empty state · 619a60ee
      Vlad Yasevich 提交于
      The SCTP outqueue structure maintains a data chunks
      that are pending transmission, the list of chunks that
      are pending a retransmission and a length of data in
      flight.  It also tries to keep the emtpy state so that
      it can performe shutdown sequence or notify user.
      
      The problem is that the empy state is inconsistently
      tracked.  It is possible to completely drain the queue
      without sending anything when using PR-SCTP.  In this
      case, the empty state will not be correctly state as
      report by Jamal Hadi Salim <jhs@mojatatu.com>.  This
      can cause an association to be perminantly stuck in the
      SHUTDOWN_PENDING state.
      
      Additionally, SCTP is incredibly inefficient when setting
      the empty state.  Even though all the data is availaible
      in the outqueue structure, we ignore it and walk a list
      of trasnports.
      
      In the end, we can completely remove the extra empty
      state and figure out if the queue is empty by looking
      at 3 things:  length of pending data, length of in-flight
      data, and exisiting of retransmit data.  All of these
      are already in the strucutre.
      Reported-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NVlad Yasevich <vyasevich@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Tested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      619a60ee
  17. 18 12月, 2013 1 次提交
  18. 11 12月, 2013 1 次提交
    • N
      sctp: properly latch and use autoclose value from sock to association · 9f70f46b
      Neil Horman 提交于
      Currently, sctp associations latch a sockets autoclose value to an association
      at association init time, subject to capping constraints from the max_autoclose
      sysctl value.  This leads to an odd situation where an application may set a
      socket level autoclose timeout, but sliently sctp will limit the autoclose
      timeout to something less than that.
      
      Fix this by modifying the autoclose setsockopt function to check the limit, cap
      it and warn the user via syslog that the timeout is capped.  This will allow
      getsockopt to return valid autoclose timeout values that reflect what subsequent
      associations actually use.
      
      While were at it, also elimintate the assoc->autoclose variable, it duplicates
      whats in the timeout array, which leads to multiple sources for the same
      information, that may differ (as the former isn't subject to any capping).  This
      gives us the timeout information in a canonical place and saves some space in
      the association structure as well.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      CC: Wang Weidong <wangweidong1@huawei.com>
      CC: David Miller <davem@davemloft.net>
      CC: Vlad Yasevich <vyasevich@gmail.com>
      CC: netdev@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f70f46b
  19. 07 12月, 2013 2 次提交
  20. 29 11月, 2013 1 次提交
  21. 04 11月, 2013 1 次提交
    • D
      net: sctp: fix and consolidate SCTP checksumming code · e6d8b64b
      Daniel Borkmann 提交于
      This fixes an outstanding bug found through IPVS, where SCTP packets
      with skb->data_len > 0 (non-linearized) and empty frag_list, but data
      accumulated in frags[] member, are forwarded with incorrect checksum
      letting SCTP initial handshake fail on some systems. Linearizing each
      SCTP skb in IPVS to prevent that would not be a good solution as
      this leads to an additional and unnecessary performance penalty on
      the load-balancer itself for no good reason (as we actually only want
      to update the checksum, and can do that in a different/better way
      presented here).
      
      The actual problem is elsewhere, namely, that SCTP's checksumming
      in sctp_compute_cksum() does not take frags[] into account like
      skb_checksum() does. So while we are fixing this up, we better reuse
      the existing code that we have anyway in __skb_checksum() and use it
      for walking through the data doing checksumming. This will not only
      fix this issue, but also consolidates some SCTP code with core
      sk_buff code, bringing it closer together and removing respectively
      avoiding reimplementation of skb_checksum() for no good reason.
      
      As crc32c() can use hardware implementation within the crypto layer,
      we leave that intact (it wraps around / falls back to e.g. slice-by-8
      algorithm in __crc32c_le() otherwise); plus use the __crc32c_le_combine()
      combinator for crc32c blocks.
      
      Also, we remove all other SCTP checksumming code, so that we only
      have to use sctp_compute_cksum() from now on; for doing that, we need
      to transform SCTP checkumming in output path slightly, and can leave
      the rest intact.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6d8b64b
  22. 24 9月, 2013 1 次提交
  23. 30 8月, 2013 1 次提交
  24. 10 8月, 2013 3 次提交
  25. 06 8月, 2013 1 次提交
    • F
      sctp: Pack dst_cookie into 1st cacheline hole for 64bit host · 5a139296
      fan.du 提交于
      As dst_cookie is used in fast path sctp_transport_dst_check.
      
      Before:
      struct sctp_transport {
      	struct list_head           transports;           /*     0    16 */
      	atomic_t                   refcnt;               /*    16     4 */
      	__u32                      dead:1;               /*    20:31  4 */
      	__u32                      rto_pending:1;        /*    20:30  4 */
      	__u32                      hb_sent:1;            /*    20:29  4 */
      	__u32                      pmtu_pending:1;       /*    20:28  4 */
      
      	/* XXX 28 bits hole, try to pack */
      
      	__u32                      sack_generation;      /*    24     4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	struct flowi               fl;                   /*    32    64 */
      	/* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
      	union sctp_addr            ipaddr;               /*    96    28 */
      
      After:
      struct sctp_transport {
      	struct list_head           transports;           /*     0    16 */
      	atomic_t                   refcnt;               /*    16     4 */
      	__u32                      dead:1;               /*    20:31  4 */
      	__u32                      rto_pending:1;        /*    20:30  4 */
      	__u32                      hb_sent:1;            /*    20:29  4 */
      	__u32                      pmtu_pending:1;       /*    20:28  4 */
      
      	/* XXX 28 bits hole, try to pack */
      
      	__u32                      sack_generation;      /*    24     4 */
      	u32                        dst_cookie;           /*    28     4 */
      	struct flowi               fl;                   /*    32    64 */
      	/* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
      	union sctp_addr            ipaddr;               /*    96    28 */
      Signed-off-by: NFan Du <fan.du@windriver.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a139296
  26. 03 8月, 2013 1 次提交