1. 15 7月, 2014 1 次提交
    • D
      net: sctp: fix information leaks in ulpevent layer · 8f2e5ae4
      Daniel Borkmann 提交于
      While working on some other SCTP code, I noticed that some
      structures shared with user space are leaking uninitialized
      stack or heap buffer. In particular, struct sctp_sndrcvinfo
      has a 2 bytes hole between .sinfo_flags and .sinfo_ppid that
      remains unfilled by us in sctp_ulpevent_read_sndrcvinfo() when
      putting this into cmsg. But also struct sctp_remote_error
      contains a 2 bytes hole that we don't fill but place into a skb
      through skb_copy_expand() via sctp_ulpevent_make_remote_error().
      
      Both structures are defined by the IETF in RFC6458:
      
      * Section 5.3.2. SCTP Header Information Structure:
      
        The sctp_sndrcvinfo structure is defined below:
      
        struct sctp_sndrcvinfo {
          uint16_t sinfo_stream;
          uint16_t sinfo_ssn;
          uint16_t sinfo_flags;
          <-- 2 bytes hole  -->
          uint32_t sinfo_ppid;
          uint32_t sinfo_context;
          uint32_t sinfo_timetolive;
          uint32_t sinfo_tsn;
          uint32_t sinfo_cumtsn;
          sctp_assoc_t sinfo_assoc_id;
        };
      
      * 6.1.3. SCTP_REMOTE_ERROR:
      
        A remote peer may send an Operation Error message to its peer.
        This message indicates a variety of error conditions on an
        association. The entire ERROR chunk as it appears on the wire
        is included in an SCTP_REMOTE_ERROR event. Please refer to the
        SCTP specification [RFC4960] and any extensions for a list of
        possible error formats. An SCTP error notification has the
        following format:
      
        struct sctp_remote_error {
          uint16_t sre_type;
          uint16_t sre_flags;
          uint32_t sre_length;
          uint16_t sre_error;
          <-- 2 bytes hole  -->
          sctp_assoc_t sre_assoc_id;
          uint8_t  sre_data[];
        };
      
      Fix this by setting both to 0 before filling them out. We also
      have other structures shared between user and kernel space in
      SCTP that contains holes (e.g. struct sctp_paddrthlds), but we
      copy that buffer over from user space first and thus don't need
      to care about it in that cases.
      
      While at it, we can also remove lengthy comments copied from
      the draft, instead, we update the comment with the correct RFC
      number where one can look it up.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f2e5ae4
  2. 15 4月, 2014 1 次提交
    • D
      Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" · 362d5204
      Daniel Borkmann 提交于
      This reverts commit ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management
      to reflect real state of the receiver's buffer") as it introduced a
      serious performance regression on SCTP over IPv4 and IPv6, though a not
      as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
      
      Current state:
      
      [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
      Time: Fri, 11 Apr 2014 17:56:21 GMT
      Connecting to host 192.168.241.3, port 5201
            Cookie: Lab200slot2.1397238981.812898.548918
      [  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
      [  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
      [  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
      [  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
      [  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
      [  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
      [  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
      [  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
      [  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
      [  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
      [  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
      [  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
      [  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
      (etc)
      
      [root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
      Time: Fri, 11 Apr 2014 19:08:41 GMT
      Connecting to host 2001:db8:0:f101::1, port 5201
            Cookie: Lab200slot2.1397243321.714295.2b3f7c
      [  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
      [  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
      [  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
      [  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
      [  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
      [  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
      [  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
      [  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
      [  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
      [  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
      [  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
      [  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
      [  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
      [  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
      (etc)
      
      After patch:
      
      [root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
      Time: Mon, 14 Apr 2014 16:40:48 GMT
      Connecting to host 192.168.240.3, port 5201
            Cookie: Lab200slot2.1397493648.413274.65e131
      [  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
      [  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
      [  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
      [  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
      [  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
      [  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
      [  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
      [  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec
      
      With the reverted patch applied, the SCTP/IPv4 performance is back
      to normal on latest upstream for IPv4 and IPv6 and has same throughput
      as 3.4.2 test kernel, steady and interval reports are smooth again.
      
      Fixes: ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
      Reported-by: NPeter Butler <pbutler@sonusnet.com>
      Reported-by: NDongsheng Song <dongsheng.song@gmail.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Tested-by: NPeter Butler <pbutler@sonusnet.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
      Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      362d5204
  3. 17 2月, 2014 1 次提交
    • M
      net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer · ef2820a7
      Matija Glavinic Pecotic 提交于
      Implementation of (a)rwnd calculation might lead to severe performance issues
      and associations completely stalling. These problems are described and solution
      is proposed which improves lksctp's robustness in congestion state.
      
      1) Sudden drop of a_rwnd and incomplete window recovery afterwards
      
      Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
      but size of sk_buff, which is blamed against receiver buffer, is not accounted
      in rwnd. Theoretically, this should not be the problem as actual size of buffer
      is double the amount requested on the socket (SO_RECVBUF). Problem here is
      that this will have bad scaling for data which is less then sizeof sk_buff.
      E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
      of traffic of this size (less then 100B).
      
      An example of sudden drop and incomplete window recovery is given below. Node B
      exhibits problematic behavior. Node A initiates association and B is configured
      to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
      message in 4G (LTE) network). On B data is left in buffer by not reading socket
      in userspace.
      
      Lets examine when we will hit pressure state and declare rwnd to be 0 for
      scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
      chunk is sent in separate sctp packet)
      
      Logic is implemented in sctp_assoc_rwnd_decrease:
      
      socket_buffer (see below) is maximum size which can be held in socket buffer
      (sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)
      
      A simple expression is given for which it will be examined after how many
      packets for above stated parameters we enter pressure state:
      
      We start by condition which has to be met in order to enter pressure state:
      
      	socket_buffer < currently_alloced;
      
      currently_alloced is represented as size of sctp packets received so far and not
      yet delivered to userspace. x is the number of chunks/packets (since there is no
      bundling, and each chunk is delivered in separate packet, we can observe each
      chunk also as sctp packet, and what is important here, having its own sk_buff):
      
      	socket_buffer < x*each_sctp_packet;
      
      each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
      twice the amount of initially requested size of socket buffer, which is in case
      of sctp, twice the a_rwnd requested:
      
      	2*rwnd < x*(payload+sizeof(struc sk_buff));
      
      sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
      and each payload size is 43
      
      	20000 < x(43+190);
      
      	x > 20000/233;
      
      	x ~> 84;
      
      After ~84 messages, pressure state is entered and 0 rwnd is advertised while
      received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
      drop from 6474 to 0, as it will be now shown in example:
      
      IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
      IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
      IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
      IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
      <...>
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]
      
      --> Sudden drop
      
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      At this point, rwnd_press stores current rwnd value so it can be later restored
      in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
      slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
      condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
      adding amount which is read from userspace. Let us observe values in above
      example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
      amount of actual sctp data currently waiting to be delivered to userspace
      is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
      only for sctp data, which is ~3500. Condition is never met, and when userspace
      reads all data, rwnd stays on 3569.
      
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
      
      --> At this point userspace read everything, rwnd recovered only to 3569
      
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
      
      Reproduction is straight forward, it is enough for sender to send packets of
      size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.
      
      2) Minute size window for associations sharing the same socket buffer
      
      In case multiple associations share the same socket, and same socket buffer
      (sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
      of the associations can permanently drop rwnd of other association(s).
      
      Situation will be typically observed as one association suddenly having rwnd
      dropped to size of last packet received and never recovering beyond that point.
      Different scenarios will lead to it, but all have in common that one of the
      associations (let it be association from 1)) nearly depleted socket buffer, and
      the other association blames socket buffer just for the amount enough to start
      the pressure. This association will enter pressure state, set rwnd_press and
      announce 0 rwnd.
      When data is read by userspace, similar situation as in 1) will occur, rwnd will
      increase just for the size read by userspace but rwnd_press will be high enough
      so that association doesn't have enough credit to reach rwnd_press and restore
      to previous state. This case is special case of 1), being worse as there is, in
      the worst case, only one packet in buffer for which size rwnd will be increased.
      Consequence is association which has very low maximum rwnd ('minute size', in
      our case down to 43B - size of packet which caused pressure) and as such
      unusable.
      
      Scenario happened in the field and labs frequently after congestion state (link
      breaks, different probabilities of packet drop, packet reordering) and with
      scenario 1) preceding. Here is given a deterministic scenario for reproduction:
      
      >From node A establish two associations on the same socket, with rcvbuf_policy
      being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
      repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
      scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
      bring it down close to 0, as in just one more packet would close it. This has as
      a consequence that association number 2 is able to receive (at least) one more
      packet which will bring it in pressure state. E.g. if association 2 had rwnd of
      10000, packet received was 43, and we enter at this point into pressure,
      rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
      increase for 43, but conditions to restore rwnd to original state, just as in
      1), will never be satisfied.
      
      --> Association 1, between A.y and B.12345
      
      IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
      IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
      IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
      IP B.12345 > A.55915: sctp (1) [COOKIE ACK]
      
      --> Association 2, between A.z and B.12346
      
      IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
      IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
      IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
      IP B.12346 > A.55915: sctp (1) [COOKIE ACK]
      
      --> Deplete socket buffer by sending messages of size 43B over association 1
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      
      <...>
      
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]
      
      --> Sudden drop on 1
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
          association 1 so there is place in buffer for only one more packet
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
      
      --> Socket buffer is almost depleted, but there is space for one more packet,
          send them over association 2, size 43B
      
      IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Immediate drop
      
      IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Read everything from the socket, both association recover up to maximum rwnd
          they are capable of reaching, note that association 1 recovered up to 3698,
          and association 2 recovered only to 43
      
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
      IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
      IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
      
      A careful reader might wonder why it is necessary to reproduce 1) prior
      reproduction of 2). It is simply easier to observe when to send packet over
      association 2 which will push association into the pressure state.
      
      Proposed solution:
      
      Both problems share the same root cause, and that is improper scaling of socket
      buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
      calculating rwnd is not possible due to fact that there is no linear
      relationship between amount of data blamed in increase/decrease with IP packet
      in which payload arrived. Even in case such solution would be followed,
      complexity of the code would increase. Due to nature of current rwnd handling,
      slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
      entered is rationale, but it gives false representation to the sender of current
      buffer space. Furthermore, it implements additional congestion control mechanism
      which is defined on implementation, and not on standard basis.
      
      Proposed solution simplifies whole algorithm having on mind definition from rfc:
      
      o  Receiver Window (rwnd): This gives the sender an indication of the space
         available in the receiver's inbound buffer.
      
      Core of the proposed solution is given with these lines:
      
      sctp_assoc_rwnd_update:
      	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
      		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
      	else
      		asoc->rwnd = 0;
      
      We advertise to sender (half of) actual space we have. Half is in the braces
      depending whether you would like to observe size of socket buffer as SO_RECVBUF
      or twice the amount, i.e. size is the one visible from userspace, that is,
      from kernelspace.
      In this way sender is given with good approximation of our buffer space,
      regardless of the buffer policy - we always advertise what we have. Proposed
      solution fixes described problems and removes necessity for rwnd restoration
      algorithm. Finally, as proposed solution is simplification, some lines of code,
      along with some bytes in struct sctp_association are saved.
      
      Version 2 of the patch addressed comments from Vlad. Name of the function is set
      to be more descriptive, and two parts of code are changed, in one removing the
      superfluous call to sctp_assoc_rwnd_update since call would not result in update
      of rwnd, and the other being reordering of the code in a way that call to
      sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2
      in a way that existing function is not reordered/copied in line, but it is
      correctly called. Thanks Vlad for suggesting.
      Signed-off-by: NMatija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
      Reviewed-by: NAlexander Sverdlin <alexander.sverdlin@nsn.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef2820a7
  4. 07 12月, 2013 1 次提交
  5. 10 8月, 2013 1 次提交
  6. 25 7月, 2013 1 次提交
  7. 18 6月, 2013 1 次提交
  8. 01 8月, 2012 1 次提交
    • M
      netvm: prevent a stream-specific deadlock · c76562b6
      Mel Gorman 提交于
      This patch series is based on top of "Swap-over-NBD without deadlocking
      v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
      
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  In diskless systems this is not an option so if swap if
      required then swapping over the network is considered.  The two likely
      scenarios are when blade servers are used as part of a cluster where the
      form factor or maintenance costs do not allow the use of disks and thin
      clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap but this is not always an option.  There is no
      guarantee that the network attached storage (NAS) device is running Linux
      or supports NBD.  However, it is likely that it supports NFS so there are
      users that want support for swapping over NFS despite any performance
      concern.  Some distributions currently carry patches that support swapping
      over NFS but it would be preferable to support it in the mainline kernel.
      
      Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
      
      Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
      	reserves.
      
      Patch 3 adds three helpers for filesystems to handle swap cache pages.
      	For example, page_file_mapping() returns page->mapping for
      	file-backed pages and the address_space of the underlying
      	swap file for swap cache pages.
      
      Patch 4 adds two address_space_operations to allow a filesystem
      	to pin all metadata relevant to a swapfile in memory. Upon
      	successful activation, the swapfile is marked SWP_FILE and
      	the address space operation ->direct_IO is used for writing
      	and ->readpage for reading in swap pages.
      
      Patch 5 notes that patch 3 is bolting
      	filesystem-specific-swapfile-support onto the side and that
      	the default handlers have different information to what
      	is available to the filesystem. This patch refactors the
      	code so that there are generic handlers for each of the new
      	address_space operations.
      
      Patch 6 adds an API to allow a vector of kernel addresses to be
      	translated to struct pages and pinned for IO.
      
      Patch 7 adds support for using highmem pages for swap by kmapping
      	the pages before calling the direct_IO handler.
      
      Patch 8 updates NFS to use the helpers from patch 3 where necessary.
      
      Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
      
      Patch 10 implements the new swapfile-related address_space operations
      	for NFS and teaches the direct IO handler how to manage
      	kernel addresses.
      
      Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
      	where appropriate.
      
      Patch 12 fixes a NULL pointer dereference that occurs when using
      	swap-over-NFS.
      
      With the patches applied, it is possible to mount a swapfile that is on an
      NFS filesystem.  Swap performance is not great with a swap stress test
      taking roughly twice as long to complete than if the swap device was
      backed by NBD.
      
      This patch: netvm: prevent a stream-specific deadlock
      
      It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
      that we're over the global rmem limit.  This will prevent SOCK_MEMALLOC
      buffers from receiving data, which will prevent userspace from running,
      which is needed to reduce the buffered data.
      
      Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.  Once
      this change it applied, it is important that sockets that set
      SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
      If this happens, a warning is generated and the tokens reclaimed to avoid
      accounting errors until the bug is fixed.
      
      [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c76562b6
  9. 01 7月, 2012 1 次提交
    • N
      sctp: be more restrictive in transport selection on bundled sacks · 4244854d
      Neil Horman 提交于
      It was noticed recently that when we send data on a transport, its possible that
      we might bundle a sack that arrived on a different transport.  While this isn't
      a major problem, it does go against the SHOULD requirement in section 6.4 of RFC
      2960:
      
       An endpoint SHOULD transmit reply chunks (e.g., SACK, HEARTBEAT ACK,
         etc.) to the same destination transport address from which it
         received the DATA or control chunk to which it is replying.  This
         rule should also be followed if the endpoint is bundling DATA chunks
         together with the reply chunk.
      
      This patch seeks to correct that.  It restricts the bundling of sack operations
      to only those transports which have moved the ctsn of the association forward
      since the last sack.  By doing this we guarantee that we only bundle outbound
      saks on a transport that has received a chunk since the last sack.  This brings
      us into stricter compliance with the RFC.
      
      Vlad had initially suggested that we strictly allow only sack bundling on the
      transport that last moved the ctsn forward.  While this makes sense, I was
      concerned that doing so prevented us from bundling in the case where we had
      received chunks that moved the ctsn on multiple transports.  In those cases, the
      RFC allows us to select any of the transports having received chunks to bundle
      the sack on.  so I've modified the approach to allow for that, by adding a state
      variable to each transport that tracks weather it has moved the ctsn since the
      last sack.  This I think keeps our behavior (and performance), close enough to
      our current profile that I think we can do this without a sysctl knob to
      enable/disable it.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Vlad Yaseivch <vyasevich@gmail.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-sctp@vger.kernel.org
      Reported-by: NMichele Baldessari <michele@redhat.com>
      Reported-by: Nsorin serban <sserban@redhat.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4244854d
  10. 09 7月, 2011 1 次提交
  11. 27 4月, 2011 1 次提交
  12. 22 4月, 2011 2 次提交
  13. 31 3月, 2011 1 次提交
  14. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  15. 09 6月, 2009 1 次提交
  16. 09 10月, 2008 1 次提交
    • V
      sctp: Rework the tsn map to use generic bitmap. · 8e1ee18c
      Vlad Yasevich 提交于
      The tsn map currently use is 4K large and is stuck inside
      the sctp_association structure making memory references REALLY
      expensive.  What we really need is at most 4K worth of bits
      so the biggest map we would have is 512 bytes.   Also, the
      map is only really usefull when we have gaps to store and
      report.  As such, starting with minimal map of say 32 TSNs (bits)
      should be enough for normal low-loss operations.  We can grow
      the map by some multiple of 32 along with some extra room any
      time we receive the TSN which would put us outside of the map
      boundry.  As we close gaps, we can shift the map to rebase
      it on the latest TSN we've seen.  This saves 4088 bytes per
      association just in the map alone along savings from the now
      unnecessary structure members.
      Signed-off-by: NVlad Yasevich <vladislav.yasevich@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e1ee18c
  17. 08 7月, 2008 1 次提交
  18. 13 4月, 2008 1 次提交
  19. 28 2月, 2008 1 次提交
  20. 05 2月, 2008 1 次提交
  21. 29 1月, 2008 1 次提交
    • H
      [NET] CORE: Introducing new memory accounting interface. · 3ab224be
      Hideo Aoki 提交于
      This patch introduces new memory accounting functions for each network
      protocol. Most of them are renamed from memory accounting functions
      for stream protocols. At the same time, some stream memory accounting
      functions are removed since other functions do same thing.
      
      Renaming:
      	sk_stream_free_skb()		->	sk_wmem_free_skb()
      	__sk_stream_mem_reclaim()	->	__sk_mem_reclaim()
      	sk_stream_mem_reclaim()		->	sk_mem_reclaim()
      	sk_stream_mem_schedule 		->    	__sk_mem_schedule()
      	sk_stream_pages()      		->	sk_mem_pages()
      	sk_stream_rmem_schedule()	->	sk_rmem_schedule()
      	sk_stream_wmem_schedule()	->	sk_wmem_schedule()
      	sk_charge_skb()			->	sk_mem_charge()
      
      Removeing
      	sk_stream_rfree():	consolidates into sock_rfree()
      	sk_stream_set_owner_r(): consolidates into skb_set_owner_r()
      	sk_stream_mem_schedule()
      
      The following functions are added.
          	sk_has_account(): check if the protocol supports accounting
      	sk_mem_uncharge(): do the opposite of sk_mem_charge()
      
      In addition, to achieve consolidation, updating sk_wmem_queued is
      removed from sk_mem_charge().
      
      Next, to consolidate memory accounting functions, this patch adds
      memory accounting calls to network core functions. Moreover, present
      memory accounting call is renamed to new accounting call.
      
      Finally we replace present memory accounting calls with new interface
      in TCP and SCTP.
      Signed-off-by: NTakahiro Yasui <tyasui@redhat.com>
      Signed-off-by: NHideo Aoki <haoki@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ab224be
  22. 09 1月, 2008 1 次提交
  23. 11 10月, 2007 2 次提交
  24. 11 5月, 2007 1 次提交
  25. 26 4月, 2007 1 次提交
  26. 11 2月, 2007 1 次提交
  27. 23 12月, 2006 1 次提交
  28. 14 12月, 2006 1 次提交
  29. 03 12月, 2006 1 次提交
  30. 12 10月, 2006 1 次提交
  31. 18 6月, 2006 1 次提交
  32. 12 11月, 2005 1 次提交
  33. 29 10月, 2005 1 次提交
  34. 09 10月, 2005 1 次提交
  35. 12 7月, 2005 1 次提交
  36. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4